Video understanding is concerned with the parsing of the image data flow for the semantic understanding of the objects in the scene, but also their actions and interactions defining their behavior. When the objects of interest are people, there is the need to detect them (Tu et al., 2008), recognize them (Wu et al., 2008), but also to track their position, and re-identify them when they reappear (Doretto et al., 2011). By detecting people actions and interactions(Motiian et al., 2017) we can also attempt to predict their future behavior and intent. These techniques can be used to respond to queries that require mining a large corpus of video data for safety and security applications. On the other hand, variations of these techniques could be used to analyze and quantify the behavior of a heart in an echocardiogram.
References
TCSVT
Online Human Interaction Detection and Recognition with Multiple CamerasMotiian, S.,
Siyahjani, F.,
Almohsen, R.,
and Doretto, G.IEEE Transactions on Circuits and Systems for Video Technology,
2017.
abstractbibTeXpdf
We address the problem of detecting and recognizing online the occurrence of human interactions as seen by a network of multiple cameras. We represent interactions by forming temporal trajectories, coupling together the body motion of each individual and their proximity relationships with others, and also sound whenever available. Such trajectories are modeled with kernel state-space (KSS) models. Their advantage is being suitable for the online interaction detection, recognition, and also for fusing information from multiple cameras, while enabling a fast implementation based on online recursive updates. For recognition, in order to compare interaction trajectories in the space of KSS models, we design so-called pairwise kernels with a special symmetry. For detection, we exploit the geometry of linear operators in Hilbert space, and extend to KSS models the concept of parity space, originally defined for linear models. For fusion, we combine KSS models with kernel construction and multiview learning techniques. We extensively evaluate the approach on four single view publicly available data sets, and we also introduce, and will make public, a new challenging human interactions data set that we have collected using a network of three cameras. The results show that the approach holds promise to become an effective building block for the analysis of real-time human behavior from multiple cameras.
@article{motiianSAD2015tcsvt,
abbr = {TCSVT},
author = {Motiian, S. and Siyahjani, F. and Almohsen, R. and Doretto, G.},
title = {{Online Human Interaction Detection and Recognition with Multiple Cameras}},
journal = {IEEE Transactions on Circuits and Systems for Video Technology},
year = {2017},
volume = {27},
number = {3},
pages = {649--663},
owner = {doretto},
timestamp = {2015.09.16},
bib2html_pubtype = {Journals}
}
JAIHC
Appearance-based person reidentification in camera networks: problem
overview and current approachesDoretto, G.,
Sebastian, T.,
Tu, P.,
and Rittscher, J.
Journal of Ambient Intelligence and Humanized Computing,
2011.
abstractbibTeXpdfhtml
Recent advances in visual tracking methods allow following a given
object or individual in presence of significant clutter or partial
occlusions in a single or a set of overlapping camera views. The
question of when person detections in different views or at different
time instants can be linked to the same individual is of fundamental
importance to the video analysis in large-scale network of cameras.
This is the person reidentification problem. The paper focuses on
algorithms that use the overall appearance of an individual as opposed
to passive biometrics such as face and gait. Methods that effectively
address the challenges associated with changes in illumination, pose,
and clothing appearance variation are discussed. More specifically,
the development of a set of models that capture the overall appearance
of an individual and can effectively be used for information retrieval
are reviewed. Some of them provide a holistic description of a person,
and some others require an intermediate step where specific body
parts need to be identified. Some are designed to extract appearance
features over time, and some others can operate reliably also on
single images. The paper discusses algorithms for speeding up the
computation of signatures. In particular it describes very fast procedures
for computing co-occurrence matrices by leveraging a generalization
of the integral representation of images. The algorithms are deployed
and tested in a camera network comprising of three cameras with non-overlapping
field of views, where a multi-camera multi-target tracker links the
tracks in different cameras by reidentifying the same people appearing
in different views.
@article{dorettoSTR11jaihc,
abbr = {JAIHC},
author = {Doretto, G. and Sebastian, T. and Tu, P. and Rittscher, J.},
title = {Appearance-based person reidentification in camera networks: problem
overview and current approaches},
journal = {Journal of Ambient Intelligence and Humanized Computing},
year = {2011},
volume = {2},
pages = {127-151},
affiliation = {West Virginia University, P.O. Box 6901, Morgantown, WV 26506, USA},
bib2html_pubtype = {Journals},
bib2html_rescat = {Human Reidentification, Identity Management, Video Analysis, Appearance
Modeling, Shape and Appearance Modeling, Integral Image Computations,
Track Matching},
file = {dorettoSTR11jaihc.pdf:doretto\\journal\\dorettoSTR11jaihc.pdf:PDF},
issn = {1868-5137},
issue = {2},
keyword = {Engineering},
owner = {doretto},
publisher = {Springer Berlin / Heidelberg},
timestamp = {2010.10.17},
url = {http://dx.doi.org/10.1007/s12652-010-0034-y}
}
ECCV
Unified crowd segmentation
Tu, P.,
Sebastian, T.,
Doretto, G.,
Krahnstoever, N.,
Rittscher, J.,
and Yu, T.
In Proceedings of European Conference on Computer Vision,
2008.
abstractbibTeXpdf
This paper presents a unified approach to crowd segmentation. A global
solution is generated using an Expectation Maximization framework.
Initially, a head and shoulder detector is used to nominate an exhaustive
set of person locations and these form the person hypotheses. The
image is then partitioned into a grid of small patches which are
each assigned to one of the person hypotheses. A key idea of this
paper is that while whole body monolithic person detectors can fail
due to occlusion, a partial response to such a detector can be used
to evaluate the likelihood of a single patch being assigned to a
hypothesis. This captures local appearance information without having
to learn specific appearance models. The likelihood of a pair of
patches being assigned to a person hypothesis is evaluated based
on low level image features such as uniform motion fields and color
constancy. During the E-step, the single and pairwise likelihoods
are used to compute a globally optimal set of assignments of patches
to hypotheses. In the M-step, parameters which enforce global consistency
of assignments are estimated. This can be viewed as a form of occlusion
reasoning. The final assignment of patches to hypotheses constitutes
a segmentation of the crowd. The resulting system provides a global
solution that does not require background modeling and is robust
with respect to clutter and partial occlusion.
@inproceedings{tuSDKRY08eccv,
abbr = {ECCV},
author = {Tu, P. and Sebastian, T. and Doretto, G. and Krahnstoever, N. and Rittscher, J. and Yu, T.},
title = {Unified crowd segmentation},
booktitle = {Proceedings of European Conference on Computer Vision},
year = {2008},
pages = {691--704},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Video Analysis, People Detection, Integral Image Computations,
People Tracking},
file = {tuSDKRY08eccv.pdf:doretto\\conference\\tuSDKRY08eccv.pdf:PDF},
owner = {doretto},
timestamp = {2008.01.16}
}
CVPR
Face alignment using boosted ranking models
Wu, H.,
Liu, X.,
and Doretto, G.
In Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition,
2008.
OralabstractbibTeXpdf
Face alignment seeks to deform a face model to match it with the features
of the image of a face by optimizing an appropriate cost function.
We propose a new face model that is aligned by maximizing a score
function, which we learn from training data, and that we impose to
be concave. We show that this problem can be reduced to learning
a classifier that is able to say whether or not by switching from
one alignment to a new one, the model is approaching the correct
fitting. This relates to the ranking problem where a number of instances
need to be ordered. For training the model, we propose to extend
GentleBoost [23] to ranklearning. Extensive experimentation shows
the superiority of this approach to other learning paradigms, and
demonstrates that this model exceeds the alignment performance of
the state-of-the-art.
@inproceedings{wuLD08cvpr,
abbr = {CVPR},
author = {Wu, H. and Liu, X. and Doretto, G.},
title = {Face alignment using boosted ranking models},
booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition},
year = {2008},
pages = {1--8},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Video Analysis, Appearance Modeling, Shape and Appearance Modeling,
Integral Image Computations, Face Tracking, Face Modeling},
file = {wuLD08cvpr.pdf:doretto\\conference\\wuLD08cvpr.pdf:PDF;wuLD08cvpr.pdf:doretto\\conference\\wuLD08cvpr.pdf:PDF},
owner = {doretto},
timestamp = {2008.01.16},
wwwnote = {Oral}
}
Electric fish and robots may hold the key to achieving “autonomous lifelong machine learning,” based on research conducted at West Virginia University with t...