Human Interactions

Recognizing human interactions from video is an important step forward towards the long-term goal of performing scene understanding fully automatically. Recent years have seen a concentration of works revolving around the problem of recognizing single person actions, as well as group activities, while the area of modeling the interactions between two people is still relatively unexplored. In [ISVC13] people interactions are modeled by forming temporal interaction trajectories coupling together the body motion of each individual as well as their proximity relationships. Such trajectories live in a well-defined Riemannian manifold and enjoy specific symmetry properties that have to be taken into account during the development of a theoretically grounded recognition framework.

Dictionary Learning

Recent successes in the use of sparse coding for many computer vision applications have triggered the attention towards the problem of how an over-complete dictionary should be learned from data. This is because the quality of a dictionary greatly affects performance in many respects, including computational. While so far the focus has been on learning compact, reconstructive, and discriminative dictionaries, in [ACCV12] all the previous qualities are retained and are further enhanced by learning a dictionary that is able to predict the contextual information surrounding a sparsely coded signal.

Exemplar-based Object Layout

Recognizing the presence of object classes in an image, or image classification, has become an increasingly important topic of interest. Equally important, however, is also the capability to locate these object classes in the image. The combined problem, usually referred to as object layout, is approached with models that require intense training. In [ISVC11] this issue is addressed with the primary goal of minimizing the training requirements so as to allow for ease of adding new object classes, as opposed to approaches that favor training a suite of object-specific classifiers. It turns out that it is possible to effectively represent an object class with enough image exemplars, in combination with image retrieval techniques, and statistical modeling, to obtain state-of-the-art object recognition performance with minimal training efforts.

Transfer Learning from Multiple Sources

Transfer learning allows leveraging the knowledge of source domains, available a priori, to help training a classifier for a target domain, where the available data is scarce. The effectiveness of the transfer is affected by the relationship between source and target. Rather than improving the learning, brute force leveraging of a source poorly related to the target may decrease the classifier performance. One strategy to reduce this negative transfer is to import knowledge from multiple sources to increase the chance of finding one source closely related to the target. In [CVPR10] these ideas are explored by extending the boosting framework for transferring knowledge from multiple sources. It turns out that it is possible to obtain algorithms that are very efficient in terms of the speed with which they can be retrained once a new target domain is given. Such algorithms have been applied to very important computer vision problems such as object category recognition and specific object detection.

Face Modeling and Tracking

Active Appearance Models (AAMs) represent facial images with generative models for both shape and appearance of the face. Despite their success, they enjoy limited performance when used on faces that were not part of the training set. Moreover, training them with a lot of examples degrades their effectiveness. This limits their applicability for tracking multiple unseen faces in unconstrained scenarios. In [CVPR08] these issues are addressed by learning a discriminative face model which is fitted by minimizing a cost function that is concave. It turns out that this is equivalent to learning a ranking function. The framework shows a dramatic improvement over AAMs in terms of alignment robustness and speed, enabling the simultaneous real-time tracking of tens of faces. The approach describes a general methodology applicable to the many problems (e.g. discriminative object tracking) that can be solved by learning the cost function to be optimized, and which it has been shown it can be imposed to be either concave or convex.

People Detection and Recognition

People detection and tracking in video are fundamental Computer Vision capabilities that still constitute a research challenge. Important difficulties are due to the partial occlusions of the objects of interest (people), the dynamic background (possibly due to the motion of the observer), and the foreground clutter (due to non-person objects in motion). Traditional methods ignore one or more of these aspects, and this prevents from enabling tracking multiple people from a moving platform, even in slightly crowded or cluttered conditions. In [ECCV08] these issues are addressed all at once by exploiting both people appearance and shape cues in an online optimization framework based on Expectation Maximization. Given initial hypothesis of people positions nominated by a discriminative head and shoulder detector, operating at a high false alarm rate, images are analyzed by optimally assigning each image patch to the most likely person hypothesis. This amounts to automatically reject the false hypothesis, find how many people are present in the scene, localize them, and describe how they occlude each other.

Long-duration tracking of individuals across large sites remains an almost untouched research area. Trucks of individuals acquired in disjoint fields of view have to be connected despite the fact that the same person will appear in a different pose, from a different viewpoint, and under different illumination conditions. Ultimately, this is an identity-matching problem, which might be approached by using traditional biometric cues, such as face. However, practical scenarios prevent from relying on good quality acquisition of face images at standoff distance. Therefore, in luck of more stable biometric data, one can revert to the whole-body appearance information, provided that a person will not change clothes between sightings. [ICCV07] presents a model for the appearance of people. It aims at describing the spatial distribution of the albedo of an object (person) as it is seen from the perspective of each of its constituent (body) parts. Estimating the model entails computing an occurrence matrix, for which a state-of-the-art algorithm enabling real-time performance is derived. It exploits a generalization of the popular integral image representation, which can be widely applicable for quickly computing complex vector valued image statistics.

Visual Surveillance

Increasingly, large networks of surveillance cameras are employed to monitor public and private facilities. This continuous collection of imagery has the potential for tremendous impact on public safety and security. Unfortunately, this potential is often unrealized since manual monitoring of growing numbers of video feeds is not feasible. As a consequence, surveillance video is mostly stored without being viewed and is only used for data-mining and forensic needs. However, the ability to perform computer-based video analytics is now becoming possible, enabling a proactive approach where security personnel can be continually appraised of who is on site, where they are, and what they are doing. Under this new paradigm, a significantly higher level of security can be achieved through the increased productivity of security officers. The ultimate goal of intelligent video for security and surveillance is to automatically detect events and situations that require the attention of security personnel. Augmenting security staff with automatic processing will increase their efficiency and effectiveness. This is a difficult problem since events of interest are complicated and diverse. [SPIE-DSS07, AVSS09] discuss some of the challenges of developing surveillance systems, and present an overview of some solutions concerning with people detection, crowd analysis, multi-camera multi-target tracking, event detection, indexing, and search.

Aerial Video Analysis

In aerial video moving objects of interest are typically very small, and being able to detect them is key to enable tracking. There are detection methods that learn the background and distinguish when a foreground object is present. These approaches require the image sensor to be fixed, and a large amount of frames for learning the background. To avoid these constraints, one could use motion segmentation algorithms (which need as low as two consecutive frames) but the foreground objects are expected to be considerably big. When the objects are small [CVPRW06] proposes to learn how to classify image regions into categories such as road, tree, grass, building, vehicle, shadow, and to integrate this information with a motion segmentation algorithm for extracting the moving objects. The method dramatically boosts the detection rate of small objects, enabling reliable tracking. Moreover, it is general in the sense that it is not bound to a particular motion segmentation approach.

A fundamental goal in high-level vision is the ability to analyze a large field of view (which might be observed by an aerial sensor), and give a semantic interpretation of the interactions between the actors in the scene. Almost all the approaches have been developed for the ideal scenario where very accurate tracking data of the actors is available, and can be used to infer the status of the site. A more realistic setting is when the tracks of each actor are fragmented, and the fragments are not linked. This assumption accounts for occlusions, traffic, and tracking errors. [CVPR06] and [ICPR06] develop a framework where a dynamic Bayesian network is used to represent the interactions between actors, and inference is done by estimating at the same time what is the most likely linking between fragments, given that a certain event is occurring, and what is the most likely occurring event, given the current linking. This means that this approach estimates the long-duration tracks, while the events are being recognized despite the high fragmentation. This is possible even in scenes with many non-involved movers, and under different scene viewpoints and/or configurations.

Dynamic Scene Analysis

One of the most important elements of modern Computer Vision is the concept of image texture, or simply texture. Depending on the task at hand (e.g. image-based rendering, recognition, or segmentation, just to mention a few broad areas), several texture models have been proposed in the literature. An image texture originates through an image formation process that is typically very complex and not invertible. However, for image analysis purposes, most of the time it is not necessary to recover all the unknowns of a scene, and one can be content with reverting to a statistical analysis of the data. It is within this spirit that textures are seen as a spatial statistical repetition of image patterns. More formally, image textures can be seen as realizations from stochastic processes defined on a surface space, and the "repetition" property can be associated to the "stationarity" of the processes. What happens when these concepts are applied to video?

In nature there are plenty of scenes that originate video sequences showing temporal "repetition," intended in a statistical sense. One could think of a flow of water, a fire, or a flow of car traffic or people walking. This kind of visual processes are now referred to as dynamic textures. [IJCV03, ICCV01] propose to study dynamic textures as stochastic processes that exhibit temporal stationarity, and introduce the use of linear dynamic systems for modeling their second-order statistical properties. They derived procedures for learning and simulating a dynamic texture model, and demonstrated its effectiveness in several cases using prediction error methods. The formalization is technically sound, and the model has been used in the literature to tackle many other problems by several other authors.

In analyzing visual processes there may be portions of videos that can be modeled as dynamic textures, which means that they exhibit temporal stationarity. In addition to that, within a single frame they may also exhibit repetitions of the same patterns, like in image textures, which means that the visual process is spatially stationary as well. Therefore, it makes sense to design models that can capture the structure of the joint spatial, and temporal statistics, for the purpose of enabling recognition and segmentation. [ECCV04] introduces a model for this kind of dynamic textures, which combines a tree representation of Markov random fields, for capturing the spatial stationarity, with linear dynamic systems, for capturing the temporal stationarity of the visual process. The effectiveness of the model is demonstrated by showing extrapolation of video in both space and time domains. The framework sets the stage for simultaneous segmentation and recognition of spatio-temporal events.

Rather then attempting to model the temporal image variability of dynamic textures by capturing only how image intensities (appearance) vary over time, one could try to describe it by modeling how the shape of the scene varies. Both representations have advantages and limitations. For instance, the temporal variations of sharp edges are better captured by shape variation; however, this one cannot be used when a directional motion component is present, and appearance is the alternative. Therefore, exploiting the benefits of jointly modeling shape and appearance is very important, as it has been demonstrated for single images, but the extension to dynamic scenes (motion) was missing. [IEEE TPAMI06, CVPR05] address this issue, and propose to explain stationary image variability by means of the joint variability of shape and appearance akin to a temporal generalization of the well-known Active Appearance Models (AAMs). The issues of how much image variability should be modeled by shape, how much by appearance, how they vary over time (motion), and how appearance, shape and motion merge together, are addressed. The approach is capable of learning the temporal variation of higher-order image statistics, typical of videos containing sharp edge variation.

Dynamic Background Modeling

In order to extract the desired higher-level information, as an intermediate step, several video analysis tasks rely on modeling the background in order to detect the presence of foreground objects of interest. While several methods are available for simple scenarios, the case of a moving camera, observing objects moving in a scene with severe motion clutter, is still considered a challenge. [AVSS09] addresses this issue by providing a model for the background that takes into account the camera motion, as well as the motion clutter. Detecting a foreground object is equivalent to detecting a model change. This is done optimally online by exploiting the sequential generalized likelihood ratio test, applied to the sufficient test statistic that describes the motion clutter.

Dynamic Texture Recognition

Recognition of objects based on their images is one of the central problems in modern Computer Vision. Objects can be characterized by their geometric, photometric, and dynamic properties. While a vast literature exists on recognition based on geometry and photometry, much less has been said about recognizing scenes based upon their dynamics. [CVPR01] formulates the problem of recognizing a sequence of images based on a joint photometric-dynamic model. This enables distinguishing not just steam from foliage, but also fast turbulent steam from haze, or to detect the presence of strong winds by looking at trees. Sequences are not represented by local features or optical flow. Instead, they are supposed to be realizations from stationary stochastic processes. Recognition is not based on classifying individual realizations, but statistical models that generate them. This entails studying the structure of the space of models, and defining distances between model instances. It is shown that ignoring the model space structure leads to poor recognition performance.

Dynamic Texture Segmentation

Segmenting the image plane of video sequences is often one of the first steps towards the analysis of video. A lot of effort has been spent on developing image segmentation techniques based on cues such as color, or texture. Similarly, there are several methods for segmenting image motion based on optical flow, or motion features. On the other hand, there might be cases where segmentation based on photometry, or motion (dynamics) alone might be insufficient, because the motion of the object, segmented based on photometry, is incorrect, or because the color of the object, segmented based on motion, is wrong. [ICCV03] addresses, for the first time, the problem of segmenting video based on the joint photometry and dynamics of the scene. The approach aims at solving a variational optimization problem that looks for the region boundaries and the dynamic texture models that can optimally represent the video data inside each region. The result is an algorithm that can group regions with the same spatio-temporal statistics. Extensions of this approach have been successfully used for applications such as traffic monitoring, and medical image analysis for segmenting organs in motion.

Dynamic Texture Editing: Video-based rendering

The operation that by processing video data allows producing new video data in Computer Graphics is known as Video-Based Rendering (VBR). Developing new VBR techniques is important because they allow to quickly synthesize new photo-realistic scenes (because of the origin of the data) without having to develop and simulate a synthetic model of the scene. The difficulty is in developing editing techniques that, once applied to the original data, can produce the desired perceptual effect. [SIGGRAPH02, CVPR03] represent a VBR approach that allows learning dynamic texture models from video, and then simulating them for synthesizing/rendering new unseen videos. Modeling the spatial stationarity enables the synthesis not only in time, but also in space (so the frame size can grow) [ECCV04]. It is shown how dynamic texture model parameters can be edited (changed) online, and mapped to meaningful perceptual changes, such as the spatial frequency content, the speed, the time axis, or the intensity of the visual process. This means that from a video sequence of sea waves one could, for instance, produce a new video with a rougher or smoother sea movement, according to the desire.

[Texture03] further extends the previous VBR approach by introducing a model of the spatio-temporal statistics of a collection of images of dynamic scenes as seen from a moving camera. The joint modeling of the moving vantage point together with the statistics of the scene motion is obtained by introducing a time-variant linear dynamical system. The resulting algorithms could be useful for video editing where the motion of a virtual camera can be controlled interactively, as well as for performing stabilized synthetic generation of video sequences.

3D Object Modeling

Building 3D models is an important problem in several areas, such as forensic applications, medical applications, industrial inspection, or virtual visits of 3D synthetic environments. One way to build models is by acquiring range data images, by means of laser scanners, and then stitching them together. This implies registering, or aligning range data pairs. [IEEE TPAMI02] introduces an original method to solve this problem, which operates in the frequency domain. The Fourier transform allows decoupling the estimation of the rotation parameters from the estimation of the translation parameters. The algorithm exploits this well-known property by suggesting a three-step procedure. The performance of the algorithm is assessed through extensive testing with several objects and shows that good and very robust estimates of 3D rigid motion are achievable, and are well suited for unsupervised registration. The algorithm can be used as a pre-alignment tool for more accurate space-domain registration techniques, like the ICP algorithm. These methods have been successfully deployed for building models of cultural heritage objects, and for registering computed tomography data.

Free-form 3D surfaces registration can be made more robust by integrating the surface albedo. [ICIP98] investigates this problem in the Fourier domain and proposes a new technique that uses radial projections of the frequency domain representation of the combined range and intensity data. An interesting extension of the algorithm can be used for the estimation of 3D affine transformations. The obtained results are useful per se in applications targeted to enhancing the visual quality of the models, or can serve as a good starting point for the ICP algorithm when a higher precision is needed.