One of the most important elements of modern Computer Vision is the concept of image texture, or simply texture. Depending on the task at hand (e.g. image-based rendering, recognition, or segmentation, just to mention a few broad areas), several texture models have been proposed in the literature. An image texture originates through an image formation process that is typically very complex and not invertible. However, for image analysis purposes, most of the time it is not necessary to recover all the unknowns of a scene, and one can be content with reverting to a statistical analysis of the data. It is within this spirit that textures are seen as a spatial statistical repetition of image patterns. More formally, image textures can be seen as realizations from stochastic processes defined on a surface space, and the “repetition” property can be associated to the “stationarity” of the processes. What happens when these concepts are applied to video?

Dynamic Rextures: Modeling the Temporal Statistics

In nature there are plenty of scenes that originate video sequences showing temporal “repetition,” intended in a statistical sense. One could think of a flow of water, a fire, or a flow of car traffic or people walking. This kind of visual processes are now referred to as dynamic textures. (Doretto et al., 2003; Soatto et al., 2001) propose to study dynamic textures as stochastic processes that exhibit temporal stationarity, and introduce the use of linear dynamic systems for modeling their second-order statistical properties. They derived procedures for learning and simulating a dynamic texture model, and demonstrated its effectiveness in several cases using prediction error methods. The formalization is technically sound, and the model has been used in the literature to tackle many other problems by several other authors.

Dynamic Textures: Joint Modeling of Spatial and Temporal Statistics

In analyzing visual processes there may be portions of videos that can be modeled as dynamic textures, which means that they exhibit temporal stationarity. In addition to that, within a single frame they may also exhibit repetitions of the same patterns, like in image textures, which means that the visual process is spatially stationary as well. Therefore, it makes sense to design models that can capture the structure of the joint spatial, and temporal statistics, for the purpose of enabling recognition and segmentation. (Doretto et al., 2004) introduces a model for this kind of dynamic textures, which combines a tree representation of Markov random fields, for capturing the spatial stationarity, with linear dynamic systems, for capturing the temporal stationarity of the visual process. The effectiveness of the model is demonstrated by showing extrapolation of video in both space and time domains. The framework sets the stage for simultaneous segmentation and recognition of spatio-temporal events.

Dynamic Shape and Appearance: Joint Shape, Appearance, and Dynamics Modeling

Rather then attempting to model the temporal image variability of dynamic textures by capturing only how image intensities (appearance) vary over time, one could try to describe it by modeling how the shape of the scene varies. Both representations have advantages and limitations. For instance, the temporal variations of sharp edges are better captured by shape variation; however, this one cannot be used when a directional motion component is present, and appearance is the alternative. Therefore, exploiting the benefits of jointly modeling shape and appearance is very important, as it has been demonstrated for single images, but the extension to dynamic scenes (motion) was missing. (Doretto & Soatto, 2006; Doretto, 2005) address this issue, and propose to explain stationary image variability by means of the joint variability of shape and appearance akin to a temporal generalization of the well-known Active Appearance Models (AAMs). The issues of how much image variability should be modeled by shape, how much by appearance, how they vary over time (motion), and how appearance, shape and motion merge together, are addressed. The approach is capable of learning the temporal variation of higher-order image statistics, typical of videos containing sharp edge variation.


  1. TPAMI
    Dynamic shape and appearance models Doretto, G., and Soatto, S. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006. abstract bibTeX pdf
  2. CVPR
    Modeling dynamic scenes with active appearance Doretto, G. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. Oral abstract bibTeX pdf
  3. ECCV
    Spatially homogeneous dynamic textures Doretto, G., Jones, E., and Soatto, S. In Proceedings of European Conference on Computer Vision, 2004. Oral abstract bibTeX pdf
  4. IJCV
    Dynamic textures Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S. International Journal of Computer Vision, 2003. abstract bibTeX pdf
  5. ICCV
    Dynamic textures Soatto, S., Doretto, G., and Wu, Y. N. In Proceedings of IEEE International Conference on Computer Vision, 2001. Oral abstract bibTeX pdf