Video understanding is concerned with the parsing of the image data flow for the semantic understanding of the objects in the scene, but also their actions and interactions defining their behavior. When the objects of interest are people, there is the need to detect them (Tu et al., 2008), recognize them (Wu et al., 2008), but also to track their position, and re-identify them when they reappear (Doretto et al., 2011). By detecting people actions and interactions (Motiian et al., 2017) we can also attempt to predict their future behavior and intent. These techniques can be used to respond to queries that require mining a large corpus of video data for safety and security applications. On the other hand, variations of these techniques could be used to analyze and quantify the behavior of a heart in an echocardiogram.


