Recognizing human interactions from video is an important step forward towards the long-term goal of performing scene understanding fully automatically. Recent years have seen a concentration of works revolving around the problem of recognizing singleperson actions, as well as group activities. On the other hand, the area of modeling the interactions between two people is still relatively unexplored.


In order to effectively compare human interactions at frame level we have designed pairwise kernels, and they satisfy the balance property.

At video level, we have used Binet-Cauchy kernels to compute the similarities between videos.


Representation of human interactions: At every frame the bounding box delimiting the region of each person is assumed to be given (e.g., through the use of a person tracker, as it is typically done in video surveillance settings). From each bounding box two features are computed. The first one is the histogram of oriented optical flow (HOOF). It captures the motion between two consecutive frames. Video below shows how optical flow and HOOF look like:


In addition to HOOF, we introduce a feature called motion histogram (MH), which summarizes the motion trajectory of the past T - 1 frames (where T > 1). Video below shows how Motion Images and Motion Histograms look like:


We have also computed the distance between two bounding boxed at each frame.

Pairwise Kernels for human interactions: We have shown in the paper that the representation of human interactions has a specific Riemannian structure and we used this known structure to design kernels to map human interaction features to a Reproducing Kernel Hilbert Space (RKHS).


We have evaluated the recognition accuracy of UT-Interaction dataset and TVHI dataset for the designed kernels and compare them with RBF kernels. Results show that the designed kernels have a better performance compare to non-linear RBF kernel.