Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition

Size: px

Start display at page:

Download "Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition"

Simon Cobb
5 years ago
Views:

1 Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition Ethem F. Can and R. Manmatha Department of Computer Science, UMass Amherst Amherst, MA, 01002, USA [efcan, Abstract In this paper, we focus on a representation (HFLOW) for action recognition where we keep track of the flow of visual words between consecutive frames in a video clip. In addition, we also consider a pyramid histogram of visual words (PHOW) extracted from densely sampled SIFT descriptors. Action recognition is done by training a multi-class SVM classifier using each feature and then classifying test videos. The fusion of these two techniques gives us state of the art results on two challenging standard datasets which include unconstrained web videos. On UCF50 we achieve 72.23% accuracy and on the more challenging HMDB51 we improve on published techniques with an accuracy of 30.19%. 1. Introduction Action recognition is a basic computer vision task which has been well studied and much work has been done on it in recent years. As the field has progressed the datasets have become more complicated. While the early datasets such as KTH [13] helped in advancing the field, they were restricted to a small number of categories (6 for KTH) and were constrained in other ways. Most techniques perform very well on these datasets getting accuracies in the high 90%. Kuehne et al. [6] argue that the earlier video datasets exploit static poses (joint locations) to get good results and proposed the HMDB51 as a dataset requiring motion cues. A second challenging dataset is UCF50 [1]. Recent datasets have more categories (around 50), use unconstrained videos and are considered much more challenging compared with previous datasets which focused on a person performing an action in front of a single stationary camera [11]. Many of the videos in these recent datasets are closer to real world cases since they are collected from YouTube or are clips from films. The challenge with these datasets is exemplified by the fact that the best methods have accuracies in the low to mid-70 percentage range for UCF50 while all previous techniques on HMDB51 have been below 30%. Much of the recent work on action recognition has involved creating new descriptors [5, 12], the use of existing descriptors or extensions of 2D descriptors such as HOG- HOF, SIFT3D, GIST3D and so on to incorporate a temporal dimension followed by a classifier (usually a support vector machine). Another approach involves looking at motion changes [5, 7]. In addition recent techniques [11, 14] have often involved the fusion of a number of descriptors to improve accuracy. Two recent papers [10, 18] provide an extensive survey of methods used for action recognition. Here we show that densely sampled representations in time and space can produce state of the art results on action recognition. Most actions involve an object(s) performing an action which involves movement or interacting with another object or objects. Thus, our representations try to exploit both the fact that there is an object there is motion. Specifically, we use two representations one of which exploits motion changes which is useful for action description and the other static descriptors as they have been to be successful for object recognition [3]. We propose a new representation (HFLOW) that considers the histogram of the flow of visual words between adjacent frames in a video clip. In addition we use a representation based on the spatial pyramid histograms of visual words (PHOW) that are extracted from densely sampled SIFT descriptors from each video frame. For both representations we start with extracting densely sampled SIFT descriptors. We then quantize these descriptors. For the PHOW representation we make use of the pyramids up to level 2 where there are one, four, and sixteen cells in levels 0, 1 and 2 respectively. We compute a histogram for each region for each frame and pool the histograms of a region over the entire video to create a PHOW histogram for a video. In the datasets under consideration, videos may have varying numbers of frames differing by as much as a factor of 50. To account for this, a new normalization scheme is used which involves taking the logarithms of each bin and dividing by the L norm - the maximum (after taking 1

2 logarithms) over all bins at each level. For HFLOW we focus on a frame structure in which there are 16 (4x4 grid) cells covering a video frame (level 2 of the spatial pyramid). We keep track of the flow of visual words between consecutive frames. There are 10 possible moves - a word can move to one of 8 neighboring regions, or it can stay in the same region or it can entirely disappear from the neighborhood. That is, for a visual word w i in a region r j at frame f t, we check which region it moves to at time f t+1. Rather than track an individual visual word we look at the histogram of the visual words and check how the counts in each bin have changed compared to the counts for a neighbor or at its own location in the next frame. We compute a histogram for flow for each region and again pool the histograms of a region over the entire video to create a HFLOW representation for a video. As for PHOW, normalization involves taking logarithms and an L norm over each possible move. Classification is done by training multi-class models using a SVM-multiclass classifier and a linear kernel. This is done for each representation separately and then the results combined using late fusion. The fusion results show state-of-the-arts performances (72.23% on UCF50) (best previous result 76%) and outperform all reported results on HMDB51 (30.19% on HMDB51) (best previous result under 30%). Each representation by itself does quite well giving state of the art performance on both datasets. The paper is organized as follows. We discuss related work on action recognition literature. This is followed by a discussion of our approaches. We then explain the details of the datasets and the experimental settings used. The next section reports the results which is followed by a discussion of the results. Finally, we conclude the paper. 2. Related Work A number of approaches have either used 2D descriptors or extended them to 3D versions. Another common theme among recent techniques is the fusion of multiple techniques to improve results. Sun et al. [15] used 2D SIFT and 3D SIFT computed at keypoints. They also compute Zernike moments at a single frame (FRM-ZNK) or for a motione energy image (MEI-ZNK). They get the best results using fusion on both the KTH and Weizman datasets. For object detection it has been shown that the best results are obtained using dense SIFT features rather than using only SIFT keypoints [3]. Several different ways of incorporating temporal information into descriptors have been proposed. One is the combination of histogram of oriented gradient (HOG) and histogram of oriented flow (HOF) to give a HOG/HOF feature [8]. This has been shown to work quite well. Like our technique it combines an approach (HOG) which works well for recognizing objects with a version of HOG incorporating motion called HOF. These descriptors are usually computed at 3D Harris corners. Wang et al. [17] compare several different descriptors such as HOG3D, HOF, HOG/HOF, Cubiods (temporal Gabor filters), ESURF (an extended version of SURF) and showed that HOG/HOF worked best when densely sampled points were used for the sport actions dataset. Other descriptors include STIP (spatio-temporal interest points) [8]. Several techniques focus on encoding the motion. SIFT displacement [7] looks at how SIFT keypoints move between frames by matching their SIFT descriptors. The matched descriptors in a region are then binned into a histogram using their orientations. Two recent papers provide an extensive survey of [10, 18] methods used for action recognition. We now focus on more recent papers using the UCF50 and HMDB51 datasets. Sadanand [12] in analogy with ObjectBank proposed ActionBank which describes actions in terms of manually constructed action templates. Their approach is somewhat expensive. Todorovic et al. [16] view actions as stochastic Kronecker graphs. According to [1] what splits they used and so it is hard to comapare with their results. Kliper-Gross et al. focus on capturing local changes in motion directions. They encode every pixel on every frame by eight strings of eight trinary digits each. This patch-based method encodes the pixel in terms of motion changes on adjacent frames. They show that their method provides good performances on the benchmark action recognition datasets. Reddy and Mubarak [11] classify actions using color SIFT (CSIFT) vectors separately at moving and stationary points. The results are fused with a trajectory based descriptor MBH to give good results on UCF50. Their approach to HMDB51 is slightly different since they fuse CSIFT with HOG/HOF. Solmaz et al. [14] compute a temporal version of GIST called GIST3D by using 3-D Gabor filters. Fusing with STIP lets them obtain among the best results of all techniques. 3. Approach Here we describe the two representations; pyramid histogram of visual words quantized from densely sampled SIFT (dense-sift) descriptors (PHOW) and the histogram of the flow of visual words (HFLOW) across regions between two consecutive frames in the video clip (FLOW). PHOW is a well know descriptor used for object recognition but we are not aware of its use for action recognition. HFLOW is a new descriptor we propose. Frames are subsampled from the video at a rate of 10 frames/second. For each frame dense SIFT descriptors are extracted. To limit processing time, the frames are resized so that the maximum height and width of a video frame are limited to 300 by 300 pixels. The step size for dense sampling is set to 5 pixels. We extract SIFT descriptors at three pyramid levels (full size, 50% of the full size, and 25% of the full size)

3 Instead of creating a separate vocabulary for each dataset (time consuming) a standard vocabulary is borrowed from an external image set (the Large Scale Visual Recognition Challenge 2011 (ILSVRC2011) [2]). This is a 1000 word vocabulary and is used to quantize all the SIFT vectors from all the videos. We note that using a visual vocabulary created from the actual videos actually performed slightly worse in terms of results. Using these visual words a PHOW and HFLOW descriptors are created as described below Pyramid Histogram Of visual Words (PHOW) Spatial pyramid representations have been widely used for object recognition [9]. A pyramid has a grid level l with 2 l regions along each dimension. We use up to level 2 where at level l = 0, 1, 2, there are 1,4, and 16 regions respectively. The histograms of each region are computed individually to give a histogram of size 1000 for each region. The histogram for each region r i is pooled over the entire video. We simply sum the values of each bin in the histogram of a region to obtain a histogram h i representing the video clip for that region. Each video clip may have a different number of frames n; in fact n for some datasets can vary by a factor of 50. Thus, a good normalization scheme is critical. L 1 and L 2 normalization do not work very well. Instead we adopt a different normalization scheme. We reduce the dynamic range of values in the histograms by taking the logarithm of each bin. And for each level we normalize using the L norm. That is, we divide all the bins by the maximum of of the bins (after taking logarithms) at that level. Formally, let h ijl be the histogram of bin i at region j and level l: v ijl = log(h ijl ) max ij (log(h ijl )) ; h ijl 0 (1) = 0; h ijl = 0 (2) Videos have a lot of repetition (close frames are similar) so a bin with value 1 is likely to be due to noise. Setting histogram bins with value 0 or 1 to 0, therefore, does not make much difference. The histograms are then appended to each other to create a feature vector of size 21,000 (1, , ,000) Histogram of FLow Of visual Words (HFLOW) Previous attempts at tracking feature vectors for action recognition have involved seeing how keypoints move. SIFT displacement [7] involves tracking key points (using the SIFT features for matching ) between frames and then using the orientation histogram weighted by the moved distance as a feature vector for action recognition. Matching SIFT descriptors extracted from any adjacent frames and then tracing those descriptors is very time consuming. In our work, we focus on the flow of the visual words between consecutive video frames. We have a 1000 visual words and we want to characterize how the visual words move between regions. Rather than focus on individual visual words we look at how the histogram of the visual words moves between regions for consecutive frames. Level 2 of the spatial pyramid is used - there are 4 by 4 regions in this case. Each region has its own histogram of visual words. We only consider the flow of visual words in a region to its neighboring regions. In other words, a visual word w i at region r j at frame f t can move to its 8 (for regions in the center) neighboring regions in the next frame f t+1. There are two more possibilities; w i may 1) stay in the same region at frame f t+1 or 2) entirely disappear from the neighborhood. Thus, there are ten ways for a visual word w i to flow. w i f t f t+1 Figure 1. Illustration of visual word w i at f t and its possible moves in the following frame; f t+1. In Figure 1, we illustrate 9 possible moves of w i from f t to f t+1. Note that there is one more possibility, that is when a visual word is present at f t and disappears at f t+1. Figure 2. Possible moves of visual words in 4x4 grid layout. Figure 2 shows some of the possible moves of visual words in the 4x4 grid layout that we use for HFLOW. For a histogram h j of a region r j at frame f t, we compare that histogram with the histograms of its neighbors at frame

4 f t+1 to determine the amount of flow from f t to f t+1 for that particular region. We start with the un-normalized visual word histograms for each region. Let us assume that h j (f t ) is the histogram of region r j at frame f t, and similarly h j (f t+1 ) is the histogram of region r j at frame f t+1. The maximum amount of flow that can be carried from a bin wi t h j (f t ) to a histogram of one of its neighbors w t+1 i h k (f t+1 ) is the minimum of wi t and wt+1 i. In the HFLOW representation for each visual word since there are 10 possible moves (8 to neighboring regions, 1 if it stays in the same region and 1 if it disappears) they may be defined as: gflow 0,..., f flow 8 and gflow 9. Then the i th bin of gflow k (i) = min{wt i,wt+1 i } where k = 0,..., 8. We define gflow 9 (i) = wt i if it disappears completely from the neighborhood (i.e w t+1 i = ) and 0 otherwise. Note that when gflow 9 (i) is non-zero all the gflow 0,..., f flow 8 are zero. This distinguishes this case from one where the visual word is absent at both t and t + 1. For a pair of adjacent frames we have ten such flow vectors each of size 1,000 -a bin for each visual word-. Our representation for a frame consists of the concatenation of such vectors as in the PHOW case; g flow = [gflow 0... g9 flow ]. That is we have a 10,000D vector. We have (n 1) g flow vectors for a video clip of n frames. We finally pool these (n 1) vectors over the entire video (we pool each of the 10,000 dimensions separately) in order to get the final HFLOW representation for a video clip. As for the PHOW case we normalize by taking logarithms and dividing by the L norm. Note that in this case we divide by the maximum for each individual move. vf k i = log(gk flow (i)) max i log(g k flow (i)) (3) The final vf vector is a 10,000 dimensions vector. 4. Datasets and Experimental Environment Recent papers (for example [11]) have noted that techniques perform well on older action recognition datasets because of the small number of categories and the often uniform backgrounds present. They have argued that the more challenging datasets are UCF50 [1] and HMDB51 [6]. Here, we briefly discuss the datasets before doing experiments on them. UCF50 This benchmark dataset consists of about 6600 unconstrained video clips each of which is harvested from web. There are 50 actions and each action has more than 100 videos. The videos are divided into groups and a group may contain multiple clips from the same video. There are several published results on this dataset. However, different researchers seem to use different splits. This can make a big difference in the results. For example, if videos from the same group are split into training and test sets results are likely to be very high. We focus on two published splits [11, 12]. The first [11] is (LOgO) leave-one-group-out where one group is treated as test and all other groups are treated as training videos. The second [12] involves five-fold (5fold) cross validation (group-wise cross validation) where videos from the same group are placed in the same fold so that parts of the same video do not appear both in the training and test set. In each case the provided scores are averaged over all the folds in the cross-validation. HMDB51 This dataset consists of about 6700 videos. There are 51 action classes and each action has at least 70 videos for training and 30 for testing purposes in the standard splits provided in the original paper introducing the dataset [6]. There are 10 action classes that overlap with the UCF50 set. This dataset is considered very challenging since the best existing result is 29%. [14, 5]. Experimental Environment We use a multi-class SVM classifier that is based on the one-against-all approach. We set the regularization parameter to 5,000 for UCF50 and 500 for HMDB51 with a linear kernel while creating the models. 5. Results and Discussion We individually evaluate PHOW and HFLOW. In addition to the individual evaluation we also perform late fusion on the prediction scores of PHOW and HFLOW by scaling the prediction scores between [0, 1] and then averaging them. In the tables PHOW+HFLOW denotes the late fusion of PHOW and HFLOW. We first show the results (Table 1) for both datasets for two different stepsizes (5 pixels versus 10 pixels). The smaller stepsize produces a small improvement at the cost of more computation. Dataset System 5pixels 10pixels HMDB51 PHOW 28.47% 27.38% HMDB51 HFLOW 26.10% 25.29% HMDB51 PHOW+HFLOW 30.19% 29.06% UCF50 PHOW 71.25% 69.84% UCF50 HFLOW 66.18% 66.07% UCF50 PHOW+HFLOW 72.23% 71.44% Table 1. Accuracy vs stepsize on HMDB51 set and UCF50. Table 2 compares our results on the UCF50 dataset with results from recent papers. The best results have all been published within the last 6 months. Besides our technique, [11] and [14] also fuse multiple descriptors. We note that on the 5-fold the PHOW and HFLOW separately are better than HOG/HOF and ActionBank. HFLOW is worse than MIP but PHOW performance is close to MIP (less than 1% difference). The fused PHOW+HFLOW is almost the same as MIP and better than other techniques. For the LOgO split the fused PHOW+HFLOW is almost the same as MIP and about 2% and 5% lower than [11] and [14] respectively. Overall, the fused PHOW+HFLOW performs quite well.

5 System 5fold LOgO HOG/HOF [8] 47.9% N/A ActionBank [12] 57.9% N/A MIP [5] 68.51% 72.68% GIST3D+STIP [14] N/A 73.70% CSIFT+MBH [11] N/A 76.90% PHOW 67.64% 71.25% HFLOW 62.76% 66.18% PHOW+HFLOW 69.24% 72.23% Table 2. Recognition accuracies on UCF50 set. Table 3 shows the corresponding accuracies on the HMDB51 dataset. The results for all algorithms are substantially lower indicating a much more difficult dataset. Even small improvements on this dataset are difficult to achieve (since the dataset was released all the techniques have been in the 20-29% range). PHOW by itself performs quite well doing better than all except two techniques. (GIST3D+STIP [14] and MIP [5]). HFLOW is slightly worse. PHOW+HFLOW does better than all the other approaches (about 1% better than the closest techniques). System Accuracy HOG/HOF [8] 20.44% C2 [4] 22.83% ActionBank [12] 26.90% CSIFT+HOG/HOF [11] 27.02% GIST3D+STIP[14] 29.20% MIP [5] 29.17% PHOW 28.47% HFLOW 26.10% PHOW+HFLOW 30.19% Table 3. Recognition accuracies on HMDB51 set. Figure 3 is a confusion matrix which shows the average accuracy over 3 splits of the HMDB51 dataset. Entry (i,j) shows what percentage of the videos described as action i in the groundtruth are labeled as action j by the PHOW+HFLOW technique. The accuracies are rounded to the nearest integer for display purposes and for this reason the rows don t sum to 100%. It is instructive to look at some examples. Eat and drink are easily confused which seems reasonable. Somersault is classified correctly in about 16% of the test videos and the major misclassifications are as cartwheel ( 23%), flic-flac ( 12%) and handstand ( 9%). Flic flac is a backwards somersault, a cartwheel could be thought of as a sideways somersault and a handstand could be viewed as a component of a somersault. Draw sword is correctly classified in about 27% of the test videos and the major misclassifications are sword exercise ( 28%) and sword ( 6%) (which refers to sword fights). Golf is correctly classified in about 87% of the test videos. Talk is correctly detected in about 52% of the test cases and incorrectly labeled as drink ( 7%) and chew ( 6%). This implies that if there are significant differences in the actions, it is more likely to be detected correctly. Actions which are much more similar such as a somersault versus a flic-flac or eat versus drink are much harder to detect. It appears that more closely modeling the object parts in these cases (front versus back) for somersault versus flic flac or the object used (say in eating vs drinking) are important. Modeling the motions is also important but modeling the objects may also lead to significant improvements. So the good performance of PHOW may be due to its ability to model objects. It has been shown before that PHOW and similar descriptors paired with SVM classifiers work very well for object recognition [3]. Chatfield et al. [3] also point out that small changes (specific details) in how the different parts of a system are created can make big differences. Normalization is also important (taking logarithms and using L infty norm) since the videos vary in length quite a bit and our experience was that L 1 or L 2 did not perform as well. Another reason for the usefulness of PHOW may be due to the classifier learning to encode the actions implicitly. HFLOW exploits motion information and comes close to PHOW in terms of accuracy. As previous attempts have shown it is important to encode motion information. HFLOW works even though it only encodes motion coarsely (from one region to another) rather than fine measures of displacement as in [7, 5]. 6. Conclusion We focus on two representations; pyramid histogram of visual words (PHOW) and histogram of the flow of visual words (HFLOW). We make use of densely sampled SIFT descriptors that are quantized to visual words for PHOW and we keep track of the flow of those visual words between consecutive frames in a video clip to compute HFLOW. We also take logarithms of the histograms and use an L infty normalization to handle videos of varying lengths. We show that late fusion of classifiers created using both features gives state of the art performance %on UCF50 and 30.19% accuracy on HMDB51 (about 1% better than any previous technique). These results suggest that modeling the objects as well as the actions may be a good way of solving the action recognition problem. 7. Acknowledgment This work was supported in part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. References [1] ( 1, 2, 4

6 Figure 3. Confusion matrix for HMDB51 dataset. Note that the numbers are in %. The sum of rows may not be exactly 100 because of rounding errors. The scores are averaged over three splits. [2] Large scale visual recognition challenge 2011 (ilsvrc2011) [3] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, , 2, 5 [4] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. In ICCV, [5] O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, , 4, 5 [6] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, pages , , 4 [7] K.-T. Lai, M.-S. Chen, C.-H. Hsieh, and M.-F. Lai. Orientation histogram of sift displacement for recognizing actions in broadcast videos. In EUVIP, pages , , 2, 3, 5 [8] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR 2008, pages 1 8, , 5 [9] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR (2), pages , [10] R. Poppe. A survey on vision-based human action recognition. IVC, 28(6): , , 2 [11] K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos. MVA, , 2, 4, 5 [12] S. S. Sadanand and J. Corson. Action bank: A high-level representation of activity in video. In CVPR, , 2, 4, 5 [13] C. Schüldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In ICPR, [14] B. Solmaz, A. Modiri, and M. Shah. Classifying web videos using a global video descriptor. MVA, , 2, 4, 5 [15] X. Sun, M. Chen, and A. Hauptmann. Action recognition via local descriptors and holistic features. In CVPR Workshop

7 H4B, pages 58 65, [16] S. Todorovic. Human activities as stochastic kronecker graphs. In ECCV (2), pages , [17] H. Wang, M. Ullah, A. Kläser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, [18] D. Weinland, R. Ronfard, and E. Boyer. A survey of visionbased methods for action representation, segmentation and recognition. CVIU, 115(2): , , 2

Motion Interchange Patterns for Action Recognition in Unconstrained Videos

Motion Interchange Patterns for Action Recognition in Unconstrained Videos Orit Kliper-Gross, Yaron Gurovich, Tal Hassner, Lior Wolf Weizmann Institute of Science The Open University of Israel Tel Aviv