Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition

Size: px
Start display at page:

Download "Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition"

Transcription

1 Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition Ethem F. Can and R. Manmatha Department of Computer Science, UMass Amherst Amherst, MA, 01002, USA [efcan, Abstract In this paper, we focus on a representation (HFLOW) for action recognition where we keep track of the flow of visual words between consecutive frames in a video clip. In addition, we also consider a pyramid histogram of visual words (PHOW) extracted from densely sampled SIFT descriptors. Action recognition is done by training a multi-class SVM classifier using each feature and then classifying test videos. The fusion of these two techniques gives us state of the art results on two challenging standard datasets which include unconstrained web videos. On UCF50 we achieve 72.23% accuracy and on the more challenging HMDB51 we improve on published techniques with an accuracy of 30.19%. 1. Introduction Action recognition is a basic computer vision task which has been well studied and much work has been done on it in recent years. As the field has progressed the datasets have become more complicated. While the early datasets such as KTH [13] helped in advancing the field, they were restricted to a small number of categories (6 for KTH) and were constrained in other ways. Most techniques perform very well on these datasets getting accuracies in the high 90%. Kuehne et al. [6] argue that the earlier video datasets exploit static poses (joint locations) to get good results and proposed the HMDB51 as a dataset requiring motion cues. A second challenging dataset is UCF50 [1]. Recent datasets have more categories (around 50), use unconstrained videos and are considered much more challenging compared with previous datasets which focused on a person performing an action in front of a single stationary camera [11]. Many of the videos in these recent datasets are closer to real world cases since they are collected from YouTube or are clips from films. The challenge with these datasets is exemplified by the fact that the best methods have accuracies in the low to mid-70 percentage range for UCF50 while all previous techniques on HMDB51 have been below 30%. Much of the recent work on action recognition has involved creating new descriptors [5, 12], the use of existing descriptors or extensions of 2D descriptors such as HOG- HOF, SIFT3D, GIST3D and so on to incorporate a temporal dimension followed by a classifier (usually a support vector machine). Another approach involves looking at motion changes [5, 7]. In addition recent techniques [11, 14] have often involved the fusion of a number of descriptors to improve accuracy. Two recent papers [10, 18] provide an extensive survey of methods used for action recognition. Here we show that densely sampled representations in time and space can produce state of the art results on action recognition. Most actions involve an object(s) performing an action which involves movement or interacting with another object or objects. Thus, our representations try to exploit both the fact that there is an object there is motion. Specifically, we use two representations one of which exploits motion changes which is useful for action description and the other static descriptors as they have been to be successful for object recognition [3]. We propose a new representation (HFLOW) that considers the histogram of the flow of visual words between adjacent frames in a video clip. In addition we use a representation based on the spatial pyramid histograms of visual words (PHOW) that are extracted from densely sampled SIFT descriptors from each video frame. For both representations we start with extracting densely sampled SIFT descriptors. We then quantize these descriptors. For the PHOW representation we make use of the pyramids up to level 2 where there are one, four, and sixteen cells in levels 0, 1 and 2 respectively. We compute a histogram for each region for each frame and pool the histograms of a region over the entire video to create a PHOW histogram for a video. In the datasets under consideration, videos may have varying numbers of frames differing by as much as a factor of 50. To account for this, a new normalization scheme is used which involves taking the logarithms of each bin and dividing by the L norm - the maximum (after taking 1

2 logarithms) over all bins at each level. For HFLOW we focus on a frame structure in which there are 16 (4x4 grid) cells covering a video frame (level 2 of the spatial pyramid). We keep track of the flow of visual words between consecutive frames. There are 10 possible moves - a word can move to one of 8 neighboring regions, or it can stay in the same region or it can entirely disappear from the neighborhood. That is, for a visual word w i in a region r j at frame f t, we check which region it moves to at time f t+1. Rather than track an individual visual word we look at the histogram of the visual words and check how the counts in each bin have changed compared to the counts for a neighbor or at its own location in the next frame. We compute a histogram for flow for each region and again pool the histograms of a region over the entire video to create a HFLOW representation for a video. As for PHOW, normalization involves taking logarithms and an L norm over each possible move. Classification is done by training multi-class models using a SVM-multiclass classifier and a linear kernel. This is done for each representation separately and then the results combined using late fusion. The fusion results show state-of-the-arts performances (72.23% on UCF50) (best previous result 76%) and outperform all reported results on HMDB51 (30.19% on HMDB51) (best previous result under 30%). Each representation by itself does quite well giving state of the art performance on both datasets. The paper is organized as follows. We discuss related work on action recognition literature. This is followed by a discussion of our approaches. We then explain the details of the datasets and the experimental settings used. The next section reports the results which is followed by a discussion of the results. Finally, we conclude the paper. 2. Related Work A number of approaches have either used 2D descriptors or extended them to 3D versions. Another common theme among recent techniques is the fusion of multiple techniques to improve results. Sun et al. [15] used 2D SIFT and 3D SIFT computed at keypoints. They also compute Zernike moments at a single frame (FRM-ZNK) or for a motione energy image (MEI-ZNK). They get the best results using fusion on both the KTH and Weizman datasets. For object detection it has been shown that the best results are obtained using dense SIFT features rather than using only SIFT keypoints [3]. Several different ways of incorporating temporal information into descriptors have been proposed. One is the combination of histogram of oriented gradient (HOG) and histogram of oriented flow (HOF) to give a HOG/HOF feature [8]. This has been shown to work quite well. Like our technique it combines an approach (HOG) which works well for recognizing objects with a version of HOG incorporating motion called HOF. These descriptors are usually computed at 3D Harris corners. Wang et al. [17] compare several different descriptors such as HOG3D, HOF, HOG/HOF, Cubiods (temporal Gabor filters), ESURF (an extended version of SURF) and showed that HOG/HOF worked best when densely sampled points were used for the sport actions dataset. Other descriptors include STIP (spatio-temporal interest points) [8]. Several techniques focus on encoding the motion. SIFT displacement [7] looks at how SIFT keypoints move between frames by matching their SIFT descriptors. The matched descriptors in a region are then binned into a histogram using their orientations. Two recent papers provide an extensive survey of [10, 18] methods used for action recognition. We now focus on more recent papers using the UCF50 and HMDB51 datasets. Sadanand [12] in analogy with ObjectBank proposed ActionBank which describes actions in terms of manually constructed action templates. Their approach is somewhat expensive. Todorovic et al. [16] view actions as stochastic Kronecker graphs. According to [1] what splits they used and so it is hard to comapare with their results. Kliper-Gross et al. focus on capturing local changes in motion directions. They encode every pixel on every frame by eight strings of eight trinary digits each. This patch-based method encodes the pixel in terms of motion changes on adjacent frames. They show that their method provides good performances on the benchmark action recognition datasets. Reddy and Mubarak [11] classify actions using color SIFT (CSIFT) vectors separately at moving and stationary points. The results are fused with a trajectory based descriptor MBH to give good results on UCF50. Their approach to HMDB51 is slightly different since they fuse CSIFT with HOG/HOF. Solmaz et al. [14] compute a temporal version of GIST called GIST3D by using 3-D Gabor filters. Fusing with STIP lets them obtain among the best results of all techniques. 3. Approach Here we describe the two representations; pyramid histogram of visual words quantized from densely sampled SIFT (dense-sift) descriptors (PHOW) and the histogram of the flow of visual words (HFLOW) across regions between two consecutive frames in the video clip (FLOW). PHOW is a well know descriptor used for object recognition but we are not aware of its use for action recognition. HFLOW is a new descriptor we propose. Frames are subsampled from the video at a rate of 10 frames/second. For each frame dense SIFT descriptors are extracted. To limit processing time, the frames are resized so that the maximum height and width of a video frame are limited to 300 by 300 pixels. The step size for dense sampling is set to 5 pixels. We extract SIFT descriptors at three pyramid levels (full size, 50% of the full size, and 25% of the full size)

3 Instead of creating a separate vocabulary for each dataset (time consuming) a standard vocabulary is borrowed from an external image set (the Large Scale Visual Recognition Challenge 2011 (ILSVRC2011) [2]). This is a 1000 word vocabulary and is used to quantize all the SIFT vectors from all the videos. We note that using a visual vocabulary created from the actual videos actually performed slightly worse in terms of results. Using these visual words a PHOW and HFLOW descriptors are created as described below Pyramid Histogram Of visual Words (PHOW) Spatial pyramid representations have been widely used for object recognition [9]. A pyramid has a grid level l with 2 l regions along each dimension. We use up to level 2 where at level l = 0, 1, 2, there are 1,4, and 16 regions respectively. The histograms of each region are computed individually to give a histogram of size 1000 for each region. The histogram for each region r i is pooled over the entire video. We simply sum the values of each bin in the histogram of a region to obtain a histogram h i representing the video clip for that region. Each video clip may have a different number of frames n; in fact n for some datasets can vary by a factor of 50. Thus, a good normalization scheme is critical. L 1 and L 2 normalization do not work very well. Instead we adopt a different normalization scheme. We reduce the dynamic range of values in the histograms by taking the logarithm of each bin. And for each level we normalize using the L norm. That is, we divide all the bins by the maximum of of the bins (after taking logarithms) at that level. Formally, let h ijl be the histogram of bin i at region j and level l: v ijl = log(h ijl ) max ij (log(h ijl )) ; h ijl 0 (1) = 0; h ijl = 0 (2) Videos have a lot of repetition (close frames are similar) so a bin with value 1 is likely to be due to noise. Setting histogram bins with value 0 or 1 to 0, therefore, does not make much difference. The histograms are then appended to each other to create a feature vector of size 21,000 (1, , ,000) Histogram of FLow Of visual Words (HFLOW) Previous attempts at tracking feature vectors for action recognition have involved seeing how keypoints move. SIFT displacement [7] involves tracking key points (using the SIFT features for matching ) between frames and then using the orientation histogram weighted by the moved distance as a feature vector for action recognition. Matching SIFT descriptors extracted from any adjacent frames and then tracing those descriptors is very time consuming. In our work, we focus on the flow of the visual words between consecutive video frames. We have a 1000 visual words and we want to characterize how the visual words move between regions. Rather than focus on individual visual words we look at how the histogram of the visual words moves between regions for consecutive frames. Level 2 of the spatial pyramid is used - there are 4 by 4 regions in this case. Each region has its own histogram of visual words. We only consider the flow of visual words in a region to its neighboring regions. In other words, a visual word w i at region r j at frame f t can move to its 8 (for regions in the center) neighboring regions in the next frame f t+1. There are two more possibilities; w i may 1) stay in the same region at frame f t+1 or 2) entirely disappear from the neighborhood. Thus, there are ten ways for a visual word w i to flow. w i f t f t+1 Figure 1. Illustration of visual word w i at f t and its possible moves in the following frame; f t+1. In Figure 1, we illustrate 9 possible moves of w i from f t to f t+1. Note that there is one more possibility, that is when a visual word is present at f t and disappears at f t+1. Figure 2. Possible moves of visual words in 4x4 grid layout. Figure 2 shows some of the possible moves of visual words in the 4x4 grid layout that we use for HFLOW. For a histogram h j of a region r j at frame f t, we compare that histogram with the histograms of its neighbors at frame

4 f t+1 to determine the amount of flow from f t to f t+1 for that particular region. We start with the un-normalized visual word histograms for each region. Let us assume that h j (f t ) is the histogram of region r j at frame f t, and similarly h j (f t+1 ) is the histogram of region r j at frame f t+1. The maximum amount of flow that can be carried from a bin wi t h j (f t ) to a histogram of one of its neighbors w t+1 i h k (f t+1 ) is the minimum of wi t and wt+1 i. In the HFLOW representation for each visual word since there are 10 possible moves (8 to neighboring regions, 1 if it stays in the same region and 1 if it disappears) they may be defined as: gflow 0,..., f flow 8 and gflow 9. Then the i th bin of gflow k (i) = min{wt i,wt+1 i } where k = 0,..., 8. We define gflow 9 (i) = wt i if it disappears completely from the neighborhood (i.e w t+1 i = ) and 0 otherwise. Note that when gflow 9 (i) is non-zero all the gflow 0,..., f flow 8 are zero. This distinguishes this case from one where the visual word is absent at both t and t + 1. For a pair of adjacent frames we have ten such flow vectors each of size 1,000 -a bin for each visual word-. Our representation for a frame consists of the concatenation of such vectors as in the PHOW case; g flow = [gflow 0... g9 flow ]. That is we have a 10,000D vector. We have (n 1) g flow vectors for a video clip of n frames. We finally pool these (n 1) vectors over the entire video (we pool each of the 10,000 dimensions separately) in order to get the final HFLOW representation for a video clip. As for the PHOW case we normalize by taking logarithms and dividing by the L norm. Note that in this case we divide by the maximum for each individual move. vf k i = log(gk flow (i)) max i log(g k flow (i)) (3) The final vf vector is a 10,000 dimensions vector. 4. Datasets and Experimental Environment Recent papers (for example [11]) have noted that techniques perform well on older action recognition datasets because of the small number of categories and the often uniform backgrounds present. They have argued that the more challenging datasets are UCF50 [1] and HMDB51 [6]. Here, we briefly discuss the datasets before doing experiments on them. UCF50 This benchmark dataset consists of about 6600 unconstrained video clips each of which is harvested from web. There are 50 actions and each action has more than 100 videos. The videos are divided into groups and a group may contain multiple clips from the same video. There are several published results on this dataset. However, different researchers seem to use different splits. This can make a big difference in the results. For example, if videos from the same group are split into training and test sets results are likely to be very high. We focus on two published splits [11, 12]. The first [11] is (LOgO) leave-one-group-out where one group is treated as test and all other groups are treated as training videos. The second [12] involves five-fold (5fold) cross validation (group-wise cross validation) where videos from the same group are placed in the same fold so that parts of the same video do not appear both in the training and test set. In each case the provided scores are averaged over all the folds in the cross-validation. HMDB51 This dataset consists of about 6700 videos. There are 51 action classes and each action has at least 70 videos for training and 30 for testing purposes in the standard splits provided in the original paper introducing the dataset [6]. There are 10 action classes that overlap with the UCF50 set. This dataset is considered very challenging since the best existing result is 29%. [14, 5]. Experimental Environment We use a multi-class SVM classifier that is based on the one-against-all approach. We set the regularization parameter to 5,000 for UCF50 and 500 for HMDB51 with a linear kernel while creating the models. 5. Results and Discussion We individually evaluate PHOW and HFLOW. In addition to the individual evaluation we also perform late fusion on the prediction scores of PHOW and HFLOW by scaling the prediction scores between [0, 1] and then averaging them. In the tables PHOW+HFLOW denotes the late fusion of PHOW and HFLOW. We first show the results (Table 1) for both datasets for two different stepsizes (5 pixels versus 10 pixels). The smaller stepsize produces a small improvement at the cost of more computation. Dataset System 5pixels 10pixels HMDB51 PHOW 28.47% 27.38% HMDB51 HFLOW 26.10% 25.29% HMDB51 PHOW+HFLOW 30.19% 29.06% UCF50 PHOW 71.25% 69.84% UCF50 HFLOW 66.18% 66.07% UCF50 PHOW+HFLOW 72.23% 71.44% Table 1. Accuracy vs stepsize on HMDB51 set and UCF50. Table 2 compares our results on the UCF50 dataset with results from recent papers. The best results have all been published within the last 6 months. Besides our technique, [11] and [14] also fuse multiple descriptors. We note that on the 5-fold the PHOW and HFLOW separately are better than HOG/HOF and ActionBank. HFLOW is worse than MIP but PHOW performance is close to MIP (less than 1% difference). The fused PHOW+HFLOW is almost the same as MIP and better than other techniques. For the LOgO split the fused PHOW+HFLOW is almost the same as MIP and about 2% and 5% lower than [11] and [14] respectively. Overall, the fused PHOW+HFLOW performs quite well.

5 System 5fold LOgO HOG/HOF [8] 47.9% N/A ActionBank [12] 57.9% N/A MIP [5] 68.51% 72.68% GIST3D+STIP [14] N/A 73.70% CSIFT+MBH [11] N/A 76.90% PHOW 67.64% 71.25% HFLOW 62.76% 66.18% PHOW+HFLOW 69.24% 72.23% Table 2. Recognition accuracies on UCF50 set. Table 3 shows the corresponding accuracies on the HMDB51 dataset. The results for all algorithms are substantially lower indicating a much more difficult dataset. Even small improvements on this dataset are difficult to achieve (since the dataset was released all the techniques have been in the 20-29% range). PHOW by itself performs quite well doing better than all except two techniques. (GIST3D+STIP [14] and MIP [5]). HFLOW is slightly worse. PHOW+HFLOW does better than all the other approaches (about 1% better than the closest techniques). System Accuracy HOG/HOF [8] 20.44% C2 [4] 22.83% ActionBank [12] 26.90% CSIFT+HOG/HOF [11] 27.02% GIST3D+STIP[14] 29.20% MIP [5] 29.17% PHOW 28.47% HFLOW 26.10% PHOW+HFLOW 30.19% Table 3. Recognition accuracies on HMDB51 set. Figure 3 is a confusion matrix which shows the average accuracy over 3 splits of the HMDB51 dataset. Entry (i,j) shows what percentage of the videos described as action i in the groundtruth are labeled as action j by the PHOW+HFLOW technique. The accuracies are rounded to the nearest integer for display purposes and for this reason the rows don t sum to 100%. It is instructive to look at some examples. Eat and drink are easily confused which seems reasonable. Somersault is classified correctly in about 16% of the test videos and the major misclassifications are as cartwheel ( 23%), flic-flac ( 12%) and handstand ( 9%). Flic flac is a backwards somersault, a cartwheel could be thought of as a sideways somersault and a handstand could be viewed as a component of a somersault. Draw sword is correctly classified in about 27% of the test videos and the major misclassifications are sword exercise ( 28%) and sword ( 6%) (which refers to sword fights). Golf is correctly classified in about 87% of the test videos. Talk is correctly detected in about 52% of the test cases and incorrectly labeled as drink ( 7%) and chew ( 6%). This implies that if there are significant differences in the actions, it is more likely to be detected correctly. Actions which are much more similar such as a somersault versus a flic-flac or eat versus drink are much harder to detect. It appears that more closely modeling the object parts in these cases (front versus back) for somersault versus flic flac or the object used (say in eating vs drinking) are important. Modeling the motions is also important but modeling the objects may also lead to significant improvements. So the good performance of PHOW may be due to its ability to model objects. It has been shown before that PHOW and similar descriptors paired with SVM classifiers work very well for object recognition [3]. Chatfield et al. [3] also point out that small changes (specific details) in how the different parts of a system are created can make big differences. Normalization is also important (taking logarithms and using L infty norm) since the videos vary in length quite a bit and our experience was that L 1 or L 2 did not perform as well. Another reason for the usefulness of PHOW may be due to the classifier learning to encode the actions implicitly. HFLOW exploits motion information and comes close to PHOW in terms of accuracy. As previous attempts have shown it is important to encode motion information. HFLOW works even though it only encodes motion coarsely (from one region to another) rather than fine measures of displacement as in [7, 5]. 6. Conclusion We focus on two representations; pyramid histogram of visual words (PHOW) and histogram of the flow of visual words (HFLOW). We make use of densely sampled SIFT descriptors that are quantized to visual words for PHOW and we keep track of the flow of those visual words between consecutive frames in a video clip to compute HFLOW. We also take logarithms of the histograms and use an L infty normalization to handle videos of varying lengths. We show that late fusion of classifiers created using both features gives state of the art performance %on UCF50 and 30.19% accuracy on HMDB51 (about 1% better than any previous technique). These results suggest that modeling the objects as well as the actions may be a good way of solving the action recognition problem. 7. Acknowledgment This work was supported in part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. References [1] ( 1, 2, 4

6 Figure 3. Confusion matrix for HMDB51 dataset. Note that the numbers are in %. The sum of rows may not be exactly 100 because of rounding errors. The scores are averaged over three splits. [2] Large scale visual recognition challenge 2011 (ilsvrc2011) [3] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, , 2, 5 [4] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. In ICCV, [5] O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, , 4, 5 [6] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, pages , , 4 [7] K.-T. Lai, M.-S. Chen, C.-H. Hsieh, and M.-F. Lai. Orientation histogram of sift displacement for recognizing actions in broadcast videos. In EUVIP, pages , , 2, 3, 5 [8] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR 2008, pages 1 8, , 5 [9] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR (2), pages , [10] R. Poppe. A survey on vision-based human action recognition. IVC, 28(6): , , 2 [11] K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos. MVA, , 2, 4, 5 [12] S. S. Sadanand and J. Corson. Action bank: A high-level representation of activity in video. In CVPR, , 2, 4, 5 [13] C. Schüldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In ICPR, [14] B. Solmaz, A. Modiri, and M. Shah. Classifying web videos using a global video descriptor. MVA, , 2, 4, 5 [15] X. Sun, M. Chen, and A. Hauptmann. Action recognition via local descriptors and holistic features. In CVPR Workshop

7 H4B, pages 58 65, [16] S. Todorovic. Human activities as stochastic kronecker graphs. In ECCV (2), pages , [17] H. Wang, M. Ullah, A. Kläser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, [18] D. Weinland, R. Ronfard, and E. Boyer. A survey of visionbased methods for action representation, segmentation and recognition. CVIU, 115(2): , , 2

Motion Interchange Patterns for Action Recognition in Unconstrained Videos

Motion Interchange Patterns for Action Recognition in Unconstrained Videos Motion Interchange Patterns for Action Recognition in Unconstrained Videos Orit Kliper-Gross, Yaron Gurovich, Tal Hassner, Lior Wolf Weizmann Institute of Science The Open University of Israel Tel Aviv

More information

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute

More information

Sampling Strategies for Real-time Action Recognition

Sampling Strategies for Real-time Action Recognition 2013 IEEE Conference on Computer Vision and Pattern Recognition Sampling Strategies for Real-time Action Recognition Feng Shi, Emil Petriu and Robert Laganière School of Electrical Engineering and Computer

More information

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition International Journal of Innovation and Applied Studies ISSN 2028-9324 Vol. 9 No. 4 Dec. 2014, pp. 1708-1717 2014 Innovative Space of Scientific Research Journals http://www.ijias.issr-journals.org/ Evaluation

More information

Action Recognition with HOG-OF Features

Action Recognition with HOG-OF Features Action Recognition with HOG-OF Features Florian Baumann Institut für Informationsverarbeitung, Leibniz Universität Hannover, {last name}@tnt.uni-hannover.de Abstract. In this paper a simple and efficient

More information

Action recognition in videos

Action recognition in videos Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit

More information

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Leveraging Textural Features for Recognizing Actions in Low Quality Videos Leveraging Textural Features for Recognizing Actions in Low Quality Videos Saimunur Rahman 1, John See 2, and Chiung Ching Ho 3 Centre of Visual Computing, Faculty of Computing and Informatics Multimedia

More information

Action Recognition with Improved Trajectories

Action Recognition with Improved Trajectories Action Recognition with Improved Trajectories Heng Wang and Cordelia Schmid LEAR, INRIA, France firstname.lastname@inria.fr Abstract Recently dense trajectories were shown to be an efficient video representation

More information

Person Action Recognition/Detection

Person Action Recognition/Detection Person Action Recognition/Detection Fabrício Ceschin Visão Computacional Prof. David Menotti Departamento de Informática - Universidade Federal do Paraná 1 In object recognition: is there a chair in the

More information

Revisiting LBP-based Texture Models for Human Action Recognition

Revisiting LBP-based Texture Models for Human Action Recognition Revisiting LBP-based Texture Models for Human Action Recognition Thanh Phuong Nguyen 1, Antoine Manzanera 1, Ngoc-Son Vu 2, and Matthieu Garrigues 1 1 ENSTA-ParisTech, 828, Boulevard des Maréchaux, 91762

More information

Motion Interchange Patterns for Action Recognition in Unconstrained Videos

Motion Interchange Patterns for Action Recognition in Unconstrained Videos Motion Interchange Patterns for Action Recognition in Unconstrained Videos Orit Kliper-Gross 1, Yaron Gurovich 2, Tal Hassner 3 and Lior Wolf 2 1 The Weizmann Institute of Science, Israel. 2 Tel-Aviv University,

More information

Ordered Trajectories for Large Scale Human Action Recognition

Ordered Trajectories for Large Scale Human Action Recognition 2013 IEEE International Conference on Computer Vision Workshops Ordered Trajectories for Large Scale Human Action Recognition O. V. Ramana Murthy 1 1 Vision & Sensing, HCC Lab, ESTeM, University of Canberra

More information

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,

More information

Optical Flow Co-occurrence Matrices: A Novel Spatiotemporal Feature Descriptor

Optical Flow Co-occurrence Matrices: A Novel Spatiotemporal Feature Descriptor Optical Flow Co-occurrence Matrices: A Novel Spatiotemporal Feature Descriptor Carlos Caetano, Jefersson A. dos Santos, William Robson Schwartz Smart Surveillance Interest Group, Computer Science Department

More information

A Piggyback Representation for Action Recognition

A Piggyback Representation for Action Recognition A Piggyback Representation for Action Recognition Lior Wolf, Yair Hanani The Balvatnik School of Computer Science Tel Aviv University Tal Hassner Dept. of Mathematics and Computer Science The Open University

More information

Class 9 Action Recognition

Class 9 Action Recognition Class 9 Action Recognition Liangliang Cao, April 4, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual Recognition

More information

Action Recognition Using Super Sparse Coding Vector with Spatio-Temporal Awareness

Action Recognition Using Super Sparse Coding Vector with Spatio-Temporal Awareness Action Recognition Using Super Sparse Coding Vector with Spatio-Temporal Awareness Xiaodong Yang and YingLi Tian Department of Electrical Engineering City College, City University of New York Abstract.

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Beyond Bags of Features

Beyond Bags of Features : for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

More information

CS229: Action Recognition in Tennis

CS229: Action Recognition in Tennis CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active

More information

Motion Trend Patterns for Action Modelling and Recognition

Motion Trend Patterns for Action Modelling and Recognition Motion Trend Patterns for Action Modelling and Recognition Thanh Phuong Nguyen, Antoine Manzanera, and Matthieu Garrigues ENSTA-ParisTech, 828 Boulevard des Maréchaux, 91762 Palaiseau CEDEX, France {thanh-phuong.nguyen,antoine.manzanera,matthieu.garrigues}@ensta-paristech.fr

More information

Efficient feature extraction, encoding and classification for action recognition

Efficient feature extraction, encoding and classification for action recognition Efficient feature extraction, encoding and classification for action recognition Vadim Kantorov Ivan Laptev INRIA - WILLOW / Dept. of Computer Science, Ecole Normale Suprieure Abstract Local video features

More information

Lecture 18: Human Motion Recognition

Lecture 18: Human Motion Recognition Lecture 18: Human Motion Recognition Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Introduction Motion classification using template matching Motion classification i using spatio

More information

Evaluation of local spatio-temporal features for action recognition

Evaluation of local spatio-temporal features for action recognition Evaluation of local spatio-temporal features for action recognition Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, Cordelia Schmid To cite this version: Heng Wang, Muhammad Muneeb Ullah,

More information

Learning Realistic Human Actions from Movies

Learning Realistic Human Actions from Movies Learning Realistic Human Actions from Movies Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** INRIA Rennes, France ** INRIA Grenoble, France *** Bar-Ilan University, Israel Presented

More information

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

Developing Open Source code for Pyramidal Histogram Feature Sets

Developing Open Source code for Pyramidal Histogram Feature Sets Developing Open Source code for Pyramidal Histogram Feature Sets BTech Project Report by Subodh Misra subodhm@iitk.ac.in Y648 Guide: Prof. Amitabha Mukerjee Dept of Computer Science and Engineering IIT

More information

Computation Strategies for Volume Local Binary Patterns applied to Action Recognition

Computation Strategies for Volume Local Binary Patterns applied to Action Recognition Computation Strategies for Volume Local Binary Patterns applied to Action Recognition F. Baumann, A. Ehlers, B. Rosenhahn Institut für Informationsverarbeitung (TNT) Leibniz Universität Hannover, Germany

More information

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Leveraging Textural Features for Recognizing Actions in Low Quality Videos Leveraging Textural Features for Recognizing Actions in Low Quality Videos Saimunur Rahman, John See, Chiung Ching Ho Centre of Visual Computing, Faculty of Computing and Informatics Multimedia University,

More information

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it What is an Event? Dictionary.com definition: something that occurs in a certain place during a particular

More information

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor Xiaodong Yang and YingLi Tian Department of Electrical Engineering The City College of New York, CUNY {xyang02, ytian}@ccny.cuny.edu

More information

On the Effects of Low Video Quality in Human Action Recognition

On the Effects of Low Video Quality in Human Action Recognition On the Effects of Low Video Quality in Human Action Recognition John See Faculty of Computing and Informatics Multimedia University Cyberjaya, Selangor, Malaysia Email: johnsee@mmu.edu.my Saimunur Rahman

More information

Beyond Bags of features Spatial information & Shape models

Beyond Bags of features Spatial information & Shape models Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features

More information

Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan Andrew Zisserman Cemil Zalluhoğlu Introduction Aim Extend deep Convolution Networks to action recognition in video. Motivation

More information

CLASSIFICATION Experiments

CLASSIFICATION Experiments CLASSIFICATION Experiments January 27,2015 CS3710: Visual Recognition Bhavin Modi Bag of features Object Bag of words 1. Extract features 2. Learn visual vocabulary Bag of features: outline 3. Quantize

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Evaluation of Color STIPs for Human Action Recognition

Evaluation of Color STIPs for Human Action Recognition 2013 IEEE Conference on Computer Vision and Pattern Recognition Evaluation of Color STIPs for Human Action Recognition Ivo Everts, Jan C. van Gemert and Theo Gevers Intelligent Systems Lab Amsterdam University

More information

String distance for automatic image classification

String distance for automatic image classification String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,

More information

Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off

Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off Noname manuscript No. (will be inserted by the editor) Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off J. Uijlings I.C.

More information

Content-based image and video analysis. Event Recognition

Content-based image and video analysis. Event Recognition Content-based image and video analysis Event Recognition 21.06.2010 What is an event? a thing that happens or takes place, Oxford Dictionary Examples: Human gestures Human actions (running, drinking, etc.)

More information

By Suren Manvelyan,

By Suren Manvelyan, By Suren Manvelyan, http://www.surenmanvelyan.com/gallery/7116 By Suren Manvelyan, http://www.surenmanvelyan.com/gallery/7116 By Suren Manvelyan, http://www.surenmanvelyan.com/gallery/7116 By Suren Manvelyan,

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features

Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features Saimunur Rahman, John See, Chiung Ching Ho Centre of Visual Computing, Faculty of Computing and Informatics

More information

SPATIO-TEMPORAL PYRAMIDAL ACCORDION REPRESENTATION FOR HUMAN ACTION RECOGNITION. Manel Sekma, Mahmoud Mejdoub, Chokri Ben Amar

SPATIO-TEMPORAL PYRAMIDAL ACCORDION REPRESENTATION FOR HUMAN ACTION RECOGNITION. Manel Sekma, Mahmoud Mejdoub, Chokri Ben Amar 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) SPATIO-TEMPORAL PYRAMIDAL ACCORDION REPRESENTATION FOR HUMAN ACTION RECOGNITION Manel Sekma, Mahmoud Mejdoub, Chokri

More information

ImageCLEF 2011

ImageCLEF 2011 SZTAKI @ ImageCLEF 2011 Bálint Daróczy joint work with András Benczúr, Róbert Pethes Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Training/test

More information

Large-scale Video Classification with Convolutional Neural Networks

Large-scale Video Classification with Convolutional Neural Networks Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei Note: Slide content mostly from : Bay Area

More information

Better exploiting motion for better action recognition

Better exploiting motion for better action recognition Better exploiting motion for better action recognition Mihir Jain, Hervé Jégou, Patrick Bouthemy To cite this version: Mihir Jain, Hervé Jégou, Patrick Bouthemy. Better exploiting motion for better action

More information

Action Recognition by Dense Trajectories

Action Recognition by Dense Trajectories Action Recognition by Dense Trajectories Heng Wang, Alexander Kläser, Cordelia Schmid, Liu Cheng-Lin To cite this version: Heng Wang, Alexander Kläser, Cordelia Schmid, Liu Cheng-Lin. Action Recognition

More information

Mixtures of Gaussians and Advanced Feature Encoding

Mixtures of Gaussians and Advanced Feature Encoding Mixtures of Gaussians and Advanced Feature Encoding Computer Vision Ali Borji UWM Many slides from James Hayes, Derek Hoiem, Florent Perronnin, and Hervé Why do good recognition systems go bad? E.g. Why

More information

Action and Event Recognition with Fisher Vectors on a Compact Feature Set

Action and Event Recognition with Fisher Vectors on a Compact Feature Set Action and Event Recognition with Fisher Vectors on a Compact Feature Set Dan Oneata, Jakob Verbeek, Cordelia Schmid To cite this version: Dan Oneata, Jakob Verbeek, Cordelia Schmid. Action and Event Recognition

More information

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES Pin-Syuan Huang, Jing-Yi Tsai, Yu-Fang Wang, and Chun-Yi Tsai Department of Computer Science and Information Engineering, National Taitung University,

More information

CS231N Section. Video Understanding 6/1/2018

CS231N Section. Video Understanding 6/1/2018 CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image

More information

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in

More information

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained

More information

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Tetsu Matsukawa Koji Suzuki Takio Kurita :University of Tsukuba :National Institute of Advanced Industrial Science and

More information

MoSIFT: Recognizing Human Actions in Surveillance Videos

MoSIFT: Recognizing Human Actions in Surveillance Videos MoSIFT: Recognizing Human Actions in Surveillance Videos CMU-CS-09-161 Ming-yu Chen and Alex Hauptmann School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 September 24, 2009 Copyright

More information

An Implementation on Histogram of Oriented Gradients for Human Detection

An Implementation on Histogram of Oriented Gradients for Human Detection An Implementation on Histogram of Oriented Gradients for Human Detection Cansın Yıldız Dept. of Computer Engineering Bilkent University Ankara,Turkey cansin@cs.bilkent.edu.tr Abstract I implemented a Histogram

More information

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network EasyChair Preprint 730 Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network Xue Bai, Enqing Chen and Haron Chweya Tinega EasyChair preprints are intended for rapid dissemination

More information

HMDB: A Large Video Database for Human Motion Recognition

HMDB: A Large Video Database for Human Motion Recognition HMDB: A Large Video Database for Human Motion Recognition H. Kuehne Karlsruhe Instit. of Tech. Karlsruhe, Germany kuehne@kit.edu H. Jhuang E. Garrote T. Poggio Massachusetts Institute of Technology Cambridge,

More information

Modified Time Flexible Kernel for Video Activity Recognition using Support Vector Machines

Modified Time Flexible Kernel for Video Activity Recognition using Support Vector Machines Modified Time Flexible Kernel for Video Activity Recognition using Support Vector Machines Ankit Sharma 1, Apurv Kumar 1, Sony Allappa 1, Veena Thenkanidiyoor 1, Dileep Aroor Dinesh 2 and Shikha Gupta

More information

Understanding Sport Activities from Correspondences of Clustered Trajectories

Understanding Sport Activities from Correspondences of Clustered Trajectories Understanding Sport Activities from Correspondences of Clustered Trajectories Francesco Turchini, Lorenzo Seidenari, Alberto Del Bimbo http://www.micc.unifi.it/vim Introduction The availability of multimedia

More information

Vision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada

Vision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada Spatio-Temporal Salient Features Amir H. Shabani Vision and Image Processing Lab., University of Waterloo, ON CRV Tutorial day- May 30, 2010 Ottawa, Canada 1 Applications Automated surveillance for scene

More information

RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS. IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University

RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS. IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS Tatsuya Ishihara Kris M. Kitani Wei-Chiu Ma Hironobu Takagi Chieko Asakawa IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

Bag-of-features. Cordelia Schmid

Bag-of-features. Cordelia Schmid Bag-of-features for category classification Cordelia Schmid Visual search Particular objects and scenes, large databases Category recognition Image classification: assigning a class label to the image

More information

Action Recognition From Videos using Sparse Trajectories

Action Recognition From Videos using Sparse Trajectories Action Recognition From Videos using Sparse Trajectories Alexandros Doumanoglou, Nicholas Vretos, Petros Daras Centre for Research and Technology - Hellas (ITI-CERTH) 6th Km Charilaou - Thermi, Thessaloniki,

More information

Learning discriminative space-time actions from weakly labelled videos

Learning discriminative space-time actions from weakly labelled videos International Journal of Computer Vision manuscript No. (will be inserted by the editor) Learning discriminative space-time actions from weakly labelled videos Michael Sapienza Fabio Cuzzolin Philip H.S.

More information

Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off

Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off Noname manuscript No. (will be inserted by the editor) Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off J. Uijlings I.C.

More information

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center Evolvement of Visual Features

More information

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study J. Zhang 1 M. Marszałek 1 S. Lazebnik 2 C. Schmid 1 1 INRIA Rhône-Alpes, LEAR - GRAVIR Montbonnot, France

More information

Bag of Optical Flow Volumes for Image Sequence Recognition 1

Bag of Optical Flow Volumes for Image Sequence Recognition 1 RIEMENSCHNEIDER, DONOSER, BISCHOF: BAG OF OPTICAL FLOW VOLUMES 1 Bag of Optical Flow Volumes for Image Sequence Recognition 1 Hayko Riemenschneider http://www.icg.tugraz.at/members/hayko Michael Donoser

More information

Dyadic Interaction Detection from Pose and Flow

Dyadic Interaction Detection from Pose and Flow Dyadic Interaction Detection from Pose and Flow Coert van Gemeren 1,RobbyT.Tan 2, Ronald Poppe 1,andRemcoC.Veltkamp 1, 1 Interaction Technology Group, Department of Information and Computing Sciences,

More information

Scene Recognition using Bag-of-Words

Scene Recognition using Bag-of-Words Scene Recognition using Bag-of-Words Sarthak Ahuja B.Tech Computer Science Indraprastha Institute of Information Technology Okhla, Delhi 110020 Email: sarthak12088@iiitd.ac.in Anchita Goel B.Tech Computer

More information

Part based models for recognition. Kristen Grauman

Part based models for recognition. Kristen Grauman Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily

More information

Beyond bags of Features

Beyond bags of Features Beyond bags of Features Spatial Pyramid Matching for Recognizing Natural Scene Categories Camille Schreck, Romain Vavassori Ensimag December 14, 2012 Schreck, Vavassori (Ensimag) Beyond bags of Features

More information

Temporal Poselets for Collective Activity Detection and Recognition

Temporal Poselets for Collective Activity Detection and Recognition Temporal Poselets for Collective Activity Detection and Recognition Moin Nabi Alessio Del Bue Vittorio Murino Pattern Analysis and Computer Vision (PAVIS) Istituto Italiano di Tecnologia (IIT) Via Morego

More information

Fast Image Matching Using Multi-level Texture Descriptor

Fast Image Matching Using Multi-level Texture Descriptor Fast Image Matching Using Multi-level Texture Descriptor Hui-Fuang Ng *, Chih-Yang Lin #, and Tatenda Muindisi * Department of Computer Science, Universiti Tunku Abdul Rahman, Malaysia. E-mail: nghf@utar.edu.my

More information

Real-Time Action Recognition System from a First-Person View Video Stream

Real-Time Action Recognition System from a First-Person View Video Stream Real-Time Action Recognition System from a First-Person View Video Stream Maxim Maximov, Sang-Rok Oh, and Myong Soo Park there are some requirements to be considered. Many algorithms focus on recognition

More information

From Actemes to Action: A Strongly-supervised Representation for Detailed Action Understanding

From Actemes to Action: A Strongly-supervised Representation for Detailed Action Understanding From Actemes to Action: A Strongly-supervised Representation for Detailed Action Understanding Weiyu Zhang GRASP Laboratory University of Pennsylvania Philadelphia, PA, USA zhweiyu@seas.upenn.edu Menglong

More information

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis Int J Comput Vis (2016) 119:239 253 DOI 10.1007/s11263-016-0905-6 EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis Du Tran 1 Lorenzo Torresani 1 Received: 15 May 2014 / Accepted:

More information

Evaluation of local descriptors for action recognition in videos

Evaluation of local descriptors for action recognition in videos Evaluation of local descriptors for action recognition in videos Piotr Bilinski and Francois Bremond INRIA Sophia Antipolis - PULSAR group 2004 route des Lucioles - BP 93 06902 Sophia Antipolis Cedex,

More information

Discriminative Figure-Centric Models for Joint Action Localization and Recognition

Discriminative Figure-Centric Models for Joint Action Localization and Recognition Discriminative Figure-Centric Models for Joint Action Localization and Recognition Tian Lan School of Computing Science Simon Fraser University tla58@sfu.ca Yang Wang Dept. of Computer Science UIUC yangwang@uiuc.edu

More information

An evaluation of local action descriptors for human action classification in the presence of occlusion

An evaluation of local action descriptors for human action classification in the presence of occlusion An evaluation of local action descriptors for human action classification in the presence of occlusion Iveel Jargalsaikhan, Cem Direkoglu, Suzanne Little, and Noel E. O Connor INSIGHT Centre for Data Analytics,

More information

Sparse coding for image classification

Sparse coding for image classification Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction

More information

Visual Action Recognition

Visual Action Recognition Visual Action Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu 1 / 57 Outline

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

A Unified Method for First and Third Person Action Recognition

A Unified Method for First and Third Person Action Recognition A Unified Method for First and Third Person Action Recognition Ali Javidani Department of Computer Science and Engineering Shahid Beheshti University Tehran, Iran a.javidani@mail.sbu.ac.ir Ahmad Mahmoudi-Aznaveh

More information

Object Detection Using Segmented Images

Object Detection Using Segmented Images Object Detection Using Segmented Images Naran Bayanbat Stanford University Palo Alto, CA naranb@stanford.edu Jason Chen Stanford University Palo Alto, CA jasonch@stanford.edu Abstract Object detection

More information

arxiv: v1 [cs.cv] 2 Sep 2013

arxiv: v1 [cs.cv] 2 Sep 2013 A Study on Unsupervised Dictionary Learning and Feature Encoding for Action Classification Xiaojiang Peng 1,2, Qiang Peng 1, Yu Qiao 2, Junzhou Chen 1, and Mehtab Afzal 1 1 Southwest Jiaotong University,

More information

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space.

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space. Visual words Map high-dimensional descriptors to tokens/words by quantizing the feature space. Quantize via clustering; cluster centers are the visual words Word #2 Descriptor feature space Assign word

More information

Local Part Model for Action Recognition in Realistic Videos

Local Part Model for Action Recognition in Realistic Videos Local Part Model for Action Recognition in Realistic Videos by Feng Shi Thesis submitted to Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the Doctorate of

More information

Joint Segmentation and Recognition of Worker Actions using Semi-Markov Models

Joint Segmentation and Recognition of Worker Actions using Semi-Markov Models 33 rd International Symposium on Automation and Robotics in Construction (ISARC 06) Joint Segmentation and Recognition of Worker Actions using Semi-Markov Models Jun Yang a, Zhongke Shi b and Ziyan Wu

More information

Local Features and Bag of Words Models

Local Features and Bag of Words Models 10/14/11 Local Features and Bag of Words Models Computer Vision CS 143, Brown James Hays Slides from Svetlana Lazebnik, Derek Hoiem, Antonio Torralba, David Lowe, Fei Fei Li and others Computer Engineering

More information

Object Classification Problem

Object Classification Problem HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category

More information

Learning Representations for Visual Object Class Recognition

Learning Representations for Visual Object Class Recognition Learning Representations for Visual Object Class Recognition Marcin Marszałek Cordelia Schmid Hedi Harzallah Joost van de Weijer LEAR, INRIA Grenoble, Rhône-Alpes, France October 15th, 2007 Bag-of-Features

More information

CS 231A Computer Vision (Fall 2011) Problem Set 4

CS 231A Computer Vision (Fall 2011) Problem Set 4 CS 231A Computer Vision (Fall 2011) Problem Set 4 Due: Nov. 30 th, 2011 (9:30am) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable part-based

More information

Dictionary of gray-level 3D patches for action recognition

Dictionary of gray-level 3D patches for action recognition Dictionary of gray-level 3D patches for action recognition Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin To cite this version: Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin. Dictionary of

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Modeling Concept Dependencies for Event Detection

Modeling Concept Dependencies for Event Detection Modeling Concept Dependencies for Event Detection Ethem F. Can Center for Intelligent Information Retrieval (CIIR) School of Computer Science UMass Amherst, MA, 01002 efcan@cs.umass.edu R. Manmatha Center

More information

Combining spatio-temporal appearance descriptors and optical flow for human action recognition in video data

Combining spatio-temporal appearance descriptors and optical flow for human action recognition in video data Combining spatio-temporal appearance descriptors and optical flow for human action recognition in video data Karla Brkić, Sr dan Rašić, Axel Pinz, Siniša Šegvić and Zoran Kalafatić University of Zagreb,

More information