Action Recognition with HOG-OF Features

Similar documents
Computation Strategies for Volume Local Binary Patterns applied to Action Recognition

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Combining spatio-temporal appearance descriptors and optical flow for human action recognition in video data

Lecture 18: Human Motion Recognition

Real-Time Action Recognition System from a First-Person View Video Stream

arxiv: v1 [cs.cv] 2 Sep 2014

Action recognition in videos

Optical flow and tracking

Action Recognition From Videos using Sparse Trajectories

Leow Wee Kheng CS4243 Computer Vision and Pattern Recognition. Motion Tracking. CS4243 Motion Tracking 1

Dictionary of gray-level 3D patches for action recognition

The most cited papers in Computer Vision

Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features

Learning Realistic Human Actions from Movies

Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features

Minimizing hallucination in Histogram of Oriented Gradients

Learning realistic human actions from movies

Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition

CS229: Action Recognition in Tennis

MoSIFT: Recognizing Human Actions in Surveillance Videos

Video Processing for Judicial Applications

Cascaded Random Forest for Fast Object Detection

Revisiting LBP-based Texture Models for Human Action Recognition

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE

Person Action Recognition/Detection

Action Recognition Using Hybrid Feature Descriptor and VLAD Video Encoding

Beyond Bags of Features

An evaluation of local action descriptors for human action classification in the presence of occlusion

Temporal Feature Weighting for Prototype-Based Action Recognition

Detection of a Hand Holding a Cellular Phone Using Multiple Image Features

Motion Trend Patterns for Action Modelling and Recognition

Sampling Strategies for Real-time Action Recognition

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Evaluation of local descriptors for action recognition in videos

Peripheral drift illusion

A Unified Method for First and Third Person Action Recognition

Cascaded Random Forest for Fast Object Detection

Action Recognition by Dense Trajectories

Action Recognition & Categories via Spatial-Temporal Features

Real-Time Pedestrian Detection and Tracking

Visual Tracking (1) Feature Point Tracking and Block Matching

Ordered Trajectories for Large Scale Human Action Recognition

Crowd Event Recognition Using HOG Tracker

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

SURF. Lecture6: SURF and HOG. Integral Image. Feature Evaluation with Integral Image

Dyadic Interaction Detection from Pose and Flow

Human Focused Action Localization in Video

REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY. Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin

Deformable Part Models

A Spatio-Temporal Descriptor Based on 3D-Gradients

Human Detection and Action Recognition. in Video Sequences

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Feature Tracking and Optical Flow

Incremental Action Recognition Using Feature-Tree

Autonomous Navigation for Flying Robots

Dynamic Scene Classification using Spatial and Temporal Cues

Feature Tracking and Optical Flow

Recap Image Classification with Bags of Local Features

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects

Using temporal seeding to constrain the disparity search range in stereo matching

Available online at ScienceDirect. Procedia Computer Science 22 (2013 )

Bag of Optical Flow Volumes for Image Sequence Recognition 1

RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS. IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University

Class 9 Action Recognition

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance

First-Person Animal Activity Recognition from Egocentric Videos

Human action recognition using an ensemble of body-part detectors

Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition

An Efficient Part-Based Approach to Action Recognition from RGB-D Video with BoW-Pyramid Representation

Motion analysis for broadcast tennis video considering mutual interaction of players

Tri-modal Human Body Segmentation

SPATIO-TEMPORAL PYRAMIDAL ACCORDION REPRESENTATION FOR HUMAN ACTION RECOGNITION. Manel Sekma, Mahmoud Mejdoub, Chokri Ben Amar

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Detection of Groups of People in Surveillance Videos based on Spatio-Temporal Clues

Classification of objects from Video Data (Group 30)

Moving Object Tracking in Video Using MATLAB

Notes 9: Optical Flow

Anomaly Detection in Crowded Scenes by SL-HOF Descriptor and Foreground Classification

Selection of Scale-Invariant Parts for Object Class Recognition

Action Recognition Using Motion History Image and Static History Image-based Local Binary Patterns

Optical Flow Co-occurrence Matrices: A Novel Spatiotemporal Feature Descriptor

Content-based image and video analysis. Event Recognition

Temporal Poselets for Collective Activity Detection and Recognition

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

Multi-stable Perception. Necker Cube

2D Image Processing Feature Descriptors

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

Evaluation of local spatio-temporal features for action recognition

ROBUST OBJECT TRACKING BY SIMULTANEOUS GENERATION OF AN OBJECT MODEL

Joint Segmentation and Recognition of Worker Actions using Semi-Markov Models

Middle-Level Representation for Human Activities Recognition: the Role of Spatio-temporal Relationships

AK Computer Vision Feature Point Detectors and Descriptors

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Visual Tracking (1) Pixel-intensity-based methods

Tracking. Hao Guan( 管皓 ) School of Computer Science Fudan University

Improving Recognition through Object Sub-categorization

Transcription:

Action Recognition with HOG-OF Features Florian Baumann Institut für Informationsverarbeitung, Leibniz Universität Hannover, {last name}@tnt.uni-hannover.de Abstract. In this paper a simple and efficient framework for single human action recognition is proposed. In two parallel processing streams, motion information and static object appearances are gathered by introducing a frame-by-frame learning approach. For each processing stream a Random Forest classifier is separately learned. The final decision is determined by combining both probability functions. The proposed recognition system is evaluated on the KTH data set for single human action recognition with original training/testing splits and a 5-fold cross validation. The results demonstrate state-of-the-art accuracies with an overall training time of 30 seconds on a standard workstation. 1 Introduction Human action recognition is divided into human actions, human-human interactions, human-object interactions and group activities [1]. In this paper, we address the problem of recognizing actions performed by a single person like boxing, clapping, waving and walking, running, jogging. See Figure 1. Aggarwal and Ryoo categorize the developed methods for human activity recognition into single-layered and hierarchical approaches. These approaches are further divided into several subcategories [1]. Poppe suggests to divide the methods in global and local approaches [12]. Global approaches construe the image as a whole leading to a sensitive representation to noise, occlusions or changes in viewpoint and background. Local approaches extract regional features, leading to a more accurate representation, invariant against viewpoint and background changes. Contribution In this paper, different feature types, such as HOG and optical flow features are used to separately learn two Random Forest classifiers which are then combined for final action recognition. HOGs are predestinated for gathering static information, since they are well-established at the task of human detection [4] while the optical flow is used for extracting motion information. Both features are local approaches, leading to an accurate representation, invariant against illumination, contrast and background changes. Related Work Earlier work was done by Mauthner et al. [10]. The authors use HOGdescriptors for appearance and motion detection. The feature vectors are represented by NMF coefficients and concatenated. An SVM is used to learn the classifier. Similar to this work, Seo and Milanfar [16] also use an one-shot learning method. Recent This work has been partially funded by the ERC within the starting grant Dynamic MinVIP.

2 F. Baumann works that combine HOG- and HOF-features for single human action recognition were introduced by Wang et al. [19] and Laptev et al. [7]. Both use an SVM to recognize patterns. In contrast, we use HOG descriptors and OF trajectories to learn two independent Random Forest classifiers. Fig. 1. Example images of KTH dataset [15]. The dataset contains six actions performed by 25 people under different conditions. 2 Approach For human action recognition the use of static information as well as motion information is necessary to obtain robust classification results. In order to gather all information static-features from each frame and motion-features between frames are extracted. Two Random Forest classifiers are separately learned to find patterns. 2.1 Features Frame-by-Frame Learning Static appearance information is extracted using histograms of oriented gradients. HOGs are well-established for human detection and mostly independent regarding illumination and contrast changes. First described in 2005 by Dalal und Triggs [4], HOGs became increasingly popular for different tasks of visual object detection. The computation proceeds as follows: to obtain a gradient image, a filter with [-1, 0, 1], without smoothing is applied to the image. For the computation of the HOG-Descriptor the gradient image is divided into 16x16 pixel nonoverlapping blocks of four 8x8 pixel cells [4]. Next, the orientations of each cell are used for a weighted vote into 9 bins within a range of 0 180. Overlapping spatial

Action Recognition with HOG-OF Features 3 blocks are contrast normalized and concatenated to build the final descriptor. Motion recognition The optical flow is used to gather motion information. It bases on the assumption that points at the same location have constant intensities over a short duration [6]: I(x, y, t) = I(x + δx, y + δy, t + δt), (1) where I(x, y, t) is the image patch, displaced in time δt with distance (δx, δy) in the x- /y-direction. The proposed framework comprises the method of Lucas Kanade optical flow [8] which bases on Equation 1 but additionally assumes that the intensities are constant in a small neighborhood. To compute the optical flow descriptor, strong corners in two consecutive frames at t and t+1 are detected with the Shi-Tomasi corner detector [17]. Next, tracking of these feature points is realized by a pyramidal Lucas Kanade tracker [2]. 2.2 Random Forest by Leo Breiman [3] A Random Forest consists of CART-like decision trees that are independently constructed on a bootstrap sample. Compared to other ensemble learning algorithms, i.e. boosting [5] that build a flat tree structure of decision stumps, a Random Forest uses an ensemble of decision trees and is multi-class capable. A completed classifier consists of several trees 1 t T in which the class probabilities, estimated by majority voting, are used to calculate the sample s label y(x) with respect to a feature vector x: y(x) = argmax c ( 1 T T t=1 I ht(x)=c The decision function h t (x) provides the classification of one tree to a class c with the indicator function I: { 1, h t (x) = c, I ht(x)=c = (3) 0, otherwise. Classification A sample is classified by passing it down each tree until a leaf node is reached. A classification result is assigned to each leaf node and the final decision is determined by taking the class having the most votes, see Equation (2). 2.3 Combination For each feature-type a Random Forest is separately learned to yield independent classifiers. The HOGs are used to learn a frame-by-frame classifier, so that a feature consists of a histogram obtained from each frame. A majority vote of all frames is leading to the sample s label while the class probabilities are averaged. Optical flow trajectories are calculated between two consecutive frames while a feature is constructed by concatenating the trajectories of all frames. Figure 2 shows an overview about the implemented framework. The final decision is determined by combining the class probabilities Pr(A i ) obtained by the HOG classifier and Pr(B i ) obtained by the optical flow classifier with the product law 1 : Pr(A i B i ) = Pr(A i ) Pr(B i ). 1 With the assumption that events A i and B i are independent. ) (2)

4 F. Baumann Extract OF-Trajectories Random Forest Videos Extract HOG-Features Random Forest Boxing Combined Decision Walking Jogging Waving Clapping Running Fig. 2. Overview about the recognition system. In two parallel units static-features and motionfeatures are computed. For each feature-type a Random Forest classifier is separately learned. The final decision is determined by combining both classifiers using the product law. 3 Experimental Results The proposed method is applied to the task of human action recognition. The goal is to recognize actions in several environments, under different illumination conditions and performed by different subjects. For the evaluation a well-established, publicly available dataset is used and the results are compared with several state-of-the-art methods. 3.1 KTH Action Dataset [15] The KTH is a well-established, publicly available dataset for single human action recognition, consisting of 600 video files from 25 subjects performing six actions (walking, jogging, running, boxing, waving, clapping). Similar to [11], a fixed position bounding box with a temporal window of 32 frames is selected, based on annotations by Lui [9]. Presumably, a smaller number of frames is sufficient [14]. Furthermore, the original training/testing splits from [15] as well as a 5-fold cross validation strategy are used. Jogging, running, walking, waving and clapping are perfectly learned but boxing and clapping/waving are confused. Table 1 compares the proposed framework with several state-of-the-art methods. Figure 3(a) shows the confusion matrix for 5-fold cross validation. The method achieves state-of-the-art results. Figure 3(b) shows the results for original training/testing splits. The proposed framework achieves competing results. Presumably due to the smaller training set the results are more worse than the 5-fold cross validation results. The overall training time is about 30 seconds, on a standard notebook with a single-threaded C++ implementation.

Action Recognition with HOG-OF Features 5 Method Validation Accuracy (%) Schindler and Gool [14] 5-fold 87,98 Zhang et al. [20] 5-fold 94,60 Proposed method 5-fold 96,44 Laptev et al. [15] Original split 91,80 Zhang et al. [20] Original split 94,00 Wang et al. [18] Original split 94,20 Proposed method Original split 94,31 O Hara and Draper [11] Original split 97,90 Sadanand and Corso [13] Original split 98,20 Table 1. Accuracies (%) in comparison of the proposed framework to state-of-the-art methods on the KTH dataset. Tabelle1 Tabelle1 box clap wave jog run walk box clap wave jog run walk box 0.85 0.05 0.1 0 0 0 box 1 0 0 0 0 0 clap 0 1 0 0 0 0 clap 0.11 0.80 0.09 0 0 0 wave 0 0 1 0 0 0 wave 0 0.12 0.88 0 0 0 jog 0 0 0 1 0 0 jog 0 0 0 1 0 0 run 0 0 0 0 1 0 run 0 0 0 0 0.98 0.02 walk 0 0 0 0 0 1 (a) walk 0 0 0 0 0 1 (b) Fig. 3. Confusion matrix for the KTH dataset. (a): 5-fold cross validation, (b) Original splits. 4 Conclusion In this paper a simple and efficient framework for single human action recognition is proposed. Optical flow features are used to gather motion information between frames while static object information is extracted by using histogram of oriented gradients. With a frame-by-frame learning approach two Random Forest classifiers are separately built and the final decision is determined by combining both class probabilities. The proposed framework is evaluated using two validation strategies on the well-known, publicly available KTH dataset for single human action recognition. The results demonstrate state-of-the-art accuracies while obtaining an overall training time of 30 seconds on a standard workstation. Seite 1 Seite 1

6 F. Baumann References 1. Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Computing Surveys 43(3), 16:1 16:43 (Apr 2011) 2. Bouguet, J.Y.: Pyramidal implementation of the lucas kanade feature tracker. In: Intel Corporation (2000) 3. Breiman, L.: Random forests. In: Machine Learning. vol. 45, pp. 5 32 (2001) 4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on. vol. 1, pp. 886 893 vol. 1 (2005) 5. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Machine Learning, Proceedings of the Thirteenth International Conference on. pp. 148 156. IEEE (1996) 6. Horn, B.K., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17 (1981) 7. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on (2008) 8. Lucas, B.D., Kanade, T., et al.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th international joint conference on Artificial intelligence (1981) 9. Lui, Y.M., Beveridge, J., Kirby, M.: Action classification on product manifolds. In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on. pp. 833 839 (2010) 10. Mauthner, T., Roth, P.M., Bischof, H.: Instant action recognition. In: Proceedings of the 16th Scandinavian Conference on Image Analysis. pp. 1 10. SCIA 09, Springer-Verlag, Berlin, Heidelberg (2009) 11. O Hara, S., Draper, B.: Scalable action recognition with a subspace forest. In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on. pp. 1210 1217 (2012) 12. Poppe, R.: A survey on vision-based human action recognition. Image and Vision Computing 28(6), 976 990 (2010) 13. Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on (2012) 14. Schindler, K., Van Gool, L.: Action snippets: How many frames does human action recognition require? In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on. pp. 1 8 (2008) 15. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Pattern Recognition. (ICPR). Proceedings of the 17th International Conference on. vol. 3, pp. 32 36 Vol.3 (2004) 16. Seo, H.J., Milanfar, P.: Action recognition from one example. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(5), 867 882 (May) 17. Shi, J., Tomasi, C.: Good features to track. In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on. pp. 593 600. IEEE (1994) 18. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on. pp. 3169 3176 (2011) 19. Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatiotemporal features for action recognition. In: British Machine Vision Conference, (BMVC). (2009) 20. Zhang, Y., Liu, X., Chang, M.C., Ge, W., Chen, T.: Spatio-temporal phrases for activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision, (ECCV). vol. 7574, pp. 707 721 (2012)