Human Action Recognition in Videos Using Hybrid Motion Features

Size: px

Start display at page:

Download "Human Action Recognition in Videos Using Hybrid Motion Features"

Patricia Horton
5 years ago
Views:

1 Human Action Recognition in Videos Using Hybrid Motion Features Si Liu 1,2, Jing Liu 1,TianzhuZhang 1, and Hanqing Lu 1 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Science, Beijing , China 2 China-Singapore Institute of Digital Media, , Singapore Abstract. In this paper, we present hybrid motion features to promote action recognition in videos. The features are composed of two complementary components from different views of motion information. On one hand, the period feature is extracted to capture global motion in timedomain. On the other hand, the enhanced histograms of motion words (EHOM) are proposed to describe local motion information. Each word is represented by optical flow of a frame and the correlations between words are encoded into the transition matrix of a Markov process, and then its stationary distribution is extracted as the final EHOM. Compared to traditional Bags of Words representation, EHOM preserves not only relationships between words but also temporary information in videos to some extent. We show that by integrating local and global features, we get improved recognition rates on a variety of standard datasets. Keywords: Action recognition, Period, EHOM, Optical flow, Bag of words, Markov process. 1 Introduction With the wide spread of digital cameras for public visual surveillance purposes, digital multimedia processing has been received increasing attention during the past decade. Human action recognition is becoming one of the most important topics in computer vision. The results can be applied to many areas such as surveillance, video retrieval and human computer interaction etc. Successful extraction of good features from videos is crucial to action recognition. Yan et al. [1] extend the 2D box feature to 3D spatio-temporal volumetric feature. Recently, Ju Sun et al. [2] propose to model the spatio-temporal context information in a hierarchical way. Among all the proposed features, there is a huge family directly describing motion. For example, Bobick and Davis [3] develop the temporal template which captures both motion and shape. Laptev [4] extracts motion-based space-time features. This representation focuses on human actions viewed as motion patterns. Ziming Zhang et al. [5] propose Motion Context (MC) which captures the distribution of the motion words and thus summarizes the local motion information in a rich 3D MC descriptor. These motion based approaches have shown to be successful for action recognition. S. Boll et al. (Eds.): MMM 2010, LNCS 5916, pp , c Springer-Verlag Berlin Heidelberg 2010

2 412 S. Liu et al. Acknowledging the discriminative power of motion features, we propose to combine period and enhanced histograms of motion words (EHOM) to describe motion in the video. Considering the large variation in realistic videos, our method is more feasible to extract compared to 3D volumes, trajectories, spatiotemporal interest points etc. Period Features: Periodical motion occurs often in human actions. For example, running and walking can be seen as periodical actions in the leg region. Therefore, a variety of methods use period features to perform action recognition. Cutler and Davis [6] compute an object s self-similarity as it evolves in time. For periodic motion, the self-similarity measure is also periodic, and they apply Time-Frequency analysis to detect and characterize the periodic motion. What s more, Liu et al. [7] also classify periodic motions. Optical Flow Features: Efros et al. [8] recognize the actions of small scale figures using features derived from optical flow measurements in a spatio-temporal volume for each stabilized human figure. Alireza Fathi et al. [9] develop a method constructing mid-level motion features which are built from low-level optical flow information. Saad Ali et al. [10] propose a set of kinematic features that are derived from the optical flow. All of them achieve good results. Hybrid Features: We strongly feel that period and optical flow are complementary for action recognition mainly for two reasons. First, optical flow only capture the motion between two adjacent frames thus bringing in local problems, while period can capture global motion in time domain. For example, suppose we want to differentiate walking from jogging. Because they produce quite similar optical flow, it is difficult to distinguish them based on optical flow features alone. Yet, the period feature can easily distinguish them because when somebody jogs, his/her legs move faster. Second, period information is not obvious in several actions such as bending. However, the optical flow of bending with forwarding components and rising up components are quite discriminative. To exploit the synergy, we choose to use hybrid features consisting of both period features (capturing global motion) and optical flow features (capturing local motion) to develop an effective recognition framework. 2 Overview of Our Recognition System The main components of the system are illustrated in Fig. 1. We first produce a figure-centric spatio-temporal volume (see Fig. 1(a)) for each person. It can be obtained by using any one of detection/tracking algorithms over the input sequence and constructing a fixed size window around it. Afterwards, we divide every frame of the spatio-temporal volume into m n blocks to make the proposed algorithm robust to noise and efficient to be computed. By doing this, we also implicitly maintain spatial information in the frame when constructing

3 Human Action Recognition in Videos Using Hybrid Motion Features 413 Input Video (a) Spatio-temporal Cuboid Frame based Optical Flow Optical Flow Bag t x y x t y Video based Period Feature Video based EHOM (b) (c) Hybrid Motion Features SVM Classifier (d) Fig. 1. The Framework of Our Approach features.asaresult,wegetm n smaller spatio-temporal cuboids consisting of all the blocks at the corresponding location in every frame(fig. 1(b)). Sec. 3 addresses quasi-period extraction of the cuboid to describe the global motion in time-domain. The feature of all the cuboid are concatenated to form the period feature of the video. Sec. 4 introduces the EHOM feature extraction. Specifically, each frame s optical flow is first assigned a label by k-means clustering algorithm. Based on these labels, Markov process is used to encode the dynamic information (Fig. 1(c)). Then the hybrid features are constructed and fed into the subsequent multi-class SVM classifier (Fig. 1(d)). The experimental results are reported in Sec. 5. Finally, the conclusions are given in Sec Period Feature Extraction Based on the spatio-temporal cuboid obtained by dividing the original video, our frequency extraction approach is appearance-based similar to [11]. Fig. 2 shows the block diagram of the module. First, we use probabilistic PCA (ppca) [12] to detect the maximum spatially coherent changes over time in the objects appearance. The input data that are spatially correlated are grouped together. Different from pixel-wise approaches, ppca considers these pixels as one physical entity. Hence the method is robust to noise. The final output consists of a

4 414 S. Liu et al. Figure-centric cubiod ppca Frequency Analysis f est per est Fig. 2. Block diagram of period extraction module combination of two indicators: the estimated period f est and the degree of periodicity per est. Next, we will describe the ppca phase and frequency analysis phase respectively. ppca for Robust Periodicity Detection: Let X D N =[x 1 x 2...x N ]represent the input video, with D the number of pixels in one frame and N the number of image frames. The rows of an aligned image frame are concatenated to form the column x n. The optimal linear reconstruction ˆX of the data is given by: ˆX = WU + X, wherew D Q =[w 1 w 2...w Q ] is the set of orthonormal basis vectors, principal components matrix U D Q is a set of Q-dimensional vectors of unobserved variables(see Fig.3(b)) and X the set of mean vectors x.each eigenvector s corresponding eigenvalue is indicated by Λ = diag(λ 1,λ 2,...λ d )of the covariance matrix S of the input data X: S = VΛV T, which is calculated by eigenvalue decomposition. The dimension Q is selected by setting the maximum percentage of retained variance we want to preserve in the reconstructed matrix ˆX. Frequency Analysis: Periodogram is a typical non-parametric frequency analysis method which estimates the power spectrum based on the Fourier Transform of the autocovariance function. We choose the modified periodogram of the non-parametric class: P q (f) = 1 N 1 2 N w(n)x(n)exp( jn2πf), where N is the n=0 frame length, w(n) is the window used and x(n) is principal component vector u T q from the ppca(see Fig. 3(b)). By weighing the spectra P q (f) with the relative percentages λ q of the retained variance and summing them together, a spectrum is obtained by P (f) = Q λ qp q (f), where λ q = λq q=1. D λ d d=1 In order to detect the dominant frequency component in the spectrum P (f) (see Fig. 3(c)), we first detect the peaks and local minima which define the peaks supports. The peaks with a frequency lower than fs N are discarded, with f s being the sampling rate of the video and N the frame length. Afterwards, starting from the lowest found frequency to the highest, each peak is checked against the others for its harmonicity. We require that a fundamental frequency

5 Human Action Recognition in Videos Using Hybrid Motion Features 415 u 1 spectrum u 2 Frame number Frequency(mHz) (a) (b) (c) Fig. 3. (a) The spatio-temporal cubic of running is denoted in red. (b) The first 2 principal components of the cubic. (c) The weighted spectrum of running, the peaks is denoted in red and their supports in green. should have a higher peak than its harmonics and a tolerance of fs N is used in the matching process. We select the one group k with the highest total energy to represent the dominant frequency component in the data. The total energy is the sum of the area between the left and right supports E(.) of the fundamental frequency peak fk 0 and its harmonics f k i : { f est =argmax f 0 k E(f 0 k )+ i E(f i k ) } The estimated frequency f est of Fig. 3(c) is 120mHz, which means that the motion repeats itself every 8.33 frames. Note that no matter whether the data is periodical or not, as long as there exist some minor peaks in the spectrum P (f), the above method may still give a frequency estimate. So we adopt to compare the energy of all peaks found in P (f) with the total energy to separate the above cases: K E Δ (f k ) k=1 per est =, (2) P (f) where K is the number of peaks detected and E Δ (f k ) as the area of a triangle formed by the peak and its left and right supports. Note that the peak supports should have zero energy for the spectrum of periodic signal. By only using the triangle area for the nominator in eq.(2), we assign a lower per est value for quasiperiodic signal. The obtained per est and f est are then concancated to generate the period component of the hybrid feature. f (1)

6 416 S. Liu et al. Optical Flow Extraction Visual Words Generation Markov Process EHOM Fig. 4. Block diagram of EHOM extraction module 4 Enhanced Histograms of Motion Words Extraction As motion frequency is a global and thus coarse description of motion, we adopt a local and finer motion mode descriptor optical flow as a complement. Fig. 4 shows the block diagram of the module. First, We extract the optical flow of every frame. Then we generate the codebook by clustering all optical flow in training dataset. Afterwards, we would have directly computed the histogram of words occurrences over the entire video sequence based on the obtained visual words, but by doing so the time domain information is lost. For action recognition, however, the dynamic properties of these object components are quite essential, e.g. for the action of standing up or airplane taking off. That is why we go one step further and combine a optical flow based Bags of Words representation with Markov process [13] to get EHOM. It is independent of the length of video and simultaneity maintains both the dynamic information and correlations between words in the video. To our best knowledge, we are the first to consider the relationship between motion words in action recognition. The Lucas and Kanade [14] algorithm is employed to compute the optical flow for each frame. The optical flow vector field F is then split into horizontal and vertical components of the flow, F x and F y. These two non-negative channels are then blurred with a gaussian and normalized. They will be used as our optical flow motion features for each frame. Blurring the optical flows reduces the influence of noise and small spatial shifts in the figure centric volume. For each frame, optical flow features of each block are concatenated to generate a longer vector. Next, we represent a video sequence as Bags of Words. Our method represents a frame as a single word. In other words, a word corresponds to a frame, and a document corresponds to a video sequence in our representation. Specifically, given the optical flow vector of every frame in the video, we construct a visual vocabulary with the k-means algorithm and then assign each frame to the closest (we use Euclidean distance) vocabulary word. In fig. 5(a), different colors mean the corresponding frames are assigned to different visual words. As we mentioned, we go one step further than Bags of Words by considering the relationship between the motion words using Markov process. Before going deep into details, we present some basic definitions in Markov chains. A Markov Chain [15] is a sequence of random observed variables with the Markov property. It is a powerful tool for modeling the dynamic properties of a system. The markov stationary distribution, associated with an ergodic Markov chain, offers a compact and effective representation for a dynamic system.

7 Human Action Recognition in Videos Using Hybrid Motion Features 417 x t y (a) Frames Assigned to Different Visual Words k (d) Markov Stationary Features (b) Visual Words Transition Diagram k k (c) Visual Words Occurrence Matrix 2 Fig. 5. Construction of E HOM Theorem 4.1. Any ergodic finite-state Markov chain is associated with a unique stationary distribution (row) vector, such that πp = π. Theorem ) The limit A = lim A n exists for all ergodic Markov chains, x where the matrix A n = 1 n +1 (I + P P n ) (3) 2) Each row of A is the unique stationary distribution vector π. Hence when the ergodicity condition is satisfied, we can approximate A by A n. To further reduce the approximation error when using a finite n, π is calculated as the column average of A n. For consecutive frames in a fixed-length time window with their codebook labels F and F. we translate the sequential relations between these labels into a directed graph, which is similar to the state diagram of a Markov chain (Fig. 5 (b)). Here we get K vertices corresponding to the K codewords, and weighted edges corresponding to the occurrence of each transition between the words. We further establish an equivalent matrix representation of the graph(fig. 5 (c)), and perform row-normalization on the matrix to arrive at a valid transition matrix P for a certain Markov chain. Once we obtain the transition matrix P and make sure it is associated with an ergodic Markov chain, we can use eq. 3 to compute π (Fig. 5(d)). 5 Experiments Here we briefly introduce the parameters used in our experiments. In the period feature extraction phase, ppca retained variance is 90%, Hanning window is used for periodogram smoothing. If per est is less than 0.4, in other words, the

418 S. Liu et al. signal is not periodic, we assign the corresponding f est to zero. In EHOM extraction phase, the vocabulary size is set to be 100 and we use n =50toestimate A by A n.

8 418 S. Liu et al. signal is not periodic, we assign the corresponding f est to zero. In EHOM extraction phase, the vocabulary size is set to be 100 and we use n =50toestimate A by A n. The length of time window is 20 frames. For classification, we use support vector machine (SVM) classifier with RBF kernel. We adopt PCA to reduce the dimension of period feature to make it the same as the dimension of EHOM. To prove the effectiveness of our hybrid feature, we test our algorithm on two human action datasets: KTH human motion dataset [16] and Weizmann human action dataset [17]. For each dataset, we perform leave-one-out cross-validation. During each run, we leave the videos of one person as test data each time, and use the rest of the videos for training. 5.1 Evaluating Different Components in Hybrid Feature We will show that both components in our proposed hybrid feature are quite discriminative. The period features of 6 activities in KTH database are illustrated in Fig. 6. We can see that the bottom three actions have different frequencies in the leg regions (denoted in red ellipses). Specifically, f running >f jogging >f walking, where f stands for the frequencies of leg regions. It conforms to the intuitive understanding. Fig. 7 shows the comparison of our proposed EHOM with traditional BOW representation and illustrates that better results are achieved by considering correlations between motion words. The following experiment is to demonstrate the benefit of combining period and EHOM feature. Fig. 8 shows the classification results for period features, EHOM features and the hybrid of them. The average accuracies are 80.49%, 89.38% and 93.47% respectively. It shows that the EHOM component achieves better result than the period component. We can also draw the conclusion that the hybrid feature is more discriminative than either component alone. Fig. 6. The frequencies of different actions in KTH database

9 Human Action Recognition in Videos Using Hybrid Motion Features 419 Methods Mean Accuracy BOW 87.25% EHOM 89.38% Fig. 7. The comparison between BOW andehomonkthdataset Methods Mean Accuracy period feature 80.49% EHOM feature 89.38% hybrid feature 93.47% Fig. 8. The comparison of different features about mean accuracy on KTH dataset 5.2 Comparison with the State-of-the-Art Experiments on Weizmann Dataset: The Weizmann human action dataset contains 93 low-resolution video sequences showing 9 different people, each of which performing 10 different actions. We have tracked and stabilized the figures using background subtraction masks that come with the dataset. In Fig. 9(a) we have shown some sample frames of the dataset. The confusion matrix of our results is shown in Fig. 9(b). Our method has achieved a 100% accuracy. Experiments on KTH Dataset: The KTH human motion dataset, contains six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping). Each action is performed several times by 25 subjects in four different conditions: outdoors, outdoors with scale variation, outdoors with bend jack jump pjump run side skip walk wave1 wave jack bend wave2 wave1 walk skip side run pjump jump (a) Weizmann dataset (b) Weizmann Confusion matrix Fig. 9. Results on Weizmann dataset: (a) sample frames. (b) confusion matrix on Weizmann dataset using 100 codewords. (overall accuracy=100%) Methods Mean Accuracy Saad Ali [10] 87.70% Alireza Fathi [9] 90.50% Ivan Laptev [18] 91.8% Our method 93.47% Fig. 10. The comparison of different methods about mean accuracy on KTH dataset

420 S. Liu et al. boxing handclapping handwaving jogging running walking.98.02.00.00.00.00.03.96.01.00.00.00.01.04.95.00.00.00.00.00.00.84.13.03.00.00.00.09.91.00.00.00.00.02.03.95 boxing walking running jogging handwaving handclapping (a) KTH dataset (b) KTH Confusion matrix Fig.

Representative frames of this dataset are shown in Fig. 11(a).

Since most of the previously published results assign a single label to each video, we will also report per-video classification on KTH datasets.

The most confusion is between the last three actions: running, jogging and walking. We have compared our results with the current state of the art in Fig. 10. Our results outperform other methods.

The other reason is the combination of Bags of Words representation and Markov process keeps the correlation between words and temporary information to some extent.

10 420 S. Liu et al. boxing handclapping handwaving jogging running walking boxing walking running jogging handwaving handclapping (a) KTH dataset (b) KTH Confusion matrix Fig. 11. Results on KTH dataset: (a) sample frames. (b) confusion matrix on KTH dataset using 100 codewords. (overall accuracy=93.47%) different clothes and indoors. Representative frames of this dataset are shown in Fig. 11(a). Note that the person moves in different directions in the video of KTH database [16], so we divide each video into several segments according to the person s moving direction. Since most of the previously published results assign a single label to each video, we will also report per-video classification on KTH datasets. The per-video classification is performed by assigning a single action label aquired from majority voting. The confusion matrix on the KTH dataset is shown in Fig. 11(b). The most confusion is between the last three actions: running, jogging and walking. We have compared our results with the current state of the art in Fig. 10. Our results outperform other methods. The reason for the improvement is the complementarity of the period and EHOM components in our feature. The other reason is the combination of Bags of Words representation and Markov process keeps the correlation between words and temporary information to some extent. 6 Conclusion In this paper, we propose an efficient feature for human action recognition. The hybrid feature composed of two complementary ingredients is extracted. As a global motion description in time-domain, period component can capture the global motion in time-domain. As an additional source of evidence, EHOM component could describe local motion information. When generating EHOM, we integrate Bags of Words representation with Markov process to relax the requirement on the duration of videos and maintain the dynamic information. Experiments testify the complementary roles of the two components. The proposed algorithm is simple to implement and experiments have demonstrated its improved performance compared with the state-of-the-art algorithms on the

11 Human Action Recognition in Videos Using Hybrid Motion Features 421 task of action recognition. Since we have already achieved pretty good results in benchmark databases under controlled settings, we plan to test our algorithm in more complicated settings such as movies in future. References 1. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV (2005) 2. Sun, J., Wu, X., Yan, S., Chua, T., Cheong, L., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: CVPR (2009) 3. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, (2001) 4. Laptev, I., Lindeberg, T.: Pace-time interest points. In: ICCV (2003) 5. Zhang, Z., Hu, Y., Chan, S., Chia, L.-T.: Motion context: A new representation for human action recognition. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp Springer, Heidelberg (2008) 6. Cutler, R., Davis, L.S.: Robust real-time periodic motion detection, analysis, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 7. Liu, Y., Collins, R., Tsin, Y.: Gait sequence analysis using frieze patterns. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV LNCS, vol. 2351, pp Springer, Heidelberg (2002) 8. Efros, A., Berg, A., Mori, G., Malik, J.: Recognition action at a distance. In: ICCV (2003) 9. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: CVPR (2008) 10. Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 99 (2008) 11. Pogalin, E., Smeulders, A.W.M., Thean, A.H.C.: Visual quasi-periodicity. In: CVPR (2008) 12. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. of Royal Stat. Society, Series B 61 (1999) 13. Li, J., Wu, W., Wang, T., Zhang, Y.: One step beyond histograms: Image representation using markov stationary features. In: CVPR (2008) 14. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: DARPA Image Understanding Workshop (1981) 15. Breiman, L.: Probability. Society for Industrial Mathematics (1992) 16. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: CVPR (2004) 17. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV (2005) 18. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)

Lecture 18: Human Motion Recognition

Lecture 18: Human Motion Recognition Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Introduction Motion classification using template matching Motion classification i using spatio