Histograms of Oriented Optical Flow and Binet-Cauchy Kernels on Nonlinear Dynamical Systems for the Recognition of Human Actions

Size: px

Start display at page:

Download "Histograms of Oriented Optical Flow and Binet-Cauchy Kernels on Nonlinear Dynamical Systems for the Recognition of Human Actions"

Rebecca Strickland
5 years ago
Views:

1 Histograms of Oriented Optical Flow and Binet-Cauchy Kernels on Nonlinear Dynamical Systems for the Recognition of Human Actions Rizwan Chaudhry, Avinash Ravichandran, Gregory Hager and René Vidal Center for Imaging Science, Johns Hopkins University 34 N Charles St, Baltimore, MD 228 Abstract System theoretic approaches to action recognition model the dynamics of a scene with linear dynamical systems (LDSs) and perform classification using metrics on the space of LDSs, e.g. Binet-Cauchy kernels. However, such approaches are only applicable to time series data living in a Euclidean space, e.g. joint trajectories extracted from motion capture data or feature point trajectories extracted from video. Much of the success of recent object recognition techniques relies on the use of more complex feature descriptors, such as SIFT descriptors or HOG descriptors, which are essentially histograms. Since histograms live in a non-euclidean space, we can no longer model their temporal evolution with LDSs, nor can we classify them using a metric for LDSs. In this paper, we propose to represent each frame of a video using a histogram of oriented optical flow (HOOF) and to recognize human actions by classifying HOOF time-series. For this purpose, we propose a generalization of the Binet-Cauchy kernels to nonlinear dynamical systems (NLDS) whose output lives in a non-euclidean space, e.g. the space of histograms. This can be achieved by using kernels defined on the original non-euclidean space, leading to a well-defined metric for NLDSs. We use these kernels for the classification of actions in video sequences using (HOOF) as the output of the NLDS. We evaluate our approach to recognition of human actions in several scenarios and achieve encouraging results.. Introduction Analysis of human activities has always remained a topic of great interest in computer vision. It is seen as a stepping stone for applications such as automatic environment surveillance, assisted living and human computer interaction. The surveys by Gavrila [4], Aggarwal et al. [] and by Moeslund et al., [8], [9] provide a broad overview of over three hundred papers and numerous approaches for analyzing human motion in videos, including human motion capture, tracking, segmentation and recognition. Related work. Recent work on activity recognition can be broadly classified into three types of approaches: local, global and system-theoretic. Local approaches use local spatiotemporal features to represent human activity in a video, e.g. [7,, 3]. Niebles [2] presented an unsupervised method similar to the bag-of-words approach for learning the probability distributions of space-time interest points in human action videos. İkizler et al. [6] presented a method whereby limb motion model units are learnt from labeled motion capture data and used to detect more complex unseen motions in a test video using search queries specific to the limb motion in the desired activity. [5] and [3] represent human activities by 3-D space-time shapes. Classification is performed by comparing geometric properties of these shapes against training data. However, an important limitation of the aforementioned approaches is that they do not incorporate global characteristics of the activity as a whole. Global approaches use global features such as optical flow to represent the state of motion in the whole frame at a time instant. With static background, one can represent the type of motion of the foreground object by computing features from the optical flow. In [3], optical flow histograms were used to match the motion of a player in a soccer match to that of a subject in a control video. Tran et al. in [27] present an optical flow and shape based approach that uses separate histograms for the horizontal and vertical components of the optical flow as well as the silhouette of the person as a motion descriptor. All these approaches, however, do not model the characteristic temporal dynamics of these features. Moreover, comparison is done either on a frameby-frame basis or using other ad-hoc methods. Clearly, the natural way to compare human activities is to compare their temporal evolution as a whole. System-theoretic approaches to recognition of human actions model feature variations with dynamical systems and hence specifically consider the dynamics of the activity. The recognition pipeline is composed of ) finding features in every frame, 2) modeling the temporal evolution of these /9/$ IEEE 932

2 Figure. Optical flows and HOOF feature trajectories features with a dynamical system, 3) using a similarity criteria, e.g. distances or kernels between dynamical systems, to train classifiers, and 4) using them on novel video sequences. Bissacco et al. used joint-angle trajectotries in [3] as well as joint trajectories from motion-capture data and features extracted from silhouettes in [4] to represent the action profiles. Ali et al. in [2] used joint trajectories to extract invariant features to model the non-linear dynamics of human activities. However, these approaches are mostly limited to local feature representations and to our knowledge, there has been no work on modeling the dynamics of global features, e.g. optical flow variations. Paper contributions. In this paper, we propose the Histogram of Oriented Optical Flow (HOOF) features to represent human activities. These novel features are independent of the scale of the moving person as well as the direction of motion. Extraction of HOOF features does not require any prior human segmentation or background subtraction. However, HOOF features are non-euclidean, and thus the evolution of HOOF features creates a trajectory on a nonlinear manifold. Traditionally, Linear Dynamical Systems (LDSs) have been used to model feature time series that are Euclidean, e.g. joint angles, joint trajectories, pixel intensities, etc. Non-Euclidean data like histogram time series need to be modeled with Non-Linear Dynamical Systems (NLDS). Hence, similarity criteria designed for LDSs cannot be used to compare two histogram time series. In this paper, we extend the Binet-Cauchy kernels [29] to NLDS. This is done by replacing an infinite sum of output feature inner products in the kernel expression by a Mercer kernel [23] on the output space. We model the proposed HOOF features as outputs of NLDS and use the Binet-Cauchy kernels for NLDS to perform human activity recognition on the Weizmann database [5] with encouraging results. Paper outline. The rest of this paper is organized as follows. 2 briefly reviews the LDS recognition pipeline for Euclidean time-series data. 3 proposes the Histogram of Oriented Optical Flow (HOOF) features, which are used to model the activity profile in each frame of a video. Every activity video is thus represented as a non-euclidean time-series of HOOF features. 4 introduces NLDS and describes how NLDS parameters can be learnt using kernels defined on the underlying non-euclidean space. 5 presents the Binet-Cauchy kernels for NLDSs which define a similarity metric between two non-euclidean time-series. 6 gives experimental results for human activity recognition using the proposed metric and features. Finally, 7 gives concluding remarks and future directions. 2. Recognition with Linear Dynamical Systems A LDS is represented by the tuple M = (µ, x, A, C, B, R) and evolves in time according to the following equations { xt+ = Ax t + Bv t y t = µ + Cx t + w t. () Here x t R n is the state of the LDS at time t; y t R p is the observed output or feature at time t; x is the initial state of the system; and µ R p is the mean of {y t } t= N, e.g. the mean joint angle configuration, etc. A R n n describes the dynamics of the state evolution, B R n nv models the way in which input noise affects the state evolution and C R p n transforms the state to an output or observation of the overall system. v t R nv and w t R p are the system noise and the observation noise at time t, respectively. We assume that the noise processes are zero-mean i.i.d. Gaussian, such that v t G(v t,, I nv ) and w t G(w t,, R), R R p p, 933

3 where G(z, µ z, Σ)=(2π) n/2 Σ /2 exp( 2 z µ z 2 Σ ) is a multivariate Gaussian distribution on z with z 2 Σ = z Σ z. By this definition, Bv t G(Bv t,, Q) where Q = BB R nv nv. We also assume that v t and w t are independent processes. Given a set of T training videos, the first task is to learn the parameters M i, i =,..., T, from the feature trajectories of each video. There are several methods to learn these system parameters e.g. [25], [28] and []. Once these parameters are identified for each of the videos, various metrics can be used to define the similarity between these LDSs. In particular, three major types of metrics are ) geometric distances based on subspace angles between the observability subspaces of the LDSs [9], 2) algebraic metrics like the Binet-Cauchy kernels [29] and 3) information theoretic metrics like the KL-divergence [6]. Given a metric, all pairwise similarities are evaluated on the training data, and used for classification of novel sequences using methods such as k-nearest Neighbors (k- NN) or Support Vector Machine (SVM). 3. Histogram of Oriented Optical Flow (HOOF) As we alluded to in the introduction, existing systemtheoretic approaches to action recognition have been mostly applied to joint angles extracted from motion capture data. If one were to apply such approaches to video data, one would be faced with the challenging problem of accurately extracting and tracking the joints of a person in the presence of self-occlusions, changes of scale, pose, etc. Inspired by the recent success of histograms of features in the object recognition community, we posit that the natural feature to use in a motion sequence is optical flow. However, the raw optical flow data may be of no use, as the number of pixels in a person (hence the size of the descriptor) changes over time. Moreover, optical flow computations are very susceptible to background noise, scale changes as well as directionality of movement. To avoid these issues, one could use instead the distribution of optical flow. Indeed, when a person moves through a scene with a stationary background, it induces a very characteristic optical flow profile. Figure shows some optical flow patterns for a sample walking sequence. However, notice that the observed optical flow profile could be different if the activity was performed at a larger scale. For example a zoomed in walking person versus a far-away walking person. The magnitude of the optical flow vectors would be larger in the zoomed in case. Similarly, if a person is running from the left to the right, the optical flow observed would be a reflection in the vertical axis to that observed if the person was running from the right to the left. We thus need a feature based on optical flow that represents the action profile at every time instant and that is invariant to the scale and directionality of motion. To overcome these issues, in this paper we propose the Histogram of Oriented Optical Flow (HOOF), which is defined as follows. First, optical flow is computed at every frame of the window. Each flow vector is binned according to its primary angle from the horizontal axis and weighted according to its magnitude. Thus, all optical flow vectors, v = [x, y] with direction, θ = tan ( y x ) in the range π 2 + π b B θ < π 2 + π b B will contribute by x 2 + y 2 to the sum in bin b, b B, out of a total of B bins. Finally, the histogram is normalized to sum up to. 3 2 β 4 α 4 α Optical Flow Vectors (2) HOOF feature Figure 2. Histogram formation with four bins, B = 4 Figure 2 illustrates the procedure. Binning according to the primary angle, the smallest signed angle between the horizontal axis and the vector, allows the histogram representation to be independent of the (left or right) direction of motion. Normalization makes the histogram representation scale-invariant. We expect to observe the same histogram whether a person is moving from the left to the right or in the opposite direction, whether a person is running far away in the scene or very near the camera. Since the contribution of each optical flow vector to its corresponding bin is proportional to its magnitude, small noisy optical flow measurements have little effect on the observed histogram. Assuming a stationary background, there is no optical flow in the background. Using the magnitude-based addition to each bin, we can simply compute the optical flow histogram on the whole frame rather than requiring to pre-compute a segmentation of the moving person. The number of bins, B, is a parameter of choice. Generally we observe that with histogram time-series of at least 3 bins per histogram, we are able to achieve good recognition results. 3.. Kernels for comparing HOOF HOOF features provide us with a normalized histogram h t = [h t;, h t;2,..., h t;b ] at each time instant t. In order 934

4 to use such histograms for recognition purposes, we need to have a way of comparing two histograms. To that end, notice that histograms cannot be treated as simple vectors in a Euclidean space. Histograms are essentially probability mass functions, hence they must satisfy the constraints B h t;i = and h t;i, i {,..., B}. At first sight, i= one may think that this space is still simple enough. However, the space of histograms H is actually a Riemannian manifold with a nontrivial structure. The problem of equipping the space of probability density functions (PDFs) with a differential manifold structure and defining a Riemannian metric on it has been an active area of research in the field of information geometry. The work by Rao [2] was the first to introduce a Riemannian structure to this statistical manifold by introducing the Fisher-Rao metric. The Fisher-Rao metric, however, is extremely hard to work with due to the difficulty in computing geodesics on this space [26]. Even though the space H turns out to be difficult to work with, we know that it is not the only possible representation for PDFs. There are many different re-parameterizations of PDFs that are equivalent. These include the cumulative distribution function, log density function and square-root density function. Each of these parameterizations will lead to a different resulting manifold. Depending on the choice of representation, the resulting Riemannian structure can have varying degrees of complexity and numerical techniques may be required to compute geodesics on the manifold. For the sake of computational simplicity, in this paper we will restrict our attention to similarity measures built by mapping the histogram h H to a high dimensional (possibly infinite) Euclidean space, F, using the map Φ : H F. Since F is a Euclidean space, all the natural notions of finding distances between two points can be employed for comparison. Most of the time, however, the map Φ cannot be found. Mercer kernels [23] have the special property of being positive definite kernels that induce an inner product in a higher dimensional space under the map Φ. More specifically, for points lying on the non-linear manifold H, the Mercer kernel is given by k(h, h 2 ) = Φ(h ) Φ(h 2 ) and hence a similarity measure between the high-dimensional space can be computed by simply computing the kernel function on the original representation without knowing the mapping Φ. We now briefly describe some popular kernel measures used on the space of histograms. The histogram, h t = [h t;,..., h t;b ] can be reparameterized to the square root representation for histograms, ht := [ h t;,..., B h t;b ] such that ( h t;i ) 2 =. i= This projects every histogram onto the unit B-dimensional hypersphere or S B. The Riemannian metric between two points R and R 2 on the hypersphere is d(r, R 2 ) = cos (R R 2 ). Thus a kernel between two histograms can be defined as an inner product on their square root representations: B k S (h, h 2 ) = h;i h 2;i. (3) i= Note that this is precisely the kernel that can be achieved by using the RBF kernel, k(h, h 2 ) = exp( d(h, h 2 )), on the Bhattacharya distance between the two histograms. Minimum Difference of Pairwise Assignment (MDPA) [5] is similar to the Earth Mover s Distance (EMD) [22] and is a metric on the space of histograms that implicitly is a summation of distances between points on an underlying metric space from which the histograms were created. For ordinal histograms, i.e., histograms created from linearly varying data (as opposed to modular data, e.g. the modular group Z p = {,,..., p }), the MDPA distance is B i d MDPA (h, h 2 ) = (h ;j h 2;j ).. (4) i= j= Another popular distance between two histograms is the χ 2 distance which is defined as d χ 2(h, h 2 ) = 2 B i= h ;i h 2;i h ;i + h 2;i (5) We can use the RBF kernel to create kernels as similarity measures from these distances. Finally, the Histogram Intersection Kernel (HIST) [2] is another Mercer kernel [23] on the space of histograms and is defined for normalized histograms as k HIST = B min(h ;i, h 2;i ). (6) i= The inner product of the square-root representations is by construction an inner product and hence is a Mercer kernel. Also, the χ 2 and HIST kernels are provably Mercer kernels [32], [2]. However, to the best of our knowledge, the positive-definiteness of the MDPA kernel has not been established [32] Kernels for comparing HOOF time series Since HOOF features h t = [h t;, h t;2,..., h t;b ] are defined at each frame of the video, our actual representation is a time series of these histograms {h t } N t=. Our goal is to compare these time series in order to perform classification of actions. But rather than comparing these time series directly, we want to exploit the temporal evolution of these histograms in order to distinguish different actions. We posit that each action induces a time-series of HOOF 935

5 with specific dynamics, and that different actions induce different dynamics. Therefore, we propose to recognize actions by comparing the dynamics of HOOF time series. There are two important technical challenges in developing a framework for the classification of HOOF time series. The first one is that, because each histogram h t lives in a non-euclidean space H, we cannot model its temporal evolution with LDSs. We address this issue in 4 by using the previously defined kernels in H to define a NLDS. The second challenge is how to compute a distance between HOOF time series. We address this issue in 5, where we extend Binet-Cauchy kernels to NLDSs. 4. Modeling HOOF Time Series with Non- Linear Dynamical Systems Modeling Euclidean feature trajectories with LDSs has been very useful for dynamical system recognition. However we need NLDSs to model non-euclidean feature trajectories like histograms. Consider the Mercer kernel, k(y t, y t) = Φ(y t ) Φ(y t) on the non-euclidean space such that the implicit map, Φ, maps the original non-euclidean space H to a high dimensional (possibly infinite) Euclidean space F. Since k is a Mercer kernel, it represents an inner product in the higher dimensional Euclidean space. Using the map Φ, we can therefore transform the non-euclidean features, {y t } N t=, t= and to the high-dimensional Euclidean space {Φ(y t )} N assume that the higher dimensional trajectories follow a linear dynamical system { xt+ = Ax t + Bv t Φ(y t ) = Cx t + w t (7) The main difference between (7) and () is that we do not necessarily know the embedding Φ, hence we cannot identify (A, C) as before. Moreover, even if we knew Φ, C : R n F is now a linear operator, rather than simply a matrix, because F is possibly infinite dimensional. Therefore, the goal is to identify the parameters (A, B), the sequence x t, and some representation for C by exploiting the fact that we only know the kernel k. In [7], an approach based on Kernel PCA (KPCA) [24] that parallels the PCA approach for learning LDS parameters in [] was proposed to learn the system parameters for equation (7). Briefly, given the output feature sequence, {y t } t= N, the intra-sequence kernel matrix, K = {k(y i, y j )} N i,j= is computed, where k(y i, y j ) = Φ(y i ) Φ(y j ). The centered kernel matrix, that represents the kernel between zero-mean data in the high-dimensional space, is thus computed as K = (I N ee )K(I N ee ) where e = [,..., ] R N. After performing the eigenvalue decomposition K = V DV, the j-th eigenvector v j can be used to obtain the j-th kernel principal component as N α i,j Φ(y i ), where i= α i,j represents the i-th component of the j-th weight vector, α j = λj v j, assuming that the eigenvectors are sorted in descending order of the eigenvalues {λ j } N j=. Given α and K, the sequence of hidden states X = [x, x,..., x N ] and the state-transition matrix, A, can be estimated as X = α K (8) A = [x, x 2,..., x N ][x, x,..., x N 2 ] (9) The state noise at time t is estimated as ˆv t = x t x t, and the noise covariance matrix as Q = N N i= ˆv tˆv t. Using a Cholesky decomposition on Q, B is estimated as BB = Q. For details on estimating R and other parameters, refer to [7]. Notice, however, that we have not shown how to estimate C or a representation of it, though it is clear that C is somehow implicitly represented by the kernel matrix K. Since our goal is to use the NLDS in (7) for recognition, rather than synthesis, we do not actually need to compute C. All we need is a way of comparing two linear operators C, which can be done by comparing the corresponding kernel matrices K, as we show in the next section. 5. Comparing HOOF Time Series with Binet- Cauchy Kernels for NLDS Vishwanathan et al. [29] presented the family of Binet- Cauchy kernels for LDSs. In this section we briefly review the main concepts of the Binet-Cauchy kernels for LDSs and then develop extensions for the case of NLDSs. From the family of Binet-Cauchy kernels, the trace kernel, KLDS T, for comparing two infinitely long zeromean Euclidean time-series generated by the LDSs M = (x, A, C, B, R), and M = (x, A, C, B, R ) is defined as KLDS({y T t } t=, {y t} t=) = KLDS(M, T M ) [ ] := E v,w λ t yt y t. () t= Here < λ < and E represents the expected value of the infinite sum of the inner products w.r.t. the joint probability distribution of v t and w t. It was shown in [29] that if the two LDSs have the same underlying and independent noise processes, with covariances Q and R for the state and output, respectively, then the Binet-Cauchy trace kernel can be computed in closed form as K T LDS(M, M 2 ) = x P x + λ trace(qp + R), () λ 936

6 where P = λ t (A t ) C C A t t= (2) If λ A A <, where. is a matrix norm, then P can be computed by solving the Sylvester equation [29], P = λa P A + C C. (3) Notice that as a result of the system parameter learning method [], the second term on the right side of equation (3), C C, is the matrix of all pairwise inner products of the principal components of the matrix Y = [y ȳ, y ȳ,..., y N ȳ] and the matrix Y = [y ȳ, y ȳ,..., y N ȳ ], where ȳ is the mean of the sequence {y t } N t= and so on. Hence, the (i, j)-th entry of C C is c i c j, where c i is the i-th principal component of Y and c j is the j-th principal component of Y. We now develop the Binet-Cauchy trace kernel for NLDSs. From equation (), we see that the Binet-Cauchy trace kernel for LDSs is the expected value of an infinite series of weighted inner products between the outputs of two systems. We can similarly write the Binet-Cauchy trace kernel for NLDSs as the expected value of an infinite series of weighted inner products between the outputs after embedding them into the high-dimensional (possibly infinite) space using the map Φ. Specifically, [ ] KNLDS(M, T M ) := E v,w λ t Φ(y t ) Φ(y t) t= [ ] = E v,w λ t k(y t, y t), (4) t= where k is the kernel defined on the non-euclidean space of outputs H. If we look at eq (3), we see that in the case of NLDSs the equivalent form for the trace kernel is not immediately obtainable, because C and C are unknown, and hence the term C C cannot be evaluated directly. However, notice that C C is now the product of the matrices formed from the kernel principal components from the NLDS identification process as opposed to the principal components as in the case of LDSs. Thus, similar to the approach used in [7], the (i, j)-th entry of C C can be computed as [C C ] i,j = vi v j [ N ] N = α k,i Φ(y k ) α l,jφ(y l) k= l= = α i Sα j (5) where S is the matrix of all inner products of the form [Φ(y k ) Φ(y l )] k,l = [k(y k, y l )] k,l, where k {,..., N}, l {,..., N }. Before computing the entries of C C in equation (5), we need to center the kernel computation S. This is done by computing, α i = α i e α i N e and α j = α j e α j N e and evaluating, F = α S α Hence, the Binet-Cauchy kernel for NLDS requires the computation of the infinite sum, P = λ t (A t ) F A t, (6) t= If λ A A <, where. is a matrix norm, then P can be computed by solving the corresponding Sylvester equation: P = λa P A + F (7) and the Binet-Cauchy trace kernel for NLDS is KNLDS(M T, M 2 ) = x P x + λ λ trace(q P +R) (8) Notice that equation (8) is a general kernel that takes into consideration the dynamics of the system encoded by (A, C), the noise processes, Q and R, and the initial state, x. The effect of the initial state in a periodic time series is to simply delay the time series. Since in activity recognition it does not matter, e.g., at which part of the walking cycle the video begins, we would like a kernel that is independent of the initial states and the noise processes. Hence we define the Binet-Cauchy maximum singular value kernel for NLDS as K σ NLDS = max σ( P ) (9) which is the maximum singular value of P and is a kernel only on the dynamics of the NLDS. Furthermore, we can show that the Martin distance used by [7] (and all subspace angles based distances between dynamical systems) are special cases of the Binet-Cauchy kernels [8]. 6. Experiments To test the performance of our proposed HOOF features on activity recognition using the Binet-Cauchy kernel for NLDS, we perform a number of experiments on the Weizmann Human Action dataset [5]. This dataset contains 94 videos of different actions each performed by 9 different persons. The classes of actions include running, side walking, waving, jumping, etc. Optical flow was computed using OpenCV on each of the sequences and HOOF histograms were extracted from each frame leading to a HOOF time series for each video sequence. 6.. Classification Results We use each of the kernels described in section 3. to perform Kernel PCA on the HOOF time series extracted 937

7 5 Recognition with distance between HOOF temporal means Euclidean Geodesic MDPA chi square Non intersection 5 Geodesic kernel Martin Max SV kernel Temp Means 5 5 Error rate 5 Error rate Number of bins Figure 3. Misclassification rates when using distances between HOOF means from each video. We then use the Binet-Cauchy maximum singular value kernel, KNLDS σ, to compute all pairwise similarity values. We normalize the similarity values such that K(M, M ) =, if M = M by computing, K (M, M K(M, M ) ) = and finding pair-wise distances between systems by computing, K(M, M), K(M, M ) d(m, M ) = 2( K(M, M 2 )). Classification is then performed with Leave-one-out, -Nearest Neighbor classification using these distance values. As a baseline, Figure 3 shows the performance of using the distance between the temporal means to perform classification. The temporal means were computed by averaging the histogram time-series h = N N i= h i. Notice that h is also a histogram and we can apply the distance metrics in section 3. to compare them. Although using the mean histograms to represent the activity profile ignores any dynamics of the motion, we see that the HOOF features give low error rates. Figure 4 shows the error rates for the inner product of the square root representations, or the Geodesic kernel, across all bin sizes. We can see that the Binet-Cauchy maximum singular value kernel gives the lowest error rate of 5.6% achieving a recognition rate of 94.4%. The corresponding confusion matrix is shown in Figure 5. One sequence each is misclassified for the classes jump and run and three sequences of class wave were misclassified as wave2. Table compares the performance of three state of the art activity recognition algorithms on the Weizmann database with similar experimental setups. We can see that the proposed method performs better than two of the methods with any choice of the Mercer kernel on the space of histograms. Furthermore, the proposed method performs almost as well as the best method in [5]. The small decrease in performance is because of the fact that [5] requires the accurate extrac Number of bins, b Figure 4. Misclassification rates with Geodesic kernel on NLDS Proposed method - Geodesic kernel Proposed method - MDPA distance Proposed method - χ 2 distance Proposed method - HIST kernel Gorelick et al. [5] Ali et al. [2] 92.6 Niebles et al. [2] 9. Table. Comparison of recognition rates with state of the art methods on the Weizmann database tion of the silhouette of the moving person at every time instant which is a very strong assumption. Our method, on the other hand, is very general and does not require any preprocessing steps and still gives better results than all stateof-the-art methods, except [5]. bend jack jump pjump run side skip walk wave wave2 Riemannian metric with Binet Cauchy kernel, Leave one out NN bend jack jump pjump run side skip walk wave wave2 Figure 5. Confusion matrix for recognition on Weizmann database, average recognition rate is 94.4% 938

8 7. Conclusions and Future Work We have presented an activity recognition method that models the activity in a scene as a time-series of non- Euclidean Histograms of Oriented Optical Flow features. We have shown that these features do not need any preprocessing, human detection, tracking and prior background subtraction and represent the activity comprehensively. The HOOF features are scale-invariant as well as independent to the direction of motion. Since the space of histograms is non-euclidean, we have modeled the temporal evolution of HOOF features using NLDSs and learnt the system parameters using kernels on the original histograms. More importantly, we have extended the Binet-Cauchy kernels for measuring the similarities between two NLDSs and shown that the Binet-Cauchy kernels can also be computed by evaluating pairwise Mercer kernels on the non-euclidean space of features. We have applied our framework to our proposed HOOF features and have achieved state of the art results on the Weizmann Human Gait database. Currently we are working on extending our method to multiple disconnected motions in a scene by tracking and segmenting optical flow activity in the scene as well as accounting for the motion of the camera. 8. Acknowledgements This work was partially supported by startup funds from JHU and by grants ONR N , ONR N , NSF CAREER and ARL Robotics- CTA 84MC. References [] J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Computer Vision and Image Understanding, 73:9 2, 999. [2] S. Ali, A. Basharat, and M. Shah. Chaotic invariants for human action recognition. In IEEE International Conference on Computer Vision, 27. [3] A. Bissacco, A. Chiuso, Y. Ma, and S. Soatto. Recognition of human gaits. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 52 58, 2. [4] A. Bissacco, A. Chiuso, and S. Soatto. Classification and recognition of dynamical models: The role of phase, independent components, kernels and optimal transport. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(): , 26. [5] S.-H. Cha and S. N. Srihari. On measuring the distance between histograms. Pattern Recognition, 35(6):355 37, June 22. [6] A. Chan and N. Vasconcelos. Probabilistic kernels for the classification of auto-regressive visual processes. In IEEE Conference on Computer Vision and Pattern Recognition, volume, pages , 25. [7] A. Chan and N. Vasconcelos. Classifying video with kernel dynamic textures. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6, 27. [8] R. Chaudhry and R. Vidal. Recognition of visual dynamical processes: Theory, kernels and experimental evaluation. Technical report, Johns Hopkins University, 29. [9] K. D. Cock and B. D. Moor. Subspace angles and distances between ARMA models. System and Control Letters, 46(4):265 27, 22. [] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, October 25. [] G. Doretto, A. Chiuso, Y. Wu, and S. Soatto. Dynamic textures. Int. Journal of Computer Vision, 5(2):9 9, 23. [2] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley- Interscience, October 24. [3] A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In IEEE International Conference on Computer Vision, pages , 23. [4] D. M. Gavrila. The visual analysis of human movement: A survey. Computer Vision and Image Understanding, 73:82 98, 999. [5] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 29(2): , 27. [6] N. İkizler and D. A. Forsyth. Searching for complex human activities with no visual examples. International Journal of Computer Vision, 8(3): , 28. [7] I. Laptev. On space-time interest points. Int. Journal of Computer Vision, 64(2-3):7 23, 25. [8] T. B. Moeslund and E. Granum. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 8:23 268, 2. [9] T. B. Moeslund, A. Hilton, and V. Krger. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 4:9 26, 26. [2] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79:299 38, 28. [2] C. R. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math Soc., 37:8 89, 945. [22] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In IEEE Int. Conf. on Computer Vision, 998. [23] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 22. [24] B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, :299 39, 998. [25] R. Shumway and D. Stoffer. An approach to time series smoothing and forecasting using the EM algorithm. Journal of Time Series Analysis, 3(4): , 982. [26] A. Srivastava, I. Jermyn, and S. Joshi. Riemannian analysis of probability density functions with applications in vision. In IEEE Conference on Computer Vision and Pattern Recognition, 27. [27] D. Tran and A. Sorokin. Human activity recognition with metric learning. In European Conference on Computer Vision, 28. [28] P. van Overschee and B. D. Moor. Subspace Identification for Linear Systems. Kluwer Academic Publishers, 996. [29] S. Vishwanathan, A. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. International Journal of Computer Vision, 73():95 9, 27. [3] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In European Conference on Computer Vision, 28. [3] A. Yilmaz and M. Shah. Actions sketch: A novel action representation. In IEEE Conference on Computer Vision and Pattern Recognition, pages , 25. [32] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: An in-depth study. Technical report, INRIA,

View-Invariant Dynamic Texture Recognition using a Bag of Dynamical Systems

View-Invariant Dynamic Texture Recognition using a Bag of Dynamical Systems Avinash Ravichandran, Rizwan Chaudhry and René Vidal Center for Imaging Science, Johns Hopkins University, Baltimore, MD 21218,