Supervised Spatio-Temporal Neighborhood Topology Learning for Action Recognition

Size: px

Start display at page:

Download "Supervised Spatio-Temporal Neighborhood Topology Learning for Action Recognition"

Betty Walker
5 years ago
Views:

1 Supervised Spatio-Temporal Neighborhood Topology Learning for Action Recognition Andy J Ma Abstract Supervised manifold learning has been successfully applied to human action recognition, in which class label information could improve recognition performance. However, the learned manifold may not be able to well preserve both the local structure and temporal pose correspondence of sequences from the same action. To overcome this problem, this paper proposes a new supervised manifold learning algorithm namely supervised spatio-temporal neighborhood topology learning (SSTNTL) for action recognition. By analyzing the topological characteristics in the context of action recognition, we propose to construct the neighborhood topology using both supervised spatial and temporal pose correspondence information. Employing the locality preserving property in LPP, SSTNTL solves the generalized eigenvalue problem to obtain the best projections that not only separating data points from different classes, but also preserving local structures and temporal pose correspondence of sequences from the same class. Experimental results demonstrate that SSTNTL outperforms the manifold embedding methods with other topologies as well as local discriminant information. Moreover, compared with stateof-the-art action recognition algorithms, SSTNTL gives the best performance for both human and gesture action recognition. 1 Introduction Action recognition is an active research topic in computer vision and pattern recognition, due to a wide range of potential applications, such as intelligent video surveillance, perceptual interface and content-based video retrieval. While many algorithms and systems have been developed in the last decade [32] [25] [1], recognizing actions in videos still remains challenging. In action recognition, a key issue is to extract useful action information from raw video data. So far various approaches have been proposed This paper has been submitted to IEEE Transactions on Circuits and Systems for Video Technology. to extract features from video sequences, such as key frame extraction [35] [2], space-time interest point detection and description [22] [13], key point trajectory based approaches [21] [26], etc. However, recognizing actions using key frames lacks of motion information. And the interest point and trajectory based methods extract local features, but do not take advantage of global space-time shape information. Action sequences characterized by poses deforming continuously over time, can be regarded as data points on dynamic action manifolds. In [33] [34], Locality Preserving Projection (LPP) [8] was employed to learn and match the dynamic shape manifolds for human action recognition. Other than LPP, there are many manifold learning methods such as Isometric feature Map (Isomap) [31], Locally Linear Embedding (LLE) [27] and Laplacian Eigenmap (LE) [3], which aim to discover the intrinsic geometrical structure of a data manifold. However, all these manifold learning algorithms do not make use of the temporal information in action sequences which is thought to be important. In [9], Spatio-Temporal Isomap (ST-Isomap) is proposed to construct the local neighbors emphasizing the similarity between the temporal related blocks. Besides ST-Isomap, the temporal extension to Laplacian Eigenmap (TLE) is proposed in [14] and achieves better performance. Both ST-Isomap and TLE are unsupervised methods. That means class label information does not take into account. In turn, it may not be efficiently applied to supervised dimensionality reduction problem. Recently, supervised manifold embedding methods have been proposed such as Locality Sensitive Discriminant Analysis (LSDA) [4], Local Fisher Discriminant Analysis (LFDA) [30], Marginal Fisher Analysis (MFA) [37], and Supervised Locality Preserving Projection (SLPP) [39]. Although these methods show that label information is important, they do not take temporal cues into account. In this context, Jia and Yeung [10] proposed a local spatio-temporal discriminant embedding (LSTDE) method to discover the local spatial and local temporal discriminant structures. LSTDE is derived by maximizing the inter-class variance and minimizing the intra-class variance based on

the local spatio-temporal information. In this case, recognition based on the whole sequence but not local segment may not benefit from LSTDE too much.

2 the local spatio-temporal information. In this case, recognition based on the whole sequence but not local segment may not benefit from LSTDE too much. Moreover, the solution to LSTDE is based on an iterative scheme, but there is no justification showing that whether the iterative solution is converged and optimal. In addition, different from Isomap, LLE, or LE, LSTDE is not derived from the aspect of manifold property, so the manifold interpretation of LSTDE is unclear. To overcome the limitations mentioned above, this paper proposes a new method to learn the manifold structure by using supervised spatial and temporal pose correspondence information from topological aspect of manifold. As we know, topology is an important concept in the definition of a manifold [5]. By analyzing the topological base in existing methods for action recognition, we define a new topology by spatial and class label information, which is used to construct the supervised spatial neighborhood (SSN). However, SSN does not take full advantage of the temporal information. Thus, we further analyze the temporal pose correspondence between sequences of the same action, and develop the temporal pose correspondence neighborhood (T- PCN). The supervised spatial and temporal pose correspondence information is fused by taking the union of SSN and TPCN. At last, the optimal linear projection functions preserving neighborhood relationship are obtained by solving a generalized eigenvalue problem. Our proposed method not only takes advantage of the supervised spatial distribution, but also temporal pose correspondence information to learn the neighborhood topology. We call it as supervised spatiotemporal neighborhood topology learning (SSTNTL). The rest of this paper is organized as follows. In section 2, we present the proposed method for action recognition. Experimental results and conclusion are reported in Sections 3 and 4, respectively. 2 Supervised Spatio-Temporal Neighborhood Topology Learning As we know, the major difference between LPP and SLPP is the adjacency graph. In this section, we show that the adjacency graph is intimately related to the topological base from mathematical perspective. As we know, topology is an important concept in the definition of a manifold, so we will first give a topological analysis in the context of action recognition. Then, we propose to construct the neighborhood topology by the supervised spatial as well as temporal pose correspondence information. Finally, employing the locality preserving property in LPP, we propose the SSTNTL for action recognition. Figure 1. Similar pose in two different actions. 2.1 Topological Analysis for Action Recognition In action recognition, actions are regarded as data points on a manifold. Supposes there are c classes of actions and data points from all the actions lie on a smooth and compact manifold. Different actions may be close to each other on the manifold, because different actions share similar poses. This is illustrated in Fig. 1, which shows two actions, namely run and walk. We can see that poses (frames) indicated with red rectangle in run and walk actions are similar. As such, certain data points from these two actions will be close together. This means sequences representing these two actions may be close with each other. In this case, the two actions in Fig. 1 can be represented by data points (diamond and circle represent different actions) as shown in Fig. 2(a). Denote the two actions as A i and A j. And let A = A i A j. Usually, the topology defined on A is induced by the Euclidean topology. Denote this topology as τ e. Since topology is too abstract to understand, we investigate the topological base [11] instead. In mathematics, if τ = B, where B represents a family of sets generated by the family of sets B, i.e. B = {U X x U, B B, s.t.x B U}, B is called the topological base of the topological space (A, τ). With this representation, τ e can be written as τ e = B e, B e = {A B(x n, ε) x n A, ε > 0} (1) where B(x n, ε) is a ball centered at x n and with radius ε. The topology in LPP is constructed with fixed ε in τ e, which can be written as τ LPP = B LPP, B LPP = {A B(x n, ε LPP ) x n A} (2) where ε LPP is the parameter for the ε-neighborhood. For suitable parameter ε LPP, the topological base in LPP can be represented by ovals in Fig. 2(b). Based on the topological base, the adjacency graph in LPP is constructed by putting a link between any two points in the same oval, which is also shown in Fig. 2(b). From Fig. 2(b), we can see that close points from two different actions are connected together, so projections learned by LPP will map these points into lowdimensional space as close as possible. However, this is not what we want for classification.

3 (a) (b) (c) (d) Figure 2. (a) Visualization of two close actions. Visualization of the topological base and adjacency graphs in (b) LPP, (c) SLPP and (d) the proposed supervised spatial method. Topologies defined on A can have many possible variations. Besides the Euclidean topology, in SLPP, the topology is given by τ SLPP = B SLPP, B SLPP = {A i, A j } (3) The topological base and adjacency graph in SLPP are shown as ovals and edges in Fig. 2(c). Since SLPP only consider the class information, every two points from the same class are connected by an edge on the graph in Fig. 2(c). Therefore, the topology in SLPP hardly contains any local information of the data. 2.2 Supervised Spatial Neighborhood Topology Construction According to the analysis in the above section, we can see that the topological base plays an important role in both LPP and SLPP. With spatial distribution and class label information of the data, the topological bases B LPP in LPP and B SLPP SLPP can be constructed. However, B LPP cannot separate data points from the same class, and B SLPP hardly contains any local information. In order to ensure that the topological base not only preserves local structure, but also be discriminative, we take the intersection of B LPP and B SLPP and define the supervised spatial topological base as B SS = {B LPP B SLPP B LPP B LPP, B SLPP B SLPP } (4) Since B LPP = A B(x n, ε LPP ) and B SLPP = A i or A j, B SS in (5) can be rewritten and the novel topology can be defined as τ SS = B SS, B SS = {A k B(x n, ε SS ) x n A k, k = i or j} The topological base B SS with (5) and adjacency graph constructed by B SS are shown in Fig. 2(d). In this new topology and adjacency graph, A can be separated as two connected components, A i and A j. In addition, close points from the (5) Figure 3. Visualization of the temporal continuity in an action sequence. same action are connected by an edge on the graph. Therefore, the proposed supervised spatial topology τ SS not only preserves local structure of data points from the same class but also separates data points from different classes. Besides spatial and label information, the supervised s- patial topology τ SS contains temporal adjacent information as well. As shown in Fig. 3, action sequences are characterized by poses deforming continuously over time. In mathematics, the temporal continuity means that for suitable ε SS, there exists a positive integer δ, s.t. x n+t x n ε SS, when t δ. By the definition of the supervised spatial topological base, it has x n+t B SS B SS, when t δ (6) This equation means that the temporal adjacent neighbors as indicated by red ovals in Fig. 3 are contained in the supervised spatial neighbors. Moreover, the construction of B SS avoids the non-trivial problem of selection of the adjacent parameter, which is another advantage of the proposed method. In practice, it is difficult to choose an optimal ε SS when using ε neighborhood to construct the adjacency graph. Therefore, we construct the graph using k-nn. However, the number of data points from different classes may not be the same, so we choose a percentage parameter a SS instead of k to construct the neighborhood of each data point. The algorithmic procedure to construct the supervised spatial neighborhood N SS (x n ) for each x n is stated in Fig Temporal Pose Correspondence Neighborhood Topology Construction Although the supervised spatial neighborhood N SS contains temporal adjacent neighbors as mentioned in the above section, the construction of N SS still does not take full advantage of the temporal information. If we consider different sequences of the same action bend as shown in Fig. 5, it can be observed that different sequences share the similar poses deforming similarly over time. However, if the neighborhood is constructed only by spatial information, the corresponding poses in different sequences of the same actions may not be close to each other, due to the background

4 Algorithm 1. Constructing N SS Input: Training samples x 1,, x N ; Corresponding labels y 1,, y N ; Percentage parameter a SS ; Output: Supervised spatial neighborhood N SS ; for i = 1,, c Compute number of samples for class i, N i ; Calculate k-nn parameter k SS i = a SS N i for n = 1,, N SS Select the k yn nearest samples respect to x n from class y n as N SS (x n ); return N SS (x 1 ),, N SS (x N ). Figure 4. Algorithm 1. Construction of the supervised spatial neighborhood. and appearance changes. Thus, the temporal pose correspondence between sequences of the same action may not be discovered by the supervised spatial neighbors. In this context, we propose to construct the neighborhood by the temporal pose correspondence directly. Denote two complete sequences of the same action 1 with segmented regions of interest as A i1 = {x i1 1,, xi1 N i1 } and A i2 = {x i2 1,, xi2 N i2 }. For non-periodic actions, s- ince we assume that the action sequences are complete, the beginning poses and ending ones in A i1 and A i2 are the same. However, different sequences of the same action are performed by different subjects, so A i1 and A i2 may contain different numbers of frames, and the pose correspondence between A i1 and A i2 needs to be found out. Dynamic time warping (DTW) [28] is an algorithm which measures the similarity bewteen two sequences which may vary in speed. With DTW, the temporal pose correspondence is represented by the warping function F = [(f i1 (1), f i2 (1)),, (f i1 (t), f i2 (t)),, (f i1 (T ), f i2 (T ))] where 1 f i1 (t) N i1, 1 f i2 (t) N i2, and T is integer, s.t. T max{n i1, N i2 }. With the five constraints proposed in [28], i.e. monotonic, continuity, boundary, adjustment window, and slop constraints, DTW finds the warping function F by solving the following optimization problem min F T t=1 (7) ω(t) x i1 f i1 (t) xi2 f i2 (t) (8) 1 For actions like run and walk, the moving direction is assumed to be the same Figure 5. Topological visualization of temporal pose correspondence neighborhood. where ω(t) is the non-negative weighting coefficient. For symmetric form, ω(t) = (f i1 (t) f i1 (t 1)) + (f i2 (t) f i2 (t 1)), while ω(t) = (f i1 (t) f i1 (t 1)) or ω(t) = (f i2 (t) f i2 (t 1)) for asymmetric form. In [28], the DTW distance matrix is constructed by the dynamic programming (DP) based time-normalization algorithm. Then, the optimal F is obtained by searching the path in matrix with minimal cost. (More details about DTW can be referred to [28].) For the periodic action, the beginning poses and ending ones in A i1 and A i2 may not be the same, so DTW cannot be applied directly. Suppose N i1 N i2. And denote the permute function σ t0, s.t. σ t0 (t) = t t 0, if t 0 < t N i1 ; σ t0 (t) = t + N i1 t 0, if 0 t t 0, where t 0 is a shifting integer from 0 to N i1 1. In this case, the optimal F is obtained by min σ min F T t=1 ω(t) x i1 σ t0 (f i1 (t)) xi2 f i2 (t) (9) In order to solve optimization problem (9), we fix t 0 to compute the minimum value respective to F by DTW. Then, choosing the lowest cost with 1 t 0 N i1, the optimal solution is achieved. The optimal warping function F defined by (7) gives the correspondence between similar poses in different sequences of the same action. However, the difference between the corresponding frames may be very large due to the background and appearance changes. Consequently, if corresponding frames of the same pose in different sequences are in the same neighborhood, the manifold structure may be destroyed. Thus, neighboring corresponding frames are selected as the base of the temporal pose correspondence neighborhood topology, i.e. τ TPC = B TPC, B TPC = {C(x n ) B(x n, ε TPC ) x n X} (10) where X is the universal set and C(x n ) denotes the set containing poses in different sequences corresponding to x n.

5 Algorithm 2. Constructing N TPC Input: Training sequences A 1,, A M ; Corresponding labels y 1,, y M ; Percentage parameter a TPC ; Output: Temporal pose correspondence neighborhood N TPC ; for i = 1,, c Find action sequences with label i; for m = 1,, M i for n = m + 1,, M i Obtain warping function F Ai m,a in by solving (9) or (10); for n = 1,, N Construct corresponding set to x n, C(x n ) by F; Count the number of elements in C(x n ), # C(x n ) ; Set k n TPC = a TPC # C(x n ) ; Select the k n TPC nearest samples in C(x n ) as N TPC (x n ); return N TPC (x 1 ),, N TPC (x N ). Figure 6. Algorithm 2. Construction of the temporal pose correspondence neighborhood. The topological base constructed by the temporal pose correspondence is indicated by the red ovals in Fig. 5. Similarly, the temporal pose correspondence neighborhood is constructed by k-nn. And a percentage parameter a TPC is used instead of k. Denote M i is the number of sequences for action i. The algorithmic procedure to construct the temporal pose correspondence neighborhood N TPC (x n ) for each x n is stated in Fig Supervised Spatial and Temporal Pose Correspondence Neighborhood Topology Learning As mentioned in Sections 2.2 and 2.3, the topological base B SS contains supervised spatial information, while B TPC consists of temporal pose correspondence. In order to ensure that the topological base embodies both information, we take the union of B SS and B TPC, and the fused topology namely, supervised spatio-temporal neighborhood topology is defined as τ ST = B ST, B ST = {B ST B ST B SS or B ST B TPC } (11) Based on (11), the fused neighborhood of each x n is given by N ST (x n ) = N SS (x n ) N TPC (x n ) (12) Suppose manifold M ST is defined with the neighborhood topology τ ST. In [3], it is shown that the optimal mapping f preserving locality on manifold M ST is given by solving the following optimization problem arg min f, s.t. f L 2 (MST ) =1 M ST f 2 (13) In order to avoid the out-of-sample problem, the mapping f is restricted to be linear. In [8], it is shown that the optimal linear projections preserving locality can be obtained by solving the following optimization problem, arg min e, s.t. e T XDX T e=1 e T XLX T e (14) where L and D are the Laplacian and diagonal matrices. Since the Laplacian L and diagonal matrix D are s- parse and with special structure, XLX T and XDX T can be computed by the following equations XLX T = ( L X1 T ) X 1 X c L c Xc T c = X i L i Xi T i=1 i=1 (15) XDX T = ( D X1 T ) X 1 X c D c Xc T (16) c = X i D i Xi T At last, the algorithmic procedure of the proposed SST- NTL method is presented in Fig Experiments In this section, we evaluate the proposed SSTNTL on three publicly available action databases: Weizmann [7], KTH [29] human action databases and Cambridge-Gesture database [12]. First, we give a brief introduction to these 3 databases and the experimental settings in Sections 3.1. Then, the results on these 3 databases are reported in Sections 3.2, 3.3 and 3.4.

Algorithm 2; Construct N ST by (12); Set matrices M L = 0 and M D = 0; for i = 1,, c T Compute M L = M L + X i L i X i and M D = M D + X i D i X T i ; Solve the eigenvalue problem M L e = λm D e;

6 Algorithm 3. SSTNTL Input: Training sequences A 1,, A M ; Corresponding labels y 1,, y M ; Parameters a SS, a TPC, l; Output: Projection matrix P = (e 1,, e l ); Construct N SS by Algorithm 1; Construct N TPC by Algorithm 2; Construct N ST by (12); Set matrices M L = 0 and M D = 0; for i = 1,, c T Compute M L = M L + X i L i X i and M D = M D + X i D i X T i ; Solve the eigenvalue problem M L e = λm D e; Sort the eigenvectors e 1,, e l by eigenvalues λ 1 λ l ; return P = (e 1,, e l ). Figure 7. Algorithm 3. The algorithmic procedure of SSTNTL. 3.1 Database and Setting Weizmann database [7] contains 93 videos from nine persons, each performing ten actions, i.e. bend, jumping jack, jump in place on two legs, jump forward on two legs, galloping sideways, skip, run, walk, wave one hand, and wave two hands. Motion detection method introduced in [18] is used to extract region of interest from the raw videos. Since the actions in this database except bend are periodic, each video of the periodic actions contains multiple periods of the same action. We segment the action sequences into sets of periodic motion cycles by the information saliency curve [18] following the procedure in [16] [17]. One period in each sequence is used in our experiments, and all the frames are normalized into the same dimension as pixels. We follow the common nine-fold cross validation protocol [15] [38] to e- valuate the proposed method. The average recognition rate is recorded. There are 25 subjects performing six actions under four scenarios in KTH database [29]. The six actions include boxing, hand clapping, hand waving, jogging, running and walking. The four different scenarios are outdoors (S1), outdoors with scale variations (S2), outdoors with different clothes (S3) and indoors (S4). Similar procedure is performed on this database to extract action periods by the information saliency method. In order to make the features less sensitive to background and appearance changes, we use the subtractions of every successive two frames instead of the original grey-scale images in this database. The leave-one-out cross validation protocol in [26] [23] [36] is used to evaluate the proposed method. And the average recognition rate over 25 runs for each scenario is recorded. Cambridge-Gesture database [12] consists of 900 image sequences of nine hand gesture classes under five kinds of illuminations. The nine hand gesture actions include Flat-Leftward (FL), Flat-Rightward (FR), Flat- Contract (FC), Spread-Leftward (SL), Spread-Rightward (SR), Spread-Contract (SC), V-Shape-Leftward (VL), V- Shape-Rightward (VR), and V-Shape-Contract (VC). Each class and illumination set contains 20 sequences. The holistic representation of spatial envelope [24] is used to compute feature vector of 960 dimension for each image. Following the experimental protocol in [12] [19], 20 sequences per class with plain illumination is randomly partitioned into 10 sequences for training and the other 10 sequences for validation, while testing is performed in the data sets with the other four illuminations. For these three databases, Principal Component Analysis (PCA) [6] is used for dimensionality reduction to avoid the singular matrix problem and improve the computational efficiency. The dimension of PCA is determined by keeping around 90% energy. After performing PCA, manifold embedding methods are used to extract features for action recognition. In our experiments, we compare the proposed SSTNTL with SNTL [20], LPP [8], SLPP [39], LSDA [4] and LSTDE [10]. The weight matrix is calculated by the simple minded method to avoid the parameter selection by the heat kernel weighting. As shown in Algorithm 3 in Fig. 7, parameters a SS, a TPC, l need to be determined in SSTNTL. Since it is difficult and time-consuming to search the best combination of the 3 parameters simultaneously, we determine the parameters in a two-stage scheme. First, setting l = 60, a SS and a TPC are s- elected from {0.02, 0.04,, 0.2} and {0.1, 0.2,, 0.9} respectively. Second, the best dimension l is selected from two to the determined PCA dimension with step size two for fixed a SS and a TPC. On the other hand, the parameter selection procedure in other methods is similar to that in SSTNTL. The neighborhood parameter in SNTL, LPP and LSDA is selected from the same set as a SS in SST- NTL. Since LSTDE is very slow when the neighborhood is large, the neighborhood parameter in LSTDE is selected from {0.01, 0.02, 0.03} instead. After feature extraction, we use a nearest neighbor framework in the classification stage. Let A test be a testing action sequence. The class label of A test is given by y test = y arg min m d(p T A test,p T A m) (17) where d measures the distance between two sequences u and v, e.g. u = P T A test and v = P T A m. Since median Hausdorff distance gives more robust results as mentioned

d(u, v) =median j min u i v j i + median i min u i v j j (18) 3.

7 Figure 8. Recognition accuracy (%) of SST- NTL with different values of a SS and a TPC, and fixed l = 60 on Weizmann database. Figure 9. Recognition accuracy with the best neighborhood parameter and different embedded dimensions on Weizmann database. in [33], we define d as (18) in our experiments. d(u, v) =median j min u i v j i + median i min u i v j j (18) 3.2 Results on Weizmann Human Action Database The recognition accuracies of SSTNTL with different values of a SS and a TPC, and fixed l = 60 on this database are shown in Fig. 8. The darker element of the matrix in Fig. 8 represents higher recognition rate. The columns are the recognition rates with changing a SS and specific a TPC, while rows are recognition rates with changing a TPC and specific a SS. From Fig. 8, we can see that the highest accuracy of 100% is achieved in this database, when a SS = 0.06 and a TPC = 0.9. From the last row in Fig. 8, we observe that the recognition rates do not change when a SS = 0.2 for different a TPC. This may be due to the reason that the supervised spatial neighborhood N SS already contains the temporal pose correspondence neighborhood N TPC, when a SS = 0.2. Thus, SSTNTL may be the same as SNTL, when a SS is large enough. Fig. 9 shows the recognition accuracies of different methods with different dimension. From Fig. 9, we can see that the recognition rate of SSTNTL is higher than those of all the other methods in every dimensions. On the other hand, when the dimension is less than 20, SSTNTL, SNTL, LPP and SLPP outperform LSDA and LSTDE a lot. This result suggests that the neighborhood topology learning methods with clear manifold interpretation are much better than those without clear one, when the dimension is small. The highest accuracy and corresponding optimal dimension of each method are recorded in the left hand side of Table 1. From Table 1, we can see that SSTNTL not only achieves the best recognition accuracy of 100%, but also has the lowest optimal dimension. This means SSTNTL can recognize actions correctly, while only need low storage. On the other hand, compared with state-of-the-art algorithms on this database, SSTNTL outperforms Expandable Data-Driven Graphical Model (EDDGM) [15] and gives the equally best performance with the Boosted Exemplar Learning (BEL) method [38]. 3.3 Results on KTH Human Action Database On this database, the highest recognition accuracies of different manifold embedding methods under the four scenarios are presented in the top six rows in Table 2. It can be seen that both SSTNTL and SNTL achieve the best performance under scenario 1 and 3. SSTNTL, SNTL and L- STDE are performing better than other methods under scenario 4, while SSTNTL outperforms other manifold embedding methods under scenario 2. As observed in Table 2, the recognition rates under scenario 1, 3 and 4 are higher than those under scenario 2. This means that the motion features for scenario 1, 3 and 4 are more discriminative than those in scenario 2. Under this situation, it is possible that the temporal corresponding poses in different sequences of the same actions are nearly the same and in the same supervised spatial neighborhood. Consequently, SSTNTL and SNTL give the same recognition rates under scenario 1, 3 and 4. For scenario 2, the features are less discriminative. However, SSTNTL achieves a remarkable improvement compared to the other manifold embedding methods, since SSTNTL takes advantage of the temporal pose cor-

8 Method SSTNTL SNTL [20] LPP [8] SLPP [39] LSDA [4] LSTDE [10] EDDGM [15] BEL [38] Accuracy Dimension Table 1. Recognition accuracy (%) and corresponding optimal dimension on Weizmann database. Method S1 S2 S3 S4 Mean SSTNTL SNTL [20] LPP [8] SLPP [39] LSDA [4] LSTDE [10] Tracklet [26] HSTM [23] AFMKL [36] (a) SSTNTL (b) SNTL [20] Table 2. Recognition accuracy (%) of different methods on KTH database respondence to construct the topology. This also convinces that the temporal pose correspondence information can help to recognize actions. We also compare the manifold embedding methods with state-of-the-art human action recognition algorithms. The recognition rates of Tracklet method [26], Hierarchical Space-Time Model (HSTM) [23] and Multiple Kernel Learning with Augmented Features (AFMKL) [36] are p- resented in the last three rows in Table 2. Compared with all the methods in Table 2, SSTNTL gives the best performance together with SNTL under scenario 1 and 3, and with SNTL and LSTDE under scenario 4, while the Tracklet outperforms the others under scenario 2. The reason that why SSTNTL is not better than Tracklet under scenario 2 can be explained as follows. The features used in SSTNTL are obtained by motion detection. Since motion detection under scenario 2 with scale variations is more challenged than that under the other three scenarios, the features obtained in scenario 2 must be less robust than those in other scenarios. Thus, the performance of SSTNTL drops together with the other manifold embedding methods. Since the Tracklet and AFMKL methods rely on local features without motion detection, they outperform the embedding methods under scenario 2. In addition, we compare the mean accuracies of all the methods in the last column in Table 2. The mean recognition rate of SSTNTL is higher than those of other manifold embedding methods as well as the state-of-the-art human action recognition algorithms. This convinces that SSTNTL is preferable for human action recognition. While SSTNTL achieves the highest average recognition rate of 95.7%, we are interested in investigating which ac- (c) Tracklet [26] (d) AFMKL [36] Figure 10. Confusion matrix on KTH database tions were misclassified. The confusion matrices of the best four algorithms are shown in Fig 10. We can see that actions Jogging and Running were confused with each other for all the methods. This is because Jogging and Running are nearly the same except the speed performing the two actions. Another observation is that the recognition rates for each action with SSTNTL, SNTL and AFMKL are larger than or equal to 90%, while only SSTNTL achieves the 100% accuracy among these 3 methods. This means that SSTNTL not only gives high performance on all the actions, but also recognizes some action totally correctly, which is another advantage of the proposed method. 3.4 Results on Hand Gesture Action Database Recognition accuracies for different illumination sets on this database are presented in Table 3. Compared with the manifold embedding methods, SSTNTL and SNTL obtain the highest recognition rate for illumination set 1, while SSTNTL outperforms the others for sets 2, 3, 4 and the mean accuracy. This convinces that SSTNTL is better than the topology learning methods with other topologies, as well as the local discriminative algorithms even with temporal information. We also compare the proposed method with state-of-

Method Set 1 Set 2 Set 3 Set 4 Mean SSTNTL 89 89 85 91 89 SNTL [20] 89 79 67 84 80 LPP [8] 85 84 75 89 83 SLPP [39] 66 65 53 68 63 LSDA [4] 66 67 58 68 65 LSTDE [10] 68 66 69 80 71 TCCA [12] 81 81 78

The Tensor Canonical Correlation Analysis (TCCA) [12] and Product Manifold (PM) [19] are used for comparison with the manifold embedding methods.

9 Method Set 1 Set 2 Set 3 Set 4 Mean SSTNTL SNTL [20] LPP [8] SLPP [39] LSDA [4] LSTDE [10] TCCA [12] PM [19] (a) SSTNTL (b) LPP [8] Table 3. Recognition accuracy (%) of different methods on Cambridge-Gesture database. the-art gesture action recognition algorithms. The Tensor Canonical Correlation Analysis (TCCA) [12] and Product Manifold (PM) [19] are used for comparison with the manifold embedding methods. From Table 3, we can see that all the SSTNLT, SNTL and PM methods get the highest accuracy 89% for illumination set 1. The product manifold method outperforms the others for set 3, while SSTNTL gives the best performance for sets 2 and 4. Compared with mean recognition rates, SSTNTL achieves the highest mean accuracy of 89% in this database. At last, the confusion matrices of the best four algorithms are shown in Fig. 11. Compared the diagonal elements in the confusion matrices, SSTNTL outperforms S- NTL on eight actions, TCCA and product manifold method on five actions, respectively. This means SSTNTL outperforms the others on more than half of the gesture actions in this database. Overall speaking, SSTNTL is also performing well for hand gesture action recognition. 4 Conclusion In this paper, we have proposed a novel manifold learning method, namely supervised spatio-temporal neighborhood topology learning (SSTNTL) for action classification. Starting from analyzing the topological characteristics in the context of action recognition, we proposed to construct the neighborhood topology from two aspects. First, the spatial distribution containing local structures, as well as the label information separating data points from different class are used to construct the supervised spatial topology. Second, the temporal pose correspondence neighborhood is constructed by discovering the global similarity between action sequences of the same class. These two neighborhood topologies are fused by taking the union of them. Based on the fused topology, SSTNTL employs the locality preserving property in LPP, and solves the generalized eigenvalue problem to obtain the best projections that not only separating data points from different classes, but also preserving local structures and temporal pose correspondence (c) TCCA [12] (d) PM [19] Figure 11. Confusion matrix on Hand Gesture confusion matrices of sequences from the same class. The proposed algorithm is evaluated on three publicly available action video databases. Experimental results show that SSTNTL outperforms the neighborhood topology learning methods with other topologies, as well as the local discriminant algorithms even with temporal information. On the other hand, by analyzing the 2D visualizations of the training and testing data, we show that the generalization ability of the topology learning methods is better than that of the local discriminability based methods, when the reduced dimension is low. And SSTNTL shows the best generalization ability. In addition, compared with stateof-the-art action recognition algorithms, SSTNTL gives the best performance for both human and gesture action recognition. References [1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Computing Surveys, 43(3):16:1 16:43, [2] S. Baysal, M. C. Kurt, and P. Duygulu. Recognizing human actions using key poses. Proceedings of the IEEE International Conference on Pattern Recognition, pages , [3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems, [4] D. Cai, X. He, K. Zhou, J. Han, and H. Bao. Locality sensitive discriminant analysis. Proceedings of the 20th International Joint Conference on Artifical intelligence, [5] M. P. do Carmo. Riemannian geometry. Birkhauser, 1993.

10 [6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, [7] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12): , [8] X. He and P. Niyogi. Locality preserving projections. Advances in Neural Information Processing Systems, [9] O. C. Jenkins and M. J. Matarić. A spatio-temporal extension to Isomap nonlinear dimension reduction. Proceedings of The International Conference on Machine Learning, pages , [10] K. Jia and D.-Y. Yeung. Human action recognition using local spatio-temporal discriminant embedding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8, [11] J. L. Kelley. General topology. Birkhauser, [12] T.-K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(8): , [13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8, [14] M. Lewandowski, J. M. del Rincon, D. Makris, and J.-C. Nebel. Temporal extension of laplacian eigenmaps for unsupervised dimensionality reduction of time series. Proceedings of the IEEE International Conference on Pattern Recognition, pages , [15] W. Li, Z. Zhang, and Z. Liu. Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Transactions on Circuits and Systems for Video Technology, 18(11): , [16] C. Liu and P. C. Yuen. Human action recognition using boosted eigenactions. Image and Vision Computing, 28(5): , [17] C. Liu and P. C. Yuen. A boosted co-training algorithm for human action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 21(9): , [18] C. Liu, P. C. Yuen, and G. Qiu. Object motion detection using information theoretic spatio-temporal saliency. Pattern Recognition, 42(11): , [19] Y. M. Lui, J. R. Beveridge, and M. Kirby. Action classification on product manifolds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [20] A. J. Ma, P. C. Yuen, W. Zou, and J.-H. Lai. Supervised neighborhood topology learning for human action recognition. IEEE International Conference on Computer Vision Workshops, pages , [21] R. Messing, C. Pal, and H. Kautz. Activity recognition using the velocity histories of tracked keypoints. Proceedings of the IEEE International Conference on Computer Vision, pages , [22] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3): , [23] H. Ning, T. X. Han, D. B. Walther, M. Liu, and T. S. Huang. Hierarchical space-time model enabling efficient search for human actions. IEEE Transactions on Circuits and Systems for Video Technology, 19(6): , [24] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3): , [25] R. Poppe. A survey on vision-based human action recognition. Image and Vision Computing, 28(6): , [26] M. Raptis and S. Soatto. Tracklet descriptors for action modeling and video analysis. Proceedings of European Conference on Computer Vision, 6311/2010: , [27] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290: , [28] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 2(1):43 49, [29] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. Proceedings of the IEEE International Conference on Pattern Recognition, [30] M. Sugiyama. Local fisher discriminant analysis for supervised dimensionality reduction. Proceedings of the 23rd International Conference on Machine learning, [31] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290: , [32] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea. Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11): , [33] L. Wang and D. Suter. Learning and matching of dynamic shape manifolds for human action recognition. IEEE Transactions on Image Processing, 16(6): , [34] L. Wang and D. Suter. Visual learning and recognition of sequential data manifolds with applications to human movement analysis. Computer Vision and Image Understanding, 110(2): , [35] D. Weinland and E. Boyer. Action recognition using exemplar-based embedding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 7, [36] X. Wu, D. Xu, L. Duan, and J. Luo;. Action recognition using context and appearance distribution features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [37] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):40 51, [38] T. Zhang, J. Liu, C. X. Si Liu, and H. Lu. Boosted exemplar learning for action recognition and annotation. IEEE Transactions on Circuits and Systems for Video Technology, 21(7): , [39] Z. Zheng, F. Yang, W. Tan, J. Jia, and J. Yang. Gabor featurebased face recognition using supervised locality preserving projection. Signal Processing, 87: , 2007.

SUPERVISED NEIGHBOURHOOD TOPOLOGY LEARNING (SNTL) FOR HUMAN ACTION RECOGNITION

SUPERVISED NEIGHBOURHOOD TOPOLOGY LEARNING (SNTL) FOR HUMAN ACTION RECOGNITION 1 J.H. Ma, 1 P.C. Yuen, 1 W.W. Zou, 2 J.H. Lai 1 Hong Kong Baptist University 2 Sun Yat-sen University ICCV workshop on Machine