Histogram of Oriented Principal Components for Cross-View Action Recognition

Size: px
Start display at page:

Download "Histogram of Oriented Principal Components for Cross-View Action Recognition"

Transcription

1 MANUSCRIPT FOR REVIEW, Histogram of Oriented Principal Components for Cross-View Action Recognition arxiv: v2 [cs.cv] 3 Sep 2015 Hossein Rahmani, Arif Mahmood, Du Huynh, Member, IEEE, and Ajmal Mian, Member, IEEE 1 Abstract Existing techniques for 3D action recognition are sensitive to viewpoint variations because they extract features from depth images which are viewpoint dependent. In contrast, we directly process pointclouds for cross-view action recognition from unknown and unseen views. We propose the Histogram of Oriented Principal Components (HOPC) descriptor that is robust to noise, viewpoint, scale and action speed variations. At a 3D point, HOPC is computed by projecting the three scaled eigenvectors of the pointcloud within its local spatio-temporal support volume onto the vertices of a regular dodecahedron. HOPC is also used for the detection of Spatio-Temporal Keypoints (STK) in 3D pointcloud sequences so that view-invariant STK descriptors (or Local HOPC descriptors) at these key locations only are used for action recognition. We also propose a global descriptor computed from the normalized spatio-temporal distribution of STKs in 4-D, which we refer to as STK-D. We have evaluated the performance of our proposed descriptors against nine existing techniques on two cross-view and three single-view human action recognition datasets. The Experimental results show that our techniques provide significant improvement over state-of-the-art methods. Index Terms Spatio-temporal keypoint, pointcloud, view invariance. F I NTRODUCTION Human action recognition has numerous applications in smart surveillance, human-computer interaction, sports and elderly care [1], [2]. Kinect like depth cameras have become popular for this task because depth sequences are somewhat immune to variations in illumination, clothing color and texture. However, the presence of occlusions, sensor noise, variations in action execution speed and most importantly sensor viewpoint still make action recognition challenging. Designing an efficient representation for 3D video sequences is an important task for many computer vision problems. Most existing techniques (e.g. [3] [7]) treat depth sequences similar to conventional videos and use colorbased action recognition representations. However, simple extensions of color based action recognition techniques to depth sequences are not optimal [8], [9]. Instead of processing depth sequences, richer geometric features can be extracted from 3D pointcloud videos. Action recognition research [4] [14] has mainly focused on actions captured from a fixed viewpoint. However, a practical human action recognition system should be able to recognize actions from different views. Some viewinvariant approaches [15] [28] have also been proposed for cross-view action recognition where recognition is performed from an unknown and/or unseen view. These approaches generally rely on geometric constraints [15] [19], view-invariant features [20] [26], and human body joint tracking [27], [28]. More recent approaches transfer The authors are with the School of Computer Science and Software Engineering, The University of Western Australia, 35 Stirling Highway, Crawley, Western Australia, hossein@csse.uwa.edu.au, {arif.mahmood,du.huynh,ajmal.mian}@uwa.edu.au Fig. 1: 3D pointcloud sequences of a subject performing the holding head action. Notice how the depth values (color) change significantly with viewpoint. features across views [29] [36]. However, these methods do not perform as good as fixed view action recognition. The majority of cross-view action recognition research has focused on color videos or skeleton data. Cross-view action recognition from 3D videos remains an under explored area. We believe that cross-view action recognition from 3D pointcloud videos holds more promise because viewinvariant features can be extracted from such videos. We approach the cross-view action recognition problem from a novel perspective by directly processing the 3D pointcloud sequences (Fig. 1). We extend our previous research [37] where we proposed a new descriptor, the Histogram of Oriented Principal Components (HOPC), to capture the local geometric characteristics around each point in a 3D pointcloud sequence. Based on HOPC, we propose a Spatio-Temporal Keypoint (STK) detection method so that

2 MANUSCRIPT FOR REVIEW, 2014 view-invariant Local HOPC descriptors are extracted from the most discriminative points within a sequence of 3D pointclouds. We also propose another descriptor, STK-D, which is computed from the spatio-temporal distribution of the STKs. Since Local HOPC and STK-D capture complementary information, their combination significantly improves the cross-view action recognition accuracy over existing state-of-the-art. To achieve view invariance for HOPC, all points within an adaptable spatio-temporal support volume of each STK are aligned along the eigenvectors of its spatial support volume. In other words, the spatio-temporal support volume is aligned in a local object centered coordinate basis. Thus, HOPC descriptor extracted from this aligned support volume is view-invariant (Fig. 2). Note that this strategy does not necessarity work for other descriptors as shown in Fig. 2. As humans often perform the same action at different speeds, for speed invariance, we propose automatic temporal scale selection that minimizes the eigenratios over a varying temporal window size independently at each STK. Our four main contributions are summarized as follows: Firstly, we propose the HOPC descriptor which encodes shape and motion in a robust way. Secondly, we propose a view-invariant Spatio-Temporal Keypoint (STK) detector that is integrated with HOPC in the sense that it detects points that are suitable for HOPC. Thirdly, we propose a global action descriptor based on the spatio-temporal distribution of STKs. Finally, we propose a method for viewpoint and speed invariant action recognition. Moreover, we introduce a new UWA3D Multiview Activity II dataset in addition to [37] which contains 30 actions performed by 10 subjects from four different views. This dataset is larger in number of action classes than existing 3D action datasets. The proposed descriptors have been evaluated on two multi-view and three single-view human action recognition datasets. The former includes the Northwestern-UCLA Multiview Action3D [29] and the UWA3D Multiview Activity II datasets whereas the latter includes MSR Action3D [38], MSR Daily Activity3D [39], and MSR Gesture3D [40] datasets. Our extensive experimental results show that the proposed descriptors achieved significantly better accuracy compared to the nine existing state-of-the-art techniques [3], [9], [10], [27] [29], [36], [59], [60]. 2 R ELATED W ORK Based on the data type, action recognition methods can be divided into three categories including color-based, skeleton-based and depth-based methods. In color videos, a significant portion of the existing work has been proposed for single-view action recognition, where the training and test videos are captured from the same view. In order to recognize actions across different views, one approach is to collect data from all possible views and train a separate classifier for each view. However, this approach does not scale well due to the requirement of a large number of labeled samples for each view and it becomes infeasible as 2 Fig. 2: After orientation normalization, the HOPC descriptors are similar for the two views. However, the HON and HOG descriptors are still different. the number of action categories increases. To overcome this problem, some techniques infer 3D scene structure and use geometric transformations to achieve view invariance [15] [19]. These methods critically rely on accurate detection of the body joints and contours, which are still open problems in real-world settings. Other methods focus on spatiotemporal features which are inherently view-invariant [20] [26]. However, these methods have limitations as some of them require access to mocap data while others compromise discriminative power to achieve view invariance [41]. More recently, knowledge transfer based methods [29] [36] have become popular. These methods find a view independent latent space in which features extracted from different views are directly comparable. Such methods are either not applicable or perform poorly when the recog-

3 MANUSCRIPT FOR REVIEW, nition is performed on videos from unknown and, more importantly, from unseen views. To overcome this problem, Wang et al. [29] proposed cross-view action representation by exploiting the compositional structure in spatio-temporal patterns and geometrical relations among views. Although their method can be applied to action recognition from unknown and unseen views, it requires 3D skeleton data for training which is not always available. Our proposed approach also falls in this category except that it uses 3D pointcloud sequences and does not require skeleton data. To the best of our knowledge, we are the first to propose cross-view action recognition using 3D pointcloud videos. In skeleton-based action recognition methods, multicamera motion capture (MoCap) systems [42] have been used for human action recognition. However, such specialized equipment is marker-based and expensive. On the other hand, some other methods [13], [14], [27], [28], [39], [59] use the human joint positions extracted by the OpenNI tracking framework [43]. For example, Yang and Tian [14] used pairwise 3D joint position differences in each frame and temporal differences across frames to represent an action. Since 3D joints cannot capture all the discriminative information, the action recognition accuracy is compromised. Wang et al. [39] extended this approach by computing the histogram of occupancy patterns of a fixed region around each joint in each frame. In order to make this method more robust to viewpoint variations, they proposed a global orientation normalization using the skeleton data [28]. In this method, a plane is fitted to the joints and a rotation matrix is computed to rotate this plane to the XY -plane. However, this method is only applicable if the subject is in an upright pose. Moreover, when the subject is in a non-frontal view, the joint positions may have large errors, making the normalization process unreliable. In contrast, our proposed orientation normalization method does not need the joint positions and can efficiently work in non-frontal as well as non upright positions. In addition to that, as our method performs local orientation normalization at each STK, it is more robust than the single global normalization proposed by [28]. Many of the existing depth-based action recognition methods use global features such as silhouettes and spacetime volume information. For example, Li et al. [38] sampled boundary pixels from 2D silhouettes as a bag of features. Yang et al. [7] added temporal derivatives of 2D projections to get Depth Motion Maps (DMM). Vieira et al. [44] computed silhouettes in 3D by using the spacetime occupancy patterns. Oreifej and Liu [9] extended histogram of oriented 3D normals [45] to 4D by adding the time derivative. Recently, Yang and Tian [10] extended HON4D by concatenating the 4D normals in the local neighbourhood of each pixel as its descriptor. Our proposed HOPC descriptor is more informative than HON4D [37] because it captures the spread of data in three principal directions. Holistic methods may fail in scenarios where the subject significantly changes her/his spatial position [9], [10]. Some other methods use local features where a set of interest points are extracted from the depth sequence and a local feature descriptor is computed for each interest point. For example, Cheng et al. [3] used the Cuboid interest point detector [46] and proposed a Comparative Coding Descriptor (CCD). Due to the presence of noise in depth sequences, simply extending color-based interest point detectors, such as Cuboid [47], 3D Hessian [48] and 3D Harris [46], degrades the efficiency and effectiveness of these detectors as most interest points are detected at irrelevant locations [8], [9]. Motion trajectory based action recognition methods [20], [49] [51] are also not reliable in depth sequences [9]. Therefore, recent depth based action recognition methods resorted to alternative ways to extract more reliable interest points. Wang et al. [40] proposed Haar features to be extracted from each random subvolume. Xia and Aggarwal [8] proposed a filtering method to extract spatio-temporal interest points. Their approach fails when the action execution speed is faster than the flip of the signal caused by sensor noise. Moreover, both techniques are not robust to viewpoint variations. 3 HOPC: HISTOGRAM OF ORIENTED PRIN- CIPAL COMPONENTS HOPC is extracted at each point within a sequence of 3D pointclouds Q seqpq 1,, Q t,, Q nf q, where n f denotes the number of 3D pointclouds in the sequence and Q t is the 3D pointcloud at time t. Consider a point p px t, y t, z t q J in Q t. We define two different support volumes for p: a spatial support volume and a spatiotemporal support volume. The spatial support volume of p, denoted by Ω S ppq, contains the 3D points in Q t that are in a sphere of radius r centered at p (Fig. 3(b)). To define the spatio-temporal support volume of p, denoted by Ω ST ppq, we merge the sequence of pointclouds in the small time interval rt τ, t`τs. The 3D points which are in a sphere of radius r centered at p are considered as Ω ST ppq (Fig. 3(c)). The covariance matrix C α of the points q P Ω α ppq, α P tst, Su is given by: C α 1 ÿ pq µqpq µq J, (1) n p where qpω α ppq µ 1 n p ÿ qpω α ppq and n p Ω α ppq denotes the number of points in the support volume of p. Performing eigen decomposition on the covariance matrix C α gives us: q, V α E α V αj C α, (2) where E α is a diagonal matrix containing the eigenvalues λ α 1 ě λ α 2 ě λ α 3 ě 0 of C α and V α rv α 1 v α 2 v α 3 s contains the three corresponding orthonormal eigenvectors. The HOPC descriptor is built by projecting each eigenvector onto m directions obtained from the vertices of a regular polyhedron. In particular, we consider a regular

4 MANUSCRIPT FOR REVIEW, Therefore, its orientation has a 180 ambiguity. To resolve this orientation ambiguity, we consider the distribution of point vector directions and their magnitudes within the support volume of p. That is, for each point q P Ωα ppq, we compute o q p and we determine the sign of each eigenvector vjα as follows: ÿ vjα vjα.sign signpoj vjα qpoj vjv q2, (3) qpωα ppq where the sign function returns the sign of an input number. Note that the squared projection operation ensures that small projected values, which are often due to noise, are suppressed. If the signs of eigenvectors v1α, v2α, and v3α α α disagree, i.e. v1α ˆ vř 2 v3, we switch the sign of the eigenvector whose qpωα ppq signpoj vjα qpoj vjα q2 value is the smallest. We then project each eigenvector vjα onto U to give us: J α m bα j U vj P R, for 1 ď j ď 3. (4) If vjα perfectly aligns with ui P U, it should vote into only the ith bin. However, as the ui s are not orthogonal to each other, bα j will have non-zero projection values in other bins as well. To overcome this effect, we quantize the projection values of bα j by imposing a threshold value ψ computed as follows: 1 ψ uj, for uk, ul P U, k ul ϕ ` ϕ (5) where uk and ul are any two neighbouring vectors in U. The quantized vector is then given by " 0 if bα α j pzq ď ψ b j pzq α bj pzq ψ otherwise, where 1 ď z ď m denotes a bin number. For the α j th eigenvector, we define hα j to be b j scaled by their α corresponding eigenvalue λj : hα j Fig. 3: Spatio Temporal Keypoint (STK) detection. (a) A 3D pointcloud sequence corresponding to the holding head action, (b) the spatial support volume of a particular point p, (c) the spatio-temporal support volume of p, (d) the HOPC descriptors, (e) STK detection. dodecahedron which is composed of m 20 vertices, each of which corresponds to a histogram bin. Let tui um i 1 be the vertices of a regular dodecahedron and let U ru1, u2,, um s P R3ˆm. For a regular dodecahedron centered at the origin, these vertices are given as: from p 1, 1, 1q from p0, ϕ 1, ϕq from p ϕ 1, ϕ, 0q from p ϕ, 0, ϕ 1 q? where ϕ p1 ` 5q{2 is the golden ratio. Each eigenvector is a direction in the 3D space representing the distribution of point positions in the support volume vertices vertices vertices vertices α λα j b j b α j 2 P Rm, for 1 ď j ď 3. (6) We concatenate the histograms of oriented principal components of the three eigenvectors in decreasing order of magnitudes of their associated eigenvalues to form a descriptor for point p: ı αj αj αj hα h h h P R3m. (7) p The spatial HOPC descriptor hsp encodes the shape of the support volume around p. On the other hand, the spatiotemporal HOPC descriptor hst p encodes information from both shape and motion. Since the smallest principal component of the local surface is the total least squares estimate of the surface normal [52], the surface normals encoded in our descriptor are more robust to noise than gradientbased surface normals used in [9], [45]. Moreover, HOPC additionally encodes the first two eigenvectors which are more dominant compared to the third one. The computation of the spatial and spatio-temporal HOPC descriptors is shown in Fig. 3(d).

5 MANUSCRIPT FOR REVIEW, SPATIO-TEMPORAL KEYPOINT (STK) DETECTION The aim of the STK detection is to find points in 3D pointcloud action sequences that satisfy three constraints: Repeatability: STKs should be identified with high repeatability in different samples of the same action in the presence of noise and viewpoint changes. Uniqueness: A unique coordinate basis should be obtained from the neighbourhood of the STKs for the purpose of view-invariant description. Significant spatio-temporal variation: STKs should be detected where the neighbourhood has significant space-time variations. To achieve these aims, we propose an STK detection technique which has high repeatability, uniqueness and detects points where space-time variation is significant. Consider a point p px t, y t, z t q J within a sequence of 3D pointclouds. We perform eigen decomposition on the spatial and the spatio-temporal covariance matrices C S and C ST as described in Section 3. For the first two constraints, we define the following ratios: δ12 S λs 1 λ S, δ23 S λs 2 2 λ S, δ ST 3 12 λst 1 λ ST 2 23 λst 2 λ ST 3, δ ST. (8) For 3D symmetrical surfaces, the ratio between the first two eigenvalues or last two eigenvalues are very close to 1. The principal components at such locations are, therefore, ambiguous. Thus, for a point to be qualified as a potential keypoint, the condition tδ S 12, δ S 23, δ ST 12, δ ST 23u ą θ STK 1 ` ɛ STK, (9) must be satisfied, where ɛ STK is a small margin to cater for noise. This process eliminates ambiguous points and produces a subset of candidate keypoints which can be described uniquely in a local coordinate basis. Recall that h S p in (7) represents the spatial HOPC and the spatio-temporal HOPC at point p. For the third constraint, a quality factor η p is computed for all candidate keypoints: h ST p η p 1 2 3mÿ i 1 ph S ppiq h ST p piqq 2 ph S ppiq ` h ST p piqq. (10) When h S p h ST p, the quality factor is at the minimum value of η p 0 which basically means that the candidate point p has a stationary spatio-temporal support volume. On the other hand, significant variations in space-time change the direction and magnitude of spatio-temporal eigenvectors with respect to the spatial eigenvectors. Thus, η p is large when a significant motion occurs in the spatio-temporal support volume. STKs that are in the vicinity of each other are similar as they describe more or less the same local support volume. We perform a non-maximum suppression to keep a minimum distance between STKs. We define radius r 1 (with r 1 ă r) and time interval rt τ 1, t`τ 1 s (with τ 1 ď τ) where t is the frame number being considered. The candidate STKs are firstly sorted according to their quality values. Starting from the highest quality STK, all candidate STKs falling within r 1 and τ 1 from it are discarded. The same process is repeated on the remaining candidate STKs until only a desired number, n k, of STKs are left. Figure. 3 shows the steps of our STK detection algorithm. Figure. 4 shows the extracted STKs from four different views for a 3D pointcloud sequence corresponding to the two hand waving action. 5 VIEW-INVARIANT STK DESCRIPTION (LOCAL HOPC) The HOPC descriptor discussed in Section 3 is not viewinvariant yet. We compute Local HOPC only at the STKs since it is possible to normalize the orientation of the local region only at these points i.e. a unique local coordinate basis can only be defined at these points. We perform orientation normalization at each STK using the eigenvectors of its spatial covariance matrix C S (see Section 3). We consider the eigenvectors V S rv1 S vs 2 vs 3 s of CS as a local object centered coordinate basis. Note that the matrix V S is orthonormal and can be used as a valid 3D rotation matrix, since: " vi S.vj S 1 if i j (11) 0 if i j. We apply the 3D rotation R V SJ to all the meancentered points tq i u np i 1 within the spatio-temporal support volume of p, Ω ST ppq, and bring them to a canonical coordinate system: q 1 i Rq i, for i 1,, n p, (12) where q 1 i denotes the rotated point in the local object centered coordinate basis. Note that the first, second, and third principal components are now aligned with the X, Y, and Z axes of the Cartesian coordinates. Since the same STKs in two different views have the same canonical representation, we can do cross-view keypoint matching (Fig. 2). It is important to note that our STK detection algorithm has already pruned ambiguous points to make the local object centered coordinate basis unique, i.e. no two eigenvectors have the same eigenvalues. Therefore, the eigenvector with the maximum eigenvalue will always map to the X axis, the second largest to the Y axis and the smallest to the Z axis. After the orientation normalization given in (12), for each point q 1 P Ω ST ppq, we inspect the eigenratios δ12 ST and δ23 ST (Eq. (9)) computed using neighbouring points of q 1 to determine how the HOPC descriptor at q 1 should contribute to the STK descriptor computation of p. When δ12 ST and δst 23 are both larger than θ l 1 ` ɛ l, where ɛ l is a small margin, all the eigenvectors are uniquely defined and, therefore, can all contribute to the STK descriptor. When δ12 ST ď θ l and δ23 ST ą θ l, the first two eigenvectors are ambiguous and so only the third eigenvector should contribute to the STK descriptor. A similar argument

6 MANUSCRIPT FOR REVIEW, a specific spatio-temporal pattern shared by the STKs in that cluster. One codeword is assigned to each STK descriptor based on the minimum Euclidean distance. The histogram of codewords is used as an action descriptor. For classification, we use an SVM classifier with the histogram intersection kernel [53]. Fig. 4: STKs (shown in red) projected onto XY Z dimensions of all points of a 3D pointcloud sequence corresponding to the two hand waving action. Four different views are shown. Note that the distribution of STKs encodes the action globally as they are detected only where movement is performed. applies to the case where δ12 ST ą θ l and δ23 ST ď θ l. When both δ12 ST ď θ l and δ23 ST ď θ l, then q 1 has no contribution to the descriptor computation. In summary, the following three criteria need to be considered for the construction of h ST q : 1 1) If δ ST 12 ą θ l & δ ST 23 ą θ l, h ST q 1 rhst J 1 h ST 2 J h ST 3 J s J ; 2) If δ12 ST ď θ l & δ23 ST ą θ l, h ST q 1 r0j 0 J h ST J 3 s J ; 3) If δ ST 12 ą θ l & δ ST 23 ď θ l, h ST q 1 rhst J 1 0 J 0 J s J. Next, the orientation normalized spatio-temporal support volume around the STK p is partitioned into γ n xˆn y ˆ n t spatio-temporal cells along the X, Y, and T dimensions. We use c s, where s 1 γ, to denote the s th cell. The cell descriptor h cs is computed by accumulating the h ST q s 1 h cs ÿ q, (13) 1 and then normalizing q 1 Pc s h ST h cs Ð h c s h cs 2. (14) We define the final view-invariant descriptor, h v, of STK p to be the concatenation of h cs obtained from all the cells: h v rh J c 1 h J c 2 h J c γ s J. (15) The above steps are repeated for all the STKs. Thus, the STK descriptors encode view-invariant spatio-temporal patterns that will be used for action description. 6 ACTION DESCRIPTION 6.1 Bag of STK Descriptors We represent each sequence of 3D pointclouds by a set of STK descriptors. Inspired by the successful bag-ofwords approach for object recognition, we build a codebook by clustering the STK descriptors (h v ) with the K- means algorithm. Clustering is performed over all action descriptors extracted from all training view samples. Thus, the codebook that we learn is not single action or single view specific. For a fair evaluation, we do not use the target test views in codebook learning or any other training task. We consider each cluster as a codeword that represents 6.2 Mining Discriminative Codebooks Not all codewords have the same level of discrimination. Some codewords may encode movements that do not offer good discrimination among different actions, e.g. the sway of the human body. We use the F-score to find the most discriminative features in the codebook and discard nondiscriminative features. The F-score [54] measures the discrimination of two sets of real numbers. For more than two sets of real numbers, we use the multiset F-score [55] to measure their discrimination. Given the training histogram of codewords x k, for k 1,, m, and l ě 2 action classes, if the number of the samples in the jth (1 ď j ď l) class is n j, then the F-score of the ith histogram bin is defined as: ř l j 1 x pjq x i 2 F i ř l j 1 i ř 1 nj n j 1 k 1 x pjq k,i xpjq i 2, (16) where x i and x pjq i are the average of the ith histogram bin of all samples and the jth class samples, respectively, and x pjq k,i is the ith histogram bin of the kth sample in the jth class. The larger is the F-score, the more discriminative is the corresponding histogram bin. Therefore, we rank the codewords by their F-scores and select the codewords whose F-scores are higher than a threshold. In our experiments, up to 1.5% improved accuracy was observed by selecting the top 98% discriminative features out of the total Encoding Spatio-Temporal STK Distribution The bag-of-words approach efficiently encodes the local spatio-temporal information in a 3D pointcloud sequence. However, it ignores the spatio-temporal relationship among the STKs. We observed that encoding the distribution of STKs in space-time (Fig. 4) can further improve the discrimination between different actions in addition to the bag-of-words based descriptors. To incorporate the spacetime positional information of STKs, we propose a method that encodes this information. Let P tp i P R 4, i 1,, n k u represent the set of all selected STKs within a sequence of 3D pointclouds Q, where n k is the number of STKs and p i px, y, z, tq J are the coordinates of an STK in the 4D space with x and y being the spatial coordinates, z being depth and t being time. To cope with the heterogeneity in the vectors, we normalize the vectors so that all their components have zero-mean and unit variance. To simplify the description, let us assume that the set P tp i P R 4 u now have all the normalized vectors as described above. By dropping the time axis, we have

7 MANUSCRIPT FOR REVIEW, TABLE 1: The 600 vertices of a 120-cell regular polychoron centered at the origin generated from all and even permutations of these coordinates [56]. Vertices Permutation Coordinate points 24 all 0, 0, 2, 2 64 all 1, 1, 1,?5 64 all ϕ 2, ϕ, ϕ, ϕ 64 all ϕ 1, ϕ 1, ϕ 1, ϕ`2 96 even 0, ϕ 2, 1, ϕ`2 96 even 0, ϕ 1, ϕ,?5 192 even ϕ 1, 1, ϕ, 2 a set of normalized 3D STKs: P 1 tp 1 i P R3 u. Eigen decomposition is then applied to the covariance matrix of points in P 1 to yield two eigenratios δ 12 λ 1 {λ 2 and δ 23 λ 2 {λ 3. To get a unique coordinate basis, we require that δ 12, δ 23 ą θ g 1 ` ɛ g, where ɛ g is a small constant. If these constraints are not satisfied, we perform an iterative refinement of STKs as follows. Given n k initial STKs, in each iteration, m k (where m k! n k ) STKs with the lowest quality factor (Eq. (10)) are removed. Eigen decomposition is applied to the remaining points to yield two new eigenratios. This process is iterated until the eigenratio constraints are satisfied. Generally, three iterations are sufficient. To achieve a view-invariant representation, all points in P 1 are aligned along V, i.e., for all p 1 P P 1, we set p 1 Ð V J p 1 where V is the eigenvector matrix obtained from the eigen decomposition in the last iteration. The normalized temporal dimension is then reattached to each point in P 1 : p Ð rp 1, ts, (17) to form the set P p tp i P R 4 u. To encode the distribution of STKs in the 4D space, we consider a 4D regular geometric object called polychoron [56] which is a 4D extension of the 2D polygon. The vertices of a regular polychoron divide the 4D space uniformly, and therefore, each vertex can be considered as a histogram bin. In particular, from the set of regular polychorons, we consider the 120-cell regular polychoron with 600 vertices as given in Table 1 [56]. Given the set P p constructed above, we project each orientation normalized p i onto the 600 vertices of the polychoron and select the vertex with the highest projection value. The histogram bin corresponding to the selected vertex is incremented by one. We repeat this process for all STKs in P p and the final histogram is a 600 dimensional STK Distribution (STK-D) descriptor which encodes the global spatio-temporal distribution of STKs of the sequence Q in a compact and discriminative form. 7 ADAPTABLE SUPPORT VOLUME So far, for STK detection and description, we have used a fixed spatio-temporal support volume with spatial radius r and temporal scale τ. However, subjects may have different scales (height and width) and may perform actions at different speeds. Therefore, simply using a fixed spatial radius r and temporal scale τ is not optimal. Moreover, a larger value of r enables the proposed descriptor to encapsulate more information about shape but makes the descriptor vulnerable to occlusions. Similarly, a small τ is preferable over large τ for better temporal action localization. However, a small τ may not capture sufficient information about an action if it is performed slowly. 7.1 Spatial Scale Selection Several automatic spatial scale selection methods have been proposed for 3D object retrieval [57]. We adapt the method proposed by Mian et al. [58] in object retrieval for action recognition in 3D pointcloud sequences. Note that in the human action recognition problem, the subject s height is available in most cases (which is not the case for object retrieval). Where available, we use the subject s height (h s ) to find an appropriate spatial scale. We select the ratios as r σh s, where 0 ă σ ă 1 is a constant factor. We have empirically selected the value of σ to maximize the descriptiveness and robustness of our descriptor to occlusions. In all experiments, we use a fixed value of σ for all actions, views and datasets. In our experiments in Section 8, we observe that this simple approach achieves almost the same accuracy as the automatic spatial scale selection method adapted from [58]. Once we have selected an appropriate spatial scale r, then we proceed to select an appropriate temporal scale τ. 7.2 Automatic Temporal Scale Selection Most existing action recognition techniques [5], [6], [8], [9], [44], [47] use a fixed temporal scale. We observe that variations in action execution speed cause significant disparity among the descriptors from the same action (Fig. 5). To make our descriptor robust to action speed variations, we propose an automatic temporal scale selection technique. Let Q seqpq 1,, Q t,, Q nf q be a sequence of 3D pointclouds. For a given point p px t, y t, z t q J P Q t and a given temporal scale τ, we can define the spatial support volume Ω S ppq and spatio-temporal support volume Ω ST τ ppq of p. The covariance matrix of the points falling within Ω ST τ ppq can be eigen-decomposed to yield the eigenvalues λ τ 1 ě λ τ 2 ě λ τ 3 ě 0. These steps are similar to those described in Section 3, except that, in this Section, we repeat these steps for each temporal scale τ 1,, τ m, where τ m is a fixed upper threshold. For each τ value, we calculate: A p pτq λτ 2 λ τ 1 ` λτ 3 λ τ. (18) 2 The optimal temporal scale τ ppq for the given point p is chosen to be the one that minimizes A p over the range 1 ď τ ď τ m, i.e., τ ppq argmin τ A p pτq. (19)

8 MANUSCRIPT FOR REVIEW, Fig. 5: The same action (hand waving) is shown at three different speeds: (a) slow, (b) moderate, and (c) fast. The number of frames reduces as the action speed increases. For the slow movement, the optimal temporal scale is found to be τ 3, for moderate movement τ 2, and for fast movement τ 1. As an example to illustrate this automatic temporal scale selection process, Fig. 5(a)-(c) show the temporal sequences of pointclouds for the hand waving action performed at three different speeds. The dotted circle shows the sphere defined by the spatial radius r in each pointcloud. The spatial radius r in the three cases is the same because of similar geometry. Our aim here is to select the optimal temporal scale for the point p in the pointcloud Q t shown in black and dotted outline. Figure 5(d) shows the union of points in the range Q t 3 Q t`3 which are within the radius r measured from the coordinate (x, y, z) of point p. Figure 5(e) and (f) show the union of points in the same way for Q t 2 Q t`2 and Q t 1 Q t`1, respectively. Figure 5(g)-(i) show the plots of A p with the variation of τ. Increasing τ beyond a certain value does not affect the accumulated pointcloud as the value of A p becomes constant. In most cases, increasing τ decreases A p until a fixed value is reached. We compute A p pτq for all values of τ and find the global minimum τ. When more than one τ exist, the smallest value of τ is chosen. For each STK, the temporal scale is selected independently and may vary from one STK to the other in the same 3D pointcloud sequence. The proposed temporal scale selection is detailed in Algorithm 1. The algorithm outputs two variables τ and flag P t0, 1u. If the optimal τ is equal to τ m then the flag is set to 0, indicating that the STK p should be discarded. If the computed optimal τ is smaller than τ m then the flag is set to 1, indicating that the return τ value is the optimal temporal scale for p. Algorithm 1: Automatic Temporal Scale Selection input : Q, p, r, and τ m. output: τ, flag. 1 for τ 1 : τ m do 2 Construct Ω ST τ ppq; 3 µ τ Ð 1 n p řqpω τ ppq q; C τ Ð 1 n p řqpω τ ppq pq µ τ qpq µ τ q J ; 4 V τ» fi λτ λ τ 2 0 fl V 0 0 λ τ τ J C τ ; 3 5 A p pτq Ð λτ 2 λ τ 1 6 end 7 τ argmin τ ` λτ 3 λ ; τ 2 A p pτq; 8 if τ τ m then 9 flagð 0; 10 else 11 flagð 1; 12 end 8 EXPERIMENTS We evaluate the proposed algorithm on five benchmark datasets, including two multi-view (Northwestern-UCLA Multiview Action3D [29], and UWA3D Multiview Activity II) and three single-view (MSR Action3D [38], MSR Daily Activity3D [39], and MSR Gesture3D [40]) datasets. Performance is compared to nine existing action recognition methods including Histogram of Oriented 4D Normals (HON4D) [9], Super Normal Vector (SNV) [10], Lie Algebra Relative Pairs (LARP) [59], Comparative Coding Descriptor (CCD) [3], Virtual Views (VV) [60], Histogram of 3D Joints (HOJ3D) [27], Discriminative Virtual Views (DVV) [36], Actionlet Ensemble (AE) [28], and AND- OR graph (AOG) [29]. The baseline results are obtained using publicly available implementations of CCD [3], VV [60], DVV [36], HON4D [9], SNV [10], and LARP [59] from the respective authors websites. For the remaining three methods AOG [29], HOJ3D [27], and AE [28], we use our implementations because their codes are not publicly available. For CCD [3], VV [60] and DVV [36], we use DSTIP [8], which is more robust to 3D sensor noise compared to color-based interest point detectors, to extract and describe the spatio-temporal interest points. Our algorithm is robust to many different parameter settings (see Section 8.7). To help the reader reproduce our results, we provide the parameter values that we used in Table 2. The UWA3D Multiview Activity II dataset and code will be made publicly available. To evaluate individual components of the proposed algorithm, we report results for the following four settings: Holistic HOPC: A sequence of 3D pointclouds is divided into γ 6 ˆ 5 ˆ 3 spatio-temporal cells along the X, Y, and T dimensions. The spatio-temporal HOPC descriptor

9 MANUSCRIPT FOR REVIEW, TABLE 2: Parameters and their values: K: number of codewords, TABLE 3: Comparison of action recognition accuracy (%) on nk : number of STKs, θstk, θl, θg : eigenratio thresholds, nx ˆ ny ˆnt : spatio-temporal cells (Section 5), τm : maximum temporal scale, mk : iterative refinement (Section 6.3). the Northwestern-UCLA Multiview Action3D dataset where the samples from the first two cameras are used as training data, and the samples from the third camera are used as test data. Parameter K nk θstk θl θg nx ny nt τm mk Value nf 0.05nk Data type RGB Skeleton Depth CCD [3] VV [60] HOJ3D [27] DVV [36] AE [28] AOG [29] HON4D [9] SNV [10] LARP [59] Holistic HOPC STK-D Local HOPC Local HOPC+STK-D Fig. 6: Sample pointclouds from the Northwestern-UCLA Multiview Action3D dataset [29] captured by 3 cameras. in (7) is computed for each point p within the hst p sequence. The cell descriptor is computed using (13) and then normalized using (14). The final descriptor for the given sequence is a concatenation of all the cell descriptors. We use SVM for classification. Similar to HON4D [9] and SNV [10], our Holistic HOPC is suitable for single-view action recognition [37] and can handle more inter-class similarities of local motions compared to local methods [9]. STK-D: For each sequence of 3D pointclouds, the histogram of spatio-temporal distribution of STKs is used as the sequence descriptor (Section 6.3). Local HOPC: For each sequence, STKs are detected using the method proposed in Section 4. The proposed orientation normalization is then applied at each STK neighborhood to extract its view-invariant HOPC descriptor (Section 5). The BoW approach is used to describe the sequence. Local HOPC+STK-D: The bag of STK descriptors and the histogram of spatio-temporal distribution of STKs are concatenated to form the sequence descriptor. 8.1 N-UCLA Multiview Action3D Dataset The Northwestern-UCLA dataset [29] contains RGB, depth and human skeleton positions captured simultaneously by three Kinect cameras. It consists of 10 action categories: (1) pick up with one hand, (2) pick up with two hands, (3) drop trash, (4) walk around, (5) sit down, (6) stand up, (7) donning, (8) doffing, (9) throw, and (10) carry. Each action was performed by 10 subjects 1 to 6 times. Figure 6 shows 12 sample 3D pointclouds of four actions captured by the three cameras. To compare our method with state-of-the-art algorithms, we use the same experimental setting as [29], using the samples from the first two cameras as training data, and the samples from the third camera as test data. Results are given in Table 3. Holistic approaches such as HON4D [9], SNV [10], and the proposed Holistic HOPC achieved low recognition accuracy since they are not designed to handle viewpoint changes. Similarly, since depth is a function of viewpoint, CCD [3] achieved low accuracy by encoding the differences between the depth values of an interest point and its neighbourhood points. Among the knowledge transfer based methods, VV [60] and DVV [36] did not perform well; however, AOG [29] obtained high accuracy on only RGB videos. On depth videos, AOG also did not perform well. A possible reason is that depth videos have higher noise levels and interpolating noisy features across views can compromise discrimination ability. Skeleton based methods such as AE [28] and LARP [59] achieved high accuracy. We used the scale and orientation normalization of skeletons proposed in LARP [59] for AE [28] as well which improved the results of AE. However, skeleton data may not be reliable, or even available, when the subject is not in an upright position or is occluded [8]. More importantly, the application of these methods is limited to human activity recognition where the human skeleton is generally estimated by [43]. STK-D alone achieved higher accuracy compared to all depth image based methods. This confirms the repeatability of STKs and the robustness of STK-D to viewpoint changes. The Local HOPC descriptor achieved higher accuracy than STK-D and all depth based methods. Since HOPC and STK-D capture complementary information, their combination (Local HOPC+STK-D) further improved the performance by 8.1% achieving the overall best accuracy of 80%. Note that this is about 6% higher than the nearest competitor LARP [59] which requires skeleton data whereas our method does not. The confusion matrix of our proposed view-invariant Local HOPC+STK-D method is shown in Fig. 7. The action

10 MANUSCRIPT FOR REVIEW, Fig. 8: Sample pointclouds from the UWA3D Multiview Activity Fig. 7: Confusion matrix of our algorithm on the Northwestern- II dataset captured by one camera from 4 different views. UCLA Multiview Action3D dataset [29]. (7) donning and action (8) doffing have maximum confusion with each other because the motion and appearance of these actions are very similar. Similarly, action (1) pick up with one hand and action (3) drop trash have high confusion due to similarity in motion and appearance. 8.2 UWA3D Multiview Activity II Dataset This dataset was collected in our lab using Kinect to emphasize three points: (1) Larger number of human activities. (2) Each subject performed all actions in a continuous manner with no breaks or pauses. Therefore, the start and end positions of body for the same actions are different. (3) Each subject performed the same actions four times while imaged from four different views: front view, left and right side views, and top view. This dataset consists of 30 human activities performed by 10 subjects with different scales: (1) one hand waving, (2) one hand Punching, (3) two hand waving, (4) two hand punching, (5) sitting down, (6) standing up, (7) vibrating, (8) falling down, (9) holding chest, (10) holding head, (11) holding back, (12) walking, (13) irregular walking, (14) lying down, (15) turning around, (16) drinking, (17) phone answering, (18) bending, (19) jumping jack, (20) running, (21) picking up, (22) putting down, (23) kicking, (24) jumping, (25) dancing, (26) moping floor, (27) sneezing, (28) sitting down (chair), (29) squatting, and (30) coughing. To capture depth videos, each subject performed 30 activities 4 times in a continuous manner. Each time, the Kinect was moved to a different angle to capture the actions from four different views. Note that this approach generates more challenging data than when actions are captured simultaneously from different viewpoints. We organized our dataset by segmenting the continuous sequences of activities. The dataset is challenging because of varying viewpoints, self-occlusion and high similarity among activities. For example, the actions (16) drinking and (17) phone answering have very similar motion, but the location of hand in these two actions is slightly different. Also, some actions such as (10) holding head and (11) holding back, have self-occlusion. Moreover, in the top view, the lower part of the body was not properly captured because of occlusion. Figure 8 shows 16 sample pointclouds of five actions from 4 views. For cross-view action recognition, we use the samples from two views as training data, and the samples from the two remaining views as test data. Table 4 summarizes our results. Since this dataset is more challenging compared to the N-UCLA dataset, the performance of all methods drops significantly. Our Holistic HOPC descriptor achieved higher average recognition accuracy than the depth based methods but lower than the methods which use normalized skeleton data. Among the depth based methods, HON4D [9] and SNV [10] are the nearest competitors to the Holistic HOPC. The Local HOPC achieved higher accuracy than STK-D and the Holistic HOPC. Combining STK-D with Local HOPC again improved performance by 8.2% achieving the overall best performance of 52.2%. Note that this is about 9% higher than the nearest competitor LARP [59] which uses skeleton data. Local HOPC+STK-D achieved the highest accuracy in all combinations of training and test views except one. The accuracy of skeleton based methods is significantly lower on this dataset because the skeleton data is not accurate for some actions such as drinking, phone answering, sneezing or is not available for some actions such as falling down and lying down. Moreover, the overall accuracy of the knowledge transfer based methods VV [60], DVV [36], and AOG [29] when depth videos are used as input data is low because motion and appearance of many actions are very similar and the depth sequences have a high level of noise. Therefore, the view dependent local features used in VV [60], DVV [36] and the appearance and motion interpolation based method used in AOG [29] are not enough to discriminate between actions in the presence of noise. Figure 9 shows the confusion matrix of our proposed view-invariant Local HOPC+STK-D method when videos from view V1 and view V2 are used for training and videos from view V3 are used as test data. The actions that causes the most confusion are (9) holding chest versus

11 MANUSCRIPT FOR REVIEW, TABLE 4: Comparison of action recognition accuracy (%) on the UWA3D Multiview Activity II dataset. Each time two views are used for training and the remain two views are individually used for testing. Training views V 1 & V 2 V 1 & V 3 V 1 & V 4 V 2 & V 3 V 2 & V 4 V 3 & V 4 Mean Test view V 3 V 4 V 2 V 4 V 2 V 3 V 1 V 4 V 1 V 3 V 1 V 2 AOG [29] (RGB) HOJ3D [27] (Skeleton) AE [28] (Norm. Skeleton) LARP [59] (Norm. Skeleton) CCD [3] (Depth) VV [60] (Depth) DVV [36] (Depth) AOG [29] (Depth) HON4D [9] (Depth) SNV [10] (Depth) Holistic HOPC (Depth) STK-D (Depth) Local HOPC (Depth) Local HOPC+STK-D (Depth) skeleton based methods such as HOJ3D [27] and LARP [59]. The average accuracy of our view-invariant Local HOPC+STK-D method is 82.9% which is still higher than HOJ3D [27], AE [28], HON4D [9], and LARP [59]. 8.4 MSR Daily Activity3D Dataset This dataset [39] contains 16 daily activities performed twice by 10 subjects, once in standing position and once while sitting. Most activities involve human-object interactions which makes this dataset challenging. We follow the experimental setting of [39] and use samples from half of the subjects as training data, and the rest as test data. As shown in Table 5, the proposed Holistic HOPC outperformed all techniques achieving an average accuracy of 88.8%. The view-invariant Local HOPC+STK-D outperformed AOG [29], HOJ3D [27] and LARP [59]; however, it achieved lower accuracy than HON4D [9] and SNV [10], because these methods assume that the training and test samples are obtained from the same viewpoint. Fig. 9: Confusion matrix of our algorithm on the UWA3D Multiview Activity II dataset when view V 1 and view V 2 are used for training and view V 3 is used for test. (11) holding back and (12) walking versus (13) irregular walking, because the motion and appearance of these two actions are very similar. 8.3 MSR Action3D Dataset The MSR Action3D dataset [38] consists of 20 actions performed 2 to 3 times by 10 subjects. This dataset is challenging due to high inter-action similarities. Following the protocol of [9], we use 5 subjects for training and the remaining 5 for testing and exhaustively repeated the experiments 252 folds. Table 5 compares our algorithms with existing methods. The proposed Holistic HOPC outperformed all methods and achieved 86.5% average accuracy which is more than 2% higher than its nearest competitor SNV [10] and significantly higher than the 8.5 MSR Gesture3D Dataset The MSR Gesture3D dataset [40] contains 12 American sign language gestures performed 2 to 3 times by 10 subjects. For comparison with previous techniques, we use the leave-one-subject-out cross validation scheme proposed by [40]. Table 5 compares our methods to existing ones excluding AE [28], LARP [59], AOG [29] and HOJ3D [27] since they require 3D joint positions which are not present in this dataset. Our Holistic HOPC outperformed all techniques and achieved an average accuracy of 96.2%. The Local HOPC+STK-D achieves an accuracy of 93.6% which is higher than HON4D [9]. 8.6 Effects of Adaptable Support Volume Spatial Scale Selection In this experiment, we evaluate the influence of three different approaches for spatial scale selection at each STK. In the first approach, we use a constant spatial scale for

12 MANUSCRIPT FOR REVIEW, TABLE 5: Comparison of average action recognition accuracy TABLE 7: Average recognition accuracy of the proposed method (%) on the MSR Action3D [38], MSR Daily Activity3D [39], and MSR Gesture3D [40] datasets. NA: RGB or skeleton data Not Available. in two different settings on the Northwestern-UCLA Multiview Action3D [29] and the UWA3D Multiview Activity II datasets. (1) Constant temporal scale (τ 2) and (2) our automatic temporal scale selection technique. Method Action DailyActivity Gesture AOG [29] (RGB) HOJ3D [27] (Skeleton) AE [28] (Norm. Skeleton) LARP [59] (Norm. Skeleton) AOG [29] (Depth) HON4D [9] (Depth) SNV [10] (Depth) NA NA NA NA NA NA NA Holistic HOPC (Depth) Local HOPC+STK-D (Depth) Dataset N-UCLA UWA3DII Temporal scale selection method Constant Automatic TABLE 6: Average recognition accuracy of the proposed method in three different settings on the Northwestern-UCLA Multiview Action3D [29] and the UWA3D Multiview Activity II datasets. (1) Constant spatial scale for all subjects, (2) ratio of subject s height as the spatial scale, and (3) automatic spatial scale selection [58]. Dataset N-UCLA UWA3DII Spatial scale selection method Constant Subject height Automatic all subjects. In the second approach, we select a scale for each subject relative to the subject s height. In the third one, we use the automatic spatial scale selection method proposed by Mian et al. [58]. Table 6 shows the average accuracy of the proposed method in the three settings. Using the subject s height to find a subject specific scale for the STKs turns out to be the best approach. Automatic scale selection performs closely and can be a good option if the subject s height cannot be measured due to occlusions. Constant scale for all subjects performs the worst. However, the performance of our algorithm is better than existing techniques in all three settings. Fig. 10: STKs (shown in red) extracted using our proposed detector. STKs are projected on XY Z dimensions of all points within a 3D pointcloud sequence corresponding to the action two hand waving. The top nk 100, 400, 700, 1000 with the best quality are shown in (a)-(d), respectively. Note that the highest quality STKs are detected where significant movement is performed. Noisy points begin to appear as late as nk late as nk Figure 11(a) shows the influence of the number of STKs on the average recognition accuracy. The proposed method achieves the best recognition accuracy when nk 400; however, the performance remains stable up to nk Threshold Values We evaluate the effect of the eigenratio thresholds θstk for STK detection in (9), θl for view-invariant Local HOPC in Section 5, and θg for STK-D in Section 6.3 on the average recognition accuracy of our proposed method. Figures 11(b)-(d) show our results. Notice that there is a large range (1.1 ď θstk ď 1.5) over which the recognition accuracy remains stable. For very small values of θstk, Automatic Temporal Scale Selection We evaluate the improvement gained by our method using automatic temporal scale selection by repeating our experiments with constant temporal scale for STK detection and Local HOPC descriptor extraction. Table 7 shows the average recognition accuracy of our proposed method using a constant temporal scale (τ 2) and automatic temporal scale selection. The proposed automatic temporal scale selection technique achieved higher accuracy which shows the robustness of our method to action speed variations. 8.7 Evaluation of Parameters and Computation Time Number of STKs To study the effect of the total number of STKs (nk ), we select STKs with the top nk 100, 400, 700, 1000 quality factors as shown in Fig. 10. Note how the STK detector effectively captures the movement of the hands in the highest quality STKs, and noisy points begin to appear as Fig. 11: Average recognition accuracy of Local HOPC+STK-D versus (a) the number of STKs, (b) θstk, (c) θl, and (d) θg on the Northwestern-UCLA [29] and the UWA3D Multiview Activity II datasets.

13 MANUSCRIPT FOR REVIEW, a unique coordinate basis can not be obtained and for larger values of θ STK, the number of detected STKs is not sufficient. A more stable trend in recognition accuracy can be observed for varying the thresholds θ l and θ g. The recognition accuracy starts to decrease when θ l ą 1.5 because the number of points within the spatio-temporal support volume of STKs which have unique eigenvectors starts to decrease. Finally, varying θ g does not change the accuracy significantly because the extracted STKs from most actions already have a unique orientation and the proposed iterative refinement process (Section 6.3) almost always finds an unambiguous orientation for all values of θ g Computation Time The average computational time of the STK detection is 1.7 seconds per frame on a 3.4GHz machine with 24GB RAM using Matlab. The calculation of Local HOPC at STKs takes 0.2 seconds per frame. The overall computational time of the proposed view-invariant method is about 2 seconds per frame. However, the proposed view-dependent Holistic HOPC is faster and take only 0.6 seconds per frame. The average computation time of the nearest competitor AOG [29] that uses depth images is 1.4 seconds per frame. However, our method outperforms AOG [29] on single-view and multi-view datasets by significant margins. Moreover, the calculation of STK and HOPC are individually parallel in nature and can be implemented on a GPU. 9 DISCUSSION AND CONCLUSION Performance of the current 3D action recognition techniques degrades under viewpoint variations because they treat 3D videos as depth image sequences. Depth images are defined with respect to a particular viewpoint and are thus highly dependent on the viewpoint. We have proposed an algorithm for cross-view action recognition which directly processes 3D pointcloud videos to achieve robustness to variations in viewpoint, subject scale and action speed. We have also proposed the HOPC descriptor that is well integrated with our proposed spatio-temporal keypoint (STK) detection algorithm. Local HOPC descriptor combined with global STK-Distribution achieve state-of-the-art results on two standard cross-view action recognition datasets. Unlike HOJ3D [27], LARP [59], AE [39], and AOG [29], our method does not require skeleton data. Skeleton or joint estimation methods such as [43] do not work well when the human body is only partially visible. Moreover, joint estimation may not be reliable when the subject is not in an upright position (e.g. patient lying on bed) [8] or touches the background. Finally, surveillance cameras are usually mounted at elevated angles which causes further difficulties in joint estimation [8]. Thus, our proposed methods (and other non-skeleton based methods) are more general in the sense that they can be applied to a wider variety of action recognition problems where skeletonization of the data is either not possible or has not been achieved yet. ACKNOWLEDGMENT We thank the authors of [29] for providing the Northwestern-UCLA Multiview Action3D dataset and especially Dr Jiang Wang for answering our questions about the implementation of AOG [29] and AE [28] methods. We also thank the authors of [3], [8] [10], [36], [59], [60] for making their codes publicly available. This research is supported by ARC Discovery grant DP REFERENCES [1] J. K. Aggarwal and M. S. Ryoo, Human activity analysis: A review, in ACM Computing Survey, [2] D. Weinland, R. Ronfard, and E. Boyer, A survey of vision-based methods for action representation, segmentation and recognition, in CVIU, [3] Z. Cheng, L. Qin, Y. Ye, Q. Huang, and Q. Tian, Human daily action analysis with multi-view and color-depth data, in ECCVW, [4] C. Lu, J. Jia, and C. keung Tang, Range-sample depth feature for action recognition, in CVPR, [5] H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, Action classification with locality-constrained linear coding, in ICPR, [6] H. Rahmani, A. Mahmood, A. Mian, and D. Huynh, Real time action recognition using histograms of depth gradients and random decision forests, in WACV, [7] X. Yang, C. Zhang, and Y. Tian, Recognizing actions using depth motion maps-based histograms of oriented gradients, in ACM ICM, [8] L. Xia and J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recongition using depth camera, in CVPR, [9] O. Oreifej and Z. Liu, HON4D: histogram of oriented 4D normals for activity recognition from depth sequences, in CVPR, [10] X. Yang and Y. Tian, Super normal vector for activity recognition using depth sequences, in CVPR, [11] H. Zhang, W. Zhou, C. Reardon, and L. E. Parker, Simplex-based 3D spatio-temporal feature description for action recognition, in CVPR, [12] Y. Tian, R. Sukthankar, and M. Shah, Spatiotemporal deformable part models for action detection, in CVPR, [13] M. Zanfir, M. Leordeanu, and C. Sminchisescu, The Moving Pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection, in ICCV, [14] X. Yang and Y. Tian, EigenJoints-based action recognition using naive bayes nearest neighbor, in CVPRW, [15] A. Yilmaz and M. Shah, Action sketch: a novel action representation, in CVPR, [16] T. Syeda-Mahmood, A. Vasilescu, and S. Sethi, Action recognition from arbitrary views using 3D exemplars, in ICCV, [17] F. Lv and R. Nevatia, Single view human action recognition using key pose matching and viterbi path searching, in CVPR, [18] D. Gavrila and L. Davis, 3D model-based tracking of humans in action: a multi-view approach, in CVPR, [19] T. Darrell, I. Essa, and A. Pentland, Task-specific gesture analysis in real-time using interpolated views, in PAMI, [20] B. Li, O. Camps, and M. Sznaier, Cross-view activity recognition using hankelets, in CVPR, [21] V. Parameswaran and R. Chellappa, View invariance for human action recognition, in IJCV, [22] C. Rao, A. Yilmaz, and M. Shah, View-invariant representation and recognition of actions, in IJCV, [23] T. Syeda-Mahmood, A. Vasilescu, and S. Sethi, Recognizing action events from multiple viewpoints, in IEEE Workshop on Detection and Recognition of Events in Video, [24] S. Seitz and C. Dyer, View-invariant analysis of cyclic motion, in IJCV, [25] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, in ICCV, [26] D. Weinland, R. Ronfard, and E. Boyer, Free viewpoint action recognition using motion history volumes, in CVIU, [27] L. Xia, C. C. Chen, and J. K. Aggarwal, View invariant human action recognition using histograms of 3D joints, in CVPRW, [28] J. Wang, Z. Liu, Y. Wu, and J. Yuan, Learning actionlet ensemble for 3D human action recognition, in PAMI, 2013.

14 MANUSCRIPT FOR REVIEW, [29] J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu, Cross-view action modeling, learning and recognition, in CVPR, [30] A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham, 3D pose from motion for cross-view action recognition via non-linear circulant temporal encoding, in CVPR, [31] J. Zheng and Z. Jiang, Learning view-invariant sparse representations for cross-view action recognition, in ICCV, [32] Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi, Crossview action recognition via a continuous virtual path, in CVPR, [33] A. Farhadi and M. K. Tabrizi, Learning to recognize activities from the wrong view point, in ECCV, [34] A. Farhadi, M. K. Tabrizi, I. Endres, and D. A. Forsyth, A latent model of discriminative aspect, in ICCV, [35] J. Liu, M. Shah, B. Kuipersy, and S. Savarese, Cross-view action recognition via view knowledge transfer, in CVPR, [36] R. Li and T. Zickler, Discriminative virtual views for cross-view action recognition, in CVPR, [37] H. Rahmani, A. Mahmood, D. Q Huynh, and A. Mian, HOPC: Histogram of oriented principal components of 3D pointclouds for action recognition, in ECCV, [38] W. Li, Z. Zhang, and Z. Liu, Action recognition based on a bag of 3D points, in CVPRW, [39] J. Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in CVPR, [40] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, Robust 3D action recognition with random occupancy patterns, in ECCV, [41] X. Yang and Y. Tian, A survey of vision-based methods for action representation, segmentation and recognition, [42] L. Campbell and A. Bobick, Recognition of human body motion using phase space constraints, in ICCV, [43] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, Real-time human pose recognition in parts from single depth images, in CVPR, [44] A. W. Vieira, E. Nascimento, G. Oliveira, Z. Liu, and M. Campos, STOP: space-time occupancy patterns for 3D action recognition from depth map sequences, in CIARP, [45] S. Tang, X. Wang, X. Lv, T. Han, J. Keller, Z. He, M. Skubic, and S. Lao, Histogram of oriented normal vectors for object recognition with a depth sensor, in ACCV, [46] I. Laptev, On space-time interest point, in IJCV, [47] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior recognition via sparse spatio-temporal features, in ICCV, [48] G. Willems, T. Tuytelaars, and L. Gool, An efficient dense and scale-invariant spatio-temporal interest point detector, in ECCV, [49] H. Wang and C. Schmid, Action recognition with improved trajectories, in ICCV, [50] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, Action recognition by dense trajectories, in CVPR, [51] S. Wu, O. Oreifej, and M. Shah, Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories, in ICCV, [52] N. J. Mitra and A. Nguyen, Estimating surface normals in noisy point clouds data, in SCG, [53] S. Maji, A. C. Berg, and J. Malik, Classification using intersection kernel support vector machines is efficient, in CVPR, [54] Y. W. Chen and C. J. Lin, Combining SVMs with various feature selection strategies, in Feature Extraction, ser. Studies in Fuzziness and Soft Computing. Springer Berlin Heidelberg, [55] J. Xie and C. Wang, Using support vector machines with a novel hybrid feature selection method for diagnosis of erythematosquamous diseases, in Expert Systems with Applications, [56] H. S. M. Coxeter, Regular polytopes. Dover Publications, [57] F. Timbari and L. D. Stefano, Performance evaluation of 3D keypoint detectors, in IJCV, [58] A. Mian, M. Bennamoun, and R. Owens, On the repeatability and quality of keypoints for local feature-based 3D object retrieval from cluttered scenes, in IJCV, [59] R. Vemulapalli, F. Arrate, and R. Chellappa, Human action recognition by representing 3D skeletons as points in a Lie group, in CVPR, [60] R. Gopalan, R. Li, and R. Chellapa, Domain adaption for object recognition: An unsupervised approach, in ICCV, Hossein Rahmani received his B.Sc. degree in Computer Software Engineering in 2004 from Isfahan University of Technology (IUT), Isfahan, Iran and the M.S. degree in Software Engineering in 2010 from Shahid Beheshti University (SBU), Tehran, Iran. His research interests include computer vision, 3D shape analysis, and pattern recognition. He is currently working towards his PhD degree in computer science from The University of Western Australia. His current research is focused on RGB-Depth based human activity recognition. Arif Mahmood obtained Gold Medal in MS and completed PhD with distinction from Lahore University of Management Sciences in He is currently PostDoc Researcher in Qatar University. Previously he was Research Assistant Professor in The University of Western Australia. His major research interests include data clustering and classification, action and object recognition, crowd analysis, community detection and content based image search and matching. Du Huynh is an Associate Professor at the School of Computer Science and Software Engineering, The University of Western Australia. She obtained her Ph.D in Computer Vision, in 1994, at the same university. Since then, she has worked for the Australian Cooperative Research Centre for Sensor Signal and Information Processing (CSSIP) and Murdoch University. She has been a visiting scholar at Lund University, Malmo University, Chinese University of Hong Kong, Nagoya University, Gunma University, and the University of Melbourne. Associate Professor Huynh has won several grants funded by the Australian Research Council. Her research interests include shape from motion, multiple view geometry, video image processing, and visual tracking. Ajmal Mian completed his PhD from The University of Western Australia in 2006 with distinction and received the Australasian Distinguished Doctoral Dissertation Award from Computing Research and Education Association of Australasia. He received two prestigious nationally competitive fellowships namely the Australian Postdoctoral Fellowship in 2008 and the Australian Research Fellowship in He received the UWA Outstanding Young Investigator Award in 2011, the West Australian Early Career Scientist of the Year award in 2012 and the Vice-Chancellors Mid-Career Research Award in He has secured five Australian Research Council grants worth over $2.3 Million. He is currently with the School of Computer Science and Software Engineering, The University of Western Australia. His research interests include computer vision, action recognition, 3D shape analysis, hyperspectral image analysis, machine learning, and multimodal biometrics.

Learning a Deep Model for Human Action Recognition from Novel Viewpoints

Learning a Deep Model for Human Action Recognition from Novel Viewpoints MANUSCRIPT FOR REVIEW, 2015 1 Learning a Deep Model for Human Action Recognition from Novel Viewpoints Hossein Rahmani, Ajmal Mian and Mubarak Shah arxiv:1602.00828v1 [cs.cv] 2 Feb 2016 Abstract Recognizing

More information

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering www.csse.uwa.edu.au/~ajmal/ Overview Aim of automatic human action recognition Applications

More information

Short Survey on Static Hand Gesture Recognition

Short Survey on Static Hand Gesture Recognition Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of

More information

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,

More information

Robotics Programming Laboratory

Robotics Programming Laboratory Chair of Software Engineering Robotics Programming Laboratory Bertrand Meyer Jiwon Shin Lecture 8: Robot Perception Perception http://pascallin.ecs.soton.ac.uk/challenges/voc/databases.html#caltech car

More information

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image

More information

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!

More information

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit Augmented Reality VU Computer Vision 3D Registration (2) Prof. Vincent Lepetit Feature Point-Based 3D Tracking Feature Points for 3D Tracking Much less ambiguous than edges; Point-to-point reprojection

More information

CS 231A Computer Vision (Fall 2012) Problem Set 3

CS 231A Computer Vision (Fall 2012) Problem Set 3 CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest

More information

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition International Journal of Innovation and Applied Studies ISSN 2028-9324 Vol. 9 No. 4 Dec. 2014, pp. 1708-1717 2014 Innovative Space of Scientific Research Journals http://www.ijias.issr-journals.org/ Evaluation

More information

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking Feature descriptors Alain Pagani Prof. Didier Stricker Computer Vision: Object and People Tracking 1 Overview Previous lectures: Feature extraction Today: Gradiant/edge Points (Kanade-Tomasi + Harris)

More information

CS 223B Computer Vision Problem Set 3

CS 223B Computer Vision Problem Set 3 CS 223B Computer Vision Problem Set 3 Due: Feb. 22 nd, 2011 1 Probabilistic Recursion for Tracking In this problem you will derive a method for tracking a point of interest through a sequence of images.

More information

CS 231A Computer Vision (Winter 2014) Problem Set 3

CS 231A Computer Vision (Winter 2014) Problem Set 3 CS 231A Computer Vision (Winter 2014) Problem Set 3 Due: Feb. 18 th, 2015 (11:59pm) 1 Single Object Recognition Via SIFT (45 points) In his 2004 SIFT paper, David Lowe demonstrates impressive object recognition

More information

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor Xiaodong Yang and YingLi Tian Department of Electrical Engineering The City College of New York, CUNY {xyang02, ytian}@ccny.cuny.edu

More information

Motion Estimation and Optical Flow Tracking

Motion Estimation and Optical Flow Tracking Image Matching Image Retrieval Object Recognition Motion Estimation and Optical Flow Tracking Example: Mosiacing (Panorama) M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003 Example 3D Reconstruction

More information

Arm-hand Action Recognition Based on 3D Skeleton Joints Ling RUI 1, Shi-wei MA 1,a, *, Jia-rui WEN 1 and Li-na LIU 1,2

Arm-hand Action Recognition Based on 3D Skeleton Joints Ling RUI 1, Shi-wei MA 1,a, *, Jia-rui WEN 1 and Li-na LIU 1,2 1 International Conference on Control and Automation (ICCA 1) ISBN: 97-1-9-39- Arm-hand Action Recognition Based on 3D Skeleton Joints Ling RUI 1, Shi-wei MA 1,a, *, Jia-rui WEN 1 and Li-na LIU 1, 1 School

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Articulated Pose Estimation with Flexible Mixtures-of-Parts Articulated Pose Estimation with Flexible Mixtures-of-Parts PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION Outline Modeling Special Cases Inferences Learning Experiments Problem and Relevance Problem:

More information

Local Features: Detection, Description & Matching

Local Features: Detection, Description & Matching Local Features: Detection, Description & Matching Lecture 08 Computer Vision Material Citations Dr George Stockman Professor Emeritus, Michigan State University Dr David Lowe Professor, University of British

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

Expanding gait identification methods from straight to curved trajectories

Expanding gait identification methods from straight to curved trajectories Expanding gait identification methods from straight to curved trajectories Yumi Iwashita, Ryo Kurazume Kyushu University 744 Motooka Nishi-ku Fukuoka, Japan yumi@ieee.org Abstract Conventional methods

More information

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm Group 1: Mina A. Makar Stanford University mamakar@stanford.edu Abstract In this report, we investigate the application of the Scale-Invariant

More information

Tracking system. Danica Kragic. Object Recognition & Model Based Tracking

Tracking system. Danica Kragic. Object Recognition & Model Based Tracking Tracking system Object Recognition & Model Based Tracking Motivation Manipulating objects in domestic environments Localization / Navigation Object Recognition Servoing Tracking Grasping Pose estimation

More information

Tri-modal Human Body Segmentation

Tri-modal Human Body Segmentation Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4

More information

Image Processing. Image Features

Image Processing. Image Features Image Processing Image Features Preliminaries 2 What are Image Features? Anything. What they are used for? Some statements about image fragments (patches) recognition Search for similar patches matching

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions

Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions TO APPEAR IN IEEE TRANSACTIONS ON MULTIMEDIA 1 Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions Mengyuan Liu, Member, IEEE, Hong Liu, Member, IEEE, Chen Chen, Member,

More information

Scale Invariant Feature Transform

Scale Invariant Feature Transform Why do we care about matching features? Scale Invariant Feature Transform Camera calibration Stereo Tracking/SFM Image moiaicing Object/activity Recognition Objection representation and recognition Automatic

More information

Scale Invariant Feature Transform

Scale Invariant Feature Transform Scale Invariant Feature Transform Why do we care about matching features? Camera calibration Stereo Tracking/SFM Image moiaicing Object/activity Recognition Objection representation and recognition Image

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

Separating Objects and Clutter in Indoor Scenes

Separating Objects and Clutter in Indoor Scenes Separating Objects and Clutter in Indoor Scenes Salman H. Khan School of Computer Science & Software Engineering, The University of Western Australia Co-authors: Xuming He, Mohammed Bennamoun, Ferdous

More information

CITS 4402 Computer Vision

CITS 4402 Computer Vision CITS 4402 Computer Vision Prof Ajmal Mian Lecture 12 3D Shape Analysis & Matching Overview of this lecture Revision of 3D shape acquisition techniques Representation of 3D data Applying 2D image techniques

More information

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract

More information

2D Image Processing Feature Descriptors

2D Image Processing Feature Descriptors 2D Image Processing Feature Descriptors Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1 Overview

More information

Human Action Recognition Using Temporal Hierarchical Pyramid of Depth Motion Map and KECA

Human Action Recognition Using Temporal Hierarchical Pyramid of Depth Motion Map and KECA Human Action Recognition Using Temporal Hierarchical Pyramid of Depth Motion Map and KECA our El Din El Madany 1, Yifeng He 2, Ling Guan 3 Department of Electrical and Computer Engineering, Ryerson University,

More information

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Visuelle Perzeption für Mensch- Maschine Schnittstellen Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map? Announcements I HW 3 due 12 noon, tomorrow. HW 4 to be posted soon recognition Lecture plan recognition for next two lectures, then video and motion. Introduction to Computer Vision CSE 152 Lecture 17

More information

Derivative Delay Embedding: Online Modeling of Streaming Time Series

Derivative Delay Embedding: Online Modeling of Streaming Time Series Derivative Delay Embedding: Online Modeling of Streaming Time Series Zhifei Zhang (PhD student), Yang Song, Wei Wang, and Hairong Qi Department of Electrical Engineering & Computer Science Outline 1. Challenges

More information

Local features: detection and description. Local invariant features

Local features: detection and description. Local invariant features Local features: detection and description Local invariant features Detection of interest points Harris corner detection Scale invariant blob detection: LoG Description of local patches SIFT : Histograms

More information

3D object recognition used by team robotto

3D object recognition used by team robotto 3D object recognition used by team robotto Workshop Juliane Hoebel February 1, 2016 Faculty of Computer Science, Otto-von-Guericke University Magdeburg Content 1. Introduction 2. Depth sensor 3. 3D object

More information

Object detection using non-redundant local Binary Patterns

Object detection using non-redundant local Binary Patterns University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2010 Object detection using non-redundant local Binary Patterns Duc Thanh

More information

Digital Image Processing

Digital Image Processing Digital Image Processing Part 9: Representation and Description AASS Learning Systems Lab, Dep. Teknik Room T1209 (Fr, 11-12 o'clock) achim.lilienthal@oru.se Course Book Chapter 11 2011-05-17 Contents

More information

Robust Face Recognition via Sparse Representation

Robust Face Recognition via Sparse Representation Robust Face Recognition via Sparse Representation Panqu Wang Department of Electrical and Computer Engineering University of California, San Diego La Jolla, CA 92092 pawang@ucsd.edu Can Xu Department of

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Dynamic Human Shape Description and Characterization

Dynamic Human Shape Description and Characterization Dynamic Human Shape Description and Characterization Z. Cheng*, S. Mosher, Jeanne Smith H. Cheng, and K. Robinette Infoscitex Corporation, Dayton, Ohio, USA 711 th Human Performance Wing, Air Force Research

More information

RECOGNIZING human activities has been widely applied to

RECOGNIZING human activities has been widely applied to 1028 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 5, MAY 2017 Super Normal Vector for Human Activity Recognition with Depth Cameras Xiaodong Yang, Member, IEEE, and YingLi

More information

A Survey of Human Action Recognition Approaches that use an RGB-D Sensor

A Survey of Human Action Recognition Approaches that use an RGB-D Sensor IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/ieiespc.2015.4.4.281 281 EIE Transactions on Smart Processing and Computing A Survey of Human Action

More information

Patch-based Object Recognition. Basic Idea

Patch-based Object Recognition. Basic Idea Patch-based Object Recognition 1! Basic Idea Determine interest points in image Determine local image properties around interest points Use local image properties for object classification Example: Interest

More information

Object Category Detection. Slides mostly from Derek Hoiem

Object Category Detection. Slides mostly from Derek Hoiem Object Category Detection Slides mostly from Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical template matching with sliding window Part-based Models

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

CHAPTER 3. Single-view Geometry. 1. Consequences of Projection

CHAPTER 3. Single-view Geometry. 1. Consequences of Projection CHAPTER 3 Single-view Geometry When we open an eye or take a photograph, we see only a flattened, two-dimensional projection of the physical underlying scene. The consequences are numerous and startling.

More information

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation For Intelligent GPS Navigation and Traffic Interpretation Tianshi Gao Stanford University tianshig@stanford.edu 1. Introduction Imagine that you are driving on the highway at 70 mph and trying to figure

More information

Generic Face Alignment Using an Improved Active Shape Model

Generic Face Alignment Using an Improved Active Shape Model Generic Face Alignment Using an Improved Active Shape Model Liting Wang, Xiaoqing Ding, Chi Fang Electronic Engineering Department, Tsinghua University, Beijing, China {wanglt, dxq, fangchi} @ocrserv.ee.tsinghua.edu.cn

More information

Action recognition in videos

Action recognition in videos Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

Action Recognition Using Motion History Image and Static History Image-based Local Binary Patterns

Action Recognition Using Motion History Image and Static History Image-based Local Binary Patterns , pp.20-214 http://dx.doi.org/10.14257/ijmue.2017.12.1.17 Action Recognition Using Motion History Image and Static History Image-based Local Binary Patterns Enqing Chen 1, Shichao Zhang* 1 and Chengwu

More information

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1 Feature Detection Raul Queiroz Feitosa 3/30/2017 Feature Detection 1 Objetive This chapter discusses the correspondence problem and presents approaches to solve it. 3/30/2017 Feature Detection 2 Outline

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016 CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation

More information

METRIC PLANE RECTIFICATION USING SYMMETRIC VANISHING POINTS

METRIC PLANE RECTIFICATION USING SYMMETRIC VANISHING POINTS METRIC PLANE RECTIFICATION USING SYMMETRIC VANISHING POINTS M. Lefler, H. Hel-Or Dept. of CS, University of Haifa, Israel Y. Hel-Or School of CS, IDC, Herzliya, Israel ABSTRACT Video analysis often requires

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Discriminative classifiers for image recognition

Discriminative classifiers for image recognition Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Chapter 3 Image Registration. Chapter 3 Image Registration

Chapter 3 Image Registration. Chapter 3 Image Registration Chapter 3 Image Registration Distributed Algorithms for Introduction (1) Definition: Image Registration Input: 2 images of the same scene but taken from different perspectives Goal: Identify transformation

More information

A Human Activity Recognition System Using Skeleton Data from RGBD Sensors Enea Cippitelli, Samuele Gasparrini, Ennio Gambi and Susanna Spinsante

A Human Activity Recognition System Using Skeleton Data from RGBD Sensors Enea Cippitelli, Samuele Gasparrini, Ennio Gambi and Susanna Spinsante A Human Activity Recognition System Using Skeleton Data from RGBD Sensors Enea Cippitelli, Samuele Gasparrini, Ennio Gambi and Susanna Spinsante -Presented By: Dhanesh Pradhan Motivation Activity Recognition

More information

Obtaining Feature Correspondences

Obtaining Feature Correspondences Obtaining Feature Correspondences Neill Campbell May 9, 2008 A state-of-the-art system for finding objects in images has recently been developed by David Lowe. The algorithm is termed the Scale-Invariant

More information

Outline 7/2/201011/6/

Outline 7/2/201011/6/ Outline Pattern recognition in computer vision Background on the development of SIFT SIFT algorithm and some of its variations Computational considerations (SURF) Potential improvement Summary 01 2 Pattern

More information

Feature Based Registration - Image Alignment

Feature Based Registration - Image Alignment Feature Based Registration - Image Alignment Image Registration Image registration is the process of estimating an optimal transformation between two or more images. Many slides from Alexei Efros http://graphics.cs.cmu.edu/courses/15-463/2007_fall/463.html

More information

9 length of contour = no. of horizontal and vertical components + ( 2 no. of diagonal components) diameter of boundary B

9 length of contour = no. of horizontal and vertical components + ( 2 no. of diagonal components) diameter of boundary B 8. Boundary Descriptor 8.. Some Simple Descriptors length of contour : simplest descriptor - chain-coded curve 9 length of contour no. of horiontal and vertical components ( no. of diagonal components

More information

3D Object Recognition using Multiclass SVM-KNN

3D Object Recognition using Multiclass SVM-KNN 3D Object Recognition using Multiclass SVM-KNN R. Muralidharan, C. Chandradekar April 29, 2014 Presented by: Tasadduk Chowdhury Problem We address the problem of recognizing 3D objects based on various

More information

The Novel Approach for 3D Face Recognition Using Simple Preprocessing Method

The Novel Approach for 3D Face Recognition Using Simple Preprocessing Method The Novel Approach for 3D Face Recognition Using Simple Preprocessing Method Parvin Aminnejad 1, Ahmad Ayatollahi 2, Siamak Aminnejad 3, Reihaneh Asghari Abstract In this work, we presented a novel approach

More information

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS 38 CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS 3.1 PRINCIPAL COMPONENT ANALYSIS (PCA) 3.1.1 Introduction In the previous chapter, a brief literature review on conventional

More information

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing CS 4495 Computer Vision Features 2 SIFT descriptor Aaron Bobick School of Interactive Computing Administrivia PS 3: Out due Oct 6 th. Features recap: Goal is to find corresponding locations in two images.

More information

Learning Action Recognition Model From Depth and Skeleton Videos

Learning Action Recognition Model From Depth and Skeleton Videos Learning Action Recognition Model From Depth and Skeleton Videos Hossein Rahmani, and Mohammed Bennamoun School of Computer Science and Software Engineering, The University of Western Australia hossein.rahmani@uwa.edu.au,

More information

Feature Detectors and Descriptors: Corners, Lines, etc.

Feature Detectors and Descriptors: Corners, Lines, etc. Feature Detectors and Descriptors: Corners, Lines, etc. Edges vs. Corners Edges = maxima in intensity gradient Edges vs. Corners Corners = lots of variation in direction of gradient in a small neighborhood

More information

CSE 252B: Computer Vision II

CSE 252B: Computer Vision II CSE 252B: Computer Vision II Lecturer: Serge Belongie Scribes: Jeremy Pollock and Neil Alldrin LECTURE 14 Robust Feature Matching 14.1. Introduction Last lecture we learned how to find interest points

More information

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013 Feature Descriptors CS 510 Lecture #21 April 29 th, 2013 Programming Assignment #4 Due two weeks from today Any questions? How is it going? Where are we? We have two umbrella schemes for object recognition

More information

calibrated coordinates Linear transformation pixel coordinates

calibrated coordinates Linear transformation pixel coordinates 1 calibrated coordinates Linear transformation pixel coordinates 2 Calibration with a rig Uncalibrated epipolar geometry Ambiguities in image formation Stratified reconstruction Autocalibration with partial

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Category vs. instance recognition

Category vs. instance recognition Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building

More information

Spatio-temporal Feature Classifier

Spatio-temporal Feature Classifier Spatio-temporal Feature Classifier Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 1-7 1 Open Access Yun Wang 1,* and Suxing Liu 2 1 School

More information

Lecture 8 Object Descriptors

Lecture 8 Object Descriptors Lecture 8 Object Descriptors Azadeh Fakhrzadeh Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University 2 Reading instructions Chapter 11.1 11.4 in G-W Azadeh Fakhrzadeh

More information

Minimizing hallucination in Histogram of Oriented Gradients

Minimizing hallucination in Histogram of Oriented Gradients Minimizing hallucination in Histogram of Oriented Gradients Javier Ortiz Sławomir Bąk Michał Koperski François Brémond INRIA Sophia Antipolis, STARS group 2004, route des Lucioles, BP93 06902 Sophia Antipolis

More information

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003 CS 664 Slides #11 Image Segmentation Prof. Dan Huttenlocher Fall 2003 Image Segmentation Find regions of image that are coherent Dual of edge detection Regions vs. boundaries Related to clustering problems

More information

CS231A Midterm Review. Friday 5/6/2016

CS231A Midterm Review. Friday 5/6/2016 CS231A Midterm Review Friday 5/6/2016 Outline General Logistics Camera Models Non-perspective cameras Calibration Single View Metrology Epipolar Geometry Structure from Motion Active Stereo and Volumetric

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

EN1610 Image Understanding Lab # 4: Corners, Interest Points, Hough Transform

EN1610 Image Understanding Lab # 4: Corners, Interest Points, Hough Transform EN1610 Image Understanding Lab # 4: Corners, Interest Points, Hough Transform The goal of this fourth lab is to ˆ Learn how to detect corners, and use them in a tracking application ˆ Learn how to describe

More information

SIFT - scale-invariant feature transform Konrad Schindler

SIFT - scale-invariant feature transform Konrad Schindler SIFT - scale-invariant feature transform Konrad Schindler Institute of Geodesy and Photogrammetry Invariant interest points Goal match points between images with very different scale, orientation, projective

More information

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

CS4670: Computer Vision

CS4670: Computer Vision CS4670: Computer Vision Noah Snavely Lecture 6: Feature matching and alignment Szeliski: Chapter 6.1 Reading Last time: Corners and blobs Scale-space blob detector: Example Feature descriptors We know

More information

Principal Component Analysis (PCA) is a most practicable. statistical technique. Its application plays a major role in many

Principal Component Analysis (PCA) is a most practicable. statistical technique. Its application plays a major role in many CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS ON EIGENFACES 2D AND 3D MODEL 3.1 INTRODUCTION Principal Component Analysis (PCA) is a most practicable statistical technique. Its application plays a major role

More information

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong) Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong) References: [1] http://homepages.inf.ed.ac.uk/rbf/hipr2/index.htm [2] http://www.cs.wisc.edu/~dyer/cs540/notes/vision.html

More information

Digital Image Processing Fundamentals

Digital Image Processing Fundamentals Ioannis Pitas Digital Image Processing Fundamentals Chapter 7 Shape Description Answers to the Chapter Questions Thessaloniki 1998 Chapter 7: Shape description 7.1 Introduction 1. Why is invariance to

More information

Computer Vision for HCI. Topics of This Lecture

Computer Vision for HCI. Topics of This Lecture Computer Vision for HCI Interest Points Topics of This Lecture Local Invariant Features Motivation Requirements, Invariances Keypoint Localization Features from Accelerated Segment Test (FAST) Harris Shi-Tomasi

More information

Face detection and recognition. Detection Recognition Sally

Face detection and recognition. Detection Recognition Sally Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification

More information

R3DG Features: Relative 3D Geometry-based Skeletal Representations for Human Action Recognition

R3DG Features: Relative 3D Geometry-based Skeletal Representations for Human Action Recognition R3DG Features: Relative 3D Geometry-based Skeletal Representations for Human Action Recognition Raviteja Vemulapalli 1, Felipe Arrate 2, Rama Chellappa 1 1 Center for Automation Research, UMIACS University

More information

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING Nandita M. Nayak, Amit K. Roy-Chowdhury University of California, Riverside ABSTRACT We present an approach which incorporates spatiotemporal

More information