Action Recognition By Learnt Class-Specific Overcomplete Dictionaries

Size: px

Start display at page:

Download "Action Recognition By Learnt Class-Specific Overcomplete Dictionaries"

Elizabeth Morris
6 years ago
Views:

1 Action Recognition By Learnt Class-Specific Overcomplete Dictionaries Tanaya Guha Electrical and Computer Engineering University of British Columbia Vancouver, Canada Rabab K. Ward Electrical and Computer Engineering University of British Columbia Vancouver, Canada Abstract This paper presents a sparse signal representation based approach to address the problem of human action recognition in videos. For each action, a set of redundant basis (dictionary) is learnt by solving a sparse optimization problem. A dictionary is learnt using the image patches of its corresponding action, such that every patch vector is represented by some linear combination of a small number of basis vectors. By learning one dictionary per action, it is expected that each dictionary can efficiently represent one particular action. We show that such class-specific dictionaries - each representative of one action - provide a powerful means of action classification. Given a query sequence, the classifier seeks the dictionary that best approximates the query class. We have evaluated the proposed approach on the standard datasets. Experimental results demonstrate high accuracy and robustness against occlusion or viewpoint changes. I. INTRODUCTION Understanding human actions is a key component in computer vision research because of its application possibilities in varied fields such as human-computer interface, video surveillance, sports events, video indexing etc. [1]. Various methods have been developed to analyze simple actions and complex activities. One of the existing lines of research is optical flow based data matching [2]. The computation of optical flow vectors is difficult due to smooth surfaces and motion discontinuities [1]. Another approach is to analyze human motion dynamically using state-space modeling [3]. These methods are fit for modeling complex activities but can be computationally intense. Some researchers have also developed algorithms that consider action sequences as 3D space-time volumes or shapes [1], [4]. Feature tracking is an another popular action recognition method that largely depends on the accuracy of the tracking system [5]. Another effective approach is Bag-Of-Words (BOW) modeling which represents an action in terms of codewords of a predefined codebook [6]. Despite the significant research effort, the applicability of most of the action recognition systems is limited by real-life conditions like occlusion, changes in appearance and by computational complexity. Sparse representation of signals has grown into a major field of research in recent years. It is now well-established that signals like natural images admit sparse decomposition, when they are represented by some redundant basis, called dictionary. A crucial step is the design or selection of such a dictionary. One option is to choose from a set of pre-defined dictionaries e.g. curvelets, bandlets, variants of wavelets etc. [7]. Another interesting way is to use the training samples 143 directly as the dictionary columns as proposed by Wright et al. [8]. Recent research shows that it is also possible to learn a dictionary from a set of training examples so that the dictionary is better-adapted to the given data. Practical dictionary learning algorithms like MOD (Method of Optimal Directions) [9] and K-SVD [10] have been proposed. Dictionaries learnt using these algorithms are non-parametric and can yield more compact representation of the signal as they are better adapted to capture the inherent structure of the data. Dictionaries learnt by K-SVD method have been shown to achieve better results compared to the off-theshelf dictionaries in image denoising [11]. Though primarily developed for the need of signal reconstruction, the idea of sparse decomposition has been used to solve several classification problems. Inspired by the success of sparse analysis based solutions to texture segmentation [12], [13], face recognition [8] and audio classification [14], we propose a sparse signal representation based approach for learning and recognizing human actions in videos. This paper aims at recognizing human actions using learnt overcomplete dictionaries. Our method is silhouette-based; it requires representing each video sequence by a single image, computed by averaging over all the silhouettes extracted in each frame. We refer to such an averaged silhouette image as an averaged Motion Energy Image (MEI) [4]. The training phase of the proposed method involves learning one overcomplete dictionary for every action. A dictionary for a particular action is learnt by adapting a set of basis vectors to a large number of random patches extracted from the averaged MEIs corresponding to that particular action. This is done in such a way that each input patch has a sparse representation over the basis vectors. A dictionary tailored to represent a particular class is expected to have an efficient i.e. more sparse representation for that class of signals and to be less efficient (not so sparse) in representing signals that belong to a different class. Thus class-specific dictionaries become representative of their corresponding classes. Given a query action video, the classification problem is equivalent to finding the dictionary among the learnt ones that best approximates the query. The proposed method has the following advantages: The dictionary learning process of one class is independent of other classes. This training process can be easily parallelized. Also, no change is necessary on the part of the existing data when a new class is added to the

2 system. This is a huge advantage, as in practice most databases have to be continuously updated with addition of new classes. We do not assume the availability of a large number of training samples per class. The dictionaries can be learnt even when a single training sample is available per class. The use of Random Projections for dimensionality reduction, makes the approach fast and computationally efficient. Difference from BOW modeling: The dictionary learning approach is significantly different from the traditional BOW approach. We represent each input vector as a linear combination of a small number of codewords called dictionary columns, while in BOW modeling, an input vector is approximated by only one codeword i.e. a column vector in a codebook/dictionary. We also advocate the learning of class-specific dictionaries as opposed to constructing a single dictionary for all classes in the BOW approach. Note that, our method also notably differs from the sparse face recognition approach presented in [8]. While we learn the dictionary atoms from the training data by solving an optimization problem, in [8] the training samples are directly used as dictionary atoms. To evaluate the proposed method, we have designed various experiments on the Weizmann human action and Robustness datasets [1], subsets of the KTH human motion [15] and UCF sports dataset [16]. The experimental results show high accuracy and robustness of the approach against partial occlusion, viewpoint and scale changes. II. THE PROPOSED METHOD Assume that there are C classes of actions and c training samples per class; the training samples are represented by A ij, i [1, 2,..., C] and j [1, 2,..., c]. Our aim is to learn C separate dictionaries and to use them for classification of any new video sequence. A. Video Representation For each action sequence A ij, i [1, 2,..., C], j [1, 2,..., c] silhouettes are extracted at each frame. A simple normalized cross-correlation based registration technique is used to align the human figures. The aligned silhouettes are then transformed to an averaged MEI denoted as I ij R M M, by simply taking their mean. This MEI representation implicitly captures both the shape and the temporal information of human actions and is very fast to compute. Figure 1 shows examples of the averaged MEIs for the 10 actions in the Weizmann dataset. B. Multiscale Patch Domain Analysis Each MEI is subjected to local image analysis at two scales; Random overlapping patches are extracted from the original MEI and its downsampled (and smoothed) version. At each scale ρ patches of size η η are extracted. Ideally, one patch centered at every image pixel is to be extracted; but in practice, extracting any large number of patches is 144 Fig. 1. Averaged MEIs for actions (clockwise) bend, jack, jump, pjump, run, wave2, wave1,, skip, side. sufficient. Every single patch can be represented as a vector of length η 2. The process of patch extraction can be described as the act of a linear operator Φ. Φ:R M M R η2 2ρ The patches with no or little information are removed by hard thresholding. The set of all the informative patches collected over c training samples of a class constitute the training set for dictionary learning. Let such a collection of patches be denoted by P i R η2 2cρ, i [1, 2,..., C] and 2cρ is the number of patches extracted per class. C. Dimensionality Reduction A patch size as small as produces a vector of dimension [256 1]. It asks for more than 500 basis vectors (with redundancy 2) to be learnt for each dictionary, in order to secure a sparse representation of the input data. This high dimensionality seriously limits the speed and efficiency of our algorithm. A natural solution is to reduce the dimensionality. To obtain a lower dimensional representation, the application of standard methods like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) etc. is wide-spread. Recently Random Projection (RP) has emerged as a powerful tool [17]. Theoretical results show that the random projections of high dimensional data into a lower dimensional subspace can preserve the distances between vectors quite well. RP is very simple to use and has low computational complexity. The original d(= η 2 )-dimensional data is projected onto an n dimensional subspace (n << d) by premultiplying the data matrix P i by a random matrix R proj R n d. In practice, any normally distributed R proj with zero mean and unit variance serves the purpose. The dimensionality reduction is performed as (1) Y i = R proj P i (2) where i [1, 2,..., C]. The dimensionality reduced data matrix Y i R n 2cρ contains the projections (not true projections, because they are not orthogonal) of P i into some random n dimensional subspace. D. Dictionary Learning The next step is to learn C redundant dictionaries D i, i [1, 2,..., C] i.e. one dictionary per class. The idea of fitting

Dictionary size Accuracy (%) 16 32 72.2 32 64 88.9 64 128 97.7 TABLE I RELATION BETWEEN CLASS-SPECIFIC DICTIONARY SIZE AND CLASSIFICATION ACCURACY (τ 1 10% OF K AND τ 2

3 Dictionary size Accuracy (%) TABLE I RELATION BETWEEN CLASS-SPECIFIC DICTIONARY SIZE AND CLASSIFICATION ACCURACY (τ 1 10% OF K AND τ 2 =2) Fig. 2. Sample dictionary atoms from the dictionary learnt by K-SVD method for the action bend a dictionary to a set of training examples was originally proposed by Olshausen and Field in 1996 [18]. Recently an algorithm known as K-SVD has been proposed for adapting dictionaries to the data [10]. We have applied K-SVD to learn the class-specific dictionaries because it exhibits faster convergence than its competitors. Consider a set of dimensionality reduced patches as our training signals Y = {y 1, y 2,..., y N } where y i R n and N =2cρ. We intend to find a dictionary matrix D R n K having K (K > n) atoms {d 1, d 2,..., d K }, over which Y has a sparse representation X = {x 1, x 2,..., x N } where x i R K. In other words, we seek a decomposition such that every patch in Y is represented by a linear combination of no more than τ 1 (τ 1 << K) dictionary atoms. This optimization problem can be formally written as min D, X { } Y DX 2 2 subject to x i 0 τ 1 (3) K-SVD solves (3) by performing two steps at each iteration: (i) sparse coding and (ii) dictionary update. In the sparse coding stage, D is kept fixed and the coefficient matrix X is computed by a pursuit algorithm. Next, the dictionary D is updated sequentially allowing the relevant coefficients in X to change as well. This iterative process of simultaneously updating the dictionary atoms and the coefficients is unique to K-SVD and results in a faster convergence. The dictionary atoms are always normalized. For the details of this algorithm, please refer to the works of Aharon et al. [10], [11]. Fig. 2 displays some sample atoms of the dictionary learnt for the action bend. Note that, (3) has a term containing. 0 i.e. the l 0 seminorm that counts number of non-zero elements in a vector. The l 0 term renders the problem non-convex. Hence, solving (3) accurately is NP hard. Nonetheless, it is wellknown that if τ 1 is small compared to K, approximate solutions of such problems can be found using pursuit algorithms. The simplest pursuit algorithms are Matching Pursuit (MP) and Orthogonal MP (OMP). These two algorithms are greedy methods that select the dictionary elements sequentially by computing inner products between the input data and the dictionary atoms. OMP differs from MP by the orthogonalization step it performs after each basis is selected. Another approach is to render the problem as a convex one by replacing the l 0 norm by the l 1 norm. This is known as Basis Pursuit (BP) algorithm. The FOCal Under-determined Approach Accuracy (%) Time/classification Simple reconstruction sec Proposed method sec TABLE II COMPARISON WITH SIMPLE RECONSTRUCTIVE APPROACH System Solver (FOCUSS) uses the more generalized l p norm where p 1. K-SVD is flexible and can work with any of these pursuit algorithms. E. Classification Strategy Consider a query sequence V Q. After transforming it to an averaged MEI, a large number of random patches are extracted and projected into the same lower dimensional subspace denoted by R proj. The set of dimensionality reduced patches taken from V Q is denoted as Q = {q 1, q 2,..., q N } where q i R n. Given the dictionaries of all classes, the most simple classification approach is to approximate the test patches by each of the C dictionaries with some constant sparsity τ 2. This generates C different reconstruction errors. The dictionary that produces the minimum error determines the class of V Q. This will be referred to as the simple reconstructive approach. The reconstruction error has already been proved to be discriminative for texture segmentation [12]. We propose an alternative classification strategy that uses sparsity as the discriminating criterion. This method concatenates the class-specific dictionaries together to create a bigger dictionary, D S. D S =[D 1 D 2... D C ] (4) where D S R n KC. With the help of a pursuit algorithm we find X Q which is the sparse representation of Q over D S as follows: min Q D S X Q 2 X 2 subject to x Q 0 τ 2 (5) Q The resulting coefficient matrix X Q can also be written as X Q =[X D1 X D2... X DC ] T (6) where X Di is the coefficient matrix corresponding to the ith dictionary. Assuming V Q belongs to class q, ideally Q should use only the atoms of D q for its linear decomposition, which means all the non-zeros coefficients should be concentrated in X Dq. Although, this condition is difficult to achieve in practice because of the correlation amongst atoms of different dictionaries and input vectors, we can still expect 145

Fig. 4. Confusion matrix for the Weizmann action dataset. Overall recognition rate 97.7% Fig. 3. Recognition accuracy vs. sparsity for dictionary size 64 128.

4 Fig. 4. Confusion matrix for the Weizmann action dataset. Overall recognition rate 97.7% Fig. 3. Recognition accuracy vs. sparsity for dictionary size that a large number of non-zero coefficients should come from X Dq. The estimated class î of Q is given by î = argmax X Di 0 (7) i [1,2,...,C] This means that the query V Q is assigned to the class whose dictionary has the maximum contribution (number of nonzeros) towards the sparse representation of the patches of Q. Clearly Q is block sparse; this is because the non-zero coefficients in X Q occur in clusters. This encourages us to exploit block sparsity as an additional structure. But, each block of D S is a redundant dictionary, which makes it difficult to use block sparsity promoting algorithms like Block OMP (BOMP) [19]. We have used BOMP and observed that results are neither consistent nor very accurate. Why do we choose to solve a concatenated system? Solving a single system with a huge dictionary of size n KC seems identical to solving each of the C systems where the dictionary size of each system is n K. But this is not the case. Assuming that the least squares problem can be solved with a cost of λ per iteration, the cost of solving the bigger system is approximately τ 2 (nkl + λ), for τ 2 iterations. If the systems are solved separately L times, the cost equals Lτ 2 (nk + λ). This indicates that solving the system with the concatenated dictionary is L times cheaper. Fig. 5. Sample MEIs from the Robustness dataset: subjects ing at different angles of 90 low-resolution ( , deinterlaced 50 fps) videos of 9 different subjects, each performing 10 natural actions: bend, jumping jack (jack), jump forward (jump), jump in place (pjump), run, gallop sideways (side), skip,, wave one hand (wave1) and wave both hands (wave2). This dataset uses a fixed camera setting and a simple background. The training process uses 8 averaged MEIs per class. From each averaged MEI, a large number (ρ =2, 000) patches of size are extracted, a thousand at each scale. To learn a class-specific dictionary, we thus have a training set of 16, 000 patches. These patches have energy values above some empirically set threshold level. Thresholding is important because lots of homogeneous patches can be extracted from silhouettes that do not contribute to the training process. Each patch is converted to a vector of dimension η 2 = 256 to form P i R ,000 where i [1, 2,..., C]. A normally distributed random matrix R proj R with zero mean and unit variance is constructed. The dimension Approach Avg. Accuracy (%) Test sequence Classified as Gorelick et al. 07 [1] 97.8 Niebles et al. 08 [20] 90.0 ing in 0 Lin et al. 09 [21] 100 ing in 9 Yao et al. 10 [22] 95.6 ing in 18 Proposed 97.7 ing in 27 ing in 36 TABLE III ing in 45 COMPARISON WITH RECENT WORKS ON THE WEIZMANN DATASET ing in 54 ing in 63 jump ing in 72 jump ing in 81 jump III. PERFORMANCE EVALUATION TABLE IV A. Training on the Weizmann action dataset To evaluate the proposed approach, we first train the PERFORMANCE UNDER VIEWPOINT CHANGES (SYSTEM TRAINED WITH SUBJECTS WALKING IN 0 ) system on the Weizmann human action dataset [1]. It consists 146

a dog sleeping swinging a bag occluded by a pole Classified as jump TABLE V PERFORMANCE UNDER OCCLUSION AND OTHER DIFFICULT SCENARIOS of P i is reduced by premultiplying it with R proj as in (2).

5 Fig. 6. Sample MEIs from the Robustness dataset: ing with a dog, knees up, occluded legs and swinging a bag Test sequence normal ing in a skirt carrying briefcase limping man occluded legs knees up ing with a dog sleeping swinging a bag occluded by a pole Classified as jump TABLE V PERFORMANCE UNDER OCCLUSION AND OTHER DIFFICULT SCENARIOS of P i is reduced by premultiplying it with R proj as in (2). The resulting training set has dimension 64 16, 000. At every run a new R proj is constructed. For every class of action, a dictionary of size (K = 128) is learnt. The dictionary columns are initialized with random vectors taken from the reduced Y. Ten dictionaries, one per action, are learnt using the K-SVD algorithm with τ 1 =12(approximately 10% of the value of K) and 20 K-SVD iterations. Selecting the right size of the dictionary is critical. We set the redundancy to be 2 and simply find the optimal dictionary size empirically (refer to Table I). Fig. 7. Sample MEIs from the KTH subset - run,, wave Action Accuracy (%) Run 75.0 Walk 75.0 Wave 87.5 TABLE VI RESULTS ON THE KTH SUBSET, TRAINED ON THE WEIZMANN DATASET fig.3). This trend roughly indicates that the number of highly discriminative basis vectors present in a class-specific dictionary is small and yet those dictionary atoms are sufficient for decision-making. The highest accuracy achieved in our experiments is 97.7% for τ 2 =2and individual dictionary size [64 128]. Note that, for τ 2 =1, OMP performs only 1 iteration. In this iteration it selects, for every input vector, the best dictionary atom (i.e. the dictionary atom which forms the biggest inner product with the input vector). This corresponds to finding the nearest neighbor of each vector. We have compared our method with the simple reconstructive approach, which with the best combination of parameters (dictionary size: , τ 2 =6) misclassifies 8 out of the 90 sequences. The results shown in Table II confirms that the proposed method is superior in terms of both accuracy and speed. Finally, we compare our approach with a number of recent works in Table III. The proposed method is comparable to the state-of-the-art. B. Results on the Weizmann action dataset In the classification stage, all 10 class-specific dictionaries C. Results on the Robustness dataset are concatenated to create D S R Given a new test video sequence, a set of 2, 000 patches are extracted and projected on the same lower subspace defined by R proj. The projected set of data is denoted by Q R 64 2,000 which is sparse-coded over all the elements of D S.Wehave used OMP to solve the sparse representation problem since OMP provides a straightforward control over the number of nonzeros in terms of iterations. Also from a run-time point of view, OMP is more efficient compared to other sophisticated pursuit algorithms. At each run we train the system with the video sequences of 8 subjects and perform testing with the remaining one subject. The recognition rates reported below are the results averaged over 9 runs. Figure 4 presents the recognition results achieved by the proposed approach. Our method classifies all the actions perfectly except for the action wave1, which is confused D. Results on the KTH subset with a very similar action wave2. It misclassifies only 2 out of 90 sequences producing an error rate of only 2.22%. We also observed that the recognition accuracy improves as the sparsity increases i.e. as τ 2 in (5) gets smaller (refer to 147 This dataset is designed to test the robustness of an algorithm against changes in viewpoint, partial occlusion and unusual action scenarios. It consists of 20 videos of subjects ing in a non-uniform background, creating various difficult scenarios. Sample METs of this dataset can be found in figs. 5 and 6. We have used the videos of all 9 subjects available in the action dataset to train the dictionary and performed testing with the sequences of the robustness dataset. The proposed method perform well under changes in viewpoint (Table. IV) other than extreme conditions, since the system is only trained with subjects ing in 0. The performance of the proposed method is encouraging (90%) under occlusion and other difficult scenarios (Table V). The system is also tested on a subset of the KTH dataset, consisting of 48 videos of three actions (the common actions between Weizmann and KTH) - running, ing and waving. These test sequences use fixed camera setting and

6 will be to enhance the system to work for real-world videos, complex multi-subject activities etc. Fig. 8. dataset Sample MEIs of running sequences taken from the UCF sports homogeneous background bu contains viewpoint changes, illumination and scale variations. Some sample MEIs are presented in fig. 7 and results are shown in Table VI. E. Results on the UCF sports subset Finally we test the system on 10 running sequences taken from the UCF sports dataset. These are real-world videos having unrestricted camera motion, occlusion, background clutter etc. Since the background is not known, the silhouettes are extracted by simple thresholding and morphological operations, which are noisy and crude (fig. 8). The proposed method could recognize only 4 out of the 10 sequences i.e. produces a recognition rate of 40%. IV. CONCLUSION We propose a novel approach for modeling and recognizing human actions. The main contributions of our work are: Sparse representation of human actions by learning nonparametric, overcomplete dictionaries, instead of the commonly used predefined dictionaries. This, to the best of our knowledge, has not been explored yet. A classification strategy involving the class-specific learnt dictionaries instead of using a single shared dictionary. The proposed method achieves high recognition accuracy and is robust to partial occlusion, viewpoint and scale changes. Though the method works reasonably well on the KTH dataset even being trained on the Weizmann dataset, the performance is not up to the mark for the realistic sports videos. This is because segmentation of silhouettes are difficult for such sequences. The proposed approach completely relies on local image analysis. The local patches though help the recognition process under occlusion and other distortions, if there are many classes with similar local appearance (e.g. wave1 and wave2), the method needs to be improved by incorporating global information and using more sophisticated video representations. Our work is an attempt to show that the sparse representations and class-specific dictionaries can be the key towards developing robust action recognition systems. There are many scopes of future work. Some significant improvements REFERENCES [1] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, IEEE Trans. PAMI, vol. 29, no. 12, pp , December [2] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, Recognizing action at a distance, in Proc. ICCV, [3] J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in timesequential images using hidden markov model, in Proc. CVPR, 1992, pp [4] Aaron F. Bobick and James W. Davis, The recognition of human movement using temporal templates, IEEE Trans. PAMI, vol. 23, pp , [5] C. Bregler, Learning and recognizing human dynamics in video sequences, CVPR, [6] Ziming Zhang, Yiqun Hu, Syin Chan, and Liang-Tien Chia, Motion context: A new representation for human action recognition, in Proc. ECCV, 2008, vol. 5305, pp [7] S. Mallat, A wavelet tour of signal processing: The sparse way, 3rd Ed, Academic Press, NY, [8] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, Robust face recognition via sparse representation, IEEE Trans. PAMI, vol. 31, pp , [9] K. Engan, S. O. Aase, and J. H. Husoy, Frame based signal compression using method of optimal directions (mod), in Proc. ISCAS, [10] M. Aharon, M. Elad, and A. Bruckstein, K-svd: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Processing, vol. 54, pp , Nov [11] M. Elad and M. Aharon, Image denoising via sparse and redundant representations over learned dictionaries, IEEE Trans. Image Processing, vol. 15, no. 12, pp , Dec [12] G. Peyré, Sparse modeling of textures, J. Math. Imaging Vis., vol. 34, no. 1, pp , [13] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Discriminative learned dictionaries for local image analysis, in Proc. CVPR, [14] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng., Shift-invariant sparse coding for audio classification, in Proc. UAI, [15] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a local svm approach, in Proc. ICPR, 2004, vol. 3, pp Vol.3. [16] M.D. Rodriguez, J. Ahmed, and M. Shah, Action mach a spatiotemporal maximum average correlation height filter for action recognition, in Proc. CVPR 2008, 2008, pp [17] Ella Bingham and Heikki Mannila, Random projection in dimensionality reduction: applications to image and text data, in Proc. ACM int. conf. Knowledge discovery and data mining, New York, NY, USA, 2001, pp [18] B.A. Olshausen and D. J. Field, Natural image statistics and efficient coding, Network: Computation in Neural Systems, vol. 7, no. 2, pp , [19] Yonina C. Eldar and Helmut Bolcskei, Block-sparsity: Coherence and efficient recovery, in Proc. ICASSP, 2009, pp [20] Juan Niebles, Hongcheng Wang, and Li Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, Int. J. Computer Vision, vol. 79, pp , [21] Zhe Lin, Zhuolin Jiang, and Larry S. Davis, Recognizing actions by shape-motion prototype trees, in Proc. ICCV, [22] A. Yao, J. Gall, and L. Van Gool, A hough transform-based voting framework for action recognition, in Proc. CVPR, 2010, pp

REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY. Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin

REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin Univ. Grenoble Alpes, GIPSA-Lab, F-38000 Grenoble, France ABSTRACT