Human Activity Recognition Based on R Transform and Fourier Mellin Transform

Size: px

Start display at page:

Download "Human Activity Recognition Based on R Transform and Fourier Mellin Transform"

Malcolm Higgins
5 years ago
Views:

1 Human Activity Recognition Based on R Transform and Fourier Mellin Transform Pengfei Zhu, Weiming Hu, Li Li, and Qingdi Wei Institute of Automation, Chinese Academy of Sciences, Beijing, China {pfzhu,wmhu,lli,qdwei}@nlpr.ia.ac.cn Abstract. Human activity recognition is attracting a lot of attention in the computer vision domain. In this paper we present a novel human activity recognition method based on R transform and Fourier Mellin Transform (FMT). Firstly, we convert the original image sequence to the Radon domain, get the R transform curves by R transform. Then we extract the Rotation-Scaling-Translation (RST) invariant features by FMT and use to have dimension reduction by PCA method. At the recognition stage, the Earth Mover s Distance (EMD) is used here. In the experiment, we compare our method to other methods. The experimental results show the effectiveness of our method. 1 Introduction Human activity recognition is an attractive direction of research in computer vision, which has wide application such as intelligent surveillance, analysis of the physical condition of people and caring of aged people [1]. Human activity recognition includes tracking, action features extraction and representation, action model learning and high level semantic understanding. The feature expression of activity recognition is a key step. But the video data are variant at the aspect of the scale angle and location with the carema, the job of feature extraction is very hard. So the extraction of view invariant features are attentioned by more and more researchers. Rao et al [2] present a computational representation of human action to capture these dramatic changes using spatio-temporal curvature of 2-D trajectory. This representation is compact, view-invariant, and is capable of explaining an action in terms of meaningful action units called dynamic instants and intervals. Ogale et al [3] represent human actions as short sequences of atomic body poses. Actions and their constituent atomic poses are extracted from a set of multiview multiperson video sequences by an automatic keyframe selection process, and are used to automatically construct a probabilistic context-free grammar (PCFG). Parameswaran and Chellappa [4] exploit a wealth of techniques in 2D invariance that can be used to advantage in 3D to 2D projection and model actions in terms of view-invariant canonical body poses and trajectories in 2D invariance space, leading to a simple and effective way to represent and recognize human actions from a general viewpoint. Weinland et al [5] introduce Motion History Volumes (MHV) as a free-viewpoint representation for human actions in the case of multiple calibrated, and background-subtracted, video cameras. They present algorithms for computing, aligning and com-paring MHVs of G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp , c Springer-Verlag Berlin Heidelberg 2009

2 632 P. Zhu et al. different actions performed by different people in a variety of viewpoints. Weinland et al [6] propose a new framework where they model actions using three dimensional occupancy grids, built from multiple viewpoints, in an exemplar-based HMM. The novelty is, that a 3D reconstruction is not required during the recognition phase, instead learned 3D exemplars are used to produce 2D image information that is compared to the observations. Parameters that describe image projections are added as latent variables in the recognition process. Li and Fukui [7] propose a novel view-invariant human action recognition method based on non-rigid factorization and Hidden Markov Models. Shen and Foroosh [8] show that fundamental ratios are invariant to camera parameters, and hence can be used to identify similar plane motions from varying viewpoints. For action recognition, they decompose a body posture into a set of point triplets (planes). The similarity between two actions is then determined by the motion of point triplets and hence by their associated fundamental ratios, providing thus view-invariant recognition of actions. Natarajan and Nevatia [9] present an approach to simultaneously track and recognize known actions that is robust to such variation, starting from a person detection in the standing pose. To tackle activity recognition, Gilbert et al [10] propose learning compound features that are assembled from simple 2D corners in both space and time. In this paper, we present a novel human activity recognition method based on R transform and Fourier Mellin Transform (FMT). Figure 1 shows the framework of our method. Fig. 1. Overview of our approach The rest of this paper is organized as follows. Section 2 shows the Radon transform and R transform. The Fourier mellin transform algorithm is introduced in section 3. To evaluate our method, the experiments are showed in section 4. Section 5 is the conclusion of our paper. Section 6 shows the references.

3 Human Activity Recognition Based on R Transform and FMT Radon Transform and R Transform In mathematics, two dimensional Radon transform is the transform consisting of the integral of a function over the set of lines in all directions, which is roughly equivalent to finding the projection of a shape on any given line. For a discrete binary image, each image is projected to the Radon domain. Let f (x, y) be an image, its Radon transform is defined[11][12]: T R f (ρ, θ) = f (x, y)δ(x cos θ + y sin θ ρ)dxdy = R adon { f (x, y)} (1) where θ [0,π], ρ [, ]andδ(.) is the Dirac delta function, δ(x) = { 1ifx = 0 0otherwise (2) For geometry transformation such as scaling, translation and rotation, Radon transform has the following properties: For a scaling factor α, For translation of (x 0, y 0 ), R adon { f ( x α, y α )} = 1 α T R f (αρ, θ) (3) R adon { f (x x 0, y y 0 )} = T R f (ρ x 0 cos θ y 0 sin θ, θ) (4) For rotation of θ 0 R adon { f θ0 (x, y)} = T R f (ρ, θ + θ 0) (5) From the equation (3)-(5), we can see that the Radon transform is variant at the aspects of scaling, translation and rotation. An improved representation of Radon transform, R transform, is introduced [13][12]: For a scaling factor α, R f (θ) = T 2 R f (ρ, θ)dθ (6) 1 α 2 For translation of (x 0, y 0 ), T 2 R (αρ, θ)dρ = 1 T 2 f α 3 R (ν, θ)dν = 1 f α R 3 f (θ) (7) T 2 R f ((ρ x 0 cos θ y 0 sin θ),θ)dρ = T 2 R f (ν, θ)dν = R f (θ) (8) For rotation of θ 0 T 2 R f (ρ, (θ + θ 0 ))dρ = R f (θ + θ 0 ) (9)

4 634 P. Zhu et al. Fig. 2. The Radon transform and the R transform of the example images From the equation (7)-(9), we can see that the R transform is invariant at the aspect of translation, a scaling changing can reach an amplitude scaling, and a rotation results in a phase sift. In the experiments, we normalize the R transform curve to get the scaling invariance by equation (10). R (θ) = R(θ) max(r(θ)) The Figure 2 shows the Radon transform and the R transform of the example images. (10)

5 3 Fourier Mellin Transform Human Activity Recognition Based on R Transform and FMT 635 The use of the Fourier Mellin Transform for rigid image registration was proposed in [14], that is to match images that are translated, rotated and scaled with respect to one another. Let F 1 (ξ, η) andf 2 (ξ, η) be the Fourier transforms of images f 1 (x, y) and f 2 (x, y), respectively. If f 2 differs from f 1 only by a displacement (x 0, y 0 )then or in frequency domain, using the fourier shift theorem f 2 (x, y) = f 1 (x x 0, y y 0 ), (11) F 2 (ξ, η) = e j2π(ξx 0+ηy o ) F 1 (ξ, η). (12) The cross-power spectrum is then defined as C(ξ, η) = F 1(ξ, η)f2 (ξ, η) = e j2π(ξx 0+ηy 0 ), (13) F 1 (ξ, η)f 2 (ξ, η) where F is the complex conjugate of F. The Fourier shift theorem guarantees that the phase of the cross-power spectrum is equivalent to the phase difference between the images. The inverse of (3) results in c(x, y) = δ(x x 0, y y 0 ), (14) which is approximately zero everywhere except at the optimal registration point. If f 1 and f 2 are related by a translation (x 0, y 0 ) and a rotation θ 0 then f 2 (x, y) = f 1 (x cos θ 0 + y sin θ 0 x 0, x sin θ 0 = y cos θ 0 y 0 ). (15) Using the Fourier translation property and the Fourier rotation property, we have F 2 (ξ, η) = e j2π(ξx 0+ηy o ) F 1 (ξ cos θ 0 + η sin θ 0, ξ sin θ 0 + η cos θ 0 ). (16) Let M 1 and M 2 be the magnitudes of F 1 and F 2, respectively. They are related by M 2 (ξ, η) = M 1 (ξ cos θ 0 + η sin θ 0, ξ sin θ 0 + η cos θ 0 ). (17) To recover the rotation, the Fourier magnitude spectra are transformed to polar representation M 1 (ρ, θ) = M 2 (ρ, θ θ 0 ) (18) where ρ and θ are the radius and angle in the polar coordinate system, respectively. Then, (3) can be applied to find ρ 0. If f 1 is a translated, rotated and scaled version of f 2, the Fourier magnitude spectra are transformed to log-polar representations and related by M 2 (ρ, θ) = M 1 (ρ/s,θ θ 0 ) (19) i.e. M 2 (log ρ, θ) = M 1 (log ρ log s,θ θ 0 ) (20)

6 636 P. Zhu et al. M 2 (ξ, θ) = M 1 (ξ d,θ θ 0 ) (21) where s is the scaling factor, ξ = log ρ and d = log s. 4 Experiments In our experiments, we use the Weizman dataset to evaluate our method with 93 videos of 9 actors and 10 actions (bend, jack, jump, pjump, run, side, skip, walk, wave1, wave2), the sample images are showed in Figure 3. Fig. 3. The example images of the Weizman dataset In our experiments, each silhouette image is normalized into a resolution. Firstly we convert the image to the Radon domain, get a R curve by the R transform. Before extract the invariant features by the fourier mellin transform, we convert the curve to a 2D R transform image. To get more compressed features, PCA is used here. Since the periods of the activities are not uniform, comparing sequences is not straightforward. In the case of human activities, the same activity can be performed in different speeds, resulting the sequence to be expanded or shrunk in time. In order to eliminate such effects of different speeds and to perform robust comparison, the Earth Mover s Distance (EMD) [15] is used in our experiment. The Earth Mover s Distance has been proved to have promising performance in image retrieval and visual tracking because it can find optimal signature alignment and thereby can measure the similarity accurately. For arbitrary two activity sequences P and Q, P = {(p i, w pi ), 1 i m}, Q = {(q i, w qi ), 1 i n}, wherem and n are the number of clusters in P and Q, respectively. The EMD between P and Q is computed by D(P, Q) = mi=1 nj=1 d ij f ij mi=1 nj=1 f ij (22) Where d ij is the Euclidean distance between p i and q j,and f ij is the optimal match between two signatures P and Q that can be computed by solving the Linear Programming problem.

7 Human Activity Recognition Based on R Transform and FMT 637 min WORK(P, Q, F) = s.t. f ij 0 n m n d ij f ij i=1 j=1 f ij w pi m f ij w qi j=1 i=1 m n m n f ij = min( w pi, w qi ) i=1 j=1 i=1 j=1 4.1 Experiment 1 In the experiment, we evaluate our method at the aspect of rotation, translation and scaling respectively. Figure 4, 5 show the correct recognition rates when the activity sequences are rotated or scaled. From the results, we can see that the correct rates of the rotated activities can right up to 90%. And the correct rates of the scaled ones can right up to 80%. As for the translated ones, the correct recognition rates are 100%. Fig. 4. The correct recognition rates of rotated activity sequences

8 638 P. Zhu et al. Fig. 5. The correct recognition rates of scaled activity sequences Fig. 6. The example images of our dataset 4.2 Experiment 2 In the experiment, we build a dataset including the datum of the Weizman dataset, the rotated sequences of the Weizman dataset at angle 30 o -30 o randomly, the translated sequences of the Weizman dataset, and the scaled image sequences of the Weizman dataset. The example images are showed in Figure 6. We compare our method to other methods, such as Zernike Moment, R transform in, Fourier Mellin Transform. Figure 7 shows the correct recognition rates. From the figure we can see that our methods are better than other three methods. Our RST-invariant features based on the R transform and the Fourier Mellin Transform is effective and can be used in human activity recognition.

9 Human Activity Recognition Based on R Transform and FMT 639 Fig. 7. The correct recognition rates 5 Conclusion In this paper we present a novel human activity recognition method based on R transform and Fourier Mellin Transform (FMT). Our feature extraction method is Rotation- Scaling-Translation invariant, which can be used in human activity recognition, especially when the camera is unstable. The experimental results show the effectiveness of our method. Acknowledgment This work is partly supported by NSFC (Grant No and ) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453). References 1. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behavior. IEEE Trans. on Systems, Man and Cybernetics, Part C: Applications and Reviews 37, (2004) 2. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of actions. International Journal of Computer Vision 50, (2002) 3. Ogale, A., Karapurkar, A., Aloimonos, Y.: View-invariant modeling and recognition of human actions using grammars. In: Workshop on Dynamical Vision at ICCV, vol. 5 (2005) 4. Parameswaran, V., Chellappa, R.: View invariance for human action recognition. International Journal of Computer Vision 66, (2006) 5. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding 104, (2006) 6. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3d exemplars. In: Proceedings of the International Conference on Computer Vision, pp. 1 7 (2007) 7. Li, X., Fukui, K.: View-invariant human action recognition based on factorization and hmms. EICE Transactions on Information and Systems, (2008)

10 640 P. Zhu et al. 8. Shen, Y., Foroosh, H.: View-invariant action recognition using fundamental ratios. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 7 (2008) 9. Natarajan, P., Nevatia, R.: View and scale invariant action recognition using multiview shapeflow models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8 (2008) 10. Gilbert, A., Illingworth, J., Bowden, R.: Scale invariant action recognition using compound features mined from dense spatio-temporal corners. In: European Conference on Computer Vision, pp (2008) 11. Deans, S.: Application of the radon transform. Wiley Interscience Publications, New York (1983) 12. Wang, Y., Huang, K., Tan, T.: Human activity recognition based on r transform. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8 (2007) 13. Tabbone, S., Wendling, L., Salmon, J.: A new shape descriptor defined on the radon transform. Computer Vision and Image Understanding 102, (2006) 14. Reddy, B., Chatterji, B.: An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Processing 8, (1996) 15. Rubner, Y., Tomasi, C., Guibas, L.: The earth mover s distance as a metric for image retrieval. International Journal of Computer Vision 40, (2000)

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information Ana González, Marcos Ortega Hortas, and Manuel G. Penedo University of A Coruña, VARPA group, A Coruña 15071,