Dynamic Human Shape Description and Characterization

Dynamic Human Shape Description and Characterization Z. Cheng*, S. Mosher, Jeanne Smith H. Cheng, and K. Robinette Infoscitex Corporation, Dayton, Ohio, USA 711 th Human Performance Wing, Air Force Research Laboratory, Dayton, Ohio, USA Abstract Dynamic human shape description and characterization was investigated in this paper. The dynamic shapes of a subject in four activities (jogging, limping, shooting, and walking) were generated via 3-D motion replication. The Paquet Shape Descriptor (PSD) was used to describe the shape of the subject in each frame. The unique features of dynamic human shapes were revealed from the observations of the 3-D plots of PSDs. The principal component analysis was performed on the calculated PSDs and principal components (PCs) were used to characterize the PSDs. The PSD calculation was then reasonably approximated by its first few projections in the eigenspace formed by PCs and represented by the corresponding projection coefficients. As such, the dynamic human shapes for each activity were described by these projection coefficients. Based on projection coefficients, data mining technology was employed for activity classification. Case studies were performed to validate the methodology developed. Keywords: Human Modeling, Dynamic Shape, Shape Descriptor, Principal Component Analysis, Activity Recognition 1. Introduction While a human is moving or performing an action, his body shape is changing dynamically. In other words, the shape change and motion are tied together during a human action (activity). However, human shape and motion are often treated separately in activity recognition. The shape dynamics describe the spatial-temporal shape deformation of an object during its movement and thus provide important information about the identity of a subject and the motions performed by the subject (Jin and Mokhtarian, 26). A few researchers utilized shape dynamics for human activity recognition. In (Kilner et al, 29), the authors addressed the problem of human action matching in outdoor sports broadcast environments by analyzing 3-D data from a recorded human activity and retrieving the most appropriate proxy action from a motion capture library. In (Niebles and Li 27), a video sequence was represented as a collection of spatial and spatial-temporal features by extracting static and dynamic interest points; then a hierarchical model was proposed that can be characterized as a constellation of bags-of-features, both spatial and temporal. In (Jin and Mokhtarian 26), a system was proposed for recognizing object motions based on their shape dynamics. The spatial-temporal shape deformation in motions was captured by hidden Markov models. In (Blank et al 25), human action in video sequences was seen as silhouettes of a moving torso and protruding limbs undergoing articulated motion. Human actions were regarded as three dimensional shapes induced by the silhouettes in the space-time volume. Dynamic human shapes can be described by a dynamic 3-D human shape model which, in turn, can be extracted from 2-D video imagery or 3-D sensor data or created by 3-D replication/animation. A dynamic 3-D shape model usually contains tens of thousands of graphic elements (vertices or polygons). In order to use the information coded in dynamic shapes for human identification and activity recognition, it is necessary to find an effective method for dynamic shape description and characterization. 2. Dynamic Shape Creation Since the technologies that are capable of capturing 3-D dynamic shapes of a subject during motion are still very limited in terms of their maturity and availability, there is very little data available at this time for human dynamic shapes. However, as a motion capture system can be used to capture human motion and a laser scanner can be used to *Corresponding author. Email: Zhiqing.cheng@wpafb.af.mil 1

capture human body shape, various techniques have been developed to replicate/animate human motion in a 3-D space, thus generating dynamic shapes of a subject in an action. In this paper, Blender (http://www.blender.org), an open source software tool was used to animate the motion of a human subject in a 3-D space during four different activities. The four activities were walking, jogging, limping and shooting. The data that were used as a basis for the animation were acquired in the Human Signatures Laboratory of the US Air Force, including scan data and motion capture (MoCap) data. The human subject, with markers attached, was scanned using the Cyberware whole body scanner. Motion capture data were acquired for the same subject with the same markers attached. The markers allowed the joint centers to be determined for both the scan and the MoCap data. The scan was imported into Blender and the joint centers were used to define the skeleton in BVH (Bio-vision Hierarchical) file format. Euler angles for the different body segments were computed from the joint centers and other markers and used to set up BVH files for the four different activities. The BVH files were imported into Blender and used to animate the whole body scan of the subject. Figure 1 shows the images captured from the animation within Blender for the four activities. From Blender animation, a 3-D mesh can be output at each frame of motion, as shown in Fig. 2 for limping, which can be used to represent the 3-D dynamic body shape of the subject at this instant of motion. Thus, the output of the 3-D mesh in each frame can be used as the simulation data of dynamic human shapes for training the algorithms developed for activity recognition. 3. Dynamic Shape Description and Characterization Dynamic shapes shown in Figs. 1 and 2 are represented by 3-D meshes. Each mesh may contain as many as tens of thousands of graphical elements (vertices or polygons). It is not feasible to use the vertices or polygons directly for the analysis of human shape dynamics. One way to effectively describe dynamic shapes and to enable further analysis is by using a shape descriptor (Cohen & Li 23; Chu & Cohen 25). In this paper, the Paquet Shape Descriptor (PSD) (Paquet et al 2; Robinette 23) with certain modifications is used to describe dynamic shapes and to analyze shape dynamics. As illustrated in Fig. 3, the PSD uses 12 bins (discrete parameters) to characterize shape variation. Among these 12 bins, 4 are related to the radius r, 4 to the first angle (cos(θ)), and 4 to the second angle (cos(δ)). The details of the PSD calculation are omitted here. The 3-D plots of the time histories of PSD for four activities are illustrated in Fig. 4 where the first 4 bins corresponding to radius are show on the top, the second 4 bins of cos(θ) in the middle, and the last 4 bins of cos(δ) at the bottom. Figure 1. Replication of a subject in four activities: limping, jogging, shooting, and walking. Figure 2. Dynamics shapes of a subject during limping. Example Cord (,,) P1 θ ) δ P3 r P2 Chest Left Arm Figure 3. Paquet shape descriptor and its coordinate system. By visually looking at these plots, one can find that: The variation of each bin over time is different: the variations of some bins over time are larger and significant, but others are not. Periodic features are exhibited by the plots for the activities of jogging, limping, and walking. The 3-D plot for each activity is unique. There are visible and significant differences among the plots for four activities. These observations from the PSD reveal some unique features of shape dynamics. However, directly using PSD to analyze shape dynamics is still not feasible, since it has 12 bins (variables) which form a space of 12-dimension. Further treatment becomes necessary to characterize the shape descriptor and to reduce the dimension of the problem space. 2

Figure 4. The time histories of 12 bins of PSD for four activities 3

Magnitude Magnitude Magnitude Magnitude Z. Cheng, Dynamic Human Shape Description and Characterization Therefore, the principal component analysis (PCA) is used to characterize the high-dimensional space defined by the PSD. Denote T p p p p }, (1) ijk { 1 2 12 ijk as the PSD shape descriptor for the i-th subject in j- th activity at k-th frame. For the data collected, denote P { pijk }, i 1,..., I; j 1,..., J; k 1,..., K. (2) where I represents the number of subjects, J is the number of actions, and K is the number of frames for each action. However note that the number of activities that each subject performs can be different and the number of frames for each activity can be different also. By performing PCA of P one can find the principal components that characterize the space defined by the shape descriptor. In this paper, dynamic shapes were created for the four activities at the frame rate of.2 /s, with 85 frames for jogging, 352 frames for limping, 554 frames for shooting, and 227 frames for walking. The percentage of variance of each principal component (PC) is shown in Fig. 5, and the first four PCs are show in Fig. 6. The original PSD vector can be projected onto the space (eigenspace) formed by PCs, that is, it can be expanded in terms of PCs. As shown in Fig. 5, among all 12 PCs, only the first 1~2 are significant. This means that the original PSD can be reasonably approximated by its first few projections in the eigenspace and represented by the projection coefficients corresponding to these significant PCs. Figure 7 illustrates the time histories of the first and second projection coefficients for four activities. Figure 5. Percentage of variance of each principal component Principal Component 1 Principal Component 2.3.2.1. -.1 -.2 -.3 -.4 4 8 12.3.2.1. -.1 -.2 -.3 4 8 12 Principal Component 3 Principal Component 4.2.1. -.1 -.2 -.3 -.4 4 8 12.3.2.1. -.1 -.2 -.3 4 8 12 Figure 6. First four principal components of PSD. 4

Z. Cheng, Dynamic Human Shape Description and Characterization jog1 Principal component 1 jog1 Principal component 2 2 15 1 5-5 -1-15..5 1. 1.5 2. 25 2 15 1 5-5 -1-15 -25..5 1. 1.5 2. limp1 Principal component 1 limp1 Principal component 2 3 25 2 15 1 5-5 -1-15 2 4 6 8 3 2 1-1 -3-4 2 4 6 8 shoot1 Principal component 1 shoot1 Principal component 2 3 2 1-1 -3 2 4 6 8 1 12 3 2 1-1 -3-4 2 4 6 8 1 12 walk1 Principal component 1 walk1 Principal component 2 1 5-5 -1-15 -25-3 -35 1 2 3 4 5 2 1-1 -3-4 -5 1 2 3 4 5 Figure 7. Time histories of the first and second projection coefficients for four activities. Denote W v v v }, (3) M where { 1 2 M v m is the m-th principal component (eigenvector) of P. The original observations (data) can be projected onto the space defined by W, that is, T T YM P W, (4) M Where YM Y[ M, N] is the matrix of projection coefficients, each column of which corresponds to each original record, M is the dimension of a shape descriptor (M=12 for PSD), and N is the number of total shapes observed (N=1218 for the case of this paper). From Fig. 5 we can see that among the total of 12 principal components, the significant ones are less than 2. This means that instead of using the full space of dimension of M, one can construct a new space with only the significant principal components, that is, WL { v1 v2 vl}, L M, (5) which would substantially reduce the dimension of the space. As for the case investigated in this paper, 5

L 2, which is much less than M 12. Then the projection in this space is given by T T YL P W, (6) L where YL : Y[ L, N]. Each original record can be either fully reconstructed by Eq. (4) or partially reconstructed (approximated) by Eq. (6). Usually an original record can be well approximated by its partial construction with significant principal components. This means that the original data with dimension of M can be represented by its projection coefficients with dimension of L ( L M ). In the space of reduced dimension, the problem becomes tractable, as the number of variables becomes much smaller. In fact, for the case in this paper, the two projection coefficients corresponding to the first two most significant principal components are sufficient to represent the shape dynamics for action recognition. The sequence of a projection coefficient at each frame for a particular subject in a particular action constitutes a time series, as shown in Fig. 7 for example. It is shown that the time histories of the first and second coefficient are unique with respect to each action, which can be used as the discriminators for activity recognition. 4. Activity Recognition Based on Shape Dynamics The shape dynamics of a subject during motion, as described in Section 3, can be used for activity recognition. In this paper, a data mining tool was employed to classify four activities (jog, limp, shoot, walk) based on 85 frames from each activity. Note that in the classification, each frame was treated independently rather than being placed in sequence as a time series. Five attributes were used in classification: (a) Pelvis_Velocity, the resultant velocity at the mid-pelvis location; (b) PC1, the first projection coefficient; (c) PC2, the second projection coefficient; (d) PC1_Velocity, the derivative of PC1; and (e) PC2_Velocity, the derivative of PC2. The significance of each attribute can be assessed in terms of gain ratio as given in Table 1. While Pelvis_Velocity is most significant, all five attributes are selected for classification. Various classification methods are available, such as those provided by Weka http://www.cs.waikato.ac.nz/ml/weka/. Among them, five conventional methods listed in Table 2 were chosen to use in the case study. All of them achieved classification accuracy greater than 95%, as shown in Table 2. Table 1. Attributes ranking results Table 2. Classification accuracy 5. Conclusion Based on the study of this paper, the following conclusions are in order. Shape dynamics contain the information about both body motion and shape changes and have great potential for human identification and activity recognition. Shape dynamics can be well-captured by a shape descriptor and further characterized by principal components. Human motion/action in 3-D space can be replicated or animated with high bio-fidelity, 6

which can be used to generate the data for training a model or to evaluate the performance of a tool. Using a dynamic 3-D human shape model for human activity recognition is plausible. This approach is unique as it differs from other conventional techniques based on 2-D imagery or models. It is effective as it can overcome the shortcomings inherent in 2-D methods. As a shape descriptor, the PSD is not reversible. This means that while it can be used for analysis, as it was used in this paper, it cannot be used for shape reconstruction. Also, spatial information may not be uniquely represented in the original definition of the PSD, which can be remedied by certain treatments or modifications. 18th International Conference on Pattern Recognition (ICPR'6). Kilner J, Guillemaut J-Y, and Hilton A, 29. 3-D Action Matching with Key-Pose Detection. In: 29 IEEE 12th International Conference on Computer Vision Workshops. Niebles J-C and Li F-F, 27. A Hierarchical Model of Shape and Appearance for Human Action Classification. In: IEEE Computer Vision and Pattern Recognition (CVPR 27). Paquet E, Rioux M, Murching A, Naveen T, and Tabatabai A, 2. Description of shape information for 2-D and 3-D objects. Signal Processing: Image Communication 16 (2), pp 13-122. It should be pointed out that the dynamic shape models used in this study were created from 3-D surface scan data and motion capture data using OpenSim and Blender. While these models provide high biofidelic description of body shape during motion, the body surface deformation may not be fully or accurately represented by these models. However, since the body shape variation induced by the articulated motion is much larger than the surface deformation, most observations and results from this paper can be reasonably postulated to be true even if the surface deformation is more precisely represented. More investigations are needed to validate this assumption. Robinette K, 23. An Investigation of 3-D Anthropometric Shape Descriptors for Database Mining. Ph.D. Thesis, University of Cincinnati. Acknowledgement This study was carried out under the support of a SBIR Phase I funding (FA865-1-M-692) provided by the US Air Force. References Blank M, Gorelick L, Shechtman E, Irani M, and Basri R, 25. Actions as Space-Time Shapes. In: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV 5). Cohen I and Li H, 23. Inference of Human Posture by Classification of 3-D Human Body Shape, IEEE International Workshop on Analysis and Modeling of Faces and Gestures, ICCV 23. Chu C-W and Cohen I, 25. Posture and Gesture Recognition using 3-D Body Shapes Decomposition. In: Proceedings of the 25 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 5). Jin N and Mokhtarian F, 26. A Non-Parametric HMM Learning Method for Shape Dynamics with Application to Human Motion Recognition. In: The 7