Epitomic Analysis of Human Motion

Epitomic Analysis of Human Motion Wooyoung Kim James M. Rehg Department of Computer Science Georgia Institute of Technology Atlanta, GA 30332 {wooyoung, rehg}@cc.gatech.edu Abstract Epitomic analysis is a recent statistical approach to form a generative model. It has been recently applied to image, video, and audio data. In this paper, we describe a novel application of epitomic analysis to motion capture data. We show how to construct a motion epitome, and use it for various applications in animation such as motion synthesis, reconstruction and classification of motions. We discuss the use of motion epitome for measuring motion naturalness. 1. Introduction The image epitome was introduced by Jojic, Frey and Kannan in [4] as a novel generative image model. The epitome representation of a source image is much smaller than the source but retains most of the constitutive elements needed to reconstruct it. Applications of image epitomes include denoising, detection, indexing and retrieval, and segmentation. More recently, epitomic models have been developed for video [1] and audio data [5]. This paper introduces a novel epitomic model for motion capture data which we call the motion epitome. Epitomic motion analysis results in a generative time series model which compresses sequences of motion capture data into a compact representation of their most salient constituent elements. In comparison to conventional generative models of human motion such as a Hidden Markov Model or Switching Linear Dynamic System, the motion epitome offers the potential to simultaneously represent correlations at multiple time scales and complex patterns of dependency between states in the model. Used in conjunction with non-parametric motion representations (such as motion graphs [6]), the motion epitome offers the potential to generate more accurate transitions between motion sequences. We describe a procedure for learning motion epitomes from a corpus of motion capture data, and present the use of motion epitomes in the reconstruction of motions,the synthesis of long motion sequences, and the classification of motions. We provide experimental results of synthesis and classification of motion sequences. 2. Motivation The motivation for this work is the existence, for the first time, of large and diverse collections of motion capture data. For example, the CMU corpus (available at http://mocap.cs.cmu.edu) contains 1648 motions and includes 23 different types of motions. We believe these collections can enable a large-scale investigation of data driven approaches to the generation and analysis of human movement. A basic question is whether we can identify a canonical finite set of primitive movement elements from which all human movement can be constructed. One approach to this problem is to design a statistical generative model and examine the representations that it constructs when it is trained on a sufficiently large corpus of motion data. In this work we develop the motion epitome as a generative statistical model for motion capture data. 1

3. Motion Capture Data Motion capture data is obtained by tracking markers attached to the figure from multiple cameras at 60-120Hz. Using commercial software (such as IQ), marker positions in 3D are converted into rotation angles, φ, θ, ψ with respect to an underlying 3D skeleton model as shown in Figure 1(a). We make an analogy between motion images and intensity images, and propose to learn epitomic representations of motion using the same framework as image epitomes. Figure 1(b) gives an example of a motion image constructed from joint angle data. In the motion image, each row consists of a feature F = {T x, T y, T z, R φ, R θ, R ψ, {φ i, θ i, ψ i } J i=1 }, where T xt y R z, T φ R θ R ψ are a root position and a root rotation, and φ i θ i ψ i is a relative rotation for the ith joint angle. Hence, each row gives the pose of the entire figure at a specific instant time t. Figure 1: (a) is the skeleton of 29 joints; (b) is motion image and a sequence of motion. The motion image of left is un-normalized law data. Motion capture data is represented in the Euler angle system in which a 3D rotation is described as φ, θ, ψ rotation angles. This causes a problem with epitomic analysis, since it considers the same angles with periodic degree 360 as different ones. In order to solve this problem, we first convert the Euler angles to quaternion which use an axis ν in 3D and a rotation angle α. After conversion to quaternion, we scale translation data so that all of the data can be in the same range of [-1,1]. Then for the better visualization, we only extract rotation angles from it to make a new motion image. The new motion images are shown in Figure 3, and Figure 4. 4. Motion Epitome Motion epitome is a generative model of motion capture data which is much more compact in time, but retains most of the patterns in the input motion. This is a probabilistic model of a copying and averaging process inspired 2

by the paper in [2]. Figure 3 shows motion epitome extracted from a set of sequences of motions. We first describe how to generate a motion from a motion epitome, then explain inference and the learning process. 4.1. Generating a motion with motion epitome We regard a sequence of motion as a 2D array of features F = {T x, T y, T z, R φ, R θ, R ψ, {φ i, θ i, ψ i } J i=1 } at each time t, defined in Section 3, with t {1,..., T } where T is the number of time frames of the motion. Then the epitome e is a set of probability distributions of size T e where T e < T, so that an instance of a feature vector at time t is estimated from any of the probability distributions in e. Here we use a Gaussian distribution parameterized by a mean and covariance for each entry of the epitome. Hence, at the epitome coordinate t e, the probability density of the feature vector stored in the entry t of the motion is p(x t e te ) = N (x t ; µ te, φ te ), where µ te is the mean and φ te is the covariance. (a) (b) Graphical model Generative process Figure 2: Graphical model of epitomic analysis(a) and generative process (b). Figure 2(b) illustrates the generative procedure consistent with the graphical model in Figure 2(a). Given e, the goal is to generate an output time series X. An intermediate step is to generate overlapping patches, z k by sampling from the epitome representation. For x i at time i to be generated, we first consider all possible time coordinates sets that include i. Representative sets k, s, h are shown in Figure 2 (b). With each time coordinate set k, we randomly choose patch τ k from the epitome. The distribution of τ k could be learned or assumed to be 3

uniform. After that we generate a predicted patch z k using the distribution, k p(z k τ k, e) = N (z j,k ; µ τk(j), φ τk(j) ) (1) j where µ τk(j), φ τk(j) are the mean and variances stored in the patch τ k at j. Also note that k(j) refers to the jth coordinate in k, and z j,k to the value at k(j) in z k. Then all overlapping patches z k s are combined to make a single prediction for the observed value x i. Here we assume that in order to generate a consistent value x i, the contributions from all overlapping patches {z k i k} are averaged and Gaussian noise with variance ψ i is added; p(x i {z k : i k}) = N (x i ; where N i is the number of coordinate sets that overlap coordinate i. 4.2. Inference with motion epitome 1 z i,k, ψ i ) (2) N i k:i k Epitomic analysis allows arbitrary patches from the model in the generative process. Although this makes it possible to generate more wide range of motions than we are interested in, we constraint that the motions to be generated from patches agree at time i if they are shared. Therefore, we assume that all values that share a time coordinate i are the same value ζ i. Then the inference process is narrowed as the following; Given an epitome e and an observation X, we want to infer the hidden value ζ i at time i which is shared by all patches {z k : i k} predicted from e. This is obtained by maximizing the log likelihood of the data; where log p(x) = log p(x, {z k, τ k }) {z k } {τ k } {z k } {τ k } q({z k, τ k }) log p(x, {z k, τ k }) q({z k, τ k }). (3) p(x, {z k, τ k }) = p({z k, τ k })p(x {z k, τ k }) n = p({z k, τ k }) p(x i {z k, τ k }) = p({z k, τ k }) = k i=1 n p(x i {z k }) i=1 [p(τ k )p(z k τ k )] n p(x i {z k }). (4) Note that the equation (4) can be fully expressed with epitome parameters by substituting equation (1), (2) into (4). Here, q({z k, τ k }) is the auxiliary joint probability distribution over the set of epitome patches {τ k } and the set of patches {z k } for all coordinate patches k. Following the approximate and efficient inference and learning algorithm proposed in [3], we decouple the variables in the q-distribution; i=1 q({z k }, {τ k }) = Π k q(τ k )q(z k ). (5) 4

Since we assumed that the value of z k in the overlapping regions agree, we can express q(z k ) as: q(z k ) = Π j δ(z j,k ζ k(j) ). Then the estimate of ζ i is obtained by setting to zero the derivative of equation (3) with respect to ζ i : ζ i = x i ψ i + k,j:k(j)=j τ k q(τ k ) µ τ k (j) 1 ψ i + k,j:k(j)=j φ τk (j) τ k q(τ k ) 1 φ τk (j) Likewise, we can obtain q(τ k ) by setting zero the derivative of (3) with respect to q(τ k ):. (6) q(τ k ) p(τ k ) i N (ζ i,k ; µ τk(i), φ k(i) ). (7) Inference consists of iterating the above two updates, (6) and (7), until they converge. All the motion applications we discuss in this paper use this inference algorithm. 4.3. Learning motion epitome Now we suppose that we are leaning parameters of a motion epitome given an observation X. The learning process then requires estimating the parameters µ j and ψ j as well. These are obtained by setting to zero the derivative of (3) with respect to µ j, ψ j respectively; µ j = k k(i)=i k k(i)=i τ k,τ k (i)=j q(τ k)ζ i τ k,τ k (i)=j q(τ k) (8) φ j = k k(i)=i k τ k,τ k (i)=j q(τ k)(ζ i µ j ) 2 τ k,τ k (i)=j q(τ. (9) k) k(i)=i Here, estimating ζ i and the parameters of the epitome is the M-step, and computing q(τ k ) which is an approximation to the posterior is the E-step. Hence, we learn the epitome by iterating these EM steps. Figure 3: Learning motion epitome from an input motion Figure 3 illustrates the learning process of motion epitome with motion images. We also provide in Figure 4 examples of motion epitomes learned from a set of basketball-dribbling (a), and from a set of jogging motions (b). The left of each figure is the set of training data and right is the motion epitome of jogging. Clearly the size of epitome is much smaller but contains most salient patterns according to their types. Note that the motion image is not a raw image as shown in Figure 1(b). We converted the raw data to quaternion system from Euler angle and scaled the translation data for efficient use of epitome. The motion images in Figure 3 and 4 are obtained by 5

Figure 4: Examples of motion epitome learned from training data of (a) basketball dribble and (b) jogging motions extracting only rotation angle after converting and scaling, for better representation. To see the details of motion visualization, visit http://www.cc.gatech.edu/ wooyoung/motionvisual/motionvisual.htm. In this site, we showed that motion types can be distinguished with this motion image. 5. Applications of Motion Epitome Motion epitome can be used for several motion applications including motion synthesis, key-frame interpolation, motion editing, transitions in motion blending and motion classification. Here we present three of those applications, motion fill-in, motion classification, measuring motion naturalness. We use the same graphical model for these tasks as in Figure 2, and the same notations in Section 3. The core algorithm in each of these applications is first to reconstruct the true motion from an input motion sequence, given a previously trained epitome model, then to compute the similarity between the reconstructed motion and an input. 5.1. Motion Reconstruction We first consider the general case of motion reconstruction with motion epitome. The process follows the inference algorithm as in Section 4.2. Given an epitomic model e and an input motion X, we update (6), (7) iteratively until it converges. Then the sequence ζ = (ζ i ) is the resulting reconstructed motion. We can observe that, from equation (2), the generated value x i will be dominated by Gaussian noise ψ i if it is very high. Then the inferred feature value ζ i shared for the time i is determined by the epitome predicted patches as shown in (6). We also see that if 6

the noise ψ i is very low, ζ i is dominated by x i. From these observations, we can set the ψ i high if x i is unreliable. If the noise variance ψ i is known, then we only iterate equation (7) and (6) to obtain ζ. In the motion application in section 5.3 and 5.4, we deal with this problem when the noise variance is unknown and not uniform. 5.2. Motion Fill-In Figure 5: Synthesized frames using epitomic analysis and non-parametric methods. Left-bottom is the result of epitomic analysis and right-bottom is the one of non-parametric synthesis. Compared with the original frame shown middle-bottom, the reconstructed frames using epitome show more smooth transition than the other. Motion type Error with Epitomic analysis Error with Non-parametric synthesis jog and left turn 1.34*1e+03 6.67*1e+03 jog and right turn 4.30*1e+03 7.80*1e+03 jog and walking turn 1.01*1e+03 13.50*1e+03 dribbling basketball 8.40*1e+03 51.90*1e+03 sitting and walking 1.40*1e+03 9.10*1e+03 Table 1: Results of filling-in with motion epitome and non-parametric synthesis The purpose of the task is to fill in several missing frames of a given input motion. The feature values of missing frames are modelled to have infinite variance in the measurement noise. Hence, we set ψ i = for each missing frame i. Then we only need to reconstruct ζ i s of missing frames by iterating equation (7) and (6). In this experiment, first we construct several types of motion epitome, such as, dribbling basketball, turning while jogging, walking, standing or walking after sitting. Then, a walking motion with missing frames is reconstructed from the walking epitome, a jogging motion from the jogging epitome, etc. Here we present the results in Table 1 with the comparison of non-parametric synthesis as appeared in [2]. Non-parametric synthesis searches for 7

windows of data in the motion library which best match the neighborhood of the missing frames, and does fill-in by copying. One example is shown in Figure 5. The synthesized frames using epitomic analysis are more similar to the original frames than those of non-parametric sampling. As we can see in table 1 the numbers indicating errors, the synthesized motion with epitomic analysis have less errors than with non-parametric synthesis overall. The results assure that we can fill-in frames only from the patches generated from epitome motion whose size is very small, rather than searching huge data library for most similar patches. We plan to explore this experiment to the application of creating smooth motion transition appeared in [6]. 5.3. Motion Classification Figure 6: Charts of jump model and other types of test motion including jump. Figure 7: Charts of boxing model and other types of test motion including boxing. The basic approach to motion classification is to find an appropriate similarity measurements between given an input and a type of motion epitome. Hence, we estimate the measurement by maximizing probability of an input given the epitome, P (X e) = P (X, e)/p (e). Since epitome e is already given, that leads to finding the joint 8

Figure 8: Charts of walking model and other types of test motion including walking. Figure 9: Charts of running model and other types of test motion including running. probability of X, e. For this purpose, we take the log function to the joint probability as the following: log P (X, e) = log P (X, e, T, Z) T,Z = log P (X, {z k, τ k }, e) {z k,τ k } {τ k } {z k,τ k } {τ k } q({z k, τ k }) log P (X, {z k, τ k }, e) q({z k, τ k }) (10) We choose the q-distribution as 7 of Section 4.2. By substituting equation (5) into (10) to find the optimal solution, we eventually get q(τ k ), ζ i as same as equation (6), and ψ i = (x i ζ i ) 2 since ψ i is also unknown. We update ζ i, ψ i and q(τ k ) iteratively until it converges, and replace it to equation (10), then the measure for 9

Figure 10: Charts of basketball-dribbling model and other types of test motion including basketball-dribbling. motion classification is; n log N (x i ; ζ i, ψ i ) + log p(e). (11) i=1 In our experiment, we first learned 5 types of motion epitomes - dribbling basketball, boxing, walking, run/jogging, jumping. Next, we reconstruct a true motion given a test motion and an epitome. Then the score that can tell each type of the test data is obtained from equation (11). Figure 6 to 10 show the results. Each figure shows the scores of seven different sets of data against each epitomic model-walk, run, jump, boxing, and basketball. The graph on the left of each figure presents scores of each set on a curve. Each point of a curve is a single motion sequence. For the purpose of representing each data set according to their scores, we added a histogram graph on the right side of each figure. To better understand the graphs, take Figure 8 as an example. The given epitomic model type is walking and seven different test sets are scored against it. walk is a set of walking motions that are excluded from the training data when we learn the epitome, while walk2 is a training set. These two sets are much higher scores than other types of set overall, so that we can draw a boundary for motion classification. 5.4. Motion naturalness Ever since motion capture data has been used for synthesizing human motion, natural-looking has been the important measurement for the quality of the output. Several possible measurements are proposed in [7], where the authors present a statistical model of natural motions using mixtures of Gaussians, hidden Markov models and switching linear dynamic systems. In this paper, we present epitomic analysis as another approach to measuring motion naturalness. The procedure mostly follows that of motion classification, except that we train a classifier to distinguish between natural and unnatural movement based on the natural training data. In this experiment, we take two types of motions - walk and run - and provide a way of discriminating unnaturalness with natural epitomic models. First, we construct three types of epitome models only with natural motions - walk, run, and a combination of walk and run motions. Second, given an input motion sequence, we infer a true motion based on the each model, then compute a score using the measurements in equation (11). We present the results in Figure 11, 12 and 13. In Figure 11, the left graph illustrates that most of the natural walking motions have higher score than the unnatural ones with natural walk model. Similar results appeared with run motions against run model at the left graph of Figure 12, but the classifier is less clear than with walk motions. This phenomenon can be explained as the following reason. The unnatural motion capture data of testing sets were obtained from a number 10

Figure 11: Walking motions are measured via two different types of epitome model. The chart on the left shows scores of natural and unnatural walk motions against natural walk epitome model. The right chart is the scores against natural run epitome model. of sources, including editing, key-framing, noise-added, bad motion transitions and insufficiently cleaned motion capture data. Most of the unnatural run motions tested, unlike walk motions obtained from mostly from the bad editing process, are collected by bad key-framing. The set of this category is likely to have very short sequences of unnaturalness so that a natural model cannot catch the differences easily. We also present another interesting results on the right graphs of Figure 11and 12. As we can see at the graph on the right of Figure 11, natural run epitomic model does not distinguish unnatural walks from natural ones. This is the same as in the case of unnatural run against natural walk model. The results tell us that each type of epitomic model captures the differences of motion types more sensitively than the subtle changes between naturalness and unnaturalness. We can, however, overcome this limitation if we combine all different types of natural motions and build a natural epitomic model. We present the possibility in Figure 13, where unnatural walk and runs are scored against the combination of natural walk and run model. We plan to extend this experiment with more robust framework in the future. 6. Conclusion We have shown that the motion epitome is an innovative representative model of motion capture data, and performed several experiments of motion applications with epitome. The results of filling in and classification are quite successful. Furthermore, we provided the potentials of epitomic analysis toward measuring motion naturalness. It has been proved that motion epitome does not just tell between natural and natural motions, but also gives a way of fixing the unnaturalness of given input by reconstructing. More results are available at http://www.cc.gatech.edu/ wooyoung/motionepitome/motionepitome.htm. There are two contributions in our work. For one, in the machine learning point of view, we have explored the use of epitomic analysis to the new problem domain, motion capture data. And as in the graphics point of view, epitomic analysis has the potential of producing richer and more realistic continuous animation than other conventional generative models, such as, HMM, SLDS. This is due to the fact that epitomic analysis is a patchbased algorithm. Motion epitome, however, has to deal with high dimensional feature space unlike image and video epitomes. For efficiency purpose, we need to find a way of reducing the dimension, such as, combining PCA with eptiome. 11

Figure 12: Running motions are measured via two different types of epitome model. The chart on the left shows scores of natural and unnatural run motions against natural run epitome model. The right chart is the scores against natural walk epitome model. Another way is to use a subset of joints instead of the whole body, so that we can perform it in the lower dimensional feature space. With more efficient motion epitome, we would like to compare the results of motion naturalness as in the paper [7]. References [1] V. Cheung, B.J. Frey, N.Jojic. Video Epitomes, In Proc. IEEE Intern. Conf. on Computer Vision and Pattern Recognition. Los Alamitos, CA:IEEE Computer Society Press, 2005 [2] A. Efros and T.K. Leung. Texture Synthesis by Non-parametric Sampling, ICCV 1999. [3] B.J. Frey, N.Jojic. Advances in algorithms for inference and learning in complex probability models, IEEE Trans. PAMI, 2003. [4] N.Jojic, B.J. Frey, and A. Kannan.(2003) Epitomic Analysis of appearance and shape, In Proc. IEEE Intern. Conf. on Computer Vision, 2003 [5] Ashish Kapoor and Sumit Basu. The Audio Epitome: A New Representation for Modeling and Classifying Auditory Phenomena,International Conference on Acoustics, Speech and Signal Processing, ICASSP 2005. [6] L. Kovar, M. Gleicher, and F. Pighin.Motion Graphs, In ACM Transactions on Graphics 21(3),Proceedings of SIG- GRAPH 2002. [7] Liu Ren,Alton Patrick,Alexei A. Efros, Jessica K. Hodgins,James M. Rehg. A Data-Driven Approach to Quantifying Natural Human Motion, SIGGRAPH 2005. 12

Figure 13: Walking and running motions are measured via the combination of walk and run epitome model. The chart on the left shows scores of walk motions. The right chart is the scores of run motions. 13