Stylized synthesis of facial speech motions

Size: px

Start display at page:

Download "Stylized synthesis of facial speech motions"

Cordelia O’Brien’
6 years ago
Views:

COMPUTER ANIMATION AND VIRTUAL WORLDS Comp. Anim. Virtual Worlds 2007; 18: 517 526 Published online 26 June 2007 in Wiley InterScience (www.interscience.wiley.com).

1 COMPUTER ANIMATION AND VIRTUAL WORLDS Comp. Anim. Virtual Worlds 2007; 18: Published online 26 June 2007 in Wiley InterScience ( Stylized synthesis of facial speech motions By Yuru Pei * and Hongbin Zha... Stylized synthesis of facial speech motions is central to facial animation. Most synthesis algorithms put emphasis on the reasonable concatenation of captured motion segments. The dynamic modeling of speech units, e.g. visemes and visyllables (the visual appearance of a syllable), has not drawn much attention. In this paper, we address the fundamental issues regarding the stylized dynamic modeling of visyllables. The decomposable generalized model is learnt for the stylized motion synthesis. The visyllable modeling includes two parts: (1) A dynamic model for each kind of visyllable that is learnt based on a Gaussian Process Dynamical Model; (2) A multilinear model based unified mapping between the high dimensional observation space and low dimensional latent space. The dynamic visyllable model embeds the high dimensional motion data, and constructs the dynamic mapping in the latent space simultaneously. To generalize the visyllable model from several instances, the mapping coefficient matrices are assembled to a tensor, which is decomposed into independent modes, e.g. identity and uttering styles. Therefore, with the linear combination of components in each mode, the novel stylized motions can be synthesized. Copyright 2007 John Wiley & Sons, Ltd. Received: 15 May 2007; Accepted: 17 May 2007 KEY WORDS: speech animation; visyllable dynamic model; stylized synthesis; decomposable generalized mapping Introduction Synthesizing stylized speech motions is important to realistic human facial animation. Currently, the 3D facial motions can be captured real-time with opticalbased devices. Given the large volume of dataset, most 3D speech motion synthesis algorithms are based on the concatenation of captured motion segments and the variable transition models 1 5 to handle the coarticulation. Hitherto, little efforts addressed the inside dynamic nature of the pronunciation unit, e.g., what dynamics inside the visyllable is and how the facial shape varies throughout one visyllable. In this paper, we address the dynamic model specific to each kind of visyllable. Moreover, a decomposable generalized mapping between the low dimensional latent space and the high dimensional observation is generated based on the *Correspondence to: Y. Pei, State Key Laboratory of Machine Perception, Science Building 2, Peking University, No. 5 Yiheyuan Road, Haidian District, Beijing, China. peiyuru@cis.pku.edu.cn multilinear model. The work is based on an assumption that the stylized mapping can be decomposed as a combination of independent components. The attributes that parameterize the mapping space is computed via the N-mode Singular Value Decomposition (SVD). The novel stylized visyllable is synthesized with the linear combination of decomposed mode components. Our framework incorporates the speech motion dynamics into the visyllable model. As illustrated in Figure 1, first, the GPDM is learnt for every visyllable instance. Then a generalized mapping is generated based on a multilinear model. Finally, the stylized motions are synthesized based on the linear component combination and the mean prediction in the visyllable dynamic model. The main idea of this paper is to represent the visyllable dynamics based on a Gaussian Process. The contextdependent speaking motion transitions are largely incorporated inside the visyllable models. The model consists of the explicit mapping function for the temporal transition and the low dimensional embedding. A generalized uttering style mapping is constructed based on a tensor-product of independent mode components.. Copyright 2007 John Wiley & Sons, Ltd.

2 Y. PEI AND H. ZHA Figure 1. An overview of stylized speech synthesis. The stylized motion synthesis is achieved with the mean prediction in the low dimensional space and the stylized mapping to the high dimensional observation space. Related Work Facial speech motions have highly repetitive motion patterns, clear meanings (uttering words), complex physical-driven mechanisms (the joint work of a set of muscles), and high dimensional representations. The motion synthesis is mainly on the concatenation of visual units. 2 4,6,7 Ezzat et al. 2 employ a variant of Multidimensional Morphable Model (MMM) to embed mouth configurations. HMM is employed to model the probabilistic state machine for speech animation. 6 The dynamic function of phonemes is presented in Reference [8]. Cao et al. 9 generate a data structure called Anime Graph to encapsulate a facial shape database along with the speech information. The Radial Basis Function (RBF) network is used to map the Mocap speech data to 3D facial models, 1 in which the EM-PCA is used to learn the co-articulation model and find the expressive eigen space from Mocap data. Hitherto, the dynamics inside speech units has not drawn much attention. We propose the framework to incorporate the speech dynamics in the visyllable model. Our approach is similar to Li et al. s 10 Motion texture. In their work, the repetitive motion pattern is modeled with the Linear Dynamic System (LDS), and a separate embedding mechanism is introduced to deal with the high dimensional motion data. Instead, we provide an integrated framework for the motion embedding and the dynamic modeling, along with a generalized style mapping. Motion analysis has drawn attention for many years. The motion data can be obtained from videos, motion capture devices, and real-time 3D scanners. Due to the high dimensionality and the complex transition dynamics of the data, the dimensional reduction and the dynamics modeling are the major issues in this field. Most work separates the data embedding and the dynamics modeling. 3,4,7,10 12 First, linear and non-linear embedding techniques are used to acquire the low dimensional representation, e.g. the PCA 3 and the graphbased non-linear dimensionality reduction methods such as the Isomap 7,12 and the LLE. 11 And then, the dynamic models are generated in the low dimensional space by different algorithms, e.g. the HMM, 4,6 the LDS, 10 and the GMM. 12 In the linear embedding, such as the PCA and the ICA, the mapping between the observations and latent variables is constructed explicitly, whereas it is not the case in many non-linear embedding methods. Rahimi et al. 13 learn the semi-supervised embedding through the RBF regression with Newton dynamics in the low dimensional space. However, probabilistic transitions are not embodied. Lawrence 14 proposes an unsupervised probabilistic embedding with a Gaussian Process as the GPLVM, which has been used in the stylized motion synthesis. 15 In the GPLVM, the mapping function is kernelized with the RBF. With the Gaussian prior, the embedding can be marginalized to yield multivariate Gaussian data likelihoods. Wang et al. 16 propose the GPDM, which incorporates the dynamic model in the latent space and is a dynamic extension of the GPLVM. A modified GPDM 17 has been used in the 3D people tracking. In this paper, we employ the GPDM to model the visyllable dynamics. The bilinear and multilinear models have been employed in computer vision and computer graphics. 11,18 21 The attributes with effects on the appearance are decoupled. The content and style separation based on the bilinear model has been used in the analysis of the gait and face image ensemble. Vasilescu 19 uses the multilinear model to decompose the face image data into different components: the head position, the illumination, and the expression. In the 3D face analysis, 20 the bilinear and trilinear models are used to separate the components related to identities, expressions, and visemes. In this paper, a generalized mapping based on the multilinear model is proposed to separate the orthogonal speaking style components.. Copyright 2007 John Wiley & Sons, Ltd. 518 Comp. Anim. Virtual Worlds 2007; 18:

FACIAL SPEECH MOTIONS Visyllable Model Learning Visyllable Feature Vectors The speech motions in the paper are confined to the subregion of the human face, including the lip, the chin, and the cheek,

3 FACIAL SPEECH MOTIONS Visyllable Model Learning Visyllable Feature Vectors The speech motions in the paper are confined to the subregion of the human face, including the lip, the chin, and the cheek, which has direct relations to pronunciations. Every visyllable is represented as a set of marker trajectories. The feature vector y i = [ p i, v i ] is used to define every frame in the sequence with n markers. p i is the 3D coordinates vector of the facial markers, v i the marker velocity being the difference of two consecutive frames as v i = p i p i 1. The visyllable with m frames is represented as a [m 6n] matrix with each row being the feature vector y i. Visyllable Model The high dimensional visyllable data is formidable to process. The dimensionality of motion data has to be reduced for the dynamic model learning. Instead of separating the dimensionality reduction and the dynamic model learning, the GPDM computes the dynamic mapping while finding the low dimensional latent variables. Thus, the dynamics modeling and the dimensionality reduction are integrated into the same framework. The smooth probabilistic density function of the latent variables is generated. The mapping between the high dimensional feature vector y t and the latent variable x t and the dynamic mapping in the latent space are both kernelized with the RBF. The dynamic model is defined as: the likelihood: L = ln p(x, Y, ᾱ, β) = λ d 2 ln K X tr ( KX 1 X OUTXOUT T ) + ln α j m ln W + D 2 ln K Y tr ( KY 1 YW 2 Y T ) + j j ln β j where K Y and K X are the kernel matrices of the mapping and the dynamic model. The matrix elements are defined with the RBF and the affine transformation (Appendix). To every visyllable instance, the mapping f Y (x) from the high dimensional motion data to the low dimensional representations is computed along with the dynamic mapping f X (x) over latent variables as: (3) f Y (x) = µ Y + Ȳ T K 1 Y k Y (x) (4) f X (x) = X T 1 OUT K X k(x) (5) where µ Y = 1 m m i y i and µ X = 1 m m i x i are the mean of the training data. Ȳ is the observations with the mean subtracted, and X OUT is the output of the dynamic mapping. As shown in Figure 2, one point in the latent space is corresponding to a face configuration represented by the feature vector. The warm color is related to the small x t = i a i φ i x t 1 + n x,t (1) The mapping between the high dimensional feature vector and the latent variable is defined as: y t = j b j ϕ j x t + n y,t (2) The weights A ={a i },B={b j } and the basis functions ϕ j, φ i are used to define the non-linear mapping and the dynamics in the latent space, where n x,t and n y,t are the zero-mean white Gaussian noise. The learning process is to solve the latent variables and hyper-parameters in the kernel functions. With Gaussian priors on the weights, the parameters can be computed by minimizing the negative log-posterior of Figure 2. The reconstruction variance map of a visyllable in the latent space.. Copyright 2007 John Wiley & Sons, Ltd. 519 Comp. Anim. Virtual Worlds 2007; 18:

Y. PEI AND H. ZHA Figure 3. A Trajectory variations of a marker (y-direction) due to variations in stress on the left and speed on the right. Neutral speech is the green dashed line for comparison.

With the explicit mapping functions, the new motion sequence can be predicted given an initial pose.

4 Y. PEI AND H. ZHA Figure 3. A Trajectory variations of a marker (y-direction) due to variations in stress on the left and speed on the right. Neutral speech is the green dashed line for comparison. reconstruction variance and the cold color to the large reconstruction variance. The color configurations are the same to all the figures in this paper. With the explicit mapping functions, the new motion sequence can be predicted given an initial pose. Style Separation in a Generalized Model Due to the inter-personal differences of uttering styles, the instances of the same visyllable may take on different appearances. Moreover, uttering the syllable with different speeds and stresses can cause various appearances even for the same person. Figure 3 shows stress and speed variations on marker trajectories. The goal is to build a generalized model to accommodate all the instances. One institutive method is to concatenate all instances together and feed them to the visyllable model learning. Due to the style discrepancy, the difference between instances is larger than that inside the visyllable as shown in Figure 4. The embeddings of different instances in the latent space are far away from each other. Our system employs an alternative method. First, the dynamic model specific to every instance is learnt. Second, the mapping coefficients of all models are assembled to a tensor, and apply the N-mode SVD for the style separation. A generalized embedding has to be constructed before the tensorization of the coefficient matrices. We select an instance with the neutral style as the reference. The latent embedding of the reference is chosen as the initial value for the other instance learning. In this way, the latent variables have similar values in the embedding space as shown in Figure 4. The mapping between the latent space and the input motions is rewritten as: Figure 4. The upper row is the embedding of two instances with different styles. The bottom is the generalized embedding of all instances of a visyllable. where B0 i = µi Y and B 1 = Ȳ T K 1 Y ki Y (x )/ky ref(x ). N is the instance number of one visyllable. ky i (x )/ky ref(x ) is the ratio between the ith kernel matrix and the reference on the ith instance s latent variables. After the model learning for all instances, the mapping coefficients B1 i in f Y (x) are assembled orderly into a tensor with respect to the identity and the uttering styles. The independent mode components are separated by the tensor decomposition. The concise description of the multilinear algebra can be found in References [19,20]. The tensor is decomposed as the mode product of the core tensor Z with a set of mode matrices U. T = Z 1 U ID 2 U SA 3 U MC (7) The core tensor Z includes the structure information and controls the interaction between mode matrices. U ID is the mode matrix to the identity, and U SA to the uttering style. U MC is to the mapping coefficient B 1. The tensor is decomposed as the N-mode product of an orthogonal space related to different attributes. The core tensor is computed as: f i Y (x) = Bi 0 + Bi 1 kref Y (x), i = 1,...,N (6) Z = T 1 U T ID 2 U T SA 3 U T MC (8). Copyright 2007 John Wiley & Sons, Ltd. 520 Comp. Anim. Virtual Worlds 2007; 18:

5 FACIAL SPEECH MOTIONS The mode matrices U ID and U SA can be seen as independent of the speaking content, and compose the uttering styles. As we can see, the dynamic mappings in the latent space are similar in a generalized visyllable model. However, the mappings between the latent space and the high dimensional observation space are various and are formulated as the tensor-product of different style modes. The generalized mapping enables the linear combination of style factors for the synthesis of a novel style. Motion Synthesis Given a syllable script, the motion synthesis is to build the 3D trajectories of facial markers. We assume the initial pose y 0 is predefined, e.g. the neutral facial configuration. The goal is to generate a complete motion sequence of a visyllable given an initial pose. As the probability distribution of latent variables is learnt, the embedded motion in the latent space can be reconstructed with the dynamic mapping f X (x). And then, the high dimensional motions in the observation space are computed via the stylized tensor-product. Visyllable Motion Prediction in the Latent Space Given an initial pose in the latent space, the visyllable trajectory can be constructed with the dynamic mapping function f X (x). This is by the Gaussian sampling with the zero-mean noise added. However, when the initial pose diverge from the training data, the mean prediction by the dynamic mapping will produce a sequence with a large reconstruction variance and drift from the training data. In order to reduce the uncertainty in the prediction, an optimization is introduced to minimize the objective function L1(x, y) related to the likelihood of the new data. (x f X (x)) 2 L1(x, y) = 2σX 2 (x) + d 2 ln σ2 X (x) (9) where σx 2 (x) is the variance to the dynamic mapping, d the dimension of the latent space. Style Interpolation Synthesizing a novel style is an interesting issue in the motion synthesis. The log space interpolation scheme has been used in the style based IK. 15 In our system, with the multilinear model, the mapping in Equation (4) is formulated as: f Y (x) = B i 0 + Z 1 U ID 2 U SA 3 U MC k ref Y (x) (10) N i=1 The novel style is synthesized with the tensor-product of the style mode matrices. Each mode matrix can be composed as a linear combination of components. Based on the component interpolation, the stylized mapping in the space spanned by the tensor mode components can be obtained. Experiments Two subjects are asked to pronounce each syllable with the neutral style three times, at three speeds, and with three stresses. Thus, for each visyllable, there are 18 instances. The motion-captured data are aligned in the preprocessing. To the motion data of every syllable, a dynamic model is learnt. Some visyllable dynamic models used in the experiments are shown in Figure 5. A generalized model is constructed with a Figure 5. Some English and Mandarin visyllable dynamic models in the experiments. The three dimensional latent spaces are shown with the reconstruction variance plotted. The circled blue line is the latent variables of the reference instance, while the triangled magenta line is the latent variables of the synthesized sequences.. Copyright 2007 John Wiley & Sons, Ltd. 521 Comp. Anim. Virtual Worlds 2007; 18:

Y. PEI AND H. ZHA Visual Speech Motion Synthesis With the first pose specified, the motion sequence to a syllable can be predicted by the dynamic mapping in the latent space.

Stylized Motion Synthesis Figure 6. The comparison of three synthesized marker trajectories (y-direction) against the ground truth in the Euclidean coordinate system.

6 Y. PEI AND H. ZHA Visual Speech Motion Synthesis With the first pose specified, the motion sequence to a syllable can be predicted by the dynamic mapping in the latent space. The optimization in the motion synthesis phase makes the synthesis results consistent with the training data. The synthesis results are shown in Figures 5 and 6 and the accompanying videos. Stylized Motion Synthesis Figure 6. The comparison of three synthesized marker trajectories (y-direction) against the ground truth in the Euclidean coordinate system. The positions of three markers are shown on the left image. decomposable mapping between the latent space and the high dimensional observation space. When assembling the tensor, the mapping coefficients are put into the same size via the time warping. The length of the coefficient matrix in the first visyllable dynamic model is selected as the reference for the time warping. The time scaling parameter is defined as: r t = m i /m ref, where m i is the frame number of the current instance and m ref the frame number of the reference visyllable. With the N-mode SVD, the components corresponding to each mode are computed as shown in Figure 7, where the first row is the coefficient vectors of uttering mode components, and the first column is the coefficient vectors of identity mode components. The central part of Figure 7 is the facial configurations to each mode component in a visyllable model. The 16th frame in each motion sequence is shown. The trajectories of one lower lip marker are plotted with respect to each identity. We linearly combine the identity mode components UID i and uttering mode components Ui SA for the synthesis of a novel style as: U ID = α U U 1 ID + (1 αu )U 2 ID 9 U SA = β U i Ui SA i=1 Figure 7. The coefficient vectors of mode components.. Copyright 2007 John Wiley & Sons, Ltd. 522 Comp. Anim. Virtual Worlds 2007; 18:

The motion sequences are computed by the stylized mapping in Equation (10). Figure 8 shows the visyllable motion synthesis with the identity mode remaining constant.

7 FACIAL SPEECH MOTIONS Figure 8. The reference and stylized synthesis sequences. The first row is the reference motion sequence. The lower three rows are the stylized synthesis motion sequences of different persons. where 0 α U 1, 0 β U i 1, 9 i=1 βu i = 1. The motion sequences are computed by the stylized mapping in Equation (10). Figure 8 shows the visyllable motion synthesis with the identity mode remaining constant. Figure 9 shows the synthesized motions with the tensor-product of the modified identity and uttering modes, where the first and the second columns are the coefficient vectors of uttering mode components and Figure 9. Stylized visyllable motion synthesis.. Copyright 2007 John Wiley & Sons, Ltd. 523 Comp. Anim. Virtual Worlds 2007; 18:

8 Y. PEI AND H. ZHA identity components respectively. The third column is the synthesized motion sequences. Due to the time warping in the generalized model learning, the factor r t is used to scale the synthesized motion sequence. Comparison With the Ground Truth For the evaluation, we compare the motion-captured data of the test sentence with the synthesis results. As shown in Figure 6, the synthesis results can follow the trajectories of the ground truth. However, in some parts there are apparent discrepancies as circled out. They are mainly on the short pause of the syllable scripts. Thus, the system could not reconstruct the various motions in the pause segments. In the complete motion sequence, people tend to care little to the mouth shape when there is no audio signal. Therefore, even with these discrepancies, the synthesis results show consistent visual appearances to the input syllable scripts. Mapping the Motion Data to a Novel 3D Face The mapping of the motion-captured data to a novel 3D face is not a major issue of this paper. We employ a method similar to Deng s 1 Blendshape Face. The RBF network is trained for the mapping between motioncaptured markers and shape blending coefficients. However, in our system, the prototypes are determined automatically with the Isomap embedding and the clustering in the low dimensional space as described in the speech motion transferring system. 7 The blending coefficients G(g 0,g 1,...,g m ) to the training data of the RBF network are computed via the L2 distance to the clustering centroids. The mapping between the latent variable x j and the blending coefficients G is computed with the RBF regression. G = Q(x j ) = i=1 w i h ( xj c i ) where x j is a 3D latent variable, c i the training data. h(r) = r 2 + e 2 is the Hardy multiquadrics function, e the stiffness constant to regulate the effects of feature points. The mapping results to a 3D face model are shown in Figure 10, where the first row is the ground truth, the second and the third rows show the synthesized marker set and the feature mesh of corresponding frames. Figure 10. Synthesized facial motions. Conclusion and Future Work In this paper, we present a dynamic visyllable model to represent the speech motion data. We propose a stylized motion synthesis method based on the learnt probability density function over facial configurations, and the stylized synthesis is achieved by the generalized mapping from the observation to the latent space. As we can see, the number of the visyllables is comparably large, e.g. approximately 400 in Mandarin, and some 900 demi-visyllables 3 in English. That means large number of visyllable models need to be learnt. It is intuitive to classify the visyllables. Hitherto the visyllable classification is via the syllable definition. We think the shape analysis of visyllables should draw more attention. In order to classify the visyllable according to its own visual appearance, a reasonable similarity description should be defined to account for the style variations. Moreover, the similarity description should be robust in. Copyright 2007 John Wiley & Sons, Ltd. 524 Comp. Anim. Virtual Worlds 2007; 18:

9 FACIAL SPEECH MOTIONS case of only partial matching between visyllables. The automatic visyllable annotation and the classification might be interesting to pursue in the future research. The model is learnt by minimizing the negative logposterior in Equation (3). Appendix GPDM Learning The likelihood of motion-captured data is modeled with the GPDM, which is a latent variable dynamic model. From Bayesian s view, the weights A ={a i },B={b j } of the basis functions can be marginalized. Given the Gaussian prior on a i and b i, the marginalization on A and B will produce the multivariate Gaussian data likelihood in the following forms: p(y X, β) = p(x ᾱ) = W m ( (2π) md K Y exp 1 D 2 tr ( K 1 Y YW 2 Y )) T ( p(x 1 ) exp (2π) 1 (m 1)d K X d 2 tr ( K 1 X X ) ) OUTX T OUT where W m is the scaling factors to different dimension of observations. K Y and K X are the kernel matrices. The elements in the kernel matrices are defined with the RBF and the affine transformation as: ( k Y (x, x ) = β 1 exp β 2 2 ) x x 2 + δ x,x β 3 ( k X (x, x ) = α 1 exp α 2 x x 2) + α 3 x T x + δ x,x 2 α 4 β ={β 1,...,W} is the hyper-parameters of kernel functions, with β 1, β 2 the parameters on the scale and the inverse width of the basis functions. β 3 is related to the noise. X OUT = (x 2,...,x m ) is the output of the dynamic model, with input as X = (x 1,...,x m 1 ),where m is the number of training data. Hyper-parameters α 1,α 2 are the scale and the inverse width of the RBF basis function. The linear term is included in the kernel function of the dynamic model with the coefficient α 3. α 4 is related to the noise. Moreover, the priors are placed on the hyperparameters of the mapping and the dynamic models with p(ᾱ) i α 1 i and p( β) i β 1 i to avoid overfitting. Thus, the generalized model of the observation with the prior, the mapping, and the dynamics is as follows: p(x, Y, ᾱ, β) = p(y X, β)p(x ᾱ)p(ᾱ)p( β) ACKNOWLEDGEMENTS The authors would like to thank Neil Lawrence for his publicly available source code of the GPLVM. They would also like to thank the anonymous referees for the useful suggestions for improving this paper. This work was supported in part by the National Science Foundation (China) grants NSF and National 973 Research grants 2004CB References 1. Deng Z, Neumann U, Lewis JP, Kim TY, Bulut M, Narayanan S. Expressive facial animation synthesis by learning speech co-articultion and expression spaces. IEEE Transactions on Visualization and Computer Graphics 2006; 12(6): Ezzat T, Geiger G, Poggio T. Trainable videorealistic speech animation. ACM Transactions on Graphics 2002; 21(3): Kshirsagar S, Thalmann NM. Visyllable based speech animation. Computer Graphics Forum 2003; 22(3): Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In ACM SIGGRAPH 97 Proceedings, August 1997; pp Ma J, Cole R, Pellom B, Ward W, Wise B. Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Transactions on Visualization and Computer Graphics 2006; 12(2): Brand M. Voice puppetry. In ACM SIGGRAPH 99 Proceedings, 1999; pp Pei Y, Zha H. Transferring of speech movements from video to 3d face space. IEEE Transactions on Visualization and Computer Graphics 2007; 13(1): King SA, Parent RE. Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Transactions on Visualization and Computer Graphics 2005; 11(3): Cao Y, Tien W, Faloutsos C, Pighin P. Expressive speechdriven facial animation. ACM Transactions on Graphics 2005; 24(4): Li Y, Wang T, Shum H. Motion texture: a two-level statistical model for character motion synthesis. In ACM SIGGRAPH 02 Proceedings, August 2002; pp Elgammal A, Lee C. Separating style and content on a nonlinear manifold. In CVPR 04 Proceedings, July 2004; pp Wang Q, Xu G, Ai H. Learning object intrinsic structure for robust visual tracking. In CVPR 03 Proceedings, June 2003; pp Rahimi A, Recht B, Darrell T. Learning appearance manifolds from video. In CVPR 05 Proceedings, June 2005; pp Lawrence ND. Gaussian process latent variable models for visualization of high dimensional data. In NIPS Proceedings, 2004; pp Copyright 2007 John Wiley & Sons, Ltd. 525 Comp. Anim. Virtual Worlds 2007; 18:

Y. PEI AND H. ZHA 15. Grochow K, Martin S, Hertzmann A, Popovic Z. Style based inverse kinematics. In ACM SIGGRAPH 04 Proceedings, August 2004; pp. 522 531. 16. Wang J, Fleet DJ, Hertzmann A.

In CVPR 06 Proceedings, 2006; pp. 238 245. 18. Tenenbaum J, Freeman WT. Separating style and content with bilinear models. Neural Computation 2000; 12: 1247 1283. 19. Vasilescu MAO, Terzopoulos D.

10 Y. PEI AND H. ZHA 15. Grochow K, Martin S, Hertzmann A, Popovic Z. Style based inverse kinematics. In ACM SIGGRAPH 04 Proceedings, August 2004; pp Wang J, Fleet DJ, Hertzmann A. Gaussian process dynamical models. In NIPS Proceedings, 2005; pp Urtasun R, Fleet DJ, Fua P. 3d people tracking with gaussian process dynamical models. In CVPR 06 Proceedings, 2006; pp Tenenbaum J, Freeman WT. Separating style and content with bilinear models. Neural Computation 2000; 12: Vasilescu MAO, Terzopoulos D. Multilinear subspace analysis of image ensembles. In CVPR 03 Proceedings, June 2003; pp Vlasic D, Brand M, Pfister H, Popovic J. Face transfer with multilinear models. ACM Transactions on Graphics 2005; 24(3): Wang Y, Huang X, Lee C, et al. High resolution acquisition, learning and transfer of dynamic 3d facial expressions. Computer Graphics Forum 2004; 23(3): Authors biographies: Yuru Pei received her Ph.D. in Computer Science from Peking University, China, in 2006, and M.S. in Computer Science from Zhejiang University in 2003, and B.S. in Computer Science from Central South University in She is now an Assistant Professor in the State Key Laboratory of Machine Perception, Peking University. Her research is mainly in character animation, especially on facial animation of speech and expression. Moreover, she also works on craniofacial reconstruction. Hongbin Zha received his B.E. in Electrical Engineering from Hefei University of Technology, China, in 1983, and M.S. and Ph.D. in Electrical Engineering from Kyushu University, Japan, in 1987 and 1990, respectively. After working as a Research Associate in the Department of Control Engineering and Science, Kyushu Institute of Technology, Japan, he joined Kyushu University in 1991 as an Associate Professor. He was also a Visiting Professor in Centre for Vision, Speech and Signal Processing, Surrey University, UK, in Since 2000, he has been a Professor in the State Key Laboratory of Machine Perception, Peking University, Beijing, China. His research interests include computer vision, 3D geometric modeling, digital museums, and robotics. He has published over 140 technical publications in various journals, book, and international conference proceedings. Dr Zha received the Franklin V. Taylor Award from the IEEE Systems, Man and Cybernetics Society in Copyright 2007 John Wiley & Sons, Ltd. 526 Comp. Anim. Virtual Worlds 2007; 18:

The Role of Manifold Learning in Human Motion Analysis

The Role of Manifold Learning in Human Motion Analysis Ahmed Elgammal and Chan Su Lee Department of Computer Science, Rutgers University, Piscataway, NJ, USA {elgammal,chansu}@cs.rutgers.edu Abstract.