Exploiting Spatial-temporal Constraints for Interactive Animation Control

Size: px

Start display at page:

Download "Exploiting Spatial-temporal Constraints for Interactive Animation Control"

Christina Morgan
5 years ago
Views:

1 Exploiting Spatial-temporal Constraints for Interactive Animation Control Jinxiang Chai CMU-RI-TR Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Robotics The Robotics Institute Carnegie Mellon University Pittsburgh, Pennsylvania October, 2006 Thesis Committee: Jessica Hodgins, Chair Takeo Kanade Leonard McMillan Michiel van de Panne Nancy Pollard. Copyright c 2006 by Jnxiang Chai. All rights reserved.

2 I Abstract Interactive control of human characters would allow the intuitive control of characters in computer/video games, the control of avatars for virtual reality, electronically mediated communication or teleconferencing, and the rapid prototyping of character animations for movies. To be useful, such a system must be capable of controlling a lifelike character interactively, precisely, and intuitively. Building an animation system for home use is particularly challenging because the system should also be low cost and not require a considerable amount of time, skill, or artistry to assemble. This thesis explores an approach that exploits a number of different spatial-temporal constraints for interactive animation control. The control inputs from such a system will often be low dimensional, contain far less information than actual human motion. Thus they cannot be directly used for precise control of high-dimensional characters. However, natural human motion is highly constrained; the movements of the degrees of freedom of the limbs or facial expressions are not independent. Our hypothesis is that the knowledge about natural human motion embedded in a domain-specific motion capture database can be used to transform underconstrained user input into realistic human motions. The spatial-temporal coherence embedded in the motion data allows us to control high-dimensional human animations with low-dimensional user input. We demonstrate the power and flexibility of this approach through three different applications: controlling detailed three-dimensional (3D) facial expressions using a single video camera, controlling complex 3D full-body movements using two synchronized video cameras and a very small number of retro-reflective markers, and controlling realistic facial expressions or full-body motions using a sparse set of intuitive constraints defined throughout the motion. For all three systems, we assess the quality of the results by comparisons with those created by a commercial optical motion capture system. We demonstrate that the quality of the animation created by all three systems is comparable to commercial motion capture systems but requires less expense, time, and space to capture the user s input.

3 II Acknowledgments First and foremost, I would like to thank my advisor, Professor Jessica Hodgins, for her inspiration, support and guidance throughout my graduate studies. This thesis would not have been possible without her guidance and resources. I would also like to thank the other members of my thesis committee Michiel van de Panne, Nancy Pollard, Takeo Kanade, and Leonard McMillan for their comments and insights on this work. Many, many others have contributed to my effort. I am especially grateful to Dr. Harry Shum for introducing me to the area of graphics and vision during my stay at Microsoft Research Asia and for encouraging me to pursue an academic career after graduate school. The collaborations with Dr. Harry Shum and Dr Xin Tong from Microsoft Research Asia, Dr. Richard Szeliski and Dr. Sing Bing Kang from Microsoft Research Redmond, Dr. Ramesh Raskar from Mitsubishi Electric Research Laboratories have broadened and deepened my knowledge and understanding of graphics and vision. I would also like to thank my friend, Jing Xiao, for his collaboration on the first project of this thesis. My colleagues at the CMU graphics lab and the Robotics Institute have helped me in innumerable ways, technical and social, and made my graduate school experience very enjoyable. I would also like to thank Moshe Mahler for his help in modeling and rendering the images for this thesis and Justin Macey for his assistance in collecting and cleaning the motion capture data. And finally, most of my gratitude goes to my dear wife, Zoie. I cannot thank her enough for being patient and caring, enduring all the hardship of being a graduate student s wife, and doing everything for me that I was supposed to do instead of working on my thesis.

4 TABLE OF CONTENTS 1. Introduction Contributions Organization Vision-based Control of 3D Facial Animation Background Overview Video Analysis Facial Tracking Control Parameters Motion Capture Data Preprocessing Decoupling Pose and Expression Expression Control Control Parameter Normalization Data-driven Filtering Data-driven Expression Synthesis Expression Retargeting Results Discussion Vision-based Control of 3D Full-body Motion Background Control Interfaces for Human Motion Animation with Motion Capture Data Overview Motion Performance

5 Table of Contents IV Motion Analysis Subject Calibration Online Local Modeling Fast Online K-nearest Neighbor Search Online Motion Synthesis Numerical Comparison Dimensionality Reduction Online Motion Synthesis Using Local Models Results Discussion Generating Statistically Valid Motion Using Constrained Optimization Background Constraint-based Trajectory Optimization Data-driven Motion Synthesis Overview Spatial-temporal Motion Analysis Constraint-based Motion Synthesis Constraints Objective Function Optimization Results Facial Expression Full-body Animation Other Experiments Discussion Conclusions and Future Work

6 LIST OF FIGURES 1.1 Spatial-temporal correlations in human motions: figure (a) and (c) show two examples of spatial correlation in human motions. (a) Plot of the elbow angle vs. the shoulder angle for five different punching actions. (b) Plot of the Y- coordinate for a point located on the upper lip vs. the Y-coordinate of a point located on the lower lip for six different snoring expressions Algorithm overview Interactive Expression Control: a user can control 3D facial expressions of an avatar interactively. (Left) The users act out the motion in front of a single-view camera. (Middle) The controlled facial movement of the avatars without texture maps. (Right) The controlled facial movement of the avatars with texture mapped models System overview diagram. At run time, the video images from a single-view camera are fed into the Video Analysis component, which simultaneously extracts two types of animation control parameters: expression control parameters and 3D pose control parameters. The Expression Control and Animation component uses the expression control parameters as well as a preprocessed motion capture database to synthesize the facial expression, which describes only the movement of the motion capture markers on the surface of the motion capture subject. The Expression Retargeting component uses the synthesized expression, together with the scanned surface model of the motion capture subject and input avatar surface model, to produce the facial expression for the avatar. The avatar expression is then combined with the avatar pose, which is directly derived from pose control parameters, to generate the final animation User-independent facial tracking: the red arrow denotes the position and orientation of the head and the green dots show the positions of the tracked points

7 List of Figures VI 2.4 Fifteen high-level expression control parameters. (Top left) The features that measure the distance between two feature points. (Top right) The features that measure the distance between a point and a line. (Bottom left) The horizonal and vertical orientation of the mouth. (Bottom right) The center position of the mouth The scanned head surface model of the motion capture subject aligned with 76 motion capture markers Data-driven filtering diagram. At run time, we first use a nearest-neighbor search algorithm to find the K closest examples in the motion capture database. And then we compute the principal components of the closest examples. We keep the B largest eigenvectors as the filter basis, where B is automatically determined by retaining 99% of the variation of the original data. Finally, we project the noisy segment into a local linear space spanned by the filter basis and reconstruct the control signal in the low-dimensional space Precomputation of deformation basis. The system first builds a surface correspondence between the scanned source surface model and the input target surface model. Both surface models are in the neutral expression. Then the system adapts the deformation bases of the motion capture database S 0, S 1,..., S L to the target model T 0, T 1,..., T L based on the deformation relationship derived from the local surface correspondence diagram Dense surface correspondence. (Left) The scanned source surface model. (Middle) The animated surface model. (Right) The morphed model from the source surface to the target surface using the surface correspondence The top seven deformation bases for a target surface model. (a) The gray mask is the target surface model in the neutral expression. (b-h) The needles show the scale and direction of the 3D deformation vector on each vertex Online expression retargeting diagram. At run time, the system projects the synthesized expression into the deformation basis space of the source model S 1,...S L to compute the combination weights. The deformation of the target surface is generated by blending together the deformation bases of the target surface T 1,...T L using the combination weights

8 List of Figures VII 2.11 Results of two users controlling the 3D facial expressions of two different target surface models Results of two users controlling the 3D facial expressions of two different texturemapped avatar models Users wearing a few retro-reflective markers control the full-body motion of avatars by acting out the motion in front of two synchronized cameras. From left to right: walking, running, hopping, jumping, boxing, and Kendo (Japanese sword art) System overview Marker detection and correspondence: a user acts out the motion in front of two synchronized video cameras. (a) and (b) The images from the left and right cameras respectively. (c) The detected marker positions in the left image. (d) The detected marker locations in the right image and the epipolar lines of the markers that were detected in the left image. For each marker in the left image, the matching marker in the right image should be located on its corresponding epipolar line A 2D example of the fast nearest neighbor search using two dimensions of the neighbor graph for the boxing database: (a) the data points in the database after we project them into the 2D eigen-space; (b) the magenta circle represents the previous pose and the magenta square represents the current pose. At run time, we use the neighbors of the previous frame (blue points) and a precomputed neighbor graph to find the possible neighbors of the current pose in the neighbor graph (red points). The algorithm then searches only the red and blue points to find the nearest neighbors of the current query point. (c) the green points are the nearest neighbors found using this algorithm Comparison of four dimensionality reduction methods: each curve shows the average reconstruction error with increasing number of dimensions. We could not compute the complete GPLVM error curves for the medium and large databases because of the computational cost

9 List of Figures VIII 3.6 Comparison of methods for synthesizing motions from low-dimensional continuous control signals. (a) Average errors for boxing motion: 7.67 degrees/joint per frame for nearest neighbor synthesis (NN), 6.15 degrees/joint per frame for locally weighted regression (LR), and 2.31 degrees/joint per frame for our method. (b) Average errors for walking motion: 4.46 degrees/joint per frame for NN, 3.32 degrees/joint per frame for LWR, and 1.30 degrees/joint per frame for our method. None of the testing sequences are in the database and boxing and walking motions are synthesized from the same set of markers used for the two camera system (figure 3.1) Performance animation from low-dimensional signals. (1) The input video and corresponding output animation. (2) (5) Animations created by users using the two camera system Comparison with ground truth data. (1) Ground truth motion capture data. (2) Synthesized motion from the same marker set as that used for the two camera system Our system generates a wide variety of natural motions based on various kinds of user-defined constraints The average reconstruction error of the linear time-invariant system: (a) Average reconstruction error of facial motion data in terms of the order of the dynamic system (m) and the number of dimensions of the control input (d u ); (b) Average reconstruction error of human body motion data in terms of the order of the dynamic system (m) and the number of dimensions of the control input (d u ) Typical user-defined constraints: (a) positions or orientations of any points on the face, or distance between any two points; (b) positions or orientations of any points on the body, distance between any two points, or joint angle values for any joints. (c) notations Keyframing constraints for creating facial animation: (a) the facial expression in the first frame; (b) the facial expression in the middle frame. (c) the facial expression in the last frame Sparse spatial-temporal constraints in screen-space for generating facial animation: (a) The user picks six points on the face; (b)-(d) their screen-space position constraints at three key frames

10 List of Figures IX 4.6 Combination of key-trajectory constraints and keyframing constraints: (a) The user defines a distance between the left corner of the mouth and the right corner of the mouth; (b) the neutral expression in the first frame; (c) the neutral expression in the last frame; (d) the distance values throughout the motion Two typical constraints for generating full-body animation: (a) keyframing constraints for generating running animation; (b)-(c) key-trajectory constraints where the user selects six points on the character and then specifies their 3D trajectories across the motion Facial animation generated by various spatial temporal constraints: (a) facial animation generated by keyframing constraints shown in figure 4.4; (b) facial animation generated by sparse 2D constraints shown in figure 4.5; (c) facial animation generated by key trajectories of eight green facial points Full-body animation generated by key-frame constraints: (a) baby walking; (b) careful walking; (c) mickey mouse walking Full-body animation generated by key-frame constraints: (a) Climbing over an obstacle; (b) running Full-body animation generated by key-frame constraints: (a) motion transition from walking to jumping; (b) motion transition from walking to picking up an object; (c) motion transition from walking to sitting down Data generalization: (top) a short sequence of normal walking data which are used for training a statistical model; (bottom) walking on a slope generated by the statistical model learned from walking data shown in the top and user-defined constraints

11 LIST OF TABLES 3.1 Performance of the nearest neighbor search algorithm, where mean, minimum and maximum are the mean number of nodes searched, the minimal number of nodes searched, and the maximum number of nodes searched. All three numbers are significantly smaller than the size of the database. The error (degree per joint) is measured by computing the L 2 distance between the pose synthesized by the examples found using exhaustive search and the pose synthesized by the examples found using our nearest neighbor search

12 1. INTRODUCTION A long-standing goal in computer graphics is to create an animation system that allows everyone to design motion for a lifelike human character quickly and easily. Such a system would allow a naive user to intuitively specify a small set of poses at key instants; the system then automatically generates realistic motion that best satisfies the user-defined constraints. A more skilled user could use the system to generate more detailed motion by specifying fine-grained constraints throughout the motion. This thesis addresses how to use a wide variety of spatial-temporal constraints for creating two kinds of interactive animation systems: performance animation and constraint-based motion generation. A performance animation system automatically transforms the performance of the user into motion for an animated character. Real-time motion control by performance would allow the control of characters in computer/video games, the control of avatars for virtual reality, electronically mediated communication or teleconferences, and the rapid prototyping of character animations. For example, in video games, players could puppeteer the expressions and body movement of their avatars in real time. In multi-user virtual worlds such as virtual poker rooms or online chat rooms, users could put on graphical masks that transmit their expressions. In teleconferencing, users could transmit their body movement and facial expressions via low-bandwidth animation signals rather than high-bandwidth video streams. In animation or TV studios, actors could perform the desired facial expressions or body movements to create realistic prototype animations for a movie character. Constraint-based motion generation is an appealing method for rapid prototyping of character animation for movies and video games because it allows the user to specify the desired motion in a sparse, intuitive way. For example, the user might specify a sparse set of key frames and foot contact information. The system automatically creates realistic animation that best satisfies the user-specified constraints. Such a system could significantly reduce the amount of time, skill, and artistry required for animation design. The thesis aims to create interactive animation systems that might someday be practical for

13 1. Introduction 2 use in the home. Creating a system for a home user is a particularly difficult challenge because it must be low cost and intuitive to use. The system should also not require a considerable amount of time, skill, and expertise to install. In this thesis, we address these issues while also addressing the main difficulties in constructing any animation control system: designing a rich set of believable human actions for the virtual character and giving the user interactive, precise, and intuitive control over these actions. Animating lifelike human movement is a complex endeavor. There are many degrees of freedom (DOFs) that must be animated. The human body may be represented with as many as 267 DOFs [45] and is commonly represented with at least sixty DOFs, and a realistic human face model contains hundreds of DOFs [74]. The degrees of freedom must often be considered simultaneously because individual degrees of freedom interact with each other. For example, changing the torso position may require modifying the arms to keep the hands resting on a table. Creating a smile requires the coordinated movement of most of the points on the face. Human movement also has a wide range of variations even for the same functional action; walking may be performed quite differently by two individuals. And finally, people are extremely good at judging whether an animated motion looks realistic or lifelike. A movement which accomplishes the intended task, for example, punching a specific point on the boxing bag, may still be judged as unacceptable if it looks jerky, uncoordinated, or otherwise not natural. One of the most successful and popular approaches for creating realistic human animation is to use motion capture data. Recent technological advances in motion capture equipment have made it possible to record 3D human motions with high fidelity, resolution, and consistency at interactive rates. A recent notable example of motion capture data is the movie The Lord of the Rings: the Return of the King where prerecorded body movement and facial expressions were used to animate the synthetic character Gollum. Motion capture data have also been successfully used to create many video games, particularly sports games such as football, hockey, and basketball. Although motion capture data are a reliable way to capture the detail and nuance of live motion, reuse and modification for a different purpose remains a challenging task. Providing the user with an intuitive and interactive interface for precisely controlling a broad range of human animations is difficult because human expressions or full-body movements are often high dimensional but user-defined constraints are usually not. For example, the control input from a low-cost performance animation often contains much lower-dimensional information than human motion. In a constraint-based motion generation system, the number of key frames

14 1. Introduction Elbowangle[degree] Shoulder angle [degree] Y-coordinateforasecondpoint[mm] Y-coordinate for a first point [mm] (a) (b) Fig. 1.1: Spatial-temporal correlations in human motions: figure (a) and (c) show two examples of spatial correlation in human motions. (a) Plot of the elbow angle vs. the shoulder angle for five different punching actions. (b) Plot of the Y-coordinate for a point located on the upper lip vs. the Y- coordinate of a point located on the lower lip for six different snoring expressions. specified by the user are often much lower than the number of total frames to be animated. Therefore, one common challenge for creating both animation systems is that the mapping from input space to motion configuration space is not one-to-one, and many solutions will be consistent with the control input. This thesis proposes a novel approach for computing realistic human motion from underconstrained user input by constraining the generated motions to lie in the space of natural human motions. The key insight in our approach is that natural human motions are highly constrained and the movement of the DOFs are not independent (figure 1.1). Our approach utilizes the spatialtemporal correlation in human motion to transform input constraints into high-dimensional, lifelike human animations. A block diagram of the algorithm is given in figure 1.2. To create an animation system for a specific application like boxing, we first use a Vicon optical motion capture system [100] to record a high-quality human motion database that contains variations of each basic boxing action, such as speed, style, and hitting direction, and also transitions between the actions. The system automatically learns statistical models from this motion capture data and then enforces those statistical models as a prior on the motion reconstruction. This process resolves the ambiguity of the solution, yielding realistic, natural looking motions. To demonstrate the generality and power of our approach, we have designed three animation systems that allow the user to create complex human actions from intuitive spatial-temporal

1. Introduction 4 User interface Realistic character animation Sparse spatial- temporal constraints Motion generation Human motion prior Human Motion Database Human motion analysis Fig. 1.

The system then utilizes the spatial-temporal correlation embedded in a prerecorded motion capture database to efficiently transform the noisy and low-resolution control signals to high-quality

15 1. Introduction 4 User interface Realistic character animation Sparse spatial- temporal constraints Motion generation Human motion prior Human Motion Database Human motion analysis Fig. 1.2: Algorithm overview. constraints. In vision-based control of 3D facial expression (Chapter 2), the user controls the 3D facial expressions of an avatar interactively by acting out the desired expressions in front of a single video camera. At run time, the system uses a real-time algorithm to automatically extract a small set of high-level expression control parameters, such as width of the mouth, from the video. The system then utilizes the spatial-temporal correlation embedded in a prerecorded motion capture database to efficiently transform the noisy and low-resolution control signals to high-quality motion. Degrees of freedom that were noisy and corrupted are filtered and then mapped to the motion capture data; missing degrees of freedom and details are synthesized using the information contained in the motion capture data. We demonstrate the power of this approach through two users who control and animate a wide range of 3D facial expressions of two different avatars in real time. In vision-based control of complex full-body motion (Chapter 3), the user wears a small set of retro-reflective markers and performs in front of two synchronized video cameras. At run time, the system automatically learns a series of local pose models from a set of motion capture examples that are a close match to the marker locations captured by the cameras. These local models are then used to reconstruct the motion of the user as a full-body animation. We demonstrate the power of this approach with real-time control of six different behaviors using two video cameras and a small set of retro-reflective markers. A standard performance animation system for full-body control might require 8-16 expensive motion capture cameras and at least 40 mark-

16 1. Introduction 5 ers. Using the knowledge embedded in a motion capture database, we can reduce the cost of the system to two low-cost video cameras and 6-9 markers. In constraint-based motion generation (Chapter 4), the user specifies a sparse set of spatialtemporal constraints throughout the motion, such as key frames. The system automatically learns a low-dimensional statistical dynamical model from motion capture data and then enforces this as a spatial-temporal constraint in the motion generation process. The statistical dynamical model, together with an automatically derived objective function that measures the goodness of the motion and user defined constraints, comprise a constrained optimization problem. The optimization yields statistically valid motion that matches the constraints specified by the animator. We demonstrate the effectiveness of this approach by generating both face and full-body animation from a variety of user-defined constraints. Such a system can significantly reduce the talent and resources required to produce 3D character animation. The common theme that underlies all three systems is the use of the knowledge embedded in human motion data to reduce the complexity of human motions to the number of dimensions that can be controlled by the user input. This approach is robust to variations in the kinematic model of the character, the sensor used for motion performance, and the application domain. The created animation may contain motion (body postures or facial expressions) that are not in the database because of spatial interpolations/extrapolations of the data; the timing of the motions is also not limited to that in the database because it is directly controlled by the user. However, the database must contain the basic actions required for the application domain. 1.1 Contributions The contributions of this thesis are two-fold. First, we introduce a novel approach for generating realistic motions from a wide variety of spatial-temporal constraints by constraining the generated motion into the space of natural motions. We demonstrate the power and flexility of the approach by building three prototype systems: control 3D facial expression interactively using a single video camera; control complex full-body motion using two synchronized video cameras and a small set of retro-reflective markers; create either full-body animation or facial expressions from a sparse set of spatial-temporal constraints specified by the user.

17 1. Introduction 6 Second, this thesis contributes a set of algorithms to the field of computer graphics. Included among these are Two general online animation algorithms that transform continuous low-dimensional control signals to high-dimensional human motions using the knowledge embedded in motion capture data. These two algorithms are complementary: the first creates motions from a small set of high-level control parameters, such as width of the mouth and the second algorithm synthesizes the motions from low-level control parameters, such as positions or orientations of certain points on the human body. A general constraint-based motion generation algorithm that translates a small set of spatialtemporal constraints to high-dimensional human motion using the priors learned from motion capture data. An online algorithm that reduces the number of dimensions of human poses at run time using the knowledge about natural human motion embedded in a motion capture database. A statistical low-dimensional linear dynamical model that efficiently represents complex, coordinated human motion. A nearest-neighbor search algorithm for a continuously moving point query whose computational complexity is constant and independent on the size of database. A data-driven filtering algorithm to remove the noise in continuous animation signals using motion capture data. An expression retargetting method whose run-time computation is constant and independent of the complexity of the character model. 1.2 Organization In the next chapter, we describe the details of controlling and animating 3D facial expressions using a single video camera. Chapter 3 presents our framework and results for interactive control of full-body motion using two synchronized video cameras with a very small number of reflective markers. We then describe our approach for motion generation using a sparse set of spatialtemporal constraints in Chapter 4. Chapter 5 concludes and discusses future work.

18 2. VISION-BASED CONTROL OF 3D FACIAL ANIMATION A vision-based interface for facial animation should accurately track both 3D head motion and detailed deformations of the full face. If the system is to allow any user to control the actions of a character model, the system must also be independent of the facial dimensions, proportions, and geometry of the user. In the computer vision literature, extensive research has been done in the area of vision-based facial tracking. Many algorithms exist but the performance of facial tracking systems, particularly user-independent expression tracking, is still not good for online animation applications. Consequently, we do not attempt to track all the details of the facial expression. Instead, our system robustly tracks a small set of distinctive facial features in real time and then translates these low-dimensional tracking data into detailed animation using the information contained in a motion capture database. We show that a rich set of lifelike facial actions can be created from a motion capture database and that the user can control these actions interactively by acting out the desired motions in front of a single video camera (figure 2.1). In the process, we face several challenges. First in order to control facial actions interactively via a vision-based interface, we need to extract meaningful animation control signals from the video sequence of a live performer in real time. Second, we must transform low-dimensional control signals derived from our vision-based interface to high-dimensional expression data. This problem is particularly difficult because the mapping between the two spaces is not one-to-one, so a direct frame-by-frame mapping from control signal space to facial expression space will not work. Third, we want to animate any 3D character model by reusing and interpolating motion capture data, but motion capture data record only the motion of a finite number of facial markers on a source model. Thus we need to adapt the motion capture data of a source model to all the vertices of the character model to be animated. Finally, if the system is to allow any user to control any 3D face model, we must take into account differences in facial proportions and geometry.

(Right) The controlled facial movement of the avatars with texture mapped models. 2.

19 2. Vision-based Control of 3D Facial Animation 8 Fig. 2.1: Interactive Expression Control: a user can control 3D facial expressions of an avatar interactively. (Left) The users act out the motion in front of a single-view camera. (Middle) The controlled facial movement of the avatars without texture maps. (Right) The controlled facial movement of the avatars with texture mapped models. 2.1 Background Facial animation has been driven by keyframe interpolation [71, 77, 59], direct parameterization [72, 73, 27], the user s performance [30, 103, 98, 32], pseudomuscle-based models [99, 22, 50], muscle-based simulation [97, 58], 2D facial data for speech [18, 17, 33] and full 3D motion capture data [43, 68]. Among these approaches, our work is most closely related to the performance-driven approach and to motion capture, and we therefore review the research related to these two approaches in greater detail. Parke and Waters offer an excellent survey of the entire field [74]. A number of researchers have described techniques for recovering facial motions directly from video. Williams tracked expressions of a live performer with special makeup and then mapped 2D tracking data onto the surface of a scanned 3D face model [103]. Facial trackers without special markers can be based on deformable patches [13], edge or feature detectors [98, 54], 3D face models [32, 78, 29] or data-driven models [41]. For example, Terzpoloulos and Waters tracked the contour features on eyebrows and lips for automatic estimation of the face muscle contraction parameters from a video sequence, and these muscle parameters were then used to animate the physically based muscle structure of a synthetic character [98]. Essa and his colleagues tracked facial expressions using optical flow in an estimation and control framework

20 2. Vision-based Control of 3D Facial Animation 9 coupled with a physical model describing the skin and muscle structure of the face [32]. More recently, Gokturk and his colleagues applied Principal Component Analysis (PCA) on stereo tracking data to learn a deformable model and then incorporated the learned deformation model into an optical flow estimation framework to simultaneously track the head and a small set of facial features [41]. The direct use of tracking motion for animation requires that the face model of a live performer have similar proportions and geometry as those of the animated model. Recently, however, vision-based tracking has been combined with blendshape interpolation techniques [72, 59] to create facial animations for a new target model [20, 26, 34]. The target model could be a 2D drawing or any 3D model. Generally, this approach requires that an artist generate a set of key expressions for the target model. These key expressions are correlated to the different values of facial features in the labelled images. Vision-based tracking can then extract the values of the facial expressions in the video image, and the examples can be interpolated appropriately. Buck and his colleagues introduced a 2D hand-drawn animation system in which a small set of 2D facial parameters are tracked in the video images of a live performer and then used to blend a set of hand-drawn faces for various expressions [20]. The FaceStation system automatically located and tracked 2D facial expressions in real time and then animated a 3D virtual character by morphing among 16 predefined 3D morphs [34]. Chuang and Bregler explored a similar approach for animating a 3D face model by interpolating a set of 3D key facial shapes created by an artist [26]. The primary strength of such animation systems is the flexibility to animate a target face model that is different from the source model in the video. These systems, however, rely on the labor of skilled animators to produce the key poses for each animated model. The resolution and accuracy of the final animation remains highly sensitive to that of the visual control signal due to the direct frame-by-frame mappings adopted in these systems. Any jitter in the tracking system results in an unsatisfactory animation. A simple filtering technique like a Kalman filter might be used to reduce the noise and other artifacts, but it would also remove the details and high frequencies in the visual signals. Like vision-based animation, motion capture also uses measured human motion data to animate facial expressions. Motion capture data, however, have more subtle details than vision tracking data because an accurate hardware setup is used in a controlled capture environment. Guenter and his colleagues created an impressive system for capturing human facial expressions using special facial markers and multiple calibrated cameras and replaying them to make highly

21 2. Vision-based Control of 3D Facial Animation 10 realistic 3D facial animations [43]. Recently, Noh and Neumann presented an expression cloning technique to adapt existing motion capture data of a 3D source facial model to a new 3D target model [68]. A recent notable example of motion capture data is the movie The Lord of the Rings: the Return of the King where prerecorded movement and facial expressions were used to animate the synthetic character Gollum. The vision-based approach and motion capture both have advantages and disadvantages. The vision-based approach gives us an intuitive and inexpensive way to control a wide range of actions, but current results are disappointing for animation applications [40]. In contrast, motion capture data generate high-quality animation but are expensive to collect, and once they have been collected, may not be exactly what the animator needs, particularly for interactive applications in which the required motions cannot be precisely or completely predicted in advance. Our goal is to obtain the advantage of each method while avoiding the disadvantages. In particular, we use a vision-based interface to extract a small set of animation control parameters from a single-camera video and then use the embedded knowledge in the motion capture data to translate it into high-quality facial expressions. The result is an animation that does what an animator wants it to do, but has the same high quality as motion capture data. An alternative approach to performance-based animation is to use high degree-of-freedom (DOF) input devices to control and animate facial expressions directly. For example, DeGraf demonstrated a real-time facial animation system that used a special purpose interactive device called a waldo to achieve real-time control [30]. Faceworks is another high DOF device that gives an animator direct control of a character s facial expressions using multiple sliders to directly control facial expression parameters [35]. These systems allow trained puppeteers to create animation interactively but are not appropriate for occasional users of video games or teleconferencing systems because an untrained user cannot learn to simultaneously manipulate a large number of DOFs independently in a reasonable period of time. Computer vision researchers have explored another way to combine motion capture data and vision-based tracking [47, 75, 91]. They use motion capture data to build a prior model describing the movement of the human body and then incorporate this model into a probabilistic Bayesian framework to constrain the search. The main difference between our work and this approach is that our goal is to animate and control the movement of any character model rather than reconstruct and recognize the motion of a specific subject. What matters in our work is how the motion looks, how well the character responds to the user s input, and the latency. To this end,

22 2. Vision-based Control of 3D Facial Animation 11 we not only want to remove the jitter and wobbles from the vision-based interface but also to retain as much as possible the small details and high frequencies in the motion capture data. In contrast, tracking applications require that the prior model from motion capture data is able to predict the real world behaviors well so that it can provide a powerful cue for the reconstructed motion in the presence of occlusion and measurement noise. The details and high frequencies of the reconstructed motion are not the primary concern of these systems. 2.2 Overview Our system transforms low-quality tracking motion into high-quality animation by using the knowledge of human facial motion that is embedded in motion capture data. The input to our system consists of a single video stream recording the user s facial movement, a preprocessed motion capture database, a 3D source surface scanned from the motion capture subject, and a 3D avatar surface model to be animated. By acting out a desired expression in front of a camera, any user can interactively control the facial expressions of any 3D character model. Our system is organized into four major components (figure 2.2): Video analysis. Simultaneously track the 3D position and orientation of the head and a small group of important facial features in video and then automatically translate them into two sets of high-level animation control parameters: expression control parameters and head pose parameters. Motion capture data preprocessing. Automatically separate head motion from facial deformations in the motion capture data and extract the expression control parameters from the decoupled motion capture data. Expression control and animation. Efficiently transform the noisy and low-resolution expression control signals to high-quality motion in the context of the motion capture database. Degrees of freedom that were noisy and corrupted are filtered and then mapped to the motion capture data; missing degrees of freedom and details are synthesized using the information contained in the motion capture data. Expression retargeting. Adapt the synthesized motion to animate all the vertices of a new character model.

2. Vision-based Control of 3D Facial Animation 12 Video Analysis Expression control parameters Expression Control and Animation Preprocessed Motion Capture

control parameters: expression control parameters and 3D pose control parameters.

facial expression, which describes only the movement of the motion capture markers on the surface of the motion capture subject.

23 2. Vision-based Control of 3D Facial Animation 12 Video Analysis Expression control parameters Expression Control and Animation Preprocessed Motion Capture Data Input single-camera Video 3D poses Synthesized expressions Expression Retargeting Output 3D Facial Animation Avatar expressions Mocap subject model Avatar model Fig. 2.2: System overview diagram. At run time, the video images from a single-view camera are fed into the Video Analysis component, which simultaneously extracts two types of animation control parameters: expression control parameters and 3D pose control parameters. The Expression Control and Animation component uses the expression control parameters as well as a preprocessed motion capture database to synthesize the facial expression, which describes only the movement of the motion capture markers on the surface of the motion capture subject. The Expression Retargeting component uses the synthesized expression, together with the scanned surface model of the motion capture subject and input avatar surface model, to produce the facial expression for the avatar. The avatar expression is then combined with the avatar pose, which is directly derived from pose control parameters, to generate the final animation.

24 2. Vision-based Control of 3D Facial Animation 13 The motion capture data preprocessing is done off-line; the other stages are completed online based on input from the user. We describe each of these components in more detail in the next four sections of this chapter. 2.3 Video Analysis We now describe how to robustly extract a small set of high-level expression control parameters from a single video stream in real time. Section describes how the system simultaneously tracks the pose and expression deformation from video. The pose tracker recovers the position and orientation of the head, while the expression tracker tracks the position of 19 distinctive facial features. Section explains how to extract a set of high-level animation control parameters from the tracking data Facial Tracking Our system tracks the six DOFs of head motion (yaw, pitch, roll and 3D position). We use a generic cylinder model to approximate the head geometry of the user and then apply a modelbased tracking technique to recover the head pose of the user in the monocular video stream [106, 21]. The expression tracking step involves tracking nineteen 2D features on the face: one for each mid-point of the upper and lower lip, one for each mouth corner, two for each eyebrow, four for each eye, and three for the nose, as shown in figure 2.3. We choose these facial features because they have high contrast textures in the local feature window and because they record the movement of important facial areas. Initialization: The facial tracking algorithm is based on a hand-initialized first frame. Pose tracking needs the initialization of the pose and the parameters of the cylinder model, including the radius, height, and center position whereas expression tracking requires the initial position of 2D features. By default, the system starts with the neutral expression in the frontal-parallel pose. A user clicks on the 2D positions of nineteen points in the first frame. After the features are identified in the first frame, the system automatically computes the cylindrical model parameters. After the initialization step, the system builds a texture-mapped reference head model for use in tracking by projecting the first image onto the surface of the initialized cylinder model. Pose tracking: During tracking, the system dynamically updates the position and orientation of the reference head model. The texture of the model is updated by projecting the current

When the next frame is captured, the new head pose is automatically computed by minimizing the sum of the squared intensity differences between the projected reference head model and the new frame.

25 2. Vision-based Control of 3D Facial Animation 14 Fig. 2.3: User-independent facial tracking: the red arrow denotes the position and orientation of the head and the green dots show the positions of the tracked points. image onto the surface of the cylinder. When the next frame is captured, the new head pose is automatically computed by minimizing the sum of the squared intensity differences between the projected reference head model and the new frame. The dynamically updated reference head model can handle gradual lighting changes and self-occlusion, which allows us to recover the head motion even when some of the face is not visible. Because the reference head model is dynamically updated, tracking errors will accumulate over time. We use a re-registration technique to prevent this problem. At run time, the system automatically stores several texture-mapped head models in key poses, and re-registers the new video image with the closest example rather than with the reference head model when the difference between the current pose and its closest pose among the examples is smaller than a user-defined threshold. Some pixels in the processed images may disappear or become distorted or corrupted because of occlusion, non-rigid motion, and sensor noise. Those pixels should contribute less to motion estimation than other pixels. The system uses a robust technique, compensated iteratively reweighted least squares, to reduce the contribution of these noisy and corrupted pixels [12]. Expression tracking: After the head poses have been recovered, the system uses the poses and head model to warp the images into the fronto-parallel view. The positions of facial features

26 2. Vision-based Control of 3D Facial Animation 15 are then estimated in the warped current image, given their locations in the warped reference image. Warping removes the effects of the global pose in the 2D tracking features by associating the movements of features only with expression deformation. It also improves the accuracy and robustness of our feature tracking because the movement of the features becomes smaller after warping. To track the 2D position of a facial feature, we define a small square window centered at the feature s position. We make the simplifying assumption that the movement of pixels in the feature window can be approximated as the affine motion of a plane centered at the 3D coordinate of the feature. Affine deformations have six degrees of freedom and can be inferred using optical flow in the feature window. We employ a gradient-based motion estimation method to find the affine motion parameters thereby minimizing the sum of the squared intensity difference in the feature window between the current frame and the reference frame [10, 89, 8]. Even in warped images, tracking single features such as the eyelids based on intensity only is unreliable. To improve tracking robustness, we incorporate several prior geometric constraints into the feature tracking system. For example, we assign a small threshold to limit the horizontal displacement of features located on the eyelids because the eyelids deform almost vertically. The system uses a re-registration technique similar to the one used in pose tracking to handle error accumulation in expression tracking. In particular, the system stores a set of warped facial images and feature locations in example expressions at run time and re-registers the facial features of the new images with those in the closest example rather than with those in the reference image when the difference between current expression and its closest template falls below a small threshold. Our facial tracking system runs in real time at 20 fps. The system is user-independent and can track the facial movement of different human subjects. Figure 2.3 shows the results of our pose tracking and expression tracking for two performers Control Parameters To build a common interface between the motion capture data and the vision-based interface, we derive a small set of parameters from the tracked facial features as a robust and discriminative control signal for facial animation. We want the control parameters extracted from feature tracking to correspond in a meaningful and intuitive way with the expression movement qualities they control. In total, the system automatically extracts fifteen control parameters that describe the facial expression of the observed actor (figure 2.4):

27 2. Vision-based Control of 3D Facial Animation 16 Fig. 2.4: Fifteen high-level expression control parameters. (Top left) The features that measure the distance between two feature points. (Top right) The features that measure the distance between a point and a line. (Bottom left) The horizonal and vertical orientation of the mouth. (Bottom right) The center position of the mouth. Mouth (six): The system extracts six scalar quantities describing the movement of the mouth based on the positions of four tracking features around the mouth: left corner, right corner, upper lip and lower lip. More precisely, these six control parameters include the parameters measuring the distance between the lower and upper lips (one), the distance between the left and right corners of the mouth (one), the center of the mouth (two), the angle of the line segment connecting the lower lip and upper lip with respect to the vertical line (two), and the angle of the line segment connecting the left and right corners of the mouth with respect to the horizontal line (one). Nose (two): Based on three tracked features on the nose, we compute two control parameters describing the movement of the nose: the distance between the left and right corners of the nose and the distance between the top point of the nose and the line segment connecting the left and right corners (two). Eye (two): The control parameters describing the eye movements are the distance between the upper and lower eyelids of each eye (two).

28 2. Vision-based Control of 3D Facial Animation 17 Eyebrow (five): The control parameters for the eyebrow actions consist of the angle of each eyebrow relative to a horizontal line (two), the distance between each eyebrow and eye (two), and the distance between the left and right eyebrow (one). The expression control signal describes the evolution of fifteen control parameters derived from tracking data. We use the notation C i { c a,i a = 1,..., 15} to denote the control signal at time i. Here a is the index for the individual control parameter. These fifteen parameters are used to control the expression deformation of the character model. In addition, the head pose derived from video gives us six additional animation control parameters, which are used to drive the avatar s position and orientation. 2.4 Motion Capture Data Preprocessing We used a Minolta Vivid 700 laser scanner to build the surface model of a motion capture subject [49]. We set up a Vicon motion capture system to record facial movement by attaching 76 reflective markers onto the face of the motion capture subject. The scanned surface model with the registered motion capture markers is shown in figure 2.5. The markers were arranged so that their movement captures the subtle nuances of the facial expressions. During capture sessions, the subject must be allowed to move his head freely because head motion is involved in almost all natural expressions. As a result, the head motion and the facial expressions are coupled in the motion capture data, and we need to separate the head motion from the facial expression accurately in order to reuse and modify the motion capture data. Section describes a factorization algorithm to separate the head motion from the facial expression in the motion capture data by utilizing the inherent rank constraints in the motion capture data Decoupling Pose and Expression Assuming the facial expression deforms with L independent modes of variation, then its shape can be represented as a linear combination of a deformation basis set S 1, S 2,..., S L. Each deformation basis S i is a 3 P matrix describing the deformation mode of P points. The recorded facial motion capture data, denoted as X f, combines the effects of 3D head pose and local expression deformation: ( L ) X f = R f λ fi S i + T f (2.1) i=1

29 2. Vision-based Control of 3D Facial Animation 18 Fig. 2.5: The scanned head surface model of the motion capture subject aligned with 76 motion capture markers. where R f is a 3 3 head rotation matrix and T f is a 3 1 head translation in frame f. The weight corresponding to the ith deformation basis S i is λ fi. Our goal is to separate the head pose R f and T f from the motion capture data X f so that the motion capture data record only expression deformations. We eliminate T f from X f by subtracting the mean of all 3D points. We then represent the resulting motion capture data in matrix notation: X 1. X F } {{ } M = λ 11 R 1... λ 1L R 1.. } λ F1 R F... λ FL R F {{ } Q. S 1. S L }{{} B where F is the number of frames of motion capture data, M is a 3F P matrix storing the 3D coordinates of all the motion capture marker locations, Q is a 3F 3L scaled rotation matrix recording the head orientations and the weight for each deformation basis in every frame, B is a 3L P matrix containing all deformation bases. Equation (2.2) shows that without noise, the rank of the data matrix M is at most 3L. Therefore, we can automatically determine the number of deformation bases by computing the rank of matrix M. (2.2) When noise corrupts the motion capture data, the data matrix M will not be exactly of rank 3L. However, we can perform singular value decomposition (SVD) on the data matrix M such that M = USV T, and then get the best possible rank 3L approximation of the data matrix,

30 2. Vision-based Control of 3D Facial Animation 19 factoring it into two matrices: Q = US 1 2 (2.3) B = S 1 2 V T (2.4) The rank of data matrix M, 3L, is automatically determined by keeping a specific amount of original data energy. In our experiment, we found that 25 deformation bases were sufficient to capture 99.6% of the deformation variations in the motion capture database. The decomposition of the matrix M (equation 2.3 and 2.4) is determined up to a linear transformation. Any non-singular 3L 3L matrix G and its inverse could be inserted between Q and B without changing the result. Thus the actual scaled rotation matrix Q and basis matrix B are given by Q = Q G (2.5) B = G 1 B (2.6) To recover the appropriate linear transformation matrix G, we introduce two different sets of linear constraints: rotation constraints and basis constraints on the matrix GG T (For details, please refer to [105]). Rotation constraints utilize the orthogonal property of the rotation matrices to impose linear constraints on the matrix GG T. Given 3D motion capture data, the deformation bases representing the facial expression are not unique. Any non-singular transformation of deformation bases are still valid to describe the facial expressions. To remove the ambiguity, our algorithm automatically finds the L appropriate frames in the motion capture data that cover all the deformation variations in the data. We choose these L frames as the deformation basis. The specific form of the deformation basis provides another set of linear constraints on the matrix GG T. These two sets of constraints allow us to uniquely recover the matrix GG T via standard least-square techniques. We use the SVD technique again to factor the matrix GG T into the matrix G and its transpose G T, a step that leads to the actual rotation R f, configuration coefficients λ fi, and deformation bases S 1,..., S L. After we separate the poses from the motion capture data, we project each frame of the 3D motion capture data in the fronto-parallel view and extract the expression control parameters for each motion capture frame much as we extracted expression control parameters from the visual tracking data. Let X i {x b,i b = 1,..., 76} be 3D positions of the motion capture markers in

31 2. Vision-based Control of 3D Facial Animation 20 frame i and C i {zc a,i a = 1,..., 15} be the control parameters derived from frame i. Here x b,i is the 3D coordinate of the bth motion capture marker corresponding to frame i. In this way, each motion capture frame X i is automatically associated with animation control parameters C i. 2.5 Expression Control Given the control parameters derived from a vision-based interface, controlling the head motion of a virtual character is straightforward. The system directly maps the orientation of the performer to the virtual character. The position parameters derived from video need to be appropriately scaled before they are used to control the position of an avatar. This scale is computed as the ratio of the mouth width between the user and the avatar. Controlling the expression deformation requires integrating the information in the expression control parameters and the motion capture data. In this section, we present a novel data-driven approach for motion synthesis that translates the noisy and lower-resolution expression control signals to the high-resolution motion data using the information contained in the motion capture database. The system first finds the K closest examples in the motion capture database using the low-resolution and noisy query control parameters from the vision-based interface and then linearly interpolates the corresponding high-quality motion examples in the database with a local regression technique. Because the mapping from the control parameters space to the motion data space is not one to one, the query control signal is based on multiple frames rather than a single frame thereby reducing the mapping ambiguity by integrating the query evidence forward and backward over a window of a short fixed length. The motion synthesis process begins with a normalization step that corrects for the difference between the animation control parameters in the tracking data and the motion capture data (Section 2.5.1). We then describe a data-driven approach to filter the noisy expression control signals derived from the vision-based interface (Section 2.5.2). Next, we introduce a data-driven expression synthesis approach that transforms the filtered control signals into high-quality motion data (Section 2.5.3). Finally, we describe a new data structure and an efficient K nearest points search technique that we use to speed up the synthesis process (Section 2.5.4).

32 2. Vision-based Control of 3D Facial Animation Control Parameter Normalization The expression control parameters from the tracking data are inconsistent with those in the motion capture data because the user and the motion capture subject have different facial geometry and proportions. We use a normalization step that automatically scales the measured control parameters and database control parameters according to the control parameters of the neutral expression and one extreme expression to particularly remove these differences. By scaling both control parameters, we ensure that the control parameters extracted from the user have approximately the same magnitude as those extracted from motion capture data when both are in the same expression. The normalization functions are norm( c) = c c 0 c e c 0 (2.7) norm(c) = c c 0 c e c 0 (2.8) where c 0 and c e are the expression values corresponding to the neutral expression and extreme expression of the motion capture subject and c 0 and c e are the expression values corresponding to the neutral expression and extreme expression of the user. The normalization transforms the value of the control parameters in the neutral expression to zero and the value of control parameters in the extreme expression to one. In this way, all the expression features have the approximately same scales Data-driven Filtering Control signals in a vision-based interface are often noisy. Before we translate them into the motion data, they must be filtered in a way that is suitable for mapping. This section describes a data-driven filtering technique that uses the knowledge in the motion capture data to filter the control signals (figure 2.6). We divide continuous control signals into segments of a fixed, short temporal length L at run time and then use the knowledge embedded in the motion capture database to sequentially filter the control signals by learning a local linear dynamical model at run time. For each frame in the motion capture database, we obtain one motion segment starting from that frame and treat it as a data point. Conceptually, all the motion segments form a nonlinear manifold embedded in a highdimensional configuration space. Each motion segment in the motion capture database is a point sample on the manifold. The motion segments from the vision-based interface are noisy samples

33 2. Vision-based Control of 3D Facial Animation 22 of this manifold. The key idea of our filtering technique is that we can use a low-dimensional linear subspace to approximate the local region of the high-dimensional nonlinear manifold and then filter the noisy sample by projecting it into its local low-dimensional subspaces. Noisy sample Nearest Neighbor Search Closest examples Motion Capture Database Online PCA Filtered sample Reconstruction in the low - dimensional space 7 largest Eigen-curves (99% energy) Fig. 2.6: Data-driven filtering diagram. At run time, we first use a nearest-neighbor search algorithm to find the K closest examples in the motion capture database. And then we compute the principal components of the closest examples. We keep the B largest eigenvectors as the filter basis, where B is automatically determined by retaining 99% of the variation of the original data. Finally, we project the noisy segment into a local linear space spanned by the filter basis and reconstruct the control signal in the low-dimensional space. For each noisy segment, the motion capture database is searched for the examples that are closest to the segment. These examples are then used to build a low-dimensional subspace in the local region via Principal Component Analysis (PCA). The principal components, which represent the major sources of dynamical variations in the local region, naturally capture the constraints of the dynamical behaviors in the local region. We can use this low-dimensional linear space to reduce the noise and error in the sensed control signal. Let φ t [ C t1,..., CtL ] be a segment of input control signals and φ i [Z i1,..., Z il ] be the ith segment in the database. At run time, we use the following query metric to find the K closest

34 2. Vision-based Control of 3D Facial Animation 23 examples, φ 1 t,..., φk t, for the current segment φ t : D( φ t, φ i ) = φ t φ i = 15 L ( c a,tl c a,tl ) 2 (2.9) a=1 l=1 where denotes a Euclidean norm. After subtracting the mean of the K closest examples φ t = 1 K K k=1 φk t, we apply PCA to the covariance matrix of these K examples and obtain a low-dimensional representation for the segments located in the local region of the current segment: W = Ut T (φ t φ t ) (2.10) where U t is constructed by the eigenvectors u i, i = 1,..., B, corresponding to the B largest eigenvalues of the covariance matrix of the data set. The dimension of the new space W is much lower than the dimension of the original space φ. The eigenvectors form the basis of our data-driven filter. We can filter the noisy segment with the following function: ˆφ t = filter( φ t ) = φ t + U t Ut T ( φ t φ t ) (2.11) where ˆφ t is the filtered control segment. Both the local basis matrix U t and the mean segment φ t are learned at run time. The performance of the data-driven filtering algorithm depends on the dimensions of local linear space (B), the number of closest examples (K), and the number of frames in the segment (L). Now, we discuss how to select the appropriate numbers for these three parameters: The number of dimensions is determined by the amount of energy (variation) we want to retain from the original data. The more variation we keep; the larger the number of dimensions. If we keep a large amount of variation in the data, the noise in the signal might not be removed. If the number of dimensions is too small, the signal might be oversmooth and lose its details and high frequency component. We automatically determine the appropriate number for B by retaining 99% of the variation of the original data. The number of dimensions is about 5 to 10 in the experiments reported in Section 2.7. As the number of closest examples K increases, the dimension of the subspace becomes higher so that it is capable of representing a large range of local dynamical behavior. The high-dimensional subspace will provide fewer constraints on the sensed motion and the

35 2. Vision-based Control of 3D Facial Animation 24 subspace constraints might be insufficient to remove the noise in the data. If fewer examples are used, they might not be similar enough to the user s motion and the subspace might not fit the motion well resulting in filtered motion that is over smooth and distorted. The specific choice of K depends on the properties of a given motion capture database and control data extracted from video. In our experiments, we found that K between 50 to 150 gave good results. If we choose too many frames for the motion segments, the closest examples from the motion capture data might not be dense enough to learn an accurate local linear model in the high-dimensional space. If we choose too few frames, the model will not capture the regularities in the motion. The length of the motion segments will determine the response time of the system, and the action delay between the user and the avatar. We found 20 frames to be a reasonable length for the segment in our experiments. The delay of the system is then 0.33 s because the frame rate of the video camera is 60 fps Data-driven Expression Synthesis Suppose we have N frames of motion capture data X 1,..., X N and its associated control parameters C 1,..., C N, then our expression control problem can be stated as follows: given a segment of control signals ˆφ t = [Ĉt 1,..., Ĉt L ], synthesize the corresponding motion ˆM t = [ ˆX t1,..., ˆX tl ]. A simple solution to this problem would be K-nearest-neighbor interpolation. For each frame Ĉ i such that i = t 1,..., t L, the system can choose the K closest example points of the control parameters, denoted as Ci 1,..., CK i, in the motion capture data and assign each of them a weight based on their distance to the query Ĉi. These weights could then be used to synthesize new motion by interpolating the corresponding motion capture data Xi 1,..., XK i. The downside of this approach is that the generated motion would not be smooth because the mapping from control parameter space to motion configuration space is not one to one. We utilize the spatial-temporal correlations in the human motion to remove the mapping ambiguity from the control parameter space to the motion configuration space. Rather than synthesizing the motion frame by frame, we transform the whole segment of the control data to the new motion. Synthesizing the motion in this way allows us to handle the mapping ambiguity by integrating expression control parameters forwards and backwards over the whole segment. Similar to the data-driven filtering step, the system automatically breaks the video stream into segments at run time, and then sequentially transforms the visual control signals into high-quality

36 2. Vision-based Control of 3D Facial Animation 25 motion data segment by segment. Given the current segment ˆφ t, we compute the interpolation weights based on its distance to the K closest segments φ k t, which were found in the data-driven filtering step. The distance is measured by a Euclidean distance function, same as equation (2.9). We can then synthesize the segment of the motion via a linear combination of the motion examples, Mt k, that are associated with the K closest segments φk t : ˆM t = Ker(d(φ k t, ˆφ t ))M k t Ker(d(φ k t, ˆφ t )) (2.12) where Ker( ) is a smooth weight function for local motion interpolation, and we use a Gaussian kernel [5]: Ker(d) = exp( d 2 ) (2.13) Both the dynamical filtering and the motion mapping procedure work on a short temporal window of 20 frames. Overlap between neighboring segments prevents discontinuities between segments. Two synthesized segments are blended by fading out one while fading in the other using a sigmoid-like function, α = 0.5cos(βπ) Over the transition duration, β moves linearly from 0 to 1. The motion during the transition is determined by linearly interpolating similar segments with weights α and 1 α. In our experiment, the blend interval is set to five frames. 2.6 Expression Retargeting The synthesized motion specifies the movement of a finite set of points on the source surface; however, animating a 3D character model requires moving all the vertices on the animated surface. Noh and Neumann introduced an expression cloning technique to map the expression of a source model to the surface of a target model [68]. Their method modifies the magnitudes and directions of the source using the local shape of two models. It produces good results but the run time cost depends on the complexity of the animated model because the motion vector of each vertex on the target surface is adapted individually at run time. In this section, we present an efficient expression retargeting method where the run-time computation is constant independent of the complexity of the character model. Our method precomputes all deformation bases of the target model so that the run-time operation only involves blending together these deformation bases.

2. Vision-based Control of 3D Facial Animation 26 Motion Capture Database Motion capture data preprocessing 25 source motion 99.5% energy Source deformation bases.

37 2. Vision-based Control of 3D Facial Animation 26 Motion Capture Database Motion capture data preprocessing 25 source motion 99.5% energy Source deformation bases... Source model S 0 S 1 S 2 S 3 S 4 S 5 Precomputation of deformation bases 25 avatar motion bases Target deformation bases... Target model T 0 T 1 T 2 T 3 T 4 T 5 Fig. 2.7: Precomputation of deformation basis. The system first builds a surface correspondence between the scanned source surface model and the input target surface model. Both surface models are in the neutral expression. Then the system adapts the deformation bases of the motion capture database S 0,S 1,...,S L to the target model T 0,T 1,...,T L based on the deformation relationship derived from the local surface correspondence diagram. As input to this process, we take the scanned source surface model, the input target surface model, and the deformation bases of the motion capture database S 0, S 1,..., S L. Both models are in the neutral expression. The system first builds a surface correspondence between the two models and then adapts the deformation bases of the source model to the target model T 0, T 1,..., T L based on the deformation relationship derived from the local surface correspondence (see figure 2.7). At run time, the system operates on the synthesized motion to generate the weight for each deformation basis. The output animation is created by blending together the target deformation bases using the weights. In particular, the expression retargeting process consists of four stages: motion vector interpolation, dense surface correspondences, motion vector transfer, and target motion synthesis. Motion Vector Interpolation: Given the deformation vector of the key points on the source

2. Vision-based Control of 3D Facial Animation 27 Fig. 2.8: Dense surface correspondence. (Left) The scanned source surface model. (Middle) The animated surface model.

38 2. Vision-based Control of 3D Facial Animation 27 Fig. 2.8: Dense surface correspondence. (Left) The scanned source surface model. (Middle) The animated surface model. (Right) The morphed model from the source surface to the target surface using the surface correspondence. surface for each deformation mode, the system deforms the remaining vertices on the source surface by linearly interpolating the movement of the key points using barycentric coordinates. The system generates a mesh model based on the 3D positions of the motion capture markers on the source model. For each vertex, the system determines the face on which each vertex is located and the corresponding barycentric coordinate for interpolation. Then the deformation vector of the remaining vertices is interpolated accordingly. Dense Surface Correspondences: Beginning with a small set of correspondences that have been manually established between two surfaces, a dense surface correspondence is computed by volume morphing with a Radial Basis Function followed by a cylindrical projection [67] (figure 2.8). This operation gives us a homeomorphic mapping (one to one and onto) between the two surfaces. Then we use Radial Basis Functions (RBF) once again to learn continuous homeomorphic mapping functions f(x s ) = (f 1 (x s ), f 2 (x s ), f 3 (x s )) from the neutral expression of the source surface x s = (x s, y s, z s ) to the neutral expression of the target surface x t = (x t, y t, z t ), such that x t = f(x s ). Motion Vector Transfer: The deformation on the source surface can not simply be transferred to a target model without adjusting the direction and scale of each motion vector, because facial proportions and geometry vary between models. For any point x s on the source surface, we can compute the corresponding point x t on the target surface by using the mapping function f(x s ). Given a small deformation δx s = (δx s, δy s, δz s ) for a source point x s, the deformation δx t of its corresponding target point x t is computed by the Jacobian matrices Jf(x s ) = (Jf 1 (x s ),Jf 2 (x s ),Jf 3 (x s )) T :

39 2. Vision-based Control of 3D Facial Animation 28 δx t = Jf(x s ) δx s (2.14) We use the learnt RBF function f(x s ) to compute the Jacobian matrix at x s numerically: Jf i (x s ) = f i (x s+δx s,y s,z s) f i (x s,y s,z s) δx s f i (x s,y s+δy s,z s) f i (x s,y s,z s) δy s f i (x s,y s,z s+δz s) f i x s,y s,z s) δz s i = 1, 2, 3 (2.15) Geometrically, the Jacobian matrix adjusts the direction and magnitude of the source motion vector according to the local surface correspondence between two models. Because the deformation of the source motion is represented as a linear combination of a set of small deformation bases, the deformation δx t can be computed as: δx t = Jf(x s ) Σ L i=1 λ iδx s,i = Σ L i=1λ i (Jf(x s ) δx s,i ) (2.16) where δx s,i represents the small deformation corresponding to the ith deformation base, and λ i is the combination weight. Equation (2.16) shows that we can precompute the deformation bases of the target surface δx t,i such that δx t,i = Jf(x s ) δx s,i (2.17) Figure 2.9 shows the seven largest deformation bases on a target surface model. Target Motion Synthesis: After the motion data are synthesized from the motion capture database via a vision-based interface, the system projects them into the deformation basis space of the source model S 1,...S L to compute the combination weights λ 1,..., λ L. The deformation of the target surface δx t is generated by blending together the deformation bases of the target surface δx t,i using the combination weights λ i according to equation (2.16) (see figure 2.10). The target motion synthesis is done online and the other three steps are completed off-line. One strength of our expression retargeting method is speed. The run-time computational cost of the expression retargeting depends only on the number of deformation bases for motion capture database L rather than the number of the vertices of the animated model. In our implementation, the expression retargeting cost is far less than the rendering cost.

40 2. Vision-based Control of 3D Facial Animation 29 (a) (b) (c) (d) (e) (f) (g) (h) Fig. 2.9: The top seven deformation bases for a target surface model. (a) The gray mask is the target surface model in the neutral expression. (b-h) The needles show the scale and direction of the 3D deformation vector on each vertex. 2.7 Results All of the motion data in the database were from one subject and collected at the rate of 120 frames/second with 92 facial markers. In total, the motion capture database contains about frames (approximately 10 minutes). We captured the subject performing various sets of facial actions, including the six basic facial expressions: anger, fear, surprise, sadness, joy, and disgust and such other common facial actions as eating, yawning, and snoring. During the capture, the subject was instructed to repeat the same facial action at least six times in order to capture the different styles of the same facial action. We also recorded a small amount of motion data related to speaking (about 6000 frames) and included them into motion capture database, but the amount of speaking data was not sufficient to cover all the variety of the facial movements associated with speaking. We tested our facial animation system on different users. We use a single video camera to record the facial expression of the user. The frame rate of the video camera is about 60 frames/second. The users were not told which facial actions were in the motion capture database. We also did not give specific instructions to the users about the kind of expressions they should

2. Vision-based Control of 3D Facial Animation 30 Synthesized expression... Combination weights S0 S1 S2 S3... T 0 T 1 T 2 T 3 Fig. 2.10: Online expression retargeting diagram.

41 2. Vision-based Control of 3D Facial Animation 30 Synthesized expression... Combination weights S0 S1 S2 S3... T 0 T 1 T 2 T 3 Fig. 2.10: Online expression retargeting diagram. At run time, the system projects the synthesized expression into the deformation basis space of the source model S 1,...S L to compute the combination weights. The deformation of the target surface is generated by blending together the deformation bases of the target surface T 1,...T L using the combination weights. perform and how they should perform them. The user was instructed to start from the neutral expression under the frontal-parallel view. At run time, the system is completely automatic except that the positions of the tracking features in the first frame must be initialized. The facial tracking system then runs in real time (about 20 frames/second). Dynamical filtering, expression synthesis, and expression retargeting do not require user intervention and the resulting animation is synthesized in real time. The system has a 0.33s delay to remove the mapping ambiguity between the visual tracking data and the motion capture data. Figure 2.11 and figure 2.12 show several sample frames from the single-camera video of two subjects and the animated 3D facial expressions of two different virtual characters. The virtual characters exhibit a wide range of facial expressions and capture the detailed muscle movement in untracked areas like the lower cheek. Although the tracked features are noisy and suffer from the jitter typically seen in vision-based tracking systems, the controlled expression animations are smooth and high-quality

42 2. Vision-based Control of 3D Facial Animation 31 Fig. 2.11: Results of two users controlling the 3D facial expressions of two different target surface models.

43 2. Vision-based Control of 3D Facial Animation 32 Fig. 2.12: Results of two users controlling the 3D facial expressions of two different texture-mapped avatar models.

44 2. Vision-based Control of 3D Facial Animation Discussion We have demonstrated how a preprocessed motion capture database, in conjunction with a realtime facial tracking system, can be used to create a performance-based facial animation in which a live performer effectively controls the expressions and facial actions of a 3D virtual character. In particular, we have developed an end-to-end facial animation system for tracking real-time facial movements in video, preprocessing the motion capture database, controlling and animating the facial actions using the preprocessed motion capture database and the extracted animation control parameters, and retargeting the synthesized expressions to a new target model. The output animation might not be exactly similar to the facial movement in the video because the synthesized motion is interpolated and adapted from the motion capture data. If the database does not include the exact facial movements in the video, the animation is synthesized by interpolating the closest motion examples in the database. In addition, the same facial action from a user and an avatar will have different facial deformations because they generally have different facial geometry and proportions. One limitation of the current system is that it sometimes loses details of the lip movement. This problem probably arises because the motion capture database does not include sufficient samples related to speaking. Alternatively, the tracking data from the vision-based interface might not be sufficient to capture the subtle movement of the lips. With a larger database and more tracking features around the mouth, this artifact might be eliminated. Currently, our system does not consider speech as an input; however, previous work on speech animation shows there is a good deal of mutual information between vocal and facial gestures [18, 17, 33]. The combination of a speech interface and a vision-based interface would improve the quality of the final animation and increase the controllability of the facial movements. Although we have performed preliminary evaluation by visually inspecting the results of using the system on two subjects that are not in the database, we have not rigorously evaluated the results. The accuracy of our system depends on three key factors: tracking error from the vision-based interface, motion synthesis error, and expression retargetting error.

45 3. VISION-BASED CONTROL OF 3D FULL-BODY MOTION The ability to accurately reconstruct a user s motion in real time would allow the intuitive control of characters in computer games, the control of avatars for virtual reality or electronically mediated communication, and the rapid prototyping of character animations. This problem has been solved by commercially available motion capture equipment, but this solution is far too expensive for common use. It is also cumbersome, requiring the user to wear carefully positioned retro-reflective markers and skin-tight clothing, 15 magnetic motion sensors, or a full exoskeleton. In this paper, we present a different approach to solving this problem: reconstructing the user s motion from the positions of a small set of markers captured with two video cameras. This necessarily incomplete information about the user s motion is supplemented by a database of prerecorded human motion. The results are visually comparable in quality to those obtained from a commercial motion capture system with a full set of markers provided that similar behaviors are found in the pre-recorded database. The cost is low because only two synchronized video cameras are required. The system is easy to set up and relatively non-intrusive because the user is required to wear only a small set of markers (6 9 for the experiments reported here) and can wear street clothes (figure 3.1). Providing accurate control of full-body motion based on a small set of markers is difficult because the information about the user s motion is incomplete. The control signals, or input, are the locations of the markers. This information is quite low-dimensional (less than twenty degrees of freedom) compared to a typical human model (approximately sixty degrees of freedom). The control signals cannot be used directly to create a full-body animation because they will be consistent with many disparate solutions for the character s pose. We eliminate this ambiguity by building a local, low-dimensional model of the user s motion on the fly from a motion database of pre-recorded, high-quality motion. The key insight behind this approach is that natural human motion is highly coordinated and the movement of the degrees of freedom are not independent. As a result, the local models can be quite low-dimensional while accurately representing the motion.

full-body motion of avatars by acting out the motion in front of

From left to right: walking, running, hopping, jumping, boxing,

We demonstrate the power and flexibility of this approach by

significant latency: walking, running, hopping, jumping, boxing,

The reconstructed motion is based on a single large human motion

Our experiments indicate that this approach scales well with the

user s motion through spatial-temporal interpolation of the data.

against ground truth data simultaneously captured with a full

46 3. Vision-based Control of 3D Full-body Motion 35 Fig. 3.1: Users wearing a few retro-reflective markers control the full-body motion of avatars by acting out the motion in front of two synchronized cameras. From left to right: walking, running, hopping, jumping, boxing, and Kendo (Japanese sword art). We demonstrate the power and flexibility of this approach by having users control six behaviors in real time without significant latency: walking, running, hopping, jumping, boxing, and Kendo (Japanese sword art) (figure 3.1). The reconstructed motion is based on a single large human motion database. Our experiments indicate that this approach scales well with the size and heterogeneity of the database and is robust to variations in kinematics between users. The resulting animation also captures the individual style of the user s motion through spatial-temporal interpolation of the data. The database, however, must contain the basic actions required for the application domain. We assess the quality of the reconstructed motion by comparing against ground truth data simultaneously captured with a full marker set in a commercial motion capture system. We also compare alternative techniques for the constituent elements of the system: dimensionality reduction for the human motions and the local models used for synthesis. 3.1 Background In the next two sections, we discuss related work in control interfaces for human motion. Because we use a motion capture database in our system, we also briefly review research utilizing motion capture data for animation.

47 3. Vision-based Control of 3D Full-body Motion Control Interfaces for Human Motion Computer and video games offer a variety of interfaces for controlling human motion such as mice, joysticks, and button or key presses. These interfaces can provide direct control over only a limited number of degrees of freedom and the details of the motion must be computed automatically. Control at a more detailed level can be provided if the user acts out the desired motion using his or her body in a performance animation system. Such systems have proved quite successful for television characters who respond in real time to the actions of human actors [90]. Active optical, passive optical, magnetic, and exoskeleton-based motion capture systems all now have the ability to perform real-time capture of a typical human model. While very effective for television, trade shows, and other performance venues, the time required to suit up the user (commonly referred to as the time to don and doff) prevents their use in location-based entertainment and other applications where the system will be used by many people. These systems are also not appropriate for home use because of cost. Systems that can extract meaningful information about the user s motion from only a few sensors are appealing because they dramatically reduce don and doff time. The infrared sensor-based games (Mocap Boxing and Police 911 by Konami [51], for example) are a successful commercial example of this class of interface. These systems track the motion of the hands and render the hands, a first person view of part of the upper body, and the effect of the boxing gloves or gun on the environment. Sony s EyeToy [93] is a vision-based system that requires no markers but is capable of extracting the 2D locations of simple gestures such as a punch or a wave. An earlier system that implemented a number of different interfaces for computer games was presented in the research community [37]. None of these systems attempted to fully capture or animate the user s motion but instead focused on recognizing or locating a limited set of simple actions and showing their effect on the scene. Researchers have also explored techniques for using a few sensors to reconstruct full-body motion. Badler and colleagues [7] used four magnetic sensors and real-time inverse kinematics algorithms to control a standing figure in a virtual environment. Their system adopted a heuristic approach to handling the kinematic redundancy while we use a data-driven approach. Semwal and colleagues [88] provided an analytic solution to the inverse kinematics algorithm based on eight magnetic sensors. Yin and Pai [111] used a foot pressure sensor to develop an interface that extracts full-body motion from a database. Their system was successful at reproducing full body

48 3. Vision-based Control of 3D Full-body Motion 37 motion for a limited range of behaviors with a latency of one second. However, foot pressure patterns may be insufficient to accurately reconstruct a motion with detailed upper body motions. The interface problem becomes more tractable if the motion is performed in several layers so that not all degrees of freedom need to be animated simultaneously. Oore and colleagues [69] used a pair of six degree-of-freedom tracking devices to provide interactive control over the stepping and walking motions of a character. Dontcheva and colleagues [31] also used layering in their puppeteering system. Recently, Grochow and colleagues [42] applied a global nonlinear dimensionality reduction technique, Gaussian Process Latent Variable Model (GPLVM) [55], to human motion data. They combined the learned probabilistic model with kinematic constraints to create a character that could be interactively posed with a mouse. Their global, nonlinear dimensionality reduction technique works well with a small homogenous data set, but might not be suitable for a large heterogeneous motion data set. Lee and colleagues [56] built a vision-based interface to transform noisy silhouette data obtained from a single video camera to full-body movements with a latency of about two seconds. Their approach searched a motion graph using Hu moments computed from the input silhouettes. Ren and colleagues [81] used silhouettes from a three-camera system and a motion capture database to select among a large set of simple features for those most suited to identifying the yaw orientation and pose of the user from three silhouettes. Their application was domain specific in that the features were selected using training data of a specific behavior, swing dancing. Their approach produced high-quality motion that approximated that of the user with 0.8 second latency. Neither of these systems gives the user precise control over the character s motion because a motion graph-based approach cannot modify the existing motions in the database. In addition to eliminating the synthesis latency in these systems, our approach uses a series of local models to interpolate the motions in the database for more accurate control. Another alternative is to employ vision-based tracking to capture the movement of the user. However, that technique has not yet been successfully used to accurately reconstruct complex full-body human motion in real time [47, 15, 91, 25]. The work of Howe [47] and Sidenbladh [91] and their colleagues is perhaps most closely related to that presented in here in that they also use motion capture data. Howe and colleagues [47] published one of the earliest papers on using global PCA to reduce the dimensionality of human motion. They incorporated the reduced model into a probabilistic Bayesian framework to constrain the search of human motion. Sidenbladh and colleagues [91] reduced the dimensionality of the database using global PCA and then

49 3. Vision-based Control of 3D Full-body Motion 38 constrained the set of allowable trajectories within a high-dimensional state space. Our goals are different, however, because we focus on high-quality animation and real-time control Animation with Motion Capture Data A number of researchers have developed techniques for synthesizing animated sequences from motion capture data. Three distinct approaches have been used: constructing models of human motion [60, 16], reordering motion clips employing a motion graph [3, 53, 56, 80, 4] and interpolating motion to create new sequences [44, 102, 82, 52]. In our work, we construct a graph of nearest neighbors for fast search of the motion examples that are close to the current control signals and use it to build a local linear model of the motion for interpolation. We therefore discuss motion graphs and motion interpolation in more detail. Motion graphs create an animation by cutting pieces from a motion database and reassembling them to form a new motion. Because the motion that is selected is not modified, it retains the subtle details of the original motion data but the synthesized motions are restricted to those in the motion capture database. For example, a motion graph cannot be used to synthesize a walking motion for a slope of a particular angle unless the database included data for that slope. Interpolation addresses this problem by allowing synthesis of motion variations that are not in the database. Both Guo and Roberge [44] and Wiley and Hahn [102] produced modified motions using linear interpolation. Rose and colleagues [82] used radial basis functions to interpolate motions located irregularly in the parameter space. Generally, this approach requires segmenting the motion data into structurally similar sequences, building a temporal correspondence among them, and annotating each with a small set of meaningful, high-level control knobs. Given new values of the control parameters, the sequences can be interpolated to compute a motion that matches the specified parameters. Recently, Kovar and Gleicher [52] introduced a method for automatically locating logically similar motion segments in a data set and using them to construct parameterized motions. These algorithms produce high-quality motion for new parameter values that are within the space of the interpolated examples. Like interpolation, our approach can generate spatial/temporal variations that are not in the database. Because interpolation occurs only in the local region of the current control signals with our approach, it does not require that the motions be structurally similar at a high level.

50 3. Vision-based Control of 3D Full-body Motion Overview Our system transforms low-dimensional control signals obtained from only a few markers into full-body animation by constructing a series of local models from a database of human motion at run time and using those models to fill in probable values for the information about the user s motion not captured by the markers. We first perform a series of off-line captures to create a large and heterogeneous human motion database (about 1 hour) using a Vicon optical motion capture system with twelve 120 Hz Mx- 40 cameras [100]. The database contains ten full-body behaviors: boxing (71597 frames), walking ( frames), running (18523 frames), jumping (40303 frames), hopping (18952 frames), locomotion transitions (36251 frames), dancing (18002 frames), basketball (12484 frames), climbing on playground equipment playground (51947 frames), and Kendo (59600 frames). We used a marker set with 41 markers, an adaptation of the Helen Hayes marker set. We added four extra markers on the bamboo sword for Kendo. For Kendo, boxing, and locomotion, the subjects were instructed to repeat each action at least five times in order to capture variations in performance and to ensure that the local model was constructed from sufficiently similar data. Each motion in the database has a skeleton that includes the subject s limb lengths and joint range of motion computed automatically from calibration captures. Each motion sequence contains trajectories for the absolute position and orientation of the root node (pelvis) as well as relative joint angles of 18 joints. These joints are head, thorax, upper neck, lower neck, upper back, lower back, and left and right humerus, radius, wrist, femur tibia, and metatarsal. We denote the set of motion capture data in the database as {q n n = 1,..., N}, where q n is the joint angle representation of a specific pose in the database. The control signals obtained from the interface at time t are represented by the locations of a small set of retro-reflective markers worn by the user, denoted as c t. We always place markers on the torso of the user so that the absolute position and orientation of the user, denoted as z t, can be directly obtained from the control signals, c t. The online motion control problem is to synthesize the current human body pose, q t, in real time based on the current low-dimensional control signals, c t, obtained from the vision-based interface, motion capture data in the database, {q 1,..., q N }, and the synthesized poses in the previous frames, [ q 1,..., q t 1 ]. The system contains three major components (figure 3.2): Motion performance. The user wears a small set of retro-reflective markers to perform a motion in front of two synchronized video cameras. The system automatically extracts the

.., q ~ [ 1 t1 ] Smoothness term Online motion synthesis q ~ t Fig. 3.2: System overview. locations of the markers, [ c 1,..., c t ], and the absolute position and orientation of the motion, [ z 1,.

51 3. Vision-based Control of 3D Full-body Motion 40 Motion performance Low-dimensional control signals ~ c t Motion capture database [ q1,...,qn] Human pose prior Online local modeling Local linear model Control term Root position and orientation ~ zt Previous synthesized poses ~ q,..., q ~ [ 1 t1 ] Smoothness term Online motion synthesis q ~ t Fig. 3.2: System overview. locations of the markers, [ c 1,..., c t ], and the absolute position and orientation of the motion, [ z 1,..., z t ], from the video streams in real time. The trajectories of the markers specify the desired trajectories of certain points on the animated character. Online local modeling. To synthesize the current pose q t, we first search the motion capture database for examples that are close to the current control signals c t and the synthesized poses in the previous frames [ q 1,..., q t 1 ]. Because the runtime computational cost depends on the efficiency of the nearest neighbor search process, we introduce a data structure, a neighbor graph, and an algorithm that accelerates the nearest neighbor search by utilizing the temporal coherence of the control signals. The nearest neighbors, denoted as {q tk k = 1,..., K}, are then used to learn a local linear model of human pose near the current control signal. Online motion synthesis. The local linear model is used to reconstruct the user s pose, q t, based on the control signals obtained from the vision-based interface, c t, a human pose prior term that ensures that the synthesized motion satisfies the probabilistic distribution of human motions in the database, and a smoothness term that minimizes velocity changes in the synthesized motion. We describe these components in more detail in the next three sections.

52 3. Vision-based Control of 3D Full-body Motion Motion Performance In this section, we describe a simple but robust vision algorithm to extract the locations of the retro-reflective markers from two synchronized video cameras. We then describe the subject calibration process that makes the vision-based interface robust to users of different sizes and to variations in marker placement Motion Analysis Our system employs two Pulnix video cameras (TMC-6700-CL), which have image resolution and a frame rate of 60 fps, as input devices (figure 3.3). We use the method described by Zhang [113] to determine the intrinsic parameters of the cameras and the relative transformation between them. To illuminate the retro-reflective markers, we placed a photography light near each camera. To make detection more robust, we apply background subtraction techniques based on the statistics of the images and search just the foreground pixels. The system computes the mean and standard deviation of each background pixel in each color channel for a sequence of frames where the user is not present. During online tracking, pixels that differ in at least one color channel by more than a user-defined threshold from the background distribution are labelled as foreground pixels. We then perform a morphological filter (dilation) on the foreground pixels to enlarge the region and to ensure that markers on the boundary are included. After the system locates the markers in each image, we establish a correspondence between the markers using epipolar geometry and color similarity constraints. The epipolar geometry constraint reduces the search space for the corresponding marker to only those markers that lie on a single epipolar line [108]. Figure 3.3 (c) and (d) shows the marker locations and the corresponding epipolar lines. Because the number of markers is small, the epipolar line constraint is generally sufficient to find the correct point correspondences. Occasionally, more than one point in the second camera might satisfy the epipolar geometry constraint from the first camera, but temporal coherence can be used to reduce this ambiguity. Given a point correspondence between the two calibrated cameras, we compute the marker s 3D location by finding the intersection of rays cast from the 2D markers in both cameras. Occlusion might prevent a marker from being seen by both cameras. To address this problem, we also include the 2D locations of the markers seen by only one camera in the control signals. Once the system labels the markers in the first frame, it can label the markers in each subse-

53 3. Vision-based Control of 3D Full-body Motion 42 (a) (b) (c) (d) Fig. 3.3: Marker detection and correspondence: a user acts out the motion in front of two synchronized video cameras. (a) and (b) The images from the left and right cameras respectively. (c) The detected marker positions in the left image. (d) The detected marker locations in the right image and the epipolar lines of the markers that were detected in the left image. For each marker in the left image, the matching marker in the right image should be located on its corresponding epipolar line.

54 3. Vision-based Control of 3D Full-body Motion 43 quent frame by matching the current marker locations to the marker set found for the most recent synthesized motion (described in section 3.5). Because the synthesized motion is reconstructed from the motion capture database, it includes any occluded markers. Therefore, the system can automatically handle missing markers by labelling a marker that was occluded once it can be seen again. The marker tracking and labelling system runs in real time and did not require manual intervention for the experiments reported here. The marker labelling appears to be robust to variations in body type and occlusion Subject Calibration Subject calibration ensures that the vision-based interface is robust to users of different sizes and to variations in marker placement. Subject calibration consists of two steps: skeleton calibration and marker calibration. Skeleton calibration. Skeleton calibration estimates the user s skeleton model from the 3D locations of a few markers obtained from the vision-based interface. In our experiments, we place markers on the left hand, left elbow, left foot, left knee, and each shoulder. Two markers are placed on the front of the waist. We instruct the user to assume a T Pose and capture the 3D locations of the markers. The locations of this small set of markers are not sufficient to compute a detailed skeleton model; therefore we use these measured 3D marker locations to interpolate a database of detailed skeleton models from a variety of subjects. We then place markers on the right limb and model the right side of the skeleton model in a similar fashion. Each user need perform the skeleton calibration step only once. Marker calibration. The goal of marker calibration is to determine the location of the control markers used in the interface relative to the inboard joint. For example, the location of the hand marker relative to the coordinate system of the wrist. We first measure the location of the markers for the T pose in the world coordinate frame. Given this information and user s skeleton model, we compute the 3D positions of the inboard joints relative to the world coordinate frame in the T pose via forward kinematics. The location of markers in the coordinate system of the inboard joint can then be found by computing the difference between the location of the inboard joint and the marker relative to the world coordinate frame. This calibration step must be repeated if the marker placement changes. The system can handle extra markers if the marker calibration step is repeated for the new marker set. We preprocess the motion capture database by computing the 3D location of the control

55 3. Vision-based Control of 3D Full-body Motion 44 markers c n corresponding to the motion capture data for each frame in the database q n : c n = f(q n ; s, ṽ l, z 0 ) (3.1) where the function f is the forward kinematics function that computes the marker positions from the joint angles of the current frames, q n, given the user s skeleton model, s, and the locations of the control markers, ṽ l, relative to the inboard joint. We choose the default root position and orientation, z 0, as Online Local Modeling The motion synthesis problem is difficult because the positions of the small set of markers worn by the user do not adequately constrain the joint angles of a full-body human model. our key idea is to use a lazy learning algorithm to automatically construct a series of simple local models that sufficiently constrain the solution space. The lazy learning algorithm postpones all computation until an explicit request for information (e.g. prediction or local modeling) is received [2]. The motions to be synthesized, [ q 1,..., q t ], form a nonlinear manifold embedded in a highdimensional space. At run time, the system automatically learns a series of low-dimensional linear models to approximate this high-dimensional manifold. To build a local model, we search the motion capture database for examples that are close to the current marker locations and recently synthesized poses. These examples are then used as training data to learn a simple linear model via Principal Component Analysis (PCA) [11]. A new local model is created to synthesize each pose. The system relies on the current control signals from the interface c t and the synthesized poses in the previous two frames [ q t 1, q t 2 ] to find the K closest examples {q tk k = 1,..., K} for the current pose q t. The query metric, for each example q n in the database, is α c n T( z t, z 0 ) c t 2 + (1 α) q n 2 q t 1 + q t 2 2 (3.2) where denotes a Euclidean norm. T( z t, z 0 ) is the transformation matrix that aligns the current root position and orientation of the control markers, z t, with the default root position and orientation of the motion capture data, z 0. The first term evaluates how well the control parameters associated with the example pose match the control signals from the interface. The second term evaluates the continuity of the motion that would result if q n were to be placed after q t 1 and q t 2. In our experiments, α is set to 0.8.

56 3. Vision-based Control of 3D Full-body Motion 45 After subtracting the mean, p t, of the K closest examples in the local region we apply PCA to the covariance matrix of these K examples, {q tk k = 1,..., K}, in the joint angle space. We obtain a linear model for the current pose q t : q t = p t + U t w t (3.3) where U t is constructed from the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of the local examples. w t is a B-dimensional vector, which is a low-dimensional representation of the current pose q t. The number of nearest neighbors, K, and the dimension of the space, B, are selected locally and adjusted for the current query point by a leave-oneout cross-validation [94, 5]. More specifically, each time a prediction is required, a set of local models are identified, each with a different dimension, B, and each including a different number of neighbors, K. The generalization ability of each model is then assessed through a local leave-one-out cross-validation procedure and the best model is selected for reconstruction. The dimension of the new space is usually less than seven in the experiments reported here and is therefore much lower than the dimension of the original space. The local model avoids the problem of finding an appropriate structure for a global model, which would necessarily be high dimensional and nonlinear. Instead, we assume that a series of low-dimensional, local models are sufficient to approximate the global high-dimensional manifold. Our local models do not require any parameter tuning because all parameters are automatically selected by cross-validation Fast Online K-nearest Neighbor Search The main drawback of the local model is the time required to find the nearest neighbors. Because the number of queries will be large, the computational cost can be significantly reduced by preprocessing the motion capture database to create a data structure that allows fast nearest-neighbor search. This section introduces a neighbor graph and an algorithm that accelerates the runtime query by utilizing temporal coherence. We first build a neighbor graph, each node of which represents a pose in the human motion database q n. We connect the i-th node and j-th node if and only if they satisfy: q i q j L1 < max{ f m d f c, ǫ} (3.4) where L1 denotes the L 1 distance. d represents the largest L 1 distance between two consecutive poses in the database. f m and f c are the camera frame rates used for the motion capture

57 3. Vision-based Control of 3D Full-body Motion 46 and the control interface respectively. ǫ is a specified search radius for nearest neighbors. In our experiments, d is 1.75 degrees per joint angle and ǫ is 3 degrees per joint angle. Let {q t 1k k = 1,..., K} be the nearest neighbors of the previous frame. The nearest neighbors of the query point {q tk k = 1,..., K} can be approximately found by searching {q t 1k k = 1,..., K} and their neighbors in the neighbor graph. A 2D example of our nearest neighbor search algorithm is shown in figure 4.2. The neighbor graph significantly reduces the computational time of the search by just examining the data points in the neighboring nodes of the last query point. Table 3.1 shows that the performance scales well with the size and heterogeneity of the database. Database size mean minimum maximum error Boxing Heterogeneous Tab. 3.1: Performance of the nearest neighbor search algorithm, where mean, minimum and maximum are the mean number of nodes searched, the minimal number of nodes searched, and the maximum number of nodes searched. All three numbers are significantly smaller than the size of the database. The error (degree per joint) is measured by computing the L 2 distance between the pose synthesized by the examples found using exhaustive search and the pose synthesized by the examples found using our nearest neighbor search. 3.5 Online Motion Synthesis The system automatically extracts the absolute root position and orientation, [ z 1,..., z t ], of the reconstructed motions directly from the vision-based interface. This section focuses on how to reconstruct the joint angle values, [ q 1,..., q t ], from the low-dimensional control signals, [ c 1,..., c t ]. At run time, the system automatically transforms the low-dimensional control signals from the vision-based interface to full-body human motions frame by frame using the learned local linear model (Equation 4.1), the training examples in the local region, {q tk k = 1,..., K}, the previous synthesized poses, [ q 1,..., q t 1 ], and the current low-dimensional control signals, c t. We use the local linear model (Equation 4.1) as a hard constraint and optimize the current pose q t in the low-dimensional space w t using a set of three energy terms: prior, control, and smoothness: The prior term, E prior, measures the a-priori likelihood of the current pose using the knowl-

58 3. Vision-based Control of 3D Full-body Motion (a) (b) (c) Fig. 3.4: A 2D example of the fast nearest neighbor search using two dimensions of the neighbor graph for the boxing database: (a) the data points in the database after we project them into the 2D eigenspace; (b) the magenta circle represents the previous pose and the magenta square represents the current pose. At run time, we use the neighbors of the previous frame (blue points) and a precomputed neighbor graph to find the possible neighbors of the current pose in the neighbor graph (red points). The algorithm then searches only the red and blue points to find the nearest neighbors of the current query point. (c) the green points are the nearest neighbors found using this algorithm.

59 3. Vision-based Control of 3D Full-body Motion 48 edge embedded in the motion capture database. The prior term is used to constrain the reconstructed pose to satisfy the probabilistic distribution determined by the training examples {q tk k = 1,..., K} in the local region. We assume the poses in the local region are a multivariate normal distribution, and the pose prior term maximizes P( q t q t1,..., q tk ) = exp( 1 2 ( q t p t ) T Λ t 1 ( q t p t )) (2π) d 2 Λt 1 2 (3.5) where d is the dimension of q t. The vector p t and the matrix Λ t are the mean vector and covariance matrix of the data samples {q tk k = 1,..., K} in the local region. Λ t is the determinant of the covariance matrix Λ t. We minimize the negative log of P( q t q t1,..., q tk ), yielding the energy formulation E prior = ( q t p t ) T Λ 1 t ( q t p t ) (3.6) The control term, E control, measures the deviation of the marker locations in the reconstructed motion from the control inputs obtained from the vision-based interface: E control = f( q t ; s, ṽ l, z t ) c t 2 (3.7) where the function f is the forward kinematics function (Equation 3.1). Generating the animated sequence from only this constraint in the original joint angle space is similar to performing perframe inverse kinematics as was done by Badler and Yamane and their colleagues [7, 109]. If markers are visible to only one camera, we use the intrinsic and extrinsic parameters of that camera to project the 3D locations to 2D locations in the camera s image plane and then minimize the difference between the projected 2D locations and the 2D marker locations from the single camera. The smoothness term, E smoothness, measures the smoothness of the synthesized motion if q t were placed after [ q 1,..., q t 1 ]. We assume that the pose at time t depends only the poses at time t-1 and t-2, and the smoothness term is E smoothness = q t 2 q t 1 + q t 2 2 (3.8) where q t 1 and q t 2 are the synthesized poses in the previous two frames. Combining Equations 4.14, 3.7 and 3.8 and substituting q t using the local model (Equation 4.1), the complete energy function for motion synthesis is arg min wt w T t U t T Λ 1 t U t w t + α f(w t ; U t, p t, s, ṽ l, z t ) c t 2 (3.9) +β U t w t + p t 2 q t 1 + q t 2 2

60 3. Vision-based Control of 3D Full-body Motion 49 We initialize the optimization with the closest example in the database and optimize using the Levenberg-Marquardt nonlinear programming method [9]. The solution converges rapidly because of a good starting point and a low-dimensional optimization space (generally less than seven). In our experiments, α is set to 0.8 and β is set to Numerical Comparison In this section, we compare alternative techniques for the constituent elements of our performance animation system: dimensionality reduction for the human poses and local models for synthesis Dimensionality Reduction The performance of the system depends on the ability to represent human motion in a lowdimensional space. Without this low-dimensional representation, the mapping from the control signals to the motion database would be one to many. The low dimensional space also reduces the time required for optimization. In this section, we compare the performance of our online dimensionality reduction algorithm with other dimensionality reduction methods. Previous work in dimensionality reduction can be divived into three categories: global principal component analysis (PCA) [11], nonlinear dimensionality reduction [64, 87, 84, 96, 55], and mixtures of local linear models (LLM) [38, 19, 46]. To study the performance of these algorithms on human motion data, we constructed three human motion databases: a 6-second walking database (700 frames), a 10-minute boxing database (71597 frames), and a 1-hour heterogeneous database ( frames) that includes walking, running, jumping, hopping, dancing, basketball, boxing, Kendo, and climbing on playground equipment. Figure 3.5 plots the performance of one algorithm from each of these classes as well as the approach described in this paper. Each curve shows the average reconstruction error per joint with increasing number of dimensions. The average reconstruction error is the L 2 distance between the original motion and the motion reconstructed from the low-dimensional space. Principal Component Analysis (PCA). PCA finds a global linear subspace approximating the nonlinear manifold of the motion capture database. Global PCA is widely used because of its simplicity and computational efficiency [11]. The number of dimensions required increases dramatically as the size and heterogeneity of the human motion database increase. For the walking

61 3. Vision-based Control of 3D Full-body Motion 50 6 PCA Averagereconstructionerror (degreeperjointangle) GPLVM LLM Our method Number of dimensions 8 (a) small walking database Averagereconstructionerror (degreeperjointangle) Global PCA GPLVM LLM Our method Number of dimensions (b) medium-size boxing database 30 Global PCA Averagereconstructionerror (degreeperjointangle) GPLVM LLM Our method Number of dimensions (c) large heterogeneous database Fig. 3.5: Comparison of four dimensionality reduction methods: each curve shows the average reconstruction error with increasing number of dimensions. We could not compute the complete GPLVM error curves for the medium and large databases because of the computational cost.

62 3. Vision-based Control of 3D Full-body Motion 51 database, 14 dimensions are required to obtain a reconstruction error of less than one degree. For the boxing database, 27 dimensions are required and the large heterogeneous database requires at least 38 dimensions. Nonlinear dimensionality reduction. Direct synthesis of motions in the low-dimensional space requires an explicit mapping from the low-dimensional space to the original configuration space. Most previous research in nonlinear dimensionality reduction [64, 87, 84, 96], therefore, is not appropriate for our work. An exception is the work of Lawrence [55], where a Gaussian Process Latent Variable Model (GPLVM) was proposed to compute a global nonlinear map from the low-dimensional latent space to a high-dimensional space. Recently, Grochow and his colleagues [42] applied GPLVM to human motion data to animate a character interactively. GPLVM works well for the small walking data set (figure 3.5(a)). The average reconstruction error in the 2D latent space is about 2.7 degrees per joint; however, like global PCA, the performance of the GPLVM deteriorates as the size and heterogeneity of the database increase. Mixtures of local linear models (LLM). LLM first partitions the data space into disjoint regions with a clustering technique and then performs PCA for each cluster [38, 19, 46]. LLM performs well for all three databases (figure 3.5). For the walking database, six dimensions are required to obtain a reconstruction error of less than one degree. For the boxing database, eighteen dimensions are required and the large heterogeneous database requires at least 27 dimensions. The performance of LLM is dependent on the number and quality of the clusters. Our method provides better performance for our application than the three other methods because it is a local method where the model is constructed at run-time using only the current set of nearest neighbors. LLM has better performance than either global PCA or GPLVM because LLM models clusters of the data. Our local modeling method is more efficient than LLM because we build the local model based on the neighbor set of the current query data rather than a precomputed cluster center. Our method scales well with the size and heterogeneity of the database; the algorithm creates similar error curves for the three testing databases. For the walking database, four dimensions are required to obtain a reconstruction error of less than one degree. For the boxing database, six dimensions are required and the large heterogeneous database requires at least five dimensions.

63 3. Vision-based Control of 3D Full-body Motion NN 14 NN Averagereconstructionerror (degreeperjointangle) LWR Our method Averagereconstructionerror (degreeperjointangle) LWR Our method Frame number (a) Boxing Frame number (b) Walking Fig. 3.6: Comparison of methods for synthesizing motions from low-dimensional continuous control signals. (a) Average errors for boxing motion: 7.67 degrees/joint per frame for nearest neighbor synthesis (NN), 6.15 degrees/joint per frame for locally weighted regression (LR), and 2.31 degrees/joint per frame for our method. (b) Average errors for walking motion: 4.46 degrees/joint per frame for NN, 3.32 degrees/joint per frame for LWR, and 1.30 degrees/joint per frame for our method. None of the testing sequences are in the database and boxing and walking motions are synthesized from the same set of markers used for the two camera system (figure 3.1) Online Motion Synthesis Using Local Models The local model used for online motion synthesis is different from previous lazy learning methods such as locally weighted regression [5] because we synthesize the motion in a low-dimensional parametric space constructed from a set of closest examples rather than directly interpolating the local examples. In figure 3.6, we compare our method with two popular local learning methods: nearest neighbor (NN) and locally weighted regression (LWR). Nearest neighbor synthesis simply chooses the closest example. Locally weighted regression interpolates the nearby points by their distance to the query point and has proven very effective in such problem domains as motor learning [6] and speech synthesis [28]. Figure 3.6 shows that our method creates more accurate results than either nearest neighbor synthesis or locally weighted regression. 3.7 Results We test the effectiveness of our algorithm on different behaviors and different users using a large and heterogeneous human motion database and evaluate the quality of the synthesized motions by comparing them with motion capture data recorded with a full marker set. Our results are best

64 3. Vision-based Control of 3D Full-body Motion 53 seen in video form 1, although we show several frames of a few motions in figure 7. We tested our system by controlling and animating a virtual character using two synchronized streams of video data and a small set of markers. Figure shows sample frames of the results. The users control boxing, Kendo, walking, running, jumping and hopping. In the accompanying video, we also demonstrate that the users can transition from one behavior to another, for example from walking to running and that the system can synthesize motions in which the user is not fully facing forward. The video also illustrates a one camera system which uses a slightly larger set of markers. We also compare the reconstructed motion with motion capture data recorded with a full marker set. Leave-one-out evaluation. First, we evaluate the reconstructed motions by leaving out one sequence of motion capture data from each database as the testing sequence (figure 3.8.1). The 3D trajectories from the control markers used for the two camera system (as captured by the Vicon system) are then input to our online motion synthesis system to construct an animation (figure 3.8.2). This test, however, does not include the effect of errors in the tracking of markers in the two camera vision system. End-to-end evaluation. To perform an end-to-end evaluation, we synchronize the Vicon system and our two camera system and capture the movement of the user wearing the full set of markers. We then compare the motion from the Vicon system using the full marker set and the motion from the two camera system using the small set of markers. The result of this test for boxing is shown in the video. The reconstruction error for the boxing sequence is 2.54 degrees per joint angle. 3.8 Discussion We have presented an approach for performance animation that uses a series of local models created at run time from a large and heterogeneous human motion database to reconstruct full-body human motion from low-dimensional control signals. We demonstrate the power and flexibility of this approach with different users wearing a small set of markers and controlling a variety of behaviors in real time by performing in front of one or two video cameras. Given an appropriate database, the results are comparable in quality to those obtained from a commercial motion cap- 1

65 3. Vision-based Control of 3D Full-body Motion Fig. 3.7: Performance animation from low-dimensional signals. (1) The input video and corresponding output animation. (2) (5) Animations created by users using the two camera system.

3. Vision-based Control of 3D Full-body Motion 55 6 7 Fig. 3.8: Comparison with ground truth data. (1) Ground truth motion capture data.

ture system; however, our performance animation system is far less expensive and requires less time to suit up the user.

In our experiments, combining databases containing different behaviors had no effect on the performance of the local models or on the quality of the reconstructed motion.

66 3. Vision-based Control of 3D Full-body Motion Fig. 3.8: Comparison with ground truth data. (1) Ground truth motion capture data. (2) Synthesized motion from the same marker set as that used for the two camera system. ture system; however, our performance animation system is far less expensive and requires less time to suit up the user. Because the models are local, the system handles a heterogeneous database without difficulty. In our experiments, combining databases containing different behaviors had no effect on the performance of the local models or on the quality of the reconstructed motion. When we used a database in which each clip was labelled according to its behavior, we observed that the nearest neighbor search would rarely pick up a sequence of poses from a behavior other than the one the user was performing. A global method such as PCA or GPLVM has a much more difficult time modeling a heterogeneous database because it must compute a global model of the entire database rather than consecutive local models of the region around the user s current pose. Global models might be appropriate for applications such as synthesis of motion without a continuous driving signal (for example, [86]), but given the temporal coherence of the control signals of our performance animation system, they were not required. The performance of the system scales well with the size of the database because the nearest neighbor search is independent of the size of the database; however, it is dependent on the density of the database because an ǫ ball is searched to find the nearest neighbors at run time. If repeated poses in the database became an issue, the size of the ǫ ball could be reduced or the database could be culled for duplicate sequences as a pre-computation step. The system achieves some generality beyond the database particularly with respect to small changes in style (speed and exact posture). However, we do find nearest neighbors for the entire

67 3. Vision-based Control of 3D Full-body Motion 56 pose on each frame and therefore novel combinations of behaviors (hopping while performing Kendo, for example) will likely not yield reasonable results. We have not yet attempted to assess how far the user s motions can stray from those in the database before the quality of the resulting animation declines to an unacceptable level. We have tested the system with users whose motion was not part of the database and found that the quality of the reconstructed motion was still good. We have not yet attempted to rigorously assess the dependence of the system on the body type of the user. A larger set of prototype skeleton models would likely result in a better match to the user s skeleton as would a more sophisticated pose for calibration (such as a motorcycle pose with all joints slightly bent rather than a T pose). We made somewhat an arbitrary decision in choosing where to place the markers for a specific behavior, although we always put several markers on the torso to compute the root position and orientation. For locomotion, we placed markers only on the hands, feet, and shoulders, which allowed a user wearing street clothes to control the character s motion. For boxing, we added a marker on the head because head motion is a key element of boxing. For Kendo, we placed markers on the lower arms rather than on the hands in order to reduce occlusion by the sword. We also added one marker on the sword. A more principled analysis of marker placement could be performed using synthetic data rendered with a graphical model of the character performing the target behavior. We have tested the system with two synchronized cameras and a small set of markers. One limitation of the current system is that it does not allow the user to move freely in the space because of the requirement that most markers be seen by at least one camera. A third camera would reduce the constraints on the user s facing direction. For any given number of cameras, a larger set of markers should reconstruct the motion more accurately but will increase the level of the intrusiveness of the animation interface. Similarly, adding more cameras to the system could improve the performance; the system, however, will become more expensive and cumbersome. The number of cameras and markers should probably be determined by the application. We chose to use two cameras and a small marker set because a user might be able to create such a simple and cheap setup at home. Other sensors also fit this description and might provide the types of control signals we seek. For example, inertial measurement units (IMUs) are now sold in small packages and could include a wireless link to the animation computer [107, 65]. A standard performance animation system for human characters would require at least 18 IMUs for

68 3. Vision-based Control of 3D Full-body Motion 57 full-body motion control. The local models should allow us to significantly reduce the number of IMUs and thus the cost of the system. Another limitation of the system is that an appropriate database must be available. That should not be a problem for a sports video game because the virtual players are often animated using motion capture. We believe that the range of behaviors expected of the user is sufficiently limited in many other applications and that this approach will be widely applicable. For example, local models could be used to constrain the search space for markerless human motion capture [25] as well as motion planning [110]. Commercial motion capture systems could use the local model to filter noisy data and fill in missing values automatically. Another potential application is the automatic synthesis of detailed full-body animation based on a small set of trajectories keyframed by an animator [80].

69 4. GENERATING STATISTICALLY VALID MOTION USING CONSTRAINED OPTIMIZATION The previous two chapters described performance animation systems. In this third system, our goal is to design an animation system that allows users to easily create natural-looking character animation by specifying a sparse set of spatial-temporal constraints throughout the motion. A naive user might specify a small set of intuitive constraints, for example, the mouth width, height, and eye openness at key time instants. The system then automatically finds a natural facial animation that best satisfies those constraints. A more skilled user could use the system to control the trajectories of a small set of joint angles or end-positions on the limbs of character to create a stylized walking gait. An ideal motion synthesis system should allow the user to specify different kinds of constraints either at isolated points or across the whole motion in order to accommodate users with different skill levels. One appealing solution to this problem is physically based optimization [104], which allows the user to specify various forms of constraints throughout the motion and uses optimization to compute the physically valid motion that best satisfies these constraints. Unfortunately, merely getting the physics correct does not ensure that the motion will appear natural for characters with many degrees of freedom. Optimizing in the high dimensional space required for full-body or facial animation often results in local minima. In this chapter, we present an approach that allows the user to generate a wide range of naturallooking facial and full-body animation by specifying spatial-temporal constraints throughout the motion. We formulate the problem as trajectory optimization and consider the whole motion simultaneously. Instead of using the physics laws to generate physically correct animation, our approach relies on statistical laws of human motion to generate statistically valid motion. The system automatically learns a statistical dynamic model from motion capture data. The statistical dynamic model plays a role that is similar to that played by the dynamics in physically based optimization in that it constrains the motion to just part of the space of possible human motions. The statistical dynamic model, however, is linear and much lower dimensional than the physical

4. Generating Statistically Valid Motion Using Constrained Optimization 59 Fig. 4.1: Our system generates a wide variety of natural motions based on various kinds of user-defined constraints.

70 4. Generating Statistically Valid Motion Using Constrained Optimization 59 Fig. 4.1: Our system generates a wide variety of natural motions based on various kinds of user-defined constraints. dynamic model, making the optimization more efficient and less likely to be subject to local minima. We demonstrate the effectiveness of this approach in two different domains: facial animation and human body animation (figure 4). We demonstrate that the system can generate naturallooking facial and human body animation from a sparse set of spatial-temporal constraints. For example, the user can generate a desired walking animation using a small set of key frames and foot contact constraints. The user can also specify a small set of key trajectories such as the trajectories of root positions, hand positions, and foot positions to generate a realistic running motion. 4.1 Background We construct statistical models from motion capture data and then use them to generate statistically valid motion from user-defined spatial-temporal constraints in the framework of trajectory optimization. We therefore discuss related work in constraint-based trajectory optimization and data-driven animation Constraint-based Trajectory Optimization Trajectory optimization methods, which were first introduced to the graphics community by Witkin and Kass [104], provide a powerful framework for generating character animation from

71 4. Generating Statistically Valid Motion Using Constrained Optimization 60 various forms of user-specified constraints and an objective function that measures the performance of a desired motion. The algorithm minimizes the objective function while satisfying Newtonian physical constraints and user-defined constraints. Extending physically based optimization to generate natural motion for a full human character has proved difficult because the system is highly nonlinear and has many degrees of freedom and because it is difficult to define an objective function that reliably measures the naturalness of human motion [Cohen:1992,Liu:1994]. One way to make the problem tractable is to simplify the governing physical laws. Both Liu and Popovic [61] and Abe and his colleagues [1] showed that many dynamic effects can be preserved by enforcing patterns of linear and angular momentum during the motion. Reducing the number of degrees of freedom to be optimized can also create tractable problems. For example, Popovic and Witkin [79] showed that significant changes to motion capture data can be made by manually reducing the degrees of freedom to those most important for task. Safonova and her colleagues [86] demonstrated that efficient optimization can be achieved for a sixty degrees of freedom (DOF) character in a behavior-specific, low-dimensional space without simplifying the dynamics. Other improvements include reformulating the dynamics to speed up the optimization process [36], using proper variable scaling and initialization technique to improve the convergence of the optimization [95], and using a small set of motion examples to optimize appropriate parameters in the objective function [62]. In particular, Fang and Pollard [36] showed that objective functions and constraints that can be differentiated in time linear in the degrees of freedom of the character result in fast per-iteration computation times and an optimization that scales well to a more complex character. Sulejmanpasic and Popovic [95] demonstrated that proper scaling and estimation of joint angles, torques, and Lagrange multipliers can dramatically improve the convergence of motion optimization. More recently, Liu and her colleagues [62] introduced a novel optimization framework Nonlinear Inverse Optimization for optimizing appropriate parameters of the objective function from a small set of motion examples and then used the estimated parameters to synthesize a new locomotion. Our work uses a similar trajectory optimization framework but replaces the physical dynamic model with a statistical dynamic model. We also derive an objective function based on a probabilistic framework. We demonstrate that our approach can generate a wide variety of whole body animations but we also show the approach can be used to generate facial expression from

72 4. Generating Statistically Valid Motion Using Constrained Optimization 61 constraints, a problem which is not amenable to physically based trajectory optimization because style plays a significant role in facial animation Data-driven Motion Synthesis An alternative technique for generating motion from user-defined constraints is to use motion capture data. One approach is to use a motion graph to re-sequence motions to form a new motion. This approach has been used to generate impressive animation for a human character [3, 53, 56, 80, 4, 57] and for facial expressions [112]. Because the original motion is not modified, these systems are incapable of matching poses or satisfying kinematic constraints such as endposition constraints unless the motion database contains a motion that directly satisfies these constraints. Motion interpolation approach allows new motion to be created [44, 102, 82, 52, 66, 85]. This approach requires segmenting the motion data into structurally similar sequences, building a temporal correspondence among them, and annotating each with a small set of meaningful, high-level control knobs. Given new values of the control parameters, the sequences can be interpolated to compute a motion that matches the specified parameters. This interpolation framework can generate poses that are not in the database but it does not allow the user to specify fine-grained constraints or constraints across multiple frames such as key-trajectory constraints. For example, interpolation cannot constrain the foot position of a character to a user-defined path over a period of time. Our approach does not require that the motions be structurally similar at a high level and allows the user to define constraints throughout the motion. Statistical models of human motion have also been used for motion synthesis. For example, a number of researchers have used variants of Hidden Markov Models (HMMs) to statistically represent human motion: either full-body movements [39, 14, 16] or speech-driven facial expressions [18, 17, 33]. HMMs learned from human motion data have been used to interpolate key frames [39, 14], synthesize a new style of motion [16], and generate facial expressions from speech signals [18, 17, 33]. More recently, switching linear dynamic system (SLDS) have been used to model human motion [76, 60]. In SLDS models, the Markov process controls an underlying linear dynamic system, rather than a fixed Gaussian measurement model. By mapping discrete hidden states to piecewise linear measurement models, the SLDS framework has greater descriptive power than an HMMs. Pavlovic and his colleagues [76] present results for human motion synthesis, classification, and visual tracking using learned SLDS models. Li and his col-

73 4. Generating Statistically Valid Motion Using Constrained Optimization 62 leagues [60] used SLDS to extrapolate and interpolate a database of disco dancing motion. Our approach also learns spatial-temporal statistical models from motion capture data. However, our temporal model is different from the previous models used for motion synthesis because it is a continuous dynamic model rather than a mixture of discrete states (HMMs) or a mixture of continuous models (SLDS). This property leads us to formulate the motion synthesis problem in the framework of trajectory optimization. A number of researchers have also developed statistical models for human poses and used them to estimate poses from kinematic constraints. For example, Grochow and colleagues [42] applied a global nonlinear dimensionality reduction technique, Gaussian Process Latent Variable Model [55], to human motion data and then used the learned statistical pose model to compute poses from a small set of user-defined constraints. Another solution for data-driven inverse kinematics is to interpolate a small set of preexisting examples using constraints. The idea has been used to compute human body poses [83] or facial expressions [112] from kinematic constraints at a single frame. The advantage of these systems is that the user can specify fine-grained constraints to compute a desired pose at a single frame. However, the statistical model cannot be used to generate an animation from sparse constraints such as keyframe constraints because they lack temporal information. We have developed performance animation systems which transform low-dimensional, continuous control signals into character animation. In Chapter 3, we presented a real-time visionbased performance animation which transforms a small set of high-level facial features into highquality facial animation by interpolating examples in the database at run time [24]. In Chapter 4, we used a series of local statistical pose models constructed at run time to reconstruct motion from continuous, low-dimensional control signals obtained from video cameras [23]. Similar to data-driven inverse kinematics, those approaches cannot be used to generate animation from a small set of keyframe constraints because they rely on temporal continuity of control signals. Our statistical dynamic model is motivated by the dynamical model used for video textures by Soatto and his colleagues [92]. They showed that a sequence of images of moving scenes such as sea-waves, smoke, and whirlwinds can be modeled by second-order linear dynamic systems. They applied the learned linear dynamic systems to synthesize an infinite length texture sequence by sampling noises from a known Gaussian distribution. Hsu and colleagues [48] used a similar linear dynamic model to translate one style of human body motion to another. More recently, Wang and his colleagues [101] learned a nonlinear dynamical model, Gaussian Process

74 4. Generating Statistically Valid Motion Using Constrained Optimization 63 Dynamical Models, from a small set of human motion capture data and demonstrated its effectiveness for human motion representation. Our goal is different from theirs because we aim to generate a desired motion that best satisfies user-defined constraints rather than generate a long sequence of video textures, translate one style of motion to another, or generate a long sequence of random human motions. Another key difference is that we introduce a new term control input into the dynamic model. The control input term allows an efficient and linear representation of human dynamics which are usually too complex to be accurately represented by simpler linear dynamical models. 4.2 Overview The key idea of our approach is that statistical models learned from motion capture data can be used to create natural character animation that matches constraints specified by the user. The combination of the statistical models and the user s constraints provide sufficient information to produce motion with a natural appearance. The human body motion capture database (about 15 minutes) included data from locomotion (jumping, running, walking which includes walking on slopes, climbing over obstacles, and stylized walking) and interacting with the environment ( standing up/sitting down, reach/picking up/placing an object). The facial expression database (about 9 minutes) included six basic facial expressions (happiness, surprise, disgust, fear, angry, sadness) and facial movements related to everyday life such as speaking, eating, and snoring. The motion was captured with a Vicon motion capture system of 12 MX-40 cameras [100] with 92 markers for facial expressions and 41 markers for full-body movements. The motion was captured at 120Hz and then downsampled to 30Hz. We denote the set of motion capture data in the database as {y n n = 1,..., N}, where y n is the measurement of a character configuration at the nth frame. In facial animation, y n is the 3D positions of all vertices on the face model. In human body animation, y n is the position and orientation of the root and the joint angles. The constraints defined by the user are represented by c. The constrained motion synthesis problem is to synthesize a sequence of motion of length T (body or face), {y t t = 1,..., T }, based on user-defined constraints, c, and the statistical models learned from motion capture data. The system contains three major components: Statistical models. The system automatically learns a statistical dynamic model from motion

75 4. Generating Statistically Valid Motion Using Constrained Optimization 64 capture data. This model is combined with a probabilistic model to ensure that the generated motion is natural. Constraints specification. The user defines various forms of kinematic constraints throughout the motion: c = f(y 1,..., y T ). The constraints could be position, orientation, or the distance between two points. The y could be specified either at isolated points (key frames) or across the whole motion (continuous). Constrained motion synthesis. The system uses trajectory optimization to automatically find an animation that best satisfies the user-specified constraints while matching the statistical properties of the motion capture data. We describe these components in more detail in the next two sections. 4.3 Spatial-temporal Motion Analysis We first preprocess the motion capture data by applying Principal Component Analysis (PCA) to the motion capture data and obtain a linear model for y t : y t = C x t + D (4.1) where the vector x t R dx is a low-dimensional representation of the character configuration y t R dy. The matrix C is constructed from the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of the data, and D is the mean of all example data, D = (Σ N n=1 y n)/n. The dimensionality of the system state, d x, can be automatically determined by choosing d x as the cutoff where the singular values drop below a threshold. We use an m-order linear time-invariant system to describe the dynamical behavior of the captured motion in the reduced low-dimensional space [63]: x t = m A i x t i + Bu t (4.2) i=1 where m is the order of the linear dynamic model. x t R dx and u t R du are the system state and control input, and d u is the dimensionality of the control input u t. This formulation is similar to a linear time-invariant control system commonly adopted in the control community [70]. It

76 4. Generating Statistically Valid Motion Using Constrained Optimization 65 differs from the linear dynamic model used in [92] because of the introduction of the control term u t to the equation. Given the low-dimensional representation of the original motion capture data, x 1,..., x N, we want to identify the state-space model, including system matrices A i, B, the dimensionality of the control input d u, and the corresponding control input {u n n = m + 1,..., N}. The matrices A i are not unique. To eliminate the ambiguity of the matrices, we seek to find the A i which minimizes control input u t in the sense of Frobenius norm: Â 1,..., Âm = arg min A1,...,Am Bu n (4.3) The matrices A i can then be uniquely found by computing the least-square solution: n Â 1,..., Âm = arg min A1,...,Am N n=m+1 x n m i=1 A ix n i (4.4) We use the estimated matrices Â1 to compute the control input term as follows: ẑ n = x n m i=1 Âix n i, n = m + 1,..., N (4.5) We form a d x (N m) matrix by stacking all control input Bu n : ( ) ẑ m+1... ẑ N }{{} Ẑ = B ( ) u m+1... u N }{{} U (4.6) We now can estimate the matrix B and U by factorizing the matrix Ẑ into the product of a d x d u matrix ˆB and a d u (T m) matrix Û using Singular Value Decomposition. The dimensionality of the control input (d u ) can be automatically determined by choosing d u as the cutoff where the singular values drop below a threshold. Functionally, a statistical dynamic model is similar to a physical dynamic model. For example, given initial values of the system state (x t ), the linear dynamic model in Equation (4.2) can be used to generate an animation by sequentially choosing an appropriate value for the control input (u t ), just as joint torques would be used to advance a physical model through time. The advantage of using a statistical dynamic model for animation is that it is linear and usually much lower dimensional than a physical dynamic model. The number of dimensions of the control input (d u ) characterizes the complexity of our dynamic model. Figure 4.2 plots the reconstruction error of the motion data in terms of the order

77 4. Generating Statistically Valid Motion Using Constrained Optimization Reconstruction error (mm) Dynamic system order (m) Dimensionality of control input Reconstructionerror (degreeper jointangle) Dynamic system order (m) Dimensionality of control input 0 (a) (b) Fig. 4.2: The average reconstruction error of the linear time-invariant system: (a) Average reconstruction error of facial motion data in terms of the order of the dynamic system (m) and the number of dimensions of the control input (d u ); (b) Average reconstruction error of human body motion data in terms of the order of the dynamic system (m) and the number of dimensions of the control input (d u ). of the dynamic system (m) and the number of dimensions of the control input (d u ). The average reconstruction error is the L 2 distance between the original motion and the motion reconstructed from the linear time-invariant system. We can observe that the reconstruction error of the statistical model decreases as either the order of dynamic system or the number of dimensions of the control input increases. If we choose d u as zero (simply dropping off the control term), our model is the linear dynamic model used in [92], which generates the largest reconstruction error. If d u is equal to the number of dimensions of the system state d x, the model can be used to reconstruct any temporal behavior with zero errors. In practice, human motion is highly coordinated, the dimensionality of the control input for accurate motion representation (d u ) is often much lower than the dimensionality of the system state (d x ). For the examples we reported here, we set the dynamic order to 2 and the dimensionality of control input to 1 for facial movement (the reconstruction error is about 0.1 mm/vertex per frame); we set the dynamic order to 3 and the dimensionality of control input to 4 for human body animation (the reconstruction error is about 0.7 degrees/joint per frame).

78 4. Generating Statistically Valid Motion Using Constrained Optimization Constraint-based Motion Synthesis Constraint-based motion synthesis provides the user with intuitive control over the resulting motion: the user specifies a desired motion with various forms of constraints, such as keyframes, end-positions, or joint angle values; the system then automatically finds an animation that best satisfies the user-specified constraints while matching the spatial-temporal properties of the motion capture data. We formulate the motion synthesis as a constrained optimization problem and consider the entire motion simultaneously. Like physically based optimization [104], we choose to represent x t and u t independently. The motion to be synthesized is represented as a sequence of system states and control inputs D = (x 1,..., x T,..., u m+1,..., u T ) (4.7) Let x and u be the concatenation of the system states x t from frame 1 to frame T and the concatenation of the control signals u t from frame m + 1 to frame T respectively. The general problem statement is argmin x,u E(x, u) subject to f(x) = c (4.8) where E is an objective function (a scalar function of x and u) that measures the quality of the generated motions using the prior knowledge embedded in motion capture data and f is a vector function of constraints defined by the user Constraints The system allows the user to specify various forms of kinematic constraints throughout the motion or at isolated points in the motion. For facial animation, the animator can specify positions or orientations of any points on the face, or distance between any two points (figure 4.3(a)). For whole body animation, the user can specify positions or orientations of any points on the body, or joint angle values for any joints (figure 4.3(b)). Rather than requiring that constraints be specified in 3D, it is often more natural to specify where the projection of a point on the character should be located. Therefore, the system also allows the user to specify the 2D projections of any 3D point on a user-defined screen space. The system allows the user to sketch out the motion in greater or lesser detail. For example, a casual user might prefer using sparse constraints such as keyframes to generate animation while a

79 4. Generating Statistically Valid Motion Using Constrained Optimization 68 Position constraint Orientation constraint Distance constraint Joint angle constrain (a) (b) (c) Fig. 4.3: Typical user-defined constraints: (a) positions or orientations of any points on the face, or distance between any two points; (b) positions or orientations of any points on the body, distance between any two points, or joint angle values for any joints. (c) notations. more skilled user might want to control the paths of specific joints or paths over a period of time. Spatially, the constraints could provide either an exact configuration such as a full-body pose, or a small subset of the joint angles or end-positions. Temporally, the constraints could either be instantaneous constraints, which constrain at a particular frame, multiple-frame constraints, or continuous constraints during a period of time. User-defined constraints may be equalities or inequalities. For example, the user may constrain a point to lie on a plane, a line, or a point (equality constraints), or lie within a region (an inequality constraint). Mathematically, we can divide the user-defined constraints into two groups: Linear constraints, f L (x) = c L, generate linear equality or inequality constraints on the state x. The linear constraints can be used to define position constraints in facial animation and joint angle constraints in human body animation. Nonlinear constraints, f N (x) = c N, provide nonlinear equality or inequality constraints over the state x. The most common nonlinear constraints in human body animation might be end effector constraints, for example foot contact constraints. In a facial animation, they can be used to specify the distance between two points on the face or 2D projections of 3D facial points. Figure 4.4, 4.5, and 4.6 show examples of user-defined constraints that were used to generate facial animation. Figure 4.7 shows two examples of user-defined constraints that were used to

80 4. Generating Statistically Valid Motion Using Constrained Optimization 69 (a) (b) (c) Fig. 4.4: Keyframing constraints for creating facial animation: (a) the facial expression in the first frame; (b) the facial expression in the middle frame. (c) the facial expression in the last frame. (a) (b) (c) (d) Fig. 4.5: Sparse spatial-temporal constraints in screen-space for generating facial animation: (a) The user picks six points on the face; (b)-(d) their screen-space position constraints at three key frames. generate full-body animation Objective Function There might be many motions that satisfy the user-defined constraints. To remove ambiguities, we would like to constrain the generated motion to lie in the space of natural human motions by imposing a prior on the possible generated motion. Let D denote the generated motion (x 1,..., x T, u m+1,..., u T ), then the motion prior has the form p(d) = p(x 1,...,x T,u m+1,u T ) (4.9) We assume that the control input u t is independent and identically distributed and the conditional density of (u t ) is a mixture with K component Gaussian densities [11]: p control (u t ) = Σ K k=1 π kn(u t ; φ k, Λ k ) (4.10) where K is the number of Gaussian density models and π k is a mixing parameter that corresponds

81 4. Generating Statistically Valid Motion Using Constrained Optimization 70 (a) (b) (c) (d) Fig. 4.6: Combination of key-trajectory constraints and keyframing constraints: (a) The user defines a distance between the left corner of the mouth and the right corner of the mouth; (b) the neutral expression in the first frame; (c) the neutral expression in the last frame; (d) the distance values throughout the motion. (a) (b) (c) Fig. 4.7: Two typical constraints for generating full-body animation: (a) keyframing constraints for generating running animation; (b)-(c) key-trajectory constraints where the user selects six points on the character and then specifies their 3D trajectories across the motion.

82 4. Generating Statistically Valid Motion Using Constrained Optimization 71 to the prior probability that u t was generated by the kth component. The function N(u t ; φ j, Λ j ) denotes the multivariate normal density function with mean φ j and covariance matrix Λ j. The parameters of the Gaussian mixture models (π k, φ k, Λ k ) are automatically estimated using an Expectation-maximization algorithm [11]. The training data are the values of control inputs {û n } computed from the original motion capture data ({y n n = 1,..., N}) (see section 4.3). Based on the statistical dynamical model, the current system state x t depends on the previous system state (x 1,..., x t 1 ) and the current control input u t. We also assume its likelihood is measured by the deviation of the statistical dynamic model p dynamics (x t x t 1,..., x t m, u t ) = exp{ α x t m A i x t i Bu t 2 } (4.11) i=1 where α is a tuning parameter. Finally, combining the prior for the control input and the prior for the dynamic model, we have the complete prior: L(D) = p(x 1,...,x T,u m+1,...,u T ). = T t=m+1 p control(u t ). T t=m+1 p dynamics(x t x t 1,...,x t m,u t ) (4.12) where the term p(x 1,..., x m ) is dropped because it is usually very small as compared with the rest of the terms. In the implementation, we minimize the negative log of L, yielding the energy formulation E = t E control + t E dynamics (4.13) Where the control prior term, E control, measures the a-priori likelihood of the control input using the knowledge embedded in motion capture data. We minimize the negative log of the likelihood of T t=1 p control(u t ), yielding the energy formulation E control = T t=m+1 log(σ K j=1 π jn(u t ; φ j, Λ j )) (4.14) The dynamical term, E dynamics, measures the deviation from the linear time-invariant system in Equation (4.2). The dynamical term is T E dynamics = α x t t=m+1 m A i x t i Bu t 2 (4.15) i=1 where m is the order of the statistical dynamic model, T is the number of frames to be synthesized, and matrices A i and B are the parameters of the statistical dynamic model.

83 4. Generating Statistically Valid Motion Using Constrained Optimization 72 Besides these two energy terms, we also define a smoothness term, E S, to minimize the sum of the accelerations of each degree of freedom over the entire animation. The smoothness term is E smoothness = T t=1 ÿt t ÿt = T t=1 ẍt t CT Cẍ t (4.16) The final objective function is a weighted sum of the three components: E = E control + αe dynamics + βe smoothness (4.17) Optimization We can formulate the constraint-based motion synthesis as the following constrained optimization problem: arg min x,u subject to E control + αe dynamics + βe smoothness f L (x) = c L (4.18) f N (x) = c N We solve the optimization problem using sequential quadratic programming (SQP) [9], where each iteration solves a quadratic programming subproblem. The Jacobian of the constraints function f, and the derivative and Hessian matrix of objective function E are symbolically evaluated at each iteration. We choose all initial values using random values between 0 and 1 except that a linear interpolation of the user-specified keyframe constraints are used for initialization. 4.5 Results We tested our system by generating both facial animation and human body animation from various forms of user-defined constraints. In the current implementation, the user can generate a desired facial animation by specifying an exact facial configuration, the positions of any point on the face, distances between any two points, or 2D projections of any point on a user-defined screen space, or any combination of these constraints. The user can create a human body animation by specifying an exact pose, the positions and orientations of any point, angle values of any joint on the character, or any combination of these constraints. The user can specify any combination of the constraints throughout the motion or at isolated points of the motion. In the accompanying video, we demonstrate that the user can generate a wide variety of different kinds of motion.

84 4. Generating Statistically Valid Motion Using Constrained Optimization Facial Expression The system learns a single statistical model from the whole facial database and then uses it to generate a wide range of facial animations (see figure 4.8). We demonstrate that the user can generate a wide range of facial expressions using a variety of spatial-temporal constraints. Keyframe constraints. The user uses three key frames and their timing to generate animation (see figure 4.8.(a)). Sparse spatial-temporal constraints. The user specifies 2D projections of six facial points on the screen space at three key instants and then uses them to generate facial animation (see figure 4.8.(b)). This type of constraints could be easily extracted by rotoscoping. Trajectory constraints. The user achieved detailed control over the facial movement by specifying the trajectories of a small set of 3D facial points (see figure 4.8.(c)). We also demonstrate that the user can use trajectories of a small set of high-level facial features (the mouth width, height, and the openness of the eyes) to generate facial animation. Combination of keyframe and trajectory constraints. The user generates realistic facial animation by combining sparse keyframe constraints (three key frames) and sparse trajectory constraints (one trajectory) Full-body Animation The motion capture database contains a wide range of full-body movement including locomotion (jumping, running, walking) and interacting with the environment (standing up/sitting down, reach/picking up/placing an object). We label the motion in the database with five basic behaviors: walking (including normal walking, walking on a slope, climbing over an obstacle, and stylized walking), running, jumping, interaction with a chair/steptool (standing up/sitting down), interaction with an object (reach/picking up/placing an object), and their transitions. We demonstrate that the user can generate a wide range of full-body animation using our system: Individual behaviors. Each behavior is learned by determining a statistical model using labeled

generated by keyframing constraints shown in figure 4.

85 4. Generating Statistically Valid Motion Using Constrained Optimization 74 (a) (b) (c) Fig. 4.8: Facial animation generated by various spatial temporal constraints: (a) facial animation generated by keyframing constraints shown in figure 4.4; (b) facial animation generated by sparse 2D constraints shown in figure 4.5; (c) facial animation generated by key trajectories of eight green facial points.

86 4. Generating Statistically Valid Motion Using Constrained Optimization 75 motion capture data. We demonstrate the effectiveness of our system for generating individual behaviors from sparse keyframe constraints and other spatial-temporal constraints. Our behaviorspecific model is capable of generating a rich variety of actions. For example, we can use a sparse set of keyframe constraints as well as the statistical walking model to generate normal walking, walking on a slope, climbing over an obstacle, and stylized walking such as micky walking, careful walking, and baby walking (see figures 4.9 and 4.10). Transition from one behavior to another behavior. Our system can also synthesize motion that transitions from one behavior to another behavior using the statistical model learned from labeled transition data. We demonstrate that the system can generate the transition from walking to jumping, from walking to sitting down, and from walking to picking up object (see figure 4.11)using several user-defined key frames, foot contacts and their timings Other Experiments We compared results using different sets of spatial-temporal constraints with ground truth motion capture data. For facial animation, we evaluate the synthesized motions by leaving out one sequence of motion capture data from the database as the testing sequence. For full-body animation, the database does not include any motion from the testing subject. A variety of constraints are computed from testing motion capture data and then input to our system to generate an animation. In the accompanying video, we show a side-by-side comparison between ground truth walking motion and walking motions generated from sparse keyframe constraints, sparse trajectory constraints, and the combination of keyframe and trajectory constraints. We tested the generalization capability of our statistical models within our motion optimization framework. The accompany video demonstrates that the system can generate new motions that are not in the database. For example, we show that the system can use the statistical model learned from a short sequence of normal walking to generate walking on a new slope and walking with a new step size (see figure 4.12). Our video also demonstrates that the system can generate the motion for a new skeleton model. Finally, we compared animations generated by statistical models that are learned from different training databases. We have observed that a big and heterogeneous database might require more constraints from the user to generate natural animation.

87 4. Generating Statistically Valid Motion Using Constrained Optimization 76 (a) (b) (c) Fig. 4.9: Full-body animation generated by key-frame constraints: (a) baby walking; (b) careful walking; (c) mickey mouse walking.

88 4. Generating Statistically Valid Motion Using Constrained Optimization 77 (a) (b) Fig. 4.10: Full-body animation generated by key-frame constraints: (a) Climbing over an obstacle; (b) running. 4.6 Discussion We have presented an approach for generating both full-body movement and facial expression from spatial-temporal constraints while matching the statistical properties of a database of captured motion. The system first automatically learns a low-dimensional linear dynamical model from motion capture data and then enforces this as a spatial-temporal prior to generate the motion. The statistical dynamic equations, together with an automatically derived objective function and user-defined constraints, comprise a problem of constrained optimization. Solving this constrained optimization problem in the low-dimensional space yields optimal, natural motion that achieves the goals specified by the user. Our system allows the user to generate realistic animation from various forms of user-defined constraints. Any kinematic constraints could be integrated into our statistical optimization frame-

89 4. Generating Statistically Valid Motion Using Constrained Optimization 78 (a) (b) (c) Fig. 4.11: Full-body animation generated by key-frame constraints: (a) motion transition from walking to jumping; (b) motion transition from walking to picking up an object; (c) motion transition from walking to sitting down.

walking data shown in the top and user-defined constraints. work as long as the constraints can be expressed as a function of motions: c = f(y 1,..., y T ).

90 4. Generating Statistically Valid Motion Using Constrained Optimization 79 Fig. 4.12: Data generalization: (top) a short sequence of normal walking data which are used for training a statistical model; (bottom) walking on a slope generated by the statistical model learned from walking data shown in the top and user-defined constraints. work as long as the constraints can be expressed as a function of motions: c = f(y 1,..., y T ). The user can specify the constraints at isolated points, during a period of time, or across the entire motion. The quality of final animations depends on two factors: motion prior and naturalness of userdefined constraints. An appropriate motion capture database must be available. Without the use of motion capture data, the system will not generate natural motions unless the user can specify a very detailed set of constraints across the entire motion. If the user s intent conflicts with the naturalness defined by the statistical model, the current system will sacrifice realism to favor user intent because the user-specified constraints are enforced as hard constraints in the optimization. An alternative solution would be to define user-specified constraints as soft constraints, as part of the objective function, and formulate the motion synthesis as an unconstrained optimization problem. A tradeoff between the user intent and the naturalness of the motion could then be achieved by adjusting the weights in the objective function. The system achieves some generality beyond the motion capture data. For example, we have generated the motion using constraints that can not be satisfied directly by any motion in the database and found that the quality of the reconstructed motion was still good. We show the results of generating a new motion such as walking at a new speed, walking with a new step size, and walking on a new slope using a short sequence of normal walking data. We also show that we can retarget a short sequence of walking motion from one skeleton model to a new skeleton model. However, we have not yet attempted to assess how far the user s constraints can stray

Vision-based Control of 3D Facial Animation

Eurographics/SIGGRAPH Symposium on Computer Animation (2003) D. Breen, M. Lin (Editors) Vision-based Control of 3D Facial Animation Jin-xiang Chai,1 Jing Xiao1 and Jessica Hodgins1 1 The Robotics Institute,