K A I S T Department of Computer Science

Size: px

Start display at page:

Download "K A I S T Department of Computer Science"

Thomasina Hopkins
6 years ago
Views:

1 An Example-based Approach to Text-driven Speech Animation with Emotional Expressions Hyewon Pyun, Wonseok Chae, Yejin Kim, Hyungwoo Kang, and Sung Yong Shin CS/TR July 19, 2004 K A I S T Department of Computer Science

2 An Example-based Approach to Text-driven Speech Animation with Emotional Expressions Hyewon Pyun, Wonseok Chae, Yejin Kim, Hyungwoo Kang, and Sung Yong Shin 1 Introduction 1.1 Motivation Visual speech animations of virtual characters have been playing increasingly important roles in various computer graphics applications such as computer games, movies, and internet broadcasting. Aside from those applications, the face is the most important part of the human body in real life, through which we recognize each other and express our emotions. Thus, realistic facial animation is effective in human-computer interaction. For example, virtual characters with talking faces are now widely used as guiding agents at information desks, digital actors in movies, or avatars in internet chatrooms. In general, the visual speech animation of a virtual character is required to be synchronized with a given input speech track. Thus, most of the previous speech animation approaches have focused on lip synchronization [2, 4, 5, 6, 7, 11, 24]. That is, they have been mainly interested in producing a visual speech animation in which lip movements are synchronized with the speech track. For a more natural and realistic facial animation, however, it is necessary to incorporate emotional expressions into such a pure lip-sync animation. While many different approaches have been proposed for generating emotional expressions on given face models [1, 9, 10, 12, 15, 18, 21, 23, 25], none of them can provide explicit solutions for combining lip movements and facial expressions seamlessly. In this paper, we propose a novel scheme for the real-time speech animation of a 3D face model that effectively combines lip-sync movements with emotional expressions. To achieve our goal, we address three important issues: First, for realistic 3D lip-synchronization, we give a simple, effective scheme for producing a text-driven lip-sync animation with coarticulation effects. Second, for fast and 1

3 intuitive facial expression control, we provide an example-based expression synthesis scheme based on scattered data interpolation. Finally, for combining the lip-sync animation with emotional expressions in an on-line manner, we present an importance-based scheme for compositing facial models. Resulting facial animations are smooth, expressive, easy to control, and generated in real time. 1.2 Related Work Speech Animation Many methods have been proposed for generating the visual speech animation synchronized with a given speech track. These methods can produce 2D or 3D facial animations. 2D methods produced video streams based on analysis of input videos or image samples [2, 4, 6, 7]. While the 2D methods were capable of generating realistic speech animations, their main concerns were on providing lip synchronization. To incorporate upper-face expressions as well, Brand proposed a whole-face speech animation method driven by an audio signal, based on a statistical training model [3]. With this method, however, it is not easy for animators to control expressions or predict animation results. 3D speech animation methods provided animators with more freedom in terms of viewing, lighting, manipulating, and reusing. Based on Parke s parametric facial expression model [14], Pearce et al. developed an early system for 3D speech animation [17]. This system was able to convert a string of phonemes into timevarying control parameters to produce an animation sequence. Cohen and Massaro further extended this approach to generate a 3D speech animation from text [5]. Waters et al. developed a text-driven 3D speech animation system called DECface, based on the key-framing of 3D viseme models [24]. Kalberer et al. proposed a similar method, but they captured the viseme models from a talking human face with a 3D digitizer, to produce more realistic speech animation [11]. All of these previous 3D methods have also mainly focused on lip synchronization. Although the parametric approach allows for the control of upper-face expressions, it is not intuitive for animators to manipulate the facial expressions with a set of parameters Expression Modeling There have also been rich research results on creating whole-face expressions independently of a speech track. Physically-based approaches were able to synthesize facial expressions by simulating the physical properties of facial skin and muscles [12, 21, 23]. Parke proposed a parametric approach to represent the motion of a group of vertices to generate a wide range of facial expressions [15]. 2

4 In performance-driven approaches, facial animations were generated based on facial motion data captured from the live performance of an actor [9, 25]. Kalra et al. used free-form deformation to manipulate facial expressions [10]. Pighin et al. presented an example-based approach to generate photorealistic 3D facial expressions from a set of 2D photographs [18], and Blanz and Vetter proposed an expression synthesis scheme using a large set of example 3D face models [1]. Most of these approaches have focused on the pure expression synthesis aspect, without addressing the problem of synchronization with given speech tracks. 1.3 Overview As shown in Figure 1, our method consists of three main components for lip movements, emotional expressions, and their compositions, respectively. Our method first generates a visual speech animation synchronized with a given speech track obtained from text. We use an available TTS(Text-To-Speech) system to obtain a synthesized speech track, together with information including the corresponding phonemes and their lengths. As a preprocess, we construct a set of 3D face models called visemes 1, that is, the visual counterparts of phonemes [8]. Regarding a sequence of 3D viseme models as key-frames, our method then generates a visual speech animation synchronized with the phoneme sequence by interpolating the viseme models. To reflect coarticulation effects, the viseme model at each key-frame is adjusted in accordance with the length of the corresponding phoneme. Text TTS Lip-Sync Animation Generation Composition User Emotional parameters Facial Expression Animation Generation Speech Animation Figure 1: System overview While synthesizing the output viseme model at each frame, our method simultaneously produces an emotional expression model to be combined with the output viseme model. For intuitive and efficient synthesis of the emotional expression model, we adopt a scattered data interpolation scheme proposed by Sloan et al [20]. Referring to the emotion space diagram [19], we first parameterize a set of example 3D face models with key-expressions on a 2D space. Given a parameter 1 The term viseme is the abbreviation of visual phoneme. 3

5 vector, we obtain the corresponding emotional expression model by blending the key-expression models. Based on cardinal basis functions, this scheme produces the expression model in real time. Finally, we combine the output viseme model with the expression model at every frame to obtain a smooth, expressive, and realistic 3D facial animation. To avoid conflicts between the viseme model and the expression model, we propose an importance-based approach for their seamless composition. That is, we assign each vertex of the face model an importance value based on its relative contribution to the accurate pronunciations with respect to the emotion expressions. The viseme model and the expression model at each frame are blended vertex-wise using the importance values as weights, to produce a convincing visual speech animation with emotional expressions in real time. While the output viseme model is synthesized at every frame driven by the input text, the corresponding expression model is generated simultaneously in accordance with a stream of emotion parameters interactively fed by an animator. The remainder of this paper is organized as follows. In Sections 2 and 3, we describe our methods for generating the output viseme models and expression models, respectively. We explain how to combine them to obtain the final result in Section 4. In Section 5, we show some experimental results. Finally, we conclude this paper and discuss future research in Section 6. 2 Lip-Sync Animation In this section, we present our method for generating a visual speech animation synchronized with a given speech track. In general, the speech track can be obtained either from the recording of human voice, or from a speech synthesis system. Since our objective is to develop a text-driven speech animation system, we utilize an available TTS system to acquire the input speech track. One of the advantages in using a TTS system is that we can directly obtain the information on the sequence of phonemes and their lengths. Given such information, our initial objective is to generate the lip-motion animation synchronized with the phoneme sequence. For generating realistic lip movements, we predefine a set of key-viseme models for English phonemes based on the assumption that any output viseme models can be made by blending a finite set of key-visemes [6]. While there are phonemes in American English, the number of corresponding key-visemes can be much smaller since a single viseme can represent two or more phonemes. We select 14 key-visemes for our experiments, where 6 key-visemes represent the con- 4

6 sonantal phonemes, and 7 key-visemes represent the vocalic phonemes, and one extra key-viseme represents the silence. This last key-viseme is also used as the base model, whose shape is deformed to create all other key-viseme models. Thus, the animator needs to construct the 14 3D face models corresponding to the keyvisemes. Figures 2 and 3 show the key-viseme models for the vowels and the consonants, respectively. a ae o uh u e i Figure 2: Key-viseme models for the vowels g d m r f ch Figure 3: Key-viseme models for the consonants [Hello] h e l o u Figure 4: Key-viseme models for an example phoneme sequence From the set of key-viseme models, we select the models corresponding to the phonemes in the input sequence (Figure 4). Let H and L denote the phoneme 5

7 x i 1 x i 3 x i 2 x i 0 x i 4 x i O s e t 1 t 2 t 3 t 3 t 3 t 4 t Figure 5: A piece-wise cubic spline interpolation sequence and the phoneme length sequence, respectively, obtained from a TTS system: Let P the corresponding key-viseme sequence: H = {H 1, H 2,..., H m } (1) L = {l 1, l 2,..., l m } (2) P = {P 1, P 2,..., P m } (3) where each key-viseme model P j is a polygonal mesh composed of a set of vertices v j i = (xj i, yj i, zj i ), 1 i n and that of edges connecting them. From L, we first compute the time instance t j, 1 i m for locating each key-viseme model P j along the time axis. Let T be the sequence of such time instances: T = {t 1, t 2,..., t m } (4) Assuming that a local extremum is achieved at every key-frame, where a keyviseme model is placed, we compute each t j for key-viseme P j as follows: j 1 t j = l j /2 + l k (5) Here, we also assume that each key-model is placed at the middle of the corresponding phoneme duration along the time axis. With every key-viseme model P j placed at the corresponding time t j, we first describe our basic interpolation scheme and then elaborate this scheme to handle more complex problems such as long phoneme duration and coarticulation. We start with our basic interpolation scheme to construct a piece-wise cubic spline interpolating each sequence of the corresponding vertices for the selected key-viseme 6 k=1

8 models (Figure 5). We set the tangent vector of this curve at every key-frame to be zero to ensure the local extremum at the key-frame. The x coordinate for the vertex v i at time t (t j t t j+1 ) is represented by a cubic polynomial: x i (t) = at 3 + bt 2 + ct + d (6) Here, the coefficients a, b, c, d are obtained from the following constraints: x i (t j ) = x j i, x i(t j+1 ) = x j+1 i, x i(t j ) = x i(t j+1 ) = 0 (7) where x j i and xj+1 i denote the x-coordinates of the vertex v i at times t j and t j+1, respectively. The y and z coordinates can also be computed similarly. We now explain how to handle a very long phoneme duration where the same lip motion needs to be kept for a while. For each phoneme H j, we assign a viseme interval M j centered at t j, whose length l(m j ) is defined as follows: l(m j ) = { lj δ 1 l j > δ 1 0 otherwise (8) If l j > δ 1, we use both ends of this interval (denoted t s j and te j ) instead of t j for spline interpolation with the neighbor key-visemes. Phoneme H 3 in Figure 5 shows this case, whose magnified view is given in Figure 6. l j δ 1 / 2 l j δ 1 3 x i δ 1 / 2 M j s t 3 t 3 e t 3 Figure 6: A viseme maintenance interval Finally, we describe how to incorporate coarticulation effects into our visual speech animation. When a real person speaks, the lips do not always fully achieve a sequence of the key-visemes since the duration of each phoneme is usually very short. Therefore, the lip shapes for the neighbor phonemes tend to force the current viseme to have a similar lip shape. That is, the viseme for each phoneme is affected by the current phoneme length and the neighbor lip shapes, which is called 7

9 coarticulation. To approximate this coarticulation effect, we propose a simple and effective scheme based on shape blending of successive visemes. Starting from the first key-viseme model P 1 in P, we successively adjust each key-viseme model P j with respect to the previous key-viseme model, to obtain a new key-viseme sequence P: P = { P 1, P 2,..., P m } (9) where a vertex position v j i for key-viseme P j is obtained as follows: v j i = w(l j) v j i + (1 w(l j)) v j 1 i (10) Here, the weight w(l j ) is a function of the corresponding phoneme length l j. When l j is long enough, the original key-viseme can be achieved fully. Otherwise, the key-viseme is adjusted according to Equation 10. From empirical observations, we derive a heuristics for viseme transition: The transition speed from one keyviseme to the next is initially low, then abruptly gets high, and finally slows down, as shown in Figure 7. Based on this heuristics, we define the weight function w(l j ): w(l j ) = { 2lj 3 /δ l j 2 /δ 2 2 if l j < δ 2 1 otherwise (11) which is derived from the constraints: w(0) = 0, w(δ 2 ) = 1, and w (0) = w (δ 2 ) = 0. With the modified key-viseme sequence P obtained in this way, we then apply our key-framing scheme as explained above, to obtain smooth and natural-looking lip-sync animation. w(l) 1 O δ 2 l Figure 7: A weight function 8

10 3 Emotional Expression Animation In this section, we describe our example-based emotional expression synthesis scheme. Like viseme models for pronunciations, a human face has well-characterized emotional key-expressions so that a variety of emotional expressions can be obtained by blending those key-expressions. Referring to the emotion space diagram [19], we choose six emotional key-expressions including neutral, happy, sad, surprised, afraid, and angry expressions. The 3D face models corresponding to these key-expressions and their parameterization are shown in Figures 8 and 9, respectively. Note that the neutral expression model is the same as the viseme model representing the silence, that is, the base model. Neutral Happy Sad Surprised Afraid Angry Figure 8: Emotional key-expression models For intuitive and efficient emotional expression synthesis from these key-expression models, we adopt multi-dimensional scattered data interpolation scheme [20]. We first parameterize these key-expression models in a 2D space with the neutral expression located at the center (Figure 9). Then, our expression synthesis problem is transformed to a scattered data interpolation problem. We predefine the weight function for each key-expression model based on cardinal basis functions consisting of linear and radial basis functions. The global shapes of weight functions are first approximated by linear basis functions, and then adjusted locally by radial basis functions to exactly interpolate the key-expression models. At runtime, we interactively specify a parameter vector to synthesize the corresponding expression model by blending the key-expression models with respect to the weight values obtained from the predefined weight functions at this parameter vector. The weight function w i ( ) of each key-expression model E i, 1 i M at a parameter vector p is defined as follows: w i (p) = 2 M a il A l (p) + r ji R j (p). (12) l=0 where A l (p) and a il are the linear basis functions and their coefficients, respectively. Similarly, R j (p) and r ji are the radial basis functions and their coefficients. j=1 9

11 Let p i, 1 i M be the parameter vector of a key-expression model E i. To interpolate the key-expression model exactly, the weight of a key-expression model E i should be one at p i and zero at p j, i j, that is, w i (p j ) = 1 for i = j and w i (p j ) = 0 for i j. Ignoring the second term of Equation (12), we first solve for the linear coefficients a il : 2 w i (p) = a il A l (p). (13) l=0 The linear bases are simply A l (p) = p l, 1 l 2, where p l is the lth component of p, and A 0 (p) = 1. Using the parameter vector p i of each key-expression model and its weight w i (p i ), we employ a least squares method to evaluate the unknown linear coefficients a il of the linear bases. Given the linear approximation, we then compute the residuals for the keyexpression models as follows: w i (p) = w i (p) 2 a il A l (p), for all i. (14) l=0 With these residuals, we solve for the radial coefficients r ji in the Equation (12). The radial basis function R j (p) is a function of the Euclidean distance between p and p j in the parameter space: ( ) p pj R j (p) = B, for 1 j M, (15) α where B( ) is the cubic B-spline function, α is the dilation factor, which is the separation to the nearest other key-expression model in the parameter space. The radial coefficients are then found by solving the matrix system, rr = w, (16) where r is an M M matrix of the unknown radial coefficients r ji, and R and w are the matrices of the same size defined by the radial bases and by the residuals, respectively, such that R ij = R i (p j ) and w ij = w i (p j ). With the weight functions predefined, we are now able to blend the key-expression models in runtime. Using the predefined weight functions for the key-expression models E j, 1 j M, as given in Equation (12), we generate a new face model E new at the parameter vector p in : v new i (p in ) = M w j (p in )v j i, (17) j=1 10

12 where vi new and v j i, 1 i n, denote the vertex of E new and that of E j, respectively. In practice, we employ a cubic spline curve in the parameter space to continuously feed the emotional parameters during an animation (see Figure 9) happy (0,1) surprised π π ( cos,sin ) Neutral (0,0) sad π π (cos,sin ) angry 3π 3π ( cos, sin ) afraid 3π 3π (cos, sin ) Figure 9: The parameter space for expression synthesis 4 Composition In this section, we explain how to combine a visual speech animation and emotional expressions. Since they are generated frame by frame simultaneously, our problem is reduced to a problem of compositing a viseme model and an expression model. Analyzing the key-viseme models and the key-expression models with respect to the base model with neutral expression, we characterize the vertices of the face model in terms of their contributions to facial movements. For example, vertices near eyes contribute mainly to making emotional expressions. Vertices near the mouth contribute to both pronunciations and emotional expressions. However, the vertex movements are mainly constrained by pronunciations when there are any conflicts between two types of facial movements. Based on this observation, we introduce the notion of importance, which measures the relative contribution of each vertex to the pronunciations with respect to the emotional expressions. Let α i denote the importance value of every vertex v i such that 0 α i 1 for all i. If α i 0.5, then the movement of v i is constrained by the pronunciations; otherwise, it is constrained by the emotional expressions. To effectively compute the importance value, we have empirically derived the following three rules: First, the importance of a vertex is proportional to the norm of the displacement vector from the corresponding vertex of the base model. Second, a vertex with a small displacement is considered to be important if it has a neighbor vertex with a big displacement. Finally, a vertex of high importance is 11

13 constrained by the pronunciations, and a vertex of low importance is constrained by the emotional expressions. According to those rules, we compute the importance of each vertex in three steps. In the first two steps, two independent importance values are computed from the viseme models and the expression models, respectively. Then, they are combined to give the importance in the final step. Let p 1 (v i ) and e 1 (v i ), 1 i n be the pronunciation and emotion importances, respectively. In the first step, these importances are computed from the maximum norms of the displacement vectors of each vertex v i over their respective models.: p 1 (v i ) = max( v P j j i v i )/ max j,k ( vp j k v k ), e 1 (v i ) = max( v E j j i v i )/ max j,k ( ve j k v k ), where v P j i and v E j i respectively denote the vertices in a viseme model P j and an expression model E j corresponding to a vertex v i in the base model. Note that we have normalized the importances so that their values range from 0 to 1. We then propagate the importance value of each vertex to the neighbor vertices if it is big enough. Thus, in the second step, the importances p 2 (v i ) and e 2 (v i ) are obtained as follows: p 2 (v i ) = max({p 1 (v i )} {p 1 (v j ) v i v j < L p, p 1 (v j ) > S 1 }), e 2 (v i ) = max({e 1 (v i )} {e 1 (v j ) v i v j < L e, e 1 (v j ) > S 1 }), where S 1, L p, and L e are control parameters. In the final step, the importance α i of the vertex v i is obtained as follows: α i = { p2 (v i )(1 e 2 (v i )), if p 2 (v i ) < S 2, 1 (1 p 2 (v i ))e 2 (v i ), otherwise. (18) Equation (18) adjusts importance values so that they are clustered near both extremes, that is, zero and one. Figure 10 shows the importance distributions after the first, second, and third steps. Brighter regions indicate higher importance values. Now, we are ready to explain how to composite an expression model E and a viseme model P derived from the base model B. Let B = {v 1, v 2,..., v n }, E = {v E 1, v E 2,..., v E n }, and P = {v P 1, v P 2,..., v P n }. 12

14 (a) p 1 (v) (b) e 1 (v) (c) p 2 (v) (d) e 2 (v) (e) α Figure 10: Importance distributions vi E and vi P, 1 i n are obtained by displacing v i, if needed, and thus the natural correspondence is established for the vertices with the same subscript. For every vertex v i, we define displacement vectors vi E and vi P as follows: v E i = v E i v i, and v P i = v P i v i. Let C = {v1 C, vc 2,..., vc n } and vi C = vi C v i be the composite expression model and the displacement vector of a vertex vi C C, respectively. Then, vi C must lie on the plane spanned by vi E and vi P and containing v i as shown in Figure 11. P v i C v i P v i C v i E v i E v i v i E (1 αi) P ( v i ) E P ( v i ) v v P i E i Figure 11: Composition of the two displacements Consider a vertex vi C, 1 i n of the combined model C. If α i 0.5, then displacement vector vi P should be preserved in vi C for accurate pronunciation. Therefore, letting P ( vi E) be the component of ve i perpendicular to vi P, only this component P ( vi E) can make a contribution to vc i (see Figure 11). If α i < 0.5, the roles of vi P and vi E are switched. Thus, we have { vi C vi + ( v = i P + (1 α i )P ( vi E)) if α i 0.5 v i + ( vi E + α i E ( vi P )) otherwise, 13

15 where P ( v E i ) = v E i ve i v P i v P i 2 v P i, and E ( v P i ) = v P i vp i v E i v E i 2 v E i. Figure 12 shows a composite model (Figure 12(c)) constructed from a viseme model (Figure 12(a)) and a emotional expression model (Figure 12(b)). Note that the shape of the viseme model is preserved around the mouth and that of the emotional expression model is preserved in other parts. (a) (b) (c) Figure 12: Composition: (a) viseme model (vowel i ) (b) expression model ( happy ) (c) composite model. 5 Experimental Results To obtain a phoneme sequence from input text, we adopt a commercial TTS system called VoiceText. For a given 3D face model, our method first generates a sequence of output visemes synchronized with the phoneme sequence. Using the emotional space diagram as explained in Section 3, our method simultaneously synthesizes emotional expressions, which are immediately combined with the output viseme sequence frame by frame to give the final animation on the fly. All the experiments have been conducted on an Intel Pentium PC (P4 2.4 GHz processor with 512 MB memory). Figure 13 shows the three models used for our experiments, and Table 1 gives the number of vertices and that of polygons in each model. In Figure 15, we show in each row, 12 sample frames of a lip-sync animation, an expression animation, and their composite animation generated for a model Man. Note that only the composite version in the third row is actually displayed at runtime. Each composite model is obtained from the corresponding viseme model and the expression model 14

16 (a) (b) (c) Figure 13: Models used for the experiments: (a) Man (b) Woman (c) Gorilla Man Woman Gorilla Vertices Polygons Table 1: Model specification shown in the same column. The lip-sync animation was produced from input text A given in Table 2 while simultaneously synthesizing the emotion expressions from the input parameter curve shown in Figure 14(a). Note that our expression synthesis scheme can extrapolate the expression models at parameter vectors even outside the convex hull of the parameter vectors for the key-expression models. Each composite model in the final animation nicely reflects the corresponding lip motion and the emotional expression without conflicts. Similarly, Figures 16 and 17 show the animation results with the other two models. For the Woman model, we used input text B in Table 2 and the parameter curve in Figure 14(b). For the Gorilla model, text C and the last curve were used. happy happy happy surprised sad surprised sad surprised sad neutral neutral neutral angry afraid angry afraid angry afraid (a) (b) (c) Figure 14: Various emotional parameter curves 15

17 A B C Text The significant problems we face cannot be solved at the same level of thinking we were at when we created them. The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms. Any man who can drive safely while kissing a pretty girl is simply not giving the kiss the attention it deserves. Table 2: Input sentences (Quotes from Albert Einstein) For efficiency analysis, we also performed an experiment on each of the three models with the same input text sequence, which is a part of Abraham Lincoln s Gettysburg address. The text is composed of 266 words, and the TTS system produced the corresponding phoneme sequence of 86 seconds long. We applied the same emotion parameter curve to all 3 models. Table 3 shows some statistics obtained from this experiment. For each model, this table shows the computation time (in milliseconds) per frame for lip-synchronization, emotion expression, and their composition, respectively, excluding rendering time. Frame rate indicates the number of frames per second for the final animation, including rendering time. As shown in this table, our method exhibited a real-time performance in producing visual speech animations from text. Man Woman Gorilla L Comput. time E (msec./frame) C Total Frame rate (Hz) Table 3: Time-efficiency of our method (L: Lip-sync animation, E: Expression animation, C: Composite animation). 16

Our contributions are three-fold: First, we suggested a simple, effective scheme for producing a text-driven 3D lipsync animation with coarticulation

18 Figure 15: Visual speech animation with model Man Figure 16: Visual speech animation with model Woman 6 Conclusions We have presented an example-based approach to creating a visual speech animation with emotional expressions in real time. Our contributions are three-fold: First, we suggested a simple, effective scheme for producing a text-driven 3D lipsync animation with coarticulation effects. Second, we provided an example-based expression synthesis scheme based on scattered data interpolation. Our final contribution is an importance-based scheme for compositing a viseme (face) model and 17

Figure 17: Visual speech animation with model Gorilla an expression (face) model, which enables the combining of a lip-sync animation with emotional expressions frame by frame in an on-line manner.

19 Figure 17: Visual speech animation with model Gorilla an expression (face) model, which enables the combining of a lip-sync animation with emotional expressions frame by frame in an on-line manner. There are several aspects for further improvement. For more realistic facial animation, we would like to extend our scheme to incorporate subtle movements in addition to emotional expressions such as eye blinking, eyeballs rolling, head shaking or nodding, etc. Although our lip-sync animation scheme mainly refers to the phoneme length, we also need to consider other factors such as intonation or accents which also affect lip movements. Instead of providing the emotional parameters interactively through a user interface, we will try to extract the emotional parameters automatically from input text so that the expression models can be derived from the phoneme sequence directly. References [1] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces, In Proceedings of SIGGRAPH 99, pp , [2] C. Bregler, M. Covell, and M. Slaney. Video rewrite: Driving visual speech with audio, ACM SIGGRAPH 97 Conference Proceedings, pp , August [3] M. Brand. Voice Puppetry, In Proceedings of SIGGRAPH 99, pp ,

20 [4] E. Cossato and H. Graf. Photo-Realistic Talking-Heads from Image Samples, IEEE Transactions on Multimedia, 2(3), pp , September [5] M. Cohen and D. Massaro. Modeling coarticulation in synthetic visual speech, In N. M. Thalmann and D. Thalmann (Eds.) Models and Techniques in Computer Animation, pp Springer-Verlag, Tokyo, [6] T. Ezzat and T. Poggio. Visual Speech synthesis by morphing visemes, International Journal of Computer Vision, 38, pp , [7] T. Ezzat, G. Geiger, and T. Poggio. Trainable Videorealistic Speech Animation, ACM SIGGRAPH 2002 Conference Proceedings, pp , July [8] C. G. Fisher. Confusions among visually perceived consonants., Jour. Speech and Hearing Research, 11, pp , [9] B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin. Making Faces, ACM SIGGRAPH 98 Conference Proceedings, pp , July [10] P. Kalra and A. Mangili and N. M. Thalmann and D. Thalmann. Simulation of facial muscle actions based on rational free from deformations, In Proceedings of Eurographics 92, pp , [11] G. A. Kalberer and L. V. Gool. Face Animation Based on Observed 3D Speech Dynamics, Computer Animation 2001, pp , November [12] Y. C. Lee, D. Terzopoulos and K. Waters. Realistic modeling for facial animation, In Proceedings of SIGGRAPH 95, pp , [13] J. P. Lewis, M. Cordner, and N. Fong. Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Drive Deformation, ACM SIGGRAPH 2000 Conference Proceedings, pp , July [14] F. I. Parke. A Parametric Model of Human Faces, PhD thesis, University of Utah, [15] F. I. Parke. Parameterized models for facial animation, In IEEE Computer Graphics and Applications, Vol. 2, No. 9, pp , [16] F. I. Parke and K. Waters. Computer Facial Animation. 1996, ISBN

21 [17] A. Pearce, B. Wyvill, G. Wyvill, and D. Hill. Speech and Expression: A Computer Solution to Face Animation, In Graphics Interface, [18] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D.H. Salesin. Synthesizing Realistic Facial Expressions from Photographs, ACM SIGGRAPH 98 Conference Proceedings, pp , July [19] J. A. Russel. A Circomplex Model of Affect, In J. Personality and Social Psychology, Vol. 39, pp , [20] P. -P. Sloan and C. F. Rose and Michael F. Cohen. Shape by example, In Proceedings of 2001 Symposium on Interactive 3D Graphics, pp , [21] D. Terzopoulos and K. Waters. Physically-based facial modeling, analysis, and animation, In Journal of Visualization and Computer Animation, Vol. 1, No. 4, pp , [22] I. I. Witten. Principles of Computer Speech. Academic Press, [23] K. Waters. A muscle model for animating three-dimensional facial expressions, In Proceedings of SIGGRAPH 87, pp , [24] K. Waters and T. M. Levergood. DECface: an automatic lip synchronization algorithm for synthetic faces, Technical Report CRL 93/4, DEC Cambridge Research Laboratory, Cambridge, MA, September [25] L. Williams. Performance driven facial animation, In Proceedings of SIG- GRAPH 90, pp ,

Master s Thesis. Cloning Facial Expressions with User-defined Example Models

Master s Thesis. Cloning Facial Expressions with User-defined Example Models Master s Thesis Cloning Facial Expressions with User-defined Example Models ( Kim, Yejin) Department of Electrical Engineering and Computer Science Division of Computer Science Korea Advanced Institute