Speech Driven Face Animation Based on Dynamic Concatenation Model

Size: px

Start display at page:

Download "Speech Driven Face Animation Based on Dynamic Concatenation Model"

Suzan Dickerson
5 years ago
Views:

1 Journal of Information & Computational Science 3: 4 (2006) 1 Available at Speech Driven Face Animation Based on Dynamic Concatenation Model Jianhua Tao, Panrong Yin National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing , China Received 1 May 2006; revised 20 July 2006 Abstract In the paper, we design and develop a speech driven face animation system based on the dynamic concatenation model. The face animation is synthesized by the unit concatenating, and synchronous with the real speech. The units are selected according to the cost functions which correspond to voice spectrum distance between training and target units. Visual distance between two adjacent training units is also used to get better mapping results. Finally, the Viterbi method is used to find out the best face animation sequence. The experimental results show that synthesized lip movement has a good and natural quality. Keywords: Talking head; Face animation; Speech driven 1 Introduction Speech driven face animation systems ([6, 8, 10, 15]) have been developed for improving humanmachine interface such as virtual announcer, reader, mobile messenger reader and so on. It also helps hearing-impaired people to communicate with others or those in noisy environment. Unlike the text driven face animation ([3, 5]) which can only use the utterance context information, the speech driven face animation is driven by real speech with both spectrum and prosodic information, and is able to offer more expressive and vivid outputs. A speech driven face animation system is usually established in four steps as defined in Picture My Voice ([10]): label an audio-visual database; give representation of both the auditory and visual speech; find a method to describe the relationship between two representations; synthesize the visible speech given the auditory speech. Many methods have been applied in audio-visual mapping, such as, TDNN ([10]), MLPs ([8]), KNN ([6]), HMM ([15]), GMM, VQ ([15]), Rulebased ([5, 9, 14]). They can be classified into two types: through speech recognition or not Project supported by the National Natural Science Foundation of China (No ). Corresponding author. address: jhtao@nlpr.ia.ac.cn (Jianhua Tao) / Copyright c 2006 Binary Information Press December 2006

2 2 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 through speech recognition. The first approach divides speech signal into language units such as phonemes, syllables, words, and mapped them directly to corresponding lip shapes, through smoothing algorithm, the lip movement sequence is obtained ([15]). The second approach analyzes the bimodal data through statistical learning model, and finds out the mapping between the continuous acoustic features and facial parameters, so as to drive the face animation directly by a novel speech. For example, Picture my voice ([10]) trained ANN to learn the mapping from LPCs to face animation parameters, they used current phoneme, 5 backward and 5 forward time step as the input to model the context. In the paper, we propose an acoustic context based facial unit selection and dynamic concatenation model. The method constructs the data stream by concatenating stored data units in training database. It makes the synthesis result very natural and realistic. Furthermore, a novel statistic method has been generated for the model training from a big corpus. The training and testing corpus is based on 3D coordinates of FDPs which are compatible with MPEG-4 standard ([11]) and have been recorded by Motion Capture System. Since animated talking head is required to be more lifelike and individual, FAPs are extracted as visual features so as to apply our synthesized animation control parameters to different 3D mesh models. The whole paper is organized as follows, Section 1 gives the general description of the paper, Section 2 introduces the database setup and audio-visual feature extraction, Section 3 focuses on the model description and the training method, Section 4 makes some experiments and analysis of the model and Section 5 gives the conclusion of the work. 2 Corpus 2.1 Corpus design and recording To create the model, a corpus which contains 1,000 sentences was produced by searching the appropriate data from 10 years People Daily via a Greedy Algorithm ([6]). The following factors form the focus of our investigation. (a) Identity of the current syllable; (b) Identity of the current tone; (c) Identity of the final in the previous syllable; (d) Identity of the previous tone; (e) Identity of the initial in the following syllable; (f) Identity of the following tone. Factor (a) has 417 values that correspond to the 417 syllable types in Chinese. Factors (b), (d), and (f) have each 5 values that correspond to the 4 full tones and the neutral tone. Factor (c) contains 20 values (initial types) and factor (e) contains 41 values (final types). During the text selection phase, phrasing was coded solely on the basis of punctuation. After the text was selected and the database recorded, phrasing was recoded to correspond to pauses. Each utterance in our database contains at least 2 phrases. There were 2,869 phrases and 11,860 syllables in total, so

3 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 3 on average each utterance contained 2 phrases. After the corpus was designed, each sentence was recorded by a female in a professional audio recording studio. For dynamic facial parameters recording, most of current corpuses are based on a stream of dynamic images of videos ([2, 3, 5, 14]), some others recorded static facial visemes by 3D laser scans or 2D to 3D image reconstruction method ([6, 8]). There are two most popular visual representation standards, one is the Facial Action Coding System ([4]) developed by Ekman and Friesen, the other is MPEG-4 SNHC ([11]). FACS defined 44 AUs to denote the independent facial muscles action, more than 7000 AU combinations have been observed. 68 FAPs which are defined in MPEG-4 SNHC also represented the basic facial actions, including two high-level parameters, viseme and expression, and 66 low-level parameters. They are divided into 10 groups according to effect regions. Since AUs cannot offer quantitative description for face animation, the paper converts 3D FDPs movements into low-level FAPs. FAPs are computed and normalized in terms of FAPUs which are the distances between the main facial feature points. So they can be applied to different face models in a consistent way. In the paper, 15 FAPs (Table. 1) involved in group 2 and group 8 are extracted to represent the lip movement. Table 1: FAPs for visual representation in our system. Group FAP name Group FAP name 2 Open jaw 8 Stretch l cornerlip o 2 Trust jaw 8 Lower t lip lm o 2 Shift jaw 8 Lower t lip rm o 2 Push b lip 8 Raise b lip lm o 2 Push t lip 8 Raise b lip rm o 8 Lower t midlip o 8 Raise l cornerlip o 8 Raise b midlip o 8 Raise r cornerlip o 8 Stretch r cornerlip o 8 To acquire FAPs accurately, a commercially available motion capture system (Motion Analysis), in which eight cameras with 75Hz sampling frequency are employed, is used to track the FDPs on speaker s face as shown in Fig. 1. The output of the motion capture system are 3D trajectories of all the 50 markers. During experiment, the rigid head movement is not avoided, so 5 markers on the head are used to compensate the global head movement to get the non-rigid facial deformations. 2.2 Features generation Normalization of markup points and FAPs flow generation To extract FAPs correctly, the 3D trajectories of face markers have to be normalized into an upright position in positive XYZ space, and then the least-square based fitting method of two 3D point sets is used to extract the rigid head movements ([1]). In our work, three benchmark points (a,b,c) are set on the forehead as shown in Fig. 1. The normalized facial parameters will be got from the following formula, P i,t = R( P i,t + T ). (1)

4 4 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 Fig. 1: Placement of markers on neutral face and capture data. Here, P i,t means the captured position from motion capture system, P i,t means the normalized position which will be used for generating the FAPs flow, R is the rotation matrix, T is the translation vector, i means the markup point and t is the measured time. P a,t, P b,t, and P c,t are the basic benchmark positions, the distances among them are, D a,b = P a,t P b,t = (x a,t x b,t ) 2 + (y a,t y b,t ) 2 + (z a,t z b,t ) 2, D a,c = P a,t P c,t = (x a,t x c,t ) 2 + (y a,t y c,t ) 2 + (z a,t z c,t ) 2, D b,c = P b,t P c,t = (x b,t x c,t ) 2 + (y b,t y c,t ) 2 + (z b,t z c,t ) 2, Let s suppose the three benchmark points are located in the plane z = 0 in the new normalized XYZ reference frame. Then, we got the new position of these points, P a,t = {0, 0, 0} T, P c,t = { D2 a,b + D2 a,c D 2 b,c 2D a,b, P b,t = {0, D a,b, 0} T, D 2 a,c ( D2 a,b + D2 a,c D 2 b,c 2D a,b ) 2, 0} T. Since points a, b, c will keep the constant distance among them, it s not necessary to calculate at each time. With formula (1), we then can get R( P a,t + T ) = P a,t = {0, 0, 0} T, (2) R( P b,t + T ) = P b,t = {0, D a,b, 0} T, (3) R( P c,t + T ) = P c,t = { D2 a,b + D2 a,c D 2 b,c 2D a,b, D 2 a,c ( D2 a,b + D2 a,c D 2 b,c 2D a,b ) 2, 0} T. (4)

5 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 5 Then, T = P a,t. (5) We separate R into the combination of rotations about axis x, axis y and axis z. R = R x R y R z. Here, R x = cos α sin α 0 sin α cos α Thus, R y = R z = cos β 0 sin β sin β 0 cos β cos θ sin θ 0 sin θ cos θ The rotation matrix can be easily deduced from the formula 2,3 4. Then, the output of motion capture system at t step is: {P 1,t, P 2,t,, P n,t} T R 3n, n = 45. (6) With the normalized markup points, we can get the FAPs flow by following the method defined in [16] where we consider the markup points as FDP parameters. V t = {F AP t 1, F AP t 2,, F AP t 15}. (7) Acoustic feature vectorization For audio features, some researchers used text-dependant language units to analyze their corresponding static visemes; but in order to reduce manual intervention, we turned to acoustic features directly. Previous work has proved that MFCC gives an alternative representation to speech spectra and contains the audition information. In each frame, we computed 12 MFCC coefficients ([17]). The parameters in frame t can be represented as: a t = {a t 1, a t 2,, a t 12}. (8) 3 Dynamic Concatenation Model 3.1 Model description The method is based on the cost function: (Eq. 6): COST = αc a + (1 α)c V, (9)

6 6 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 where C a is the voice spectrum distance, C V is the visual distance between two adjacent units, and the weight α balances the effect of the two sub-costs. In the unit selection procedures, according to the target frame unit of novel speech, we list the candidates which have smaller acoustic distances with them. The audio content distance measures how close the candidate unit is compared with the target unit. It determines whether the most appropriate audio-visual unit has been selected. The voice spectrum distance also accounts for the context of target unit. The context covers the former and latter two frames for vowels and the former and latter frame for consonants. So the voice spectrum distance is defined by C a = T +n 12 t=t n m=1 ω t,m a t,m â t,m. (10) The smoothness of the face animation is also very important, and is measured by the concatenation cost which is calculated by FAPs erros between two selected frames. Here, V(u t 1, u t ) and V(u t, u t+1 ) denote the Eucilidian distance. C V = V(u t 1, u t ) + V(u t, u t+1 ). (11) 3.2 Training method Though the initial weights can be manually assigned according to the researcher s experience, further training method is still necessary to adapt the model to different speech corpus. The whole training procedure is described as follows. Let s suppose the initial weight vector of Eq. (10) is ω 0 = {ω 0 1, ω 0 2,, ω 0 p}. The training set and outputs are { Y ˆ 1, Y ˆ 2,, Y ˆ N } and { Y 1, Y 2,, Y N }. After time step j-1 of learning, the weight vector will be ω j = {ω1, j ω2, j, ωp}, j where, p is the length of weight vector. Weight vector is restricted by the following condition ω j i = 1. (12) The condition ensures that weights will converge to steady ones, and ensures the balance of the weights in whole space. After the training time step j, the new weight vector is acquired with, ω j+1 i = ω j i + ηj + d j i. (13) Here, η j determines the learning rate at time step j and d j denotes the learning direction. Since all context parameters were normalized from 0 to 1, d j can be got from, d j i = [1 ( Y n ) min ][ (V (a min n,j )) (V (a s n,j))] + C. (14) Y n )

7 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 7 Fig. 2: Diagram of FAPs selection and weight assigning. Y n is the FAPs error between synthesis result and the target of the frame n, which is Y n = argmax[ i γ i V (a n,m,j )] ˆ Y n. (15) (V (a s n,j)) is the corresponding error of MFCC parameters between outputs and targets, (V (a s n,j)) = [V (a s n,j) ˆV (a n,j )]. (V (a min n,j )) is the error of MFCC parameters corresponding to the candidate of FAPs which makes minimum FAPs deviation between synthesis results and targets With the condition (12), we get (V (a min n,j )) = [V (a min n,j ) ˆV (a n,j )]. ω j+1 i = ω j i + η j d j i = 1. Thus, η j d j i = 0. To reduce the computing cost, η j normally is assigned as a constant value. We get C = 1 p [( Y n ) min Y n ) 1] [ (V (a min n,j )) (V (a s n,j))]. (16) Then, the whole automatic training procedure will be finished with the combination of Eq. (13), (14) and (16). The block diagram of the approach for FAPs flow prediction and training is given in Fig. 2.

8 8 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) Synthesized FAP stream of FAP #51 of a sentence sample 200 Synthesized FAP stream of FAP #52 of a sentence sample FAP value FAP value Recorded FAP stream Synthesized FAP stream Smoothed synthesized FAP stream 800 Recorded FAP stream Synthesized FAP stream Smoothed synthesized FAP stream Frames Synthesized FAP stream of FAP #51 of a sentence sample (a) Frames Synthesized FAP stream of FAP #52 of a sentence sample FAP value FAP value Recorded FAP stream Synthesized FAP stream Smoothed synthesized FAP stream 800 Recorded FAP stream Synthesized FAP stream Smoothed synthesized FAP stream Frames (b) Fig. 3: Synthesized FAPs streams from (a) validation sentence, (b) test sentence. 4 Testing and Validation Once the cost function (with audio content distance and visual distance) are computed, the unit selection method is constructed. The approach finally aims to find the best path in this graph which generates minimum COST. Viterbi is a valid search method for this application. In the paper, three forth of the corpus is used as training data, the rest is used for validation and testting with both quantitative evaluation and qualitative observation. Fig. 3 indicates the synthesized FAP stream results. Correlation coefficients (Eq. 14) are used to represent the deviation similarity between the outputs and the targets Frames CC = 1 T T t=1 ˆf(t) ˆµ)(f(t) µ), (17) ˆσσ where f(t) is the recorded FAP stream, ˆf(t) is the synthesis result, T is the total number of frames in the database, and µ and σ are corresponding mean and standard deviation.

J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 9 Table 2: The average correlation coefficient for each FAPs on the whole database. FAP No. Validation Test FAP No.

9 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 9 Table 2: The average correlation coefficient for each FAPs on the whole database. FAP No. Validation Test FAP No. Validation Test # # # # # # # # # # # # # # # From Table 2, the synthesized results show a good performance of the method. Although the test sentences show smaller correlation coefficients, they still can synthesize most of FAPs. It is also noticed that the CC of FAP #15 and #60 of validation data are even smaller than those of test data, which may be mainly due to the speaker s habit that she speaks sentences with randomly shifting mouth corner. Finally, a MPEG-4 facial animation engine is used to access our synthesized FAP streams qualitatively. The animation model displays at a frame rate of 75 fps. Fig. 4 shows some frames of synthesized talking head. Fig. 4: Some samples of synthesized talking head. 5 Conclusion In the paper, we have presented a speech driven face animation system. While many other researchers used images, optical flows, principal components, etc, as stored units, the system employs FAPs which are extracted from 3D coordinates of FDPs to represent the visual feature. It has advantage of small storage, being able to be applied to non-specific model, and also to drive both 3D mesh model and 2D images. The unit selection method for dynamic mapping from audio to visual parameters is a simplification of HMM, but it appears very good and natural with both quantitative and qualitative evaluation. Without parameter adjust, under-fitting, over-fitting problems which are involved in learning method like ANN, it is a simple, direct and effective way to synthesize the talking head.

10 10 J. Tao et al. /Journal of Information & Computational Science 3: 4 (2006) 1 References [1] K. S. Arun, T. S. Huang and S. D. Blostein, Least-square fitting of two 3D point sets, IEEE Trans. Pattern Analysis and Machine Intelligence 9 (5) [2] C. Bregler, M. Covell and M. Slaney, Video rewrite: driving visual speech with audio, in: ACM SIGGRAPH, [3] E. Cosatto, G. Potamianos and H. P. Graf, Audio-visual unit selection for the synthesis of photorealistic talking-heads, in: IEEE International Conference on Multimedia and Expo, ICME, 2000, pp [4] P. Ekman and W. V. Friesen, Facial Action Coding System,, Consulting Psychologists Press, Palo Alto Calif., [5] T. Ezzat, T. Poggio and MikeTalk, A talking facial display based on morphing visemes, in: Proc. Computer Animation Conference, Philadelphia, USA, [6] R. Gutierrez-Osuna, P. K. Kakumanu, A. Esposito, O. N. Garcia, A. Bojorquez, J. L. Castillo and I. Rudomin, Speech-driven facial animation with realistic dynamics, IEEE Trans. Multimedia 7 (1) [7] A. Hunt and A. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in: ICASSP, Vol. 1, 1996, pp [8] Pengyu Hong, Zhen Wen and Thomas S. Huang, Real-time speech-driven face animation with expressions using neural networks, IEEE Trans. on Neural Networks 13 (4) [9] S. Kshirsagar, T. Molet, and N. M. Thalmann, Principal components of expressive speech animation, in: Proc. of Computer Graphics International, [10] Dominic W. Massaro, Jonas Beskow, Michael M. Cohen, Christopher L. Fry and Tony Rodriguez, Picture my voice: audio to visual speech synthesis using artificial neural networks, in: Proceedings of AVSP 99, Santa Cruz, CA., August, 1999, pp [11] A. Murat Tekalp and JoK rn Ostermann, Face and 2D mesh animation in MPEG-4, Signal Processing, Image Communication 15 (2000) [12] Ram R. Rao, Tsuhan Chen and Russell M. Mersereau, Audio-to-visual conversion for multimedia communication, IEEE Trans. Industrial Electronics 45 (1) [13] Jian-Qing Wang, Ka-Ho Wong, Pheng-Ann Pheng, H. M. Meng and Tien-Tsin Wong, A real-time Cantonese text-to-audiovisual speech synthesizer, in: Acoustics, Speech, and Signal Processing, 2004, Proceedings, (ICASSP 04), Vol. 1, pp [14] Ashish Verma, L. Venkata Subramaniam, Nitendra Rajput, Chalapathy Neti and Tanveer A. Faruquie, Animating expressive faces across languages, IEEE Trans. Multimedia 6 (6) [15] E. Yamamoto, S. Nakamura and K. Shikano, Lip movement synthesis from speech based on hidden Markov models, Speech Communication 26 (1998) [16] Overview of the MPEG-4 Standard, 4.htm. [17] The Hidden Markov Model Toolkit (HTK),

Expressive Face Animation Synthesis Based on Dynamic Mapping Method*

Expressive Face Animation Synthesis Based on Dynamic Mapping Method* Panrong Yin, Liyue Zhao, Lixing Huang, and Jianhua Tao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese