Facial Animation System Based on Image Warping Algorithm Lanfang Dong 1, Yatao Wang 2, Kui Ni 3, Kuikui Lu 4 Vision Computing and Visualization Laboratory, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China 1:lfdong@ustc.edu.cn, 2:ytwang@mail.ustc.edu.cn, 3:nk@ustc.edu.cn, 4:lukui@mail.ustc.edu.cn Abstract This paper is about the technologies in facial animation application. The major contents are as follows:1)we summary the classic image warping algorithms, and analyze the advantages and disadvantages of these algorithms when used in facial animation;2)we study the principles of Mesh Warping image warping algorithms, and propose a novel Mesh Warping algorithm based on scan lines. The algorithm meets real-time facial animation system in running time, reduces the constraints of constructing splines in some degree, and cuts down the difficulty of image warping, according to experimental results;3)we introduce MEPG-4 facial animation standard, and implement a speechdriven facial animation system using it. The system makes use of our scan lines Mesh Warping algorithm, which produces a variety of mouth and facial expressions of the speaker, more realistic animation, higher real-time, and better with the speech sync. Keywords-Facial Animation; Image Warping; MPEG-4; Viseme Interpolation I. Introduction The human face and speech are the two most important ways of human communications. Combining the animation and speech processing, speech animation technology generates animation with speech and mouth shape changing synchronously by computer, also known as "Talking Head" or "Mouth Shape-Sync". Speech animation technology is divided mainly into three types: samples-based, 3D model-based and single imagebased. Speech animation based on samples generates new facial animation by reorganizing the given samples. The advantages are very realistic, but it needs a live talk video with hard data acquisition. Besides it only gets the facial animation in the samples. 3D model-based facial animation establishes a 3D face model first, and then drives the face model to generate speech animation in which speech and mouth shape playing synchronously. Talking face can has a variety of expressions. This approach is not better than samples-based animation in reality at the moment, but data acquisition is easy (only a few images of different angles). It can be made conveniently (no or little user interaction), and generate realistic 3D animation. The speech animation we describe here is based on image warping: inputting one human face (or animal face, cartoon face) image, positioning feature point, we save the positions into data files. For the sound files inputted, we perform speech recognition, and generate phoneme timestamps files. Then, we synchronize images and audio files selected, play the speech, and drive face in the image for animation simultaneously. The system can be applied to humancomputer interaction using images which contain human, animal or cartoon faces. Image warping technique is the core in single image-based speech animation. The technologies used in face image warping with better effects are divided into two categories: 1) The warping based on scattered point interpolation: the typical algorithm is the warping algorithm based on radial basis function (RBF) [1, 2, 3]. It is more convenient in positioning feature points, and can produce realistic warped images. But the selected functions of RBF such as Gaussian function are generally very complex. They are slow in warping as a result. In addition, the algorithm is difficult to guarantee the stability border of warped image. 2) The warping based on fragments: typical algorithms are the warping algorithms based on triangulation [4, 5] and the grid distortion algorithms [6, 7]. Triangulation-based image warping algorithms can obtain good results when performing local warping in the face images. But the preprocessing of images divided into triangular pieces is relatively complex, and the reasonableness and effectiveness of block affect directly the final warping results. Therefore, they are in less convenient. The entire warping work must do again especially warping results are not satisfied and need to be adjusted. G.Wolberg proposed the distorted grid warping algorithm which usually uses cubic spline interpolation (Fig. 1). Grid distortion algorithm is mainly used for the shape transitions between two faces (or called warping), we call these two images IS and IT (the source and target images).source image links with grid MS. It specifies the coordinates of control points or landmarks. The second grid MT is specified their corresponding positions in the target image. MS and MT define spatial transformation at the same time, and the transformation will map all points in IS to IT. The mesh topology is limited isomorphism which allows non-folding and discontinuity. Therefore, the nodes in MT can disappear 978-1-4577-0321-8/11/$26.00 2011 IEEE 2648
Figure 1. Meshing of image warping based on grid distortion from MS according to requirements, as long as they do not cause self-intersection. Grids are limited to a fixed boundary for simplicity moreover. Lin in National Taiwan University has applied this approach in the facial animation system [8], in which facial features are defined by a mesh mask, using the meshing algorithm for image warping to generate facial animation. Because converging mesh on the face image is too complex and not easy in user interaction, the requirements are high for mesh in addition. The animation process is very random if it is used for speech animation, because the system requirements high precision. Control is not convenient as a result. II. Image warping based on scan lines We propose a novel image warping algorithm based on scan lines after carefully researching the Mesh Warping algorithm. Fig. 2 describes the details of our warping method which includes three stages. 1 Using the feature points in the source image and target image to generate the feature points in the intermediate warped image (Fig. 3). The x coordinates of feature points comes from the target feature points, and y coordinates of feature points come from the source feature points. 2 Performing warping in x direction. a Using source feature points and intermediate feature points to construct the source vertical splines (Fig. 4) and target vertical splines separately, while every spline is generated by linear interpolation. b Horizontal scan lines scan vertical bars progressively. One scan line intersects with two adjacent vertical splines, and the middle points come from the control points (two intersection points) using linear interpolation. Algorithm 1: Image Warping Based on Scan Lines Input: A gray-scale image size rectangular block (M * N) and a number of feature points which distribute in the source image and target image. Output: The new coordinates of each pixel in the warped image. Figure 2. Image warping algorithm Figure 3. Schematic diagram of feature points Figure 4. Schematic diagram of the vertical spline Figure 5. Schematic diagram of horizontal splines 3 Performing warping in y direction. a Using source feature points and intermediate feature points to construct the source horizontal splines (Fig. 5) and target horizontal splines separately, while every spline is generated by linear interpolation. 2649
(a) Source image (b) Mouth shape 1 (c) Mouth shape 2 (d) Mouth shape 3 Figure 6. Various Mouth Shape b Vertical scanning lines scan horizontal bands by column. One scan line intersects with two adjacent horizontal splines, and the middle points come from the control points (two intersection points) using linear interpolation. The horizontal and vertical splines as described in the steps above can be constructed independently, and do not form spline grid sharing some feature points any more. This improvement reduces the difficulty of constructing splines and makes it easier for the image warping in speech animation. Parts of the warping effects are shown in Fig. 6. III. 2D facial animation system based on image warping Facial animation in MPEG-4 is only a standard, and does not give a specific solution which gives researchers a vast space. We implement a face speech animation system which is composed by face parameters modeling module, speech recognition module, and animation generating module in this paper based on MPEG-4. The animation generation module composed by the animation parameters calculation module and image warping module. System block diagram is shown in Fig. 7. System inputs are arbitrary facial image files and audio files. We local the coordinates of feature points in face images by face parameters modeling module. Speech streams are recognized into visual phoneme streams by speech recognition module. System definition files define the correspondence between standard face model and the visual phonemes, including the visual phoneme definitions, the displacement factor definitions of feature points, and the expressions FAP (Facial Animation Parameters) definitions. We can obtain a set of FAP values corresponding to currently playing visual phonemes by pre-designed operation. We also can get the displacement of feature points from FAP values and the FDP (Facial Definition Parameters) values (the coordinates of facial feature points) by calculating in the face image. Finally, the image warping algorithm could generate animation. Broadcasting speech and animation simultaneously, speech animation system can be realized. A. Face parameters modeling module We mark facial feature points using face parameters modeling module. Our system selects 45 facial feature points which most distribute around the eyes and the mouth to describe a frontal face image (Fig. 8). These points play a major role in generating animation and are mainly used to achieve a variety of mouth shapes, expressions, random blinking and other effects. Feature points which are used for realization the random shaking of the whole face are lower in cheek area. B. Speech recognition module Speech recognition module is mainly used for extracting the visual phonemes in the audio file streams. Our system adopts speech recognition engine SAPI5.0 as a visual phonemes extraction tool. SAPI 5.0 (MS Speech SDK 5.0 for short) is released in October 2000 by the Microsoft Corporation. Users can easily develop applications such as speech recognition, speech synthesis and related applications using SAPI 5.0. Figure 7. Speech-driven facial animation flow chart Figure 8. Distribution of feature points in face parameters modeling 2650
TABLE 1. VISUAL PHONEME AND PHONEME # Phonemes 0 ae ax ah aa ay 1 P b m 2 D t l 3 ey eh uh 4 F v 5 K g h 6 Y iy ih ix 7 sh ch jh zh 8 ao ow aw oy 9 R 10 W uw 11 S z 12 th dh 13 N ng 14 Er 15 silence Visual phonemes (viseme) are the video parameters corresponding to phonemes which represent kinds of mouth shapes of some pronunciation. The visual phonemes used in the system are listed in Table. 1. C. Animation parameters calculation module This system uses high-layer actions, bottom-layer FAP, feature points and spline points to consist the four-layer control structure for facial animation, shown in Fig. 9. High-layer actions include mouth shapes and facial expressions. Low-layer FAP is defined as FAP3-68 in MPEG-4 standard, a total of 68. The implementation of FAP is achieved by moving a set of related feature points whose locations are determined in a neutral state face model according to the definition in MPEG-4. Each animation frame is represented into movement of a number of feature points and related splines which are relative to the neutral state face model at last. For a given intensity of a high-layer action, calculating methods about the displacement of feature points and the splines are: Bottom-layer FAP strength = High-layer actions strength * Bottom-layer FAP weights Feature points displacement = Bottom-layer FAP strength * Feature points weights * FAPU Spline points displacement = Feature points displacement * Spline points weights Four-layer structure can reduce the facial animation dependent on the face mesh, making the replacement of model more convenient. We abstract high-layer actions which make application and expansion more convenient. D. Image warping module Our system adopts Mesh Warping algorithm based on scan lines for image warping with comprehensive consideration of real-time animation, reality and operational flexibility. The constructing of splines is relatively more complicated in this method, but the effects of warping have great influence. As mentioned earlier, Mesh Warping algorithm based on scan lines can make image warping and constructing the splines in X and Y directions independently. After repeated experiments, the system determines eventually the following options. Horizontal splines covering the whole area of images achieve the warping in Y direction, as shown in Fig. 10; Vertical splines covering the mouth and eyes area respectively for reducing the difficulty of constructing splines achieves the warping in X direction, as shown in Fig. 11; Constructing the vertical splines for the whole image separately achieves the effect of shaking head in addition, as shown in Fig. 12. In Fig. 10, Fig. 11, and Fig. 12, black rectangles represent feature points; white rectangle represents the secondary feature points (mainly in order to assist constructing splines and stabilize warped image boundary, these points are computed simply by the feature points); diamonds represent the warping boundary, and control the warping only occurring inside the boundary area; upright triangles represent the pixels adjacent feature points; inverted triangles represent the points separated by a pixel with feature points. These adjacent or close points mainly control the warping of local area. It is worth mentioning that feature points in the chin are not in the vertical splines but in the horizontal splines taking into account the chin can only move up and down almost when talking (shaking head is considered separately). It shows one improvement for the traditional Mesh Warping algorithm, namely: constructing horizontal splines and vertical splines independently; warping in X and Y directions separately. Figure 9. Four-layer control structure Figure 10. Horizontal splines 2651
based on scan lines to producing a variety of mouth shape and facial expressions about talker. Nevertheless, facial animation produced by the system can be improved in many aspects. The system does not care the teeth inside the mouth, but put to use smearing method. Mouth opening becomes larger in some frames than the real size in actual. In addition, the animation effect of transitions between various visual phonemes and all kinds of expressions needs to be further improved. Figure 11. Vertical splines for facial warping Figure 12. Vertical splines for shaking head IV. Experimental results The animation process of our system is shown in Fig. 13. Using algorithm in Section 3, we achieve a speech-driven facial animation system. Fig. 14 shows some frames of facial animations produced by our system. The main functions of the speech-driven facial animation system can be describes as follows. Generating 16 self-defined visual phonemes corresponding to the mouth in the system; Providing the expression labels with six kinds of common expressions; Providing random blinking, shaking head and other effects, enhancing reality of the animation; Sharing visual phonemes, synchronization between voice and animation preferable; Animation frame rate is 20 frames/sec. V. Conclusion We analyze the facial animation research situation, and propose a facial animation generation method based on image warping. We use a frontal face image by choosing feature points to construct the face model, and achieve realistic facial animation using Mesh Warping image warping technology Figure 13. System animation flow chart 2652
Acknowledgement The author would like to thank the students Jiahui Chen and Meng Li in visual computing and visualization laboratory of University of Science and Technology of China for providing assistance. This work is supported by the Imagebased Speech Animation of Youth Innovation Fund in University of Science and Technology of China (2011-2012) and Intelligent Human-machine Speech Interaction Robot Key Technology of Research Programs in Anhui Province (2009-2011) under Grant No.09010206052. (a) Speech happily (b) Speech angrily References [1] D. Reisfld,N. Arad,N. Dyn and et al, Image warping by radial basis functions: Application to facial expressions, CVGIP: Graphical Models and Image Processing, 1994, 56(2), pp.161-172. [2] N. Arad, D. Reisfeld. Image warping using few anchor points and radial functions, Computer Graphics Forum,1995,14(1),pp.35-46. [3] J. Noh, D. Fidaleo, U. Neumann, Animated deformations with radial basis functions, In Proceedings of the ACM Symposium on Virtual Reality Software and Technology,Seoul,2000,pp.166-174. [4] G. Zhu, B. Zhang,L. Wu,Z.Hu, Research on Metamorphosis Using Delaunay Triangulation, Journal of Image and Graphics,2003,8A(6),pp.641-646. [5] Y. Zhang, H. Zhao, An Image Deformation Algorithm Based on Triangle Skeleton Coordinate, Journal of Image and Graphics,2001,6A(4),pp.365-368. [6] S. Lee, G. Wolberg, S.Y.Shin, Scattered data interpolation with multilevel B-splines, IEEE Transactions on Visualization and Computer Graphics, 1997, 3(3), pp.228-244. [7] D.B. Smythe, A two-pass mesh warping alogrithm for object transformation and image interpolation, Technical Report 1030,ILM Computer Graphics Department, Lucasfilm, San Rafael, Calif, 1998. [8] I. Lin, C. Hung, T. Yang, M. Ouhyoung, A Speech Driven Talking Head System Based on a Single Face Image, In Proceedings of the 7th Pacific Conference on Computer Graphics and Applications, Seoul,1999, pp.43-49. (c)speech surprisedly Figure 14. Speech-driven facial animation 2653