Lip Tracking for MPEG-4 Facial Animation Zhilin Wu, Petar S. Aleksic, and Aggelos. atsaggelos Department of Electrical and Computer Engineering Northwestern University 45 North Sheridan Road, Evanston, IL 6008 Email: {zlwu, apetar, aggk} @ece.nwu.edu Abstract It is very important to accurately track the mouth of a talking person for many applications, such as face recognition and human computer interaction. This is in general a difficult problem due to the complexity of shapes, colors, textures, and changing lighting conditions. In this paper we develop techniques for outer and inner lip tracking. From the tracking results FAPs are extracted which are used to drive an MPEG-4 decoder. A novel method consisting of a Gradient Vector Flow (GVF) snake with a parabolic template as an additional external force is proposed. Based on the results of the outer lip tracking, the inner lip is tracked using a similarity function and a temporal smoothness constraint. Numerical results are presented using the Bernstein database.. Introduction MPEG-4 is an emerging multimedia compression standard expected to have an important impact on a number of future consumer electronic products. MPEG-4 is the first audiovisual object-based representation standard as opposed to most existing frame-based standards for video representation. One of the proent features of MPEG-4 is facial animation. By controlling the Facial Definition Parameters (FDPs) and the Facial Animation Parameters (FAPs), a face can be animated with different shapes, textures, and expressions. This kind of animation can be used in a number of applications, such as web-based customer service with talking heads or games. It can also be of great help to hearing impaired people by providing visual information. In addition, for video conferencing MPEG-4 facial animation objects could be a rather cost-efficient alternative. The animation objects can imitate a real person and animate the talking head satisfactorily as long as the parameters are extracted accurately. Transmission of all 68 FAPs at 30 frames/second without any compression needs about 30 kbps for a talking head. The data rate can be even lower to less than 0.5 kbps with further compression techniques, such as FAP interpolation [][], while standard video transmission requires about tens of Mbps. The key issue now is how to accurately obtain parameters of a face. Many studies have been done on this topic. The most interesting features in a face are the eyes and the mouth because they are the proent moving features. Some early studies used markers on the speakers faces. This requirement considerably constrains users in various environments. Some recent studies used templates [3][4] and active contours [5]. The use of templates is valid in many cases. However, it usually requires a great amount of training and may not result in exact fit of the features. The use of active contours [6] is appropriate especially when the feature shape is hard to represent with a simple template. Nevertheless, this method is sensitive to certain salient regions close to the desired feature. Random noise may strongly affect the deformation of active contours. For example, reflections on the lips may be stronger than a lip edge in terms of intensity difference. This reflection would have great effect in pulling the active contours towards it. In this case, the active contour tends to track the reflection on the lips, not the real lip boundaries. In this paper we develop a method combining both active contours and a template. The template represents the shape of the feature and the active contours track its exact position. The advantage is that the final tracking results depend on both the Gradient Vector Flow (GVF) snake field vector and the template, appropriately weighted in terms of their qualities. This combination results in accurate and robust tracking of outer lips in our experiments with all frames of the Bernstein audio-visual database [7]. We found out experimentally that it is considerably harder to track the inner lips using the same approach applied to the tracking of the outer lips. We therefore applied a similarity function to detere the inner lip boundaries. In addition a temporal smoothing constraint is applied to improve the accuracy of the tracking result. The paper is organized as follows. The database is briefly described in Sec.. The procedure for mouth tracking is described in Sec. 3, followed by the description of the proposed algorithms for outer and inner lip tracking in Sec. 4 and 5, respectively. The generation of FAPs is described in Sec. 6 and conclusion are drawn in Sec. 7.
Video sequence Nose Tracking Mouth Extraction Outer Lip Tracking: GVF Snake and Parabola Fitting In this work, only group and group 8 FAPs describing the outer and inner lip movement are considered, as shown in Fig. 3. Inner Lip Tracking: Similarity Function Group and 8 FAPs FAPs Figure. FAP extraction system. The audio-visual database This work utilizes speechreading material from the Bernstein Lipreading Corpus. This high quality audiovisual database includes a total of 954 sentences, of which 474 were uttered by a single female speaker, and the remaining 480 by a male speaker. For each of the sentences, the database contains a speech waveform, a word-level transcription, and a video sequence time synchronized with the speech waveform. Each utterance began and ended with a period of silence. The vocabulary size is approximately,000 words. The average utterance length is approximately 4 seconds. In order to extract visual features from the database, the video was sampled at a rate of 30 frames/sec (fps) with a spatial resolution of 30 x 40 pixels, 4 bits per pixel. 3. Mouth tracking Figure illustrates the FAP extraction system we have implemented [8]. In order to extract the mouth area from the Bernstein Lipreading corpus, a neutral facial expression image was chosen among the sampled video images (Fig. a). A 7 x 44 image of the nostrils (Fig. b) was extracted from the neutral facial expression image to serve as a template for the template matching algorithm. The nostrils were chosen since they did not deform significantly during articulation [9]. The template matching algorithm, applied on the first frame of each sequence, locates the nostrils by searching a 0x0 pixel area centered at the neutral face nose location, for the best match. In the subsequent frames, the search area is constrained to a 3x3 pixel area centered at the nose location in the previous frame. Once the nose location has been identified, a rectangular 90 x 68 pixel region is extracted enclosing the mouth (Fig. c). (a) (b) (c) Figure. (a) Neutral facial expression image; (b) Extracted nose template; (c) Extracted mouth image Figure 3. Outer and inner lip position FAPs 4. Outer lip tracking The outer lip-tracking algorithm that we developed is a combination of an active contour algorithm and parabola templates [8][0]. We use the GVF snake as an active contour algorithm since it provides a large capture range, and two parabolas as templates, as described next. 4.. GVF snake A snake is an elastic curve defined by a set of control points [6], and is used for finding visual features, such as lines, edges, or contours. The snake parametric representation is given by x(s) = [x(s), y(s)], s [0, ], () where x(s) and y(s) are the vertical and horizontal coordinates and s is the normalized independent parameter. Snake deformation controlled by internal and external snake forces moves through the image imizing the functional E = + + ext x 0 ' '' α x ( s) β x ( s) E ( ( s) ) ds, () where α and β are weights that control the snake s tension and rigidity. The external force E ext (x(s)) is derived from the image data. The GVF [][], defined as the vector field v(x,y)=(u(x,y),v(x,y)), can be used as an external force. It is computed by imizing the functional E = [ µ ( u + u + v + v ) + f v f dxdy, (3) x y x y ] where f is an edge map derived from the image using a gradient operator. The parameter µ is a weighting factor which is detered based on the noise level in the image. The important property of GVF is that when used as an external force it increases the capture range of the snake algorithm. Figure 4 depicts an example of the GVF and snake results.
(a) (b) (c) Figure 4. (a) Mouth image; (b) GVF, and (c) The final snake 4.. Parabola templates Based on our investigations we concluded that the snake algorithm, which uses only the GVF as an external force, is sensitive to random image noise and salient features around the lips (i.e., lip reflections). In order to improve the lip-tracking performance two parabolas are fit along the upper and lower lip. Edge detection is performed on every extracted mouth image using the Canny detector to obtain an edge map image (Fig. 5b). In order to obtain sets of points on the upper and lower lip, the edge map image is scanned column-wise, keeping only the first and the last encountered nonzero pixels (Fig. 5c). Parabolas are fitted through each of the obtained sets of points (Fig. 5d). (a) (b) (c) (d) Figure 5. (a) Extracted mouth image; (b) Edge map image; (c) Upper and lower lip boundaries; and (d) Fitted parabolas The noise present in the mouth image and the texture of the area around the mouth in some cases may cause inaccurate fitting of the parabolas to the outer lips. We resolved these cases by taking several noise eliating steps described in detail in [8][]. (a) (a) (a3) (a4) (b) (b) (b3) (b4) Afterwards, the image consisting of the two final parabolas was blurred and the parabola external force, v parabola, was obtained using a gradient operator. v parabola was added to the GVF external force, v, to obtain the final external force, v final, by appropriately weighting the two external forces, that is, v v + w v final =. (4) t parabola The value of w t =.5 proved to provide consistently better results. The final external force, v final, was used in the snake algorithm. Shown in Figure 6 are the snake results for cases of bad quality GVF (Fig. 6a), badly fitted parabolas (Fig. 6b), and improved results by combining the two (Fig. 6c). The advantage of the developed algorithm lies in the fact that both GVF and parabola templates contribute to good tracking results, but their combination provides in most cases improved results. 5. Inner lip tracking The tracking of inner lips presents a more challenging task than the tracking of outer lips. This is primarily due to the fact that the area inside the mouth is of similar color, texture, and intensity as the lips. In addition, teeth appear and disappear in typical conversation and further complicate matters. The technique applied for the tracking of the outer lips described above did not prove to provide good results when applied to the tracking of the inner lips. We therefore resorted to an approach based on similarity functions, as described next. 5. Inner lip model We use two parabolas as inner lip templates. Figure 7 shows the mouth model, where the outer curve is the continuous contour obtained from the previously described method in section 4. The inner lip model consists of two parabolas, which share two inner corners, defining a line at angle θ with respect to the horizontal axis. The two curves modeling the outer and inner lips define two areas, one between them (denoted by x ) and one inside the inner lips (denoted by x ). θ (c) (c) (c3) (c4) Figure 6. (a-c) Mouth images; (a-c) GVFs; (a3-c3) Fitted parabolas; and (a4-c4) Snake results, when GVF (a4), or the parabola templates (b4) do not give good results when applied individually, and when both methods give good results (c4). x x Figure 7. Outer and inner lip model
5. Similarity function The best boundary separating regions x and x is the one defining the largest unlikeness between the two regions. Given a pixel q, we can classify it to region x if p q x ) > p( q ), (5) ( x where p(q x i ), i =,, are the probabilities of q given it can be found in region x i. We may also treat p(q x i ) as the value of the histogram of region x i and the comparison in Eq. (5) becomes h q) > h ( ), (6) ( q In finding the inner lip boundary the results of the application of the outer lip tracking algorithm are used. Four displacement variables are then defined and the mouth region is denoted by R, as shown in Fig. 8. The mouth region is then defined as the result of the imization ( D, D, J, J ) = arg f ( d, d, j, j ) (8) d, d, j, j over all possible combinations (d, d, j, j ). The optimal (D, D, J, J ) is then used to define two parabolas for the upper and lower inner lips. An example of the application of this algorithm is shown in Fig. 9. where h i (), i =,, are the histograms of regions x and x. Assug that the histograms h and h, of the lip and mouth regions, respectively, are known, a region R can be classified as a lip region if h ( q) f ( R) = log (7) h ( q) q R is maximized over all possible shapes of R. Alternatively a region R can be classified as a mouth region if f(r) is imized. This function calculates how similar area R is to regions x and x. The larger f(r) is the closer this area is to x, and alternatively, the smaller f(r) is the closer this area is to x. To make the tracking algorithm luance insensitive, we use the hue and saturation color space instead of the original RGB color space. 5.3 Training and tracking The only training required is to obtain the two histograms for lip and inside mouth regions. We arbitrarily selected 0 frames in a sequence. For each frame we use the outer lip tracking results obtained by the algorithm described earlier, and a hand labeled inner lip parabola contour to get the lip and inside mouth regions. From these 0 frames, we obtained 973 pixels for the lip region (x ) and 4898 pixels for the inside mouth region (x ) for training. To get the two histograms, a bin size of 64 was applied. In the Bernstein database, the speaker rarely tilts her head. Therefore for this set of experiments θ was set equal to zero for all frames. d Figure 8. Inner lip tracking procedure j j R d Figure 9. Inner lip tracking results as two parabolas 5.4 Temporal smoothing In order to preserve some form of temporal continuity the values of the displacement parameters resulting from the imization in Eq. (8) are used as predictors which are corrected by the errors of this prediction, with respect to the predictions resulting from previous frames. As an example, for the variable D, the predictor form we used is the following Dˆ = D + a ( Dˆ ) ( ˆ D + a D D )], (9) where, -, -, are the indices of the current and previous frames, D is the value resulting from the imization in the current frame and ^ denotes corrected values. We want to couple the degree of trust in an estimate to the actual imum value of the function f(), denoted by f. A form then of the prediction coefficients we used is the following where and ω a =, ω a = (0) f ω = tan ( ) + 0.5 () π n f ω = tan ( ) + 0.5 () π n where the values of n, n are chosen experimentally, so that a forgetting factor is introduced. Clearly if f << 0 and f << 0, then ω = ω = 0 and ˆ ˆ ( ˆ D = D + D ). That is, the current estimate is
deemed unreliable and its corrected value is based solely on the values obtained for the previous frames. On the other hand if f f 0 and f 0, then >> >> = ω ω =, and Dˆ = D. That is, the current value resulting from the imization needs no correction. 0.7. In (b) the mouth is nearly closed, and although the tracking result is quite accurate the error is equal to 0.900. Based on numerical evaluation, the application of temporal smoothing improved the tracking results by reducing the error on the average by 0.03. 5.5 Numerical evaluation To evaluate the inner lip tracking results, we hand labeled all inner lips for one sequence with 6 frames. If a pixel does not lie in both tracked and hand labeled inner lip region, this pixel is treated as an error pixel. The tracking error is defined as the ratio of all error pixels divided by the number of pixels in the hand labeled mouth area. Figure 0 (a) shows the so defined errors for the frames tested. Figure 0 (b) shows the corresponding number of error pixels for each frame, that is the numerator of the fraction used for Fig. 0 (a). (a) Error ratio (b) Number of error pixels Figure 0. Inner lip tracking error From the error ratio plot we can see some large errors. These errors typically correspond to frames for which the mouth is closed or nearly closed. Although the tracking of the inner lip is quite accurate, the error can still be large since the number of pixels in R (closed mouth) is very small. This is supported by Fig. 0 (b), in which clearly there is in most cases no correspondence between the peaks in the two figures (0(a) and 0(b)). Shown in Fig. are two image examples. In both of these the solid lines of the inner lip represent the hand labeled boundary and the dotted lines represent the tracked results. In (a) the tracking result is close to the truth and the error is equal to (a) error = 0.7 (b) error = 0.900 Figure. Inner lip tracking error examples 6. FAP generation FAPs defined in the MPEG-4 standard are the imum facial animation parameters responsible for describing the movements of a face. They manipulate key feature points on a mesh model of a head to animate all kinds of facial movements and expressions. These parameters are either low level (i.e., displacement of a specific single point of the face) or high level (i.e., production of a facial expression) [3]. There are totally 68 FAPs, divided into 0 groups. We are interested in outer and inner lip FAPs which are in in group 8 and group. There are 0 FAPs in each of group 8 and group. All FAPs are expressed in terms of Facial Animation Parameter Units (FAPU). These units are normalized by certain essential facial feature distances in order to give an accurate and consistent representation. Two FAPUs are involved in mouth-related FAPs: mouth width and separation between the horizontal line of nostrils and the horizontal line of neutral mouth corners. Each distance is normalized to 04. In Bernstein database, the first frame in each sequence is a neutral face. Therefore we can get the two FAPUs from the first frame and apply them to all the remaining frames. The tracked outer lip of the first frame is the neutral outer lip. Since a neutral mouth is a closed mouth, the line connecting the two outer lip corners is the neutral inner lip. In each following frame, the 0 outer lip and 0 inner lip FAP points positions are compared to the neutral FAP positions and are normalized by FAPUs. This difference represents the mouth movement. Our system automatically reads all sequences from the Bernstein audio-video database and generates all outer and inner FAPs. These parameters are then input to an MPEG-4 facial animation player [4] to generate MPEG- 4 sequences. Based on the visual evaluation of the synthesized video, mouth movements are very realistic and close to the original video. Some results are shown in Fig.. Images (a) and (c) are the two original frames
with tracked outer and inner lips. Images (b) and (d) are corresponding frames generated by the MPEG-4 decoder with FAPs extracted from the original frames. When the sequences are played by the MPEG-4 player driven by FAPs, realistic motion and articulation has been observed. (a) (c) Figure. (a) (c) Original images with tracked outer and inner lips, (b) (d) MPEG-4 facial animations With the Facial Animation Engine [5] all animated MPEG-4 face sequences were well synchronized with acoustic signals. 7. Conclusions and Future Work We have presented a combined method to accurately track the outer lips by using GVF snakes with parabolic templates as an additional external force. This combination needs fewer requirements of both salient boundaries and accuracy of templates. Furthermore, it is more flexible in tracking noisy signals. We have also presented an inner lip tracking method using a similarity function with temporal smoothing. An encoder to generate MPEG-4 FAPs from the continuous lip boundaries is also designed. Good results have been achieved for the Bernstein database sequences. The outer lip FAPs have been used in our audio-visual speech recognition system which greatly improved recognition performance [8][]. In order to get a more realistic talking head, additional facial features, such as, eyes, need to be precisely tracked. Another key issue demanding further work is to make facial feature tracking real time, which is extremely important when it is used in video conferencing. (b) (d) References: [] H. Tao, H.H. Chen, W. Wu, and T.S. Huang, Compression of MPEG-4 Facial Animation Parameters for Transmission of Talking Heads, IEEE Trans. on Circuit and Systems for Video Technology, vol. 9, no., pp. 64-76, March, 999. [] F. Lavagetto, and R. Pockaj, An Efficient Use of MPEG-4 FAP Interpolation for Facial Animation At 70 Bits/Frame, IEEE Trans. on Circuit and Systems for Video Technology, vol., no. 0, pp. 085-097, Oct., 00. [3] A.L. Yuille, P.W. Hallinan, and D.S. Cohen, Feature Extraction From Faces Using Deformable Templates, Int. J. of Computer Vision, vol. 8, no., pp. 99-, 99. [4] J. Luettin, and N.A. Thackerl, Speechreading Using Probabilistic Models, Computer Vision and Image Understanding, vol. 65, no., pp. 63-78, 997. [5] M. Pardas, Extraction and Tracking of the Eyelids, Proc. of ICASSP, New York, July, 000. [6] M. ass, A. Witkin, and D. Terzopoulos, Snakes: Active Contour Models, Int. J. Computer Vision. vol., no. 4, pp. 59-68, 987. [7] L. Bernstein, and S. Eberhardt, Johns Hopkins lipreading corpus I-II, tech. Rep., Johns Hopkins U., Baltimore, MD, 986. [8] P.S. Aleksic, J.J. Williams, Z. Wu, and A.. atsaggelos, Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features, to appear in EURASIP Journal on Applied Signal Processing, 00. [9] H.P. Graf, T Chen, E Petajan, and E Cosatto, Locating Faces And Facial Parts, Proc. Int. Workshop on Automatic Face and Gesture Recognition, pp 4-46. 995. [0] P.S. Aleksic, J.J. Williams, Z. Wu, A.. atsaggelos, Audio-Visual Continuous Speech Recognition Using MPEG-4 Compliant Visual Features, to appear, ICIP, September 00. [] C. Xu and J.L. Prince, Gradient Vector Flow: A New External Force for Snakes, IEEE Proc. Conf. on Comp. Vis. Patt. Recog., 997. [] C. Xu, D.L. Pham, and J.L. Prince, Medical Image Segmentation Using Deformable Models, SPIE Handbook on Medical Imaging -- Volume III: Medical Image Analysis, edited by J.M. Fitzpatrick and M. Sonka, May 000. [3] Text for ISO/IEC FDIS 4496- Visual, ISO/IEC JTC/SC9/WG N50, Nov. 998. [4] F. Lavagetto, R. Pockaj, The Facial Animation Engine: Toward a High-Level Interface for the Design of MPEG-4 Compliant Animated Faces, IEEE Trans. on Circuits and Systems for Video Technology, vol. 9, no., March, pp. 77-89, 999. [5]http://www-dsp.com.dist.unige.it/~pok/RESEARCH/MPEG /fae.htm