An Efficient Use of MPEG-4 FAP Interpolation for Facial Animation at 70 bits/frame

Size: px

Start display at page:

Download "An Efficient Use of MPEG-4 FAP Interpolation for Facial Animation at 70 bits/frame"

Martina Norton
5 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER An Efficient Use of MPEG-4 FAP Interpolation for Facial Animation at 70 bits/frame Fabio Lavagetto and Roberto Pockaj Abstract An efficient algorithm is proposed to exploit the facial animation parameter (FAP) interpolation modality specified by the MPEG-4 standard in order to allow very low bit-rate transmission of the animation parameters. The proposed algorithm is based on a comprehensive analysis of the cross-correlation properties that characterize FAPs, which is here reported and discussed extensively. Based on this analysis, a subset of ten almost independent FAPs has been selected from the full set of 66 low-level FAPs to be transmitted and used at the decoder to interpolate the remaining ones. The performances achievable through the proposed algorithm have been evaluated objectively by means of conventional PSNR measures and compared to an alternative solution based on the increase of the quantization scale factor used for FAP encoding. The subjective evaluation and comparison of the results has also been made possible by uploading mpg movies on a freely accessible web site (referenced in the bibliography). Experimental results demonstrate that the proposed FAP interpolation algorithm allows efficient parameter encoding at around 70 bits/frame or, in other words, at less than 2 kbits/s for smooth synthetic video at 25 frames/s. Index Terms Avatars, facial animation, MPEG-4. I. INTRODUCTION THE OBJECTIVE of the present paper is to define appropriate criteria for allowing MPEG-4 facial animation at very low bit-rate and to provide, at the same time, a detailed explanation on a particular part of the facial animation specification. As in any conventional approach to lossy video compression, in facial animation, the straightforward solution to reduce the bitrate due to the transmission of facial animation parameters (FAPs) is that of increasing the quantization scaling factor used for parameter encoding. The evident disadvantage of this approach is that of progressively degrading the quality of the animation with the introduction of jerky facial movements that are usually very annoying. In synthetic facial animation, different from natural video coding, the coarse quantization of parameters does not affect the rendering quality of each individual frame, but the smoothness of facial movements rendering. However, another possibility exists to reduce the bit-rate of the stream encoding the FAPs according to MPEG-4 specifications. This method is based on exploiting the a priori knowledge about the object that is animated, namely a human face. Through the analysis of the time correlation characterizing the Manuscript received June 1, 2000; revised July 18, This work was supported in part by the European Union under the ACTS Research Project VIDAS, and by the IST Research Project Interface, both coordinated under DIST. This paper was recommended by Associate Editor E. Petajan. The authors are with DIST, Università of Genova, Genova, Italy ( fabio@dist.unige.it; pok@dist.unige.it). Publisher Item Identifier S (01)09156-X. facial movements of a person while talking, it is possible to reduce much of the FAP redundancy. Instead of coding and transmitting the complete set of 66 FAP at each frame, it is possible to encode only a significant subset of them and let the decoder generate the missing parameters, thus achieving very low bit-rate animation. This procedure, named FAP interpolation, represents the core issue discussed in this paper. In Section II, we introduce the specifications of MPEG-4 concerned with FAP interpolation, and explain how the techniques we propose are compliant with the standard and oriented to fully exploit its potentiality. In Section III, we present the methodology and the setup used for high-precision data acquisition of real facial movements. In Section IV, a description is given of the techniques adopted to post-process the acquired data and to define an appropriate subset of FAPs assumed as a minimum basis capable of approximating all the possible human facial movements. The mechanism used by the decoder to interpolate the missing FAPs is explained in Section V, while preliminary objective and subjective results are reported in Section VI. II. FAP STATUS AND INTERPOLATION MPEG-4 specifications concerning facial animation [1] adopt the term FAP interpolation to indicate the procedure used by a generic facial animation decoder to autonomously define the value of the FAPs that have not been encoded and transmitted. Based on the knowledge of only a limited number of FAPs, the objective of the interpolation procedure is to estimate a variable number of those missing. In this context, since both the known and the estimated FAPs belong to the same frame, the estimation procedure is performed intra-frame without taking into account any inter-frame FAP prediction. In this respect, therefore, rather than an interpolation, we should more properly call the procedure an actual FAP extrapolation. After having stressed this terminology ambiguity, let us now examine the related MPEG-4 specifications and focus the attention on some issues that are not of clear and immediate interpretation. FAPs have been subdivided in a few groups, depending on the region of the face they are applied to, with the objective of optimizing the compression efficiency. These groups are listed in Table I. Together with I-Frames, two different hierarchical masks are transmitted, being fap_mask_type and fap_group_mask, with the objective of selecting the subset of the complete set of 68 defined FAPs that will be transmitted in the present I-frame and in the following P-frames. The fap_mask_type is a 2-bit mask whose meaning is described in Table II /01$ IEEE

2 1086 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 TABLE I FAP GROUPS AND NUMBER OF FAPS PER GROUP TABLE II FAP_MASK_TYPE TABLE III LENGTH IN BITS OF FAP_GROUP_MASK VERSUS THE GROUP NUMBER The fap_group_mask, which is encoded only in case the value associated to fap_mask_type is 01 or 10, specifies which FAPs among those in the group are actually transmitted. The size of this mask is variable (see Table III), depending on the specific group it is referred to. The value of this mask can be interpreted as a composition of 1 bit fields: if the bit value is 1, the corresponding FAP is transmitted; otherwise, it is not. As described before, these masks are used to specify which FAPs are transmitted. Moreover, masks are also used to encode the so-called FAP status. At each transmission of frame parameters to the decoder, the FAP status can be one of the following. SET: if the FAP was transmitted by the encoder; LOCK: if the FAP was not transmitted and maintains the value of the previous frame; INTERP: if the FAP was not transmitted and the decoder may define its new value (i.e., it can interpolate its value). SET, LOCK, and INTERP are terms not explicitly mentioned in the standard, but have been introduced by the authors and will be used in the sequel to define the FAP status. The status of non-transmitted FAPs is determined by the decoder according to the value of the fap_mask_type and of the fap_group_mask. In fact, if the fap_mask_type has value 01, non-transmitted FAPs (represented by 0 bit value for the corresponding fields of the fap_group_mask) must maintain the same value of the previous frame (FAPs are in LOCK status); if fap_mask_type has value 10, non-transmitted FAPs can be interpolated by the decoder (FAPs are in INTERP status). The only exception is when the encoder has never transmitted a FAP. In this case, since no past reference value is available for this FAP, the only solution is to force its initial status to INTERP. It is worth figuring out some problems that could be originated by this specification in the case of broadband transmission, at least as far as it concerns the authors interpretation of the standard. When different decoders start to decode the broadcasted bitstream at different instants, an unresolved ambiguity is generated about the correct recognition of not transmitted FAPs. Let us consider the example of a FAP being transmitted for some time at the beginning of the communication and, after a while, no more transmitted. In case the decoder is activated from the beginning (i.e., when the FAP is still transmitted), as soon as its transmission is stopped, it enters a LOCK status. Conversely, in case the decoder is activated after the FAP transmission is stopped, it enters an INTERP status. It must be noticed also that when a FAP is in INTERP status, the decoder is not always free to fix its value. The standard, in fact, defines two default criteria for interpolating FAPs, called left right interpolation and inner outer lip contour interpolation, respectively. These criteria have been defined to exploit two evident characteristic of facial motion: the vertical symmetry of facial movements and the strong correlation between the movements of inner and outer lip contours. As a practical consequence, in case only the FAPs of the right part of the face are transmitted while those of the left part are in INTERP status, the decoder is forced to reproduce those received also on the left half of the face, and vice-versa. The same process is applied with respect to the lip contours: if only the FAPs related to the inner contour are transmitted while those of the outer contour are in INTERP status, the decoder is forced to reproduce those received also on the outer contour, and vice-versa. Another FAP interpolation method is included in the MPEG-4 standard, applicable to all the profiles including facial animation except to the simplest one, Simple FA. This method makes use of the Facial Interpolation Table (FIT) [2] to allow the encoder to define inter-fap dependence criteria in polynomial form. After downloading the FIT parameters, the encoder activity can be limited to the transmission of a subset of FAPs, leaving to the decoder the task of interpolating the missing ones, on the base of the inter-fap relations specified by the FIT. For each FAP to be interpolated, FIT allows the definition of relations such as

3 LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1087 Each interpolation function I( ) is in a rational polynomial form (this is not true for FAP1 and FAP2; please refer to [2] for more details) (1) An example of FIT can, therefore, be as follows: (2) (3) In the opinion of the authors, however, the use of FIT is mainly oriented to guarantee the predictability of the animation when used together with the Facial Animation Table (FAT). In this way, the encoder will be able to guarantee a minimum level of quality, since the results of the animation would be fully predictable, and to achieve in the meantime a significant savings of bandwidth. It is reasonable to imagine that, for most applications, the inter-fap dependence criteria should be very simple. Thanks to the strong symmetries characterizing a human face, it should be possible to express the majority of these dependencies in terms of simple direct proportions, without the need of adopting complex polynomial functions. Based on these considerations, it turns out that almost all FIT information is, in general, needed at the decoder, provided that, besides implementing the default interpolation functions left right and inner outer lip, it also includes such kind of simple proportional relations. III. ACQUISITION SETUP The Test Data Set has been acquired by the Elite [3] system that is composed of four synchronized cameras, one video, and three IR cameras, together with a microphone. The 3-D position of small IR reflecting markers distributed on the speaker s face is estimated and tracked by the system at 100 Hz by means of suitable triangulation algorithms applied to the trinocular IR images with sub-millimeter precision. Once the 3-D trajectories of each marker have been automatically computed by the system, suitable post-processing is applied to convert them into MPEG-4 compliant FAPs. After having estimated the neutral position of the speaker (as defined by the standard) and the Facial Animation Parameter Units FAPUs), the rigid motion of the speaker s head is computed for each frame by analyzing the position of three reference markers located on almost non deformable face structures, like on the tip of the nose (see Fig. 1). The 3-D coordinates of each marker are normalized with respect to the neutral position (by compensating the rigid motion of the head) and the FAPs associated to the present frame are estimated by comparing the actual compensated positions of the markers with their coordinates in the neutral position. Twenty markers have been distributed on the speaker s face, as shown in Fig. 1. Thanks to this marker configuration, 30 FAPs Fig. 1. Location of feature points defined by MPEG-4 (left) and location of markers in the data acquisition phase (right). have been estimated, specifically: group 4 (eyebrow), group 5 (cheek), group 7 (head), group 8 (outer lip), and FAPs 3, 14, 15, 16, and 17 of group 2 (inner lip, jaw). The acquisition procedure described above has been applied to record a few sequences for a total of about frames with three different speakers, two males and one female. The acquired data have been subdivided into two parts: the Training Data Set obtained by selecting the first 2/3 of each sequence and the Test Data Set composed by the remaining 1/3 of each sequence. The former Data Set has been used to analyze the FAP correlation, while the latter has been used for the performance evaluation. IV. ANALYSIS OF FAP CORRELATION For the analysis of FAP correlation, only a few of the 10 FAP groups defined in MPEG-4 have been considered (see Table I). The analysis we have carried out was oriented to model the FAP s trajectories related to the human facial movements associated to normal speech production. The objective is in reducing the number of FAPs to encode and transmit and providing the decoder with a suitable FAP interpolation mechanism. Should particular nonhuman facial movements be rendered for specific applications (like for the animation of a cartoon-like character), it will be enough to modify the FAP mask associated to the transmission of an I-frame and to transmit all the FAPs describing that particular animation for the time needed. As far as the correlation analysis is concerned, FAPs in group 1 (visemes and expressions) have not been considered since they do not simply encode the scalar value of a specific facial feature like the other FAPs but, on the contrary, they encode high-level information such as the global posture of the mouth and the whole facial expression. No measurement of FAPs in group 6 was possible for the limitations of the data acquisition system that has been used, and the same happened for some FAPs of group 2 (inner lip contour). FAPs in groups 9 and 10 have been excluded from the analysis since human facial movements associated to the nose and ears are usually negligible and of limited interest. FAPs in

4 1088 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 group 7 (head) have been considered, even if the three rotations of the head have been assumed decorrelated. In conclusion, for the correlation analysis, three sets of FAPs have been considered: 1) some of those in group 2 together with those in group 5 and in group 8; 2) those in group 3; and 3) those in group 4. The reason for analyzing together, in the first set, FAPs coming from different groups (while the other two sets include only homogeneous FAPs) is due to the a priori knowledge about the anatomy of human faces. Bone and muscle structures of a human head, in fact, indicate that some regions of the face are affected by correlated motion while others can be reasonably assumed to be independent. As an example, movements of cheeks and lips are strongly correlated, while movements of the eyebrows can be considered rather independent from those of the jaw. This consideration is very evident if facial movements are analyzed at a low level and instant by instant, while it becomes less and less valid if the analysis is shifted to the semantic level and applied on longer time intervals. In this last case, for instance, it might be possible to identify long-term correlations between the emphasis of pronunciation (with relation to lip movements) and movements of the head and eyebrows [4]. Anyway, these considerations go far beyond the scope of the paper, and our investigation here will be limited only to low-level FAP analysis even if, in the opinion of the authors, this represents a promising and fertile field of research for future applications. A. Computation of FAP Correlations Our analysis has been based on the FAP correlation matrix [5], computed as follows: where is the number of FAPs under investigation and represents the correlation coefficient between the th and the th FAPs, defined as The term represents the variance of the th FAP and the coefficient of the covariance matrix defined as being the mean value of the th FAP. The matrix is symmetric by definition; that is, with. Though in rigorous mathematical terms is a signed real number, we have chosen to consider its absolute value to simplify the graphical interpretation of the results. The more approximate the value 1, the more the th and the th FAPs are correlated. It is worth mentioning the fact that other methods, different from the approach based on the correlation matrix described above, have been considered by the authors to the end of identifying a set of uncorrelated FAPs among the 66 defined by (4) (5) (6) Fig. 2. Graphical representation of matrix R for group 4 (white squares indicate high correlation). MPEG-4. The most suitable among them might seem to be the Principal Component Analysis (PCA), of widespread use in this class of problems. However, they are of no use in this application and, therefore, have been discarded, since the -vector basis they provide for given vectors ( ) almost never coincides with a subset of the given vectors. On the contrary, this is an obvious constraint for an MPEG-4 compliant codec, where only the 66 FAPs or possibly subset of them are allowed to be transmitted. After defining the criterion for the correlation estimation, let us examine the various FAP groups. In the following, it will be discussed how to choose the minimum set of FAPs that must be transmitted. In Section V, on the contrary, criteria will be defined to interpolate non-transmitted parameters. B. Group 4 (Eyebrows) Matrix is represented with gray levels in Fig. 2. The graphic representation of allows a faster estimation of correlations. Bright blocks indicate a high value of while dark blocks identify totally uncorrelated FAPs. The visualization of allows a first data validation by comparing the correlation of FAPs on the right side of the face with that of FAPs on the left side. By inspection, it turns out that an unexpected very low correlation between FAP 38 and the other FAPs. By analyzing the temporal trajectory of FAP 38, it was discovered that its intrinsically low dynamics (squeeze_r_eyebrow), typically in the order of a few tens of Eye Separation Unit (ES), has been completely obscured by the acquisition noise, whose standard deviations was comparable with the signal. Because of this systematic error in the measurement process, all the FAPs affected by this kind of acquisition noise have been excluded from further analysis and a second matrix has been computed where the columns corresponding to symmetric FAPs have been merged. Since MPEG-4 forces identical values for symmetric FAPs in case of interpolation, it is convenient to con-

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1089 Fig. 3. Graphical representation of matrix R for group 4 (white squares indicate high correlation).

sider symmetric FAPs as a single parameter (in our case, we have chosen to represent only the left FAPs).

5 LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1089 Fig. 3. Graphical representation of matrix R for group 4 (white squares indicate high correlation). TABLE IV MATRIX R FOR GROUP 3(LEFT) AND VALUES OF THE s COEFFICIENT (RIGHT) Fig. 4. Graphical representation of matrix R for group 2 (partial), 5, and 8 (white squares mean high correlation). sider symmetric FAPs as a single parameter (in our case, we have chosen to represent only the left FAPs). On the basis of the above considerations, measures related to FAP 38 have been replaced with those of the symmetric FAP 37, whose acquisition was significantly less noisy. Matrix is visualized in Fig. 3, while the values of its coefficients are reported in Table IV. The information associated to Table IV provides good indications for selecting the specific FAP that will be used to interpolate the other ones. The criterion we have adopted is that of selecting the th FAP having the highest correlation with respect to the other FAPs The th FAP can now be used to interpolate any other th FAP with. The threshold value equal to 0.75 has been determined experimentally, seeking for a reasonable tradeoff between rate and distortion. Once this operation has been completed, rows and columns corresponding to the th and th FAPs that have been selected are removed from matrix and the coefficients are recomputed. This process is then iterated until all the examined FAPs have been analyzed. As far as FAPs in group 4 are concerned, the values of and allow an easy choice for FAP 33 (or for its symmetric FAP 34) as the best and unique FAP to transmit, since. In Section V, the interpolation criteria will be discussed and defined. (7) C. Groups 2, 5, and 8 (Jaw, Chin, Lip Protrusion, Cheeks, Outer lip, Cornerlip) The corresponding matrix is visualized in Fig. 4. By inspecting matrix, a few significant considerations can be drawn. Also in this case, an unexpected low correlation is found between FAPs 53 and 54 (stretch outer corner lips). If we then consider FAPs corresponding to the upper lip, it is evident to conclude that they are substantially uncorrelated from all the other FAPs. Also, FAP 15 (shift_jaw) seems to be totally uncorrelated from the other FAPs. This conclusion is confirmed by a deeper analysis of its time trajectory, revealing small random variations due, very likely, to the acquisition noise, since during speech this movement is substantially absent. Let us now consider the matrix, obtained from by removing noisy FAP and by merging left and right pairs. When inspecting the coefficients of matrix (see Fig. 5 and Table V), choosing the FAPs to transmit seems to be more complex than in group 4. The first FAP selected for transmission is FAP 52, used then to interpolate FAPs 3, 14 and 57 (having ). Therefore, four rows and four columns can be removed from and the new matrix is obtained by recomputing coefficients (superscript 1 indicates that the first FAP to transmit has been selected or, in other terms, that the first simplification step has been completed). As evidenced in Table VI and Fig. 6, FAPs with the highest values are FAP 41 (lift_l_cheek) and FAP 59 (raise_l_cornerlip_o). The decision of selecting FAP 59 as the second parameter to transmit, though was slightly greater than, is due to evident technical reasons: experiments show that the automatic extraction of lip corners from real video sequences is far easier than that of cheek coordinates. The comparable

1090 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO.

Graphical representation of matrix R for group 2 (partial), 5, and 8 (white squares indicate high correlation).

Since the only value over threshold ( ) is just met for, it can be reasonably assumed that FAP 41 can be interpolated

6 1090 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 TABLE V MATRIX R FOR GROUP 2(PARTIAL), 5 AND 8(LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT) TABLE VI MATRIX R FOR GROUP 2(PARTIAL), 5, AND 8(LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT) Fig. 5. Graphical representation of matrix R for group 2 (partial), 5, and 8 (white squares indicate high correlation). values of and, together with the high cross correlation ( ), should guarantee good interpolation anyway. Since the only value over threshold ( ) is just met for, it can be reasonably assumed that FAP 41 can be interpolated effectively from FAP 59. Fig. 6. Graphical representation of matrix R for group 2 (partial), 5, and 8 (white squares indicate high correlation). Let us now erase these two FAPs and recompute,as shown in Table VII and Fig. 7. Now the highest value is reached by, even if it is only slightly higher than. Also in this case, only FAP 39 has

7 LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1091 TABLE VII MATRIX R FOR GROUP 2(PARTIAL), 5, AND 8(LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT) Fig. 8. Graphical representation of matrix R for group 2 (partial), 5, and 8 (white squares indicate high correlation). Fig. 7. Graphical representation of matrix R for group 2 (partial), 5, and 8 (white squares indicate high correlation). MATRIX R TABLE VIII FOR GROUP 2(PARTIAL), 5, AND 8(LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT) openings and to the movements of the mouth corners (see the complete matrix ). Anyway, experimental results prove that these relations cannot be modeled easily, and in particular, that no linear dependence can be formalized among FAPs 16 and 17, on one side, and the other FAPs of groups 2, 8, and 5, on the other side. D. Group 3 (Eyeballs, Pupils, Eyelids) and results as the only parameter that can be interpolated from FAP 53. The next step is the computation of, as described in Table VIII and Fig. 8. The next parameter to transmit would be FAP 55, though with a value only slightly greater than. Also in this case, as in the first step, we have preferred to transmit FAP 51 (lower_top_midlip_o), since it is highly correlated to FAP 55 and is more easily trackable from real video analysis. FAPs 16 and 17 are the last ones to be transmitted, being those maximally uncorrelated from the others. Some comments are worth drawing about lip protrusion. Since these FAPs describe variations along the axis (with positive orientation out of the screen) of the mid point of the upper and lower lip, their values are partially correlated to lip There is no need for experimental evidence to state that movements of the eyeballs and pupils are maximally correlated, since they are affected by the same rigid motion. Lower eyelids are most of time static and rarely affected by an almost unperceivable motion. Movements of the upper eyelids, on the other hand, are substantially uncorrelated from eyeballs and pupils. Therefore, as defined in the MPEG-4 specifications through the left right interpolation criterion, all the right FAPs in this group can be interpolated from the corresponding left ones (or vice versa). Based on the previous considerations, we can conclude that only three FAPs (19, 23, and 25) are necessary for animating the entire group; the remaining FAPs are not necessary or can be interpolated as indicated in Table IX. Their transmission, if needed, can be omitted by simulating eye blinking at the decoder and by synthesizing the movements of the eyeballs based on the parameters encoding the head rotation. In Section V-D, some criteria are explained to synthetically generate FAPs 19, 23, and 25 and, therefore, to completely avoid the transmission of any FAP in group 3.

8 1092 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 TABLE IX FAP INTERPOLATION COEFFICIENTS (VALUES OF ) FOR GROUP 3 (TX MEANS TRANSMITTED FAP, EMPTY ROWS INDICATE FAP NOT NECESSARY FOR TYPICAL ANIMATIONS): 1) INDICATE THAT THOSE FAP CAN POSSIBLY BE TOTALLY SYNTHESIZED AND 2) INDICATES THAT THOSE FAP CAN BE INTERPOLATED BY HEAD MOVEMENTS TABLE X FAP INTERPOLATION COEFFICIENTS (VALUES OF ) FOR GROUP 3 (TX MEANS TRANSMITTED FAP) V. FAP INTERPOLATION CRITERIA After having selected the FAPs that optimize the estimation of not transmitted parameters (see Section IV), their specific mutual dependencies must be formalized and implemented. A. Computation of FAP Interpolation Criteria For the sake of simplicity, let us assume that each of the FAPs to be interpolated is linearly dependent from a single FAP of those actually transmitted. As it will be evidenced by the experimental results reported in the following, this hypothesis is very close to reality. By comparing the trajectories of the estimated and measured FAPs, it turns out that they are quite similar, except in correspondence of intervals where the FAP amplitude is very high. It is reasonable to suppose that this effect is due to nonlinear saturation distortion affecting the estimates, which is difficult to face, even by modeling FAP dependences through more complex relations. As an example, let us consider FAP 3 (open_jaw) and FAP 52 (raise_b_midlip_o): when the jaw is completely closed or open, the lower lip still has some residual possibility to move independently from the jaw itself. Despite this annoying but, fortunately, rare and scarcely perceivable distortion, let us formalize the linear inter-fap dependence as follows: (8) where represents the value of the FAP in correspondence of the th frame of a sequence with frames, is the parameter to be interpolated and the parameter actually transmitted. The problem consists of determining the interpolation coefficient that minimizes the mean square error (MSE), defined as The optimal coefficient results as being (9) (10) Fig. 9. Trajectories of FAP 31 (raised left inner eyebrow, solid line) estimated by FAP 33 (raised left middle eyebrow) and its actual value (dashed line). Fig. 10. Trajectories of FAP 37 (squeezed left eyebrow, solid line) estimated by FAP 33 (raised left middle eyebrow) and its actual value (dashed line). B. Group 4 (Eyebrows) Table X summarizes the values computed for the FAPs of group 4. In Figs. 9 and 10, the trajectories of the estimated FAPs (solid line) are compared to the actual measured values (dashed line). Besides the substantially correct reproduction of the trajectories,

9 LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1093 TABLE XI FAP INTERPOLATION COEFFICIENTS (VALUES OF ) FOR GROUP 2, 5 AND 8 (TX MEANS TRANSMITTED FAP, EMPTY ROWS INDICATE FAP NOT NECESSARY FOR TYPICAL ANIMATIONS) it must be noticed how, in the case of FAP 37, affected by significant acquisition noise, estimates are somehow the low-pass filtered replica of the original parameters. Instead of degrading the quality of the synthesis, this low-pass filtering has the positive effect of gracefully smoothing the animation, thus making it more natural and realistic. C. Groups 2, 5, and 8 (Jaw, Chin, Lip Protrusion, Cheeks, Outer Lip, Cornerlip) Table XI indicates the values of computed for FAPs of groups 2, 5, and 8. In Figs , the trajectories are reported of estimated FAPs (solid line) compared with the actual FAP values (dashed line). D. Group 3 (Eyeballs, Pupils, Eyelids) Various studies, both in medicine and in psychology, have computed the typical value for the frequency and duration of eye blinking. In [6], as an example, it is evidenced how this frequency ranges between Hz, with an average duration of eye closure equal to ms. Based on these experimental evidences, it is easy to simulate, at the decoder, the eye blinking by means of FAP 19. Some experiments and subjective evaluations carried out by the authors have proven that, for many applications, it can be acceptable to interpolate FAP 23 and 25 based Fig. 11. Trajectory of FAP 3 (open jaw, solid line) estimated by FAP 52 (raised bottom midlip outer) and its actual value (dashed line). only on the head rotation, doing it in a way to maintain the sight of the virtual character as frontal as possible for meeting the sight of the interacting human, supposed to be seated frontally to the monitor. The typical values that we have used to interpo-

10 1094 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 Fig. 12. Trajectory of FAP 57 (raised bottom lip left middle outer, solid line) estimated by FAP 52 (raised bottom midlip outer) and its actual value (dashed line). Fig. 13. Trajectory of FAP 39 (puffed left cheek, solid line) estimated by FAP 53 (stretched left cornerlip outer) and its actual value (dashed line). late the movements of the eyes based on the movements of the head, obtained experimentally, are VI. EXPERIMENTAL RESULTS In conclusion, from that described in the previous sections, it turns out that a subset of only 10 FAPs, suitably chosen from the complete set of 66 FAPs, can be used to guarantee the efficient encoding of MPEG-4 facial animation sequences. In particular, one FAP for the eyebrow movements (FAP 33), six FAPs for mouth and cheeks movements (FAP 16, 17, 51, 51, 53, and 59), and three FAPs for head rotation (FAP 48, 49, and 50). As is proven experimentally, the remaining FAPs can be easily interpolated from those actually transmitted or result being superfluous in typical sequences with continuous speech produced by Fig. 14. Trajectory of FAP 41 (lifted left cheek, solid line) estimated by FAP 59 (raised left cornerlip outer) and its actual value (dashed line). natural faces. In the opinion of the authors, at least another FAP of group 6 should be transmitted for controlling the movements of the tongue. However, the evident difficulties in tracking the tongue movements have prevented up to now from its analysis and modeling. In the remainder of this section, we provide a quality evaluation both objective and subjective of the animation obtained by using only this subset of 10 FAPs, compared to what is achievable by exploiting the full set of 46 FAPs captured with the acquisition system described in Section III. For running the experiments, three different sets of FAPs have been generated. The first, set A, has been generated by interpolating the FAP Test Data Set (see the description in Section III) starting from only 10 FAPs encoded with a quantization scaling factor of 1 and then decoded. The second, set B, has been obtained by encoding, and then decoding, the entire Test Data Set with a quantization scaling factor of 16 in a way to maintain the same bitrate (around 1.4 kbits/s) associated with set A. The third, set C, has been obtained by encoding, and then decoding, the entire Test Data Set with a quantization scaling factor of 1 and, therefore, with the same quality of set A, but with a higher bitr ate. Each of the three sets A, B, and C achieves a frame rate of 25 frames/s. Fig. 15 provides a graphical description of the three sets. The results have been compared to the original Test Data Set before parameter encoding. In addition to the 30 FAPs actually captured through the acquisition system, the Test Data Set includes some FAPs whose value has been synthesized by fantasy, like the ten missing FAPs in group 2, those controlling the upper eyelids, and those responsible for the eyeball rotation. This is with the purpose of simulating a more realistic situation, obtainable in case the complete set of FAPs could be captured through a more sophisticated acquisition system, where the maximum available information for facial animation is used, and for better comparing FAP encoding with and without FAP interpolation. In Table XII the bit rate achieved for the three sets A, B, and C is reported. Let us notice that the bit rate for set A is signif-

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1095 Fig. 15. The three test data sets used in the experiments.

FAP 35 (raise_l_o_eyebrow) encoded with qsf = 1 (solid line marked with ), encoded with qsf = 16 (dotted line) and interpolated (solid line marked with ); note that, though the FAP35 encoded with qsf

11 LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1095 Fig. 15. The three test data sets used in the experiments. TABLE XII BIT RATES FOR THE THREE TEST DATA SETS TABLE XIII PSNR FOR SOME FAP OBTAINED WITH THE THREE DIFFERENT CODING SCHEMES, WITH (Y) OR WITHOUT (N) INTERPOLATION Fig. 16. FAP 35 (raise_l_o_eyebrow) encoded with qsf = 1 (solid line marked with ), encoded with qsf = 16 (dotted line) and interpolated (solid line marked with ); note that, though the FAP35 encoded with qsf = 16 has a better PSNR, its step-wise behavior results in a subjectively worse animation than the interpolated FAP. icantly lower than for set C while maintaining high quality in FAP reproduction (as evidenced in Table XIII). Table XII reports the PSNR values computed for a few FAPs, part of them interpolated and part of them transmitted, as specified in column Interp. The results reported in Table XIII suggest some important considerations. In the case of set A, the interpolated FAPs are obviously characterized by a PSNR lower than in the other two cases. Nevertheless, the temporal trajectory of these FAPs, unlike for set B, is not substantially affected by quantization distortion. Fig. 16 evidences clearly this phenomenon: when FAP 35 (raise_l_o_eyebrow) is interpolated, its value differs from the actual measure more than in the case of qsf equal to 16. How- Fig. 17. FAP 48 (head_pitch) encoded with qsf = 1 (solid line) and encoded with qsf = 16 (dotted line); note that the FAP48 encoded with qsf = 16 has both a worse PSNR and a step-wise behavior if compared with the FAP 48 encoded with qsf = 1. ever, the step-wise characteristics of the quantization noise turn out to be subjectively more annoying during the animation.

12 1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 TABLE XIV ENCODING PARAMETERS OF THE WOW SEQUENCE A second consideration concerns the 10 FAPs transmitted in case of set A and whose quality results far higher than in case of set B, where all the FAPs are subject to coarse quantization. In particular, the movements of the head are among those most sensitive to quantization noise, as shown in Fig. 17: the dark line represents the trajectory of FAP 48 (head_pitch) either in case of set A and of set C, while the gray line is referred to case B. Based on these experimental evidences, it is important to notice that the reproduction of head movements must be sufficiently smooth to avoid severe subjective artifacts, like annoying jerky head motion. Differently from many other movements of the face, FAPs controlling the head motion must be quantized with very small values of qsf. The third and last consideration suggested by the analysis of Table XIII concerns the negligible information associated to the average PSNR for the mutual comparison of sets A, B, and C. The above considerations on FAP quantization, and the fact that, in set A, the PSNR computed over the interpolated FAPs differs significantly from the PSNR associated with the transmitted FAPs, make almost meaningless the use of the average PSNR as an objective evaluation criterion. The achieved results put in evidence how, toward the objective of reducing as much as possible the bit rate for transmitting a FAP stream, it is preferable to employ FAP interpolation rather than increasing the quantization scaling factor. In order to allow a more reliable subjective evaluation of the quality improvements that can be achieved by exploiting FAP interpolation, two movies are available on our web site, 1 based on the Facial Animation Engine (FAE) developed at the DSP Lab of DIST. In the first movie, the stream wow.fap (donated by DIST to the MPEG Face and Body Animation Ad Hoc Group) encoded with qsf 1 (case A) is compared to the same stream encoded at very low bitrate with qsf 16 (case C). The second movie compares case A with the same stream encoded at very low bitrate by using FAP interpolation (case B). The subjective evaluation is left to the readers. In Table XIV, the characteristics of the sequences are summarized. As a final consideration, it is important to notice how the conventional objective evaluation based on the PSNR, 1 computed on each static frame, here has almost no meaning since the quality of the synthetic images is good all the time. On the contrary, the increase of the quantization factor affects significantly the movement smoothness that, as evidenced by the movie WowQ.mpg, can only be appreciated and evaluated by playing the video. VII. CONCLUSION The cross-correlation analysis between MPEG-4 FAPs reported in this paper, together with the proposed algorithm for FAP interpolation, represent a key reference for any study oriented to exploit this FAP encoding modality for achieving efficient transmission of FAP streams. The innovative contribution of this study consists both of the specific technical solution that is proposed and of the experimental evidences that are produced. Up to now, to the knowledge of the authors, no investigation has been reported in the scientific literature on the exploitation of FAP interpolation modality, nor any concrete proposal on procedural solutions. The experimental results here reported provide clear indication of the performances level that FAP interpolation can guarantee and they suggest, therefore, a large variety of possible applications of MPEG-4 facial animation technologies within Internet based services, mobile interpersonal communications, etc. REFERENCES [1] Text for ISO/IEC FDIS Visual, ISO/IEC JTC1/SC29/WG11 N2502, Nov [2] Text for ISO/IEC FDIS Systems, ISO/IEC JTC1/SC29/WG11 N2501, Nov [3] G. Ferrigno and A. Pedrotti, ELITE: A digital dedicated hardware system for movement analysis via real-time TV signal processing, IEEE Trans. Biomed. Eng., vol. BME-32, pp , [4] C. Pelachaud, N. Badler, and M. Steedman, Generating facial expressions for speech, Cog. Sci., vol. 20, no. 1, pp. 1 46, [5] A. Papoulis, Probability, Random Variables, and Stochastic Processes. New York: McGraw-Hill, [6] J. A. Stern, D. Boyer, D. J. Schroeder, R. M. Touchstone, and N. Stoliarov, Blinks, saccades, and fixation pauses during vigilance task performance: II. Gender and time of day, ADA FAA Office of Aviation Medicine Civil Aeromedical Institute Publications, Aviation Medicine Reports, [7] F. Lavagetto and R. Pockaj, The facial animation engine: Toward a high-level interface for the design of MPEG-4 compliant animated faces, IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp , Mar

13 LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME 1097 Fabio Lavagetto was born in Genoa, Italy, in He received the Laurea degree in electrical engineering from the University of Genoa, Genoa, Italy, in March 1987, and the Ph.D. degree from the Department of Communication, Computer and System Sciences (DIST), University of Genoa, in From 1987 to 1988, he was with the Marconi Group, Genova, Italy, working on real-time image processing. He was a visiting researcher with AT&T Bell Laboratories, Holmdel, NJ, during 1990 and a Contract Professor in digital signal processing at the University of Parma, Parma, Italy, in Presently, he is an Associate Professor with DIST, where he teaches a course on radio communication systems, and is responsible for many national and international research projects. During , he coordinated the European ACTS project VIDAS, concerned with the application of MPEG-4 technologies in multimedia telecommunication products. Since January 2000, he has been coordinating the IST European project INTERFACE, which is oriented to speech/image emotional analysis/synthesis. He is the author of more than 70 scientific papers in the area of multimedia data management and coding. Roberto Pockaj was born in Genova, Italy, in He received the Masters degree in electronic engineering in 1993 from the University of Genova, Genova, Italy, and the Ph.D. degree in computer engineering and computer science from the Department of Communications, Computer and System Sciences (DIST), University of Genova, in From June 1992 to June 1996, he was a software designer with the Marconi Group, Genova, Italy, working in the field of real-time image and signal processing for optoelectronic applications (active and passive laser sensors). Between 1996 and 2001, he collborated on the management of the European projects ACTS-VIDAS and IST-INTERFACE, and participated in the definition of the new standard MPEG-4 for the coding of multimedia contents within the Ad Hoc Group on Face and Body Animation. He is currently a Contract Researcher at DIST. He has authored many papers on image processing and multimedia management.

VISEME SPACE FOR REALISTIC SPEECH ANIMATION

VISEME SPACE FOR REALISTIC SPEECH ANIMATION Sumedha Kshirsagar, Nadia Magnenat-Thalmann MIRALab CUI, University of Geneva {sumedha, thalmann}@miralab.unige.ch http://www.miralab.unige.ch ABSTRACT For realistic