COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS

Size: px

Start display at page:

Download "COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS"

Emery Richard
6 years ago
Views:

1 COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS Wesley Mattheyses 1, Lukas Latacz 1 and Werner Verhelst 1,2 1 Vrije Universiteit Brussel, Dept. ETRO-DSSP, Pleinlaan 2, B-1050 Brussel, Belgium 2 iminds, Gaston Crommenlaan 8, box 102, B-9050 Ghent, Belgium {wmatthey,llatacz,wverhels}@etro.vub.ac.be Corresponding author: Wesley Mattheyses, Vrije Universiteit Brussel, Dept. ETRO-DSSP, Pleinlaan 2, B-1050 Brussel, Belgium. Mail: wmatthey@etro.vub.ac.be Tel.: Fax: Abstract: The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. Viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to the visual coarticulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme. In this research it was found that neither the use of standardized nor speaker-dependent many-to-one viseme labels could satisfy the quality requirements of concatenative visual speech synthesis. Therefore, a novel technique to define a many-to-many phoneme-to-viseme mapping scheme is introduced, which makes use of both treebased and k-means clustering approaches. We show that these many-to-many viseme labels more accurately describe the visual speech information as compared to both phoneme-based and many-toone viseme-based speech labels. In addition, we found that the use of these many-to-many visemes improves the precision of the segment selection phase in concatenative visual speech synthesis using limited speech databases. Furthermore, the resulting synthetic visual speech was both objectively and subjectively found to be of higher quality when the many-to-many visemes are used to describe the speech database as well as the synthesis targets. Keywords: visual speech synthesis, viseme classification, phoneme-to-viseme mapping, contextdependent visemes, visemes 1. INTRODUCTION AND RELATED WORK 1.1 The concept of visemes Among all sorts of auditory speech processing applications, the concept of a phoneme as a basic speech unit is well established. Auditory speech signals are segmented into a sequence of phonemes for both speech analysis and speech synthesis goals. The properties of such a phoneme set is language-dependent but its definition is nowadays well standardized for many languages. Similarly, the processing of visual speech signals needs the definition of an atomic unit of visual speech. Fisher [1] introduced the term viseme to identify the visual counterpart of several consonant phonemes. Visemes can be considered as the particular facial and oral positions that show when a speaker utters phonemes. This implies that a unique viseme is defined by the typical articulatory gestures (mouth opening, lip protrusion, jaw movement, etc.) that are needed to produce a certain phoneme [2]. On the other hand, an alternate definition that is widely used in the literature is to define a viseme as a group of phonemes that exhibit a similar visual representation. This second definition entails that there is not a one-to-one mapping between phonemes and visemes. This can be understood from the fact that not all articulators that are needed to utter phonemes are visible to an observer. For instance, the English /k/ and /g/ phonemes are created by raising the back of the tongue to touch the roof of the mouth, which is a gesture that cannot be noticed visually. In addition, some phoneme pairs differ only in terms of voicing (e.g., English /v/ and /f/) or in terms of nasality. These two properties cannot be distinguished in the visual domain, which means that such phoneme pairs will

2 have the same appearance in the visual speech mode. As a consequence, the mapping from phonemes to visemes should behave like a many-to-one (Nx1) relationship, were visibly similar phonemes are mapped to the same viseme. The construction of such an Nx1 mapping scheme has been the subject of much research. Two different approaches can be distinguished. The first approach is based on the phonetic properties of the different phonemes of a language. Based on articulatory rules (e.g., place of articulation, position of the lips, etc.) and expert knowledge, a prediction of the visual appearance of a phoneme can be made [3;4]. This way, visemes can be defined by grouping those phonemes for which the visually important articulation properties are matching. Alternatively, in a second approach the set of visemes for a particular language is determined by various data-driven methods. These strategies involve the recording of real (audio-)visual speech data, after which the captured visual speech mode is further analyzed. To this end, the most common analysis approach is to conduct some kind of subjective perception test. Such an experiment involves participants who try to match fragments of the recorded visual speech to their audible counterparts [5-8]. The nature (consonant/vowel, position in word, etc.) and the size (phoneme, diphone, triphone, etc.) of the used speech fragments varies among the studies. For instance, in the pioneering study of Fisher [1], the participants were asked to lip-read the initial and final consonants of an utterance using a forced-error approach (the set of possible responses did not contain the correct answer). The responses of these kinds of perception experiments are often used to generate a confusion matrix, denoting which phonemes are visibly confused with certain other phonemes. From this confusion matrix, groups of visibly similar phonemes (i.e., visemes) can be determined. The benefit of these human-involved datadriven approaches to determine the phoneme-to-viseme mapping scheme is the fact that they measure exactly what needs to be modeled: the perceived visual similarity among phonemes. On the other hand, conducting these perception experiments is time-consuming and the results obtained are dependent on the lip reading capabilities of the test subjects. In order to be able to process more speech data in a less time-consuming way, the analysis of the recorded speech samples can be seen from a mathematical point of view. In a study by Rogozan [9] basic geometrical properties of the speaker s mouth (height, width and opening) during the uttering of some test sentences were determined. The clustering of these properties led to a grouping of the phonemes into 13 visemes. Similarly, in studies by Hazen et al. [10] and Melenchón et al. [11] a clustering was applied on the principal components analysis (PCA) coefficients that were calculated on the video frames of the recorded visual speech. The downside of these mathematical analyses lies in the fact that they assume that the mathematical difference between two visual speech segments corresponds to the actual difference that a human observer would notice when comparing the same two segments. Unfortunately, this correlation between objective and subjective distances is far from straightforward. Several Nx1 phoneme-to-viseme mapping schemes have been defined in the literature. Most mappings agree on the clustering of some phonemes (e.g., the English /p/, /b/ and /m/ phonemes), although many differences between the mappings exist. In addition, the number of visemes defined by these Nx1 mappings varies among the different schemes. A standardized viseme mapping table for English has been defined in MPEG-4 [12], consisting of 14 different visemes augmented with a silence viseme. Although this viseme set has been applied for visual speech recognition [13] as well as for visual speech synthesis applications [14], even for English still many other viseme sets are used. For instance, various Nx1 phoneme-to-viseme mappings were applied for visual speech synthesis in [15-20] while other Nx1 viseme labeling approaches have been applied for (audio-)visual speech recognition purposes [21-23]. It is not straightforward to determine the exact number of visemes that is needed to accurately describe the visual speech information. In a study by Auer [24], the concept of a phonemic equivalence class (PEC) was introduced. Such a PEC can be seen as an equivalent of a viseme, since it

3 is used to group phonemes that are visibly similar. In that study, words from the English lexicon were transcribed using these PEC s in order to assess their distinctiveness. The number of PECs was varied between 1, 2, 10, 12, 19 and 28. It was concluded that the use of at least 12 PEC s resulted in a sufficiently unique transcription of the words. Note, however, that when optimizing the number of visemes used in a phoneme-to-viseme mapping, the target application (e.g., human speech recognition, machine based speech recognition, speech synthesis, etc.) should also be taken into account. In addition, it has been shown that the best phoneme-to-viseme mapping (and as a consequence the number of visemes) should be constructed speaker-dependently [25]. A constant among almost every phoneme-to-viseme mapping scheme that can be found in the literature is the fact that the nature of the mapping is many-to-one. At first glance, this seems reasonable since it fits with the definition of a viseme being a group of phonemes that appear visibly the same. On the other hand, visual speech suffers from so-called coarticulation effects [26]. These effects are caused by Newton s Law of Inertia: after uttering a phone, it is impossible for the muscles of the human articulatory system to immediately (i.e., indefinitely fast) change the positions of the different articulators in order to produce the next sound. Therefore, visual speech suffers both from anticipatory and preservatory coarticulation effects. Preservatory or backwards coarticulation means that a certain speech gesture continues after uttering a particular sound segment while the other gestures needed to create this sound are already completed [27]. An example of this effect is the presence of lip protrusion during the /s/ segment of the English word boots. In addition, anticipatory or forwards coarticulation occurs when a visible gesture of a speech segment occurs in advance of the other articulatory components of the segment [28]. An example of anticipatory coarticulation is the pre-rounding of the lips in order to utter the English sound /uw/: in the word school the lip rounding can be already noticed while the sounds /s/ or /k/ are still being uttered. These two coarticulation effects imply that the visual appearance of a particular phoneme will not only be dependent on the phoneme s articulation properties but also on the nature of its neighboring phonemes in the sentence. In other words, a single phoneme can exhibit various visible representations, which means that it can be mapped on several different visemes. Consequently, a comprehensive phoneme-to-viseme mapping scheme should be a many-to-many (NxM) relationship [29]. Unfortunately, only very little research has been performed on the construction of such NxM mapping tables. A first step in this direction can be found in a study by Mattys [30], where some of the phoneme equivalence classes from [24] were redefined by taking the phonetic context of consonants into account. Finally, it can be argued that even a viseme set that is computed as an NxM mapping from phonemes is insufficient to accurately and efficiently define atomic units of visual speech. Instead of a priory segmentation of the speech in terms of phonemes, the segmentation of a visual speech track could be performed by taking only the visual features into account. For example, in a study by Hilder et al. [31] such a segmentation is proposed, which has the benefit that the different allophones of a phoneme can get a different viseme label and that the visual coarticulation is automatically taken into account. It was shown that this strategy leads to a clustering of the speech segments into visemes which is optimal in the sense that the inter-cluster distance is much larger than the intracluster distance. Unfortunately, this kind of viseme definition is unpractical for usage in (audio-)visual speech applications such as speech synthesis and speech recognition, since for these viseme sets the relationship between phonemes and visemes is not explained. 1.2 Visemes for Dutch In comparison with English, the available research on visemes for the Dutch language is limited. Whereas for English some kind of standardization for Nx1 visemes exist in the MPEG-4 standard, for Dutch only a phoneme set has been standardized. A first study on visemes for Dutch was performed by Eggermont [32], were some CVC syllables were the subject of an audiovisual perception experiment. In addition, Corthals [33] describes a phoneme-to-viseme grouping using phonetic

4 expert knowledge. Finally, Van Son et al. [34] define a new Nx1 phoneme-to-viseme mapping scheme that is constructed using the experimental results of new perception tests in combination with the few conclusions on Dutch visemes that could be found in earlier literature. 1.3 Visemes for (audio-)visual speech synthesis A visual speech synthesis system is a program that is able to construct a novel synthetic visual speech signal from an arbitrary text or phoneme sequence given as input. The synthetic visual speech shows a virtual speaker which can be rendered either in 2D or in 3D [35]. Various approaches for predicting the appropriate speech gestures based on the target phoneme sequence have been suggested. Rulebased synthesizers assign to each target phoneme a typical configuration of the virtual speaker. For instance, in a 3D-based synthesis approach these configurations can be expressed by means of parameter values of a parameterized 3D model [17;36]. Alternatively, a 2D-based synthesizer can assign to each target phoneme a still image of an original speaker uttering that particular phoneme [16;37]. Rule-based synthesizers generate the final synthetic visual speech signal by interpolating between the predicted keyframes. In order to cover the complete target language, a rule-based system has to know the mapping from any phoneme of that language to its typical visual representation, i.e., it has to define a complete phoneme-to-viseme mapping table. The actual speaker configuration that corresponds to each viseme label can be manually pre-defined (using a system-specific viseme set or a standardized viseme set such as described in MPEG-4) or it can be copied from original speech recordings [14;16;38]. Almost all rule-based visual speech synthesizers adopt an Nx1 phoneme-to-viseme mapping scheme, which reduces the number of rules needed to cover all phonemes of the target language. In a rule-based synthesis approach using an Nx1 mapping scheme, the visual coarticulation effects need to be mimicked in the keyframe interpolation stage. To this end, a widely adopted technique is the implementation of the Cohen-Massaro coarticulation model [39]. In this approach, the various speech gestures are given a prominence level that denotes the level in which they are likely to be altered by coarticulation effects from neighboring speech segments. Interpolation between the keyframes is performed by weighting each keyframe configuration by an exponentially decaying function of which the height is proportional to the prominence level of the corresponding speech segment. There have been only a few reports on the use of NxM phoneme-to-viseme mapping tables for rule-based visual speech synthesis. An example is the exploratory study by Galanes et al. [40], in which regression trees are used to analyze a database of 3D motion capture data in order to design prototype configurations for context-dependent visemes. To synthesize a novel visual speech signal, the same regression trees are used to perform an NxM phoneme-to-viseme mapping that determines for each input phoneme a typical configuration of the 3D landmarks taking its target phonetic context into account. Afterwards, the keyframes are interpolated using splines instead of using a coarticulation model, since coarticulation effects were already taken into account during the phoneme-to-viseme mapping stage. Another synthesis system that uses pre-defined context-dependent visemes was suggested by De Martino et al. [41]. In their approach, 3D motion capture trajectories corresponding to the uttering of original CVCV and diphthong samples are gathered, after which by means of k-means clustering [42] important groups of similar visual phoneme representations are distinguished. From these context-dependent visemes, the keyframe mouth dimensions corresponding to a novel phoneme sequence can be predicted. These predictions are then used to animate a 3D model of the virtual speaker. In a follow-up research, for each context-dependent viseme identified by the clustering of the 3D motion capture data, a 2D still image of an original speaker is selected to define the articulation rules for a 2D photorealistic speech synthesizer [43]. Another strategy for generating synthetic visual speech from text input is the so-called unit selection approach, in which the novel speech signal is constructed by concatenating speech segments selected from a database containing original visual speech recordings. In this approach, no interpolation is needed since all output frames consist of visual speech information copied from the

5 database. In standard auditory unit selection [44], the speech segments are selected based on their phonemic match with the target speech. In the case of visual unit selection, the selection can be based on the target/database matching of either phonemes [45-47] or visemes [48;49]. From the literature it can be noticed that almost all viseme-based unit selection visual speech synthesizers apply an Nx1 phoneme-to-viseme mapping to label the database speech and to translate the target phoneme sequence into a target viseme sequence. In unit selection synthesis, original coarticulations are copied from the database to the output speech by concatenating original units longer than one phoneme/viseme. In addition, extended visual coarticulations can be taken into account by selecting those original speech segments of which the visual context (i.e., the visual properties of their neighboring phonemes/visemes) matches the visual context of the corresponding target speech segment [50]. A final strategy for estimating the desired speech gestures from the text input is the use of a statistical prediction model that has been trained on the correspondences between visual speech features and a phoneme/viseme-based labeling of the speech signal. Such a trained model can predict the target features of the synthesizer s output frames for an unseen phoneme/viseme sequence given as input. A common strategy is the use of Hidden-Markov Models (HMMs) [51] to predict the target visual features. Such an HMM usually models the visual features of each phoneme/viseme sampled at 3 to 5 distinct time instances along the phoneme/viseme duration. It can be trained using both static and dynamic observation vectors, i.e., the visual feature values and their temporal derivatives. Similar to most selection-based visual speech synthesizers, predictionbased synthesizers often use a phoneme-based segmentation of the speech, for which the basic training/prediction unit can be for instance a single phoneme [52;53] or a syllable [54]. Diphones are used as basic synthesis units in a study by Govokhina et al. [55], in which it was concluded that the use of both static and dynamic features to train an HMM improved the synthesis quality since this permits that coarticulations are learned by the prediction model as well. In a follow-up research, their system was extended to allow some small asynchronies between the phoneme transitions and the transition points between the speech segments in the visual mode. This way, the HMM is capable to more accurately model some anticipatory visual coarticulation effects since these occur in the visual speech mode before the corresponding phoneme is heard in the auditory mode [56]. 2. MOTIVATION AND OUTLINE Section 1.3 briefly explained the various techniques for generating a novel synthetic visual speech signal based on a given text or phoneme sequence. Each synthesis approach can be implemented by using both phonemes or visemes as atomic speech units. In theory, a viseme-based approach should be superior since this type of labeling is more suited to identify visual speech information. It can be noticed that almost all viseme-based visual speech synthesizers that have been described in the literature use an Nx1 phoneme-to-viseme mapping scheme. For rule-based synthesizers, such an Nx1 mapping is advantageous since it reduces the number of rules needed to cover the whole target language. Moreover, the application of an Nx1 mapping scheme is useful for selection-based synthesizers as well, since it reduces the database size needed to provide a sufficient number of database segments that match a given target speech segment. Similarly, the use of an Nx1 mapping scheme reduces the minimal number of original sentences needed to train the prediction models of prediction-based synthesis systems. On the other hand, when an Nx1 phoneme-to-viseme mapping scheme is applied, an additional modeling of the visual coarticulation effects is needed. To this end, many coarticulation models have been proposed for usage in rule-based visual speech synthesis. In the case of selection-based synthesis, which is the most appropriate technique to produce very realistic synthetic speech signals, visual coarticulations have to be copied from original speech recordings. It is obvious that the accuracy of this information transfer can be increased in case the labeling of the original and the target speech data would intrinsically describe these coarticulations.

6 This is feasible when context-dependent visemes are used to label the target and the database speech, i.e., when an NxM phoneme-to-viseme mapping scheme is applied. For selection-based synthesis, there exists a trade-off between the various approaches for labeling the speech data. The use of an Nx1 phoneme-to-viseme mapping increases the number of database segments that match a target speech segment, which means that it is more likely that for each target segment a highly suited database segment can be found. On the other hand, when this type of speech labels is used, appropriate original visual coarticulations can only be selected by means of accurate selection costs and by selecting long original segments. When context-dependent visemes are used, the visual coarticulation effects are much better described, both in the database speech and in the target speech. Unfortunately, such an NxM mapping increases the number of distinct speech labels and thus decreases the number of database segments that match a target segment. Finally, visual unit selection synthesis can also be performed using phoneme-based speech labels. Although phonemes are less suited to describe visual speech information, it may help to enhance the perception quality when the synthetic visual speech is shown audiovisually to a user since the use of phoneme labels increases the audiovisual coherence [57]. To investigate the effects of all these possible speech labeling approaches on the quality of the synthetic visual speech, a concatenative speech synthesizer that was designed in earlier research is used [58;59]. Using this synthesizer, phoneme-based visual synthesis as well as the use of standardized and database-specific Nx1 phoneme-to-viseme mappings is investigated. In addition, multiple novel NxM mapping schemes are developed and evaluated, since only very few of these mappings can be found in the literature. 3. MODEL-BASED 2D CONCATENATIVE SPEECH SYNTHESIS 3.1 Synthesis overview In standard concatenative speech synthesis, for every synthesis target a set of candidate segments is selected from the system s pre-recorded speech database. From this set, the most optimal segment is determined by minimizing a global selection cost, after which the speech data corresponding to this optimal segment is cut from the speech data in the database. When all synthesis targets have been processed, the output speech is constructed by a concatenation of all the extracted original speech segments. A straightforward application of this technique for visual speech synthesis would involve the selection and concatenation of ranges of video frames from a visual speech database. However, we have experienced that this static image-based representation of the visual speech information is not optimal to accurately analyze, select and concatenate the visual speech segments. Therefore, an Active Appearance Model (AAM) [60] was trained on the mouth regions of the video frames from the database. Such a trained AAM makes it possible to describe the mouth configurations in terms of model parameters. Thus, instead of video frames the synthesizer selects and concatenates sub-trajectories of model parameters, allowing for a more accurate selection and concatenation. After concatenation, the parameter trajectories are non-uniformly time-scaled to synchronize the synthetic video track with an auditory speech track that was given as input to the system (for this, its phonemic segmentation is used). Based on these time-scaled trajectories, inverse AAM projection leads to a new sequence of images containing the mouth-area of the talking head. Joining this video with a background video showing the other parts of the face completes the speech synthesis. Examples of this (audio-)visual synthesis approach can be found at: AAM Modeling A useful property of the AAM mapping is the fact that an AAM separately models the shape and the texture of the video frames. The shape information is determined by the position of important features in the image, while the texture information consists of the image s pixel values sampled with

7 respect to the shape of the image. In our particular case, the shape is determined by the positions of the lips, the cheeks, the chin, etc., while typical texture properties are the visibility of teeth or the visibility of mouth-cavity (the dark area inside an open mouth) in the image. AAMs are able to improve the synthesis quality, since the separate description of shape and texture makes it possible to differentiate the processing of both information streams (see [59] for further reading on this). The AAM modeling also includes a second main component calculation, where the shape and the texture parameters are combined into a new set of model parameters describing all aspects of the data. These so-called combined model parameters are very useful for analysis purposes as they hold a compact description of the appearance of all the visual articulators. 3.3 Database normalization While registering an extensive audiovisual speech database, it is impossible to maintain exactly the same conditions throughout the recordings. For instance, small changes in the position of the speaker or fluctuations in the lightning conditions may occur. Therefore, some of the variations that are modeled by the AAM will have nothing to do with the actual speech production, but are due to changes in the recording conditions. To cope with this pollution of the speech data, a normalization strategy was designed for which some of the speech and texture parameters were fixed to their mean value throughout the database. A detailed description of this technique can be found in [61]. From these normalized shape and texture parameters, new sets of combined AAM parameters were calculated. In the remainder of this paper the term combined parameters will be used when the normalized version of the combined AAM parameters is intended. 3.4 Segment selection The selection of the optimal segment from the set of candidate segments matching a synthesis target is based on both target and join costs. The target costs express how closely the candidate segment matches the target segment, while the join costs express the ease in which consecutive selected segments can be concatenated. For the experiments described in this paper, two target costs were applied. A first target cost is needed to ensure that a quality synchronization between the synthetic video track and the input auditory speech is feasible. The value of increases when the synchronization of the candidate segment with the corresponding auditory speech fragment would require a more important time-scaling. A second target cost is defined to model the visual coarticulation effects. This is achieved by comparing the visual context of the synthesis target with the visual context of the candidate segment. To calculate this target cost, a difference matrix is constructed that expresses the visual similarity between every two different phonemes/visemes (depending on the database labeling being based on phonemes or visemes) present in the database. It is important that this matrix is calculated for the particular speech data that is used for synthesis, since coarticulation effects can be speaker-specific. To calculate the difference matrix, for every distinct phoneme/viseme, all its instances in the database are gathered. For each instance, the combined AAM parameters of the video frame located at the middle of the instance are sampled. From these values, means and variances are calculated, where index corresponds to the different phonemes/visemes and index corresponds to the different model parameters. For a particular phoneme/viseme, the sum of all the variances of the model parameters expresses how much the visual appearance of that phoneme/viseme is affected by coarticulation effects. Two phonemes/visemes can be considered similar in terms of visual representation if their mean representations are alike and, in addition, if these mean representations are sufficiently reliable (i.e., if small summed variances were measured for these visemes). Thus, two matrices are calculated, which express for each pair of phonemes/visemes the Euclidean difference between their mean representations and the sum of the variances of their visual representation, respectively: ( ) (3.1)

8 (3.2) Dividing each matrix by its largest element produces the scaled matrices and, after which the final difference matrix can be constructed: Matrix can be applied to calculate for a candidate segment, matching a given target phoneme/viseme sequence : the three phonemes/visemes located before ( ) and after ( ) the segment in the database (i.e., the candidate's visual context) are compared to the target s visual context: (3.3) (( ) ( )( )) (( ) ( )( )) (3.4) where the factor ( ) defines a triangular weighting. To ensure smooth concatenations, the selection of a candidate segment is also based on join costs. The total join cost that is assigned to the transition from a given candidate segment to the next segment is calculated as follows: were denotes the Euclidean distance between the AAM shape parameters of the video frames at the join position. Using the same video frames, and are the Euclidean distances between the combined AAM parameters and the Euclidean distance between the deltavalues of these combined parameters, respectively. Finally, the total selection cost for a candidate segment is calculated as the weighted sum of all sub-costs: where the weights - - were optimized manually. (3.5) (3.6) 4. EVALUATION OF MANY-TO-ONE MAPPING SCHEMES FOR ENGLISH 4.1 Introduction In this section the standardized Nx1 mapping that is described in MPEG-4 [62] is evaluated for use in concatenative visual speech synthesis. Since this mapping scheme is designed for English, the English version of our visual speech synthesis system was used. The system used the LIPS2008 audiovisual database [63] as speech database. This is a dataset that contains 20 minutes of continuous audiovisual English speech from a single speaker. Note that for audio(-visual) concatenative speech synthesis, 20 minutes is too limited to attain quality results. However, for visual-only speech synthesis it should contain a sufficient amount of data, especially when an Nx1 viseme-based speech description is used. The video data was recorded in portrait orientation at 50 frames per second, with a resolution of 576x720 pixels. 4.2 Design of phoneme-to-viseme mapping schemes Standardized Nx1 mapping Based on the description in MPEG-4, the English phoneme set that was originally used to segment the LIPS2008 corpus has been mapped on 14 visemes, augmented with one silence viseme. The mapping of those phonemes that are not mentioned in the MPEG-4 standard was based on their visual and/or articulatory resemblance with other phones.

9 4.2.2 Speaker-dependent Nx1 mapping The MPEG-4 mapping scheme is designed to be a best-for-all speakers phoneme-to-viseme mapping. However, for usage in data-driven visual speech synthesis, the phoneme-to-viseme mapping should be optimized for the particular speaker of the synthesizer s database. To define such a speaker-dependent mapping, the AAM-based representations of the mouth-region of the video frames from the database were used. In a first step, for every distinct phoneme present in the database all its instances were gathered. Then, the combined AAM parameters of the frame located at the middle of each instance were sampled. A speaker-specific mean visual representation of each phoneme was calculated using the gathered AAM parameters. A hierarchical clustering analysis on the combined AAM parameters of these mean representations was performed to determine which phonemes are visibly similar for the speaker's speaking style. Using the dendrogram, a tree diagram that visualizes the arrangement of the clusters produced by the clustering algorithm, five important levels could be discerned in the hierarchical clustering procedure. Consequently, five different phoneme-to-viseme mappings were selected. They define 7, 9, 11, 19 and 22 visemes, respectively. Each of these viseme sets contains a 'silence' viseme on which only the silence phoneme is mapped. 4.3 Experiment Both the MPEG-4 mapping scheme and the speaker-dependent mappings were the subject of an experimental evaluation. A random selection of original sentences from the database were resynthesized using our visual speech synthesis system, for which both the synthesis targets and the speech database were labeled using the various Nx1 viseme sets. The synthesis parameters (selection costs, concatenation smoothing, etc.) were the same for all strategies. A reference synthesis strategy was added, for which the standard English phoneme set (containing 45 entries) was used to label the database and the synthesis targets. For every synthesis, the target original sentence was excluded from the database. The original database transcript was used as text input and the original database auditory speech was used as audio input. A subjective evaluation of the syntheses was conducted, using four labeling strategies: speech synthesized using the speaker-dependent mappings on 9 (group SD9 ) and 22 (group SD22 ) visemes, speech synthesized using the MPEG-4 mapping (group MPEG4 ) and speech synthesized using standard phoneme-based labels (group PHON ). In addition, extra reference samples were added (group ORI ), for which the original AAM trajectories from the database were used to resynthesize the visual speech. The samples were shown pairwise to the participants. Six different comparisons were considered, as shown in figure 4.1. The sequence of the comparison types as well as the sequence of the sample types within each pair were maximally randomized. The test consisted of 50 sample pairs: 14 comparisons containing an ORI sample and 36 comparisons between two actual syntheses. The same sentences were used for each comparison group. 13 people (aged 22-59, 9 male, 4 female) participated in the experiment, 8 of them can be considered speech technology experts. The participants were asked to give their preference for one of the two samples of each pair using a 5-point comparative MOS scale [-2,2]. They were instructed to answer '0' if they had no clear preference. The test instructions told the participants to pay attention to both the naturalness of the mouth movements and to how well these movements are in coherence with the auditory speech that is played along with the video. The key question of the test read as follows: "How much are you convinced that the person you see in the sample actually produces the auditory speech that you hear in the sample?" The results obtained are summarized in figure 4.1.

Figure 4.1: Subjective test results. The histograms show for each comparison the participants preference for the left/right sample type on a 5-point scale [-2,2].

10 Figure 4.1: Subjective test results. The histograms show for each comparison the participants preference for the left/right sample type on a 5-point scale [-2,2]. The results show that the participants were clearly in favor of the syntheses based on phonemes compared to the viseme-based syntheses. These differences were found to be significant ( < 0.05) by a paired-sample Wilcoxon signed-rank test. Furthermore, a perception of higher quality was attained by increasing the number of visemes. It can be concluded that the speaker-dependent mappings perform similar to the standardized mapping scheme, since the SD9 group scores worse than the MPEG4 group while the SD22 group scores better than the MPEG4 group. The results show that the synthesized samples are still distinguishable from natural visual speech, although for this aspect also the phoneme-based synthesis outperforms the MPEG-4-based approach. 4.4 Conclusions Both standardized and speaker-dependent Nx1 phoneme-to-viseme mapping schemes for English were constructed and applied for concatenative visual speech synthesis. In theory, such a visemebased synthesis should outperform the phoneme-based synthesis since it multiplies the number of candidate segments for selection, while the reduced number of distinct speech labels can be justified by the fact that there exists redundancy in a phoneme-based labeling of visual speech (this is the reason for the Nx1 behavior of the mapping). However, the synthesis based on phonemes resulted in higher subjective ratings compared to the syntheses based on visemes. In addition, the results obtained show that the synthesis quality increases when more distinct visemes are defined. These results raise some questions on the Nx1 viseme-based approach that is widely applied in (audio- )visual speech synthesis. For audio-visual speech synthesis, it was already shown that a phonemebased speech labeling is preferable, since it allows the selection of multimodal segments from the database, which maximizes the audiovisual coherence in the synthetic multimodal output speech [57]. From the current results it appears that a similar phoneme-based synthesis is preferable for visual-only synthesis as well. However, it could be that the many-to-one phoneme-to-viseme mappings insufficiently describe all the details of the visual speech information. Although the synthesizer mimicked the visual coarticulation effects by applying a target cost based on the visual context, it is likely that higher quality viseme-based synthesis results can be achieved by using a many-to-many phoneme-to-viseme mapping instead, which describes the visual coarticulation effects already in the speech labeling itself.

11 5. MANY-TO-MANY PHONEME-TO-VISEME MAPPING SCHEMES 5.1 Introduction In order to construct an NxM phoneme-to-viseme mapping scheme, the idea is to analyze an extensive set of audiovisual speech data, where the relationship between the visual appearances of the mouth area and the phonemic transcription of the speech is investigated. Since the resulting mapping schemes will eventually be evaluated for usage in concatenative visual speech synthesis, it was opted to construct a speaker-dependent mapping by analyzing a speech database that can be used for synthesis purposes too (i.e., a consistent dataset from a single speaker). Unfortunately, the amount of data in the English LIPS2008 database is insufficient for an accurate analysis of all distinct phonemes in all possible phonetic contexts. Therefore, the Dutch language was used instead since in previous research our own extensive Dutch audiovisual speech database has been recorded [64]. This dataset consists of 1199 audiovisual sentences (138 min) from the open domain and 536 audiovisual sentences (52 min) from the limited domain of weather forecasts. The video was recorded at progressive frames per second in portrait orientation, with a resolution of 720x1280 pixels. A phonemic segmentation was determined using the SPRAAK toolkit [65] and a high quality AAM was trained in order to describe the mouth area in terms of shape/texture parameters and in terms of combined AAM parameters. An example visual speech synthesis using this Dutch database, based on original auditory speech and its phonemic transcription is given in movie Tree-based clustering Regression trees Movie 5.1: Example frames from a synthesized Dutch sentence To construct the phoneme-to-viseme mapping scheme, the data from the Dutch speech database was analyzed by clustering the visual appearances of the phoneme instances in the database. Each instance was represented by three sets of combined AAM parameters, corresponding to the video frames at 25%, 50% and 75% of the duration of the phoneme instance, respectively. This three-point sampling was chosen to integrate dynamics in the measurement, as the mouth appearance can vary during the uttering of a phoneme. As a clustering tool multi-dimensional regression trees [66] were used, similar to the technique suggested in [40]. A regression tree is a data analysis tool that is able to cluster training data based on a number of decision features describing this data. To build such a tree, first a measure for the impurity in the training data needs to be defined. For this purpose, the distance ( ) between phoneme instances and was defined as the weighted sum of the Euclidean differences between the combined AAM parameters ( ) of the video frames at 25%, 50% and 75% of the length of both instances: ( ) (5.1)

12 Next, consider a subset containing N phoneme instances. Equation 5.2 expresses for instance its mean distance from the other instances in : ( ) (5.2) Let s write as the variance of these distances. For every instance in the value of is calculated, from which the smallest value is selected as. A final measure for the impurity of subset can then be calculated by equation 5.3: ( ) (5.3) When the regression tree is constructed, the training data is split by asking questions about the properties of the data instances. At each split, the best question is chosen in order to minimize the impurity in the data (this is the sum of the impurities of all subsets). A tree-like structure is obtained since at each split new branches are created in order to group similar data instances. In each next step, each branch itself is further split by asking other questions that on their turn minimize the impurity in the data. In the first steps of the tree building, the branching is based on big differences among the data instances, while the final splitting steps are based on only minor differences. For some branches, some of the last splitting steps can be superfluous. However, other branches do need many splitting steps in order to accurately cluster their data instances. A stop size needs to be defined as the minimal number of instances that are needed in a cluster. Obviously, the splitting also stops when no more improvement of the impurity can be found Decision features To build the regression trees, each phoneme instance needs to be characterized by an appropriate set of features. Various possible features can be used, of which the identity of the phoneme (i.e., its name or corresponding symbol) is the most straightforward. Another key feature is a vowel/consonant classification of the data instance. In addition, a set of phonetic features can be linked to each instance, based on the phonetic properties of the corresponding phoneme: vowel length, vowel type (short, long, diphthong, schwa), vowel height, vowel frontness, lip rounding, consonant type (plosive, fricative, affricative, nasal, liquid, trill), consonant place of articulation (labial, alveolar, palatal, labio-dental, dental, velar, glottal) and consonant voicing. Note that these features have been determined by phonetic knowledge of Dutch phonemes and that it can be expected that not all of them have an explicit influence on the visual representation of the phoneme. In addition, for each Dutch phoneme an additional set of purely visual features was calculated. To this end, several properties of the visual speech were measured throughout the database: mouth height, mouth width, the visible amount of teeth and the visible amount of mouth-cavity (the dark area inside an open mouth). For each of these measures, the 49 distinct Dutch phonemes were labeled based on their mean value for each measure (labels --, -, + and ++ were used for this). For example, phoneme /a/ (the long a from the word daar ) has value ++ for both the mouthheight and the teeth feature, while the phoneme /o/ (the long o from the word door ) has value ++ for the mouth-height feature but value -- for the teeth feature. In order to construct a many-tomany mapping scheme, the tree-based clustering needs to be able to model the visual coarticulation effects. Therefore, not only features of the particular phoneme instance itself but also features concerning its neighboring instances are used to describe the data instance. This way, instances from a single Dutch phoneme can be mapped onto different visemes, depending on their context in the sentence Pre-cluster The complete database contains about phoneme instances. Obviously, it would require a very complex calculation to perform the tree-based clustering on the complete dataset at once. A

13 common approach in regression tree analysis is to select a particular feature for pre-clustering the data: the data instances are first grouped based on this feature, after which a separate tree-based clustering is performed on each of these groups. Two different options for this pre-cluster feature were investigated. In a first approach, the identity of the phoneme corresponding to each instance was chosen as pre-cluster feature. This implies that for each Dutch phoneme, a separate tree will be constructed. In another approach, the consonant/vowel property was used to pre-cluster the data. This way, only two large trees are calculated: a first one to cluster the data instances corresponding to a vowel and another tree to cluster the instances corresponding to a consonant. This second approach makes it possible that two different Dutch phonemes, both in a certain context, are mapped on the same tree-based viseme, as is illustrated in figure 5.1. V1 V1 P1 V2 P1 V2 V3 V3 P2 V4 P2 V4 V5 V5 P3 V6 V7 P3 V6 V7 Figure 5.1: Difference between using the phoneme identity (left) and the c/v property (right) as precluster feature Clustering into visemes Once a pre-cluster feature has been selected, it has to be decided which features are used to build the regression trees. Many configurations are possible since features from the instance itself as well as features from its neighbors can be applied. In addition, a stop-size has to be chosen, which corresponds to the minimal number of data instances that should reside in a node. This parameter has to be chosen small enough to ensure an in-depth analysis of the data. On the other hand, an endnode from the regression tree is characterized by the mean representation of its data instances. Therefore, the minimal number of instances in an end-node should be adequate to cope with inaccuracies in the training data (e.g., local phonemic segmentation errors). After extensive testing (using similar experiments as described in the next section), two final configurations for the treebased clustering were defined, as described in table 5.1. Configuration A Configuration B Pre-cluster feature phoneme identity c/v classification Current instance s features / phoneme identity phonetic features visible features phoneme identity Neighboring instances phonetic features phonetic features features visible features visible features Table 5.1: Tree configurations A and B that were used to build the regression trees that map phonemes to visemes. For the clustering calculations, the distance between two data samples was calculated using equation 5.1 and the impurity of the data was expressed as in equation 5.3 (using =2). In configuration A, a separate tree is built for each Dutch phoneme. Each of these trees defines a partitioning of all the training instances of a single phoneme based on their context. This context is described using both the set of phonetic and the set of visible features that were described in section 5.2.2, which should be sufficient to model the influence of the context on the dynamics of the current phoneme. Alternatively, in configuration B only two large trees are calculated. As these trees are built using a

14 huge amount of training data (for each tree, a maximum of uniformly sampled data instances was chosen), for both the instance itself and for its neighboring instances all possible features are given as input to the tree-building algorithm. Although the description of a data instance based on both the phoneme identity and its phonetic/visible features contains some level of redundancy, this way a maximal number of features is available to efficiently and rapidly decrease the impurity in the large data set. The clustering algorithm itself will determine which features to use for this purpose. For both configurations A and B, the trees were built using 25 and 50 instances as stop size, respectively, resulting in trees A25, A50, B25 and B Objective candidate test As was already mentioned in the previous section, the large number of possible tree configurations imposes the need for an objective measure to assess the quality of the tree-based mapping from phonemes to visemes. An objective test was designed for which a number of database sentences are re-synthesized using our concatenative visual speech synthesizer. For every synthesis, the target original sentence is excluded from the database. The original database transcript is used as text input and the original database auditory speech is used as audio input. Both the database labeling and the description of the synthesis targets are written in terms of the tree-based visemes. As usual, the synthesis involves a set of candidate segments being determined for each synthesis target. To calculate a quality measure for the applied speech labeling, in a first step these candidate segments are arranged in terms of their total target cost. This total target cost is a combination of a cost based on the timing of the candidate and a cost based on its context (see section 3.4). Next, the n-best candidates are selected and their distance from the ground-truth is measured using the three-point distance that was described in equation 5.1. Finally, for each synthesis target a single error value is calculated by computing the mean distance over these n-best candidates. Since the re-synthesis of a fixed group of database sentences using different speech labels defines corresponding synthesis targets for each of these label sets (as the original target phoneme sequence is the same for each approach), the calculation of the mean candidate quality for each synthesis target produces pairedsample data that can be used to compare the accuracy of the different speech labeling approaches. Using this objective measure tree configurations A and B (see section 5.2.4) could be identified as high quality clustering approaches. In figure 5.2 their performance is illustrated (using the n=50 best candidates and omitting the silence targets from the calculation since they are 1x1 mapped on the silence viseme), together with the results obtained using two reference methods. The first reference result PHON was measured using a phoneme-based description of the database and the synthesis targets. For the second reference approach STDVIS, an Nx1 phoneme-to-viseme mapping for Dutch was constructed, based on the 11 viseme classifications described by Van Son et al. [34]. These Dutch standardized visemes are based on both subjective perception experiments and prior phonetic knowledge about the uttering of Dutch phonemes.

15 0,29 0,28 0,27 0,26 0,25 0,24 0,23 Figure 5.2: Candidate test results obtained for a synthesis based on phonemes, a synthesis based on standardized Nx1 visemes and multiple syntheses based on tree-based NxM visemes (mean distances). The CC values were obtained by incorporating the context target cost to determine the n-best candidates. Figure 5.2 shows that phonemes appear to describe the synthesis targets more accurately than the standard Nx1 visemes. This difference was found to be statistically significant by a paired t-test ( < 0.05). This result is in line with the results that were obtained for English (see section 4.3), where a synthesis based on phonemes outperformed all syntheses based on Nx1 visemes. In addition, all tree-based label approaches significantly ( < 0.05) exceed both the phoneme-based labeling and the standard Nx1 viseme-based labeling. Thus, unlike the Nx1 visemes, the tree-based NxM phoneme-toviseme mappings define indeed an improved description of the visual speech in comparison with phonemes. The results obtained show only minor non-significant differences between the various tree configurations. In addition, an improvement of the candidate arrangement can be noticed when the context-target cost is applied, especially for the phoneme-based and Nx1 viseme-based labels. As this context cost is used to model the visual coarticulation effects, it is logical that its usage has less influence on the results obtained for the NxM viseme-based labels, as they intrinsically model the visual coarticulation themselves. Note that, even when the context-target cost was applied, the NxM viseme labels scored significantly better ( < 0.05) than the phoneme-based and Nx1 viseme-based labels. 5.3 Towards a useful many-to-many mapping scheme Decreasing the number of visemes Using the regression trees, the phoneme instances from the Dutch speech database were clustered into small subsets. The viseme corresponding to a given phoneme instance is determined by the traversal of such a tree based on various properties of the instance itself and on the properties of its context. Given the extensive amount of training data and the large number of decision features, the tree-based clustering results in a large number of distinct visemes: 1050 for the A25-tree, 650 for the A50-tree, 800 for the B25-tree and 412 for the B50-tree. These big numbers are partly caused by the fact that an extensive analysis with many splitting steps was applied during the tree building. This is necessary since some pre-clusters contain a large number of diverse data instances. On the other hand, for the splitting of the data instances from some of the other pre-clusters (e.g., pre-clusters corresponding to less common phonemes) less splitting steps would have been sufficient. Consequently, the tree-based splitting has not only resulted in a large number of end-nodes but also in an over-splitting of some parts of the dataset. Another reason for the large number of tree-based

16 visemes is the fact that the pre-clustering step makes it impossible for the tree-clustering algorithm to combine similar data instances from different pre-clusters into the same node. Therefore, it can be assumed that for each tree-configuration, many of its tree-based visemes are in fact similar enough to be considered as one single viseme. The standardized Nx1 viseme mapping identifies 11 distinct visual appearances for Dutch. A good quality automatic viseme classification should be expected to define at most a few visemes more. Since the tree-based clustering results in the definition of a much larger number of visemes, more useful NxM phoneme-to-viseme mapping schemes were constructed by performing a new clustering on the tree-based visemes themselves. First, a general description for each tree-based viseme defined by a particular tree configuration was determined. To this end, for each end-node of the tree a mean set of combined AAM parameters was calculated, sampled over all phoneme instances that reside in this node. Since the original phoneme instances were sampled at three distinct points, the tree-based visemes are also described by three sets of combined AAM parameters (describing the visual appearance at 25%, 50% and 75% of the viseme s duration). Next, based on their combined AAM parameters, all tree-based visemes were clustered using a k-means clustering approach [42]. Note that for k-means clustering, the number of clusters needs to be determined beforehand. Estimating this number using a heuristic approach 1 resulted in about 20 clusters for all tree configurations. For the k-means clustering calculations, two different distances between the treebased visemes were defined: the Euclidean difference between the combined AAM parameters of the frames at the middle of the visemes and the weighted sum of the distances between the combined AAM parameters of the frames at 25%, 50% and 75% of the duration of the visemes (see equation 5.1). Using the first distance measure, 11 and 20 clusters were calculated, respectively. These numbers of clusters were chosen to match the standard Nx1 viseme mapping and according to the heuristic method, respectively. In addition, using the three-point distance measure, an extra clustering into 50 distinct visemes was calculated. A larger number of clusters was chosen here since this distance measure incorporates the dynamics of the visemes, which is likely to result in the existence of more distinct visemes. Also, this number is the same as the number of Dutch phonemes that were used for the initial database labeling, which might be useful for comparison later on. For time issues, only the A25 and B25 trees were processed, since these define a more in-depth initial segmentation of the training data. In addition, a single clustering of the visemes defined by the B50 tree was performed for verification, as described in table clusters 20 clusters 50 clusters Tree A25 A25_11 A25_20 A25_50 Tree B25 B25_11 B25_20 B25_50 Tree B50 - B50_20 - Table 5.2: Mapping from tree-based visemes to final NxM visemes Evaluation of the final NxM visemes In order to evaluate the final NxM phoneme-to-viseme mapping schemes, a similar evaluation of the n-best synthesis candidates as was used for the evaluation of the tree-based visemes (see section 5.2.5) was performed. Both phoneme-based speech labels and standard Nx1 viseme-based labels as well as a labeling using tree-based visemes A25 and B25 were added as reference. Figure 5.3 illustrates the test results. The n=50 best candidates were used and silences were omitted from the calculation. The results were analyzed using a paired t-test. 1 In this method the optimal number of clusters is estimated by successively performing the k-means clustering while increasing the cluster count. Afterwards, the final number of clusters is chosen by graphically determining the step in which there is a drop in the marginal gain of the percentage of variance that is explained by the clusters.

17 0,43 0,42 0,41 0,4 0,39 0,38 0,37 0,36 0,35 0,34 0,33 PHON STDVIS A25_11 A25_20 A25_50 B25_11 B25_20 B25_50 B50_20 A25 B25 Figure 5.3: Candidate test results obtained for a synthesis based on phonemes, a synthesis based on standardized Nx1 visemes and multiple syntheses based on the final NxM visemes (mean distances). Some results obtained using tree-based NxM visemes are added for comparison purposes. The results show that all final NxM visemes significantly ( < 0.05) outperform a labeling based on phonemes and a labeling based on the standard Nx1 visemes. This is an important result since unlike it was the case for the tree-based visemes, for these final NxM mapping schemes there are only a limited number of distinct speech labels defined. The NxM mappings on 11 and 20 distinct visemes result in a more accurate speech labeling in comparison with a phoneme-based labeling, despite the fact that they use less than half the number of distinct labels as compared to this phoneme-based labeling. In addition it can be seen that the more NxM visemes that are defined, the better the test results appear to be. This means that the NxM mappings on 50 visemes are indeed modeling some extra differences compared with the NxM mappings on 20 visemes. From the test results it can also be concluded that the corresponding mappings derived from tree configurations A and B perform comparably. Furthermore, it appears that a k-means clustering of the visemes defined by the B50- tree performs worse ( < 0.05) than a similar clustering of the visemes defined by the B25-tree. When the results obtained for the final NxM visemes are compared with the results obtained for the tree-based visemes, the latter perform best ( < 0.05). However, this difference can still be considered rather small given the fact that for the tree-based visemes, the number of distinct speech labels is up to a factor 50 higher than for the final NxM visemes (e.g., 20 distinct labels for the A25_20 approach and 1050 distinct labels for the A25 mapping scheme). 6. APPLICATION OF MANY-TO-MANY VISEMES FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS 6.1 Application in a large-database system In a first series of experiments the final NxM viseme labeling was applied for concatenative visual speech synthesis using the extensive Dutch (audio-)visual speech database [59;64]. The experiments are similar to the candidate test that was described in section 5.2.5, except that for actual synthesis it is not the quality of the n-best candidate segments but the quality of the final selected segment that is important. This final selection is based on the minimization of both target and join costs (see figure 6.1). Given the large size of the speech database, many candidates are available for each synthesis target. In order to reduce the calculation time, the synthesizer uses only the 700 best candidates (in terms of total target cost) for further selection.

18 Database Target costs Join costs Concatenation A: Determine candidate units B: Select final unit from candidates C: Output speech Synthesis targets Optimization Synchronization Figure 6.1: Overview of the visual speech synthesis stages. In sections 5.2 and 5.3 the accuracy of the speech labels was tested in stage A, while the attainable synthesis quality can be measured by evaluating the final selection in stage B. In order to objectively assess the attainable synthesis quality using a particular speech labeling, a synthesis experiment was conducted for which 200 randomly-selected sentences from the database were re-synthesized. For every synthesis, the target original sentence was excluded from the database. The original database transcript was used as text input and the original database auditory speech was used as audio input. During synthesis, for each synthesis target the synthesizer will select the most optimal segment from the database. An objective measure was calculated by comparing each of these selected segments with its corresponding ground-truth segment, using the weighted distance that was described in equation 5.1. Silences were omitted from the calculation. Figure 6.2 illustrates the test results obtained for some important NxM viseme-based labeling schemes that have been described in section 5.3 and a baseline system using phoneme-based speech labels. The results were further analyzed using a paired t-test. 0,355 0,35 0,345 0,34 0,335 0,33 0,325 PHON B25_11 B25_20 B25_50 Figure 6.2: Evaluation of the segment selection using a large database (mean distances). Both the results obtained using a phoneme-based speech labeling and the results obtained using multiple final NxM viseme sets are shown. In section it was described how an objective experiment pointed out that the best mean candidate quality (measured over the 50 best candidates) is attained by a synthesis based on NxM visemes. The current experiment shows however a different behavior for the quality of the final selected segment. In this case, a selection based on phonemes performs as good as the best result that was obtained by using NxM viseme labels. It performs significantly better ( < 0.05) than the syntheses based on NxM viseme labels that use 11 or 20 distinct visemes. This can be understood as follows. The main reason why a synthesis would profit from the use of only a few distinct speech

19 labels is the increased number of candidates for each synthesis target. For the current experiment however, the synthesizer s database is very large, resulting for each of the different speech labeling strategies in a huge number of candidates for the synthesis targets. This justifies the use of a larger number of distinct speech labels, since it will refine the segment selection provided that the labeling is sufficiently accurate. In the current experiment the phoneme-based system performed as good as the system using 50 distinct NxM visemes. From section it is known that the phoneme-based labeling is less accurate than this particular NxM viseme-based labeling. For synthesis however, the selection of the final segment from all possible candidates is based on both target and join costs, meaning that the selection of a high quality final segment from an overall low-quality set of candidate segments is possible when accurate selection costs are used. Even more, there is no reason to assume that the final selected segment will be one of the n-best candidate segments based on target cost alone. This could explain why for the current test, none of the syntheses based on NxM visemes are able to outperform the synthesis based on phoneme labels. To check this assumption, the same 200 sentences were again synthesized, for which another set of selection costs was applied: the target cost based on the context was omitted and the influence of the join costs was reduced in favor of the influence of the target costs. Both the phoneme-based speech labeling and the NxM viseme based labeling that scored best in the previous test were used. Obviously, the quality of the final selected segments decreased in this new synthesis setup. But more important, for this new synthesis the NxM viseme-based result was found to be significantly better ( < 0.05 for a paired t-test) than the phoneme-based segment selection. So it appears that when non-optimal selection costs are applied, the more accurate labeling of the speech by using the NxM visemes does improve the segment selection quality. One of the possible reasons for this is the fact that join costs should partially model the visual coarticulation effects as well, since they push the selection towards a segment that fits well with its neighboring segments in the synthetic speech signal. From these results it can be concluded that for synthesis using a large database, the use of more distinct speech labels than the theoretical minimum (11 for Dutch) is preferable. In addition, given the large number of candidates that can be found for each target, a precise definition of the selection costs is able to conceal the differences between the accuracy of the different speech labeling approaches. Note, however, that the use of NxM viseme-based speech labels for synthesis using a large database can speed-up the synthesis progress. During synthesis, the heaviest calculation consists of the dynamic search among the consecutive sets of candidate segments to minimize the global selection cost. In addition, it has been shown that the use of viseme-based labels improves the overall quality of the n- best candidate segments. Therefore, the use of viseme labels permits to reduce the number of candidate segments in comparison with a synthesis based on phoneme labels. Consequently, it will be easier to determine an optimal set of final selected segments, which results in reduced synthesis times. 6.2 Application in limited-database systems Motivation In the previous section the use of NxM visemes for concatenative visual speech synthesis using a large database was evaluated. In practice however, most visual speech synthesis systems use a much smaller database from which the speech segments are selected. The main reason why the use of Nx1 visemes for visual speech synthesis purposes is well-established, is the limited amount of speech data that is necessary to cover all visemes or di-visemes of the target language. Therefore it is useful to evaluate the use of NxM viseme-based speech labels in such a limited-database system. In this case, the number of available candidates for each synthesis target will be much smaller compared with the large-database system that was tested in section 6.1. It is interesting to evaluate how this affects the quality of the syntheses based on the various speech labeling approaches Limited databases

20 From the complete audio-visual database for Dutch [64], several subsets were selected which each define a new limited database. The selection of these subsets was performed by a sentence selection algorithm that ensures that for each of the speech labeling approaches under test (phoneme, standard Nx1 viseme and several NxM visemes), the subset contains at least n instances of each distinct phoneme/viseme that is defined in the label sets. The subsets obtained are summarized in table 6.1. For n=3, the sentence selection was run twice, which resulted in the distinct subsets DB1 and DB2. Database name n Database size (sentences) DB DB DB Table 6.1: Construction of limited databases Evaluation of the segment selection The attainable synthesis quality using these limited databases was assessed by the same objective test that was described in section 6.1. In a first execution, 120 sentences were re-synthesized using DB1. Various NxM viseme-based labeling approaches as well as two baseline systems (phonemebased and standard Nx1 viseme-based) were used. The results obtained are shown in figure 6.3 and were analyzed using a paired t-test. 0,385 0,38 0,375 0,37 0,365 0,36 0,355 PHON STDVIS A25_11 A25_20 A25_50 B25_11 B25_20 B25_50 Figure 6.3: Evaluation of the selected segments using DB1 (mean distances). Both the results obtained using a phoneme-based speech labeling, the results obtained using the standardized Nx1 viseme labeling and the results obtained using various final NxM viseme sets are shown. The results show that for a synthesis using a limited database all NxM viseme-based labeling approaches result in selecting significantly ( < 0.05) better segments than syntheses based on phonemes or Nx1 visemes. In addition, in this test the viseme labels using 11 and 20 distinct visemes scored better than the label sets that use 50 distinct visemes. This could be explained by the fact that due to the limited amount of available speech data, the mappings on 50 visemes result in considerably less candidate segments for each synthesis target in comparison with the approaches that use fewer distinct speech labels. This assumption is in line with the results that were shown in figure 6.1, where it was found that when a large amount of speech data is available, the viseme sets using 50 distinct labels are performing the best. It is worth mentioning that for the current test, the viseme sets using 50 distinct labels still performed significantly better ( < 0.05) than the phonemebased labeling, which uses the same number of distinct speech labels. Similarly, the syntheses based

21 on 11 distinct NxM visemes performed significantly better ( < 0.05) than the synthesis based on 11 standard Nx1 visemes. These are important results since they show that an NxM viseme labeling does help to improve the segment selection quality in the case where not that much speech data is available. To verify these results, similar experiments were conducted using other sets of target sentences and other limited databases (DB2 and DB3). For all these experiments, results similar to the previous results shown in figure 6.3 were obtained Evaluation of the synthetic visual speech Motivation While in the tests described in sections 6.1 and the effect of the speech labeling approach on the improvement of the segment selection and thus on the attainable speech quality was evaluated, some final experiments were conducted in order to assess the achieved quality of the visual speech synthesizer (denoted as stage C in figure 6.1). For this, instead of evaluating the quality of the segment selection itself, the quality of the final visual output speech resulting from the concatenation of the selected segments is assessed. This concatenation involves some optimizations, like a smoothing of the parameter trajectories both at the concatenation points and by a low-pass filtering technique (see [59] for more details on this). In addition, the concatenated speech is nonuniformly time-scaled to achieve synchronization with the target audio segmentation. It is interesting to evaluate the effect of these final synthesis steps on the observed differences between the different speech labeling approaches Objective evaluation To objectively assess the quality of the synthesized speech, 70 randomly-selected sentences from the full-size database were re-synthesized. The limited database DB1 was used as selection data set. It was ensured that the target sentences were not part of this database. The original database transcript was used as text input and the original database auditory speech was used as audio input. Optimal settings were used for the concatenation smoothing and for the other synthesis optimizations. To measure the quality of the synthesized visual speech, the final combined AAM parameter trajectories describing the output speech were compared with the ground-truth trajectories from the speech database. As distance measure a dynamic time warp (DTW) cost was used. Dynamic time warping [67] is a time-series similarity measure that minimizes the effects of shifting and distortion in time by allowing elastic transformation of time series in order to detect similar shapes with different phases. A DTW from a time series ( ) to a time series ( ) first involves the calculation of the local cost matrix representing all pairwise differences between and. A warping path between and can be defined as a series of tuples ( ) that defines the correspondences from elements from to elements from. When this warping path satisfies certain criteria like a boundary condition, a monotonicity condition and a stepsize condition, it defines a valid warp from series towards series (see [68] for a detailed explanation on this). A warping cost can be associated with each warping path by adding all local costs collected by traversing matrix along the warping path. Through dynamic programming, the DTW algorithm searches the optimal warping path between and that minimizes this warping cost. This optimal warping path defines a useful distance measure between series and through its associated warping cost. For the current experiment, a synthesized sentence was assessed by first calculating for each combined AAM parameter a DTW distance measure as the cost of warping the synthesized parameter trajectory towards the ground-truth trajectory. This value is then normalized according to the length of the warping path in order to cancel out the influence of the length of the sentence. For each sentence, a final distance measure was calculated by the weighed sum of the these normalized DTW costs, where the weights were chosen according to the total model variance that is explained by each AAM parameter. The use of a DTW cost has been suggested in [69], where it was concluded that this distance correlates well with subjective evaluations as it measures the

22 frame-wise distance between the synthetic and ground-truth visual speech as well as the temporal relationship between these two signals. The results obtained are shown in figure PHON STDVIS B25_11 B25_20 B25_50 A25_20 Figure 6.4: DTW-based evaluation of the final synthesis result (mean distances). Both the results obtained using a phoneme-based speech labeling, the results obtained using the standardized Nx1 viseme set and the results obtained using various final NxM viseme sets are shown. A statistical analysis of the test results using a paired t-test showed that the syntheses based on NxM visemes scored significantly better ( < 0.05) than the syntheses based on phonemes and Nx1 visemes. The best results were obtained using the B25_20 visemes, although the differences among the different NxM viseme sets were not found to be significant. These results are in line with the evaluations of the segment selection quality that were described in section However, in the current test the labeling approach using 50 distinct visemes scores as good as the other NxM visemebased approaches. It seems that the concatenations and optimizations in the final stages of the synthesis partly cover the quality differences between the various speech labeling approaches that have been measured for the segment selection stage Subjective evaluation In addition to an objective evaluation, two subjective perception experiments were performed in order to compare the achieved synthesis quality using the different speech labeling approaches. For this, 20 randomly-selected sentences from the full-size database were re-synthesized, using the limited database DB1 as selection data set. It was ensured that the target sentences were not part of this database. The original database transcript was used as text input and the original database audio is used as audio input. Optimal settings were used for the concatenation smoothing and for the other synthesis optimizations. For verification purposes, the DTW-based objective evaluation that was described in section was repeated for the 20 sentences that were used in the subjective experiments, which resulted in observations comparable to the results that were obtained using the larger test-set (see figure 6.4). In a first subjective test, the differences between NxM visemes B25_11, B25_20, B25_50 and A25_20 were investigated. 10 people participated in the experiment (8 male, 2 female, 9 of them aged [24-32], 1 aged 60), of which 7 can be considered speech technology experts. The samples were shown pairwise to the participants, considering all comparisons among the four approaches under test. The sequence of the comparison types as well as the sequence of the sample types within each pair were maximally randomized. The participants were asked to give their preference for one of the two samples of each pair using a 5-point comparative MOS scale [- 2,2]. They were instructed to answer '0' if they had no clear preference. The test instructions told the participants to pay attention to both the naturalness of the mouth movements and to how well these

23 movements are in coherence with the auditory speech that is played along with the video. The key question of the test read as follows: "How much are you convinced that the person you see in the sample actually produces the auditory speech that you hear in the sample?" The results of the test are summarized in figure 6.5. Figure 6.5: Subjective test results. The histograms show for each comparison the participants preference for the left/right sample type on a 5-point scale [-2,2]. The results show a slight preference for the syntheses using the B25_20 and B25_50 visemes, but none of the differences between the methods were shown to be significant by a paired-sample Wilcoxon signed-rank analysis. This result is in line with the results obtained by the objective DTWbased evaluation. As a conclusion, it was opted to select the B25_20 labels as the most preferable viseme set for synthesis using database DB1, due to the slight preference for this method in both the objective and the subjective test. In addition, this labeling fits best with the assumption that an automatic viseme classification should identify more than 11 visemes but less than the number of phonemes. In a last experiment, the NxM viseme labeling approach B25_20 was subjectively compared with a phoneme-based and an Nx1 viseme-based synthesis. A new perception experiment was conducted, of which the set-up was similar to the previous experiment. In the current test all comparisons between the three approaches under test were evaluated by 11 participants (9 male, 2 female, aged [24-60]). Six of them can be considered speech technology experts. Figure 6.6 illustrates the test results.

Figure 6.6: Subjective test results. The histograms show for each comparison the participants preference for the left/right sample type on a 5-point scale [-2,2].

24 Figure 6.6: Subjective test results. The histograms show for each comparison the participants preference for the left/right sample type on a 5-point scale [-2,2]. The results obtained show a similar behavior as the results from the DTW-based objective evaluation that was described in section The synthesis based on the B25_20 NxM visemes was preferred over the synthesis based on phonemes and the synthesis based on standard Nx1 visemes. Both these differences were found to be significant ( < 0.05) by a paired-sample Wilcoxon signed-rank analysis. On the other hand, from the histograms it is clear that for many comparison pairs the test subjects answered no difference. Also, in a substantial number of cases the phoneme-based synthesis was preferred over the NxM viseme-based synthesis. Feedback from the test subjects and a manual inspection of their answers pointed out that they often found it difficult to assess the difference between the two test samples of each comparison pair. This was mainly due to small local errors in the synthetic visual speech. These local audiovisual incoherencies degraded the perceived quality of the whole sample, even if the sample was of overall higher quality than the other test sample from the comparison pair. This observation is similar to [69], where it was concluded that the influence of local errors on the perceived speech quality can disturb this kind of subjective evaluation experiments. 7. DISCUSSION AND CONCLUSIONS For some time now, the use of visemes to label visual speech data has been well-established. This labeling approach is often used in visual speech analysis or synthesis systems, where the mapping from phonemes to visemes is constructed as a many-to-one relationship. In this paper the usage of both standardized and speaker-dependent English many-to-one phoneme-to-viseme mappings in concatenative visual speech synthesis was evaluated. A subjective experiment showed that the viseme-based syntheses were unable to increase (or even decreased) the attained synthesis quality compared to a phoneme-based synthesis. This is likely to be explained by the limited power of manyto-one phoneme-to-viseme mappings to accurately describe the visual speech information. As every instance of the same phoneme is mapped to the same viseme, these viseme labels are incapable of describing the visual coarticulation effects. This implies the need for a many-to-many phoneme-toviseme mapping scheme, where on one hand instances from different phonemes can be mapped on a same viseme but on the other hand two instances from a same phoneme can be mapped on different visemes. Using a large Dutch audiovisual speech database, a novel approach to construct many-to-many phoneme-to-viseme mapping schemes was designed. In a first step, regression trees were trained in order to cluster the visual appearances of the phoneme instances from the speech database. The mapping from phonemes to these tree-based visemes was based on several properties of the phoneme instance itself and on properties of its neighboring phoneme instances. Several tree

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies? MIRALab Where Research means Creativity Where do we stand today? M I RA Lab Nadia Magnenat-Thalmann MIRALab, University of Geneva thalmann@miralab.unige.ch Video Input (face) Audio Input (speech) FAP Extraction