In Pursuit of Visemes

Size: px

Start display at page:

Download "In Pursuit of Visemes"

Kristopher Walton
5 years ago
Views:

1 ISCA Archive AVSP International Conference on Audio-Visual Speech Processing Hakone, Kanagawa, Japan September 30-October 3, 2010 In Pursuit of Visemes Sarah Hilder, Barry-John Theobald, Richard Harvey School of Computing Sciences, University of East Anglia, UK {s.hilder, b.theobald, r.w.harvey}@uea.ac.uk Abstract We describe preliminary work towards an objective method for identifying visemes. Active appearance model (AAM) features are used to parameterise a speaker s lips and jaw during speech. The temporal behaviour of AAM features between automatically identified salient points is used to represent visual speech gestures, and visemes are created by clustering these gestures using dynamic time warping (DTW) as a costfunction. This method produces a significantly more structured model of visual speech than if a typical phoneme-to-viseme mapping is assumed. Index Terms: Visemes, visual speech encoding. 1. Introduction Phonemes are a linguistic unit used to represent speech. Replacing one phoneme with another will change the meaning of an utterance. Visemes (visual phonemes) [1] are a supposed unit of visual speech. However, visemes are less well defined and less well understood. Our goal is to identify a representation for visual speech equivalent to the use of phonemes used to transcribe the sounds of an acoustic utterance. Previous efforts to define visemes have proved inconclusive. Many make the assumption of a many-to-one relationship between phonemes and visemes [2, 3, 4, 5, 6], but there are a number of limitations with this approach. Firstly, this does not take into account the asynchrony between the acoustic and the visual modalities of speech, where the onset of movement does not always correspond to the onset of the acoustic realisation of a phone. Secondly, there are some phones that do not require the use of the visual articulators, and so phonemes such as /k/ or /g/, which are velar consonants articulated at the back of the soft palate, are unlikely to have an associated viseme. Finally, previous approaches generally ignore coarticulation effects. The allophones of a phoneme often appear very different visually, yet these obviously different visual gestures are assigned the same meaning in a visual sense. To overcome these limitations, we break the assumption of a formal link between the acoustic and the visual representations of speech. Instead, we will use machine learning algorithms to analyse visual speech to identify automatically patterns of visual behaviour. This paper describes preliminary work towards this goal. The remainder of this paper is organised as follows. Section 2 reviews previous methods used to identify visemes, and Section 3 describes our approach. Section 4 discusses our findings, and Section 5 provides detail of future work. 2. Background and Related Work The sounds of speech are dependent on the formation of the articulatory organs, such as the nasal cavity, tongue, teeth, velum and lips. Since only a small number of these articulators are visible (lips and partially the teeth and tongue) it is apparent that a one-to-one mapping between phonemes and visemes results in redundancy. Speech sounds that differ in their voicing or nasality tend to appear visually similar and so are assigned to the same visemic class. For example, /f/ is voiceless and /v/ is voiced, but both have the same place and manner of articulation (labiodental fricative). As they appear the same, they often are considered to form a viseme. Typically phonemes are mapped to visemes using some form of subjective assessment, based on analysing patterns of confusion in a stimulus-response matrix [2]. Visemes are defined by clustering phonemes such that the within-cluster response accounts for at least 75% or more of all responses. Auer and Bernstein [3] compiled a set of consonant visemes and vowel visemes (referred to as phonemic equivalence classes or PECs) by grouping phonemes based on their similarity. Their data were taken from the Eberhardt et al. consonant recognition task, where participants were asked to lip-read from C-/a/ contexts [4] and from Montgomery and Jackson s vowel recognition task, where participants were asked to lipread the vowel in /h/-v-/g/ contexts [7]. They find that twelve PECs best approximate the data. Similarly, Lesner and Kricos [6] asked subjects to lipread vowels and diphthongs in /h/-v-/g/ contexts spoken by different speakers. They find that the number and composition of visemes differs with different speakers, and that speakers who are easier to speech-read generally produce a larger number of visemes. A similar experiment was performed by Jiang et al. [8] in which participants were asked to lip-read the consonant in C-V words

2 uttered by four speakers. They also find that visemes are both speaker dependent and context dependent, where the number of visemes varies from four to six. Fisher [1] asked participants to lip-read the initial and final consonants of an utterance using a forced-error approach, where the correct response was omitted from a closed-set of possible responses. The results suggest that viseme groupings for initial and final consonants differ, and that initial consonants contain directional confusions. For example, /m/ is significantly confused with /b/, but /b/ is not significantly confused with /m/. For experiments that require human participation, it is necessary for stimuli to be kept simple. Many of the methods used to identify visemes use data where phonemes are presented only in a single context [3, 7]. Natural speech production is much less constrained than this, and visual articulation is based on what is necessary for acoustic distinctiveness, but not for visual distinctiveness [14]. The physical constraints enforced by the human muscular system prevent the articulators switching position instantaneously, which results in blurring across gesture boundaries [10]. Perkell and Matthies [9] describe this as the superposition of multiple influences on the movement of an articulator. Montgomery and Jackson [7] state this as the reason for selecting the /h/-v-/g/ context in their work this context produces minimal coarticulation effects. Coarticulation can be anticipatory (forwards coarticulation) or can reflect the influence from previous gestures (backwards coarticulation). For example, the appearance of /s/ in the words sue and sea may be different as the anticipatory, labial rounding of /u/ in sue begins during or before the articulation of /s/ (forwards coarticulation). Equally, the appearance of /k/ in the words spook and speak may be different due to the labial rounding of the preceding vowel in spook and widening in speak (backwards coarticulation). Figure 1 illustrates the affect of backwards coarticulation on lip shape. The images were taken during the articulation of /t/ in the word bat (left) and jot (right). In the latter case the lip roundedness of /dz/ and /6/ from jot continues to influence /t/. It is worth noting that coarticulation effects are not functions only of directly neighbouring phones, but have been found to be influenced by phones up to six segments in either direction [11]. Löfqvist incorporated this idea into a theory of speech production [20], whereby each speech segment has a set of dominance functions one for each articulator and speech production involves concatenating and overlapping these dominance functions. The dominance and deformability of a gesture depends on whether fully reaching the articulatory targets is necessary to produce the required sound. This means that not all visual phones are equally affected by coarticulation as the organs that are deemed necessary for producing a sound may or may not be visually apparent. Figure 1: A frame from /t/ in the word bat (left) and jot (right). The difference in lip shape highlights the influence of coarticulation on visual speech production. For example, the consonants /f/ and /v/ are far less deformable than /k/ and /g/. The former are fricative consonants that are articulated using the upper teeth and lower lip granting minimal freedom to the shape of the lips whereas the latter are velar consonants that are articulated at the back of the soft palate granting more freedom to the shape of the lips. The articulation of the vowel /u/ is a dominant gesture as lip rounding and protrusion are both essential to produce the sound. Consequently, in consonant recognition tasks, Owens and Blasek [15] and Benguerel and Pichora-Fuller [16] find that the most apparent coarticulation effects are in /u/-c-/u/ contexts, where C is a more deformable consonant (such as /s, z, t, d, l, n, k, g, h, j/). Benguerel and Pichora-Fuller [16] also find that in VCV contexts /u/ attains a near perfect recognition score whereas /æ/ scores the lowest. Perkell and Matthies [9] measured coarticulation in /i/-c-/u/ utterances by recording the vertical displacement of a point on the upper lip. They find that many speakers begin lipprotrusion for /u/ directly after the acoustic offset of /i/. To date there has been no unequivocal agreement regarding the number of visemic classes, nor how the set of phonemes are clustered to form the visemes. This may be due to the subjective nature of the methods employed, where small variations in stimuli and different participants have an influence on the resulting visemes. For the purposes of automatic lip-reading, Goldschen et al. [12] used an objective approach for identifying visemes using sentences as stimuli. A selection of static and dynamic lip features were extracted from video and manually segmented into phones. These were then clustered using a hidden Markov model similarity measurement and the average linkage hierarchical clustering algorithm [13]. The resulting visemes are consistent with results from perceptual experiments [1], but the notion of a viseme was extended to include lip opening/closing for the consonants /b/, /p/ and /m/, forming the groups /bcl, m, pcl/ and /b, p, r/ (where cl indicates closure). To account for variation in visual articulation due to phonetic context, a many-to-many relationship between phonemes and visemes is required [17]. Mattys et al. [18] are one of the few that have attempted to model visemes in this way, where different viseme classes for initial and non-initial consonants are used. However, a limitation of this approach is that consonants are assumed to have no

influence on the articulation of vowels. We propose to identify visual units of speech independently of a phonetic/acoustic representation of speech.

This will overcome three of the major shortcomings identified with previous work. Firstly, the allophones of a phoneme will not be required to have the same visual label.

Thirdly, our analysis will be objective in nature, and the visual units will be derived from continuous speech (sentences). 3.

3 influence on the articulation of vowels. We propose to identify visual units of speech independently of a phonetic/acoustic representation of speech. Instead, patterns of behaviour of the articulators will be used to identify visual meaning, and by clustering behaviours that appear similar we will identify a set of visemes. This will overcome three of the major shortcomings identified with previous work. Firstly, the allophones of a phoneme will not be required to have the same visual label. Secondly, the onset and offset of the visual gestures will be identified in the visual modality, thus we do not require visemes to align with acoustic labels, as is usually the case [12, 21]. Thirdly, our analysis will be objective in nature, and the visual units will be derived from continuous speech (sentences). 3. Viseme extraction To produce a set of visual gestures that we will refer to as visemes, we use continuous speech to acknowledge the influence of coarticulation and adopt a data-driven approach to avoid prior assumptions regarding phonetic alignment or labels Stimuli The stimuli used in this work are drawn from the LIPS2008 audio-visual corpus [22]. This contains 278 phonetically balanced sentences spoken by a single, female speaker. It was recorded at 50 frames per second in standard definition. The speaker maintained a neutral expression throughout the recording and spoke at a steady pace. The camera captured a full frontal image of the face Feature extraction and preprocessing In this work, active appearance models (AAMs) [23] are used to encode visual speech. AAMs provide a compact statistical description of the variation in the shape and appearance of a face. The shape of an AAM is defined by the two-dimensional vertex locations: s = {x 1, y 1, x 2, y 2,..., x n, y n } T of a mesh that delineates the inner and outer lip contours and the jaw. A set of model parameters that control the non-rigid variation allowed by the mesh are derived by hand-labelling a set of training images, then applying principal components analysis (PCA) to give a compact model of the form: m s = s 0 + s i p i, (1) i=1 where s 0 is the mean shape and the vectors s i are the eigenvectors of the covariance matrix corresponding to the m largest eigenvalues (see Figure 2). The coefficients Figure 2: Top row: first two modes of shape variation between ±3σ. Bottom row: the first mode of variation of the appearance model between ±3σ. p i are the shape parameters, which define the contribution of each mode in the encoding of s. The appearance of the AAM is an image defined over the pixels that lie inside the base mesh, x = (x, y) T s 0, and the set of model parameters that control the variation allowed by the image. The appearance is constructed by warping each training image from the manually annotated mesh locations to the base shape, then applying PCA (to these shape-normalised images) to give a compact model of appearance variation of the form: A(x) = A 0 (x) + l λ i A i (x) x s 0, (2) i=1 where the coefficients λ i are the appearance parameters, A 0 (x) is the base appearance, and the appearance images, A i (x), are the eigenvectors corresponding to the l largest eigenvalues of the covariance matrix (see Figure 2). To encode the visual speech information within the video sequences, the face is tracked using the inverse compositional project-out AAM algorithm [24]. Next Equations 1 and 2 are solved for the shape and appearance parameters respectively, which are then concatenated and smoothed to reduce the effects of noise. Smoothing is achieved using a cubic spline smoothing with a weighting of 0.7 a value determined by subjectively analysing the curves. The smoothed features are normalised as follows: ( ) Wp b = (3) λ where l W = i=1 σ2 λ i m (4) i=1 σ2 p i where λ and p are column vectors of appearance and shape parameters respectively, l and m are the number of dimensions corresponding to the appearance and shape

4 respectively, and σ 2 λ i and σ 2 p i represent the variance captured by each dimension of the respective model. A third PCA is applied to these features to model the correlated variation in the shape and appearance parameters, providing a compact, low-dimensional feature vector describing the shape and appearance variation of the lips and jaw during speech Speech segmentation To produce the gestures we define to be visemes, the visual speech first is segmented into units. Typically this is done using the phonetic boundaries derived from the acoustic speech [12, 21], but here we use a data-driven approach whereby we locate segment boundaries using only the visual signal. During speech, articulators do not move at a constant rate. Instead they tend to accelerate towards articulatory targets and deccelerate as they approach or realise the targets. Consequently, we make the assumption that a salient lip pose is that in which the lips are at their most still, and that a gesture is the transition from one salient lip pose to the next. This involves calculating the acceleration ( ) coefficients from the AAM features and extracting the frames where the sign changes from negative to positive. Using this method we extract from the LIPS2008 corpus 8421 gestures. This value falls between the number of phones articulated (11850) and the number of syllables produced ( 4500) Clustering Pair-wise distances between gestures are obtained by measuring the cost associated with performing a dynamic time warp (DTW). Dynamic time warping is a method for measuring the similarity between two time series that may vary in length or speed where a non-linear warp along the time axis is applied to one of the sequences to align it to the other. This warp is optimised to minimise a cost function. The reader is referred to [25] for a more in-depth discussion regarding DTW. We use the clustering toolbox, CLUTO [26] (version 2.1.2), to cluster the gesture space. Empirically we find that the graph-based clustering algorithm results in the best clustering. This possibly is because it is able to model non-spherical clusters. To determine the number, k, of clusters required, we measure the silhouette coefficient (SC) [27], Dunns Index (DI) [28] and Davies-Bouldin Index (DBI) [29] at n clusters, where {n = }. We find that k = 58 is optimal. This is where the SC and DI were maximised and DBI was minimised Results Figure 3A shows the distance matrix produced by finding the DTW cost between each pair of gestures and ordering the samples by their cluster ID. In this image, the colours range from blue to red representing, respectively, the smallest to largest values. The cluster boundaries are highlighted with black boxes along the leading diagonal. A perfect clustering would produce a series of blue boxes down the diagonal on a red background. For comparison, Figure 3B presents a visualisation of the distance matrix produced by finding the cost of performing DTW to each pair of visemes, i.e. visual phonemes. To produce this image, the AAM features were segmented via the acoustic phone boundaries. Labels were then assigned to each segment based on the phoneme to viseme mapping taken from [30]. In the image, the samples are ordered by viseme group. For example, all segments with phoneme labels /p/, /b/ and /m/ are arranged sequentially in the distance matrix. Again, the group boundaries are outlined with a black box. We can see from this image that this distance matrix appears to lack structure as the distances between grouped items appear to be no smaller the distances between ungrouped items. The one exception to this is the group that appears in the lower right-hand corner of the image. This group represents silence. As the speaker was asked to maintain a neutral expression prior to and after the utterance, the examples of a silence would be almost unaffected by coarticulation. It is apparent that the distance matrix in Figure 3A shows more structure than in Figure 3B. To quantify this we measure the average silhouette coefficient for each cluster [27]. The silhouette coefficient is a measure of cluster separation and cohesion that falls between 1 and +1. A value of 1 denotes perfectly clustered data. From the data we record average silhouette values of 0.32 when using the typical viseme approach and 0.11 for our gestural approach. The latter approach attained both higher minimum and maximum scores and the difference between the approaches are statistically significant (p < 0.01). 4. Discussion In this paper we have presented a review of the methods that are traditionally used for viseme classification and highlighted some of the many problems associated with them. We noted that when using subjective methods, there is little explicit agreement as to which phonemes are grouped to form which visemes, and how many visemes there should be. However, most have agreed that the relationship between phonemes and visemes is many-to-one and all studies have ignored the matter of audio-visual asynchrony. The visual labels usually are assumed to align with the underlying acoustic (phone) boundaries.

Each viseme/gesture is delineated by a black square along the leading diagonal. Red regions correspond to high error, whereas blue regions correspond to low error.

5 (A) (B) Figure 3: Distance matrices obtained using the DTW cost between each pair of visemes. The visemes were found using: a) the visual gestural approach described in this paper, and b) the acoustic phone boundaries and the phoneme to viseme mapping from [30] (a traditional approach). Each viseme/gesture is delineated by a black square along the leading diagonal. Red regions correspond to high error, whereas blue regions correspond to low error. Ideally for both images, the areas bound by the black squares would be more blue (small intra-viseme distance) whilst the other areas would appear more red (higher inter-viseme distance). In Section 3 we introduced our proposed method of viseme classification; A data-driven approach that makes no prior assumption regarding the segmentation of speech or the phonetic labels associated with each segment. We described a novel method of segmenting the speech according to the acceleration of the lips during articulation. We clustered each segment based on the cost of performing DTW between each pair of gestures. We have found that using this approach we can produce a significantly more structured model of visual speech production than if we assume a many-to-one phoneme to viseme mapping. 5. Further work We are applying machine learning algorithms to first discover whether we can extract a set of gestures that describe the speech space of a single speaker, and then expanding to multiple speakers in the hope that we can produce a generic set of parameterised visemes. This will then be used to compare traditional approaches to visual speech representation employed both in our visual speech synthesisers (talking heads) and our automatic lipreading systems. 6. Acknowledgements The authors gratefully acknowledge EPSRC (EP/E028047/1) for funding. 7. References [1] C. G. Fisher, Confusions among visually perceived consonants, Journal of Speech, Language, and Hearing Research, vol. 11, pp , December [2] T. Chen and R. R. Rao, Audio-visual integration in multimodal communication, in Proceedings of the IEEE, 1998, pp [3] E. T. Auer and L. E. Bernstein, Speechreading and the structure of the lexicon: Computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness, Journal of the Acoustical Society of America, vol. 102, pp , Dec [4] S. P. Eberhardt, L. E. Bernstein, and M. H. Goldstein, Speechreading sentences with singlechannel vibrotactile presentation of voice fundamental frequency, Journal of the Acoustical Society of America, vol. 88, no. 3, pp , [5] T. Ezzat and T. Poggio, Visual speech synthesis by morphing visemes, International Journal of Computer Vision, vol. 38, no. 1, pp , [6] S. A. Lesner and P. B. Kricos, Visual vowel and diphthong perception across speakers, Journal of the Academy of Rehabilitative Audiology, vol. 14, pp , 1981.

6 [7] A. A. Montgomery and P. L. Jackson, Physical characteristics of the lips underlying vowel lipreading performance, Journal of the Acoustical Society of America, vol. 73, no. 6, pp , [8] J. Jiang, A. Alwan, L. E. Bernstein, E. T. Auer, and P. A. Jr. Keating, Similarity structure in perceptual and physical measures for visual consonants across talkers, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, 2002, pp [9] J. S. Perkell and M. L. Matthies, Temporal measures of anticipatory labial coarticulation for the vowel /u/: Within- and cross-subject variability, Journal of the Acoustical Society of America, vol. 91, no. 5, pp , [10] A. Turkmani, Visual analysis of viseme dynamics, Ph.D. dissertation, University of Surrey, [11] R. D. Kent and F. D. Minifie, Coarticulation in recent speech production models, Journal of Phonetics, vol. 5, no. 2, pp , April [12] A. J. Goldschen, O. N. Garcia, and E. Petajan, Continuous optical automatic speech recognition by lipreading, in Proceedings of the 28th Asilomar Conference on Signals, Systems, and Computers, 1994, pp [13] C. Ding and X. He, Cluster merging and splitting in hierarchical clustering algorithms, in Proceedings of the Second IEEE International Conference on Data Mining, 2002, pp [14] J. Luettin, Visual speech and speaker recognition, Ph.D. dissertation, University of Sheffield, [15] E. Owens and B. Blazek, Visemes observed by hearing-impaired and normal-hearing adult viewers, Journal of Speech and Hearing Research, vol. 28, pp , [16] A. P. Benguerel and M. K. Pichora-Fuller, Coarticulation effects in lipreading, JSHR, vol. 25, pp , [17] P. L. Jackson, The theoretical minimal unit for visual speech perception: Visemes and coarticulation, Volta Review, vol. 90, no. 5, pp , September [19] M. M. Cohen and D. W. Massaro, Modeling coarticulation in synthetic visual speech, in Models and Techniques in Computer Animation. Springer- Verlag, 1993, pp [20] A. Löfqvist, Speech as audible gestures. Kluwer Academic Publishers, [21] J. Melenchón, J. Simó, G. Cobo, and E. Martínez, Objective viseme extraction and audiovisual uncertainty: estimation limits between auditory and visual modes, in International conference on auditory-visual speech processing, [22] B. Theobald, S. Fagel, G. Bailly, and F. Elsei, Lips2008: Visual speech synthesis challenge, in Interspeech, 2008, pp [23] T. Cootes, G. Edwards, and C. Taylor, Active appearance models, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp , June [24] I. Matthews and S. Baker, Active appearance models revisited, International Journal of Computer Vision, vol. 60, no. 2, pp , November [25] K. Wang and T. Gasser, Alignment of curves by dynamic time warping, The annals of statistics, vol. 25, no. 3, pp , June [26] G. Karypis, Cluto - a clustering toolkit, University of Minnesota, Department of Computer Science, Minneapolis, Tech. Rep., April [27] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, vol. 20, pp , [28] J. C. Dunn, Well-separated clusters and optimal fuzzy partitions, Cybernetics and Systems, vol. 4, pp , [29] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, pp , [30] B. Walden, R. Prosek, and A. Montgomery, Effects of training on the visual recognition of consonants, Journal of Speech and Hearing Research, vol. 20, pp , [18] S. L. Mattys, L. E. Bernstein, and E. T. Auer, Stimulus-based lexical distinctiveness as a general word-recognition mechanism, Perception and Psychophysics, vol. 64, no. 4, pp , 2002.

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies? MIRALab Where Research means Creativity Where do we stand today? M I RA Lab Nadia Magnenat-Thalmann MIRALab, University of Geneva thalmann@miralab.unige.ch Video Input (face) Audio Input (speech) FAP Extraction