Phonemes Interpolation

Size: px

Start display at page:

Download "Phonemes Interpolation"

Alyson Griffith
6 years ago
Views:

1 Phonemes Interpolation Fawaz Y. Annaz *1 and Mohammad H. Sadaghiani *2 *1 Institut Teknologi Brunei, Electrical & Electronic Engineering Department, Faculty of Engineering, Jalan Tungku Link, Gadong, BE 1410, Bandar Seri Begawan, Brunei Darussalam *2 University of Nottingham, Malaysia Cmpus, Electrical & Electronic Engineering Department, Jalan Broga, Semenyih, Selangor, Malaysia KeyWords: Lagrange Interpolation, Barycentric Lagrange Interpolation, Phonemes, Viseme. Abstract Learning a language starts immediately after birth in the form of repeating basic sounds and gestures that are generated by adults (usually the parents). While teaching is achieved by initial pronunciation through exaggerated gestures and sounds, learning is accompanied by memorizing and comprehension and eventually reproduction of such gestures and sounds. In fact, parents exaggerate speech by breaking up it into lower accepted (to babies) sound and gesture levels that are more accepted by babies. It is the aim of this paper to demonstrate methods in which fundamental articulated phonemes is represented by signatures that reflect dynamic mouth movement contours. The paper starts by explaining the basic (yet limited) Lagrange Interpolation Method to produce fundamental signatures of lip movements by tracking upper-lip and corner-lip feature-points. Then the paper proposes a method that produces more compact polynomials by using the Barycentric Lagrange Interpolation Method, which overcomes any limitations of the earlier method. 1. Introduction This paper addresses fundamental concepts in human audio-visual communication, and focusses on the mouth shape during speech. This field have received interest from various groups that can range from English Language Teachers (ELT) in a classroom to interacting with an animated face or an electromechanical robot head figure interface. Thus, the work will be of interest to groups in robotics, movie industries, biometrics, real-time translation or even future machine interaction. The difficulty in the study of this field is to understand and combine scientific and artistic significance of speech or communication and the way it should be delivered and perceived between human and/or machines. Facial animation is one concept that emerges from this science, which was pioneered by Parke [1] in 1974, a since then significant time and efforts were allocated to excel the fusion of science and art [2]-[4]. The definition of a generic framework to map speech components into animated visual pronunciation models is another important concept that goes back as far as 1968, which is known today as the concept of phonemes. This, in turn, lead to the birth of the viseme approach [5], which is the study of the visual observations of phonemes and aims to study mouth contours geometry to classify and analyse statistical and physical characteristics of consonants and vowels. The bridging of audio-phonemes and their corresponding visual visemes has thus become the basis for researchers in speech visualization and perception. It is also interesting to know the effect of coarticulation has over speech, and the effect of simple concatenation has over visual modality on their previous and next phonemes. This is clearly evident in the visual domain, thus, studies in speech driven systems [6] and [7], as well as speech and text driven systems [8]-[12] emerged to act as inputs and to generate animated lip movements. In machine learning, speech animation, speech and corresponding visual parameters are used to train Hidden Markov Models (HMM) to create constrains and trajectory functions to determine speech and visual features. The accuracy of approximated visual trajectories depends on the chosen training set which directly affects the quality of the synthesized results. In [13] HMM model was used to animate and synchronize a 2D face model by assisting speech and an AAM (Active Appearance Model) feature parameters. In this paper, each viseme is interpreted as a series of frames that describe a phoneme over a time interval, and is represented by interpolating trajectory paths from control points. Thus, each viseme is expressed by a mathematical form, which maybe further simplified by considering ratios or other geometric transformation rules. The authors in [14] proposed 2D trajectories of mouth cavity area versus aspect ratios to describe Japanese words, however, it did not suggest path formulation over discrete frames or defined mathematical signatures of spoken words. Here, the Lagrange interpolation [15] is proposed to construct polynomials by interpolating sets of control points resulting from lip deformation. This paper will initially describe the suggested approach in 2, followed by an introduction to the signature concept and examples of suggested signatures in sections 3, 4 and 5

2. The Approach The main aim of this paper is to determine mathematical signatures of indexed path of feature points that correspond to a sequence of mouth movements.

Each pronounced phoneme will be represented in a set of progressive 2Dframes, which some authors refer to as visemes.

2 2. The Approach The main aim of this paper is to determine mathematical signatures of indexed path of feature points that correspond to a sequence of mouth movements. In this introductory paper, isolated visual units of speech (phonemes) are examined to explain our approach. Each pronounced phoneme will be represented in a set of progressive 2Dframes, which some authors refer to as visemes. In this approach, the mouth height and width (per frame) make up unique sets of feature points per spoken phoneme, thus resulting in unique signature sets. It is also proposed that a fixed 30 frames are to represent various words, regardless of their length. Figure 1 shows an example of a fictitious word with five framed feature points (visemes), represented by the vector [F i U, F i C ], where the lip is approximated by ellipses. Feature points in each viseme can simply be the pixel coordinates per frame that can make up ellipse equations, which can be stored and recovered, when necessary. Vowels have longer visual duration than consonants, and they connect consonants in word structures. Thus, in a speech recognition system, the vowels play significant role in recognition [16]. Therefore, it is important to derive mathematical expressions for consonant-vowel phonemes, such as those shown in Table 1 of the International Phonetic Alphabet in American English. N 1 i=0 l i (x i ) = 1 (3) Where l i (x) are the basis functions corresponding to the nodes x i. Table 1. The IPA and ARPABET Vowels Notations IPA ARPABET Example IPA ARPABET Example i IY beet,e AH but I IH bit ɔ AO bought æ AE bat U UH foot ε AH bet u UW boot a AA hot o OW show In this method, the number of frames increases the degree of the Lagrange polynomial, however, this does not mean increase in accuracy. For example, a set of 15 extracted samples from pronouncing the vowel /UW/ gives the interpolation: L(f) = E 8 f E6 f E5 f f f f f f f f f f f f + 21 (4) Figure 1. The IPA and ARPABET Vowels Notations 3. The Lagrange interpolation The Lagrange is a popular choice to derive mathematical functions of the paths feature-point vectors that describe spoken phonemes. Here, the Lagrange interpolation reconstructs a continuous-time polynomial L(x) that spans uniformly over an interval x i [a, b], from a set of samples x i R and i Ν. N 1 L(x) = i=0 f( x i )l i (x) (1) l i (x) N 1 k=0,k j (x x k) N 1 k=0,k i(x i x k ) (2) Figure 2. The Vowel Viseme /UW/ Upper and Corner Feature-Points Lagrange Interpolation The phoneme expression is clearly defined with very high or low amplitudes on the interval boundaries, reducing the accuracy of the interpolation. Therefore, the method is limited and becomes impractical when dealing with large number of samples, thus inducing very high errors between the function and its interpolating curve. This could be demonstrated by simply substituting f=0.3 in (4) to get

3 L(f) 43 as an evaluation to a corner feature point of approximately L(f)=190. This high fluctuation in amplitude at the boundaries is described as the Runge-effect [17], results in an error between the function and its interpolating curve. The Lagrange interpolation of the vowel /UW/ (of both upper and corner feature-points) are plotted in Figures 2 (a and b). The Runge phenomenon is clearly visible on the first and last pairs of nodes, appearing in the form of oscillation at the ending boundaries. Therefore, there is an error between the function and its interpolating curve. To exclude this saturated regions in the Lagrange polynomial a more elegant solution is proposed through the Barycentric Lagrange Polynomial (BLP) interpolation, which will be discussed next. In comparison to the expression in (4), substituting a value close to the boundaries for f in (7) results in an amplitude that is very close to the next neighbouring sample amplitude. For example, letting f=2 in (7) results in L B (0.3) = 22.25, which is very close to the actual value of the first and second samples (F 0 U = 21, F 1 U = 24). Such definition changes the previous interpolation and allocates a polynomial with proper amplitudes to the same sample set in Figure 2. Thus the final desirable representation of the phoneme /UW/curve over a uniformly spanned interval [a,b] is as shown in Figure 3, which shows all samples in a function without the high boundary amplitudes. 4. The Barycentric Lagrange interpolation The boundary oscillation (Runge phenomenon) was treated by the authors in [18] by rearranging the sample nodes x i position and modifying the intervals to formulate the Barycentric Lagrange interpolation: L B (x) N 1 f( i=0 x i ) w i (x x i ) w N 1 i i=0 (x x i ) (5) Such modification tackles the problem of destructive oscillations on the interval boundaries by a transformation to another domain. The transformation is possible by Chebyshev points of the second kind x i = cos iπ (N 1) spanned on the interval [-1, 1]. The weighing function w j in (5) could be simplified as [19]: w i = ( 1) i δ i With δ i = { i = 0, i = N 1 otherwise Equation (5) defines an interpolation procedure by the cooperation of the Lagrange method and the Chebyshev nodes over the interval [1,-1]. This time the same sets of samples that used in (4) was applied to Barycentric Lagrange interpolation method and led to: (6) L(f) = ( f f f f f f f f f f f f f f ) / (14 f 12 84f f f f f f f f f f f ) (7) Figure 3. The Barycentric Lagrange Interpolations Over the Uniformly Spanned Intervals 5. Visual Signatures Building on the above approach and aiming to further reduce the number of mathematical expressions to yield a more compact signatures expression, the upper and lower feature-points ratios were considered. This is shown in Figure 4 and will be referred to as the Rational Signature. Figure 4. The BLP For a set of Feature-Point s

4 6. Conclusions The experiment of formulating visemes was conducted on the extracted Feature-Points from visemes filmed at a rate of 30 frames sec -1 video files for a set of 3 speakers pronouncing the phonemes /IY/, /IH/, /AE/ and /,e/. Clearly, amplitudes vary according to speakers, and duration (number of frames) vary according to the pronounced phoneme. However, here, the main aim is to show that it is feasible to build patterns or signatures for the various phonemes. The presented work is still under development and it is highly likely that further rules must be added to clearly distinguish and identify phoneme, words and portions of speech. The main aim of this paper was to derive mathematical expressions to some of the basic phonemes. Two methods were considered, the Lagrange Interpolation and Barycentric Lagrange Interpolation, where the latter gave a more accurate representation of a pronunciation envelope. In this analysis, the top-centre and the corner of the lip were selected as Feature-Points of the driven signatures. The paper finally presented more compact expressions by considering Feature-Point ratios. As was mentioned earlier on, the presented work is still under development and that there is still a need to further develop the approach to clearly distinguish and identify phoneme, words and portions of speech. References Figure 5. The BLP of Feature-Point s Ratios for the Phonemes phonemes /IY/, /IH/, /AE/ and /,e/. [1]. F. I. Parke, A Parametric Model for Human Faces, Utah: The University of Utah, [2]. F. I. Parke, Computer Generated Animation of Faces, Utah: The University of Utah, [3]. N. Magnenat-Thalman and D. Thalman, "The Direction of Synthetic Actors in the Film 'Rendezvous a Montreal," IEEE journal of Computer Graphics and Applications, vol. 7, no. 12, pp. 9-19, [4]. L. Xie and Z. Q. Liu, "Realistic Mouth-Synching For Speech-Driven Talking Face Using Articulatory Modeling," IEEE Transaction on Multimedia, vol. 9, no. 3, p , [5]. C. G. Fisher, "Confusions among visually perceived consonants, "Journal of Speech and Hearing Research, vol. 11, no. 4, pp , [6]. G. Ananthakrishnan and O. Engwall, "Important Regions in the Articulator Trajectory," in International Seminar on Speech Production, Strasbourg, [7]. D. Jiang, I. Ravyse, H. Sahli and W. Verhelst, "Speech Driven Realistic Mouth Animation Based on Multimodal Unit Selection,"Journal of Multi-Modal User Interfaces, vol. 2, pp , [8]. R. Gutierrez-Osuna, P. K. Kakumanu, A. Esposito, O. N. Garcia, A. Bojorquez, J. L. Castillo and I. Rudomin, "Speech-driven facial animation with realistic dynamics," IEEE Transactions on Multimedia, vol. 7, no. 1, p , [9]. E. Cosatto and H. Graf, "Sample-based synthesis of photorealistic talking heads," in Computer Animation, [10]. Z. Deng, U. Neumann, J. P. Lewis, T. Y. Kim, M. Bulut and S. Narayanan, "Expressive Facial Animation Synthesis By Learning Speech Coarticulation And Expression Spaces," IEEE Transaction on Visualization and Computer Graphics, vol. 12, no. 6, pp. 1-12, [11]. Y. Cao, P. Faloutsos, E. Kohler and F. Pighin, "Real- Time Speech Motion Synthesis from Recorded Motions," in the ACM SIGGRAPH/Eurographics symposium on computer animation, New York, 2004.

5 [12]. S. Morishima, K. Aizawa and H. Harashima, "An Intelligent Facial Image Coding Driven By Speech And Phoneme," in IEEE ICASSP, Glasgow, [13]. G. Englebienne, Animating faces from speech, The University of Manchester, [14]. T. Saitoh and R. Konishi, "Word recognition based on two dimensional lip motion trajectory," IEEE International symposium on intelligent signal processing and communication, pp , [15]. J. L. Lagrange, Lecons elementaires sur les mathématiques, donnees `a l Ecole Normale en 1795, Paris: Oeuvres VII, Gauthier Villars, 1877, p [16]. L. Rabiner and B. H. Jung, Fundumentals of speech recognition, Prentice Hall, [17]. C. Runge, "Uber empirische Funktionen und die Interpolation zwischen aquidistanten Ordinaten," Zeitschrift fur Mathematik und Physik, p , [18]. Salzer and H. E. Salzer, "Lagrangian interpolation at the Chebyshev points xn,v = cos(vπ/n), v = o(1)n; some unnoted advantages" Journal of Computer, p , [19]. J. P. Berrut and L. N. Trefethen, "Barycentric Lagrange," SIAM REVIEW, p , 2004.

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies? MIRALab Where Research means Creativity Where do we stand today? M I RA Lab Nadia Magnenat-Thalmann MIRALab, University of Geneva thalmann@miralab.unige.ch Video Input (face) Audio Input (speech) FAP Extraction