Expression control using synthetic speech.

Size: px
Start display at page:

Download "Expression control using synthetic speech."

Transcription

1 Expression control using synthetic speech. Brian Wyvill and David R. Hill Department of Computer Science, University of Calgary University Drive N.W. Calgary, Alberta, Canada, T2N 1N4 Abstract This tutorial paper presents a practical guide to animating facial expressions synchronised to a rule based speech synthesiser. A description of speech synthesis by rules is given and how a set of parameters which drive both the speech synthesis and the graphics is derived. An example animation is described along with the outstanding problems. Key words: Computer Graphics, Animation, Speech Synthesis, Face-Animation. ACM, This is the authors version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published as Course # 22 of the Tutorial Section of ACM SIGGRAPH 89, Boston, Massachusetts, 31 July - 4 August DOI unknown. Note (drh 2008): Appendix A was added, after publication of these tutorial notes by the ACM, to flesh out some details of the parameter synthesis, and to provide a more complete acoustic parameter table (the original garbled table headings have been corrected in the original paper text that follows, but the data was still incomplete intended for discussion in a tutorial). Fairly soon after the animation work using the formant synthesiser was finished a completely new articulatory speech synthesis system was developed by one of the authors and his colleagues. This system uses an acoustic tube model of the human vocal tract, with associated new posture databases cast in terms of tube radii and excitation, new rules, and so on. Originally a technology spin-off company product, the system was the first complete articulatory real-time text-to-speech synthesis system in the world and was described in [Hill 95]. All the software is now available from the GNU project gnuspeech under a General Public Licence. Originally developed on the NeXT computer, much of the system has since been ported to the Macintosh under OS/X, and work on a GNU/Linux version running under GNUStep is well under way ( Reference [Hill 95] David R. Hill, Leonard Manzara, Craig-Richard Schock. Real-time articulatory speechsynthesis-by-rule, Proc. AVIOS 95, the 14th Annual International Voice Technologies Applications Conference of the American Voice I/O Society, San Jose, September , AVIOS: San Jose, 27-44

2 Expression control using synthetic speech 1 1 Motivation In traditional hand animation synchronisation between graphics and speech has been achieved through a tedious process of analysing a speech sound track and drawing corresponding mouth positions (and expressions) at key frames. To achieve a more realistic correspondence a live actor may be filmed to obtain the correct mouth positions. This method produces good results but must be repeated for each new speech, it is time consuming and requires a great deal of specialised skill on the part of the animator. A common approach to computer animation uses a similar analysis to derive key sounds, from which parameters to drive a face model can he found. (see [Parke 74]) Such an approach to animation is more flexible than the traditional hand method since the parameters to drive such a face model correspond to the key measurements available from the photographs directly, rather than requiring the animator to design each expression as needed. However, the process is not automatic, requiring tedious manual procedures for recording and measuring the actor. In our research we were interested in finding a fully automatic way of producing an animated face to match speech. Given a recording of an actor speaking the appropriate script, it might seem possible to design a machine procedure to recognise the individual sounds and to use acoustic-phonetic and articulatory rules to derive sets of parameters to drive the Parke face model. However, this would require a more sophisticated speech recognition program than is currently available. The simplest way for a computer animator to interact with such a system would be to type in a line of text and have the synthesised speech and expressions automatically generated. This was the approach we decided to try. From the initial input, given the still incomplete state of knowledge concerning speech synthesis by rules, we wanted to allow some audio editing to allow improvements in the speech quality, with the corresponding changes to the expressions being done automatically. Synthetic speech by rules was the most appropriate choice since this can be generated from keyboard input, it is a very general approach which lends itself to the purely automatic generation of speech animation. The major drawback is that speech synthesised in this manner is far from perfect. 2 Background 2.1 The Basis for Synthesis by Rules Acoustic-phonetic research into the composition of spoken English during the 50 s and 60 s, led to the determination of the basic acoustic cues associated with forty or so sound classes. This early research was conducted at the Haskins Laboratory in the US and elsewhere worldwide. The sound classes are by no means homogeneous, and we still do not have complete knowledge on all the variations and their causes. However, broadly speaking, each sound class can be identified with a configuration of the vocal organs in making sounds in the class. We shall refer to this as a speech posture. Thus, if the jaw is rotated a certain amount, and the lips held in a particular position, with the tongue hump moved high or low, and back or forward, a vowel-like noise can be produced that is characterised by the energy distribution in the frequency domain. This distribution contains peaks, corresponding to the resonances of the tube-like vocal tract, called formants. As the speaker articulates different sounds (the speech posture is thus varying dynamically and continuously), the peaks will move up and down the frequency scale, and the sound emitted will change. Figure 1 shows the parts of the articulatory system involved with speech production.

3 Expression control using synthetic speech 2 Figure 1: The Human Vocal Apparatus 2.2 Vowel and Consonant Sounds The movements are relatively slow during vowel and vowel-like articulations, but are often much faster in consonant articulations, especially for plosive sounds like /b, d, g, p, t, and k/ (these are more commonly called the stop consonants). The nasal sounds /m, n/ and the sound at the end of running /ŋ/, are articulated very much like the plosive sounds, and not only involve quite rapid shifts in formant frequencies but also a sudden change in general spectral quality because the nasal passage is very quickly connected and disconnected for nasal articulation by the valve towards the back of the mouth that is formed by the soft palate (the velum ) hence the phrase nasal sounds. Various hiss-like noises are associated with many consonants because consonants are distinguished from vowels chiefly by a higher degree of constriction in the vocal tract (completely stopped in the case of the stop consonants). This means that either during, or just after the articulation of a consonant, air from the lungs is rushing through a relatively narrow opening, in turbulent flow, generating random noise (sounds like /s/, or /f/). Whispered speech also involves turbulent airflow noise as the sound medium, but, since the turbulence occurs early in the vocal flow, it is shaped by the resonances and assumes many of the qualities of ordinarily spoken sounds. 2.3 Voiced and Voiceless When a sound is articulated, the vocal folds situated in the larynx may be wide open and relaxed, or held under tension. In the second case they will vibrate, imposing a periodic flow pattern on the rush of air from the lungs (and making a noise much like a raspberry blown under similar conditions at the lips). However, the energy in the noise from the vocal folds is redistributed by the resonant properties of the vocal and nasal tracts, so that it doesn t sound like a raspberry by the time it gets out. Sounds in which the vocal folds are vibrating are termed voiced. Other sounds are termed voiceless, although some further qualification is needed.

4 Expression control using synthetic speech 3 It is reasonable say that the word cat is made up of the sounds /k æ t/. However, although a sustained /æ/ can be produced, a sustained /k/ or /t/ cannot. Although stop sounds are articulated as speech postures, the cues that allow us to hear them occur as a result of their environment. When the characteristic posture of /t/ is formed, no sound is heard at all: the stop gap, or silence, is only heard as a result of noises either side, especially the formant transitions (see 2.4 below). The sounds /t/ and /d/ differ only in that the vocal folds vibrate during the /d/ posture, but not during the /t/ posture. The /t/ is a voiceless alveolar stop, whereas the /d/ is a voiced alveolar stop, the alveolar ridge being the place within the vocal tract where the point of maximum constriction takes place, known as the place of articulation. The /k/ is a voiceless velar stop. 2.4 Aspiration When a voiceless stop is articulated in normal speech, the vocal folds do not begin vibrating immediately on release. Thus, after the noise burst associated with release, there is a period when air continues to rush out, producing the same effect as whispered speech for a short time (a little longer than the initial noise burst of release). This whisper noise is called aspiration, and is much stronger in some contexts and situations than others. At this time, the articulators are moving, and, as a result, so are the formants. These relatively rapid movements are called formant transitions and are, as the Haskins Laboratory researchers demonstrated, a powerful cue to the place of articulation. Again, these powerful cues fall mainly outside the time range conventionally associated with the consonant posture articulation (QSS) itself. 2.5 Synthesis by Rules The first speech synthesiser that modelled the vocal tract was the so-called Parametric Artificial Talker (PAT), invented by Walter Lawrence of the Signals Research and Development Establishment (SRDE), a government laboratory in Britain, in the 1950 s. This device modelled the resonances of the vocal tract (only the lowest three needed to be variable for good modelling), just the various energy sources (periodic or random), and the spectral characteristics of the noise bursts and aspiration. Other formulations can serve as a basis for synthesising speech (for example, Linear Predictive Coding LPC), but PAT was not only the first, but is more readily linked to the real vocal apparatus than most, and the acoustic cue basis is essentially the same for all of them. It has to be, since the illusion of speech will only be produced if the correct perceptually relevant acoustic cues are present in sufficient number. Speech may be produced from such a synthesiser by analysing real speech to obtain appropriate parameter values, and then using them to drive the synthesiser. This speech is merely a sophisticated form of compressed recording. It is difficult to analyse speech automatically for the parameters needed to drive synthesisers like PAT, but LPC compression and resynthesis is extremely effective, and serves as the basis of many modern voice response systems. It is speech by copying, however, and always requires preknowledge of what will be said, contains all the variability of real speech. More importantly it is hard to link directly to articulation. A full treatment of speech analysis is given in [Witten 82]. 2.6 Speech Postures and the Face It is possible, given a specification of the postures (i.e. sound classes ) in an intended utterance, to generate the parameters needed to drive a synthesiser entirely algorithmically, i.e. by rules,

5 Expression control using synthetic speech 4 Figure 2: Upper lip control points without reference to any real utterance. This is the basis of our approach. The target values of the parameters for all the postures are stored in a table (see table 1), and a simple interpolation procedure is written to mimic the course of variation from one target to the next, according to the class of posture involved. Appropriate noise bursts and energy source changes can also be computed. It should be noted that the values in the table are relevant to the Hill speech structure model see [Hill 78]. Since the sounds and sound changes result directly from movements of the articulators, and some of these are what cause changes in facial expression (e.g. lip opening, jaw rotation, etc.), we felt that our program for speech synthesis by rule could easily be extended by adding a few additional entries for each posture to control the relevant parameters of Parke s face model. 2.7 Face Parameters The parameters for the facial movements directly related to speech articulation are currently those specified by Fred Park. They comprise: jaw rotation; mouth width; mouth expression; lip protrusion; /f/ and /v/ lip tuck; upper lip position; and the x, y and z co-ordinates of one of the two mouth corners (assuming symmetry which is an approximation). The tongue is not represented, nor are other possible body movements associated with speech. The parameters are mapped onto a group of mesh vertices with appropriate scale factors which weight the effect of the parameter. An example of the polygon mesh representing the mouth is illustrated in Figure 2. 3 Interfacing Speech and Animation The system is shown diagramatically in Figure 3. Note that in the diagram the final output is to film, modifications for direct video output are discussed below. The user inputs text which is automatically translated into phonetic symbols defining the utterance(s). The system also reads various other databases. These are: models of spoken rhythm and intonation; tabular data defining the target values of the parameters for various

6 Expression control using synthetic speech 5 Figure 3: Animated speech to film system overview

7 Expression control using synthetic speech 6 articulatory postures (both for facial expression, and acoustic signal); and a set of composition rules that provide appropriate modelling of the natural movement from one target to the next. The composition program is also capable of taking account of special events like bursts of noise, or suppression of voicing. 3.1 Parameteric Output The output of the system comprises sets of 18 values defining the 18 parameters at successive 2 millisecond intervals. The speech parameters are sent directly to the speech synthesiser which produces synthetic speech output. This is recorded to provide the sound track. The ten face parameters controlling the jaw and lips are described in [Hill 88] and are taken directly from the Parke face model. Table 2 shows the values associated with each posture. 3.2 Face Parameters The facial parameter data is stored in a file and processed by a converter program. The purpose of this program is to convert the once per two millisecond sampling rate to a once per frame time sampling rate, based on the known number of frames determined from the magnetic film sound track. This conversion is done by linear interpolation of the parameters and resampling. The conversion factor is determined by relating the number of two millisecond samples to the number of frames recorded in the previous step. This allows for imperfections in the speed control of our equipment. In practice, calculations based on measuring lengths on the original audio tape have proved equivalent and repeatable for the short lengths we have dealt with so far. Production equipment would be run to standards that avoided this problem for arbitrary lengths. The resampled parameters are fed to a scripted facial rendering system (part of the Graphicsland animation system [Wyvill 86]). The script controls object rotation and viewpoint parameters whilst the expression parameters control variations in the polygon mesh in the vicinity of the mouth, producing lip, mouth and jaw movements in the final images, one per frame. A sequence of frames covering the whole utterance is rendered and stored on disc. Real time rendering of the face is currently possible with this scheme given a better workstation. 3.3 Output to film The sound track, preferably recorded at 15 ips on a full track tape (e.g. using a NAGRA tape recorder), is formatted by splicing to provide a level setting noise (the loudest vowel sound as in gore ) for a few feet; a one frame-time 1000 Hz noise burst for synchronisation; a 23 frame time silence; and then the actual sound track noises. The sound track is transferred to magnetic film ready for editing. The actual number of frames occupied by the speech is determined for use in dealing with the facial parameters. Getting the 1000 Hz tone burst exactly aligned within a frame is a problem. We made the tone about 1.5 frames in length to allow for displacement in transferring to the magnetic film. the stored images are converted to film, one frame at a time. After processing, the film and magnetic film soundtrack are edited in the normal way to produce material suitable for transfer. The process fixes edit synch between the sound track and picture based on synch marks placed ahead of the utterance material. The edited media are sent for transfer which composes picture and sound tracks onto standard film.

8 Expression control using synthetic speech Video Output Many animators have direct workstation to video output devices. Experience has shown that the speed control on our equipment is better than anticipated so that our present procedure is based on the assumption that speed control is adequate for separate audio and video recordings, according to the procedures outlined above, and for straight dubbing to be carried out onto the video, in real time, once the video has been completed. With enough computer power to operate in real time, it would be feasible to record sound and image simultaneously. 4 Sampling Problems Some form of temporal aliasing might seem desirable, since the speech parameters are sampled at 2ms intervals but the facial parameters at only 41.67ms (film at 24 frames/sec.) or (video at 30 frames/sec). In practice antialiasing does not appear to be needed. Indeed, the wrong kind of antialiasing could have a very negative effect, by supressing facial movements altogether. Possibly it would be better to motion-blur the image directly, rather than antialiasing the parameter track definitions. However, this is not simple as the algorithm would have to keep track of individual pixel movements and processing could become very time consuming. The simplest approach seems quite convincing, as may be seen in the demonstration film. 5 A practical example 5.1 The animated speech process The first step in manufacturing animated speech with our system is to enter text to the speech program which computes the parameters from the posture target-values-table. Text entered at keyboard: Input as: (Phonetic Representation) speak to me now bad kangaroo s p ee k t u m i n ah uu b aa d k aa ng g uh r uu Continuous parameters are generated automatically from discrete target data to drive the face animation and synthesized speech. The speech may be altered by editing the parameters interactively until the desired speech is obtained. This process requires a good degree of skill and experience to achieve human like speech. At this point all of the parameters are available for editing. Figure 4 shows a typical screen from the speech editor. The vertical lines represent the different parts of the diphones (transition between two postures). Three of the eight parameters (the three lowest formants) are shown in the diagram. Altering these with the graphical editor alters the position of the formant peaks which in turn changes the sound made by the synthesiser or the appearance of the face. Although the graphical editor facilitates this process, obtaining the desired sound requires some skill on the part of the operator. It should be noted that the editing facility is designed as a research tool. On the diagram seven postures have been shown giving six diphones. In fact the system will output posture information every 2ms and these values will then be resampled at each frame time. The posture information is particular to the Parke face model. A full list of parameters corresponding to the phonetic symbols is given in table 2, as noted above.

9 Expression control using synthetic speech 8 Figure 4: Three of the speech parameters for Speak to me now, bad kangaroo.

10 Expression control using synthetic speech Historical Significance of Speak to me now... The phrase Speak to me now, bad kangaroo was chosen for an initial demonstration of our system for historical reasons. It was the first utterance synthesised by rule by David Hill at the Edinburgh University Department of Phonetics and Linguistics in It was chosen because it was unusual (and therefore hard to hear), and incorporated a good phonetic variety, especially in terms of the stop sounds and nasals which were of particular interest. At that time the parameters to control the synthesiser were derived from analog representations in metal ink that picked up voltages from a device resembling a clothes wringer (a mangle ), in which one roller was wound with a resistive coil of wire which impressed track voltages proportional to the displacement of silver-ink tracks on a looped mylar sheet. The tracks were made continuous with big globs of the silver ink that also conveyed the voltages through perforations to straight pick-off tracks for the synthesiser controller that ran along the back of the sheet. When these blobs ran under the roller, a violent perturbation of the voltages on all tracks occurred. However, the synthesiser was incapable of a non-vocalic noise so, instead of some electrical cacaphony, the result was a most pleasing and natural belch. 6 Conclusion This paper has presented a practical guide to using the speech by rule method to produce input to the Graphicsland animation system. Our approach to automatic lip synch is based on artificial speech synthesised by simple rules, extended to produce not only the varying parameters needed for acoustic synthesis, but also similar parameters to control the visual attributes of articulation as seen in a rendered polygon mesh face (face courtesy of Fred Park). This joint production process guarantees perfect synchronisation between the lips and other components of facial expression related to speech, and the sound of speech. The chief limitations are: the less than perfect quality of the synthesised speech; the need for more accurate and more detailed facial expression data; and the use of a more natural face, embodying the physical motion constraints of real faces, probably based on muscle-oriented modelling techniques (e.g. [Waters 87] ). Future work will tackle these and other topics including the extension of parameter control to achieve needed speech/body motion synchrony as discussed in [Hill 88]. 7 Acknowledgements The following graduate students and assistants worked hard in the preparation of software and video material shown during the presentation: Craig Schock, Corine Jansonius, Trevor Paquette, Richard Esau and Larry Kamieniecki. We would also like to thank Fred Parke who gave us his original software and data, and Andrew Pearce for his past contributions. This research is partially supported by grants from the Natural Sciences and Engineering Research Council of Canada.

11 Expression control using synthetic speech 10 Symbol IPA SST TT SST TT Flag AX F1 F2 F3 AH2 FH2 AH1 ^ silence H h UH ə A ʌ E ɛ I ɩ O ɒ U ɷ AA æ EE i ER ɜ AR ɑ AW ɔ UU u R r W w L l Y j M m B b P p N n D d T t NG ŋ G g K k S s Z z SH ʃ ZH ʒ F f V v TH ɵ DH ð CH ʧ J ʤ Table 1: Rule Table for Speech Postures needed for synthesis (male voice) [Author s post-publication note: This table is incomplete and hard to follow, even with the headings ungarbled from the original. Please see Appendix A for more complete information]

12 Expression control using synthetic speech 11 Parameter number Posture IPA ^ H h UH ə A ʌ E ɛ I ɩ O ɒ U ɷ AA æ AH a EE i ER ɜ AR ɑ AW ɔ UU u R r W w L l LL ɫ Y j M m B b P pp N n D d T t NG ŋ G g K k S s Z z SH ʃ ZH ʒ F f V v TH ɵ DH ð CH ʧ J ʤ Table 2: Face Parameter Table for Phonetic Values

13 Expression control using synthetic speech 12 References [Hill 78] [Hill 88] David Hill. A program structure for event-based speech synthesis by rules within a flexible segmental framework. Int. J. Man-Machine Studies, 3(10): , May David Hill, Brian Wyvill, and Andrew Pearce. Animating Speech: an automatic approach using speech synthesis by rules. The Visual Computer, 3(5): , [Parke 74] Fred I. Parke. A Parametric Model for Human Faces. University of Utah, Dept. of Computer Science, Dec PhD dissertation. [Waters 87] Keith Waters. A muscle model for animating three-dimensional facial expression. In Computer Graphics, volume 21. ACM SIGGRAPH, July [Witten 82] I.H. Witten. Principles of computer speech. Academic Press, London, England, [Wyvill 86] Brian Wyvill, Craig McPheeters, and Rick Garbutt. The University of Calgary 3D Computer Animation System. Journal of the Society of Motion Picture and Television Engineers, 95(6): , 1986.

14 Expression control using synthetic speech 13 APPENDIX A Basic posture IPA Ì f1 f2 f3 f4 fnnf nb ax ah1 ah2 fh2 bwh2 Duration umarked/ marked (msec) Transition unmarked/ marked (msec) Categories: all have their own identity and all are phones (ph) ^ /140 0/0 si h h /66 26/26 as uh ə /74 66/66 vd,sv,vo a ʌ /198 66/66 vd,sv,vo e ɛ /174 66/66 vd,sv,vo i ɩ /174 66/66 vd,sv.vo o ɒ /156 66/66 vd,sv,vo,ba u ɷ /162 66/66 vd,sv,vo,ba aa æ /182 66/66 vd,vo ee i /212 66/66 vd,lv,vo er ɜ /252 66/66 vd,lv,vo ar ɑ /272 66/66 vd,lv,vo,ba aw ɔ /290 66/66 vd,lv,vo,ba uu u /234 66/66 vd,lv,vo,ba r r /86 40/40 co,gl,vo w w /68 34/34 co,gl,vo l l /84 22/22 co,gl,vo y j /84 38/38 co.gl,vo m m /114 16/16 na,co,vo b b /82 16/16 co,st,ch,vo p p /92 18/18 co,st,ch n n /88 26/26 na,co,vo d d /86 18/18 co,st,ch,vo t t /80 24/24 co,st,ch ng ŋ /68 28/28 na,co,vo g g /80 30/30 co,st,vo k k /104 30/30 co,st s s /112 30/30 co Table A1: Rule table for speech postures needed for synthesis (male voice) clarified (Data as used in DEGAS formant-based synthesis [Manzara 92] numerical entries not in bold type are default values and usually not significant for the posture) Continued on next page (with key)

15 Expression control using synthetic speech 14 Basic posture IPA Ì f1 f2 f3 f4 fnnf nb ax ah1 ah2 fh2 bwh2 Duration umarked/ marked (msec) Transition unmarked/ marked (msec) Categories: all have their own identity and all are phones (ph) z z /84 30/30 co,vo sh ʃ /124 24/24 co zh ʒ /82 30/30 co,vo f f /118 40/40 co v v /88 44/44 co,vo th ɵ /136 40/40 co dh ð /108 44/44 co,vo ch ʧ /118 24/24 co,af j ʤ /100 40/40 co,vo,af Categories ph phone ma marked vd vocoid gl glide co contoid di diphthong st stopped ch checked lv long-vowel sv short-vowel as aspirate fr fricative vo voiced af affricate na nasal si silence ba backvowel Column Headings Ì microintonation effect f1 formant 1 target frequency f2 formant 2 target frequency f3 formant 3 target frequency f4 formant 4 target frequency fnnf nasal formant frequency nb nasal formant bandwidth ax larynx amplitude ah1 aspiration amplitude ah2 frication amplitude fh2 frication frequency peak bwh2 frication filter bandwidth Special features (e.g. noise bursts) are added by rules according to the context. The posture durations are given for both unmarked and marked versions variants. Provision is made for marked/unmarked transition durations, but in this dataset, the transition durations are the same whether marked or not. Notes Assume we are constructing a diphone out of postures p and p+1. Diphone duration = QSS p / 2 + TT p to p+1 + QSS p+1 / 2 (a) The transition time for a vowel to other sound, or other sound to vowel is specified by the other sound. Thus TT p to p+1 is.given by: TT p to p+1 = TT p+1 for vowel to other sound TT p to p+1 = TT p for other sound to vowel The QSS time for a vowel to other sound or other sound to vowel is given by taking the transition time already allocated out of the vowel total time. Then: QSS vowel = Total vowel - TT p to p+1 QSS other = Total other (from the table) (b) For dipthongs, triphthongs, etc. (i.e. vowel-to-vowel), we are dealing with a special case, identified by encountering a special symbol for each part of the compound sound, probably supplied by a preparser. Let dp represent a posture

16 Expression control using synthetic speech 15 component of a diphthong, triphthong,. Diphthongs are not directly represented in the data tables as they involve a succession of vowel postures the component vowel postures are close to similar isolated vowels but not necessarily identical. There are three cases in constructing diphones involving these postures. (i) dp to p (ii) p to dp and (iii) dp to dp Cases (i) and (ii) are handled as described in (a), treating the dp as the vowel. For case (iii) QSS p and QSS p+1 are taken from the table and then: TT p to p+1 =Total dp - QSS p / 2 - QSS p+1 / 2 (c) The transition time from other sound to other sound is a fixed 10 msecs, taken out of the QSS of the following sound: TT p to p+1 = 10 QSS p+1 = QSS p+1-10 (this is pretty arbitrary, but works and is probably not critical) (d) The target values for /k, g, ŋ / must be modified before back vowels /ɒ, ɷ, ɑ, ɔ, u/. The second formant target should be put 300 hz above the value for the vowel. This is an example of a rule that uses the posture categories. (e) The target values for /h/ or silence to posture, or posture to /h/ or silence, are taken from the posture (i.e. the transitions are flat). (f) The shape of a given transition is determined by which of the postures involved is checked or free. There is very little deviation from the steady state targets of a checked posture, during the QSS, and the transition begins or ends fairly abruptly. For a free posture, there is considerable deviation from the target values during the QSS. To obtain an appropriate shape, the slopes are computed so that if the slope during checked posture is m, then the slope during a free posture is 3m and the slope during the transition is 6m. Then, if the total movement from the target in posture 1 to the target in posture 2 is, then: = 1.QSS p + 2.TTp to p QSS p+1 where 1 2 and 3 are the slopes as described, according to the type of p and p+1. Since the relationship between 1 2 and 3 are known in terms of m, as is the value of, the value of m may be calculated and substituted for the individual segment slopes Reference [Manzara 92] Leonard Manzara & David R. Hill. DEGAS: A system for rule-based diphone synthesis. Proc. 2nd. Int. Conf. on Spoken Language Processing, Banff, Alberta, Canada, October 12-16, Movement during QSS of p2 (slope 3) Transition movement p to p+1 (slope 2) Movement during QSS of p1 slope ( 1)

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies? MIRALab Where Research means Creativity Where do we stand today? M I RA Lab Nadia Magnenat-Thalmann MIRALab, University of Geneva thalmann@miralab.unige.ch Video Input (face) Audio Input (speech) FAP Extraction

More information

Modeling Coarticulation in Continuous Speech

Modeling Coarticulation in Continuous Speech ing in Oregon Health & Science University Center for Spoken Language Understanding December 16, 2013 Outline in 1 2 3 4 5 2 / 40 in is the influence of one phoneme on another Figure: of coarticulation

More information

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT Shoichiro IWASAWA*I, Tatsuo YOTSUKURA*2, Shigeo MORISHIMA*2 */ Telecommunication Advancement Organization *2Facu!ty of Engineering, Seikei University

More information

Towards Audiovisual TTS

Towards Audiovisual TTS Towards Audiovisual TTS in Estonian Einar MEISTER a, SaschaFAGEL b and RainerMETSVAHI a a Institute of Cybernetics at Tallinn University of Technology, Estonia b zoobemessageentertainmentgmbh, Berlin,

More information

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011 Audio Fundamentals, Compression Techniques & Standards Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011 Outlines Audio Fundamentals Sampling, digitization, quantization μ-law

More information

Speech-Coding Techniques. Chapter 3

Speech-Coding Techniques. Chapter 3 Speech-Coding Techniques Chapter 3 Introduction Efficient speech-coding techniques Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the lower the quality RTP payload types

More information

Quarterly Progress and Status Report. Studies of labial articulation

Quarterly Progress and Status Report. Studies of labial articulation Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Studies of labial articulation Lindblom, B. journal: STL-QPSR volume: 6 number: 4 year: 1965 pages: 007-009 http://www.speech.kth.se/qpsr

More information

processor [3]. Consequently a multiprocessor system is required in order to perform the modelling in real time. Figure 1.

processor [3]. Consequently a multiprocessor system is required in order to perform the modelling in real time. Figure 1. A DSP s HARDWARE ARCHITECTURE FOR VOCAL TRACT MODELLING A Benkrid* M Koudil ** M Atmani* and TE Cross*** *Electronic Engineering Dpt, College of technology at Dammam, PO box 7650, Saudi Arabia. E-mail

More information

Topics in Linguistic Theory: Laboratory Phonology Spring 2007

Topics in Linguistic Theory: Laboratory Phonology Spring 2007 MIT OpenCourseWare http://ocw.mit.edu 24.910 Topics in Linguistic Theory: Laboratory Phonology Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 9, http://acousticalsociety.org/ ICA Montreal Montreal, Canada - June Speech Communication Session asc: Linking Perception and Production (Poster Session) asc.

More information

2.4 Audio Compression

2.4 Audio Compression 2.4 Audio Compression 2.4.1 Pulse Code Modulation Audio signals are analog waves. The acoustic perception is determined by the frequency (pitch) and the amplitude (loudness). For storage, processing and

More information

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING.

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING. AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING Junhong Zhao 1,2, Hua Yuan 3, Wai-Kim Leung 4, Helen Meng 4, Jia Liu 3 and Shanhong Xia 1

More information

Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment

Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment ISCA Archive Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment Shigeo MORISHIMA Seikei University ABSTRACT Recently computer can make cyberspace to walk through

More information

MATLAB Apps for Teaching Digital Speech Processing

MATLAB Apps for Teaching Digital Speech Processing MATLAB Apps for Teaching Digital Speech Processing Lawrence Rabiner, Rutgers University Ronald Schafer, Stanford University GUI LITE 2.5 editor written by Maria d Souza and Dan Litvin MATLAB coding support

More information

image-based visual synthesis: facial overlay

image-based visual synthesis: facial overlay Universität des Saarlandes Fachrichtung 4.7 Phonetik Sommersemester 2002 Seminar: Audiovisuelle Sprache in der Sprachtechnologie Seminarleitung: Dr. Jacques Koreman image-based visual synthesis: facial

More information

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal. Perceptual coding Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal. Perceptual encoders, however, have been designed for the compression of general

More information

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects. Perceptual coding Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal. Perceptual encoders, however, have been designed for the compression of general

More information

Development of Real-Time Lip Sync Animation Framework Based On Viseme Human Speech

Development of Real-Time Lip Sync Animation Framework Based On Viseme Human Speech Development of Real-Time Lip Sync Animation Framework Based On Viseme Human Speech Loh Ngiik Hoon 1, Wang Yin Chai 2, Khairul Aidil Azlin Abd. Rahman 3* 1 Faculty of Applied and Creative Arts Universiti

More information

3 CHOPS - LIP SYNCHING PART 2

3 CHOPS - LIP SYNCHING PART 2 3 CHOPS - LIP SYNCHING PART 2 In this lesson you will be building a more complex CHOP network to create a more automated lip-synch. This will utilize the voice split CHOP, and the voice sync CHOP. Using

More information

VTalk: A System for generating Text-to-Audio-Visual Speech

VTalk: A System for generating Text-to-Audio-Visual Speech VTalk: A System for generating Text-to-Audio-Visual Speech Prem Kalra, Ashish Kapoor and Udit Kumar Goyal Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Contact email:

More information

Data-Driven Face Modeling and Animation

Data-Driven Face Modeling and Animation 1. Research Team Data-Driven Face Modeling and Animation Project Leader: Post Doc(s): Graduate Students: Undergraduate Students: Prof. Ulrich Neumann, IMSC and Computer Science John P. Lewis Zhigang Deng,

More information

Abstract Media Spaces

Abstract Media Spaces Abstract Media Spaces Rob Diaz-Marino CPSC 781 University of Calgary 2005 Outline What is Abstraction? Simple Media Spaces Abstract Media Spaces (AMS) Benefits and Drawbacks Methods for Abstracting Media

More information

Computer Graphics. Spring Feb Ghada Ahmed, PhD Dept. of Computer Science Helwan University

Computer Graphics. Spring Feb Ghada Ahmed, PhD Dept. of Computer Science Helwan University Spring 2018 13 Feb 2018, PhD ghada@fcih.net Agenda today s video 2 Starting video: Video 1 Video 2 What is Animation? Animation is the rapid display of a sequence of images to create an illusion of movement

More information

VISEME SPACE FOR REALISTIC SPEECH ANIMATION

VISEME SPACE FOR REALISTIC SPEECH ANIMATION VISEME SPACE FOR REALISTIC SPEECH ANIMATION Sumedha Kshirsagar, Nadia Magnenat-Thalmann MIRALab CUI, University of Geneva {sumedha, thalmann}@miralab.unige.ch http://www.miralab.unige.ch ABSTRACT For realistic

More information

Assignment 11. Part 1: Pitch Extraction and synthesis. Linguistics 582 Basics of Digital Signal Processing

Assignment 11. Part 1: Pitch Extraction and synthesis. Linguistics 582 Basics of Digital Signal Processing Linguistics 582 Basics of Digital Signal Processing Assignment 11 Part 1: Pitch Extraction and synthesis (1) Analyze the fundamental frequency of the two utterances you recorded for Assignment 10, using

More information

Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science Automatic Speech Recognition Spring, 2003

Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science Automatic Speech Recognition Spring, 2003 Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345 Automatic Speech Recognition Spring, 2003 Issued: 02/07/03 Assignment 1 Information An Introduction to

More information

TOWARDS A HIGH QUALITY FINNISH TALKING HEAD

TOWARDS A HIGH QUALITY FINNISH TALKING HEAD TOWARDS A HIGH QUALITY FINNISH TALKING HEAD Jean-Luc Oliv&s, Mikko Sam, Janne Kulju and Otto Seppala Helsinki University of Technology Laboratory of Computational Engineering, P.O. Box 9400, Fin-02015

More information

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music COS 116 The Computational Universe Laboratory 4: Digital Sound and Music In this lab you will learn about digital representations of sound and music, especially focusing on the role played by frequency

More information

Multi-modal Translation and Evaluation of Lip-synchronization using Noise Added Voice

Multi-modal Translation and Evaluation of Lip-synchronization using Noise Added Voice Multi-modal Translation and Evaluation of Lip-synchronization using Noise Added Voice Shigeo MORISHIMA (,2), Satoshi NAKAMURA (2) () Faculty of Engineering, Seikei University. --, Kichijoji-Kitamachi,

More information

Human body animation. Computer Animation. Human Body Animation. Skeletal Animation

Human body animation. Computer Animation. Human Body Animation. Skeletal Animation Computer Animation Aitor Rovira March 2010 Human body animation Based on slides by Marco Gillies Human Body Animation Skeletal Animation Skeletal Animation (FK, IK) Motion Capture Motion Editing (retargeting,

More information

Synthesizing Realistic Facial Expressions from Photographs

Synthesizing Realistic Facial Expressions from Photographs Synthesizing Realistic Facial Expressions from Photographs 1998 F. Pighin, J Hecker, D. Lischinskiy, R. Szeliskiz and D. H. Salesin University of Washington, The Hebrew University Microsoft Research 1

More information

Extraction and Representation of Features, Spring Lecture 4: Speech and Audio: Basics and Resources. Zheng-Hua Tan

Extraction and Representation of Features, Spring Lecture 4: Speech and Audio: Basics and Resources. Zheng-Hua Tan Extraction and Representation of Features, Spring 2011 Lecture 4: Speech and Audio: Basics and Resources Zheng-Hua Tan Multimedia Information and Signal Processing Department of Electronic Systems Aalborg

More information

A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK

A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING Sarah Taylor Barry-John Theobald Iain Matthews Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK ABSTRACT This paper introduces

More information

Animated Talking Head With Personalized 3D Head Model

Animated Talking Head With Personalized 3D Head Model Animated Talking Head With Personalized 3D Head Model L.S.Chen, T.S.Huang - Beckman Institute & CSL University of Illinois, Urbana, IL 61801, USA; lchen@ifp.uiuc.edu Jörn Ostermann, AT&T Labs-Research,

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

Principles of Audio Coding

Principles of Audio Coding Principles of Audio Coding Topics today Introduction VOCODERS Psychoacoustics Equal-Loudness Curve Frequency Masking Temporal Masking (CSIT 410) 2 Introduction Speech compression algorithm focuses on exploiting

More information

Bluray (

Bluray ( Bluray (http://www.blu-ray.com/faq) MPEG-2 - enhanced for HD, also used for playback of DVDs and HDTV recordings MPEG-4 AVC - part of the MPEG-4 standard also known as H.264 (High Profile and Main Profile)

More information

SYSTEM AND METHOD FOR SPEECH RECOGNITION

SYSTEM AND METHOD FOR SPEECH RECOGNITION Technical Disclosure Commons Defensive Publications Series September 06, 2016 SYSTEM AND METHOD FOR SPEECH RECOGNITION Dimitri Kanevsky Tara Sainath Follow this and additional works at: http://www.tdcommons.org/dpubs_series

More information

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music COS 116 The Computational Universe Laboratory 4: Digital Sound and Music In this lab you will learn about digital representations of sound and music, especially focusing on the role played by frequency

More information

Speech Articulation Training PART 1. VATA (Vowel Articulation Training Aid)

Speech Articulation Training PART 1. VATA (Vowel Articulation Training Aid) Speech Articulation Training PART 1 VATA (Vowel Articulation Training Aid) VATA is a speech therapy tool designed to supplement insufficient or missing auditory feedback for hearing impaired persons. The

More information

COPYRIGHTED MATERIAL. Introduction. 1.1 Introduction

COPYRIGHTED MATERIAL. Introduction. 1.1 Introduction 1 Introduction 1.1 Introduction One of the most fascinating characteristics of humans is their capability to communicate ideas by means of speech. This capability is undoubtedly one of the facts that has

More information

Text-to-Audiovisual Speech Synthesizer

Text-to-Audiovisual Speech Synthesizer Text-to-Audiovisual Speech Synthesizer Udit Kumar Goyal, Ashish Kapoor and Prem Kalra Department of Computer Science and Engineering, Indian Institute of Technology, Delhi pkalra@cse.iitd.ernet.in Abstract.

More information

Improved Tamil Text to Speech Synthesis

Improved Tamil Text to Speech Synthesis Improved Tamil Text to Speech Synthesis M.Vinu Krithiga * and T.V.Geetha ** * Research Scholar ; ** Professor Department of Computer Science and Engineering,

More information

MATLAB Toolbox for Audiovisual Speech Processing

MATLAB Toolbox for Audiovisual Speech Processing ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 MATLAB Toolbox for Audiovisual Speech Processing

More information

A 3D Speech Visualization Tool for Turkish based on MRI

A 3D Speech Visualization Tool for Turkish based on MRI A 3D Speech Visualization Tool for Turkish based on MRI Maviş Emel Kulak Kayıkcı, PhD, Hacettepe University, Turkey Can Ölçek, MSc, Sobee Studios, Turkey Erdal Yılmaz, PhD, Sobee Studios, Turkey Burce

More information

Rapid Floodplain Delineation. Presented by: Leo R. Kreymborg 1, P.E. David T. Williams 2, Ph.D., P.E. Iwan H. Thomas 3, E.I.T.

Rapid Floodplain Delineation. Presented by: Leo R. Kreymborg 1, P.E. David T. Williams 2, Ph.D., P.E. Iwan H. Thomas 3, E.I.T. 007 ASCE Rapid Floodplain Delineation Presented by: Leo R. Kreymborg 1, P.E. David T. Williams, Ph.D., P.E. Iwan H. Thomas 3, E.I.T. 1 Project Manager, PBS&J, 975 Sky Park Court, Suite 00, San Diego, CA

More information

Statistical Approach to a Color-based Face Detection Algorithm

Statistical Approach to a Color-based Face Detection Algorithm Statistical Approach to a Color-based Face Detection Algorithm EE 368 Digital Image Processing Group 15 Carmen Ng Thomas Pun May 27, 2002 Table of Content Table of Content... 2 Table of Figures... 3 Introduction:...

More information

The fractal characterisation of phonetic elements of human speech

The fractal characterisation of phonetic elements of human speech Loughborough University Institutional Repository The fractal characterisation of phonetic elements of human speech This item was submitted to Loughborough University's Institutional Repository by the/an

More information

Sonic Studio Mastering EQ Table of Contents

Sonic Studio Mastering EQ Table of Contents Sonic Studio Mastering EQ Table of Contents 1.0 Sonic Studio Mastering EQ... 3 1.1 Sonic Studio Mastering EQ Audio Unit Plug-in...4 1.1.1 Overview... 4 1.1.2 Operation... 4 1.1.2.1 Mastering EQ Visualizer...5

More information

Richard Williams Study Circle Handout: Disney 12 Principles of Animation. Frank Thomas & Ollie Johnston: The Illusion of Life

Richard Williams Study Circle Handout: Disney 12 Principles of Animation. Frank Thomas & Ollie Johnston: The Illusion of Life Frank Thomas & Ollie Johnston: The Illusion of Life 1 1. Squash and Stretch The principle is based on observation that only stiff objects remain inert during motion, while objects that are not stiff, although

More information

THE PEPPER FONT COMPLETE MANUAL Version 2.1

THE PEPPER FONT COMPLETE MANUAL Version 2.1 THE PEPPER FONT COMPLETE MANUAL Version 2.1 (for Microsoft Word 2013 and 2016) A Set of Phonetic Symbols for Use in Windows Documents Lawrence D. Shriberg David L. Wilson Diane Austin The Phonology Project

More information

Computer Animation Visualization. Lecture 5. Facial animation

Computer Animation Visualization. Lecture 5. Facial animation Computer Animation Visualization Lecture 5 Facial animation Taku Komura Facial Animation The face is deformable Need to decide how all the vertices on the surface shall move Manually create them Muscle-based

More information

The CASSI System for. November Scheidt and Howard J. Hamilton. Mar., Copyright 1999, N. Scheidt and H.J. Hamilton

The CASSI System for. November Scheidt and Howard J. Hamilton. Mar., Copyright 1999, N. Scheidt and H.J. Hamilton The CASSI System for Animated Speech Simulation November Scheidt and Howard J. Hamilton Technical Report CS-99-01 Mar., 1999 Copyright 1999, N. Scheidt and H.J. Hamilton Department of Computer Science

More information

K A I S T Department of Computer Science

K A I S T Department of Computer Science An Example-based Approach to Text-driven Speech Animation with Emotional Expressions Hyewon Pyun, Wonseok Chae, Yejin Kim, Hyungwoo Kang, and Sung Yong Shin CS/TR-2004-200 July 19, 2004 K A I S T Department

More information

Information Technology - Coding of Audiovisual Objects Part 3: Audio

Information Technology - Coding of Audiovisual Objects Part 3: Audio ISO/IEC CD 14496-3TTS ÃISO/IEC ISO/IEC JTC 1/SC 29 N 2203 Date:1997-10-31 ISO/IEC CD 14496-3 Subpart 6 ISO/IEC JTC 1/SC 29/WG 11 Secretariat: Information Technology - Coding of Audiovisual Objects Part

More information

Formatting documents for NVivo, in Word 2007

Formatting documents for NVivo, in Word 2007 Formatting documents for NVivo, in Word 2007 Text in an NVivo document can incorporate most of the familiar richness of appearance that word processors provide, such as changes in font type, size and style,

More information

SenSyn Speech Synthesizer Package SenSyn UNIX

SenSyn Speech Synthesizer Package SenSyn UNIX SenSyn Speech Synthesizer Package SenSyn UNIX SENSIMETRICS Sensimetrics Corporation 48 Grove Street Suite 305 Somerville, MA 02144 Tel: 617.625.0600 Fax: 617.625.6612 Web: www.sens.com Email: sensimetrics@sens.com

More information

COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS

COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS Wesley Mattheyses 1, Lukas Latacz 1 and Werner Verhelst 1,2 1 Vrije Universiteit Brussel,

More information

Recovering Tube Kinematics Using Time-varying Acoustic Infonnanon

Recovering Tube Kinematics Using Time-varying Acoustic Infonnanon HaWTUI Laboratoria Slatua &porton Speech Research 1991, SR 107/108, 81-86 Recovering Tube Kinematics Using Time-varying Acoustic Infonnanon RS.McGowan Formant freq~ency tr~jectoriesare used to optimally

More information

About CDP R5 System Upgrades ~ (PC &/or MAC) ~

About CDP R5 System Upgrades ~ (PC &/or MAC) ~ Composers Desktop Project About CDP R5 System Upgrades ~ (PC &/or MAC) ~ Overview of the Release 5 Upgrade Package The CDP Release 5 Upgrade includes: The new complete set of the CDP sound transformation

More information

Computer Animation. Algorithms and Techniques. z< MORGAN KAUFMANN PUBLISHERS. Rick Parent Ohio State University AN IMPRINT OF ELSEVIER SCIENCE

Computer Animation. Algorithms and Techniques. z< MORGAN KAUFMANN PUBLISHERS. Rick Parent Ohio State University AN IMPRINT OF ELSEVIER SCIENCE Computer Animation Algorithms and Techniques Rick Parent Ohio State University z< MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF ELSEVIER SCIENCE AMSTERDAM BOSTON LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO

More information

EECE593 Winter 2002 Project Report Vocal Tract Visualisation

EECE593 Winter 2002 Project Report Vocal Tract Visualisation EECE593 Winter 2002 Project Report Vocal Tract Visualisation David Pritchard April 9, 2002 1 Overview For my project, I have created a 3D reconstruction of the human vocal tract. The intention of this

More information

Spectral modeling of musical sounds

Spectral modeling of musical sounds Spectral modeling of musical sounds Xavier Serra Audiovisual Institute, Pompeu Fabra University http://www.iua.upf.es xserra@iua.upf.es 1. Introduction Spectral based analysis/synthesis techniques offer

More information

Table of Contents. Table of Figures. Document Change Log

Table of Contents. Table of Figures. Document Change Log The Virtual Producer Project All information is subject to change without notice and does not represent a commitment on the part of. Release 1.40 (October 2013) Table of Contents 1. Introduction 2. Goals

More information

Vocal Joystick User Guide

Vocal Joystick User Guide Vocal Joystick User Guide For Windows XP/Vista/Linux Version 0.5 beta The Vocal Joystick team includes: Jon Malkin Xiao Li Susumu Harada Jeff Bilmes Table of Contents Disclaimer... 3 About the Vocal Joystick...

More information

Function Based 2D Flow Animation

Function Based 2D Flow Animation VISUAL 2000: MEXICO CITY SEPTEMBER 18-22 100 Function Based 2D Flow Animation Ergun Akleman, Sajan Skaria, Jeff S. Haberl Abstract This paper summarizes a function-based approach to create 2D flow animations.

More information

Speech Driven Synthesis of Talking Head Sequences

Speech Driven Synthesis of Talking Head Sequences 3D Image Analysis and Synthesis, pp. 5-56, Erlangen, November 997. Speech Driven Synthesis of Talking Head Sequences Peter Eisert, Subhasis Chaudhuri,andBerndGirod Telecommunications Laboratory, University

More information

计算原理导论. Introduction to Computing Principles 智能与计算学部刘志磊

计算原理导论. Introduction to Computing Principles 智能与计算学部刘志磊 计算原理导论 Introduction to Computing Principles 天津大学 智能与计算学部刘志磊 Analog The world is basically analog What does that mean? "Signal" is a varying wave over time e.g. sound as a running example here How Does

More information

R&D White Paper WHP 070. A distributed live subtitling system. Research & Development BRITISH BROADCASTING CORPORATION. September M.

R&D White Paper WHP 070. A distributed live subtitling system. Research & Development BRITISH BROADCASTING CORPORATION. September M. R&D White Paper WHP 070 September 2003 A distributed live subtitling system M. Marks Research & Development BRITISH BROADCASTING CORPORATION BBC Research & Development White Paper WHP 070 A distributed

More information

Sample Based Texture extraction for Model based coding

Sample Based Texture extraction for Model based coding DEPARTMENT OF APPLIED PHYSICS AND ELECTRONICS UMEÅ UNIVERISTY, SWEDEN DIGITAL MEDIA LAB Sample Based Texture extraction for Model based coding Zhengrong Yao 1 Dept. Applied Physics and Electronics Umeå

More information

Bonus Ch. 1. Subdivisional Modeling. Understanding Sub-Ds

Bonus Ch. 1. Subdivisional Modeling. Understanding Sub-Ds Bonus Ch. 1 Subdivisional Modeling Throughout this book, you ve used the modo toolset to create various objects. Some objects included the use of subdivisional surfaces, and some did not. But I ve yet

More information

An Adaptive Eigenshape Model

An Adaptive Eigenshape Model An Adaptive Eigenshape Model Adam Baumberg and David Hogg School of Computer Studies University of Leeds, Leeds LS2 9JT, U.K. amb@scs.leeds.ac.uk Abstract There has been a great deal of recent interest

More information

D source. Basics of 3D Animation Fundamentals of Computer Animation by Prof. Phani Tetali and Sachin Meshram IDC, IIT Bombay

D source. Basics of 3D Animation Fundamentals of Computer Animation by Prof. Phani Tetali and Sachin Meshram IDC, IIT Bombay 1 http://www.dsource.in/course/basics-3d-animation 3. Introduction Animation is nothing more but an illusion of motion from imagery that visually interprets and conveys the message it is basically created

More information

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015 AUDIO Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015 Key objectives How do humans generate and process sound? How does digital sound work? How fast do I have to sample audio?

More information

TESL-EJ 11.1, June 2007 Audacity/Alameen 1

TESL-EJ 11.1, June 2007 Audacity/Alameen 1 June 2007 Volume 11, Number1 Title: Audacity 1.2.6 Publisher: Product Type: Platform: Minimum System Requirements: Developed by a group of volunteers and distributed under the GNU General Public License

More information

Creating a Lip Sync and Using the X-Sheet in Dragonframe

Creating a Lip Sync and Using the X-Sheet in Dragonframe Creating a Lip Sync and Using the X-Sheet in Dragonframe Contents A. Creating a Lip Sync in Dragonframe B. Loading the X-Sheet in Dragon Frame C. Setting Notes and Flag/Reminders in the X-Sheet 1. Trackreading/Breaking

More information

Real-time Lip Synchronization Based on Hidden Markov Models

Real-time Lip Synchronization Based on Hidden Markov Models ACCV2002 The 5th Asian Conference on Computer Vision, 23--25 January 2002, Melbourne, Australia. Real-time Lip Synchronization Based on Hidden Markov Models Ying Huang* 1 Stephen Lin+ Xiaoqing Ding* Baining

More information

Language Live! Levels 1 and 2 correlated to the Arizona English Language Proficiency (ELP) Standards, Grades 5 12

Language Live! Levels 1 and 2 correlated to the Arizona English Language Proficiency (ELP) Standards, Grades 5 12 correlated to the Standards, Grades 5 12 1 correlated to the Standards, Grades 5 12 ELL Stage III: Grades 3 5 Listening and Speaking Standard 1: The student will listen actively to the ideas of others

More information

A THREE-DIMENSIONAL LINEAR ARTICULATORY MODEL BASED ON MRI DATA

A THREE-DIMENSIONAL LINEAR ARTICULATORY MODEL BASED ON MRI DATA A THREE-DIMENSIONAL LINEAR ARTICULATORY MODEL BASED ON MRI DATA P. Badin 1, G. Bailly 1, M. Raybaudi 2, C. Segebarth 2 1 Institut de la Communication Parlée, UPRESA CNRS 9, INPG - Univ. Stendhal, Grenoble,

More information

Appendix 1. Case 1 CONTENTS

Appendix 1. Case 1 CONTENTS Case 1 CONTENTS PAGE Claim of the Application 2 Description of the Application 4 Drawings of the Application 23 Published Prior Art 1 29 Published Prior Art 2 44 Case 1 [Claim of the Application] - 2 -

More information

Computer Graphics. Si Lu. Fall uter_graphics.htm 11/27/2017

Computer Graphics. Si Lu. Fall uter_graphics.htm 11/27/2017 Computer Graphics Si Lu Fall 2017 http://web.cecs.pdx.edu/~lusi/cs447/cs447_547_comp uter_graphics.htm 11/27/2017 Last time o Ray tracing 2 Today o Animation o Final Exam: 14:00-15:30, Novermber 29, 2017

More information

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION S. V. Bharath Kumar Imaging Technologies Lab General Electric - Global Research JFWTC, Bangalore - 560086, INDIA bharath.sv@geind.ge.com

More information

Facade Stanford Facial Animation System. Instruction Manual:

Facade Stanford Facial Animation System. Instruction Manual: Facade Stanford Facial Animation System Instruction Manual: A. Introduction 1. What is Facade? 2. What can you do with Facade? 3. What can t you do with Facade? 4. Who can use Facade? B. Overview of Facade

More information

ENDNOTE X7 VPAT VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE

ENDNOTE X7 VPAT VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE ENDNOTE X7 VPAT VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE Updated May 21, 2013 INTRODUCTION Thomson Reuters (Scientific) LLC is dedicated to developing software products that are usable for everyone including

More information

Documentation Addendum (Covers changes up to OS v1.20)

Documentation Addendum (Covers changes up to OS v1.20) Fusion Documentation Addendum (Covers changes up to OS v1.20) This page is intentionally left blank. About this Addendum: The Fusion s operating system is upgradeable and allows us to add features and

More information

Speech-Music Discrimination from MPEG-1 Bitstream

Speech-Music Discrimination from MPEG-1 Bitstream Speech-Music Discrimination from MPEG-1 Bitstream ROMAN JARINA, NOEL MURPHY, NOEL O CONNOR, SEÁN MARLOW Centre for Digital Video Processing / RINCE Dublin City University, Dublin 9 IRELAND jarinar@eeng.dcu.ie

More information

Facial Deformations for MPEG-4

Facial Deformations for MPEG-4 Facial Deformations for MPEG-4 Marc Escher, Igor Pandzic, Nadia Magnenat Thalmann MIRALab - CUI University of Geneva 24 rue du Général-Dufour CH1211 Geneva 4, Switzerland {Marc.Escher, Igor.Pandzic, Nadia.Thalmann}@cui.unige.ch

More information

Animation. Traditional Animation Keyframe Animation. Interpolating Rotation Forward/Inverse Kinematics

Animation. Traditional Animation Keyframe Animation. Interpolating Rotation Forward/Inverse Kinematics Animation Traditional Animation Keyframe Animation Interpolating Rotation Forward/Inverse Kinematics Overview Animation techniques Performance-based (motion capture) Traditional animation (frame-by-frame)

More information

Artificial Visual Speech Synchronized with a Speech Synthesis System

Artificial Visual Speech Synchronized with a Speech Synthesis System Artificial Visual Speech Synchronized with a Speech Synthesis System H.H. Bothe und E.A. Wieden Department of Electronics, Technical University Berlin Einsteinufer 17, D-10587 Berlin, Germany Abstract:

More information

A bouncing ball squashes on its vertical axis and stretches on the horizontal axis as it strikes the ground.

A bouncing ball squashes on its vertical axis and stretches on the horizontal axis as it strikes the ground. Animation Principles The following 12 animation principles are those distilled from the combined wisdom of animators over several decades. Animators developed their own techniques in animating characters,

More information

Chapter 5. Transforming Shapes

Chapter 5. Transforming Shapes Chapter 5 Transforming Shapes It is difficult to walk through daily life without being able to see geometric transformations in your surroundings. Notice how the leaves of plants, for example, are almost

More information

OGIresLPC : Diphone synthesizer using residualexcited linear prediction

OGIresLPC : Diphone synthesizer using residualexcited linear prediction Oregon Health & Science University OHSU Digital Commons CSETech October 1997 OGIresLPC : Diphone synthesizer using residualexcited linear prediction Michael Macon Andrew Cronk Johan Wouters Alex Kain Follow

More information

This is a repository copy of Diphthong Synthesis Using the Dynamic 3D Digital Waveguide Mesh.

This is a repository copy of Diphthong Synthesis Using the Dynamic 3D Digital Waveguide Mesh. This is a repository copy of Diphthong Synthesis Using the Dynamic 3D Digital Waveguide Mesh. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/126358/ Version: Accepted Version

More information

Motion Synthesis and Editing. Yisheng Chen

Motion Synthesis and Editing. Yisheng Chen Motion Synthesis and Editing Yisheng Chen Overview Data driven motion synthesis automatically generate motion from a motion capture database, offline or interactive User inputs Large, high-dimensional

More information

GSM Network and Services

GSM Network and Services GSM Network and Services Voice coding 1 From voice to radio waves voice/source coding channel coding block coding convolutional coding interleaving encryption burst building modulation diff encoding symbol

More information

Facial Animation System Based on Image Warping Algorithm

Facial Animation System Based on Image Warping Algorithm Facial Animation System Based on Image Warping Algorithm Lanfang Dong 1, Yatao Wang 2, Kui Ni 3, Kuikui Lu 4 Vision Computing and Visualization Laboratory, School of Computer Science and Technology, University

More information

OPTIMIZING A VIDEO PREPROCESSOR FOR OCR. MR IBM Systems Dev Rochester, elopment Division Minnesota

OPTIMIZING A VIDEO PREPROCESSOR FOR OCR. MR IBM Systems Dev Rochester, elopment Division Minnesota OPTIMIZING A VIDEO PREPROCESSOR FOR OCR MR IBM Systems Dev Rochester, elopment Division Minnesota Summary This paper describes how optimal video preprocessor performance can be achieved using a software

More information

Increased Diphone Recognition for an Afrikaans TTS system

Increased Diphone Recognition for an Afrikaans TTS system Increased Diphone Recognition for an Afrikaans TTS system Francois Rousseau and Daniel Mashao Department of Electrical Engineering, University of Cape Town, Rondebosch, Cape Town, South Africa, frousseau@crg.ee.uct.ac.za,

More information

Mahdi Amiri. February Sharif University of Technology

Mahdi Amiri. February Sharif University of Technology Course Presentation Multimedia Systems Speech II Mahdi Amiri February 2014 Sharif University of Technology Speech Compression Road Map Based on Time Domain analysis Differential Pulse-Code Modulation (DPCM)

More information