In Pursuit of Visemes

Size: px
Start display at page:

Download "In Pursuit of Visemes"

Transcription

1 ISCA Archive AVSP International Conference on Audio-Visual Speech Processing Hakone, Kanagawa, Japan September 30-October 3, 2010 In Pursuit of Visemes Sarah Hilder, Barry-John Theobald, Richard Harvey School of Computing Sciences, University of East Anglia, UK {s.hilder, b.theobald, r.w.harvey}@uea.ac.uk Abstract We describe preliminary work towards an objective method for identifying visemes. Active appearance model (AAM) features are used to parameterise a speaker s lips and jaw during speech. The temporal behaviour of AAM features between automatically identified salient points is used to represent visual speech gestures, and visemes are created by clustering these gestures using dynamic time warping (DTW) as a costfunction. This method produces a significantly more structured model of visual speech than if a typical phoneme-to-viseme mapping is assumed. Index Terms: Visemes, visual speech encoding. 1. Introduction Phonemes are a linguistic unit used to represent speech. Replacing one phoneme with another will change the meaning of an utterance. Visemes (visual phonemes) [1] are a supposed unit of visual speech. However, visemes are less well defined and less well understood. Our goal is to identify a representation for visual speech equivalent to the use of phonemes used to transcribe the sounds of an acoustic utterance. Previous efforts to define visemes have proved inconclusive. Many make the assumption of a many-to-one relationship between phonemes and visemes [2, 3, 4, 5, 6], but there are a number of limitations with this approach. Firstly, this does not take into account the asynchrony between the acoustic and the visual modalities of speech, where the onset of movement does not always correspond to the onset of the acoustic realisation of a phone. Secondly, there are some phones that do not require the use of the visual articulators, and so phonemes such as /k/ or /g/, which are velar consonants articulated at the back of the soft palate, are unlikely to have an associated viseme. Finally, previous approaches generally ignore coarticulation effects. The allophones of a phoneme often appear very different visually, yet these obviously different visual gestures are assigned the same meaning in a visual sense. To overcome these limitations, we break the assumption of a formal link between the acoustic and the visual representations of speech. Instead, we will use machine learning algorithms to analyse visual speech to identify automatically patterns of visual behaviour. This paper describes preliminary work towards this goal. The remainder of this paper is organised as follows. Section 2 reviews previous methods used to identify visemes, and Section 3 describes our approach. Section 4 discusses our findings, and Section 5 provides detail of future work. 2. Background and Related Work The sounds of speech are dependent on the formation of the articulatory organs, such as the nasal cavity, tongue, teeth, velum and lips. Since only a small number of these articulators are visible (lips and partially the teeth and tongue) it is apparent that a one-to-one mapping between phonemes and visemes results in redundancy. Speech sounds that differ in their voicing or nasality tend to appear visually similar and so are assigned to the same visemic class. For example, /f/ is voiceless and /v/ is voiced, but both have the same place and manner of articulation (labiodental fricative). As they appear the same, they often are considered to form a viseme. Typically phonemes are mapped to visemes using some form of subjective assessment, based on analysing patterns of confusion in a stimulus-response matrix [2]. Visemes are defined by clustering phonemes such that the within-cluster response accounts for at least 75% or more of all responses. Auer and Bernstein [3] compiled a set of consonant visemes and vowel visemes (referred to as phonemic equivalence classes or PECs) by grouping phonemes based on their similarity. Their data were taken from the Eberhardt et al. consonant recognition task, where participants were asked to lip-read from C-/a/ contexts [4] and from Montgomery and Jackson s vowel recognition task, where participants were asked to lipread the vowel in /h/-v-/g/ contexts [7]. They find that twelve PECs best approximate the data. Similarly, Lesner and Kricos [6] asked subjects to lipread vowels and diphthongs in /h/-v-/g/ contexts spoken by different speakers. They find that the number and composition of visemes differs with different speakers, and that speakers who are easier to speech-read generally produce a larger number of visemes. A similar experiment was performed by Jiang et al. [8] in which participants were asked to lip-read the consonant in C-V words

2 uttered by four speakers. They also find that visemes are both speaker dependent and context dependent, where the number of visemes varies from four to six. Fisher [1] asked participants to lip-read the initial and final consonants of an utterance using a forced-error approach, where the correct response was omitted from a closed-set of possible responses. The results suggest that viseme groupings for initial and final consonants differ, and that initial consonants contain directional confusions. For example, /m/ is significantly confused with /b/, but /b/ is not significantly confused with /m/. For experiments that require human participation, it is necessary for stimuli to be kept simple. Many of the methods used to identify visemes use data where phonemes are presented only in a single context [3, 7]. Natural speech production is much less constrained than this, and visual articulation is based on what is necessary for acoustic distinctiveness, but not for visual distinctiveness [14]. The physical constraints enforced by the human muscular system prevent the articulators switching position instantaneously, which results in blurring across gesture boundaries [10]. Perkell and Matthies [9] describe this as the superposition of multiple influences on the movement of an articulator. Montgomery and Jackson [7] state this as the reason for selecting the /h/-v-/g/ context in their work this context produces minimal coarticulation effects. Coarticulation can be anticipatory (forwards coarticulation) or can reflect the influence from previous gestures (backwards coarticulation). For example, the appearance of /s/ in the words sue and sea may be different as the anticipatory, labial rounding of /u/ in sue begins during or before the articulation of /s/ (forwards coarticulation). Equally, the appearance of /k/ in the words spook and speak may be different due to the labial rounding of the preceding vowel in spook and widening in speak (backwards coarticulation). Figure 1 illustrates the affect of backwards coarticulation on lip shape. The images were taken during the articulation of /t/ in the word bat (left) and jot (right). In the latter case the lip roundedness of /dz/ and /6/ from jot continues to influence /t/. It is worth noting that coarticulation effects are not functions only of directly neighbouring phones, but have been found to be influenced by phones up to six segments in either direction [11]. Löfqvist incorporated this idea into a theory of speech production [20], whereby each speech segment has a set of dominance functions one for each articulator and speech production involves concatenating and overlapping these dominance functions. The dominance and deformability of a gesture depends on whether fully reaching the articulatory targets is necessary to produce the required sound. This means that not all visual phones are equally affected by coarticulation as the organs that are deemed necessary for producing a sound may or may not be visually apparent. Figure 1: A frame from /t/ in the word bat (left) and jot (right). The difference in lip shape highlights the influence of coarticulation on visual speech production. For example, the consonants /f/ and /v/ are far less deformable than /k/ and /g/. The former are fricative consonants that are articulated using the upper teeth and lower lip granting minimal freedom to the shape of the lips whereas the latter are velar consonants that are articulated at the back of the soft palate granting more freedom to the shape of the lips. The articulation of the vowel /u/ is a dominant gesture as lip rounding and protrusion are both essential to produce the sound. Consequently, in consonant recognition tasks, Owens and Blasek [15] and Benguerel and Pichora-Fuller [16] find that the most apparent coarticulation effects are in /u/-c-/u/ contexts, where C is a more deformable consonant (such as /s, z, t, d, l, n, k, g, h, j/). Benguerel and Pichora-Fuller [16] also find that in VCV contexts /u/ attains a near perfect recognition score whereas /æ/ scores the lowest. Perkell and Matthies [9] measured coarticulation in /i/-c-/u/ utterances by recording the vertical displacement of a point on the upper lip. They find that many speakers begin lipprotrusion for /u/ directly after the acoustic offset of /i/. To date there has been no unequivocal agreement regarding the number of visemic classes, nor how the set of phonemes are clustered to form the visemes. This may be due to the subjective nature of the methods employed, where small variations in stimuli and different participants have an influence on the resulting visemes. For the purposes of automatic lip-reading, Goldschen et al. [12] used an objective approach for identifying visemes using sentences as stimuli. A selection of static and dynamic lip features were extracted from video and manually segmented into phones. These were then clustered using a hidden Markov model similarity measurement and the average linkage hierarchical clustering algorithm [13]. The resulting visemes are consistent with results from perceptual experiments [1], but the notion of a viseme was extended to include lip opening/closing for the consonants /b/, /p/ and /m/, forming the groups /bcl, m, pcl/ and /b, p, r/ (where cl indicates closure). To account for variation in visual articulation due to phonetic context, a many-to-many relationship between phonemes and visemes is required [17]. Mattys et al. [18] are one of the few that have attempted to model visemes in this way, where different viseme classes for initial and non-initial consonants are used. However, a limitation of this approach is that consonants are assumed to have no

3 influence on the articulation of vowels. We propose to identify visual units of speech independently of a phonetic/acoustic representation of speech. Instead, patterns of behaviour of the articulators will be used to identify visual meaning, and by clustering behaviours that appear similar we will identify a set of visemes. This will overcome three of the major shortcomings identified with previous work. Firstly, the allophones of a phoneme will not be required to have the same visual label. Secondly, the onset and offset of the visual gestures will be identified in the visual modality, thus we do not require visemes to align with acoustic labels, as is usually the case [12, 21]. Thirdly, our analysis will be objective in nature, and the visual units will be derived from continuous speech (sentences). 3. Viseme extraction To produce a set of visual gestures that we will refer to as visemes, we use continuous speech to acknowledge the influence of coarticulation and adopt a data-driven approach to avoid prior assumptions regarding phonetic alignment or labels Stimuli The stimuli used in this work are drawn from the LIPS2008 audio-visual corpus [22]. This contains 278 phonetically balanced sentences spoken by a single, female speaker. It was recorded at 50 frames per second in standard definition. The speaker maintained a neutral expression throughout the recording and spoke at a steady pace. The camera captured a full frontal image of the face Feature extraction and preprocessing In this work, active appearance models (AAMs) [23] are used to encode visual speech. AAMs provide a compact statistical description of the variation in the shape and appearance of a face. The shape of an AAM is defined by the two-dimensional vertex locations: s = {x 1, y 1, x 2, y 2,..., x n, y n } T of a mesh that delineates the inner and outer lip contours and the jaw. A set of model parameters that control the non-rigid variation allowed by the mesh are derived by hand-labelling a set of training images, then applying principal components analysis (PCA) to give a compact model of the form: m s = s 0 + s i p i, (1) i=1 where s 0 is the mean shape and the vectors s i are the eigenvectors of the covariance matrix corresponding to the m largest eigenvalues (see Figure 2). The coefficients Figure 2: Top row: first two modes of shape variation between ±3σ. Bottom row: the first mode of variation of the appearance model between ±3σ. p i are the shape parameters, which define the contribution of each mode in the encoding of s. The appearance of the AAM is an image defined over the pixels that lie inside the base mesh, x = (x, y) T s 0, and the set of model parameters that control the variation allowed by the image. The appearance is constructed by warping each training image from the manually annotated mesh locations to the base shape, then applying PCA (to these shape-normalised images) to give a compact model of appearance variation of the form: A(x) = A 0 (x) + l λ i A i (x) x s 0, (2) i=1 where the coefficients λ i are the appearance parameters, A 0 (x) is the base appearance, and the appearance images, A i (x), are the eigenvectors corresponding to the l largest eigenvalues of the covariance matrix (see Figure 2). To encode the visual speech information within the video sequences, the face is tracked using the inverse compositional project-out AAM algorithm [24]. Next Equations 1 and 2 are solved for the shape and appearance parameters respectively, which are then concatenated and smoothed to reduce the effects of noise. Smoothing is achieved using a cubic spline smoothing with a weighting of 0.7 a value determined by subjectively analysing the curves. The smoothed features are normalised as follows: ( ) Wp b = (3) λ where l W = i=1 σ2 λ i m (4) i=1 σ2 p i where λ and p are column vectors of appearance and shape parameters respectively, l and m are the number of dimensions corresponding to the appearance and shape

4 respectively, and σ 2 λ i and σ 2 p i represent the variance captured by each dimension of the respective model. A third PCA is applied to these features to model the correlated variation in the shape and appearance parameters, providing a compact, low-dimensional feature vector describing the shape and appearance variation of the lips and jaw during speech Speech segmentation To produce the gestures we define to be visemes, the visual speech first is segmented into units. Typically this is done using the phonetic boundaries derived from the acoustic speech [12, 21], but here we use a data-driven approach whereby we locate segment boundaries using only the visual signal. During speech, articulators do not move at a constant rate. Instead they tend to accelerate towards articulatory targets and deccelerate as they approach or realise the targets. Consequently, we make the assumption that a salient lip pose is that in which the lips are at their most still, and that a gesture is the transition from one salient lip pose to the next. This involves calculating the acceleration ( ) coefficients from the AAM features and extracting the frames where the sign changes from negative to positive. Using this method we extract from the LIPS2008 corpus 8421 gestures. This value falls between the number of phones articulated (11850) and the number of syllables produced ( 4500) Clustering Pair-wise distances between gestures are obtained by measuring the cost associated with performing a dynamic time warp (DTW). Dynamic time warping is a method for measuring the similarity between two time series that may vary in length or speed where a non-linear warp along the time axis is applied to one of the sequences to align it to the other. This warp is optimised to minimise a cost function. The reader is referred to [25] for a more in-depth discussion regarding DTW. We use the clustering toolbox, CLUTO [26] (version 2.1.2), to cluster the gesture space. Empirically we find that the graph-based clustering algorithm results in the best clustering. This possibly is because it is able to model non-spherical clusters. To determine the number, k, of clusters required, we measure the silhouette coefficient (SC) [27], Dunns Index (DI) [28] and Davies-Bouldin Index (DBI) [29] at n clusters, where {n = }. We find that k = 58 is optimal. This is where the SC and DI were maximised and DBI was minimised Results Figure 3A shows the distance matrix produced by finding the DTW cost between each pair of gestures and ordering the samples by their cluster ID. In this image, the colours range from blue to red representing, respectively, the smallest to largest values. The cluster boundaries are highlighted with black boxes along the leading diagonal. A perfect clustering would produce a series of blue boxes down the diagonal on a red background. For comparison, Figure 3B presents a visualisation of the distance matrix produced by finding the cost of performing DTW to each pair of visemes, i.e. visual phonemes. To produce this image, the AAM features were segmented via the acoustic phone boundaries. Labels were then assigned to each segment based on the phoneme to viseme mapping taken from [30]. In the image, the samples are ordered by viseme group. For example, all segments with phoneme labels /p/, /b/ and /m/ are arranged sequentially in the distance matrix. Again, the group boundaries are outlined with a black box. We can see from this image that this distance matrix appears to lack structure as the distances between grouped items appear to be no smaller the distances between ungrouped items. The one exception to this is the group that appears in the lower right-hand corner of the image. This group represents silence. As the speaker was asked to maintain a neutral expression prior to and after the utterance, the examples of a silence would be almost unaffected by coarticulation. It is apparent that the distance matrix in Figure 3A shows more structure than in Figure 3B. To quantify this we measure the average silhouette coefficient for each cluster [27]. The silhouette coefficient is a measure of cluster separation and cohesion that falls between 1 and +1. A value of 1 denotes perfectly clustered data. From the data we record average silhouette values of 0.32 when using the typical viseme approach and 0.11 for our gestural approach. The latter approach attained both higher minimum and maximum scores and the difference between the approaches are statistically significant (p < 0.01). 4. Discussion In this paper we have presented a review of the methods that are traditionally used for viseme classification and highlighted some of the many problems associated with them. We noted that when using subjective methods, there is little explicit agreement as to which phonemes are grouped to form which visemes, and how many visemes there should be. However, most have agreed that the relationship between phonemes and visemes is many-to-one and all studies have ignored the matter of audio-visual asynchrony. The visual labels usually are assumed to align with the underlying acoustic (phone) boundaries.

5 (A) (B) Figure 3: Distance matrices obtained using the DTW cost between each pair of visemes. The visemes were found using: a) the visual gestural approach described in this paper, and b) the acoustic phone boundaries and the phoneme to viseme mapping from [30] (a traditional approach). Each viseme/gesture is delineated by a black square along the leading diagonal. Red regions correspond to high error, whereas blue regions correspond to low error. Ideally for both images, the areas bound by the black squares would be more blue (small intra-viseme distance) whilst the other areas would appear more red (higher inter-viseme distance). In Section 3 we introduced our proposed method of viseme classification; A data-driven approach that makes no prior assumption regarding the segmentation of speech or the phonetic labels associated with each segment. We described a novel method of segmenting the speech according to the acceleration of the lips during articulation. We clustered each segment based on the cost of performing DTW between each pair of gestures. We have found that using this approach we can produce a significantly more structured model of visual speech production than if we assume a many-to-one phoneme to viseme mapping. 5. Further work We are applying machine learning algorithms to first discover whether we can extract a set of gestures that describe the speech space of a single speaker, and then expanding to multiple speakers in the hope that we can produce a generic set of parameterised visemes. This will then be used to compare traditional approaches to visual speech representation employed both in our visual speech synthesisers (talking heads) and our automatic lipreading systems. 6. Acknowledgements The authors gratefully acknowledge EPSRC (EP/E028047/1) for funding. 7. References [1] C. G. Fisher, Confusions among visually perceived consonants, Journal of Speech, Language, and Hearing Research, vol. 11, pp , December [2] T. Chen and R. R. Rao, Audio-visual integration in multimodal communication, in Proceedings of the IEEE, 1998, pp [3] E. T. Auer and L. E. Bernstein, Speechreading and the structure of the lexicon: Computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness, Journal of the Acoustical Society of America, vol. 102, pp , Dec [4] S. P. Eberhardt, L. E. Bernstein, and M. H. Goldstein, Speechreading sentences with singlechannel vibrotactile presentation of voice fundamental frequency, Journal of the Acoustical Society of America, vol. 88, no. 3, pp , [5] T. Ezzat and T. Poggio, Visual speech synthesis by morphing visemes, International Journal of Computer Vision, vol. 38, no. 1, pp , [6] S. A. Lesner and P. B. Kricos, Visual vowel and diphthong perception across speakers, Journal of the Academy of Rehabilitative Audiology, vol. 14, pp , 1981.

6 [7] A. A. Montgomery and P. L. Jackson, Physical characteristics of the lips underlying vowel lipreading performance, Journal of the Acoustical Society of America, vol. 73, no. 6, pp , [8] J. Jiang, A. Alwan, L. E. Bernstein, E. T. Auer, and P. A. Jr. Keating, Similarity structure in perceptual and physical measures for visual consonants across talkers, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, 2002, pp [9] J. S. Perkell and M. L. Matthies, Temporal measures of anticipatory labial coarticulation for the vowel /u/: Within- and cross-subject variability, Journal of the Acoustical Society of America, vol. 91, no. 5, pp , [10] A. Turkmani, Visual analysis of viseme dynamics, Ph.D. dissertation, University of Surrey, [11] R. D. Kent and F. D. Minifie, Coarticulation in recent speech production models, Journal of Phonetics, vol. 5, no. 2, pp , April [12] A. J. Goldschen, O. N. Garcia, and E. Petajan, Continuous optical automatic speech recognition by lipreading, in Proceedings of the 28th Asilomar Conference on Signals, Systems, and Computers, 1994, pp [13] C. Ding and X. He, Cluster merging and splitting in hierarchical clustering algorithms, in Proceedings of the Second IEEE International Conference on Data Mining, 2002, pp [14] J. Luettin, Visual speech and speaker recognition, Ph.D. dissertation, University of Sheffield, [15] E. Owens and B. Blazek, Visemes observed by hearing-impaired and normal-hearing adult viewers, Journal of Speech and Hearing Research, vol. 28, pp , [16] A. P. Benguerel and M. K. Pichora-Fuller, Coarticulation effects in lipreading, JSHR, vol. 25, pp , [17] P. L. Jackson, The theoretical minimal unit for visual speech perception: Visemes and coarticulation, Volta Review, vol. 90, no. 5, pp , September [19] M. M. Cohen and D. W. Massaro, Modeling coarticulation in synthetic visual speech, in Models and Techniques in Computer Animation. Springer- Verlag, 1993, pp [20] A. Löfqvist, Speech as audible gestures. Kluwer Academic Publishers, [21] J. Melenchón, J. Simó, G. Cobo, and E. Martínez, Objective viseme extraction and audiovisual uncertainty: estimation limits between auditory and visual modes, in International conference on auditory-visual speech processing, [22] B. Theobald, S. Fagel, G. Bailly, and F. Elsei, Lips2008: Visual speech synthesis challenge, in Interspeech, 2008, pp [23] T. Cootes, G. Edwards, and C. Taylor, Active appearance models, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp , June [24] I. Matthews and S. Baker, Active appearance models revisited, International Journal of Computer Vision, vol. 60, no. 2, pp , November [25] K. Wang and T. Gasser, Alignment of curves by dynamic time warping, The annals of statistics, vol. 25, no. 3, pp , June [26] G. Karypis, Cluto - a clustering toolkit, University of Minnesota, Department of Computer Science, Minneapolis, Tech. Rep., April [27] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, vol. 20, pp , [28] J. C. Dunn, Well-separated clusters and optimal fuzzy partitions, Cybernetics and Systems, vol. 4, pp , [29] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, pp , [30] B. Walden, R. Prosek, and A. Montgomery, Effects of training on the visual recognition of consonants, Journal of Speech and Hearing Research, vol. 20, pp , [18] S. L. Mattys, L. E. Bernstein, and E. T. Auer, Stimulus-based lexical distinctiveness as a general word-recognition mechanism, Perception and Psychophysics, vol. 64, no. 4, pp , 2002.

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies? MIRALab Where Research means Creativity Where do we stand today? M I RA Lab Nadia Magnenat-Thalmann MIRALab, University of Geneva thalmann@miralab.unige.ch Video Input (face) Audio Input (speech) FAP Extraction

More information

A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK

A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING Sarah Taylor Barry-John Theobald Iain Matthews Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK ABSTRACT This paper introduces

More information

COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS

COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS Wesley Mattheyses 1, Lukas Latacz 1 and Werner Verhelst 1,2 1 Vrije Universiteit Brussel,

More information

VISEME SPACE FOR REALISTIC SPEECH ANIMATION

VISEME SPACE FOR REALISTIC SPEECH ANIMATION VISEME SPACE FOR REALISTIC SPEECH ANIMATION Sumedha Kshirsagar, Nadia Magnenat-Thalmann MIRALab CUI, University of Geneva {sumedha, thalmann}@miralab.unige.ch http://www.miralab.unige.ch ABSTRACT For realistic

More information

Towards Audiovisual TTS

Towards Audiovisual TTS Towards Audiovisual TTS in Estonian Einar MEISTER a, SaschaFAGEL b and RainerMETSVAHI a a Institute of Cybernetics at Tallinn University of Technology, Estonia b zoobemessageentertainmentgmbh, Berlin,

More information

arxiv: v1 [cs.cv] 3 Oct 2017

arxiv: v1 [cs.cv] 3 Oct 2017 Which phoneme-to-viseme maps best improve visual-only computer lip-reading? Helen L. Bear, Richard W. Harvey, Barry-John Theobald and Yuxuan Lan School of Computing Sciences, University of East Anglia,

More information

A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space

A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space Eng-Jon Ong and Shaogang Gong Department of Computer Science, Queen Mary and Westfield College, London E1 4NS, UK fongej

More information

An Adaptive Eigenshape Model

An Adaptive Eigenshape Model An Adaptive Eigenshape Model Adam Baumberg and David Hogg School of Computer Studies University of Leeds, Leeds LS2 9JT, U.K. amb@scs.leeds.ac.uk Abstract There has been a great deal of recent interest

More information

Image Coding with Active Appearance Models

Image Coding with Active Appearance Models Image Coding with Active Appearance Models Simon Baker, Iain Matthews, and Jeff Schneider CMU-RI-TR-03-13 The Robotics Institute Carnegie Mellon University Abstract Image coding is the task of representing

More information

Research Article A Constraint-Based Approach to Visual Speech for a Mexican-Spanish Talking Head

Research Article A Constraint-Based Approach to Visual Speech for a Mexican-Spanish Talking Head Computer Games Technology Volume 2008, Article ID 412056, 7 pages doi:10.1155/2008/412056 Research Article A Constraint-Based Approach to Visual Speech for a Mexican-Spanish Talking Head Oscar Martinez

More information

Facial Feature Points Tracking Based on AAM with Optical Flow Constrained Initialization

Facial Feature Points Tracking Based on AAM with Optical Flow Constrained Initialization Journal of Pattern Recognition Research 7 (2012) 72-79 Received Oct 24, 2011. Revised Jan 16, 2012. Accepted Mar 2, 2012. Facial Feature Points Tracking Based on AAM with Optical Flow Constrained Initialization

More information

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION Far East Journal of Electronics and Communications Volume 3, Number 2, 2009, Pages 125-140 Published Online: September 14, 2009 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing

More information

Human pose estimation using Active Shape Models

Human pose estimation using Active Shape Models Human pose estimation using Active Shape Models Changhyuk Jang and Keechul Jung Abstract Human pose estimation can be executed using Active Shape Models. The existing techniques for applying to human-body

More information

MATLAB Toolbox for Audiovisual Speech Processing

MATLAB Toolbox for Audiovisual Speech Processing ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 MATLAB Toolbox for Audiovisual Speech Processing

More information

3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION

3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION 3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION Roland Goecke 1,2 1 Autonomous System and Sensing Technologies, National ICT Australia, Canberra,

More information

A New Manifold Representation for Visual Speech Recognition

A New Manifold Representation for Visual Speech Recognition A New Manifold Representation for Visual Speech Recognition Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan School of Computing & Electronic Engineering, Vision Systems Group Dublin City University,

More information

REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD. Kang Liu and Joern Ostermann

REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD. Kang Liu and Joern Ostermann REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD Kang Liu and Joern Ostermann Institut für Informationsverarbeitung, Leibniz Universität Hannover Appelstr. 9A, 3167 Hannover, Germany

More information

Modeling Coarticulation in Continuous Speech

Modeling Coarticulation in Continuous Speech ing in Oregon Health & Science University Center for Spoken Language Understanding December 16, 2013 Outline in 1 2 3 4 5 2 / 40 in is the influence of one phoneme on another Figure: of coarticulation

More information

Face Recognition Using Active Appearance Models

Face Recognition Using Active Appearance Models Face Recognition Using Active Appearance Models G.J. Edwards, T.F. Cootes, and C.J. Taylor Wolfson Image Analysis Unit, Department of Medical Biophysics, University of Manchester, Manchester M13 9PT, U.K.

More information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Lip Tracking for MPEG-4 Facial Animation

Lip Tracking for MPEG-4 Facial Animation Lip Tracking for MPEG-4 Facial Animation Zhilin Wu, Petar S. Aleksic, and Aggelos. atsaggelos Department of Electrical and Computer Engineering Northwestern University 45 North Sheridan Road, Evanston,

More information

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING.

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING. AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING Junhong Zhao 1,2, Hua Yuan 3, Wai-Kim Leung 4, Helen Meng 4, Jia Liu 3 and Shanhong Xia 1

More information

A New 3D Lip Model for Analysis and Synthesis of Lip Motion in Speech Production

A New 3D Lip Model for Analysis and Synthesis of Lip Motion in Speech Production A New 3D Lip Model for Analysis and Synthesis of Lip Motion in Speech Production Lionel Reveret, Christian Benoit To cite this version: Lionel Reveret, Christian Benoit. A New 3D Lip Model for Analysis

More information

AAM Based Facial Feature Tracking with Kinect

AAM Based Facial Feature Tracking with Kinect BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 3 Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2015-0046 AAM Based Facial Feature Tracking

More information

Artificial Visual Speech Synchronized with a Speech Synthesis System

Artificial Visual Speech Synchronized with a Speech Synthesis System Artificial Visual Speech Synchronized with a Speech Synthesis System H.H. Bothe und E.A. Wieden Department of Electronics, Technical University Berlin Einsteinufer 17, D-10587 Berlin, Germany Abstract:

More information

TOWARDS A HIGH QUALITY FINNISH TALKING HEAD

TOWARDS A HIGH QUALITY FINNISH TALKING HEAD TOWARDS A HIGH QUALITY FINNISH TALKING HEAD Jean-Luc Oliv&s, Mikko Sam, Janne Kulju and Otto Seppala Helsinki University of Technology Laboratory of Computational Engineering, P.O. Box 9400, Fin-02015

More information

A new trainable trajectory formation system for facial animation

A new trainable trajectory formation system for facial animation ISCA Archive http://www.isca-speech.org/archive ITRW on Experimental Linguistics Athens, Greece August 28-30, 2006 A new trainable trajectory formation system for facial animation Oxana Govokhina 1,2,

More information

Towards a High-Quality and Well-Controlled Finnish Audio-Visual Speech Synthesizer

Towards a High-Quality and Well-Controlled Finnish Audio-Visual Speech Synthesizer Towards a High-Quality and Well-Controlled Finnish Audio-Visual Speech Synthesizer Mikko Sams, Janne Kulju, Riikka Möttönen, Vili Jussila, Jean-Luc Olivés, Yongjun Zhang, Kimmo Kaski Helsinki University

More information

Occluded Facial Expression Tracking

Occluded Facial Expression Tracking Occluded Facial Expression Tracking Hugo Mercier 1, Julien Peyras 2, and Patrice Dalle 1 1 Institut de Recherche en Informatique de Toulouse 118, route de Narbonne, F-31062 Toulouse Cedex 9 2 Dipartimento

More information

RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 23 (1999) Indiana University

RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 23 (1999) Indiana University RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 23 (1999) Indiana University Use of Partial Stimulus Information in Spoken Word Recognition Without Auditory Stimulation 1 Lorin Lachs Speech

More information

Generic Face Alignment Using an Improved Active Shape Model

Generic Face Alignment Using an Improved Active Shape Model Generic Face Alignment Using an Improved Active Shape Model Liting Wang, Xiaoqing Ding, Chi Fang Electronic Engineering Department, Tsinghua University, Beijing, China {wanglt, dxq, fangchi} @ocrserv.ee.tsinghua.edu.cn

More information

Active Appearance Models

Active Appearance Models Active Appearance Models Edwards, Taylor, and Cootes Presented by Bryan Russell Overview Overview of Appearance Models Combined Appearance Models Active Appearance Model Search Results Constrained Active

More information

Evaluating Error Functions for Robust Active Appearance Models

Evaluating Error Functions for Robust Active Appearance Models Evaluating Error Functions for Robust Active Appearance Models Barry-John Theobald School of Computing Sciences, University of East Anglia, Norwich, UK, NR4 7TJ bjt@cmp.uea.ac.uk Iain Matthews and Simon

More information

Robust Lip Contour Extraction using Separability of Multi-Dimensional Distributions

Robust Lip Contour Extraction using Separability of Multi-Dimensional Distributions Robust Lip Contour Extraction using Separability of Multi-Dimensional Distributions Tomokazu Wakasugi, Masahide Nishiura and Kazuhiro Fukui Corporate Research and Development Center, Toshiba Corporation

More information

Human body animation. Computer Animation. Human Body Animation. Skeletal Animation

Human body animation. Computer Animation. Human Body Animation. Skeletal Animation Computer Animation Aitor Rovira March 2010 Human body animation Based on slides by Marco Gillies Human Body Animation Skeletal Animation Skeletal Animation (FK, IK) Motion Capture Motion Editing (retargeting,

More information

A LOW-COST STEREOVISION BASED SYSTEM FOR ACQUISITION OF VISIBLE ARTICULATORY DATA

A LOW-COST STEREOVISION BASED SYSTEM FOR ACQUISITION OF VISIBLE ARTICULATORY DATA ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 2005 (AVSP 05) British Columbia, Canada July 24-27, 2005 A LOW-COST STEREOVISION BASED SYSTEM FOR ACQUISITION OF VISIBLE

More information

Speech Driven Synthesis of Talking Head Sequences

Speech Driven Synthesis of Talking Head Sequences 3D Image Analysis and Synthesis, pp. 5-56, Erlangen, November 997. Speech Driven Synthesis of Talking Head Sequences Peter Eisert, Subhasis Chaudhuri,andBerndGirod Telecommunications Laboratory, University

More information

Challenge Problem 5 - The Solution Dynamic Characteristics of a Truss Structure

Challenge Problem 5 - The Solution Dynamic Characteristics of a Truss Structure Challenge Problem 5 - The Solution Dynamic Characteristics of a Truss Structure In the final year of his engineering degree course a student was introduced to finite element analysis and conducted an assessment

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

Video Alignment. Final Report. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Video Alignment. Final Report. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin Final Report Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin Omer Shakil Abstract This report describes a method to align two videos.

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

Combining Audio and Video for Detection of Spontaneous Emotions

Combining Audio and Video for Detection of Spontaneous Emotions Combining Audio and Video for Detection of Spontaneous Emotions Rok Gajšek, Vitomir Štruc, Simon Dobrišek, Janez Žibert, France Mihelič, and Nikola Pavešić Faculty of Electrical Engineering, University

More information

Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation

Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation Chris J. Needham and Roger D. Boyle School of Computing, The University of Leeds, Leeds, LS2 9JT, UK {chrisn,roger}@comp.leeds.ac.uk

More information

The Template Update Problem

The Template Update Problem The Template Update Problem Iain Matthews, Takahiro Ishikawa, and Simon Baker The Robotics Institute Carnegie Mellon University Abstract Template tracking dates back to the 1981 Lucas-Kanade algorithm.

More information

Automatic Construction of Active Appearance Models as an Image Coding Problem

Automatic Construction of Active Appearance Models as an Image Coding Problem Automatic Construction of Active Appearance Models as an Image Coding Problem Simon Baker, Iain Matthews, and Jeff Schneider The Robotics Institute Carnegie Mellon University Pittsburgh, PA 1213 Abstract

More information

Fitting a Single Active Appearance Model Simultaneously to Multiple Images

Fitting a Single Active Appearance Model Simultaneously to Multiple Images Fitting a Single Active Appearance Model Simultaneously to Multiple Images Changbo Hu, Jing Xiao, Iain Matthews, Simon Baker, Jeff Cohn, and Takeo Kanade The Robotics Institute, Carnegie Mellon University

More information

Head Frontal-View Identification Using Extended LLE

Head Frontal-View Identification Using Extended LLE Head Frontal-View Identification Using Extended LLE Chao Wang Center for Spoken Language Understanding, Oregon Health and Science University Abstract Automatic head frontal-view identification is challenging

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

DISTANCE MAPS: A ROBUST ILLUMINATION PREPROCESSING FOR ACTIVE APPEARANCE MODELS

DISTANCE MAPS: A ROBUST ILLUMINATION PREPROCESSING FOR ACTIVE APPEARANCE MODELS DISTANCE MAPS: A ROBUST ILLUMINATION PREPROCESSING FOR ACTIVE APPEARANCE MODELS Sylvain Le Gallou*, Gaspard Breton*, Christophe Garcia*, Renaud Séguier** * France Telecom R&D - TECH/IRIS 4 rue du clos

More information

Chapter 3. Speech segmentation. 3.1 Preprocessing

Chapter 3. Speech segmentation. 3.1 Preprocessing , as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents

More information

MAPPING FROM SPEECH TO IMAGES USING CONTINUOUS STATE SPACE MODELS. Tue Lehn-Schiøler, Lars Kai Hansen & Jan Larsen

MAPPING FROM SPEECH TO IMAGES USING CONTINUOUS STATE SPACE MODELS. Tue Lehn-Schiøler, Lars Kai Hansen & Jan Larsen MAPPING FROM SPEECH TO IMAGES USING CONTINUOUS STATE SPACE MODELS Tue Lehn-Schiøler, Lars Kai Hansen & Jan Larsen The Technical University of Denmark Informatics and Mathematical Modelling Richard Petersens

More information

DECODING VISEMES: IMPROVING MACHINE LIP-READING. Helen L. Bear and Richard Harvey

DECODING VISEMES: IMPROVING MACHINE LIP-READING. Helen L. Bear and Richard Harvey DECODING VISEMES: IMPROVING MACHINE LIP-READING Helen L. Bear and Richard Harvey School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, United Kingdom ABSTRACT To undertake machine

More information

Features for Audio-Visual Speech Recognition

Features for Audio-Visual Speech Recognition Features for Audio-Visual Speech Recognition Iain Matthews School of Information Systems University of East Anglia September 1998 Double-sided final version. c This copy of the thesis has been supplied

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Building speaker-specific lip models for talking heads from 3D face data

Building speaker-specific lip models for talking heads from 3D face data Building speaker-specific lip models for talking heads from 3D face data Takaaki Kuratate 1,2, Marcia Riley 1 1 Institute for Cognitive Systems, Technical University Munich, Germany 2 MARCS Auditory Laboratories,

More information

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System Proc. of IEEE Conference on Computer Vision and Pattern Recognition, vol.2, II-131 II-137, Dec. 2001. Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

More information

Face Hallucination Based on Eigentransformation Learning

Face Hallucination Based on Eigentransformation Learning Advanced Science and Technology etters, pp.32-37 http://dx.doi.org/10.14257/astl.2016. Face allucination Based on Eigentransformation earning Guohua Zou School of software, East China University of Technology,

More information

On the Dimensionality of Deformable Face Models

On the Dimensionality of Deformable Face Models On the Dimensionality of Deformable Face Models CMU-RI-TR-06-12 Iain Matthews, Jing Xiao, and Simon Baker The Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Abstract

More information

Data-Driven Face Modeling and Animation

Data-Driven Face Modeling and Animation 1. Research Team Data-Driven Face Modeling and Animation Project Leader: Post Doc(s): Graduate Students: Undergraduate Students: Prof. Ulrich Neumann, IMSC and Computer Science John P. Lewis Zhigang Deng,

More information

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Sea Chen Department of Biomedical Engineering Advisors: Dr. Charles A. Bouman and Dr. Mark J. Lowe S. Chen Final Exam October

More information

SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION

SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication, pp 8/1 8/7, 1996 1 SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION I A Matthews, J A Bangham and

More information

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT Shoichiro IWASAWA*I, Tatsuo YOTSUKURA*2, Shigeo MORISHIMA*2 */ Telecommunication Advancement Organization *2Facu!ty of Engineering, Seikei University

More information

Image Transfer Methods. Satya Prakash Mallick Jan 28 th, 2003

Image Transfer Methods. Satya Prakash Mallick Jan 28 th, 2003 Image Transfer Methods Satya Prakash Mallick Jan 28 th, 2003 Objective Given two or more images of the same scene, the objective is to synthesize a novel view of the scene from a view point where there

More information

Learning a Manifold as an Atlas Supplementary Material

Learning a Manifold as an Atlas Supplementary Material Learning a Manifold as an Atlas Supplementary Material Nikolaos Pitelis Chris Russell School of EECS, Queen Mary, University of London [nikolaos.pitelis,chrisr,lourdes]@eecs.qmul.ac.uk Lourdes Agapito

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

The Analysis of Animate Object Motion using Neural Networks and Snakes

The Analysis of Animate Object Motion using Neural Networks and Snakes The Analysis of Animate Object Motion using Neural Networks and Snakes Ken Tabb, Neil Davey, Rod Adams & Stella George e-mail {K.J.Tabb, N.Davey, R.G.Adams, S.J.George}@herts.ac.uk http://www.health.herts.ac.uk/ken/vision/

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

PHOTOGRAMMETRIC TECHNIQUE FOR TEETH OCCLUSION ANALYSIS IN DENTISTRY

PHOTOGRAMMETRIC TECHNIQUE FOR TEETH OCCLUSION ANALYSIS IN DENTISTRY PHOTOGRAMMETRIC TECHNIQUE FOR TEETH OCCLUSION ANALYSIS IN DENTISTRY V. A. Knyaz a, *, S. Yu. Zheltov a, a State Research Institute of Aviation System (GosNIIAS), 539 Moscow, Russia (knyaz,zhl)@gosniias.ru

More information

Expressive Audio-Visual Speech

Expressive Audio-Visual Speech Expressive Audio-Visual Speech Elisabetta Bevacqua Department of Computer and System Science University of Rome La Sapienza Email: elisabetta.bevacqua@libero.it Catherine Pelachaud LINC - Paragraphe University

More information

A Comparative Evaluation of Active Appearance Model Algorithms

A Comparative Evaluation of Active Appearance Model Algorithms A Comparative Evaluation of Active Appearance Model Algorithms T.F. Cootes, G. Edwards and C.J. Taylor Dept. Medical Biophysics, Manchester University, UK tcootes@server1.smb.man.ac.uk Abstract An Active

More information

COLOR FIDELITY OF CHROMATIC DISTRIBUTIONS BY TRIAD ILLUMINANT COMPARISON. Marcel P. Lucassen, Theo Gevers, Arjan Gijsenij

COLOR FIDELITY OF CHROMATIC DISTRIBUTIONS BY TRIAD ILLUMINANT COMPARISON. Marcel P. Lucassen, Theo Gevers, Arjan Gijsenij COLOR FIDELITY OF CHROMATIC DISTRIBUTIONS BY TRIAD ILLUMINANT COMPARISON Marcel P. Lucassen, Theo Gevers, Arjan Gijsenij Intelligent Systems Lab Amsterdam, University of Amsterdam ABSTRACT Performance

More information

AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION

AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION Iain Matthews, J. Andrew Bangham and Stephen Cox School of Information Systems, University of East Anglia, Norwich, NR4 7TJ,

More information

VTalk: A System for generating Text-to-Audio-Visual Speech

VTalk: A System for generating Text-to-Audio-Visual Speech VTalk: A System for generating Text-to-Audio-Visual Speech Prem Kalra, Ashish Kapoor and Udit Kumar Goyal Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Contact email:

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

Real-time Expression Cloning using Appearance Models

Real-time Expression Cloning using Appearance Models Real-time Expression Cloning using Appearance Models Barry-John Theobald School of Computing Sciences University of East Anglia Norwich, UK bjt@cmp.uea.ac.uk Iain A. Matthews Robotics Institute Carnegie

More information

Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction

Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction Ham Rara, Shireen Elhabian, Asem Ali University of Louisville Louisville, KY {hmrara01,syelha01,amali003}@louisville.edu Mike Miller,

More information

Image Segmentation for Image Object Extraction

Image Segmentation for Image Object Extraction Image Segmentation for Image Object Extraction Rohit Kamble, Keshav Kaul # Computer Department, Vishwakarma Institute of Information Technology, Pune kamble.rohit@hotmail.com, kaul.keshav@gmail.com ABSTRACT

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition by Mustapha A. Makkook A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree

More information

Automated Pivot Location for the Cartesian-Polar Hybrid Point Distribution Model

Automated Pivot Location for the Cartesian-Polar Hybrid Point Distribution Model Automated Pivot Location for the Cartesian-Polar Hybrid Point Distribution Model Tony Heap and David Hogg School of Computer Studies, University of Leeds, Leeds LS2 9JT, UK email: ajh@scs.leeds.ac.uk Abstract

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Data preprocessing Functional Programming and Intelligent Algorithms

Data preprocessing Functional Programming and Intelligent Algorithms Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute

More information

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs Felix Wang fywang2 John Wieting wieting2 Introduction We implement a texture classification algorithm using 2-D Noncausal Hidden

More information

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology ISCA Archive STREAM WEIGHT OPTIMIZATION OF SPEECH AND LIP IMAGE SEQUENCE FOR AUDIO-VISUAL SPEECH RECOGNITION Satoshi Nakamura 1 Hidetoshi Ito 2 Kiyohiro Shikano 2 1 ATR Spoken Language Translation Research

More information

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition Linear Discriminant Analysis in Ottoman Alphabet Character Recognition ZEYNEB KURT, H. IREM TURKMEN, M. ELIF KARSLIGIL Department of Computer Engineering, Yildiz Technical University, 34349 Besiktas /

More information

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS Toomas Kirt Supervisor: Leo Võhandu Tallinn Technical University Toomas.Kirt@mail.ee Abstract: Key words: For the visualisation

More information

Articulatory Features for Robust Visual Speech Recognition

Articulatory Features for Robust Visual Speech Recognition Articulatory Features for Robust Visual Speech Recognition Kate Saenko, Trevor Darrell, and James Glass MIT Computer Science and Artificial Intelligence Laboratory 32 Vassar Street Cambridge, Massachusetts,

More information

Multi-Modal Human Verification Using Face and Speech

Multi-Modal Human Verification Using Face and Speech 22 Multi-Modal Human Verification Using Face and Speech Changhan Park 1 and Joonki Paik 2 1 Advanced Technology R&D Center, Samsung Thales Co., Ltd., 2 Graduate School of Advanced Imaging Science, Multimedia,

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Interactive 3D Visualization Of Optimization For Water Distribution Systems

Interactive 3D Visualization Of Optimization For Water Distribution Systems City University of New York (CUNY) CUNY Academic Works International Conference on Hydroinformatics 8-1-2014 Interactive 3D Visualization Of Optimization For Water Distribution Systems Matthew Barrie Johns

More information

Multi-modal Translation and Evaluation of Lip-synchronization using Noise Added Voice

Multi-modal Translation and Evaluation of Lip-synchronization using Noise Added Voice Multi-modal Translation and Evaluation of Lip-synchronization using Noise Added Voice Shigeo MORISHIMA (,2), Satoshi NAKAMURA (2) () Faculty of Engineering, Seikei University. --, Kichijoji-Kitamachi,

More information

A Comparison of Still-Image Compression Standards Using Different Image Quality Metrics and Proposed Methods for Improving Lossy Image Quality

A Comparison of Still-Image Compression Standards Using Different Image Quality Metrics and Proposed Methods for Improving Lossy Image Quality A Comparison of Still-Image Compression Standards Using Different Image Quality Metrics and Proposed Methods for Improving Lossy Image Quality Multidimensional DSP Literature Survey Eric Heinen 3/21/08

More information

Automatic Enhancement of Correspondence Detection in an Object Tracking System

Automatic Enhancement of Correspondence Detection in an Object Tracking System Automatic Enhancement of Correspondence Detection in an Object Tracking System Denis Schulze 1, Sven Wachsmuth 1 and Katharina J. Rohlfing 2 1- University of Bielefeld - Applied Informatics Universitätsstr.

More information

STEREO BY TWO-LEVEL DYNAMIC PROGRAMMING

STEREO BY TWO-LEVEL DYNAMIC PROGRAMMING STEREO BY TWO-LEVEL DYNAMIC PROGRAMMING Yuichi Ohta Institute of Information Sciences and Electronics University of Tsukuba IBARAKI, 305, JAPAN Takeo Kanade Computer Science Department Carnegie-Mellon

More information

Automatic Extraction of Lip Feature Points

Automatic Extraction of Lip Feature Points Automatic Extraction of Lip Feature Points Roland Gockel, J Bruce Millar l, Alexander Zelinsky 2, and J ordi Robert-Ribes 3 lcomputer Sciences Laboratory and 2Robotic Systems Laboratory, RSISE, Australian

More information

The Analysis of Animate Object Motion using Neural Networks and Snakes

The Analysis of Animate Object Motion using Neural Networks and Snakes The Analysis of Animate Object Motion using Neural Networks and Snakes Ken Tabb, Neil Davey, Rod Adams & Stella George e-mail {K.J.Tabb, N.Davey, R.G.Adams, S.J.George}@herts.ac.uk http://www.health.herts.ac.uk/ken/vision/

More information

Analyzing and Segmenting Finger Gestures in Meaningful Phases

Analyzing and Segmenting Finger Gestures in Meaningful Phases 2014 11th International Conference on Computer Graphics, Imaging and Visualization Analyzing and Segmenting Finger Gestures in Meaningful Phases Christos Mousas Paul Newbury Dept. of Informatics University

More information

Text-Independent Speaker Identification

Text-Independent Speaker Identification December 8, 1999 Text-Independent Speaker Identification Til T. Phan and Thomas Soong 1.0 Introduction 1.1 Motivation The problem of speaker identification is an area with many different applications.

More information

Application of Characteristic Function Method in Target Detection

Application of Characteristic Function Method in Target Detection Application of Characteristic Function Method in Target Detection Mohammad H Marhaban and Josef Kittler Centre for Vision, Speech and Signal Processing University of Surrey Surrey, GU2 7XH, UK eep5mm@ee.surrey.ac.uk

More information