Joint Processing of Audio and Visual Information for Speech Recognition Chalapathy Neti and Gerasimos Potamianos IBM T.J. Watson Research Center Yorktown Heights, NY 10598 With Iain Mattwes, CMU, USA Juergen Luettin, IDIAP, Switzerland Baltimore, 07/14/00
JHU Workshop, 2000 Theme: Audio-visual speech recognition Goal: Use visual information to improve audio-based speech recognition, particulary in acoustically degraded conditions TEAM: Senior Researchers Chalapathy Neti - Lead, Gerasimos Potamianos (IBM) Iain Matthews (CMU) Juegen Luettin (IDIAP, Switzerland) Andreas Andreou (JHU) Graduate students Herve Glottin (ICP, Grenoble), Eugenio Culurcielllo (JHU) Undergraduates: Azad Mashari (U.Toronto), June Sison (UCSC) DATA: Subset of IBM VVAV data, LDC Broadcast video data
Perceptual Information Interfaces (PII) Joint processing of audio, visual and other information for active acquisition, recognition and interpretation of objects and events. Is anybody speaking? Speech, music, etc. Who is speaking? What is being said? Speaker state? When speaking. Emotional state. What is the environment? Office, studio, street, etc. Who is in the environment? Crowd, meeting, anchor,, etc.
Audio-Visual Processing Technologies Exploit visual information to improve audio-based processing in adverse acoustic conditions Example contexts: Audio-Visual Speech Recognition (MMSP99, ICASSP00, ICME00) Combine visual speech (visemes) with audio speech (phones) recognition Audio-Visual Speaker Recognition (AVSP99, MMSP99, ICME00) Combine audio signatures of a person with visual signatures (face-id) Audio-Visual Speaker Segmentation (RIAO00) Combine visual scene change with audio-based speaker change Audio-Visual Speech-event detection (ICASSP00) Combine visual speech onset cues with audio-based speech energy Audio-Visual Speech synthesis (ICME00) Applications:Human-Computer Interaction, Multimedia content indexing, HCI for people with special needs In this workshop, we propose to attack the problem of LV audio-visual speech recognition. Video data Decision AV Stream Audio data Decision Fusion A-V Descriptor
Audio-Visual ASR: Motivation OBJECTIVE: To significantly improve ASR performance by joint use of audio and visual information. FACTS / MOTIVATION: Lack of ASR robustness to noise / mismatched training-testing conditions. Humans speechread to increase intelligibility of noisy speech. Summerfield experiment. Humans fuse audio and visual information in speech recognition. McGurk effect. Acoustic: /ga/ Visual: /ba/ Perceived: /da/ Hearing impaired people speechread well. Visual and audio information are partially complementary: Acoustically confusable phonemes belong to different viseme classes Inexpensive capture of visual information
Transcription Accuracy - Results from the '98 DARPA Evaluation (IBM Research System) 100 90 Accuracy 80 70 60 50 40 30 20 10 Prepared Conversation Telephone Background Music Background Noise Foreign Accent Combinations Overall 0 Focus condition
Bimodal Human Speech Recognition (Summerfield, 79) Experiment: Human transcription of audio-visual speech at SNR < 0 db 70 60 A - Acoustic only. B - Acoustic + full video. C - Acoustic + lip region. D - Acoustic + 4 lip-points. Word recognition 50 40 30 20 10 0 A B C D
Visemes vs. Phonemes 1. f,v 5. p, b, m 9. l 2. th,dh 6. w 10. iy 3. s,z 7. r 11. ah 4. sh, zh 8. g,k,n,t,d,y 12. eh Viseme is a unit of visual speech. Phoneme to viseme is a many-to-one mapping. Viseme classes are different from phonetic classes. e.g., /m/, /n/,, belong to the same acoustic class (nasals), whereas they belong to separate viseme classes.
McGurk Effect
Visual-Only ASR Performance Digit / alpha recognition (Potamianos). Single-speaker digits = 97.0 %. Single-speaker alphas = 71.1 %. 50-speaker alphas = 36.5 % Phonetic classification (Potamianos, Neti, Vermar, Iyengar). 162-speakers (VV-AV), LVCSR = 37.3% 100 95 90 85 Single-speaker digits Lip shape DWT PCA LDA 80
Audio-Visual ASR: Challenges Suitable databases are just emerging. At IBM (Giri, Potamianos,Neti): Collected a one of a kind LVCSR audio-visual database (50 hrs, >260 subjects). Previous work: AT&T (Potamianos( Potamianos): 1.5 hr, 50 subjects, connected letters Surrey (Kittler): M2VTS (37 subjects, connected digits) Visual front end. Face and lip/mouth tracking. Mouth center location and size estimates suffice, and are robust (Senior( Senior). Visual features suitable for speechreading. Pixel based, image transforms of video ROI (DCT, DWT, PCA, LDA). Audio-visual fusion (integration) for bimodal LVCSR. Decision fusion, by means of the multi-stream HMM. Exponent training. Confidence estimation of individual streams. Integration level (state, phone, word).
The Audio-Visual Database Previous work (CMU,( AT&T, U.Grenoble, (X)M2VTS, Tulips ). databases). Small vocabulary (digits, alphas). Small number of subjects (1-10). Isolated or connected word speech. The IBM ViaVoice audio-visual database (VV-AV( VV-AV). ViaVoice enrollment sentences. 50 hrs of speech. >260 subjects. MPEG2 compressed, 60 Hz, 704 x 480 pixels, frontal subject view. Largest up-to-date database.
Facial feature location and tracking
Face and Facial Feature Tracking Locate the face in each frame (Senior,( Senior, '99). Search image pyramid across scales & locations. Each square mxn region is considered as a face. Statistical, pixel based approach, using LDA and PCA. Locate ocate 29 facial features (6 mouth ones) within located faces. Statistical, pixel based approach. Uses face geometry statistics to prune erroneous feature candidates. start end ratio
Visual Speech Features (I) Lip shape based features. Lip area, height, width, perimeter (Petajan, Lip image moments (Potamianos). Lip contour parametrization. Deformable template parameters (Stork( Stork). Active shape models (Luettin,( Blake). Fourier descriptors (Potamianos( Potamianos). Fail to capture oral cavity information. Petajan, Benoit).
Visual Speech Features (II) Image transform, pixel based approach. Considers video region of interest: V(t) = { V(x,y,z): x = 1...m, y = 1...n, z = t-t L...t+t R } Compresses the region of interest using an appropriate image transform (trained( from the data), such as: Discrete cosine transform - DCT Discrete wavelet transform - DWT Principal component analysis - PCA. Linear discriminant analysis - LDA. Final feature vector: o v (t) = S G P'V(t) V(t)
CASCADE TRANFORMATION 35 30 25 20 15 10 5 0 PCA PCA+LDA+MLLT PCA+LDA
- AAM Features (IAIN) Statistical model of: Shape point distribution model (PDM) Appearance shape free eigenface Combined shape + appearance combined appearance model Iterative fit algorithm Active Appearance Model (AAM) algorithm mode 1-2sd mean +2sd
Point Distribution Model Media Clip
Active Shape Models Active Shape Model (ASM) PDM in action iterative fitting constrained in statistically learned shape space Iterative fitting algorithm (Cootes et. al) Point-wise search along normal for edge, or best fit of function Find pose update and use shape constraints Reiterate until convergence Simplex minimisation (Luettin et. al) Concatenate grey-level profiles Use PCA to calculate grey-level profile distribution model (GLDM) GLDM: x = x + P b p p p p
Active Appearance Model (AAM) Extension of ASM s to model both appearance and shape in a single statistical model (Cootes and Taylor, 1998) Shape model same PDM used with ASM tracking x = x + P b s s Color appearance model warp example landmark point model to mean shape sample color and normalise use PCA to calculate appearance model g = g + P b g g Combined shape appearance model concatenate shape and appearance parameters
Face appearance models
Audio-Visual Sensor Fusion Feature fusion vs. decision fusion
Feature Fusion Method: Concatenate audio and visual observation vectors (after time-alignment): O (t) = [ O (t) A, O (t) V ] Train a single-stream HMM: P[ O (t) state j ] = m=1...mj w jm jm N( O (t) ; jm, S jm jm )
Decision Fusion Formulation (general setting): S information sources about classes: j Å=J = {1,...,J1,...,J }. Time-synchronous observations per source: O (t) s Å R D s, s = 1,...,S. Multi-modal observation vector: O (t) = [ O (t) 1, O (t) 2,..., O (t) S ] Å R D. Single-modal HMMs: Pr [ O (t) s j ] = m=1...mjs w jms N Ds (O (t) s ; jms, S jms ). Multi-stream HMM: Score [ O (t) j ] = s=1...s js log Pr [ O (t) s j ], Stream exponents: 0 x js x S, s=1...s js = S, j Å J.
Multi-Stream HMM Training HMM parameters: Stream parameters: Stream exponents: s = [ ( w jms, jms, jms ), str = [ js, j Å J, s = 1,...,S ]. Combined parameters: = [ s, s = 1,...,S, str ]. ), m = 1,...,M js js, j Å J ]. Train single-stream HMMs first (EM training): s (k+1) (k+1) = argmax s Q ( (k) (k), s O ), ), s = 1,...,S s = 1,...,S. Discriminatively train exponents by means of the GPD algorithm. Results (50-subject connected letters, clean speech): Results ( Audio-only word accuracy: Visual-only word accuracy: Audio-visual word accuracy: word accuracy: 85.9 % word accuracy: 36.5 % word accuracy: 88.9 %
Multistream HMM (Juergen et al) Can allow for asynchrony Acoustic features Visual features
Audio-visual decoding
Fusion Research Issues Stream independence assumption is unrealistic. Probability is preferable to a "score" formulation. Integration level: Feature, state, phone, word. Classes of interest: Visemes vs. phonemes. Stream confidence estimation. Entropy, SNR based, smoothly time varying? General decision fusion function. Score [ O (t) j ] = f ( Pr [ O (t) s j ], s = 1,...,S ). Lattice / n-best rescoring techniques.
Audio-Visual Speech Recognition Word recognition experiments 5.5 hrs of training, 162 spkrs 1.4 hrs of testing, 162 spkrs Audio features cepstra+lda+mllt Visual features DWT+LDA+MLLT Audio-Visual feature fusion (Cepstra,DWT)+LDA+MLLT Noise speech noise at 10 db SNR. 90 80 70 60 50 40 30 20 10 Audio Visual A-V 0 Clean 10 db SNR
Summary Workshop concentration will be on decision fusion techniques for audio-visual ASR. Multi-stream HMM. Lattice rescoring. Others. We (IBM) are providing the LVCSR AV database and baseline lattices and models. This workshop is a unique opportunity to advance the state of the art in audio-visual ASR.
Scientific Merit Significantly improve the LVCSR robustness Digit/alpha SD -> LVCSR SI First step towards perceptual computing Use of diverse sources of information (audio, visual - gaze, pointing, facial gestures, touch) to robustly extract human activity and intent for human information interaction (HII) An exemplar for developing a generalized framework for information fusion Measures for realibility of information sources and their use in making a joint decision Fusion functions
Applications Perceptual interfaces for HII speech recognition, speaker identification, speaker segmentation, human intent detection, speech understanding, speech synthesis, scene understanding Multimedia content indexing and retreival (MPEG7 context) Internet digital media content is exploding Network companies have agressive plans for digitizing archival content HCI for people with Special needs e.g: Speech impaired people due to hearing impairment
Workshop Planning Pre-workshop: Data preparation (IBM VVAV and LDC CNN) data collection, transcription and annotation Audio and visual front ends (DCT) setup for AAM feature extraction Trained audio-only and visual-only HMMs. Audio-only lattices. Workshop: AAM based visual front end Baseline audio-visual HMM using feature fusion. Lattice rescoring experiments. Time-varying HMM stream exponent estimation. Stream information / confidence measures. adaptation to CNN data Other recombination functions.