Joint Processing of Audio and Visual Information for Speech Recognition

Similar documents
Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

AUDIO-VISUAL SPEECH-PROCESSING SYSTEM FOR POLISH APPLICABLE TO HUMAN-COMPUTER INTERACTION

SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION

AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION

Phoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area

Multifactor Fusion for Audio-Visual Speaker Recognition

Active Appearance Models

Facial Feature Detection

LIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT

ISCA Archive

Articulatory Features for Robust Visual Speech Recognition

Automatic speech reading by oral motion tracking for user authentication system

Generic Face Alignment Using an Improved Active Shape Model

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

Towards Lipreading Sentences with Active Appearance Models

Towards Audiovisual TTS

Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction

3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION

Automatic Extraction of Lip Feature Points

Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

Mouth Region Localization Method Based on Gaussian Mixture Model

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

Confidence Measures: how much we can trust our speech recognizers

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Feature Detection and Tracking with Constrained Local Models

CONTINUOUS VISUAL SPEECH RECOGNITION FOR MULTIMODAL FUSION

A New Manifold Representation for Visual Speech Recognition

Multiple Kernel Learning for Emotion Recognition in the Wild

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Research Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

LOCAL APPEARANCE BASED FACE RECOGNITION USING DISCRETE COSINE TRANSFORM

Multi-pose lipreading and audio-visual speech recognition

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

Video search requires efficient annotation of video content To some extent this can be done automatically

Face Recognition Using Active Appearance Models

Shape description and modelling

Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

The Template Update Problem

Turbo Decoders for Audio-visual Continuous Speech Recognition

Visual Front-End Wars: Viola-Jones Face Detector vs Fourier Lucas-Kanade

Dr Andrew Abel University of Stirling, Scotland

DA Progress report 2 Multi-view facial expression. classification Nikolas Hesse

Features for Audio-Visual Speech Recognition

Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons

Face analysis : identity vs. expressions

Lip Detection and Lip Geometric Feature Extraction using Constrained Local Model for Spoken Language Identification using Visual Speech Recognition

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions

Face Alignment Across Large Poses: A 3D Solution

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

Lipreading using Profile Lips Rebuilt by 3D Data from the Kinect

Face Alignment Under Various Poses and Expressions

A Multi-Stage Approach to Facial Feature Detection

DISTANCE MAPS: A ROBUST ILLUMINATION PREPROCESSING FOR ACTIVE APPEARANCE MODELS

Biometrics Technology: Multi-modal (Part 2)

A Comparative Evaluation of Active Appearance Model Algorithms

END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS

Face. Feature Extraction. Visual. Feature Extraction. Acoustic Feature Extraction

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Learning based face hallucination techniques: A survey

Audio Visual Speech Recognition

WHO WANTS TO BE A MILLIONAIRE?

Automatic Enhancement of Correspondence Detection in an Object Tracking System

Genetic Snakes: Application on Lipreading

Image Processing (IP)

Lip Contour Localization using Statistical Shape Models Fabian Schneiter

Discriminative training and Feature combination

Recognizing faces in broadcast video

AAM Based Facial Feature Tracking with Kinect

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Workshop W14 - Audio Gets Smart: Semantic Audio Analysis & Metadata Standards

Lip Tracking for MPEG-4 Facial Animation

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Modeling Coarticulation in Continuous Speech

Classification of Face Images for Gender, Age, Facial Expression, and Identity 1

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

State of The Art In 3D Face Recognition

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

A Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition

On Modeling Variations for Face Authentication

Learning The Lexicon!

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

Face anti-spoofing using Image Quality Assessment

Xing Fan, Carlos Busso and John H.L. Hansen

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

Detection of Mouth Movements and Its Applications to Cross-Modal Analysis of Planning Meetings

Optical Storage Technology. MPEG Data Compression

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Transcription:

Joint Processing of Audio and Visual Information for Speech Recognition Chalapathy Neti and Gerasimos Potamianos IBM T.J. Watson Research Center Yorktown Heights, NY 10598 With Iain Mattwes, CMU, USA Juergen Luettin, IDIAP, Switzerland Baltimore, 07/14/00

JHU Workshop, 2000 Theme: Audio-visual speech recognition Goal: Use visual information to improve audio-based speech recognition, particulary in acoustically degraded conditions TEAM: Senior Researchers Chalapathy Neti - Lead, Gerasimos Potamianos (IBM) Iain Matthews (CMU) Juegen Luettin (IDIAP, Switzerland) Andreas Andreou (JHU) Graduate students Herve Glottin (ICP, Grenoble), Eugenio Culurcielllo (JHU) Undergraduates: Azad Mashari (U.Toronto), June Sison (UCSC) DATA: Subset of IBM VVAV data, LDC Broadcast video data

Perceptual Information Interfaces (PII) Joint processing of audio, visual and other information for active acquisition, recognition and interpretation of objects and events. Is anybody speaking? Speech, music, etc. Who is speaking? What is being said? Speaker state? When speaking. Emotional state. What is the environment? Office, studio, street, etc. Who is in the environment? Crowd, meeting, anchor,, etc.

Audio-Visual Processing Technologies Exploit visual information to improve audio-based processing in adverse acoustic conditions Example contexts: Audio-Visual Speech Recognition (MMSP99, ICASSP00, ICME00) Combine visual speech (visemes) with audio speech (phones) recognition Audio-Visual Speaker Recognition (AVSP99, MMSP99, ICME00) Combine audio signatures of a person with visual signatures (face-id) Audio-Visual Speaker Segmentation (RIAO00) Combine visual scene change with audio-based speaker change Audio-Visual Speech-event detection (ICASSP00) Combine visual speech onset cues with audio-based speech energy Audio-Visual Speech synthesis (ICME00) Applications:Human-Computer Interaction, Multimedia content indexing, HCI for people with special needs In this workshop, we propose to attack the problem of LV audio-visual speech recognition. Video data Decision AV Stream Audio data Decision Fusion A-V Descriptor

Audio-Visual ASR: Motivation OBJECTIVE: To significantly improve ASR performance by joint use of audio and visual information. FACTS / MOTIVATION: Lack of ASR robustness to noise / mismatched training-testing conditions. Humans speechread to increase intelligibility of noisy speech. Summerfield experiment. Humans fuse audio and visual information in speech recognition. McGurk effect. Acoustic: /ga/ Visual: /ba/ Perceived: /da/ Hearing impaired people speechread well. Visual and audio information are partially complementary: Acoustically confusable phonemes belong to different viseme classes Inexpensive capture of visual information

Transcription Accuracy - Results from the '98 DARPA Evaluation (IBM Research System) 100 90 Accuracy 80 70 60 50 40 30 20 10 Prepared Conversation Telephone Background Music Background Noise Foreign Accent Combinations Overall 0 Focus condition

Bimodal Human Speech Recognition (Summerfield, 79) Experiment: Human transcription of audio-visual speech at SNR < 0 db 70 60 A - Acoustic only. B - Acoustic + full video. C - Acoustic + lip region. D - Acoustic + 4 lip-points. Word recognition 50 40 30 20 10 0 A B C D

Visemes vs. Phonemes 1. f,v 5. p, b, m 9. l 2. th,dh 6. w 10. iy 3. s,z 7. r 11. ah 4. sh, zh 8. g,k,n,t,d,y 12. eh Viseme is a unit of visual speech. Phoneme to viseme is a many-to-one mapping. Viseme classes are different from phonetic classes. e.g., /m/, /n/,, belong to the same acoustic class (nasals), whereas they belong to separate viseme classes.

McGurk Effect

Visual-Only ASR Performance Digit / alpha recognition (Potamianos). Single-speaker digits = 97.0 %. Single-speaker alphas = 71.1 %. 50-speaker alphas = 36.5 % Phonetic classification (Potamianos, Neti, Vermar, Iyengar). 162-speakers (VV-AV), LVCSR = 37.3% 100 95 90 85 Single-speaker digits Lip shape DWT PCA LDA 80

Audio-Visual ASR: Challenges Suitable databases are just emerging. At IBM (Giri, Potamianos,Neti): Collected a one of a kind LVCSR audio-visual database (50 hrs, >260 subjects). Previous work: AT&T (Potamianos( Potamianos): 1.5 hr, 50 subjects, connected letters Surrey (Kittler): M2VTS (37 subjects, connected digits) Visual front end. Face and lip/mouth tracking. Mouth center location and size estimates suffice, and are robust (Senior( Senior). Visual features suitable for speechreading. Pixel based, image transforms of video ROI (DCT, DWT, PCA, LDA). Audio-visual fusion (integration) for bimodal LVCSR. Decision fusion, by means of the multi-stream HMM. Exponent training. Confidence estimation of individual streams. Integration level (state, phone, word).

The Audio-Visual Database Previous work (CMU,( AT&T, U.Grenoble, (X)M2VTS, Tulips ). databases). Small vocabulary (digits, alphas). Small number of subjects (1-10). Isolated or connected word speech. The IBM ViaVoice audio-visual database (VV-AV( VV-AV). ViaVoice enrollment sentences. 50 hrs of speech. >260 subjects. MPEG2 compressed, 60 Hz, 704 x 480 pixels, frontal subject view. Largest up-to-date database.

Facial feature location and tracking

Face and Facial Feature Tracking Locate the face in each frame (Senior,( Senior, '99). Search image pyramid across scales & locations. Each square mxn region is considered as a face. Statistical, pixel based approach, using LDA and PCA. Locate ocate 29 facial features (6 mouth ones) within located faces. Statistical, pixel based approach. Uses face geometry statistics to prune erroneous feature candidates. start end ratio

Visual Speech Features (I) Lip shape based features. Lip area, height, width, perimeter (Petajan, Lip image moments (Potamianos). Lip contour parametrization. Deformable template parameters (Stork( Stork). Active shape models (Luettin,( Blake). Fourier descriptors (Potamianos( Potamianos). Fail to capture oral cavity information. Petajan, Benoit).

Visual Speech Features (II) Image transform, pixel based approach. Considers video region of interest: V(t) = { V(x,y,z): x = 1...m, y = 1...n, z = t-t L...t+t R } Compresses the region of interest using an appropriate image transform (trained( from the data), such as: Discrete cosine transform - DCT Discrete wavelet transform - DWT Principal component analysis - PCA. Linear discriminant analysis - LDA. Final feature vector: o v (t) = S G P'V(t) V(t)

CASCADE TRANFORMATION 35 30 25 20 15 10 5 0 PCA PCA+LDA+MLLT PCA+LDA

- AAM Features (IAIN) Statistical model of: Shape point distribution model (PDM) Appearance shape free eigenface Combined shape + appearance combined appearance model Iterative fit algorithm Active Appearance Model (AAM) algorithm mode 1-2sd mean +2sd

Point Distribution Model Media Clip

Active Shape Models Active Shape Model (ASM) PDM in action iterative fitting constrained in statistically learned shape space Iterative fitting algorithm (Cootes et. al) Point-wise search along normal for edge, or best fit of function Find pose update and use shape constraints Reiterate until convergence Simplex minimisation (Luettin et. al) Concatenate grey-level profiles Use PCA to calculate grey-level profile distribution model (GLDM) GLDM: x = x + P b p p p p

Active Appearance Model (AAM) Extension of ASM s to model both appearance and shape in a single statistical model (Cootes and Taylor, 1998) Shape model same PDM used with ASM tracking x = x + P b s s Color appearance model warp example landmark point model to mean shape sample color and normalise use PCA to calculate appearance model g = g + P b g g Combined shape appearance model concatenate shape and appearance parameters

Face appearance models

Audio-Visual Sensor Fusion Feature fusion vs. decision fusion

Feature Fusion Method: Concatenate audio and visual observation vectors (after time-alignment): O (t) = [ O (t) A, O (t) V ] Train a single-stream HMM: P[ O (t) state j ] = m=1...mj w jm jm N( O (t) ; jm, S jm jm )

Decision Fusion Formulation (general setting): S information sources about classes: j Å=J = {1,...,J1,...,J }. Time-synchronous observations per source: O (t) s Å R D s, s = 1,...,S. Multi-modal observation vector: O (t) = [ O (t) 1, O (t) 2,..., O (t) S ] Å R D. Single-modal HMMs: Pr [ O (t) s j ] = m=1...mjs w jms N Ds (O (t) s ; jms, S jms ). Multi-stream HMM: Score [ O (t) j ] = s=1...s js log Pr [ O (t) s j ], Stream exponents: 0 x js x S, s=1...s js = S, j Å J.

Multi-Stream HMM Training HMM parameters: Stream parameters: Stream exponents: s = [ ( w jms, jms, jms ), str = [ js, j Å J, s = 1,...,S ]. Combined parameters: = [ s, s = 1,...,S, str ]. ), m = 1,...,M js js, j Å J ]. Train single-stream HMMs first (EM training): s (k+1) (k+1) = argmax s Q ( (k) (k), s O ), ), s = 1,...,S s = 1,...,S. Discriminatively train exponents by means of the GPD algorithm. Results (50-subject connected letters, clean speech): Results ( Audio-only word accuracy: Visual-only word accuracy: Audio-visual word accuracy: word accuracy: 85.9 % word accuracy: 36.5 % word accuracy: 88.9 %

Multistream HMM (Juergen et al) Can allow for asynchrony Acoustic features Visual features

Audio-visual decoding

Fusion Research Issues Stream independence assumption is unrealistic. Probability is preferable to a "score" formulation. Integration level: Feature, state, phone, word. Classes of interest: Visemes vs. phonemes. Stream confidence estimation. Entropy, SNR based, smoothly time varying? General decision fusion function. Score [ O (t) j ] = f ( Pr [ O (t) s j ], s = 1,...,S ). Lattice / n-best rescoring techniques.

Audio-Visual Speech Recognition Word recognition experiments 5.5 hrs of training, 162 spkrs 1.4 hrs of testing, 162 spkrs Audio features cepstra+lda+mllt Visual features DWT+LDA+MLLT Audio-Visual feature fusion (Cepstra,DWT)+LDA+MLLT Noise speech noise at 10 db SNR. 90 80 70 60 50 40 30 20 10 Audio Visual A-V 0 Clean 10 db SNR

Summary Workshop concentration will be on decision fusion techniques for audio-visual ASR. Multi-stream HMM. Lattice rescoring. Others. We (IBM) are providing the LVCSR AV database and baseline lattices and models. This workshop is a unique opportunity to advance the state of the art in audio-visual ASR.

Scientific Merit Significantly improve the LVCSR robustness Digit/alpha SD -> LVCSR SI First step towards perceptual computing Use of diverse sources of information (audio, visual - gaze, pointing, facial gestures, touch) to robustly extract human activity and intent for human information interaction (HII) An exemplar for developing a generalized framework for information fusion Measures for realibility of information sources and their use in making a joint decision Fusion functions

Applications Perceptual interfaces for HII speech recognition, speaker identification, speaker segmentation, human intent detection, speech understanding, speech synthesis, scene understanding Multimedia content indexing and retreival (MPEG7 context) Internet digital media content is exploding Network companies have agressive plans for digitizing archival content HCI for people with Special needs e.g: Speech impaired people due to hearing impairment

Workshop Planning Pre-workshop: Data preparation (IBM VVAV and LDC CNN) data collection, transcription and annotation Audio and visual front ends (DCT) setup for AAM feature extraction Trained audio-only and visual-only HMMs. Audio-only lattices. Workshop: AAM based visual front end Baseline audio-visual HMM using feature fusion. Lattice rescoring experiments. Time-varying HMM stream exponent estimation. Stream information / confidence measures. adaptation to CNN data Other recombination functions.