Xing Fan, Carlos Busso and John H.L. Hansen

Similar documents
Multifactor Fusion for Audio-Visual Speaker Recognition

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

A SCANNING WINDOW SCHEME BASED ON SVM TRAINING ERROR RATE FOR UNSUPERVISED AUDIO SEGMENTATION

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Audio Visual Speech Recognition

Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory

Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Multi-pose lipreading and audio-visual speech recognition

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

Real-time Monitoring of Participants Interaction in a Meeting using Audio-Visual sensors

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Discriminative training and Feature combination

MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM. J.Luque, R.Morros, J.Anguita, M.Farrus, D.Macho, F.Marqués, C.Martínez, V.Vilaplana, J.

A ROBUST SPEAKER CLUSTERING ALGORITHM

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Multi-modal Person Identification in a Smart Environment

Research Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION. Sachin S. Kajarekar

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Parallel Architecture & Programing Models for Face Recognition

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

Principles of Audio Coding

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Nonlinear Manifold Learning for Visual Speech Recognition

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Dynamic Time Warping

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Generic Face Alignment Using an Improved Active Shape Model

SVD-based Universal DNN Modeling for Multiple Scenarios

Multimodal feature fusion for video forgery detection

Multi-Modal Human Verification Using Face and Speech

USER GUIDE FOR TWO MICROPHONE DIRECTION OF ARRIVAL ESTIMATION ON ANDROID SMARTPHONE FOR HEARING AID APPLICATIONS

Least Squares Signal Declipping for Robust Speech Recognition

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Julius rev LEE Akinobu, and Julius Development Team 2007/12/19. 1 Introduction 2

On Pre-Image Iterations for Speech Enhancement

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Why DNN Works for Speech and How to Make it More Efficient?

Modeling Phonetic Context with Non-random Forests for Speech Recognition

Speech Technology Using in Wechat

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

Bus Detection and recognition for visually impaired people

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP( 1

MRT based Adaptive Transform Coder with Classified Vector Quantization (MATC-CVQ)

A Gaussian Mixture Model Spectral Representation for Speech Recognition

Andi Buzo, Horia Cucu, Mihai Safta and Corneliu Burileanu. Speech & Dialogue (SpeeD) Research Laboratory University Politehnica of Bucharest (UPB)

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Mouth Region Localization Method Based on Gaussian Mixture Model

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR. Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan

Upper-limit Evaluation of Robot Audition based on ICA-BSS in Multi-source, Barge-in and Highly Reverberant Conditions

Color Space Projection, Feature Fusion and Concurrent Neural Modules for Biometric Image Recognition

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT

Detection of goal event in soccer videos

DSW Feature Based Hidden Marcov Model: An Application on Object Identification

Lecture 14 Shape. ch. 9, sec. 1-8, of Machine Vision by Wesley E. Snyder & Hairong Qi. Spring (CMU RI) : BioE 2630 (Pitt)

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Automatic Segmentation of Semantic Classes in Raster Map Images

Announcements. Image Segmentation. From images to objects. Extracting objects. Status reports next Thursday ~5min presentations in class

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Biometrics Technology: Multi-modal (Part 2)

Fast Panoramic Face Mosaicing and Recognition

Visual Speech Recognition Based on Lip Movement for Indian Languages

Large Scale Distributed Acoustic Modeling With Back-off N-grams

Variable-Component Deep Neural Network for Robust Speech Recognition

Unsupervised Learning

CSC 411 Lecture 18: Matrix Factorizations

RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS

Combining Audio and Video for Detection of Spontaneous Emotions

Baseball Game Highlight & Event Detection

Speaker Verification with Adaptive Spectral Subband Centroids

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Local Filter Selection Boosts Performance of Automatic Speechreading

Minimal-Impact Personal Audio Archives

VOICE AND TOUCH BASED INPUT

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Confidence Measures: how much we can trust our speech recognizers

Analyzing the Relationship Between Head Pose and Gaze to Model Driver Visual Attention

1 Introduction. 2 Speech Compression

NAME: Sample Final Exam (based on previous CSE 455 exams by Profs. Seitz and Shapiro)

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

A NEURAL NETWORK APPLICATION FOR A COMPUTER ACCESS SECURITY SYSTEM: KEYSTROKE DYNAMICS VERSUS VOICE PATTERNS

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

Dimension Reduction CS534

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

Transcription:

Xing Fan, Carlos Busso and John H.L. Hansen Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science Department of Electrical Engineering University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A. EUSIPCO 2011 August 29 - September Barcelona, Spain Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 1 EUSIPCO 2011, Barcelona Spain, August 29-September

Absence of periodic excitation Acoustic Formant shifting in F1, F2 Domain Low volume and longer duration Significant degradation in performance for ASR system trained with neutral speech Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 2 EUSIPCO 2011, Barcelona Spain, August 29-September

WHISPER-NEUTRAL Audio-Lip DATA 1 male native American English speaking subject 300 digits numbers from 0-9 are randomly ordered and read by the subject in both whispered and neutral mode. Both audio and visual information are recorded by a camera. The video is of the size 720*576, captures in color at a rate of 25 frames/sec The audio is collected with the video at a sample rate of 44.1 khz with microphone in the camera ~70cm from the subject, and down-sampled to 16 KHz Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 3 EUSIPCO 2011, Barcelona Spain, August 29-September

Pre-processing (extraction of ROI) Step1 : Using gravity center detection to find the center of the lips Step2 : Using RGB color vector to detect the lip boundary Use developing lip sample to train a Lip color GMM For each input image, each pixel will test the GMM, the boundary will be determined by the output score matrix Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 4 EUSIPCO 2011, Barcelona Spain, August 29-September

Eigenlips: A total number of 400 lip samples are employed for calculating the eigenlips Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 5 EUSIPCO 2011, Barcelona Spain, August 29-September

Steps for calculating the eigenlips Calculate the mean lip image structure Normalizing of each image Calculate the covariance matrix C Obtain the eigenvectors and eigenvalues of C Choose and sort the eigenvalues and find the corresponding eigenvectors Eigenvectors with top 15 largest eigenvalues are chosen as eigenlips Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 6 EUSIPCO 2011, Barcelona Spain, August 29-September

Mean Lip Original Lips Lip Reconstruction with Eigenlips Projection on to the eigenspace Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 7 EUSIPCO 2011, Barcelona Spain, August 29-September

HMM Structure used in this study MFCCs EIGENLIPS Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 8 EUSIPCO 2011, Barcelona Spain, August 29-September

Audio and Video HMMs are trained independent in this study. Audio: Training feature: MFCC_0_D_A (39d) State num: 16 states (2 non-emitting state) Mixture number: 1 Video: Training feature: eigenlips_d (30d) State num: 7 states (2 non-emitting state) Mixture number: 1 Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 9 EUSIPCO 2011, Barcelona Spain, August 29-September

Decoding: Synchrony is constrained on the boundary of each word Scores are log-linearly combined Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 10 EUSIPCO 2011, Barcelona Spain, August 29-September

Adjusting the value of the weight of the audio score versus lip score, we have word accuracy: (training data only contains neutral speech) Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 11 EUSIPCO 2011, Barcelona Spain, August 29-September

Overall Word accuracy: Stream training test Word Accuracy(%) Audio data neutral neutral 98.7 Audio data whisper whisper 83.3 Audio data neutral whisper 42.7 Video data neutral neutral 70.7-56% Video data whisper whisper 68.0 Video data neutral whisper 54.7 combined neutral whisper 79.7 +38% Audio based system achieves good baseline 98.7% Testing with whisper audio significant ASR performance loss 56% Combine Audio-Visual improved performance to 79.7% Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 12 EUSIPCO 2011, Barcelona Spain, August 29-September

A small digit corpus is developed for an exploratory study of audio-visual speech recognition for whispered speech. An eigenlip based feature extraction method is applied for visual data Multistream framework is built using audio and video stream HMMs Significant improvement in word accuracy is presented using this multi-stream model system Email: {xxf064000, busso, John.Hansen}@utdallas.edu Slide 13 EUSIPCO 2011, Barcelona Spain, August 29-September