Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

Similar documents
AUDIO-VISUAL SPEECH-PROCESSING SYSTEM FOR POLISH APPLICABLE TO HUMAN-COMPUTER INTERACTION

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

ISCA Archive

Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION

Intelligent Hands Free Speech based SMS System on Android

Xing Fan, Carlos Busso and John H.L. Hansen

Temporal Multimodal Learning in Audiovisual Speech Recognition

Joint Processing of Audio and Visual Information for Speech Recognition

END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS

Automatic speech reading by oral motion tracking for user authentication system

Learning based face hallucination techniques: A survey

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Multi-pose lipreading and audio-visual speech recognition

Discriminative training and Feature combination

A New Manifold Representation for Visual Speech Recognition

Audio-Visual Speech Activity Detection

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Multifactor Fusion for Audio-Visual Speaker Recognition

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Towards Lipreading Sentences with Active Appearance Models

Adaptive Quantization for Video Compression in Frequency Domain

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

Generic Face Alignment Using an Improved Active Shape Model

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

LIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Research Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

Combining Audio and Video for Detection of Spontaneous Emotions

CHAPTER 6. 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform. 6.3 Wavelet Transform based compression technique 106

Turbo Decoders for Audio-visual Continuous Speech Recognition

Chapter 3. Speech segmentation. 3.1 Preprocessing

WHO WANTS TO BE A MILLIONAIRE?

Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Phoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Palmprint Recognition Using Transform Domain and Spatial Domain Techniques

Facial Feature Detection

Variable-Component Deep Neural Network for Robust Speech Recognition

Stacked Denoising Autoencoders for Face Pose Normalization

IMAGE FUSION PARAMETER ESTIMATION AND COMPARISON BETWEEN SVD AND DWT TECHNIQUE

Speaker Verification with Adaptive Spectral Subband Centroids

CODING METHOD FOR EMBEDDING AUDIO IN VIDEO STREAM. Harri Sorokin, Jari Koivusaari, Moncef Gabbouj, and Jarmo Takala

Static Gesture Recognition with Restricted Boltzmann Machines

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

The Method of User s Identification Using the Fusion of Wavelet Transform and Hidden Markov Models

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ICIP.2005.

SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Confidence Measures: how much we can trust our speech recognizers

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009

Principles of Audio Coding

Audio Visual Speech Recognition

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

Manifold Constrained Deep Neural Networks for ASR

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

MULTIMODE TREE CODING OF SPEECH WITH PERCEPTUAL PRE-WEIGHTING AND POST-WEIGHTING

Locating 1-D Bar Codes in DCT-Domain

Face/Flesh Detection and Face Recognition

CHAPTER 3 DIFFERENT DOMAINS OF WATERMARKING. domain. In spatial domain the watermark bits directly added to the pixels of the cover

Face and Facial Expression Detection Using Viola-Jones and PCA Algorithm

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR. Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan

Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions

signal-to-noise ratio (PSNR), 2

Optimized Progressive Coding of Stereo Images Using Discrete Wavelet Transform

Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification

A Survey on Feature Extraction Techniques for Palmprint Identification

Visual Front-End Wars: Viola-Jones Face Detector vs Fourier Lucas-Kanade

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

Multi-Modal Human Verification Using Face and Speech

Face recognition using Singular Value Decomposition and Hidden Markov Models

Skin Infection Recognition using Curvelet

Robust speech recognition using features based on zero crossings with peak amplitudes

IR-DEPTH FACE DETECTION AND LIP LOCALIZATION USING KINECT V2. A Thesis. presented to. the Faculty of California Polytechnic State University,

Lipreading using Profile Lips Rebuilt by 3D Data from the Kinect

DECODING VISEMES: IMPROVING MACHINE LIP-READING. Helen L. Bear and Richard Harvey

Eye Detection by Haar wavelets and cascaded Support Vector Machine

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

ABSTRACT 1. INTRODUCTION

Detection of goal event in soccer videos

Dr Andrew Abel University of Stirling, Scotland

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Face detection and recognition. Detection Recognition Sally

Speech Driven Synthesis of Talking Head Sequences

Transcription:

Proceedings of the orld Congress on Electrical Engineering and Computer Systems and Science (EECSS 2015) Barcelona, Spain, July 13-14, 2015 Paper No. 343 Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models Tomasz Jadczyk, Mariusz Ziółko AGH University of Science and Technology Al. Mickiewicza 30, 30-059 Kraków, Poland www.dsp.agh.edu.pl, Techmo sp z o.o., techmo.pl Kraków, Poland, jadczyk@agh.edu.pl; ziolko@agh.edu.pl Abstract- In this paper we describe a speech processing system for Polish which utilizes both acoustic and visual features and is based on Dynamic Bayesian Network (DBN) models. Visual modality extracts information from speaker lip movements and is based alternatively on raw pixels and discrete cosine transform (DCT) or Active Appearance Model (AAM) features. Acoustic modality is enhanced by using two parametrizations - standard MFCC and discrete wavelet transform (DT) features, extracted for frequency subbands designed in accordance to human perception. An asynchrony between audio and visual modalities is also represented within the DBN model. Our system is evaluated on Audiovisual Speech Corpus of Polish, which contains most popular phrases from polish dialogs, uttered by 24 speakers. Experiments under various - clean and noisy - conditions confirmed that using double-parametrized audio and visual modalities (either DCT or AAM) with DBN model can reduce the Phrase Recognition Error Rate up to 35% under low signal-to-noise (SNR) conditions, comparing to results from standard audio-only, Hidden Markov Model - based system with MFCC parametrization. Keywords: audio-visual speech recognition, visual features extraction, Dynamic Bayesian Networks 1 Introduction Speech processing is a key item in natural human-computer interaction. Automatic speech recognition (ASR) systems based on a hidden Markov models (HMM) became an industry-standard. Such systems perform well in good noise conditions, but their performance decreases rapidly when signal-to-noise (SNR) ratio falls down. Using deep neural networks (DNN) lead to some improvements in ASR accuracy, but low SNR is still an issue. It may be especially difficult to secure good communication conditions in some public areas, where spoken language processing may be especially applicable, like automatic information points (kiosks) on streets, airports, etc. Incorporating visual modality into ASR system may help in overcoming difficult conditions and generation of robust system, as shown by (Potamianos et al. 2003). Additionally, using multiple parametrizations for the same modality of speech may also lead to further improvements (Gowdy et al. 2004). The main goal when using multiple streams is finding efficient way for streams fusion, to avoid situation when system performance for combined streams is worse than one of its stream used independently. An efficient technique for data fusion is based on dynamic Bayesian networks (Murphy 2002). It allows data integration based on model fusion (Shivappa et al. 2010). Additional improvements may be obtained, when DBN based model also incorporates constrained de-synchronization between modalities (Saenko and Livescu 2006), while streams from the same modality should stay synchronized. 343-1

In this work, we propose the use of DBN algorithm and DBN based models for combining information from enhanced, double parametrized audio stream and additional visual stream. Moreover, our model is able to represent some level of asynchrony between audio and visual modalities. This paper is organized as follow: in Section 2 we describe general DBN algorithm and present network used in our system., section 3 presents scheme of our system, data flow and parametrization methods for each stream, then the database used for training and evaluation is described. Results are described in section 4, then final conclusions are presented in section 5. 2 Dynamic Bayesian Network Bayesian network encodes dependencies between set of random variables, which are represented as edges and nodes of a directed graph respectively. Dynamic Bayesian network is an extension of plain network, used for modeling random variables evolution over time. It is achieved by repeating network structure and connecting corresponding nodes. In speech recognition, nodes in network represents hidden (words, phonemes, transitions) and observed (acoustic features, like MFCC) variables, while edges corresponds to conditional probability functions or deterministic dependencies (Bilmes and Bartels 2005). For example, when modeling word-inner phone position and phone transition, the next phone is determined by word transcription and occurrence of phone transition in previous time-slice. Random transition may be used for incorporating language models. 2.1 DBN models used in AVASR system The structure of model used in our system is shown in Figure 1. Observable features are presented as filled circles. For audio streams, features extracted with MFCC are marked as O M, O D are wavelet features. Features from the speaker s mouth region, used in video stream, are labeled with O V. For modeling asynchrony between modalities, there are two different nodes that represents phone in audio, and viseme in visual modality, but both modalities share single word variable. In the same time slot, depends on both P and V variables, while word transition depends on a phone transition P Tr and viseme transition V Tr respectively. Audio signal is parametrized with two, independent, algorithms, but both audio streams are synchronized on the same phone. 3 AVASR SYSTEM The AVASR system is based on our standard ASR System for Polish, called Sarmata (Ziółko et al. 2011). ASR system works on HMM models and context-dependent phoneme representations. The audio-visual extension is implemented in C#.NET language, with using EmguCV framework (eb-1) for image processing and computer vision algorithms, and Infer.NET framework (Minka et al. 2014) for DBN modeling. EmguCV is a.net platform wrapper for a well known C++ library OpenCV. 3.1 Data flow scheme The general data flow in our system is presented in Figure 2. First part, Voice Activity Detection (VAD) is based only on audio stream processing and is used for computational reduction. Audio stream is parametrized with two separate algorithms. For video stream, region-of-interest that contains mouth only area is extracted first, then it is parametrized according to Active Appearance Models algorithm or simple pixel intensities from ROI, with DCT dimensionality reduction (described in sec. 3.3). All features are passed to DBN algorithm, where inferring about word hypothesis is executed. 343-2

O11 Observations PT1 Latent DBN Audio Features DBN Visual Features ord ord ord Transition ord Transition Phone Position (inside word) Ps Ps Ps END Vs Vs Vs END Viseme Position (inside word) Phone Transition PTr PTr PTr PTr VTr VTr VTr Viseme Transition VTr Phone P Pt1 Pt2 Pt3 Vt1 Vt2 Vt3 Viseme V Observations acoustic features MFCC Observations acoustic features DT OM1 OM2 OM3 OD1 OD2 OD3 OV1 OV2 OV3 Observations visual features (DCT or AAM) Time t1 t2 t3 t1 t2 t3 Time variable Deterministic relation Probabilistic relation t1 Time stamp Fig. 1: Scheme of DBN models with two modalities, and asynchrony represented by separate phone / viseme nodes. Figure is split into Audio and Visual parts just for clarificity. MFCC Parametrization Speaker VAD Audiol segmentation Videol segmentationlhl ROIlextraction avelet Decomposition VisuallFeatures AAMl/ DCT DBNlAlgorithm Recognition Fig. 2: Data flow in our audio-visual recognition system 343-3

Fig. 3: Discrete wavelet parametrization with optimal decomposition tree. Frequency sub-bands used in audio parametrization. Lowest frequencies (0-125 Hz) are discarded from processing, for noise reduction. 3.2 Audio Features Extraction The speech input was first windowed, with Hamming indow of length 20ms and window offset 10ms. For each frame 12 MFCC features and energy feature were extracted. Additionally first and second order derivatives were appended to the MFCC, resulting in a 39 dimensional feature vector O M. Second stream in audio line is for avelet parametrization O D. Eleven levels perceptual decomposition procedure with discrete Meyer wavelet decomposition filters were applied to obtain a power spectrum of speech signal. The time discretization for all wavelet sub-bands is unified by summing adequate number of wavelet spectrum samples for high frequencies. Optimal wavelet tree (Fig.3) was found to choose exact boundaries of frequency sub-bands (Gałka and Ziółko 2009). The signal s is decomposed with discrete Meyer high-pass filters g and low-pass h according to the designed perceptual tree. This approach provides decent psycho-acoustic model of the human ear Mel-like frequency characteristics. The parametrization is conducted by measuring different sub-band energy fractions and storing them in a vector of their magnitudes. 3.3 Visual Features Extraction Visual features may be broadly categorized in two approaches (Lan et al. 2010). The first one, top-down approach is derived from a higher-level, model-based features, that utilizes shape and appearance of mouth area images. The second is a bottom-up approach, where model is built from low-level, image-based features which are then compressed using one of dimensionality reduction algorithm, such as discrete cosine transform (DCT) or principal components analysis (PCA). Both type of features require mouth-region extraction executed as an initial step. e are using Viola-Jones algorithm, which is based on an idea of a boosted cascade of weak classifiers, where each one has a high detection ratio and small true reject ratio. Model defining mouth area was described by (Castrillón Santana et al. 2007) and is provided with OpenCV 343-4

Fig. 4: Mouth region and low-level, image-based features extraction (DCT) framework. As a result we are gathering rectangle area with varying size, which is later processed. e are testing features from both categories. Visual features are extracted at 25Hz and then interpolated to meet audio features frequency (100Hz). Image-based features (denoted as DCT) are extracted by downsampling ROI to 64x32 pixels and conversion to gray-scale intensity image. Two dimensional discrete cosine transform (DCT) is applied. First 30 coefficients, without DC value and their derivatives, constitute 60-dimensional vector. Following steps are presented in Fig. 4. Model-based features are using Active Appearance Model (AAM) (Cootes et al. 2001) algorithm, that utilizes shape and appearance information. Shape component is formed by concatenating the 2D coordinates of a set of n vertices that are boundary markers. Shape can also be presented in a more compact way, as a mean shape and linear combination of m largest eigenvectors of the covariance matrix. Principal Component Analysis (PCA) is used for finding model parameters. The appearance is defined by pixels that lie inside base shape. Appearance can be represented in similar way to shape component, as a base appearance and linear combination of k appearances (computed by applying PCA to shape normalized training images). Shape and appearance components are concatenated, reweighted (due to their natures: shape are coordinates and appearance are pixel intensities) and processed by final PCA to obtain more compact and decorrelated features. A z-score normalization is performed to enhance speaker-variability robustness of visual features. 3.4 Database In this work, we were using the Audiovisual database of polish speech recordings (Igras et al. 2013). It contains three different types of utterances: numbers and commands, 160 short phrases from list of most popular questions asked to Virtual Assistant, using spoken language, and parts of read texts (7 different utterances, like articles, definitions, part of stories). Longer utterances was split to several-words phrases for learning and testing purposes. There are 24 speakers, 13 male and 11 female. Recordings contain only faces, frontal view, on bright background with rather invariant lighting conditions, Full HD quality, 25 frames per second. Each speaker has been recorded for about 10 minutes. Total database length is about 4 hours. 4 Experiments In our testing environment, Audiovisual database was divided into two sets. The training set cosisted of 21 speakers, and 3 remaining speakers were used as the testing set. This was repeated 8 times, to test all speakers, resulting in about 1000 testing utterances. Audio models were additionally trained with more than 10 hours of clean speech samples. In following experiments, we have compared four setups of multistream DBN AVASR system with standard HMM-based one, under various SNR conditions. Street noise 343-5

was artificially incorporated into testing samples. First setup was standard ASR with HMM and MFCC-only (as described in section 3.2, denoted as HMM-MFCC in Table 1) parametrization. Then, we ve tested DBN model system, with just single stream, parametrized also with MFCC (DBN-MFCC), to see differences in decoding algorithms. In third setup (DBN-MFCC and DT) we tested how multiple parametrization for single modality works under different conditions. Two final experiments gather results for complete system, with both MFCC and DT features extracted from audio signal, while DCT (DBN-AV with DCT) and AAM (DBN-AV with AAM) visual features added to DBN decoder accordingly. All results are summarized in Table 1. It shows that incorporating visual modality within DBN-based system may be beneficial to improve phrase recognition rate, especially in noisy condition, regardless of used visual features. Table 1: Phrase recognition rates for all experiments, under various audio conditions - SNR Experiment 0dB 5dB 10dB clean HMM-MFCC only 63.69% 78.97% 87.50% 96.73% DBN-MFCC only 63.79% 77.98% 86.21% 92.86% DBN-MFCC and DT 64.68% 80.95% 88.69% 93.45% DBN-AV with DCT 75.69% 81.65% 85.42% 88.10% DBN-AV with AAM 76.39% 81.55% 85.22% 87.70% 5 Conclusions In this paper we have described audio-visual speech recognition system with Dynamic Bayesian Networks used for modeling multiple streams and modalities. On the basis of results collected in Table 1 we can state that the DBN model can be beneficial when SNR is lower, even when only single modality with many parametrizations is taken into account. However, usage of more complicated modeling technique (DBN) may be unjustified when environment does not influence communication between human and computer. This is related to more demanding requirements during DBN model training. These results are consistent with results presented by (Gowdy et al. 2004). Incorporating visual modality, which is not influenced by audio noise is also beneficial, especially in low SNR conditions. e ve gathered very similar results for low-level (DCT) and higher-level (AAM) visual features, which is distinct than results presented by (Lan et al. 2010), where authors prefers Active Appearance Model. This is probably due quality of our audio-visual corpus, where invariant lighting conditions results in good recognitions even for simpler (DCT) visual features. Acknowledgments This work was supported by Polish National Science Centre (NCN) granted by decision DEC- 2011/03/N/ST7/00443. References Bilmes, J. & Bartels, C. (2005), Graphical model architectures for speech recognition, Signal Processing Magazine, IEEE 22(5), 89 100. Castrillón Santana, M., Déniz Suárez, O., Hernández Tejera, M. & Guerra Artal, C. (2007), Encara2: Realtime detection of multiple faces at different resolutions in video streams, Journal of Visual Communication and Image Representation pp. 130 140. Cootes, T., Edwards, G. & Taylor, C. (2001), Active appearance models, Pattern Analysis and Machine Intelligence, IEEE Transactions on 23(6), 681 685. 343-6

Gałka, J. & Ziółko, M. (2009), avelet parametrization for speech recognition, in Proceedings of an ISCA tutorial and research workshop on non-linear speech processing NOLISP 2009, VIC. Gowdy, J., Subramanya, A., Bartels, C. & Bilmes, J. (2004), Dbn based multi-stream models for audio-visual speech recognition, in Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP 04). IEEE International Conference on, Vol. 1, pp. I 993 6 vol.1. Igras, M., Ziółko, B. & Jadczyk, T. (2013), Audiovisual database of polish speech recordings, Studia Informatica 33(2B), 163 172. Lan, Y., Theobald, B.-J., Harvey, R., Ong, E.-J. & Bowden, R. (2010), Improving visual features for lipreading., in AVSP, pp. 7 3. Minka, T., inn, J., Guiver, J., ebster, S., Zaykov, Y., Yangel, B., Spengler, A. & Bronskill, J. (2014), Infer.NET 2.6. Microsoft Research Cambridge. http://research.microsoft.com/infernet. Murphy, K. P. (2002), Dynamic bayesian networks: representation, inference and learning, PhD thesis, University of California, Berkeley. Potamianos, G., Neti, C., Gravier, G., Garg, A. & Senior, A. (2003), Recent advances in the automatic recognition of audiovisual speech, Proceedings of the IEEE 91(9), 1306 1326. Saenko, K. & Livescu, K. (2006), An asynchronous dbn for audio-visual speech recognition, in Spoken Language Technology orkshop, 2006. IEEE, pp. 154 157. Shivappa, S. T., Trivedi, M. M. & Rao, B. D. (2010), Audiovisual information fusion in human computer interfaces and intelligent environments: A survey, Proceedings of the IEEE 98(10), 1692 1715. Ziółko, M., Gałka, J., Ziółko, B., Jadczyk, T., Skurzok, D. & Masior, M. (2011), Automatic speech recognition system dedicated for polish, Proceedings of Interspeech, Florence. eb-1: http://www.emgu.com/, consulted 01 Apr. 2015. 343-7