Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons

Size: px
Start display at page:

Download "Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons"

Transcription

1 INTERSPEECH 2014 Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons Ahmed Hussen Abdelaziz, Dorothea Kolossa Institute of Communication Acoustics, Digital Signal Processing Group, Ruhr-Universität Bochum, Germany {Ahmed.HussenAbdelAziz, Abstract Jointly using audio and video features can increase the robustness of automatic speech recognition systems in noisy environments. A systematic and reliable performance gain, however, is only achieved if the contributions of the audio and video stream to the decoding decision are dynamically optimized, for example via so-called stream weights. In this paper, we address the problem of dynamic stream weight estimation for coupled- HMM-based audio-visual speech recognition. We investigate the multilayer perceptron (MLP) for mapping reliability measure features to stream weights. As an input for the multilayer perceptron, we use a feature vector containing different model-based and signal-based reliability measures. Training of the multilayer perceptron has been achieved using dynamic oracle stream weights as target outputs, which are found using a recently proposed expectation maximization algorithm. This new approach of MLP-based stream-weight estimation has been evaluated using the Grid audio-visual corpus and has outperformed the best baseline performance, yielding a % average relative error rate reduction. Index Terms: Audio-visual ASR, coupled HMM, stream weight, reliability measure, multilayer perceptron 1. Introduction Using visual observations in addition to acoustical observations in automatic speech recognition (ASR) has recently attracted remarkable research interest as a possible solution for the rapid performance drop of audio-only ASR in noisy environments. One reasonable requirement is that under all conditions, the resulting audio-visual (AV) ASR system should perform at least as good as, but typically better than the best single-modality system. To ensure this, it is necessary to dynamically adapt the contribution of each modality to the classification decisions made by the audio-visual model. This can be done by weighting the contribution of each modality according to its information content and its reliability, using so-called stream weights (SW). In many prior works, the stream weight estimation problem has been addressed. In [1 3] for example, the stream weights have been considered as model parameters and estimated using a generative or a discriminative criterion. In [4, 5], on the other hand, the stream weights have been considered as featuredependent and estimated per frame, based on different reliability measures via heuristically chosen mapping functions like the sigmoid or the exponential function. However, it has not been shown whether these functions are an optimal choice. Therefore, we propose here to train a multilayer perceptron (MLP) to implicitly choose the appropriate mapping function. In order to train the MLP parameters, a massive amount of training data should be used. The global fixed stream weights, found per condition via a grid search as in [4, 6], are not sufficient as target outputs for the MLP, because the number of estimated data points is far too small. Instead, we train the MLP using frame-dependent (dynamic) oracle stream weights, estimated using the newly proposed algorithm in [7]. The large number of oracle stream weights made available by this algorithm one per feature vector in the training dataset also enables the use of multidimensional reliability feature vectors. This allows us to use 31-dimensional feature vectors that combine different signal-based and model-based reliability measures as the input to the MLP. The proposed stream weights are tested using a coupled hidden Markov model (CHMM)-based AVASR [8 11]. In contrast to other fusion models, e.g., the multistream hidden Markov model (MSHMM) [8 10,12], the CHMM takes into account the asynchrony between the audio and the visual modality while preserving their temporal dependency by enforcing the synchronization at the boundaries of certain speech units, e.g., phonemes, syllables, or words. The remaining paper is organized as follows: In Section 2, we review the use of coupled HMMs as fusion models for AVASR. Next, in Section 3, we discuss using the MLP as a mapping function for dynamic stream weights. The reliability measures from which the stream weights are estimated using the MLP will also be discussed in this section. The algorithm used for estimating the frame-wise oracle stream weights, needed for training the MLP parameters, is discussed in Section 4. With all parts of the system in place, the proposed approach is evaluated on the Grid audio-visual database [13]. The experimental setup and results are presented in Section 5. Finally, we conclude the paper and give an outlook for further work in Section Coupled hidden Markov model Audio-visual speech models should take into consideration the natural temporal dependencies of audio and video observations. The two modalities do exhibit asynchronicities, because the visual information derived from the articulator movements often precedes the acoustical signal generation [14], but a natural temporal dependency of both modalities results, as they are both the consequence of the same movements of articulators. The CHMM can be used to model this joint temporal evolution. As shown in Figure 1, each hidden state of the CHMM is composed of a pair of the corresponding audio and video states of two marginal single-stream HMMs. The level of asynchrony allowed by the CHMM is controlled by the transition matrix of the model. The natural dependency over time between both modalities is additionally guaranteed by enforcing the synchrony at Copyright 2014 ISCA September 2014, Singapore

2 q A Video HMM N V Audio HMM N A Coupled HMM Figure 1: Coupled HMM with N A N V composite states. appropriate model boundaries (here, at word boundaries). The transition probability a i,j between two composite statesq t = i = (i A,i V ) andq t+1 = j = (j A,j V ) in a CHMM can be computed by a i,j = p(q t = i q t+1 = j) = a s i s,js. (1) s {A,V } In (1), a s i s,j s denotes the transition probability from state qs t = i s to state qt+1 s = j s in the marginal single-modality HMM of stream s. A denotes the audio andv the video stream. The observation likelihood of a composite state q in a CHMM can be computed by b qt =i(o t,λ t) = p (O t A qt A = i A) λ t ( p Ot V qt V = i V) 1 λ t. Here, p(o s t q s t) is the single-modality observation likelihood. The audio-visual observation O t at time frame t consists of the acoustical observation O A t and the visual observation O V t. As can be seen in (2), the stream weight λ t, which is the key quantity studied in this paper, controls the contribution of each modality to the overall score of the composite state q t = i. A large stream weight at time frametmeans that the acoustical observation is more reliable and contains more information than the visual observations at this time frame and vice versa. 3. Stream Weight Estimation using MLPs (2) In many prior works, frame-dependent stream weights λ t have been estimated using different mapping functions. These functions map single- or multidimensional reliability measure features to the stream weight. Convenient choices of such mapping functions are the sigmoid function [5,6] or the exponential function [4]. However, these functions are chosen heuristically and it is not clear whether they are optimal for the stream weight estimation task. As an alternative, we propose to use a multilayer perceptron to choose the most suitable function for this task. Since we are aspiring to dynamically estimating the stream weights, the input reliability measure features of the MLP and their corresponding stream weight outputs should be computed for each frame. This means that large numbers of input-output tuples can be collected from a reasonable number of utterances. This abundance of input-output tuples makes it possible to properly train the MLP parameters in a supervised manner. As an input to the MLP, we consider a 31-dimensional feature vector, which contains different model-based and signalbased reliability measures. The first and the second modelbased reliability measure are the acoustical and the visual entropy, H A and H V [15 17]. The entropy is a measure of the decoder uncertainty regarding the discrete state, given an observation O s. Therefore, small entropies indicate reliable features and vice versa. The entropy can be computed independently for each modalitysat time frame t as follows: H s t = p(qt s = i s Ot)log(p(q s t s = i s Ot)). s (3) N s i=1 The single-modality dispersions D A and D V [4, 6, 18], which are also model-based features, are used as the third and fourth reliability measure. The dispersion indicates how certain the decoder is, when allocating the given observation to a model state, so in contrast to the entropy, reliable observations have large dispersion values. The dispersion of a single modality s A,V at time frame t is given via D s t = 2 L s (L s 1) L s L s k s =1 l s =k s +1 p(q s t = k s O s t) p(q s t = l s O s t). (4) In order to compute the dispersion as in (4), the L s largest posteriors p(qt s = k s Ot) s should first be arranged in descending order. These four model-based reliability measure features are heuristic measures for the mismatch between the given observation O s (t) and the underlying model. The remaining reliability measure features are signal-based, rather than model-based. Specifically, we are using the signalto-noise ratio (SNR), averaged soft and hard voice activity detection (VAD) cues, and the estimation uncertainty of the acoustical observations. The availability of these signal-based features depends on the chosen acoustic pre-processor. In this study, we use a Wiener filter as a pre-processor to enhance the speech signal before extracting the acoustical observations. Speech enhancement algorithms such as the Wiener filter need an estimate of the noise power. This can be found either by explicitly employing speech pause detection or by using algorithms like improved minima controlled recursive averaging (IMCRA) [19 21] to estimate the noise floor. The later algorithm, which is used here, provides two types of VAD cues in each time-frequency bin: hard (binary) and probabilistic. The fifth and the sixth components of the reliability measure feature vector are the averages of each of these two VAD features across all frequency bins. Moreover, the noise power estimate obtained using IMCRA, together with the power of the estimated clean signal from the Wiener filter, are used to compute the framewise signal to noise ratio via: ( ) St SNR t = 10log, (5) N t where S t and N t are the estimated signal and noise energies at time framet, respectively. The final reliability measure feature is the uncertainty of the enhanced acoustical observations [22, 23]. This uncertainty is an estimate of the residual noise and estimation errors in the acoustical features after applying the Wiener filter. The uncertainty σ 2 can be computed for each time-frequency bin as σ 2 t = Var[x t y t], (6) where x t and y t are single components of the short-time discrete Fourier transform (DFT) of the clean and noisy acoustical 1145

3 Algorithm 1: Oracle Dynamic SWs for CHMM-based AVASR A. Set the prior parameters (1) Set µ λ to the global fixed stream weightλ G. (2) Initialize σ λ with a small value, e.g., 0.1. B. Initialization (3) Initialize the stream weight set ˆλ = λ. C. EM Algorithm ( (4) Calculate P = p O, ˆλ w ) E step (5) Use ˆλ to calculate γ t(i) for all times and 2-D states M step (6) Update ˆλ using Equation (8). Convergence test ( (7) CalculateP = p O, ˆλ w ) (8) If (P P > ǫ), P = P, go to (5), else, go to (9). D. Recognition (9) Use the estimated ˆλ to recognize the training utterance and calculate the accuracya(σ λ ). E. Iteration (10) Increase σ λ and repeat (3-10) until all values of ˆλ become 0 or 1. (11) Choose ˆλ that achieves best accuracy. signals, respectively. The dimension of the uncertainty is reduced from the DFT length to 24 by linear transform using the Mel filterbank matrix. After defining the reliability measure feature vectors in this way, in order to train the MLP parameters, it is still necessary to find the corresponding optimal stream weight for each time frame of a training data set. For this task, we use an expectation maximization algorithm that we have recently proposed [7]. A brief summary of this algorithm is given in the next section. 4. Oracle stream weight estimation The question addressed in this section is as follows: Given the audio-visual observation sequence O = {O t} T t=1 of an utterance and its corresponding correct transcription w, how can we estimate the optimal dynamic stream weight λ = {λ t} T t=1, which we also term oracle stream weight because prior knowledge of the correct word sequence enters into its computation. A possible answer to this question is introduced in [5, 9] for AVASR based on multi-stream HMMs. The approach in [5, 9] requires a-priori knowledge of the frame-state alignment, which is found using forced alignment. However, computing the frame-state alignment requires the state-conditional observation likelihoods given by (2), which already needs the dynamic stream weights a classical hen-egg dilemma. To solve this issue, we have proposed a new algorithm in [7] that needs no prior knowledge of the frame-state alignment for estimating the oracle dynamic stream weights, cf. Algorithm 1. The algorithm initializes the stream weight as a Gaussian random variable with mean µ λ and standard deviation σ λ. These statistics are assumed to take different values for different noise types and levels. We set the mean value µ λ, which we call the bias parameter, to the optimal global fixed stream weight λ G. The global fixed stream weight can be found using a grid search, minimizing word error rate. We term the standard deviation σ λ the sensitivity parameter, as it controls the dynamics of the stream weight. Large standard deviations lead to strong changes in the stream weights and vice versa. In Step (2) of Algorithm 1, we start with a smallσ λ, then increase it iteratively in Step (10) until all stream weights become either zero or one. Increasing the standard deviation further will not change the values of the stream weights, which are bounded between zero or one. In Steps (3-8), the optimal set of stream weights is found given the prior parameters µ λ and σ λ of the current iteration. In Step (9), we test the estimated stream weight set in the current iteration by blindly decoding the utterance using an audio-visual speech recognizer. Finally, the stream weight set ˆλ that yields the lowest error rate is selected as the final stream weight set for the given utterance. In Steps (4-8) we use an EM algorithm to find an estimate ˆλ that maximizes the following objective function: F = p(o,λ w) s.t. 1 λ 0. (7) A local maximum of F can be found by estimating the framewise stream weights as follows 1 : ˆλ t = µ λ +σ 2 λ (N A,N V ) i=(1,1) γ t(i)log ( ) p(o A t q t = i). (8) p(o V t q t = i) If the resulting ˆλ t / [0,1] in (8), it is clipped to these boundaries. Equation (8) represents the M step (Step (6)). To compute ˆλ t as in (8), the E step (Step (5)) is first applied to compute the composite state occupation probabilities γ t(i) for all times and all composite states i { (1,1), (N A,N V ) }. The EM algorithm is iterated until a local maximum of (7) is found. In Step (3), a reasonably initialized stream weight set λ is found, using a greedy layer-wise approach that solves the optimization problem λ { ( t = argmax p O1,,t,λ t λ 1,,t 1,w )} λ t s.t. 1 λ 0. (9) The objective function in (9) is convex in the feasible region, i.e. 0 λ t 1, and can be optimized by gradient ascent [24]. 5. Experiments and Results The Grid audio-visual speech corpus has been used to evaluate the proposed approach. The Grid database contains audiovisual recordings uttered by 33 speakers, i.e., 1000 sentence per speaker. The Grid task is to recognize English sentences of the form command-color-prepositionletter-digit-adverb. We have divided the signals into three sets: A training set containing 90% of the signals, a development and a test set each of which contains 5% of the corpus signals. The training set has been used to separately train the marginal single-modality HMMs. The development set has mainly been used to train the MLP. To test the proposed approach under different acoustical conditions, we have used eight additional noisy versions of the test and development set. The noisy signals have been created by adding babble and white noise signals to the clean signals at four SNR levels between 0dB and 15dB. The babble and white noise signals stem from the NOISEX corpus [25] and were chosen to represent both stationary and non-stationary noise. 1 A rigorous derivation of the EM algorithm and the initialization approach is introduced in [7]. 1146

4 Table 1: Recognition performance in term of word accuracy for single-stream and audio-visual ASR systems in a range of acoustical conditions, using different stream weight estimation schemes. Audio-visual Noise SNR Audio Video Bayes Exponential Function λ = λ = Type [db] only only MLP Fusion Entropy SNR Dispersion λ G ODSW * Babble * * * * * White * * Clean Avg * The spectrograms of all signals have been extracted using the configuration parameters recommended in the ETSI- AFE [26]. From the spectrogram, we have extracted the first 13 mel-frequency cepstral coefficients (MFCC). The acoustical feature vectors are then composed of these 13-dimensional static feature vectors concatenated with their first and second temporal derivatives. The visual features have been extracted as follows: First, the region of interest (ROI), a rectangular area containing the speaker s mouth, has been found with the Viola- Jones algorithm [27]. Next, the two-dimensional discrete cosine transform (DCT) has been applied to the ROI. The first 64 DCT coefficients were then used as the visual features. The dimensions of the acoustical and visual observations have finally been reduced to 31 using linear discriminant analysis (LDA) [28,29]. The single-modality word HMMs are speaker-dependent, linear models. The number of states in each word HMM is proportional to the number of phonemes contained in this word with a proportionality factor of 3 for audio HMMs and 1 for video HMMs. The output probability distributions of all emitting states are Gaussian mixture models with 3 mixture components for audio HMMs and 4 for video HMMs. The Java Audiovisual SPEech Recognizer (JASPER) system [30] has been used for training and recognition. The MLP has an input layer with 31 input units, two hidden layers each with 10 neurons, and a one-dimensional output layer. All neurons of the hidden layers have tan-sigmoid transfer functions. The output neuron, however, has a linear transfer function. The estimated stream weights that are smaller than zero or larger than one are clipped at these values. We consider the results obtained using the approach in [4] as our baseline. Here, a second order exponential function, λ t = ae b RM t +ce d RM t, (10) was used to map a one-dimensional reliability measure RM, e.g. RM t = SNR t, to a stream weight λ t at each time frame t. The parameters of this mapping function, a, b, c, and d, have been estimated using non-linear least squares. One training data point is computed for each development set as follows: The argument of the mapping function is found by averaging a one-dimensional reliability measure, e.g., SNR, entropy, or dispersion, over the whole development set. As the target stream weight of the set, the global fixed stream weight λ G is used, which is found via grid search. Therefore, the number of training data points for parameter estimation equals the number of development sets, which is typically quite small. Table 1 shows the performance in terms of word accuracy for audio-only, video-only, and audio-visual ASR. The stream weights used with the AVASR system have been computed using different approaches. The first approach is simple Bayes fusion, where audio and video likelihoods are always weighted equally. In the second, the third, and the fourth approach, the stream weights have been estimated as in [4]. The fifth approach is the proposed one, in which the stream weights are estimated using an MLP. The sixth and the seventh approach are upper bounds of the system performance as their stream weights have been estimated using a-priori knowledge of the correct transcription. The stream weights of the sixth approach are the global fixed stream weights obtained using grid search with a minimum word error rate criterion. The oracle dynamic stream weights (ODSWs) used in the last approach are those estimated using Algorithm 1 discussed in Section 4. As seen in Table 1, the proposed approach significantly outperforms the baseline results and approaches the results obtained using optimal fixed stream weights. The asterisk in Table 1 means that the results are significantly better than the respective best result of the baseline. The statistical significance has been tested using Fisher s exact test [31] for p = However, the gap between the results obtained using the oracle dynamic stream weights and the weights estimated using the MLP is also significant. This points towards further room for optimization, e.g. through the inclusion of other robust and indicative reliability measure features. 6. Conclusions Automatic speech recognition can become highly noise-robust when visual observations are used in conjunction with the acoustical ones. It is, however, important to dynamically control the contribution of each observation to the classification decision, for example by using dynamic stream weights. We have proposed to use an MLP for mapping reliability measure features to dynamic stream weights. In order to train the MLP parameters, we have employed oracle dynamic stream weights, which were estimated using a recently proposed EM algorithm. As the MLP input, we have used composite feature vectors, containing signal-based as well as model-based reliability measures. The proposed approach has significantly outperformed the baseline for CHMM-based AVASR. We expect to achieve even better performance when further noise-robust reliability measures are included as additional MLP inputs. 1147

5 7. References [1] J. Hernando, Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition, in ICASSP, Munich, Germany, [2] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, in ICASSP, Orlando, Florida, USA, [3] L. Peng and W. Zuoying, Stream weight training based on MCE for audio-visual LVCSR, Tsinghua Science and Technology, vol. 10, no. 2, pp , [4] V. Estellers, M. Gurban, and J.-P. Thiran, On dynamic stream weighting for audio-visual speech recognition, IEEE Trans. Audio, Speech, and Language Processing, vol. 20, no. 4, pp , [5] A. Garg, G. Potamianos, C. Neti, and T. S. Huang, Frame-dependent multi-stream reliability indicators for audio-visual speech recognition, in International Conference on Multimedia and Expo, Baltimore, Maryland, USA, [6] M. Heckmann, F. Berthommier, and K. Kroschel, Noise adaptive stream weighting in audio-visual speech recognition, EURAS, vol. 2002, pp , [7] A. H. Abdelaziz, S. Zeiler, and D. Kolossa, A new EM estimation of dynamic stream weights for coupled-hmmbased audio-visual ASR, in ICASSP, Florence, Italy, [8] A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, Dynamic Bayesian networks for audio-visual speech recognition, EURASIP Journal on Applied Signal Processing, vol. 11, pp , [9] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, Recent advances in the automatic recognition of audio-visual speech, Proceedings of the IEEE, vol. 91, no. 9, pp , [10] M. Tomlinson, M. Russell, and N. M. Brooke, Integrating audio and visual information to provide highly robust speech recognition, in ICASSP, Atlanta, Georgia, USA, [11] J. Luettin, G. Potamianos, and C. Neti, Asynchronous stream modeling for large vocabulary audio-visual speech recognition, in ICASSP, Salt Lake City, Utah, USA, [12] H. Bourlard and S. Dupont, A new ASR approach based on independent processing and recombination of partial frequency bands, in ICSLP, Philadelphia, Pennsylvania, USA, [13] M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, vol. 120, no. 5, pp , [14] D. W. Massaro and D. G. Stork, Speech recognition and sensory integration, American Scientist, vol. 86, no. 3, pp , [15] M. Gurban and J.-P. Thiran, Using entropy as a stream reliability estimate for audio-visual speech recognition, in European Signal Processing Conference, Lausanne, Switzerland, [16] G. Potamianos and C. Neti, Stream confidence estimation for audio-visual speech recognition, in ICSLP, Beijing, China, [17] H. Misra, H. Bourlard, and V. Tyagi, New entropy based combination rule in HMM/ANN multi-stream ASR, in ICASSP, Hong Kong, Hong Kong, [18] A. Adjoudani and C. Benoit, On the integration of auditory and visual parameters in an HMM-based ASR, in NATO ASI Conference on Speech reading by Man and Machine: Models, Systems and Applications, Berlin, Germany, [19] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp , [20] I. Cohen and B. Berdugo, Noise estimation by minima controlled recursive averaging for robust speech enhancement, IEEE Signal Processing Letters, vol. 9, no. 1, pp , [21] I. Cohen, Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging, IEEE Transactions on Speech and Audio, vol. 11, no. 5, p , [22] L. Deng, J. Droppo, and A. Acero, Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion, IEEE Trans. Speech and Audio Processing, vol. 13, no. 3, pp , [23] R. F. Astudillo, D. Kolossa, and R. Orglmeister, Accounting for the uncertainty of speech estimates in the complex domain for minimum mean square error speech enhancement, in Interspeech, Brighton, United Kingdom, [24] S. Boyd and L. Vandenberghe, Convex Optimization, 7th ed. Cambridge University Press, [25] A. Varga and H. J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, vol. 12, no. 3, pp , [26] Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced frontend feature extraction algorithm; Compression algorithms, ETSI, ES Std., [27] G. Bradski and A. Kaehler, Computer Vision with the OpenCV Library. O Reilly Media, [28] R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Pearson, [29] D. Kolossa, S. Zeiler, R. Saeidi, and R. Astudillo, Noiseadaptive LDA: A new approach for speech recognition under observation uncertainty, IEEE Signal Processing Letters, vol. 20, no. 11, pp , [30] A. Vorwerk, S. Zeiler, D. Kolossa, R. F. Astudillo, and D. Lerch, Robust Speech Recognition of Uncertain or Missing Data. Springer, 2011, ch. Use of Missing and Unreliable Data for Audiovisual Speech Recognition, pp [31] A. Agresti, A survey of exact inference for contingency tables, Statistical Science, vol. 7, no. 1, pp ,

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR Sebastian Gergen 1, Steffen Zeiler 1, Ahmed Hussen Abdelaziz 2, Robert Nickel

More information

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH Andrés Vallés Carboneras, Mihai Gurban +, and Jean-Philippe Thiran + + Signal Processing Institute, E.T.S.I. de Telecomunicación Ecole Polytechnique

More information

Variable-Component Deep Neural Network for Robust Speech Recognition

Variable-Component Deep Neural Network for Robust Speech Recognition Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft

More information

Turbo Decoders for Audio-visual Continuous Speech Recognition

Turbo Decoders for Audio-visual Continuous Speech Recognition INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Turbo Decoders for Audio-visual Continuous Speech Recognition Ahmed Hussen Abdelaziz International Computer Science Institute, Berkeley, USA ahmedha@icsi.berkeley.edu

More information

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing (AVSP) 2013 Annecy, France August 29 - September 1, 2013 Audio-visual interaction in sparse representation features for

More information

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology ISCA Archive STREAM WEIGHT OPTIMIZATION OF SPEECH AND LIP IMAGE SEQUENCE FOR AUDIO-VISUAL SPEECH RECOGNITION Satoshi Nakamura 1 Hidetoshi Ito 2 Kiyohiro Shikano 2 1 ATR Spoken Language Translation Research

More information

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models Proceedings of the orld Congress on Electrical Engineering and Computer Systems and Science (EECSS 2015) Barcelona, Spain, July 13-14, 2015 Paper No. 343 Audio-Visual Speech Processing System for Polish

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

A long, deep and wide artificial neural net for robust speech recognition in unknown noise A long, deep and wide artificial neural net for robust speech recognition in unknown noise Feipeng Li, Phani S. Nidadavolu, and Hynek Hermansky Center for Language and Speech Processing Johns Hopkins University,

More information

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION Far East Journal of Electronics and Communications Volume 3, Number 2, 2009, Pages 125-140 Published Online: September 14, 2009 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing

More information

WHO WANTS TO BE A MILLIONAIRE?

WHO WANTS TO BE A MILLIONAIRE? IDIAP COMMUNICATION REPORT WHO WANTS TO BE A MILLIONAIRE? Huseyn Gasimov a Aleksei Triastcyn Hervé Bourlard Idiap-Com-03-2012 JULY 2012 a EPFL Centre du Parc, Rue Marconi 19, PO Box 592, CH - 1920 Martigny

More information

Multimodal speech recognition and enhancement

Multimodal speech recognition and enhancement Multimodal speech recognition and enhancement Dorothea Kolossa & Steffen Zeiler Cognitive Signal Processing Group / Kavli Institute for Theoretical Physics Ruhr-Universität Bochum Fakultät für Elektrotechnik

More information

Discriminative training and Feature combination

Discriminative training and Feature combination Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics

More information

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study H.J. Nock, G. Iyengar, and C. Neti IBM TJ Watson Research Center, PO Box 218, Yorktown Heights, NY 10598. USA. Abstract. This paper

More information

Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions

Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions Stewart, D., Seymour, R., Pass, A., & Ji, M. (2014). Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions.

More information

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition N. Ahmad, S. Datta, D. Mulvaney and O. Farooq Loughborough Univ, LE11 3TU Leicestershire, UK n.ahmad@lboro.ac.uk 6445 Abstract

More information

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments The 2 IEEE/RSJ International Conference on Intelligent Robots and Systems October 8-22, 2, Taipei, Taiwan Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments Takami Yoshida, Kazuhiro

More information

This is a repository copy of Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates.

This is a repository copy of Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. This is a repository copy of Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/121254/

More information

Speaker Verification with Adaptive Spectral Subband Centroids

Speaker Verification with Adaptive Spectral Subband Centroids Speaker Verification with Adaptive Spectral Subband Centroids Tomi Kinnunen 1, Bingjun Zhang 2, Jia Zhu 2, and Ye Wang 2 1 Speech and Dialogue Processing Lab Institution for Infocomm Research (I 2 R) 21

More information

ISCA Archive

ISCA Archive ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 2005 (AVSP 05) British Columbia, Canada July 24-27, 2005 AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE David

More information

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,

More information

Multi-pose lipreading and audio-visual speech recognition

Multi-pose lipreading and audio-visual speech recognition RESEARCH Open Access Multi-pose lipreading and audio-visual speech recognition Virginia Estellers * and Jean-Philippe Thiran Abstract In this article, we study the adaptation of visual and audio-visual

More information

Research Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

Research Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing Volume 2007, Article ID 64506, 9 pages doi:10.1155/2007/64506 Research Article Audio-Visual Speech Recognition Using

More information

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic

More information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural

More information

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam,

More information

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions. Petridis, Stavros and Stafylakis, Themos and Ma, Pingchuan and Cai, Feipeng and Tzimiropoulos, Georgios and Pantic, Maja (2018) End-to-end audiovisual speech recognition. In: IEEE International Conference

More information

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments PAGE 265 Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments Patrick Lucey, Terrence Martin and Sridha Sridharan Speech and Audio Research Laboratory Queensland University

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han,

More information

Multifactor Fusion for Audio-Visual Speaker Recognition

Multifactor Fusion for Audio-Visual Speaker Recognition Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 70 Multifactor Fusion for Audio-Visual Speaker Recognition GIRIJA CHETTY

More information

Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE 1 A Low Complexity Parabolic Lip Contour Model With Speaker Normalization For High-Level Feature Extraction in Noise Robust Audio-Visual Speech Recognition Bengt J Borgström, Student Member, IEEE, and

More information

An Introduction to Pattern Recognition

An Introduction to Pattern Recognition An Introduction to Pattern Recognition Speaker : Wei lun Chao Advisor : Prof. Jian-jiun Ding DISP Lab Graduate Institute of Communication Engineering 1 Abstract Not a new research field Wide range included

More information

Client Dependent GMM-SVM Models for Speaker Verification

Client Dependent GMM-SVM Models for Speaker Verification Client Dependent GMM-SVM Models for Speaker Verification Quan Le, Samy Bengio IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland {quan,bengio}@idiap.ch Abstract. Generative Gaussian Mixture Models (GMMs)

More information

Dynamic Time Warping

Dynamic Time Warping Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW

More information

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification ICSP Proceedings Further Studies of a FFT-Based Auditory with Application in Audio Classification Wei Chu and Benoît Champagne Department of Electrical and Computer Engineering McGill University, Montréal,

More information

Intelligent Hands Free Speech based SMS System on Android

Intelligent Hands Free Speech based SMS System on Android Intelligent Hands Free Speech based SMS System on Android Gulbakshee Dharmale 1, Dr. Vilas Thakare 3, Dr. Dipti D. Patil 2 1,3 Computer Science Dept., SGB Amravati University, Amravati, INDIA. 2 Computer

More information

Manifold Constrained Deep Neural Networks for ASR

Manifold Constrained Deep Neural Networks for ASR 1 Manifold Constrained Deep Neural Networks for ASR Department of Electrical and Computer Engineering, McGill University Richard Rose and Vikrant Tomar Motivation Speech features can be characterized as

More information

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR

More information

A Neural Network for Real-Time Signal Processing

A Neural Network for Real-Time Signal Processing 248 MalkofT A Neural Network for Real-Time Signal Processing Donald B. Malkoff General Electric / Advanced Technology Laboratories Moorestown Corporate Center Building 145-2, Route 38 Moorestown, NJ 08057

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Speech Recognition Components Acoustic and pronunciation model:

More information

AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION

AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION Iain Matthews, J. Andrew Bangham and Stephen Cox School of Information Systems, University of East Anglia, Norwich, NR4 7TJ,

More information

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models

More information

Detection of goal event in soccer videos

Detection of goal event in soccer videos Detection of goal event in soccer videos Hyoung-Gook Kim, Steffen Roeber, Amjad Samour, Thomas Sikora Department of Communication Systems, Technical University of Berlin, Einsteinufer 17, D-10587 Berlin,

More information

SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION

SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication, pp 8/1 8/7, 1996 1 SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION I A Matthews, J A Bangham and

More information

Chapter 3. Speech segmentation. 3.1 Preprocessing

Chapter 3. Speech segmentation. 3.1 Preprocessing , as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents

More information

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram International Conference on Education, Management and Computing Technology (ICEMCT 2015) Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based

More information

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139,

More information

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Chao Ma 1,2,3, Jun Qi 4, Dongmei Li 1,2,3, Runsheng Liu 1,2,3 1. Department

More information

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions Edith Cowan University Research Online ECU Publications Pre. JPEG compression of monochrome D-barcode images using DCT coefficient distributions Keng Teong Tan Hong Kong Baptist University Douglas Chai

More information

ICA mixture models for image processing

ICA mixture models for image processing I999 6th Joint Sy~nposiurn orz Neural Computation Proceedings ICA mixture models for image processing Te-Won Lee Michael S. Lewicki The Salk Institute, CNL Carnegie Mellon University, CS & CNBC 10010 N.

More information

END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS

END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS Stavros Petridis, Zuwei Li Imperial College London Dept. of Computing, London, UK {sp14;zl461}@imperial.ac.uk Maja Pantic Imperial College London / Univ.

More information

SVD-based Universal DNN Modeling for Multiple Scenarios

SVD-based Universal DNN Modeling for Multiple Scenarios SVD-based Universal DNN Modeling for Multiple Scenarios Changliang Liu 1, Jinyu Li 2, Yifan Gong 2 1 Microsoft Search echnology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft Way, Redmond,

More information

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,

More information

AUTOMATIC speech recognition (ASR) systems suffer

AUTOMATIC speech recognition (ASR) systems suffer 1 A High-Dimensional Subband Speech Representation and SVM Framework for Robust Speech Recognition Jibran Yousafzai, Member, IEEE Zoran Cvetković, Senior Member, IEEE Peter Sollich Matthew Ager Abstract

More information

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification Proceedings of the 6th WSEAS International Conference on SIGNAL PROCESSING, Dallas, Texas, USA, March 22-24, 2007 52 Optimization of Observation Membership Function By Particle Swarm Method for Enhancing

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

A GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION

A GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION A GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION Hazim Kemal Ekenel, Rainer Stiefelhagen Interactive Systems Labs, Universität Karlsruhe (TH) 76131 Karlsruhe, Germany

More information

A Gaussian Mixture Model Spectral Representation for Speech Recognition

A Gaussian Mixture Model Spectral Representation for Speech Recognition A Gaussian Mixture Model Spectral Representation for Speech Recognition Matthew Nicholas Stuttle Hughes Hall and Cambridge University Engineering Department PSfrag replacements July 2003 Dissertation submitted

More information

Mengjiao Zhao, Wei-Ping Zhu

Mengjiao Zhao, Wei-Ping Zhu ADAPTIVE WAVELET PACKET THRESHOLDING WITH ITERATIVE KALMAN FILTER FOR SPEECH ENHANCEMENT Mengjiao Zhao, Wei-Ping Zhu Department of Electrical and Computer Engineering Concordia University, Montreal, Quebec,

More information

Optimization of HMM by the Tabu Search Algorithm

Optimization of HMM by the Tabu Search Algorithm JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 949-957 (2004) Optimization of HMM by the Tabu Search Algorithm TSONG-YI CHEN, XIAO-DAN MEI *, JENG-SHYANG PAN AND SHENG-HE SUN * Department of Electronic

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

COPYRIGHTED MATERIAL. Introduction. 1.1 Introduction

COPYRIGHTED MATERIAL. Introduction. 1.1 Introduction 1 Introduction 1.1 Introduction One of the most fascinating characteristics of humans is their capability to communicate ideas by means of speech. This capability is undoubtedly one of the facts that has

More information

Face. Feature Extraction. Visual. Feature Extraction. Acoustic Feature Extraction

Face. Feature Extraction. Visual. Feature Extraction. Acoustic Feature Extraction A Bayesian Approach to Audio-Visual Speaker Identification Ara V Nefian 1, Lu Hong Liang 1, Tieyan Fu 2, and Xiao Xing Liu 1 1 Microprocessor Research Labs, Intel Corporation, fara.nefian, lu.hong.liang,

More information

Joint Processing of Audio and Visual Information for Speech Recognition

Joint Processing of Audio and Visual Information for Speech Recognition Joint Processing of Audio and Visual Information for Speech Recognition Chalapathy Neti and Gerasimos Potamianos IBM T.J. Watson Research Center Yorktown Heights, NY 10598 With Iain Mattwes, CMU, USA Juergen

More information

Temporal Multimodal Learning in Audiovisual Speech Recognition

Temporal Multimodal Learning in Audiovisual Speech Recognition Temporal Multimodal Learning in Audiovisual Speech Recognition Di Hu, Xuelong Li, Xiaoqiang Lu School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical

More information

CAMBRIDGE UNIVERSITY

CAMBRIDGE UNIVERSITY CAMBRIDGE UNIVERSITY ENGINEERING DEPARTMENT DISCRIMINATIVE CLASSIFIERS WITH GENERATIVE KERNELS FOR NOISE ROBUST SPEECH RECOGNITION M.J.F. Gales and F. Flego CUED/F-INFENG/TR605 August 13, 2008 Cambridge

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

[N569] Wavelet speech enhancement based on voiced/unvoiced decision

[N569] Wavelet speech enhancement based on voiced/unvoiced decision The 32nd International Congress and Exposition on Noise Control Engineering Jeju International Convention Center, Seogwipo, Korea, August 25-28, 2003 [N569] Wavelet speech enhancement based on voiced/unvoiced

More information

Bidirectional Truncated Recurrent Neural Networks for Efficient Speech Denoising

Bidirectional Truncated Recurrent Neural Networks for Efficient Speech Denoising Bidirectional Truncated Recurrent Neural Networks for Efficient Speech Denoising Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen Department of Electronics and Information Systems, Ghent University,

More information

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT Conference on Advances in Communication and Control Systems 2013 (CAC2S 2013) Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT Astik Biswas Department of Electrical Engineering, NIT Rourkela,Orrisa

More information

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Martin Karafiát Λ, Igor Szöke, and Jan Černocký Brno University of Technology, Faculty of Information Technology Department

More information

A New Manifold Representation for Visual Speech Recognition

A New Manifold Representation for Visual Speech Recognition A New Manifold Representation for Visual Speech Recognition Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan School of Computing & Electronic Engineering, Vision Systems Group Dublin City University,

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification 2 1 Xugang Lu 1, Peng Shen 1, Yu Tsao 2, Hisashi

More information

Interfacing of CASA and Multistream recognition. Cedex, France. CH-1920, Martigny, Switzerland

Interfacing of CASA and Multistream recognition. Cedex, France. CH-1920, Martigny, Switzerland Interfacing of CASA and Multistream recognition Herv Glotin 2;, Fr d ric Berthommier, Emmanuel Tessier, Herv Bourlard 2 Institut de la Communication Parl e (ICP), 46 Av F lix Viallet, 3803 Grenoble Cedex,

More information

Separating Speech From Noise Challenge

Separating Speech From Noise Challenge Separating Speech From Noise Challenge We have used the data from the PASCAL CHiME challenge with the goal of training a Support Vector Machine (SVM) to estimate a noise mask that labels time-frames/frequency-bins

More information

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER A.Shabbir 1, 2 and G.Verdoolaege 1, 3 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Max Planck Institute

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Markov Random Fields and Gibbs Sampling for Image Denoising

Markov Random Fields and Gibbs Sampling for Image Denoising Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov

More information

The Automatic Musicologist

The Automatic Musicologist The Automatic Musicologist Douglas Turnbull Department of Computer Science and Engineering University of California, San Diego UCSD AI Seminar April 12, 2004 Based on the paper: Fast Recognition of Musical

More information

Available online Journal of Scientific and Engineering Research, 2016, 3(4): Research Article

Available online   Journal of Scientific and Engineering Research, 2016, 3(4): Research Article Available online www.jsaer.com, 2016, 3(4):417-422 Research Article ISSN: 2394-2630 CODEN(USA): JSERBR Automatic Indexing of Multimedia Documents by Neural Networks Dabbabi Turkia 1, Lamia Bouafif 2, Ellouze

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Dr Andrew Abel University of Stirling, Scotland

Dr Andrew Abel University of Stirling, Scotland Dr Andrew Abel University of Stirling, Scotland University of Stirling - Scotland Cognitive Signal Image and Control Processing Research (COSIPRA) Cognitive Computation neurobiology, cognitive psychology

More information

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri 1 CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING Alexander Wankhammer Peter Sciri introduction./the idea > overview What is musical structure?

More information

Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech

Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech INTERSPEECH 16 September 8 12, 16, San Francisco, USA Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech Meet H. Soni, Hemant A. Patil Dhirubhai Ambani Institute

More information

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

application of learning vector quantization algorithms. In Proceedings of the International Joint Conference on

application of learning vector quantization algorithms. In Proceedings of the International Joint Conference on [5] Teuvo Kohonen. The Self-Organizing Map. In Proceedings of the IEEE, pages 1464{1480, 1990. [6] Teuvo Kohonen, Jari Kangas, Jorma Laaksonen, and Kari Torkkola. LVQPAK: A program package for the correct

More information

Automatic Enhancement of Correspondence Detection in an Object Tracking System

Automatic Enhancement of Correspondence Detection in an Object Tracking System Automatic Enhancement of Correspondence Detection in an Object Tracking System Denis Schulze 1, Sven Wachsmuth 1 and Katharina J. Rohlfing 2 1- University of Bielefeld - Applied Informatics Universitätsstr.

More information

Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification

Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification

More information

ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS

ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS Pascual Ejarque and Javier Hernando TALP Research Center, Department of Signal Theory and Communications Technical University of

More information