Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons

Size: px

Start display at page:

Download "Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons"

Blaise Hawkins
5 years ago
Views:

INTERSPEECH 2014 Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons Ahmed Hussen Abdelaziz, Dorothea Kolossa Institute of Communication

1 INTERSPEECH 2014 Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons Ahmed Hussen Abdelaziz, Dorothea Kolossa Institute of Communication Acoustics, Digital Signal Processing Group, Ruhr-Universität Bochum, Germany {Ahmed.HussenAbdelAziz, Abstract Jointly using audio and video features can increase the robustness of automatic speech recognition systems in noisy environments. A systematic and reliable performance gain, however, is only achieved if the contributions of the audio and video stream to the decoding decision are dynamically optimized, for example via so-called stream weights. In this paper, we address the problem of dynamic stream weight estimation for coupled- HMM-based audio-visual speech recognition. We investigate the multilayer perceptron (MLP) for mapping reliability measure features to stream weights. As an input for the multilayer perceptron, we use a feature vector containing different model-based and signal-based reliability measures. Training of the multilayer perceptron has been achieved using dynamic oracle stream weights as target outputs, which are found using a recently proposed expectation maximization algorithm. This new approach of MLP-based stream-weight estimation has been evaluated using the Grid audio-visual corpus and has outperformed the best baseline performance, yielding a % average relative error rate reduction. Index Terms: Audio-visual ASR, coupled HMM, stream weight, reliability measure, multilayer perceptron 1. Introduction Using visual observations in addition to acoustical observations in automatic speech recognition (ASR) has recently attracted remarkable research interest as a possible solution for the rapid performance drop of audio-only ASR in noisy environments. One reasonable requirement is that under all conditions, the resulting audio-visual (AV) ASR system should perform at least as good as, but typically better than the best single-modality system. To ensure this, it is necessary to dynamically adapt the contribution of each modality to the classification decisions made by the audio-visual model. This can be done by weighting the contribution of each modality according to its information content and its reliability, using so-called stream weights (SW). In many prior works, the stream weight estimation problem has been addressed. In [1 3] for example, the stream weights have been considered as model parameters and estimated using a generative or a discriminative criterion. In [4, 5], on the other hand, the stream weights have been considered as featuredependent and estimated per frame, based on different reliability measures via heuristically chosen mapping functions like the sigmoid or the exponential function. However, it has not been shown whether these functions are an optimal choice. Therefore, we propose here to train a multilayer perceptron (MLP) to implicitly choose the appropriate mapping function. In order to train the MLP parameters, a massive amount of training data should be used. The global fixed stream weights, found per condition via a grid search as in [4, 6], are not sufficient as target outputs for the MLP, because the number of estimated data points is far too small. Instead, we train the MLP using frame-dependent (dynamic) oracle stream weights, estimated using the newly proposed algorithm in [7]. The large number of oracle stream weights made available by this algorithm one per feature vector in the training dataset also enables the use of multidimensional reliability feature vectors. This allows us to use 31-dimensional feature vectors that combine different signal-based and model-based reliability measures as the input to the MLP. The proposed stream weights are tested using a coupled hidden Markov model (CHMM)-based AVASR [8 11]. In contrast to other fusion models, e.g., the multistream hidden Markov model (MSHMM) [8 10,12], the CHMM takes into account the asynchrony between the audio and the visual modality while preserving their temporal dependency by enforcing the synchronization at the boundaries of certain speech units, e.g., phonemes, syllables, or words. The remaining paper is organized as follows: In Section 2, we review the use of coupled HMMs as fusion models for AVASR. Next, in Section 3, we discuss using the MLP as a mapping function for dynamic stream weights. The reliability measures from which the stream weights are estimated using the MLP will also be discussed in this section. The algorithm used for estimating the frame-wise oracle stream weights, needed for training the MLP parameters, is discussed in Section 4. With all parts of the system in place, the proposed approach is evaluated on the Grid audio-visual database [13]. The experimental setup and results are presented in Section 5. Finally, we conclude the paper and give an outlook for further work in Section Coupled hidden Markov model Audio-visual speech models should take into consideration the natural temporal dependencies of audio and video observations. The two modalities do exhibit asynchronicities, because the visual information derived from the articulator movements often precedes the acoustical signal generation [14], but a natural temporal dependency of both modalities results, as they are both the consequence of the same movements of articulators. The CHMM can be used to model this joint temporal evolution. As shown in Figure 1, each hidden state of the CHMM is composed of a pair of the corresponding audio and video states of two marginal single-stream HMMs. The level of asynchrony allowed by the CHMM is controlled by the transition matrix of the model. The natural dependency over time between both modalities is additionally guaranteed by enforcing the synchrony at Copyright 2014 ISCA September 2014, Singapore

2 q A Video HMM N V Audio HMM N A Coupled HMM Figure 1: Coupled HMM with N A N V composite states. appropriate model boundaries (here, at word boundaries). The transition probability a i,j between two composite statesq t = i = (i A,i V ) andq t+1 = j = (j A,j V ) in a CHMM can be computed by a i,j = p(q t = i q t+1 = j) = a s i s,js. (1) s {A,V } In (1), a s i s,j s denotes the transition probability from state qs t = i s to state qt+1 s = j s in the marginal single-modality HMM of stream s. A denotes the audio andv the video stream. The observation likelihood of a composite state q in a CHMM can be computed by b qt =i(o t,λ t) = p (O t A qt A = i A) λ t ( p Ot V qt V = i V) 1 λ t. Here, p(o s t q s t) is the single-modality observation likelihood. The audio-visual observation O t at time frame t consists of the acoustical observation O A t and the visual observation O V t. As can be seen in (2), the stream weight λ t, which is the key quantity studied in this paper, controls the contribution of each modality to the overall score of the composite state q t = i. A large stream weight at time frametmeans that the acoustical observation is more reliable and contains more information than the visual observations at this time frame and vice versa. 3. Stream Weight Estimation using MLPs (2) In many prior works, frame-dependent stream weights λ t have been estimated using different mapping functions. These functions map single- or multidimensional reliability measure features to the stream weight. Convenient choices of such mapping functions are the sigmoid function [5,6] or the exponential function [4]. However, these functions are chosen heuristically and it is not clear whether they are optimal for the stream weight estimation task. As an alternative, we propose to use a multilayer perceptron to choose the most suitable function for this task. Since we are aspiring to dynamically estimating the stream weights, the input reliability measure features of the MLP and their corresponding stream weight outputs should be computed for each frame. This means that large numbers of input-output tuples can be collected from a reasonable number of utterances. This abundance of input-output tuples makes it possible to properly train the MLP parameters in a supervised manner. As an input to the MLP, we consider a 31-dimensional feature vector, which contains different model-based and signalbased reliability measures. The first and the second modelbased reliability measure are the acoustical and the visual entropy, H A and H V [15 17]. The entropy is a measure of the decoder uncertainty regarding the discrete state, given an observation O s. Therefore, small entropies indicate reliable features and vice versa. The entropy can be computed independently for each modalitysat time frame t as follows: H s t = p(qt s = i s Ot)log(p(q s t s = i s Ot)). s (3) N s i=1 The single-modality dispersions D A and D V [4, 6, 18], which are also model-based features, are used as the third and fourth reliability measure. The dispersion indicates how certain the decoder is, when allocating the given observation to a model state, so in contrast to the entropy, reliable observations have large dispersion values. The dispersion of a single modality s A,V at time frame t is given via D s t = 2 L s (L s 1) L s L s k s =1 l s =k s +1 p(q s t = k s O s t) p(q s t = l s O s t). (4) In order to compute the dispersion as in (4), the L s largest posteriors p(qt s = k s Ot) s should first be arranged in descending order. These four model-based reliability measure features are heuristic measures for the mismatch between the given observation O s (t) and the underlying model. The remaining reliability measure features are signal-based, rather than model-based. Specifically, we are using the signalto-noise ratio (SNR), averaged soft and hard voice activity detection (VAD) cues, and the estimation uncertainty of the acoustical observations. The availability of these signal-based features depends on the chosen acoustic pre-processor. In this study, we use a Wiener filter as a pre-processor to enhance the speech signal before extracting the acoustical observations. Speech enhancement algorithms such as the Wiener filter need an estimate of the noise power. This can be found either by explicitly employing speech pause detection or by using algorithms like improved minima controlled recursive averaging (IMCRA) [19 21] to estimate the noise floor. The later algorithm, which is used here, provides two types of VAD cues in each time-frequency bin: hard (binary) and probabilistic. The fifth and the sixth components of the reliability measure feature vector are the averages of each of these two VAD features across all frequency bins. Moreover, the noise power estimate obtained using IMCRA, together with the power of the estimated clean signal from the Wiener filter, are used to compute the framewise signal to noise ratio via: ( ) St SNR t = 10log, (5) N t where S t and N t are the estimated signal and noise energies at time framet, respectively. The final reliability measure feature is the uncertainty of the enhanced acoustical observations [22, 23]. This uncertainty is an estimate of the residual noise and estimation errors in the acoustical features after applying the Wiener filter. The uncertainty σ 2 can be computed for each time-frequency bin as σ 2 t = Var[x t y t], (6) where x t and y t are single components of the short-time discrete Fourier transform (DFT) of the clean and noisy acoustical 1145

3 Algorithm 1: Oracle Dynamic SWs for CHMM-based AVASR A. Set the prior parameters (1) Set µ λ to the global fixed stream weightλ G. (2) Initialize σ λ with a small value, e.g., 0.1. B. Initialization (3) Initialize the stream weight set ˆλ = λ. C. EM Algorithm ( (4) Calculate P = p O, ˆλ w ) E step (5) Use ˆλ to calculate γ t(i) for all times and 2-D states M step (6) Update ˆλ using Equation (8). Convergence test ( (7) CalculateP = p O, ˆλ w ) (8) If (P P > ǫ), P = P, go to (5), else, go to (9). D. Recognition (9) Use the estimated ˆλ to recognize the training utterance and calculate the accuracya(σ λ ). E. Iteration (10) Increase σ λ and repeat (3-10) until all values of ˆλ become 0 or 1. (11) Choose ˆλ that achieves best accuracy. signals, respectively. The dimension of the uncertainty is reduced from the DFT length to 24 by linear transform using the Mel filterbank matrix. After defining the reliability measure feature vectors in this way, in order to train the MLP parameters, it is still necessary to find the corresponding optimal stream weight for each time frame of a training data set. For this task, we use an expectation maximization algorithm that we have recently proposed [7]. A brief summary of this algorithm is given in the next section. 4. Oracle stream weight estimation The question addressed in this section is as follows: Given the audio-visual observation sequence O = {O t} T t=1 of an utterance and its corresponding correct transcription w, how can we estimate the optimal dynamic stream weight λ = {λ t} T t=1, which we also term oracle stream weight because prior knowledge of the correct word sequence enters into its computation. A possible answer to this question is introduced in [5, 9] for AVASR based on multi-stream HMMs. The approach in [5, 9] requires a-priori knowledge of the frame-state alignment, which is found using forced alignment. However, computing the frame-state alignment requires the state-conditional observation likelihoods given by (2), which already needs the dynamic stream weights a classical hen-egg dilemma. To solve this issue, we have proposed a new algorithm in [7] that needs no prior knowledge of the frame-state alignment for estimating the oracle dynamic stream weights, cf. Algorithm 1. The algorithm initializes the stream weight as a Gaussian random variable with mean µ λ and standard deviation σ λ. These statistics are assumed to take different values for different noise types and levels. We set the mean value µ λ, which we call the bias parameter, to the optimal global fixed stream weight λ G. The global fixed stream weight can be found using a grid search, minimizing word error rate. We term the standard deviation σ λ the sensitivity parameter, as it controls the dynamics of the stream weight. Large standard deviations lead to strong changes in the stream weights and vice versa. In Step (2) of Algorithm 1, we start with a smallσ λ, then increase it iteratively in Step (10) until all stream weights become either zero or one. Increasing the standard deviation further will not change the values of the stream weights, which are bounded between zero or one. In Steps (3-8), the optimal set of stream weights is found given the prior parameters µ λ and σ λ of the current iteration. In Step (9), we test the estimated stream weight set in the current iteration by blindly decoding the utterance using an audio-visual speech recognizer. Finally, the stream weight set ˆλ that yields the lowest error rate is selected as the final stream weight set for the given utterance. In Steps (4-8) we use an EM algorithm to find an estimate ˆλ that maximizes the following objective function: F = p(o,λ w) s.t. 1 λ 0. (7) A local maximum of F can be found by estimating the framewise stream weights as follows 1 : ˆλ t = µ λ +σ 2 λ (N A,N V ) i=(1,1) γ t(i)log ( ) p(o A t q t = i). (8) p(o V t q t = i) If the resulting ˆλ t / [0,1] in (8), it is clipped to these boundaries. Equation (8) represents the M step (Step (6)). To compute ˆλ t as in (8), the E step (Step (5)) is first applied to compute the composite state occupation probabilities γ t(i) for all times and all composite states i { (1,1), (N A,N V ) }. The EM algorithm is iterated until a local maximum of (7) is found. In Step (3), a reasonably initialized stream weight set λ is found, using a greedy layer-wise approach that solves the optimization problem λ { ( t = argmax p O1,,t,λ t λ 1,,t 1,w )} λ t s.t. 1 λ 0. (9) The objective function in (9) is convex in the feasible region, i.e. 0 λ t 1, and can be optimized by gradient ascent [24]. 5. Experiments and Results The Grid audio-visual speech corpus has been used to evaluate the proposed approach. The Grid database contains audiovisual recordings uttered by 33 speakers, i.e., 1000 sentence per speaker. The Grid task is to recognize English sentences of the form command-color-prepositionletter-digit-adverb. We have divided the signals into three sets: A training set containing 90% of the signals, a development and a test set each of which contains 5% of the corpus signals. The training set has been used to separately train the marginal single-modality HMMs. The development set has mainly been used to train the MLP. To test the proposed approach under different acoustical conditions, we have used eight additional noisy versions of the test and development set. The noisy signals have been created by adding babble and white noise signals to the clean signals at four SNR levels between 0dB and 15dB. The babble and white noise signals stem from the NOISEX corpus [25] and were chosen to represent both stationary and non-stationary noise. 1 A rigorous derivation of the EM algorithm and the initialization approach is introduced in [7]. 1146

4 Table 1: Recognition performance in term of word accuracy for single-stream and audio-visual ASR systems in a range of acoustical conditions, using different stream weight estimation schemes. Audio-visual Noise SNR Audio Video Bayes Exponential Function λ = λ = Type [db] only only MLP Fusion Entropy SNR Dispersion λ G ODSW * Babble * * * * * White * * Clean Avg * The spectrograms of all signals have been extracted using the configuration parameters recommended in the ETSI- AFE [26]. From the spectrogram, we have extracted the first 13 mel-frequency cepstral coefficients (MFCC). The acoustical feature vectors are then composed of these 13-dimensional static feature vectors concatenated with their first and second temporal derivatives. The visual features have been extracted as follows: First, the region of interest (ROI), a rectangular area containing the speaker s mouth, has been found with the Viola- Jones algorithm [27]. Next, the two-dimensional discrete cosine transform (DCT) has been applied to the ROI. The first 64 DCT coefficients were then used as the visual features. The dimensions of the acoustical and visual observations have finally been reduced to 31 using linear discriminant analysis (LDA) [28,29]. The single-modality word HMMs are speaker-dependent, linear models. The number of states in each word HMM is proportional to the number of phonemes contained in this word with a proportionality factor of 3 for audio HMMs and 1 for video HMMs. The output probability distributions of all emitting states are Gaussian mixture models with 3 mixture components for audio HMMs and 4 for video HMMs. The Java Audiovisual SPEech Recognizer (JASPER) system [30] has been used for training and recognition. The MLP has an input layer with 31 input units, two hidden layers each with 10 neurons, and a one-dimensional output layer. All neurons of the hidden layers have tan-sigmoid transfer functions. The output neuron, however, has a linear transfer function. The estimated stream weights that are smaller than zero or larger than one are clipped at these values. We consider the results obtained using the approach in [4] as our baseline. Here, a second order exponential function, λ t = ae b RM t +ce d RM t, (10) was used to map a one-dimensional reliability measure RM, e.g. RM t = SNR t, to a stream weight λ t at each time frame t. The parameters of this mapping function, a, b, c, and d, have been estimated using non-linear least squares. One training data point is computed for each development set as follows: The argument of the mapping function is found by averaging a one-dimensional reliability measure, e.g., SNR, entropy, or dispersion, over the whole development set. As the target stream weight of the set, the global fixed stream weight λ G is used, which is found via grid search. Therefore, the number of training data points for parameter estimation equals the number of development sets, which is typically quite small. Table 1 shows the performance in terms of word accuracy for audio-only, video-only, and audio-visual ASR. The stream weights used with the AVASR system have been computed using different approaches. The first approach is simple Bayes fusion, where audio and video likelihoods are always weighted equally. In the second, the third, and the fourth approach, the stream weights have been estimated as in [4]. The fifth approach is the proposed one, in which the stream weights are estimated using an MLP. The sixth and the seventh approach are upper bounds of the system performance as their stream weights have been estimated using a-priori knowledge of the correct transcription. The stream weights of the sixth approach are the global fixed stream weights obtained using grid search with a minimum word error rate criterion. The oracle dynamic stream weights (ODSWs) used in the last approach are those estimated using Algorithm 1 discussed in Section 4. As seen in Table 1, the proposed approach significantly outperforms the baseline results and approaches the results obtained using optimal fixed stream weights. The asterisk in Table 1 means that the results are significantly better than the respective best result of the baseline. The statistical significance has been tested using Fisher s exact test [31] for p = However, the gap between the results obtained using the oracle dynamic stream weights and the weights estimated using the MLP is also significant. This points towards further room for optimization, e.g. through the inclusion of other robust and indicative reliability measure features. 6. Conclusions Automatic speech recognition can become highly noise-robust when visual observations are used in conjunction with the acoustical ones. It is, however, important to dynamically control the contribution of each observation to the classification decision, for example by using dynamic stream weights. We have proposed to use an MLP for mapping reliability measure features to dynamic stream weights. In order to train the MLP parameters, we have employed oracle dynamic stream weights, which were estimated using a recently proposed EM algorithm. As the MLP input, we have used composite feature vectors, containing signal-based as well as model-based reliability measures. The proposed approach has significantly outperformed the baseline for CHMM-based AVASR. We expect to achieve even better performance when further noise-robust reliability measures are included as additional MLP inputs. 1147

5 7. References [1] J. Hernando, Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition, in ICASSP, Munich, Germany, [2] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, in ICASSP, Orlando, Florida, USA, [3] L. Peng and W. Zuoying, Stream weight training based on MCE for audio-visual LVCSR, Tsinghua Science and Technology, vol. 10, no. 2, pp , [4] V. Estellers, M. Gurban, and J.-P. Thiran, On dynamic stream weighting for audio-visual speech recognition, IEEE Trans. Audio, Speech, and Language Processing, vol. 20, no. 4, pp , [5] A. Garg, G. Potamianos, C. Neti, and T. S. Huang, Frame-dependent multi-stream reliability indicators for audio-visual speech recognition, in International Conference on Multimedia and Expo, Baltimore, Maryland, USA, [6] M. Heckmann, F. Berthommier, and K. Kroschel, Noise adaptive stream weighting in audio-visual speech recognition, EURAS, vol. 2002, pp , [7] A. H. Abdelaziz, S. Zeiler, and D. Kolossa, A new EM estimation of dynamic stream weights for coupled-hmmbased audio-visual ASR, in ICASSP, Florence, Italy, [8] A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, Dynamic Bayesian networks for audio-visual speech recognition, EURASIP Journal on Applied Signal Processing, vol. 11, pp , [9] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, Recent advances in the automatic recognition of audio-visual speech, Proceedings of the IEEE, vol. 91, no. 9, pp , [10] M. Tomlinson, M. Russell, and N. M. Brooke, Integrating audio and visual information to provide highly robust speech recognition, in ICASSP, Atlanta, Georgia, USA, [11] J. Luettin, G. Potamianos, and C. Neti, Asynchronous stream modeling for large vocabulary audio-visual speech recognition, in ICASSP, Salt Lake City, Utah, USA, [12] H. Bourlard and S. Dupont, A new ASR approach based on independent processing and recombination of partial frequency bands, in ICSLP, Philadelphia, Pennsylvania, USA, [13] M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, vol. 120, no. 5, pp , [14] D. W. Massaro and D. G. Stork, Speech recognition and sensory integration, American Scientist, vol. 86, no. 3, pp , [15] M. Gurban and J.-P. Thiran, Using entropy as a stream reliability estimate for audio-visual speech recognition, in European Signal Processing Conference, Lausanne, Switzerland, [16] G. Potamianos and C. Neti, Stream confidence estimation for audio-visual speech recognition, in ICSLP, Beijing, China, [17] H. Misra, H. Bourlard, and V. Tyagi, New entropy based combination rule in HMM/ANN multi-stream ASR, in ICASSP, Hong Kong, Hong Kong, [18] A. Adjoudani and C. Benoit, On the integration of auditory and visual parameters in an HMM-based ASR, in NATO ASI Conference on Speech reading by Man and Machine: Models, Systems and Applications, Berlin, Germany, [19] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp , [20] I. Cohen and B. Berdugo, Noise estimation by minima controlled recursive averaging for robust speech enhancement, IEEE Signal Processing Letters, vol. 9, no. 1, pp , [21] I. Cohen, Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging, IEEE Transactions on Speech and Audio, vol. 11, no. 5, p , [22] L. Deng, J. Droppo, and A. Acero, Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion, IEEE Trans. Speech and Audio Processing, vol. 13, no. 3, pp , [23] R. F. Astudillo, D. Kolossa, and R. Orglmeister, Accounting for the uncertainty of speech estimates in the complex domain for minimum mean square error speech enhancement, in Interspeech, Brighton, United Kingdom, [24] S. Boyd and L. Vandenberghe, Convex Optimization, 7th ed. Cambridge University Press, [25] A. Varga and H. J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, vol. 12, no. 3, pp , [26] Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced frontend feature extraction algorithm; Compression algorithms, ETSI, ES Std., [27] G. Bradski and A. Kaehler, Computer Vision with the OpenCV Library. O Reilly Media, [28] R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Pearson, [29] D. Kolossa, S. Zeiler, R. Saeidi, and R. Astudillo, Noiseadaptive LDA: A new approach for speech recognition under observation uncertainty, IEEE Signal Processing Letters, vol. 20, no. 11, pp , [30] A. Vorwerk, S. Zeiler, D. Kolossa, R. F. Astudillo, and D. Lerch, Robust Speech Recognition of Uncertain or Missing Data. Springer, 2011, ch. Use of Missing and Unreliable Data for Audiovisual Speech Recognition, pp [31] A. Agresti, A survey of exact inference for contingency tables, Statistical Science, vol. 7, no. 1, pp ,

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR Sebastian Gergen 1, Steffen Zeiler 1, Ahmed Hussen Abdelaziz 2, Robert Nickel