Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons
|
|
- Blaise Hawkins
- 5 years ago
- Views:
Transcription
1 INTERSPEECH 2014 Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons Ahmed Hussen Abdelaziz, Dorothea Kolossa Institute of Communication Acoustics, Digital Signal Processing Group, Ruhr-Universität Bochum, Germany {Ahmed.HussenAbdelAziz, Abstract Jointly using audio and video features can increase the robustness of automatic speech recognition systems in noisy environments. A systematic and reliable performance gain, however, is only achieved if the contributions of the audio and video stream to the decoding decision are dynamically optimized, for example via so-called stream weights. In this paper, we address the problem of dynamic stream weight estimation for coupled- HMM-based audio-visual speech recognition. We investigate the multilayer perceptron (MLP) for mapping reliability measure features to stream weights. As an input for the multilayer perceptron, we use a feature vector containing different model-based and signal-based reliability measures. Training of the multilayer perceptron has been achieved using dynamic oracle stream weights as target outputs, which are found using a recently proposed expectation maximization algorithm. This new approach of MLP-based stream-weight estimation has been evaluated using the Grid audio-visual corpus and has outperformed the best baseline performance, yielding a % average relative error rate reduction. Index Terms: Audio-visual ASR, coupled HMM, stream weight, reliability measure, multilayer perceptron 1. Introduction Using visual observations in addition to acoustical observations in automatic speech recognition (ASR) has recently attracted remarkable research interest as a possible solution for the rapid performance drop of audio-only ASR in noisy environments. One reasonable requirement is that under all conditions, the resulting audio-visual (AV) ASR system should perform at least as good as, but typically better than the best single-modality system. To ensure this, it is necessary to dynamically adapt the contribution of each modality to the classification decisions made by the audio-visual model. This can be done by weighting the contribution of each modality according to its information content and its reliability, using so-called stream weights (SW). In many prior works, the stream weight estimation problem has been addressed. In [1 3] for example, the stream weights have been considered as model parameters and estimated using a generative or a discriminative criterion. In [4, 5], on the other hand, the stream weights have been considered as featuredependent and estimated per frame, based on different reliability measures via heuristically chosen mapping functions like the sigmoid or the exponential function. However, it has not been shown whether these functions are an optimal choice. Therefore, we propose here to train a multilayer perceptron (MLP) to implicitly choose the appropriate mapping function. In order to train the MLP parameters, a massive amount of training data should be used. The global fixed stream weights, found per condition via a grid search as in [4, 6], are not sufficient as target outputs for the MLP, because the number of estimated data points is far too small. Instead, we train the MLP using frame-dependent (dynamic) oracle stream weights, estimated using the newly proposed algorithm in [7]. The large number of oracle stream weights made available by this algorithm one per feature vector in the training dataset also enables the use of multidimensional reliability feature vectors. This allows us to use 31-dimensional feature vectors that combine different signal-based and model-based reliability measures as the input to the MLP. The proposed stream weights are tested using a coupled hidden Markov model (CHMM)-based AVASR [8 11]. In contrast to other fusion models, e.g., the multistream hidden Markov model (MSHMM) [8 10,12], the CHMM takes into account the asynchrony between the audio and the visual modality while preserving their temporal dependency by enforcing the synchronization at the boundaries of certain speech units, e.g., phonemes, syllables, or words. The remaining paper is organized as follows: In Section 2, we review the use of coupled HMMs as fusion models for AVASR. Next, in Section 3, we discuss using the MLP as a mapping function for dynamic stream weights. The reliability measures from which the stream weights are estimated using the MLP will also be discussed in this section. The algorithm used for estimating the frame-wise oracle stream weights, needed for training the MLP parameters, is discussed in Section 4. With all parts of the system in place, the proposed approach is evaluated on the Grid audio-visual database [13]. The experimental setup and results are presented in Section 5. Finally, we conclude the paper and give an outlook for further work in Section Coupled hidden Markov model Audio-visual speech models should take into consideration the natural temporal dependencies of audio and video observations. The two modalities do exhibit asynchronicities, because the visual information derived from the articulator movements often precedes the acoustical signal generation [14], but a natural temporal dependency of both modalities results, as they are both the consequence of the same movements of articulators. The CHMM can be used to model this joint temporal evolution. As shown in Figure 1, each hidden state of the CHMM is composed of a pair of the corresponding audio and video states of two marginal single-stream HMMs. The level of asynchrony allowed by the CHMM is controlled by the transition matrix of the model. The natural dependency over time between both modalities is additionally guaranteed by enforcing the synchrony at Copyright 2014 ISCA September 2014, Singapore
2 q A Video HMM N V Audio HMM N A Coupled HMM Figure 1: Coupled HMM with N A N V composite states. appropriate model boundaries (here, at word boundaries). The transition probability a i,j between two composite statesq t = i = (i A,i V ) andq t+1 = j = (j A,j V ) in a CHMM can be computed by a i,j = p(q t = i q t+1 = j) = a s i s,js. (1) s {A,V } In (1), a s i s,j s denotes the transition probability from state qs t = i s to state qt+1 s = j s in the marginal single-modality HMM of stream s. A denotes the audio andv the video stream. The observation likelihood of a composite state q in a CHMM can be computed by b qt =i(o t,λ t) = p (O t A qt A = i A) λ t ( p Ot V qt V = i V) 1 λ t. Here, p(o s t q s t) is the single-modality observation likelihood. The audio-visual observation O t at time frame t consists of the acoustical observation O A t and the visual observation O V t. As can be seen in (2), the stream weight λ t, which is the key quantity studied in this paper, controls the contribution of each modality to the overall score of the composite state q t = i. A large stream weight at time frametmeans that the acoustical observation is more reliable and contains more information than the visual observations at this time frame and vice versa. 3. Stream Weight Estimation using MLPs (2) In many prior works, frame-dependent stream weights λ t have been estimated using different mapping functions. These functions map single- or multidimensional reliability measure features to the stream weight. Convenient choices of such mapping functions are the sigmoid function [5,6] or the exponential function [4]. However, these functions are chosen heuristically and it is not clear whether they are optimal for the stream weight estimation task. As an alternative, we propose to use a multilayer perceptron to choose the most suitable function for this task. Since we are aspiring to dynamically estimating the stream weights, the input reliability measure features of the MLP and their corresponding stream weight outputs should be computed for each frame. This means that large numbers of input-output tuples can be collected from a reasonable number of utterances. This abundance of input-output tuples makes it possible to properly train the MLP parameters in a supervised manner. As an input to the MLP, we consider a 31-dimensional feature vector, which contains different model-based and signalbased reliability measures. The first and the second modelbased reliability measure are the acoustical and the visual entropy, H A and H V [15 17]. The entropy is a measure of the decoder uncertainty regarding the discrete state, given an observation O s. Therefore, small entropies indicate reliable features and vice versa. The entropy can be computed independently for each modalitysat time frame t as follows: H s t = p(qt s = i s Ot)log(p(q s t s = i s Ot)). s (3) N s i=1 The single-modality dispersions D A and D V [4, 6, 18], which are also model-based features, are used as the third and fourth reliability measure. The dispersion indicates how certain the decoder is, when allocating the given observation to a model state, so in contrast to the entropy, reliable observations have large dispersion values. The dispersion of a single modality s A,V at time frame t is given via D s t = 2 L s (L s 1) L s L s k s =1 l s =k s +1 p(q s t = k s O s t) p(q s t = l s O s t). (4) In order to compute the dispersion as in (4), the L s largest posteriors p(qt s = k s Ot) s should first be arranged in descending order. These four model-based reliability measure features are heuristic measures for the mismatch between the given observation O s (t) and the underlying model. The remaining reliability measure features are signal-based, rather than model-based. Specifically, we are using the signalto-noise ratio (SNR), averaged soft and hard voice activity detection (VAD) cues, and the estimation uncertainty of the acoustical observations. The availability of these signal-based features depends on the chosen acoustic pre-processor. In this study, we use a Wiener filter as a pre-processor to enhance the speech signal before extracting the acoustical observations. Speech enhancement algorithms such as the Wiener filter need an estimate of the noise power. This can be found either by explicitly employing speech pause detection or by using algorithms like improved minima controlled recursive averaging (IMCRA) [19 21] to estimate the noise floor. The later algorithm, which is used here, provides two types of VAD cues in each time-frequency bin: hard (binary) and probabilistic. The fifth and the sixth components of the reliability measure feature vector are the averages of each of these two VAD features across all frequency bins. Moreover, the noise power estimate obtained using IMCRA, together with the power of the estimated clean signal from the Wiener filter, are used to compute the framewise signal to noise ratio via: ( ) St SNR t = 10log, (5) N t where S t and N t are the estimated signal and noise energies at time framet, respectively. The final reliability measure feature is the uncertainty of the enhanced acoustical observations [22, 23]. This uncertainty is an estimate of the residual noise and estimation errors in the acoustical features after applying the Wiener filter. The uncertainty σ 2 can be computed for each time-frequency bin as σ 2 t = Var[x t y t], (6) where x t and y t are single components of the short-time discrete Fourier transform (DFT) of the clean and noisy acoustical 1145
3 Algorithm 1: Oracle Dynamic SWs for CHMM-based AVASR A. Set the prior parameters (1) Set µ λ to the global fixed stream weightλ G. (2) Initialize σ λ with a small value, e.g., 0.1. B. Initialization (3) Initialize the stream weight set ˆλ = λ. C. EM Algorithm ( (4) Calculate P = p O, ˆλ w ) E step (5) Use ˆλ to calculate γ t(i) for all times and 2-D states M step (6) Update ˆλ using Equation (8). Convergence test ( (7) CalculateP = p O, ˆλ w ) (8) If (P P > ǫ), P = P, go to (5), else, go to (9). D. Recognition (9) Use the estimated ˆλ to recognize the training utterance and calculate the accuracya(σ λ ). E. Iteration (10) Increase σ λ and repeat (3-10) until all values of ˆλ become 0 or 1. (11) Choose ˆλ that achieves best accuracy. signals, respectively. The dimension of the uncertainty is reduced from the DFT length to 24 by linear transform using the Mel filterbank matrix. After defining the reliability measure feature vectors in this way, in order to train the MLP parameters, it is still necessary to find the corresponding optimal stream weight for each time frame of a training data set. For this task, we use an expectation maximization algorithm that we have recently proposed [7]. A brief summary of this algorithm is given in the next section. 4. Oracle stream weight estimation The question addressed in this section is as follows: Given the audio-visual observation sequence O = {O t} T t=1 of an utterance and its corresponding correct transcription w, how can we estimate the optimal dynamic stream weight λ = {λ t} T t=1, which we also term oracle stream weight because prior knowledge of the correct word sequence enters into its computation. A possible answer to this question is introduced in [5, 9] for AVASR based on multi-stream HMMs. The approach in [5, 9] requires a-priori knowledge of the frame-state alignment, which is found using forced alignment. However, computing the frame-state alignment requires the state-conditional observation likelihoods given by (2), which already needs the dynamic stream weights a classical hen-egg dilemma. To solve this issue, we have proposed a new algorithm in [7] that needs no prior knowledge of the frame-state alignment for estimating the oracle dynamic stream weights, cf. Algorithm 1. The algorithm initializes the stream weight as a Gaussian random variable with mean µ λ and standard deviation σ λ. These statistics are assumed to take different values for different noise types and levels. We set the mean value µ λ, which we call the bias parameter, to the optimal global fixed stream weight λ G. The global fixed stream weight can be found using a grid search, minimizing word error rate. We term the standard deviation σ λ the sensitivity parameter, as it controls the dynamics of the stream weight. Large standard deviations lead to strong changes in the stream weights and vice versa. In Step (2) of Algorithm 1, we start with a smallσ λ, then increase it iteratively in Step (10) until all stream weights become either zero or one. Increasing the standard deviation further will not change the values of the stream weights, which are bounded between zero or one. In Steps (3-8), the optimal set of stream weights is found given the prior parameters µ λ and σ λ of the current iteration. In Step (9), we test the estimated stream weight set in the current iteration by blindly decoding the utterance using an audio-visual speech recognizer. Finally, the stream weight set ˆλ that yields the lowest error rate is selected as the final stream weight set for the given utterance. In Steps (4-8) we use an EM algorithm to find an estimate ˆλ that maximizes the following objective function: F = p(o,λ w) s.t. 1 λ 0. (7) A local maximum of F can be found by estimating the framewise stream weights as follows 1 : ˆλ t = µ λ +σ 2 λ (N A,N V ) i=(1,1) γ t(i)log ( ) p(o A t q t = i). (8) p(o V t q t = i) If the resulting ˆλ t / [0,1] in (8), it is clipped to these boundaries. Equation (8) represents the M step (Step (6)). To compute ˆλ t as in (8), the E step (Step (5)) is first applied to compute the composite state occupation probabilities γ t(i) for all times and all composite states i { (1,1), (N A,N V ) }. The EM algorithm is iterated until a local maximum of (7) is found. In Step (3), a reasonably initialized stream weight set λ is found, using a greedy layer-wise approach that solves the optimization problem λ { ( t = argmax p O1,,t,λ t λ 1,,t 1,w )} λ t s.t. 1 λ 0. (9) The objective function in (9) is convex in the feasible region, i.e. 0 λ t 1, and can be optimized by gradient ascent [24]. 5. Experiments and Results The Grid audio-visual speech corpus has been used to evaluate the proposed approach. The Grid database contains audiovisual recordings uttered by 33 speakers, i.e., 1000 sentence per speaker. The Grid task is to recognize English sentences of the form command-color-prepositionletter-digit-adverb. We have divided the signals into three sets: A training set containing 90% of the signals, a development and a test set each of which contains 5% of the corpus signals. The training set has been used to separately train the marginal single-modality HMMs. The development set has mainly been used to train the MLP. To test the proposed approach under different acoustical conditions, we have used eight additional noisy versions of the test and development set. The noisy signals have been created by adding babble and white noise signals to the clean signals at four SNR levels between 0dB and 15dB. The babble and white noise signals stem from the NOISEX corpus [25] and were chosen to represent both stationary and non-stationary noise. 1 A rigorous derivation of the EM algorithm and the initialization approach is introduced in [7]. 1146
4 Table 1: Recognition performance in term of word accuracy for single-stream and audio-visual ASR systems in a range of acoustical conditions, using different stream weight estimation schemes. Audio-visual Noise SNR Audio Video Bayes Exponential Function λ = λ = Type [db] only only MLP Fusion Entropy SNR Dispersion λ G ODSW * Babble * * * * * White * * Clean Avg * The spectrograms of all signals have been extracted using the configuration parameters recommended in the ETSI- AFE [26]. From the spectrogram, we have extracted the first 13 mel-frequency cepstral coefficients (MFCC). The acoustical feature vectors are then composed of these 13-dimensional static feature vectors concatenated with their first and second temporal derivatives. The visual features have been extracted as follows: First, the region of interest (ROI), a rectangular area containing the speaker s mouth, has been found with the Viola- Jones algorithm [27]. Next, the two-dimensional discrete cosine transform (DCT) has been applied to the ROI. The first 64 DCT coefficients were then used as the visual features. The dimensions of the acoustical and visual observations have finally been reduced to 31 using linear discriminant analysis (LDA) [28,29]. The single-modality word HMMs are speaker-dependent, linear models. The number of states in each word HMM is proportional to the number of phonemes contained in this word with a proportionality factor of 3 for audio HMMs and 1 for video HMMs. The output probability distributions of all emitting states are Gaussian mixture models with 3 mixture components for audio HMMs and 4 for video HMMs. The Java Audiovisual SPEech Recognizer (JASPER) system [30] has been used for training and recognition. The MLP has an input layer with 31 input units, two hidden layers each with 10 neurons, and a one-dimensional output layer. All neurons of the hidden layers have tan-sigmoid transfer functions. The output neuron, however, has a linear transfer function. The estimated stream weights that are smaller than zero or larger than one are clipped at these values. We consider the results obtained using the approach in [4] as our baseline. Here, a second order exponential function, λ t = ae b RM t +ce d RM t, (10) was used to map a one-dimensional reliability measure RM, e.g. RM t = SNR t, to a stream weight λ t at each time frame t. The parameters of this mapping function, a, b, c, and d, have been estimated using non-linear least squares. One training data point is computed for each development set as follows: The argument of the mapping function is found by averaging a one-dimensional reliability measure, e.g., SNR, entropy, or dispersion, over the whole development set. As the target stream weight of the set, the global fixed stream weight λ G is used, which is found via grid search. Therefore, the number of training data points for parameter estimation equals the number of development sets, which is typically quite small. Table 1 shows the performance in terms of word accuracy for audio-only, video-only, and audio-visual ASR. The stream weights used with the AVASR system have been computed using different approaches. The first approach is simple Bayes fusion, where audio and video likelihoods are always weighted equally. In the second, the third, and the fourth approach, the stream weights have been estimated as in [4]. The fifth approach is the proposed one, in which the stream weights are estimated using an MLP. The sixth and the seventh approach are upper bounds of the system performance as their stream weights have been estimated using a-priori knowledge of the correct transcription. The stream weights of the sixth approach are the global fixed stream weights obtained using grid search with a minimum word error rate criterion. The oracle dynamic stream weights (ODSWs) used in the last approach are those estimated using Algorithm 1 discussed in Section 4. As seen in Table 1, the proposed approach significantly outperforms the baseline results and approaches the results obtained using optimal fixed stream weights. The asterisk in Table 1 means that the results are significantly better than the respective best result of the baseline. The statistical significance has been tested using Fisher s exact test [31] for p = However, the gap between the results obtained using the oracle dynamic stream weights and the weights estimated using the MLP is also significant. This points towards further room for optimization, e.g. through the inclusion of other robust and indicative reliability measure features. 6. Conclusions Automatic speech recognition can become highly noise-robust when visual observations are used in conjunction with the acoustical ones. It is, however, important to dynamically control the contribution of each observation to the classification decision, for example by using dynamic stream weights. We have proposed to use an MLP for mapping reliability measure features to dynamic stream weights. In order to train the MLP parameters, we have employed oracle dynamic stream weights, which were estimated using a recently proposed EM algorithm. As the MLP input, we have used composite feature vectors, containing signal-based as well as model-based reliability measures. The proposed approach has significantly outperformed the baseline for CHMM-based AVASR. We expect to achieve even better performance when further noise-robust reliability measures are included as additional MLP inputs. 1147
5 7. References [1] J. Hernando, Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition, in ICASSP, Munich, Germany, [2] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, in ICASSP, Orlando, Florida, USA, [3] L. Peng and W. Zuoying, Stream weight training based on MCE for audio-visual LVCSR, Tsinghua Science and Technology, vol. 10, no. 2, pp , [4] V. Estellers, M. Gurban, and J.-P. Thiran, On dynamic stream weighting for audio-visual speech recognition, IEEE Trans. Audio, Speech, and Language Processing, vol. 20, no. 4, pp , [5] A. Garg, G. Potamianos, C. Neti, and T. S. Huang, Frame-dependent multi-stream reliability indicators for audio-visual speech recognition, in International Conference on Multimedia and Expo, Baltimore, Maryland, USA, [6] M. Heckmann, F. Berthommier, and K. Kroschel, Noise adaptive stream weighting in audio-visual speech recognition, EURAS, vol. 2002, pp , [7] A. H. Abdelaziz, S. Zeiler, and D. Kolossa, A new EM estimation of dynamic stream weights for coupled-hmmbased audio-visual ASR, in ICASSP, Florence, Italy, [8] A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, Dynamic Bayesian networks for audio-visual speech recognition, EURASIP Journal on Applied Signal Processing, vol. 11, pp , [9] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, Recent advances in the automatic recognition of audio-visual speech, Proceedings of the IEEE, vol. 91, no. 9, pp , [10] M. Tomlinson, M. Russell, and N. M. Brooke, Integrating audio and visual information to provide highly robust speech recognition, in ICASSP, Atlanta, Georgia, USA, [11] J. Luettin, G. Potamianos, and C. Neti, Asynchronous stream modeling for large vocabulary audio-visual speech recognition, in ICASSP, Salt Lake City, Utah, USA, [12] H. Bourlard and S. Dupont, A new ASR approach based on independent processing and recombination of partial frequency bands, in ICSLP, Philadelphia, Pennsylvania, USA, [13] M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, vol. 120, no. 5, pp , [14] D. W. Massaro and D. G. Stork, Speech recognition and sensory integration, American Scientist, vol. 86, no. 3, pp , [15] M. Gurban and J.-P. Thiran, Using entropy as a stream reliability estimate for audio-visual speech recognition, in European Signal Processing Conference, Lausanne, Switzerland, [16] G. Potamianos and C. Neti, Stream confidence estimation for audio-visual speech recognition, in ICSLP, Beijing, China, [17] H. Misra, H. Bourlard, and V. Tyagi, New entropy based combination rule in HMM/ANN multi-stream ASR, in ICASSP, Hong Kong, Hong Kong, [18] A. Adjoudani and C. Benoit, On the integration of auditory and visual parameters in an HMM-based ASR, in NATO ASI Conference on Speech reading by Man and Machine: Models, Systems and Applications, Berlin, Germany, [19] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp , [20] I. Cohen and B. Berdugo, Noise estimation by minima controlled recursive averaging for robust speech enhancement, IEEE Signal Processing Letters, vol. 9, no. 1, pp , [21] I. Cohen, Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging, IEEE Transactions on Speech and Audio, vol. 11, no. 5, p , [22] L. Deng, J. Droppo, and A. Acero, Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion, IEEE Trans. Speech and Audio Processing, vol. 13, no. 3, pp , [23] R. F. Astudillo, D. Kolossa, and R. Orglmeister, Accounting for the uncertainty of speech estimates in the complex domain for minimum mean square error speech enhancement, in Interspeech, Brighton, United Kingdom, [24] S. Boyd and L. Vandenberghe, Convex Optimization, 7th ed. Cambridge University Press, [25] A. Varga and H. J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, vol. 12, no. 3, pp , [26] Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced frontend feature extraction algorithm; Compression algorithms, ETSI, ES Std., [27] G. Bradski and A. Kaehler, Computer Vision with the OpenCV Library. O Reilly Media, [28] R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Pearson, [29] D. Kolossa, S. Zeiler, R. Saeidi, and R. Astudillo, Noiseadaptive LDA: A new approach for speech recognition under observation uncertainty, IEEE Signal Processing Letters, vol. 20, no. 11, pp , [30] A. Vorwerk, S. Zeiler, D. Kolossa, R. F. Astudillo, and D. Lerch, Robust Speech Recognition of Uncertain or Missing Data. Springer, 2011, ch. Use of Missing and Unreliable Data for Audiovisual Speech Recognition, pp [31] A. Agresti, A survey of exact inference for contingency tables, Statistical Science, vol. 7, no. 1, pp ,
Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR
INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR Sebastian Gergen 1, Steffen Zeiler 1, Ahmed Hussen Abdelaziz 2, Robert Nickel
More informationLOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION
LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH Andrés Vallés Carboneras, Mihai Gurban +, and Jean-Philippe Thiran + + Signal Processing Institute, E.T.S.I. de Telecomunicación Ecole Polytechnique
More informationVariable-Component Deep Neural Network for Robust Speech Recognition
Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft
More informationTurbo Decoders for Audio-visual Continuous Speech Recognition
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Turbo Decoders for Audio-visual Continuous Speech Recognition Ahmed Hussen Abdelaziz International Computer Science Institute, Berkeley, USA ahmedha@icsi.berkeley.edu
More informationAudio-visual interaction in sparse representation features for noise robust audio-visual speech recognition
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing (AVSP) 2013 Annecy, France August 29 - September 1, 2013 Audio-visual interaction in sparse representation features for
More information2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology
ISCA Archive STREAM WEIGHT OPTIMIZATION OF SPEECH AND LIP IMAGE SEQUENCE FOR AUDIO-VISUAL SPEECH RECOGNITION Satoshi Nakamura 1 Hidetoshi Ito 2 Kiyohiro Shikano 2 1 ATR Spoken Language Translation Research
More informationAudio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models
Proceedings of the orld Congress on Electrical Engineering and Computer Systems and Science (EECSS 2015) Barcelona, Spain, July 13-14, 2015 Paper No. 343 Audio-Visual Speech Processing System for Polish
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationA long, deep and wide artificial neural net for robust speech recognition in unknown noise
A long, deep and wide artificial neural net for robust speech recognition in unknown noise Feipeng Li, Phani S. Nidadavolu, and Hynek Hermansky Center for Language and Speech Processing Johns Hopkins University,
More informationSPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION
Far East Journal of Electronics and Communications Volume 3, Number 2, 2009, Pages 125-140 Published Online: September 14, 2009 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing
More informationWHO WANTS TO BE A MILLIONAIRE?
IDIAP COMMUNICATION REPORT WHO WANTS TO BE A MILLIONAIRE? Huseyn Gasimov a Aleksei Triastcyn Hervé Bourlard Idiap-Com-03-2012 JULY 2012 a EPFL Centre du Parc, Rue Marconi 19, PO Box 592, CH - 1920 Martigny
More informationMultimodal speech recognition and enhancement
Multimodal speech recognition and enhancement Dorothea Kolossa & Steffen Zeiler Cognitive Signal Processing Group / Kavli Institute for Theoretical Physics Ruhr-Universität Bochum Fakultät für Elektrotechnik
More informationDiscriminative training and Feature combination
Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics
More informationSpeaker Localisation Using Audio-Visual Synchrony: An Empirical Study
Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study H.J. Nock, G. Iyengar, and C. Neti IBM TJ Watson Research Center, PO Box 218, Yorktown Heights, NY 10598. USA. Abstract. This paper
More informationRobust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions
Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions Stewart, D., Seymour, R., Pass, A., & Ji, M. (2014). Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions.
More informationA Comparison of Visual Features for Audio-Visual Automatic Speech Recognition
A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition N. Ahmad, S. Datta, D. Mulvaney and O. Farooq Loughborough Univ, LE11 3TU Leicestershire, UK n.ahmad@lboro.ac.uk 6445 Abstract
More informationTwo-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments
The 2 IEEE/RSJ International Conference on Intelligent Robots and Systems October 8-22, 2, Taipei, Taiwan Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments Takami Yoshida, Kazuhiro
More informationThis is a repository copy of Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates.
This is a repository copy of Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/121254/
More informationSpeaker Verification with Adaptive Spectral Subband Centroids
Speaker Verification with Adaptive Spectral Subband Centroids Tomi Kinnunen 1, Bingjun Zhang 2, Jia Zhu 2, and Ye Wang 2 1 Speech and Dialogue Processing Lab Institution for Infocomm Research (I 2 R) 21
More informationISCA Archive
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 2005 (AVSP 05) British Columbia, Canada July 24-27, 2005 AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE David
More informationGender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV
Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,
More informationMulti-pose lipreading and audio-visual speech recognition
RESEARCH Open Access Multi-pose lipreading and audio-visual speech recognition Virginia Estellers * and Jean-Philippe Thiran Abstract In this article, we study the adaptation of visual and audio-visual
More informationResearch Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images
Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing Volume 2007, Article ID 64506, 9 pages doi:10.1155/2007/64506 Research Article Audio-Visual Speech Recognition Using
More informationBo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on
Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic
More informationProbabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information
Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural
More informationComparative Evaluation of Feature Normalization Techniques for Speaker Verification
Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam,
More informationThe Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.
Petridis, Stavros and Stafylakis, Themos and Ma, Pingchuan and Cai, Feipeng and Tzimiropoulos, Georgios and Pantic, Maja (2018) End-to-end audiovisual speech recognition. In: IEEE International Conference
More informationConfusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments
PAGE 265 Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments Patrick Lucey, Terrence Martin and Sridha Sridharan Speech and Audio Research Laboratory Queensland University
More informationAcoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing
Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se
More informationEM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition
EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han,
More informationMultifactor Fusion for Audio-Visual Speaker Recognition
Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 70 Multifactor Fusion for Audio-Visual Speaker Recognition GIRIJA CHETTY
More informationConditional Random Fields : Theory and Application
Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF
More informationNote Set 4: Finite Mixture Models and the EM Algorithm
Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for
More informationCS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed
More informationClassification: Linear Discriminant Functions
Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions
More informationBengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE
1 A Low Complexity Parabolic Lip Contour Model With Speaker Normalization For High-Level Feature Extraction in Noise Robust Audio-Visual Speech Recognition Bengt J Borgström, Student Member, IEEE, and
More informationAn Introduction to Pattern Recognition
An Introduction to Pattern Recognition Speaker : Wei lun Chao Advisor : Prof. Jian-jiun Ding DISP Lab Graduate Institute of Communication Engineering 1 Abstract Not a new research field Wide range included
More informationClient Dependent GMM-SVM Models for Speaker Verification
Client Dependent GMM-SVM Models for Speaker Verification Quan Le, Samy Bengio IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland {quan,bengio}@idiap.ch Abstract. Generative Gaussian Mixture Models (GMMs)
More informationDynamic Time Warping
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW
More informationLecture 7: Neural network acoustic models in speech recognition
CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic
More informationFurther Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification
ICSP Proceedings Further Studies of a FFT-Based Auditory with Application in Audio Classification Wei Chu and Benoît Champagne Department of Electrical and Computer Engineering McGill University, Montréal,
More informationIntelligent Hands Free Speech based SMS System on Android
Intelligent Hands Free Speech based SMS System on Android Gulbakshee Dharmale 1, Dr. Vilas Thakare 3, Dr. Dipti D. Patil 2 1,3 Computer Science Dept., SGB Amravati University, Amravati, INDIA. 2 Computer
More informationManifold Constrained Deep Neural Networks for ASR
1 Manifold Constrained Deep Neural Networks for ASR Department of Electrical and Computer Engineering, McGill University Richard Rose and Vikrant Tomar Motivation Speech features can be characterized as
More informationMaximum Likelihood Beamforming for Robust Automatic Speech Recognition
Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR
More informationA Neural Network for Real-Time Signal Processing
248 MalkofT A Neural Network for Real-Time Signal Processing Donald B. Malkoff General Electric / Advanced Technology Laboratories Moorestown Corporate Center Building 145-2, Route 38 Moorestown, NJ 08057
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationSpeech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri
Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Speech Recognition Components Acoustic and pronunciation model:
More informationAUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION
AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION Iain Matthews, J. Andrew Bangham and Stephen Cox School of Information Systems, University of East Anglia, Norwich, NR4 7TJ,
More informationHidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017
Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models
More informationDetection of goal event in soccer videos
Detection of goal event in soccer videos Hyoung-Gook Kim, Steffen Roeber, Amjad Samour, Thomas Sikora Department of Communication Systems, Technical University of Berlin, Einsteinufer 17, D-10587 Berlin,
More informationSCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION
IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication, pp 8/1 8/7, 1996 1 SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION I A Matthews, J A Bangham and
More informationChapter 3. Speech segmentation. 3.1 Preprocessing
, as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents
More informationQuery-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram
International Conference on Education, Management and Computing Technology (ICEMCT 2015) Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based
More informationHIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass
HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139,
More informationImproving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization
Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Chao Ma 1,2,3, Jun Qi 4, Dongmei Li 1,2,3, Runsheng Liu 1,2,3 1. Department
More informationJPEG compression of monochrome 2D-barcode images using DCT coefficient distributions
Edith Cowan University Research Online ECU Publications Pre. JPEG compression of monochrome D-barcode images using DCT coefficient distributions Keng Teong Tan Hong Kong Baptist University Douglas Chai
More informationICA mixture models for image processing
I999 6th Joint Sy~nposiurn orz Neural Computation Proceedings ICA mixture models for image processing Te-Won Lee Michael S. Lewicki The Salk Institute, CNL Carnegie Mellon University, CS & CNBC 10010 N.
More informationEND-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS
END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS Stavros Petridis, Zuwei Li Imperial College London Dept. of Computing, London, UK {sp14;zl461}@imperial.ac.uk Maja Pantic Imperial College London / Univ.
More informationSVD-based Universal DNN Modeling for Multiple Scenarios
SVD-based Universal DNN Modeling for Multiple Scenarios Changliang Liu 1, Jinyu Li 2, Yifan Gong 2 1 Microsoft Search echnology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft Way, Redmond,
More informationPitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery
Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,
More informationAUTOMATIC speech recognition (ASR) systems suffer
1 A High-Dimensional Subband Speech Representation and SVM Framework for Robust Speech Recognition Jibran Yousafzai, Member, IEEE Zoran Cvetković, Senior Member, IEEE Peter Sollich Matthew Ager Abstract
More informationOptimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification
Proceedings of the 6th WSEAS International Conference on SIGNAL PROCESSING, Dallas, Texas, USA, March 22-24, 2007 52 Optimization of Observation Membership Function By Particle Swarm Method for Enhancing
More informationWhy DNN Works for Speech and How to Make it More Efficient?
Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.
More informationA GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION
A GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION Hazim Kemal Ekenel, Rainer Stiefelhagen Interactive Systems Labs, Universität Karlsruhe (TH) 76131 Karlsruhe, Germany
More informationA Gaussian Mixture Model Spectral Representation for Speech Recognition
A Gaussian Mixture Model Spectral Representation for Speech Recognition Matthew Nicholas Stuttle Hughes Hall and Cambridge University Engineering Department PSfrag replacements July 2003 Dissertation submitted
More informationMengjiao Zhao, Wei-Ping Zhu
ADAPTIVE WAVELET PACKET THRESHOLDING WITH ITERATIVE KALMAN FILTER FOR SPEECH ENHANCEMENT Mengjiao Zhao, Wei-Ping Zhu Department of Electrical and Computer Engineering Concordia University, Montreal, Quebec,
More informationOptimization of HMM by the Tabu Search Algorithm
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 949-957 (2004) Optimization of HMM by the Tabu Search Algorithm TSONG-YI CHEN, XIAO-DAN MEI *, JENG-SHYANG PAN AND SHENG-HE SUN * Department of Electronic
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationCOPYRIGHTED MATERIAL. Introduction. 1.1 Introduction
1 Introduction 1.1 Introduction One of the most fascinating characteristics of humans is their capability to communicate ideas by means of speech. This capability is undoubtedly one of the facts that has
More informationFace. Feature Extraction. Visual. Feature Extraction. Acoustic Feature Extraction
A Bayesian Approach to Audio-Visual Speaker Identification Ara V Nefian 1, Lu Hong Liang 1, Tieyan Fu 2, and Xiao Xing Liu 1 1 Microprocessor Research Labs, Intel Corporation, fara.nefian, lu.hong.liang,
More informationJoint Processing of Audio and Visual Information for Speech Recognition
Joint Processing of Audio and Visual Information for Speech Recognition Chalapathy Neti and Gerasimos Potamianos IBM T.J. Watson Research Center Yorktown Heights, NY 10598 With Iain Mattwes, CMU, USA Juergen
More informationTemporal Multimodal Learning in Audiovisual Speech Recognition
Temporal Multimodal Learning in Audiovisual Speech Recognition Di Hu, Xuelong Li, Xiaoqiang Lu School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical
More informationCAMBRIDGE UNIVERSITY
CAMBRIDGE UNIVERSITY ENGINEERING DEPARTMENT DISCRIMINATIVE CLASSIFIERS WITH GENERATIVE KERNELS FOR NOISE ROBUST SPEECH RECOGNITION M.J.F. Gales and F. Flego CUED/F-INFENG/TR605 August 13, 2008 Cambridge
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationEstimating Human Pose in Images. Navraj Singh December 11, 2009
Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks
More information[N569] Wavelet speech enhancement based on voiced/unvoiced decision
The 32nd International Congress and Exposition on Noise Control Engineering Jeju International Convention Center, Seogwipo, Korea, August 25-28, 2003 [N569] Wavelet speech enhancement based on voiced/unvoiced
More informationBidirectional Truncated Recurrent Neural Networks for Efficient Speech Denoising
Bidirectional Truncated Recurrent Neural Networks for Efficient Speech Denoising Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen Department of Electronics and Information Systems, Ghent University,
More informationAudio Visual Isolated Oriya Digit Recognition Using HMM and DWT
Conference on Advances in Communication and Control Systems 2013 (CAC2S 2013) Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT Astik Biswas Department of Electrical Engineering, NIT Rourkela,Orrisa
More informationUsing Gradient Descent Optimization for Acoustics Training from Heterogeneous Data
Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Martin Karafiát Λ, Igor Szöke, and Jan Černocký Brno University of Technology, Faculty of Information Technology Department
More informationA New Manifold Representation for Visual Speech Recognition
A New Manifold Representation for Visual Speech Recognition Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan School of Computing & Electronic Engineering, Vision Systems Group Dublin City University,
More informationMotivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)
Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,
More informationNeural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer
More informationPair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification
INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification 2 1 Xugang Lu 1, Peng Shen 1, Yu Tsao 2, Hisashi
More informationInterfacing of CASA and Multistream recognition. Cedex, France. CH-1920, Martigny, Switzerland
Interfacing of CASA and Multistream recognition Herv Glotin 2;, Fr d ric Berthommier, Emmanuel Tessier, Herv Bourlard 2 Institut de la Communication Parl e (ICP), 46 Av F lix Viallet, 3803 Grenoble Cedex,
More informationSeparating Speech From Noise Challenge
Separating Speech From Noise Challenge We have used the data from the PASCAL CHiME challenge with the goal of training a Support Vector Machine (SVM) to estimate a noise mask that labels time-frames/frequency-bins
More informationMULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER
MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER A.Shabbir 1, 2 and G.Verdoolaege 1, 3 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Max Planck Institute
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationMarkov Random Fields and Gibbs Sampling for Image Denoising
Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov
More informationThe Automatic Musicologist
The Automatic Musicologist Douglas Turnbull Department of Computer Science and Engineering University of California, San Diego UCSD AI Seminar April 12, 2004 Based on the paper: Fast Recognition of Musical
More informationAvailable online Journal of Scientific and Engineering Research, 2016, 3(4): Research Article
Available online www.jsaer.com, 2016, 3(4):417-422 Research Article ISSN: 2394-2630 CODEN(USA): JSERBR Automatic Indexing of Multimedia Documents by Neural Networks Dabbabi Turkia 1, Lamia Bouafif 2, Ellouze
More informationLecture 21 : A Hybrid: Deep Learning and Graphical Models
10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation
More informationDr Andrew Abel University of Stirling, Scotland
Dr Andrew Abel University of Stirling, Scotland University of Stirling - Scotland Cognitive Signal Image and Control Processing Research (COSIPRA) Cognitive Computation neurobiology, cognitive psychology
More informationCHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri
1 CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING Alexander Wankhammer Peter Sciri introduction./the idea > overview What is musical structure?
More informationNovel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech
INTERSPEECH 16 September 8 12, 16, San Francisco, USA Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech Meet H. Soni, Hemant A. Patil Dhirubhai Ambani Institute
More informationMachine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves
Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves
More informationVisual object classification by sparse convolutional neural networks
Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.
More informationapplication of learning vector quantization algorithms. In Proceedings of the International Joint Conference on
[5] Teuvo Kohonen. The Self-Organizing Map. In Proceedings of the IEEE, pages 1464{1480, 1990. [6] Teuvo Kohonen, Jari Kangas, Jorma Laaksonen, and Kari Torkkola. LVQPAK: A program package for the correct
More informationAutomatic Enhancement of Correspondence Detection in an Object Tracking System
Automatic Enhancement of Correspondence Detection in an Object Tracking System Denis Schulze 1, Sven Wachsmuth 1 and Katharina J. Rohlfing 2 1- University of Bielefeld - Applied Informatics Universitätsstr.
More informationComparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification
More informationON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS
ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS Pascual Ejarque and Javier Hernando TALP Research Center, Department of Signal Theory and Communications Technical University of
More information