Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri
Speech Recognition Components Acoustic and pronunciation model: Pr(o w) = d,c,p Pr(o d)pr(d c)pr(c p)pr(p w). acoustic model Pr(o d) : observation seq. distribution seq. Pr(d c) : distribution seq. CD phone seq. Pr(c p) : CD phone seq. phoneme seq. Pr(p w) : phoneme seq. word seq. Language model: seq. Pr(w), distribution over word 2
Acoustic Models Critical component of a speech recognition system. Different types: context-independent (CI) phones vs. contextdependent (CD) phones. speaker-independent vs. speaker-dependent. Complex design and training techniques in largevocabulary speech recognition. 3
Context-Dependent Phones Idea: phoneme pronunciation depends on environment (allophones, co-articulation). model phone in context Context-dependent rules: Context-dependent units: Allophonic rules: t/v better accuracy. Complex contexts: regular expressions. (Lee, 1990; Young et al., 1994) ae/b d ae b,d. V dx. 4
This Lecture Acoustic models Training algorithms 5
Continuous Speech Models Graph topology: 3-state HMM model: for each CD phone. ae b,d d0:ε d1:ε d2:ε (Rabiner and Juang, 1993) 0 d0:ε 1 d1:ε 2 d2:ae b,d 3 Interpretation: beginning, middle, and end of CD phone. Continuous case: transition weights based on distributions over feature vectors in R N, typically with N =39. 6
Distributions Simple cases: e.g., single speaker, single Gaussian distribution N (x; µ, σ) = 1 exp (2π) N/2 σ 1/2 covariance matrix σ typically diagonal. General case: mixtures of Gaussians. with λ i 0 and M λ k N (x; µ k,σ k ), k=1 M i=1 ( 1 ) 2 (x µ) σ 1 (x µ). λ i =1.Typically, M =16. 7
GMMs - Illustration 8
Silence Model Motivation: accounting for pause between words and sentences. Model: optional pause symbol between words and at the beginning and end of utterances in language model. specific silence acoustic model, which can be context-dependent or not. 9
Parameter Reduction Problem: too many parameters (> 200M). large number of GMMs provides better modeling flexibility. but requires much more training data. Solution: tying mixtures, i.e., equality constraints on distributions. within the same HMM or distributions in different HMMs (similar CD phone transitions). semi-continuous: same distrib. different mixtures. 10
Composite HMM model Composite model: obtained by taking the union and closure of all CD phone models. Illustration: 0 : d0: 1 d0: d2:p d 2:p d1: 2 d1: d2: 3 d 2: Tying can reduce the size. : d 0: d 1: 3 d 1: 1 d 0: 2 11
This Lecture Acoustic models Training algorithms 12
Parameter Estimation Data: sample of Parameters: sequences of the form: mean and variance of Gaussians µ j,σ j. mixture coefficients λ k. Problems: segmentation. m Feature vectors: CD phones: o 1,o 2,...,o T R N p 1,p 2,...,p l ae c,t. model initialization. 13
Estimation Algorithm Baum-Welch algorithm: maximum likelihood principle. generalizes to continuous case with Gaussians and GMMs. Questions: segmentation, model initialization. 14
Univariate Gaussian - ML solution Problem: find most likely Gaussian distribution, given sequence of real-valued observations Normal distribution: Likelihood: Solution: p(x) µ =0 µ = 1 m l 3.18, 2.35,.95, 1.175,... ( 1 p(x) = exp (x µ)2 2πσ 2 2σ 2 l(p) = 1 m 2 m log(2 (x i µ) 2 2 ) 2 2. is differentiable and concave; m i=1 x i p(x) σ 2 i=1 =0 σ 2 = 1 m ) m x 2 i µ 2. i=1. 15
Gradients General identities: log of determinant (log det( )) = ( 1 ) =. bilinear form (x x) =xx. 16
Multivariate Gaussian - ML solution L = Log likelihood: sample. ( 1 Pr[x i ]= exp 1 ) (2π) N/2 σ 1/2 2 (x i µ) σ 1 (x i µ). m m log Pr[x i ]= N 2 log(2π)+1 2 log σ 1 1 2 (x i µ) σ 1 (x i µ). For each x i, i=1 i=1 ML solution: x 1,...,x m L m µ = i=1 L m σ 1 = i=1 σ 1 (x i µ) =0 µ = 1 m m x i. i=1 1 2 (σ (x i µ)(x i µ) )=0 σ = 1 m m (x i µ)(x i µ). i=1 17
GMMs - EM Algorithm Mixture of p θ [x] = M EM algorithm: let Gaussians: M λ k N (x; µ k,σ k ) L = k=1 i=1 k=1 m M log λ k N (x i ; µ k,σ k ). p t i,k = N (x i ; µ t k,σ t k). E-step: M-step: q t+1 i,k µ t+1 k = σ t+1 k = λ t+1 k = p θ t[z = k x i]= m i=1 qt+1 i,k x i m i=1 qt+1 = 1 m i,k m i=1 qt+1 m i=1 λ t k pt i,k M. k=1 λt k pt i,k i,k (x i µ t+1 k )(x i µ t+1 q t+1 i,k. m i=1 qt+1 i,k 18 k )
Initialization Selection of model: HMM topology (states and transitions). number of mixtures ( M). Flat start: all distributions with same average values (mean and variance) computed over entire training set. Existing hand-labeled segmentation: partial segmentation basis. Uniform segmentation: equal number of HMM transitions per training example. 19
Forced Alignment Viterbi training: approximate but faster method to determine HMM path. Segmental K-means: approximate but faster method to determine which Gaussian of a mixture the training instance has been sampled from. 20
Viterbi Alignment Idea: faster alignment based on most likely path. use current acoustic model. align sequence of feature vectors with most likely path (best path algorithm, e.g., Viterbi). o 1 2 3 4 5 6 7 8 9 10 e e 11e 11 e 11 e 12 e 22 e 22 e 23 e 33 e 34 e 44 21
Segmental K-Means Idea: use clustering algorithm to determine which Gaussian generated each observation. Solution: use K-means clustering algorithm to initialize distribution means. Initialization: select K centroids c 1,...,c K. Repeat until no centroid change: for each point x i find closest centroid c j and assign to cluster. for each cluster j, redefine mass. x i j c j as the center of 22
Notes GMM EM algorithm: soft version of K-means. 23
Variable Number of Mixtures Problems: number of mixtures required. possible over- or underfitting. Solution: originally single Gaussian distribution. create two-component mixture with slightly perturbed means µ ± ɛ with the same covariance matrix. model parameters reestimated until desired complexity reached. 24
Acoustic Modeling In practice: complicated recipes or heuristics with large number of ad hoc techniques. key skilled human supervision: choice of initial parameters to avoid local minima, segmentation, choice and number of parameters. computationally very costly: may take many days of several processors in large-vocabulary speech recognition. 25
Improvements Adaptation (VTLN, MLLR). Ensemble methods (ROVER). Better features (e.g., LDA, MMI). Discriminative training. 26
References L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972. Jeff A. Bilmes. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, 1998. Chung-an Chan. A tutorial for the Baum-Welch Algorithm, 2013 Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4): 599-609, 1990. Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989. Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993. Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994. 27