Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Size: px

Start display at page:

Download "Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition"

Phoebe Cooper
5 years ago
Views:

1 Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition by Hong-Kwang Jeff Kuo, Brian Kingsbury (IBM Research) and Geoffry Zweig (Microsoft Research) ICASSP 2007 Presented by: Eugene Weinstein, NYU April 22nd, 2008

2 Transducers in Speech Given observation sequence o, want word sequence ŵ : Constraints modeled between HMM states/distributions d, context-dependent phones c, phonemes p, and words w Constraint set combined by using transducer composition HMM w = argmin w CD Model Π 2 [ O H C L G ]. observ. seq. H CD phone seq. C phoneme seq. L word seq. G O 2 Pron. Model [Graphic: Mohri 07] ŵ = argmax ŵ = arg max Pr[o d] Pr(o c)pr(c p)pr(p w)pr(w). Pr[d c] Pr[c p] Pr[p w] Pr[w] nd ŵ = argmax Pr[o w] Pr[w] w w d,c,p p be transducers over the tropical semiring r Lang. Model word seq.

3 Discriminative Training Previous work on discriminative training in speech Minimum Classification Error (e.g., [Juang et al. 97]): train acoustic models w/ discriminative criterion Discriminative learning of language models (e.g., [Roark et al. 06]) Other work extends this to training entire CLG [Lin and Yvon 05]: Construct the full constraint graph, train weights to minimize error Present paper: same technique; larger-scale experiments 3

4 Discriminative Formulation Let Γ be the set of transition weights in C L G and be the set of acoustic model parameters Given observations X = x 1, x 2,..., x t and a word sequence, the log-prob of path S = S 1, S 2,..., S t is W Λ Decoding problem: find the best word sequence If g(x, W, S; Λ, Γ) = α a(x, W, S; Λ) + b(w, S; Γ) W 1 = argmax W,S g(x, W, S; Λ, Γ) is the correct transcription, a discriminant function is W 0 d(x; Λ, Γ) = g(x, W 0, S 0 ; Λ, Γ) + g(x, W 1, S 1 ; Λ, Γ) 4

5 Loss Function; Gradient For utterance, the loss is l i (X i ) = l(d(x i )) = Smooth, differentiable, 0 to 1 range function For a training set (X), total loss is Gradient of the loss is Γ is a vector of transition weights: exp( γd(x i ) + θ), l(x; Λ t, Γ t ) = X l i = exp( γd i i + θ)( γ) d i (1 + exp( γd i + θ)) 2 = γ exp( γd i + θ) (1 + exp( γd i + θ)) 2 = γl i(1 l i ) d i Γ = X i ( di s 1, d i s 2,..., d i s k ) ; 5 l(x) = X i l i (X i ) l i d(x i ; Λ, Γ) d i Γ Γ = (s 1, s 2,..., s k ) d(x i ; Λ, Γ) = I(W 0, s j ) + I(W 1, s j ) s j counts

6 Computing Gradient F i : HMM state sequence of best forced alignment path of training utterance to the correct word sequence X i B i : HMM state sequence of best full search path T : Transducer mapping HMM sequence to transition sequence (each transition in CLG gets distinct label) and : counts of in and I(W 0, s j ) I(W 1, s j ) s j F i T B i T Gradient ascent training: Alternative update: Quickprop: Γ t+1 = Γ t ɛ l(x; Λ t, Γ t ). Γ t+1 = Γ t [( 2 l(γ)) 1 + ɛ] l(x; Λ t, Γ t ) 6

7 Experiments Language model: trained on 132M-word Broadcast News Small bigram model to make training feasible Large 4-gram model used for lattice rescoring Acoustic model: trained on 143-hour BN corpus Test audio: 2.93 hours, 25K words Effect of training set size: Baseline WER 23.0 d Threshold No.Training Sentences WER

8 Experiments, cont d Discriminative training (MCE) works better than ML However, training infeasible for full language model We can first use small MCE model, then rescore with large ML model; however, benefit is lost beam ML MCE Diff %Diff one-pass % 21.7% 1.3% 5.7% % 21.9% 1.3% 5.6% % 22.3% 1.5% 6.3% % 23.7% 2.0% 7.8% % 30.1% 3.4% 10.1% LM rescored % 17.6% 0.1% 0.6% % 18.0% 0.5% 2.7% % 18.9% 0.5% 2.6% % 21.0% 1.2% 5.4% % 28.4% 3.3% 10.4% 8

9 Experiments Cont d Quickprop results in faster convergence Obj. Function x Training Set Objective Function gradient descent quickprop Training Set WER gradient descent quickprop WER (%) WER (%) Test Set WER gradient descent quickprop Number of epochs 9

10 References Biing-Hwang Juang, Wu Chou, and Chin-Hui Lee, Minimum classification error rate methods for speech recognition, IEEE Transactions on Speech and Audio processing, vol. 5, no. 3, pp , May Shiuan-Sung Lin and François Yvon, Discriminative training of finite state decoding graphs, in Proc. Interspeech 2005, Lisbon, Portugal, Sept Brian Roark, Murat Saraclar and Michael Collins Discriminative n- gram language modeling. Computer Speech and Language, 21(2):

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Speech Recognition Components Acoustic and pronunciation model: