Discriminative Training and Adaptation of Large Vocabulary ASR Systems

Size: px

Start display at page:

Download "Discriminative Training and Adaptation of Large Vocabulary ASR Systems"

Georgia McCarthy
5 years ago
Views:

1 Discriminative Training and Adaptation of Large Vocabulary ASR Systems Phil Woodland March 30th 2004 ICSI Seminar: March 30th 2004

2 Overview Why use discriminative training for LVCSR? MMIE/CMLE criterion & simple example Issues for LVCSR optimsation methods: extended Baum-Welch algorithm computation: lattice-based training generalisation MMIE performance within task across task MPE criterion ICSI Seminar: March 30th

3 error based criterion I-smoothing performance Discriminative MAP weak-sense auxiliary functions task-based adaptation Discriminative linear transform based adaptation supervised adaptation discriminative SAT unsupervised discriminative adaptation Current lines of research Conclusions ICSI Seminar: March 30th

4 Many others have worked/are working at CUED on HMM-based discriminative training and adaptation for large vocab ASR systems including: Ricky Chan KK Chin Ricardo de Cordoba Mark Gales Do Yeong Kim Julian Odell Dan Povey Khe Chai Sim Luis Uebel Valtcho Valchev Lan Wang Steve Young Kai Yu ICSI Seminar: March 30th

5 Why Discriminative Criteria? Standard HMM training uses maximum likelihood estimation (MLE) MLE optimisation criteria is F MLE (λ) = R log P λ (O r M wr ) r=1 w r is the transcription for utterance r and M wr the corresponding model. Would be optimal if several unrealistic assumptions met Infinite training set size Model correctness Neither condition met for speech recognition, hence interesting to investigate alternatives, especially discriminative schemes such as MMIE (& MPE) ICSI Seminar: March 30th

6 MMIE Basics Maximum mutual information estimation (MMIE) maximises the sentence level posterior : in log form F MMIE (λ) = R r=1 log P λ (O r M wr ) P (w r ) w P λ (O r M w ) P (w) Numerator is likelihood of data given correct transcription (as for MLE) Denominator expands total likelihood in terms of all word sequences Can compute denominator by finding likelihood through composite HMM with all recognition constraints (recognition model) Need to optimise rational objective function (harder than for MLE) Maximise numerator (MLE term) Simultaneously minimise denominator ICSI Seminar: March 30th

7 More closely related to word error rate than MLE not optimising an error rate directly Strictly Conditional Maximum Likelihood Estimator but here equiv to MMIE, since LM fixed Widely used on small vocab tasks since late 1980s/early 1990s Can compute denominator using recognition pass MMIE weights training data unequally (well classified small weight) MLE gives all training samples equal weight Simple example shows usefulness with incorrect model assumptions. Two class static pattern recognition problem Two dimensional data from full covariance Gaussian Modelled with diagonal covariance Gaussian ICSI Seminar: March 30th

8 Simple MMIE Example 3 MLE SOLUTION (FULL COVARIANCE) 3 MLE SOLUTION (DIAGONAL) MMIE SOLUTION MMIE CRITERION / ERROR RATE ITERATION ICSI Seminar: March 30th

9 MMIE Issues for LVCSR Need to have effective optimisation technique that scales well to large systems. Optimisation: Extended Baum-Welch (Gopalakrishnan et al, Normandin) ˆµ jm = { θ num jm (O) θden jm (O)} + Dµ jm { γ num jm γden jm } + D ˆσ 2 jm = { θ num jm (O2 ) θjm den(o2 ) } + D(σjm 2 + µ2 jm { ) γ num jm } ˆµ 2 γden jm + D jm Gaussian occupancies (summed over time) are γ jm. θ jm (O) and θ jm (O 2 ) are sums of data and squared data respectively, weighted by occupancy. num and den denote correct word sequence, & recognition model respectively. ICSI Seminar: March 30th

10 Denominator requires computation of all sentence likelihoods: with lattices approximate Require good generalisation Can reduce training set error rate: need to reduce test-set errors! Not just better with small numbers of parameters (as often thought with MMIE) Need to increase confusable data for training Use acoustic scaling to broaden posterior distribution across denominator Weakened language model type to increase confusable data with focus on acoustics (Schlueter et al) ICSI Seminar: March 30th

11 Original Lattice Based MMIE Introduced by Valtchev, Odell, Woodland & Young (1996) Use a word-lattice to represent numerator & denominator terms Recognise every training sentence with a bigram LM (denominator) Accumulate statistics for EBW via forward-backward pass on lattice Forward-backward at word-level: Viterbi at state level Iterate EBW training using fixed word level lattice Evaluated on Wall Street Journal (SI284 training) Good test-set gains for simpler models Small/zero gains for more complex models Very effective at reducing training-set error rate (for denominator lattices) ICSI Seminar: March 30th

12 Current (2000-) MMIE Implementation Generate word lattices for training set with a fast recogniser using MLE models generate phone-marked lattices with model boundary times run EBW algorithm for several iterations Exact match lattice search Only run forward-backward between boundaries Use acoustic scaling of complete segments (by LM probabilities) F-B passes uses unigram (or v. small bigram)language model scores Parameter updates Standard updating formulae for means/variances Gaussian specific D constant with flooring Revised updates for mixture weights which leads to faster convergence ICSI Seminar: March 30th

13 NAB/WSJ MMIE Results Standard HTK large vocab LVCSR system (no adaptation) 66 hour training set #Mix H1 dev H1 eval Comp MLE MMIE MLE MMIE Bigger reductions in WER for simpler systems All model complexities improve ICSI Seminar: March 30th

14 Cross-Task NAB/WSJ MMIE Results Test discriminative training across task Train on WSJ-type data and test on broadcast news Train Setup Avg F0 F1 F2 F4 FX NAB-C2 MLE NAB-C2 MMIE BN-36H MLE BN-72H MLE BN-72H MMIE %WER on BNdev96pe data using trigram, GI, for NAB channel 2 models trained with either MLE or MMIE. The use of BN training data is also shown for comparison. MMIE always better than MLE even with severe mismatch ICSI Seminar: March 30th

15 Error Rates on Conversational telephone Speech Iteration 68 hour training 265 hour training Number eval97sub eval98 eval97sub eval98 0 (MLE) %WER from several iterations of MMIE training on CTS data Sizeable absolute reductions in WER after 4 iterations (eval98) 2.3% for 68 hour training set 3.4% for 265 hour training larger gains with increased training set sizes is general pattern ICSI Seminar: March 30th

16 MPE Objective Function Maximise the following function: F MPE (λ) = R r w p λ(o r w)p (w)rawaccuracy(w) w p λ(o r w)p (w) RawAccuracy(w) measures the number of phones correctly transcribed in sentence w (derived from word recognition). i.e. the number of correct phones in w inserted phones in w F MPE (λ) is weighted average of RawAccuracy(w) over all w. MPE is smoothed approx to phone error in a word recognition context Error measure reduces sensitivity to outliers Can use lattice-based implementation (requires time-based alignments for errors) and new statistics computation to still use EBW update formulae ICSI Seminar: March 30th

17 Improved Generalisation using I-smoothing Use of discriminative criteria can easily cause over-training Get smoothed estimates of parameters by combining Maximum Likelihood (ML) and MPE objective functions for each Gaussian Rather than globally interpolate (H-criterion), amount of ML smoothing depends on the amount of data per Gaussian I-smoothing adds τ samples of the average ML statistics for each Gaussian. Typically τ =50. For MMI scale numerator counts appropriately For MPE need ML counts in addition to other MPE statistics I-smoothing essential for MPE (& helps a little for MMI) ICSI Seminar: March 30th

18 MPE CTS results % WER Train % WER eval98 % WER redn (test) MLE MMIE MMIE (τ =200) MPE (τ =50) HMMs trained on 68hr set. Train use lattice unigram % WER Train % WER eval98 % WER redn (test) MLE baseline MMIE % MMIE (τ =200) % MPE (τ =100) % HMMs trained on 265hr train. Train is lattice unigram I-smoothing reduces the error rate with MMI by % abs MPE/I-smoothing gives around 1% abs lower WER than previous MMIE results ICSI Seminar: March 30th

19 Discriminative MAP Maximum A Posteriori (MAP) is a standard adaptation scheme: increasing adaptation data tends to Maximum Likelihood estimation; referred to as ML-MAP For discriminative MAP schemes: increasing adaptation data tends to discriminative estimation; maximum mutual information (MMI-MAP) and minimum phone error (MPE-MAP) adaptation investigated. Evaluation for task porting from CTS to Voic Also used for creation of gender dependent models ICSI Seminar: March 30th

20 Strong/Weak Sense Auxiliary Functions ^ F( λ, λ) ^ F( λ, λ) ^ G( λ, λ ) ^ G( λ, λ ) ^ λ λ G λ F (a)strong Sense λ ^ λ λ F λ G (b) Weak Sense λ Strong Sense: used for standard EM - guaranteed convergence, requires G(λ, ˆλ) G(ˆλ, ˆλ) F(λ) F(ˆλ), Weak Sense: applicable to MMI - yields Extended BW, requires G(λ, ˆλ) = λ λ F(λ). λ=ˆλ λ=ˆλ ICSI Seminar: March 30th

21 Weak Sense Auxiliary functions for MMI MMI criterion may be expressed as F MMIE (λ) = log p(o M num ) log p(o M den ) The weak sense auxiliary function is G MMIE (λ, ˆλ) = G num (λ, ˆλ) G den (λ, ˆλ) + G sm (λ, ˆλ). where G num (λ, ˆλ) and G den (λ, ˆλ) are standard strong sense auxiliary functions. A smoothing term is added to improve stability - satisfies λ Gsm (λ, ˆλ) = 0 λ=ˆλ This ensures that final function is still a valid weak sense auxiliary function and appropriate choice yields E-BW ICSI Seminar: March 30th

22 Incorporating Prior Information By definition a function is a weak sense auxiliary function of itself: a log-prior may be directly added to the weak sense auxiliary function. To make normal discriminative training more robust the ML estimate of the parameter values can be used as the centre of an appropriately defined prior distribution This yields I-Smoothing µ j = {θnum j (O) θ den {γ num j j (O)} + D j ˆµ j + τ I µ ml j γj den } + D j + τ I τ I determines influence of prior (ML estimate) on the final MMI estimate. ICSI Seminar: March 30th

23 MMI-MAP For adaptation/porting the ML estimate may not be robust use a ML-MAP estimate as the prior Use count-smoothing ML-MAP with prior parameters ( µ j ) µ j = {θnum j (O) θ den (O)} + D j ˆµ j + τ I µ map {γ num j j γ den j } + D j + τ I j where µ map j = θnum j γ j num (O)+τ µ j +τ Two smoothing variables for MMI-MAP τ determines how close the prior is to the ML estimate τ I determines how much the prior influences the final estimate. Similar form may be used for MPE-MAP. ICSI Seminar: March 30th

24 Switchboard to Voic Porting Results Test WER on Voic ML >ML MAP ML >MMI MAP MMI >ML MAP MMI >MMI MAP Hours of adaptation data WERs on Voic for varying amounts of adaptation data (MMI or ML) adapted with (MMI-MAP or ML-MAP) 4.5% relative improvement from MMI-MAP vs. ML-MAP (starting from 30h adaptation data ICSI Seminar: March 30th

25 Discriminative Linear-Transform Based Adaptation Adaptation by estimating a set of linear transforms for Gaussian means and/or variances Normally computed using ML (MLLR) Can estimate transforms for the model parmeters or apply to the features Can estimate transforms for various discriminative criteria using theory of weak-sense auxiliary functions including MMI and MPE. Investigated for supervised and unsupervised adaptation. Can also apply to discriminative speaker adaptive training training set transforms estimated for each training speaker/condition toaccount for variability estimate canonical model after applying transforms use MMI/MPE for both canonical model and transforms ICSI Seminar: March 30th

26 Supervised Adaptation with DLT Adapt native speaker models to non-natives (40 enrollment utterances) WSJ/NAB 1994 S3 dev/eval sets Use either interpolation of MMI/ML criteria (H-Crit) or MPE Single iteration WERs shown (more iterations help further) Test sets Upadapt MLLR H-crit DLT MPE-DLT s3-dev s3-eval Large improvements from adaptation Gains from discriminative adaptation ICSI Seminar: March 30th

27 Unsupervised Adaptation with DLT Unsupervised discriminative adaptation is a challenge! DLT can learn supervision information very effectively... Use supervision from strong LM and weak-lm for denominator Include confidence scores on words Evaluated in part of a CTS transcription system Supervision from MLLR-adapted confusion network decoding 27.0% WER Small gains from unsupervised DLT MLLR MPE DLT MPE DLT +conf ICSI Seminar: March 30th

28 Other Current Work Discriminative training is being applied in a range of other ways to various models For very large datasets using lightly-supervised training methods recogniser generated transcriptions with 5-10% WER (strong LM) use weaker LM for confusable data For joint cluster-adaptive training and linear transform estimation For various types of precision matrix modelling For other extended forms of HMMs parameters linear predictive HMMs to best determine the model structure Still working on refinements to basic process lattice (re-)generation & combination forms of smoothing/prior ICSI Seminar: March 30th

29 Summary & Outlook Discriminative training is effective for large vocabulary recognition Important to address basic issues efficient optimisation (EBW + lattice schemes) generalisation (acoustic scaling, weakened LMs, stopping overtraining) Interesting properties WER difference to ML is bigger with more data Most effective with smaller number of parameters Improvements under within-task and cross-task conditions All leading large vocab research systems now use discriminative training Minimum Phone Error training more effective than MMIE ICSI Seminar: March 30th

30 Std approach to training systems at Cambridge since 2002 Theoretical extensions using concerpt of weak-sense auxilairy functions New derivation of extended Baum-Welch algorithm Discriminative MAP adaptation schemes (better task porting) Discriminative linear transforms (supervised and unsupervised adaptation) Still refinements and application to more complex model structures ICSI Seminar: March 30th

Discriminative training and Feature combination

Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics