Speech Recognition Lecture 12: Lattice Algorithms. Cyril Allauzen Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Size: px

Start display at page:

Download "Speech Recognition Lecture 12: Lattice Algorithms. Cyril Allauzen Google, NYU Courant Institute Slide Credit: Mehryar Mohri"

Coleen Clark
6 years ago
Views:

1 Speech Recognition Lecture 12: Lattice Algorithms. Cyril Allauzen Google, NYU Courant Institute Slide Credit: Mehryar Mohri

2 This Lecture Speech recognition evaluation N-best strings algorithms Lattice generation Discriminative training 2

3 Performance Measure Accuracy: based on edit-distance of speech recognition transcription and reference transcription. word or phone accuracy. lattice oracle accuracy: edit-distance of lattice and reference transcription. Note: performance measure does not match the quantity optimized to learn models. word-error rate lattices. 3

4 Word Error Rates * * * Based on 1998 evaluation

5 Word Error Rate task hours of training data DNN-HMM GMM-HMM with same data GMM-HMM with more data Switchboard (test set 1) Switchboard (test set 2) (2,000 h) (2,000 h) Broadcast News Bing Voice Search (sentence error rate) Google Voice Input 5, (>> 5,870 h) Youtube 1, Source: Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, Brian Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition. Signal Processing Magazine (2012) 5

6 Edit-Distance Definition: minimal cost of a sequence of edit operations transforming one string into another. Edit operations and costs: standard edit-distance definition: insertion, deletions, substitutions, all with same cost one. general case: more general operations, arbitrary non-negative costs. Application: measuring word error rate in speech recognition and other string processing tasks. 6

7 Local Edits Edit operations: insertion: ε a, deletion: a ε, substitution: a b (a b). Example: 2 insertions, 3 deletions, 1 substitution cttgϵϵac ϵ taϵ gtϵ c This is called an alignment. 7

8 Edit-Distance Computation Standard case: textbook recursive algorithm (Cormen, Leiserson, Rivest, 1992), quadratic O( x y ) x y complexity, for two strings and. General case: (Mohri, Pereira, and Riley, 2000; Mohri, 2003) construct tropical semiring edit-distance transducer T e with arbitrary edit costs. represent xand y by automata X and Y. compute best path of X T e Y. complexity quadratic:. 8 O( T e X Y )

9 Global Alignment - Example Example: c(a, G) = 1, c(a, T) = c(g, C) =.5, no cost C:G/0.5 for matching symbols. Insertion/deletion G:C/0.5cost: 2 (not pictured). Representation: A 0 0 G 0 C 0 T 0 T:A/0.5 A:T/0.5 G:A/1.0 A:G/1.0 T:T/0.0 G:G/0.0 C:C/0.0 A:A/0.0 A 0 0 C 0 C 0 T 0 G 0 0 echo "A G C T" ngramsymbols > agct.syms; echo "A G C T" farcompilestrings --symbols=agct.syms --keep_symbols=1 --generate_keys=1 --far_type=fst 9

10 Global Alignment - Example Program: fstcompose X.fst Te.fst fstcompose - Y.fst fstshortestpath -- nshortest=1 > A.fst Graphical representation: A:A/0 0 1 G:C/0.5 2 C:C/0 3 T:T/0 4 e:g/2 5/0 e:e/0 1 A:A/0 2 G:C/0.5 3 C:C/0 T:T/0 e:g/ /0 0 e:e/0 7 A:A/0 8 G:e/2 9 e:c/2 10 C:C/0 11 T:T/0 12 e:g/2 13/0 10

11 Edit-Distance of Automata Definition: the edit-distance of two automata and B is the minimum edit-distance of a string accepted by A and a string accepted by B. Computation: best path of A T e B. complexity for acyclic automata: O( T e A B ). Generality: any weighted transducer in the tropical semiring defines an edit-distance. Learning editdistance transducer using EM algorithm. A 11

12 This Lecture Speech recognition evaluation N-best strings algorithms Lattice generation Discriminative training 12

13 N-Best Sequences Motivation: rescoring. first pass using a simple acoustic model and grammar to produce lattice or N-best list. re-evaluate alternatives with a more sophisticated model or use new information. General problem: speech recognition, handwriting recognition. information extraction, image processing. 13

14 N-Shortest-Paths Problem Problem: given a weighted directed graph G, a source state s and a set of destination or final states F, find the N shortest paths in G from s to F. Algorithms: (Dreyfus, 1969): O( E + N log( E / Q )). (Mohri, 2002): shortest-distance algorithm, N-tropical semiring. (Eppstein, 2002): O( E + Q log Q + N). 14

15 N-Shortest Strings N-Shortest-Paths Problem: given a weighted directed graph G, a source state s and a set of destination or final states F, find the N shortest strings in G from s to F. Example: NAB Eval 95. Thresh Non-Unique Unique

16 N-Shortest Paths Program: fstprune --delta=1.5 lat.fst fstshortestpath --nshortest=10 in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required to run in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required around in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required to run in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required to run in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required around in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required around in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required to run in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required around

17 N-Shortest Strings Program: fstprune --delta=1.5 lat.fst fstshortestpath --nshortest=10 --unique in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required to run in addition the launch of Microsoft corporation's windows ninety five software will mean more memory will be required around

18 Algorithms Based on N-Best Paths Idea: use K-best paths algorithm to generate distincts paths. Problems: Knot known in advance. in practice, K (Chow and Schwartz, 1990; Soon and Huang, 1991) K N may be sometimes quite large, that is K 2 N, which affects both time and space complexity. 18

19 N-Best String Algorithm Idea: apply N-best paths algorithm to on-the-fly determinization of input automaton. But, N-best paths algorithms require shortest distances to F. Weighted determinization (partial): (Mohri and Riley, 2002) eliminates redundancy, no determinizability issue. on-demand computation: only the part needed is computed. on-the-fly computation of the needed shortestdistances to final states. 19

20 Shortest-Distances to Final States Definition: let d(q, F) denote the shortest distance from q to the set of final states F in input (nondeterministic) automaton A, and let d (q,f ) be defined in the same way in the resulting (deterministic) automaton B. Theorem: for any state in B, the following holds: q = {(q 1,w 1 ),...,(q n,w n )} d (q,f ) = min {w i + d(q i,f)}. i=1,...,n 20

21 Simple N-Shortest-Paths Algorithm 1 for p 1 to Q do r[p] 0 2 π[(i, 0)] nil 3 S {(i, 0)} 4 while S 5 do (p, c) head(s); Dequeue(S) 6 r[p] r[p]+1 7 if (r[p] =N and p F ) then exit 8 if r[p] N 9 then for each e E[p] 10 do c c + w[e] 11 π[(n[e],c )] (p, c) 12 Enqueue(S, (n[e],c )) Queue ordering based on: (p, c) < (p 0,c 0 ), (c + d[p, F ] <c 0 + d(p 0,F]) 21

22 N-Best String Alg. - Experiments Oracle word accuracy: NAB 40K Bigram word accuracy x real-time 1. 1-Best Best Best Best 5. Lattice Additional time to pay for N-best very small even for large N. 22

23 N-Best String Alg. - Properties Simplicity and efficiency: easy to implement: combine two general algorithms. works with any N-best paths algorithm. empirically efficient. Generality: arbitrary input automaton (not nec. acyclic). incorporated in OpenFST Library ( fstshortestpath). 23

24 This Lecture Speech recognition evaluation N-best strings algorithms Lattice generation Discriminative training 24

25 Speech Recognition Lattices Definition: weighted automaton representing speech recognizer s alternative hypotheses. 0 hi/ I/ this/153.0 I d/143.3 this/ is/ this/ is/ is/70.97 uh/83.34 Mike/192.5 is/ my/19.2 my/ hard/ card/20.1 number/ like/

26 Lattice Generation (Odell, 1995; Ljolje et al., 1999) Procedure: given transition e in N, keep in lattice transition ((p[e],t ),i[e],o[e],(n[e],t)) with best start time (p[e],t ) during Viterbi decoding. (r, t ) (q, t) (r, t ) t t t +1 26

27 Lattice Generation Computation time: little extra computation over one-best. Optimization: projection on output (words or phonemes). epsilon-removal. pruning: keeps transitions and states lying on paths whose total weight is within a threshold of the best path. garbage-collection (use same pruning). 27

28 Notes Heuristics: not all paths within beam are kept in lattice. Lattice quality: oracle accuracy, that is best accuracy achieved by any path in lattice. Optimizations: weighted determinization and minimization. in general, dramatic reduction of redundancy and size. bad for some lattices, typically uncertain cases. 28

29 Speech Recognition Lattice which/ which/ which/ which/ flights/ flights/64 7 flights/72.4 flights/50.2 flights/59.9 flights/ flights/ flights/88.2 flights/45.4 flights/55.2 flights/63.5 flights/79 flights/83.4 flights/43.7 flights/53.5 flights/ leave/ leave/ leave/ leave/ leave/82.1 leave/51.4 leave/54.4 leave/57.7 leave/60.4 leave/68.9 leave/44.4 leave/47.4 leave/50.7 leave/53.4 leave/61.9 leave/35.9 leave/39.2 leave/41.9 leave/31.3 leave/34.6 leave/37.3 leave/ detroit/ detroit/ detroit/109 detroit/102 detroit/106 detroit/105 detroit/99.1 detroit/103 detroit/102 detroit/96.3 detroit/99.7 detroit/99.4 detroit/88.5 detroit/91.9 detroit/ and/ and/ and/ and/57 22 and/59.3 that/60.1 and/53.1 and/55.3 and/55 and/55 24 arrive/ arrive/ arrive/ arrive/ arrive/20.2 arrive/ arrive/ arrive/ arrive/ arrive/23.1 arriving/41.3 arrive/21.1 arrive/21.7 arrive/23.2 arrive/24.2 arrive/27.7 arrive/13.1 arrive/13.7 arrive/15.2 arrive/16.2 arrive/19.7 arrive/16.6 arrive/18.2 arrive/19.1 arrive/22.7 arrive/13.9 arrive/14.4 arrive/16 arrive/16.9 arrive/20.5 arrive/14.4 arrive/16 arrive/16.9 arrive/20.5 arrive/15.6 arrive/16.5 arrive/15.6 arrive/ at/ at/ at/ at/ in/29 at/23.5 at/35 at/25.7 at/32 in/ at/37.3 at/21.2 at/27.5 at/32.8 at/23.4 at/29.7 in/25 at/35 50 in/14.3 at/20 at/26.3 at/31.6 at/22.2 at/28.5 at/33.8 at/17.3 at/28.9 at/19.5 at/25.8 in/21.7 at/31.1 at/35.1 at/23.5 at/29.8 at/21.3 at/32.8 at/21.3 at/27.5 at/20.1 at/31.7 at/20.1 at/26.4 in/24.7 at/31.7 at/28.9 at/17.4 at/ saint/ saint/ saint/ saint/ saint/57.8 saint/36.8 saint/38 saint/42.3 saint/51.5 saint/33.2 saint/34.4 saint/38.7 saint/45.5 saint/47.9 saint/44.2 saint/45.4 saint/49.7 saint/56.5 saint/45 saint/49.3 saint/56.1 saint/37.8 saint/39.1 saint/43.3 saint/50.1 saint/38.7 saint/43 saint/ saint/ saint/ saint/ saint/46.8 saint/39.8 saint/46.5 saint/46.2 saint/68.1 saint/49.5 saint/ petersburg/ petersburg/92 69 petersburg/99.9 petersburg/86.1 petersburg/90 petersburg/ petersburg/104 petersburg/80.9 petersburg/84.8 petersburg/92.8 petersburg/98.8 petersburg/73.3 petersburg/77.1 petersburg/85.1 petersburg/91.1 petersburg/ petersburg/89.3 petersburg/84.1 petersburg/ petersburg/ petersburg/ petersburg/ petersburg/ petersburg/91.5 petersburg/92.3 petersburg/90.3 petersburg/98.3 petersburg/85.2 petersburg/ around/ around/ at/ around/ around/109 around/91.6 around/103 around/116 around/98.8 around/105 around/117 at/20.1 around/97.8 around/109 at/14.1 around/93.2 around/ around/ around/ at/ at/65 around/92.2 around/104 around/87.6 around/104 around/87.8 around/99.3 around/99.5 around/111 around/111 around/99.5 around/105 around/ and/ nine/ nine/ nine/31.2 nine/5.61 nine/8.76 nine/ nine/ nine/ ninety/18 87 around/ around/81 89 around/ around/97.1 and/19.1 nine/24.9 nine/28 nine/9.2 nine/12.3 nine/6.31 nine/9.45 nine/22.6 around/102 around/ nine/ a/ a/ a/ a/22.8 a/12.1 a/13.6 a/22.3 a/18.5 a/8.73 a/13.3 a/16.3 a/9.34 nine/5.66 nine/8.8 nine/17.2 nine/13 nine/28 nine/18.8 nine/21.9 nine/12.4 a/ /20.8 m/ /16.2 m/ /13.6 m/24.1 m/12.5 m/16.3 m/19 m/7.03 m/10.9 m/13.6 m/13.9 m/16.6

30 Lattice after Determinization (Mohri, 1997) at/ around/81 around/83 32 nine/ saint/ petersburg/ around/ nine/5.61 which/ flights/ leave/ detroit/105 4 and/49.7 that/ arriving/45.2 arrive/17.4 arrives/23.6 arrive/ in/16.3 at/23 at/20.9 at/ saint/ petersburg/ saint/ petersburg/80 22 saint/43 saint/43 16 petersburg/ around/ at/16.1 at/ around/97.1 nine/21.7 ninety/18 ninety/34.1 nine/21.7 nine/ a/15.3 a/9.34 a/15.3 a/ m/ m/ around/ ninety/34.1 and/18.7 nine/ and/ ninety/34.1 and/

31 Lattice after Minimization (Mohri, 1997) around/97.1 nine/29.1 which/ flights/ leave/ detroit/105 4 that/60.1 and/ arrive/12.9 arrives/23.6 arrive/17.4 arriving/ at/21.8 at/23 in/ saint/48.6 saint/ petersburg/80 petersburg/ at/18 around/96.5 at/ and/21.2 around/97.2 ninety/ nine/13 ninety/18 21 a/ m 9 saint/ petersburg/ around/ nine/13 31

32 This Lecture Speech recognition evaluation N-best strings algorithms Lattice generation Discriminative training 32

33 Discriminative Techniques Maximum-likelihood: parameters adjusted to increase joint likelihood of acoustic and CD phone or word sequences, irrespective of the probability of other word hypotheses. Discriminative techniques: takes into account competing word hypotheses and attempts to reduce the probability of incorrect ones. Main problems: computationally expensive, generalization. 33

34 Objective Functions Maximum likelihood (joint): F = argmax θ m i=1 log p θ (o i, w i ). Conditional maximum likelihood (CML): (Woodland and Povey, 2002) F = argmax m X i=1 log p (w i o i ) = argmax m X i=1 log p (o i, w i ) p (o i ) Maximum mutual information (MMI/MMIE) Equivalent to CML when independent of theta. F = argmax m X i=1 log p (o i w i )p (w i ) P ŵ p (o i ŵ )p (ŵ ) 34

35 Discriminative Training Algorithm Produce numerator lattice by decoding with only the reference transcript allowed (similar to alignment for MLE training). Produce denominator lattice by doing open-ended decode with a simple language model (unigram model trained on transcripts of training data set). Gather Gaussian component occupancy counts, first- and second-order statistics of data points. Perform model updates using modified Baum- Welch update equations. 35

36 Minimum Error Objective Functions F = argmax Minimum Word Error (MWE): m X i=1 log (Povey and Woodland, 2002) P ŵ p (o i ŵ )p (ŵ )RawAccuracy(w i, ŵ ) P ŵ p (o i ŵ )p (ŵ ) Minimum Phone Error (MPE): using phone or context-depend phone accuracy instead of word accuracy. In practice, MPE works better than MWE. 36

37 References Y. Chow and R. Schwartz, The N-Best Algorithm: An Efficient Procedure for Finding top N Sentence Hypotheses. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '90), Albuquerque, New Mexico, April 1990, pp S. E. Dreyfus. An appraisal of some shortest path algorithms. Operations Research, 17: , David Eppstein, Finding the shortest paths, SIAM Journal of Computing, vol.28, no. 2, pp , Andrej Ljolje and Fernando Pereira and Michael Riley, Efficient general lattice generation and rescoring. In Proceedings of the European Conference on Speech Communication and Technology (Eurospeech 99), Budapest, Hungary, Mehryar Mohri. Finite-State Transducers in Language and Speech Processing. Computational Linguistics, 23:2, Mehryar Mohri. Statistical Natural Language Processing. In M. Lothaire, editor, Applied Combinatorics on Words. Cambridge University Press,

38 References Mehryar Mohri. Edit-Distance of Weighted Automata: General Definitions and Algorithms. International Journal of Foundations of Computer Science, 14(6): , Mehryar Mohri and Michael Riley. An Efficient Algorithm for the N-Best-Strings Problem. In Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP 02), Denver, Colorado, September Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. The Design Principles of a Weighted Finite-State Transducer Library. Theoretical Computer Science, 231:17-32, January Julian Odell. The Use of Context in Large Vocabulary Speech Recognition. Ph.D. thesis, Cambridge University, UK. Frank Soong and Eng-Fong Huang, A Tree-Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 91), Toronto, Canada, November 1991, pp

39 References Phil Woodland and Daniel Povey, Large Scale Discriminative Training for Speech Recognition. Computer Speech and Language, 16, pp , Daniel Povey and Phil Woodland, Minimum Phone Error and i-smoothing for Improved Discriminative Training. ICASSP

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial