Sequence Prediction with Neural Segmental Models. Hao Tang

Size: px

Start display at page:

Download "Sequence Prediction with Neural Segmental Models. Hao Tang"

Derick McDowell
5 years ago
Views:

1 Sequence Prediction with Neural Segmental Models Hao Tang

2 About Me Pronunciation modeling [TKL 2012] Segmental models [TGL 2014] [TWGL 2015] [TWGL 2016] [TWGL 2016] American Sign Language fingerspelling recognition [KWTL 2015] Under-resourced speech recognition [LJTMSKHK 2016] [HJMMLDELMTLCHKSL 2016] State-tying with CCA [WTL 2016] Dialog state tracking [TWMH 2014] Finite-state transducers Discriminative training Linear models Structured prediction Neural networks

3 Segments Netflix announces House of Cards return with dark inauguration day promo.

4 Frames and Segments segment: variable-length unit Netflix announces House of Cards return with dark inauguration day promo. frame: fixed-length unit

5 Frame-Based Models B O B I I O O y 1 y 2 y 3 y 4 y 5 y 6 y 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 Netflix announces House of Cards return with

6 Frame Labels BIO tags in named-entity recognition B O B I I O O Netflix announces House of Cards return with Netflix announces House of Cards return with sub-phonetic states in phonetic recognition ay-1 ay-2 ay-3 ay

7 Frame Labels BIO tags in named-entity recognition B O B I I O O Netflix announces House of Cards return with Netflix announces House of Cards return with sub-phonetic states in phonetic recognition ay-1 ay-2 ay-3 ay We are not able to express features such as duration formants I[the segment has balanced parentheses]

8 Reduction to Graph Search Problem named-entity recognition speech recognition parsing translation Sequence Prediction Graph Search Inference: 1. Take input x. Build search graph G. 2. Find the maximum scoring path in G.

9 Frame-Based Models B B B B y 1 y 2 y 3 y 4 I I I I x 1 x 2 x 3 x 4 O O O O Netflix announces House of

10 Segmental Models I I O O O Netflix announces House of Cards return with

11 Segmental Models I I O O Netflix announces House of Cards return with

12 Segmental Models I O I O Netflix announces House of Cards return with

13 Segmental Models I O Netflix announces House of Cards return with

14 Segmental Models I O Netflix announces House of Cards return with

15 Segmental Models I I O O O Netflix announces House of Cards return with

16 Segmental Models Netflix announces House of Cards return with Netflix announces House of Cards return with

17 Segmental Models Netflix announces House of Cards return with Netflix announces House of Cards return with The difference between frame-based models and segmental models: Search graph. Features can be extracted from variable-length units.

18 Problem Definition Netflix announces House of Cards return with Search space: G = (V, E) Weight: w θ (x, e), where x is the input and e is an edge. Inference: finding the maximum scoring path. Learning: finding θ that minimizes a loss function.

19 Past Research on Segmental Models Network-based digit recognition [Bush and Kopec, 1985] SUMMIT [Zue et al., 1989] [Glass, 2003] Stochastic segmental models [Ostendorf and Roukos, 1989] Hidden semi-markov models [Sarawagi and Cohen, 2004] Segmental conditional random fields (SCRF) [Zweig and Ngyuen, 2009] [Zweig et al., 2011] [Zweig, 2012] Boundary-factored segmental CRF [He and Fosler-Lussier, 2012] Deep segmental neural networks [Abdel-Hamid et al., 2013] Discriminative segmental cascades [TWGL 2015] Segmental recurrent neural networks [Lu et al., 2016]

20 Problem: Efficiency Runtime for inference O( E c), where c is the time to compute the weight of an edge. Suppose x = T, and the label set is of size L. Frame-based models: E = O(TL) Segmental models: E = O(TLD), where D is the maximum duration. L D named-entity recognition 30 4 action recognition phoneme recognition word recognition

21 Past Research on Efficiency of Segmental Models Bottom-up approach [Zue and Glass, 1988] Other ASR systems [Chang and Glass, 1997] [Zweig et al., 2010] Augmentation [Glass et al., 1996] [Chang and Glass, 1997] Separate pruners [Okanohara et al., 2006] Different graph topologies [Andrew, 2006] [Vinh et al., 2011] [He and Fosler-Lussier, 2012]

22 Contribution Desideratum: No HMMs! Discriminative segmental cascades [TWGL 2015] [TWGL 2016] Improved performance with segmental neural networks and higher-order features while maintaining efficiency Structured composition for computing higher-order features efficiently Speedup in inference and learning without accuracy loss End-to-end training for segmental models [TWGL 2016] Two-stage training can serve as a good initialization for end-to-end training. Hinge loss converges the fastest and log loss achieves the best accuracy. Marginal log loss achieves strong results without relying on manual alignments.

23 Discriminative Segmental Cascades first-pass search space H 1 = Y 1 search space size feature complexity segmental features higher-order features

24 Discriminative Segmental Cascades first-pass search space H 1 = Y 1 prune a b a b pruned search space H 2 a c search space size feature complexity segmental features higher-order features

25 Discriminative Segmental Cascades first-pass search space H 1 = Y 1 prune a b a b pruned search space H 2 a c search space size feature complexity segmental features higher-order features σ-compose bigram LM L 2 ɛ b b b b a a b b c ɛ a a a a a a c second-pass search space H 2 σ L 2 = Y 2

26 Discriminative Segmental Cascades first-pass search space H 1 = Y 1 prune a b a b pruned search space H 2 a c search space size feature complexity segmental features higher-order features σ-compose bigram LM L 2 ɛ b b b b a a b b c ɛ a a a a a a c second-pass search space H 2 σ L 2 = Y 2

27 Discriminative Segmental Cascades first-pass search space H 1 = Y 1 prune a b a b pruned search space H 2 a c search space size feature complexity segmental features higher-order features σ-compose bigram LM L 2 ɛ b b b b a a b b c ɛ a a a a a a c second-pass search space H 2 σ L 2 = Y 2

28 Max-Marginal Pruning [Sixtus and Ortmanns, 1999, Weiss et al., 2012] for e E, prune e if γ(e) < t Max-marginal of e E is defined as γ(e) = max w(x, y) y e For α (0, 1), the threshold is defined as 1 t = α max γ(e) + (1 α) e E E At least one path is retained. γ(e) e E All paths with scores higher than t are retained. e

29 Structured Composition (σ-composition) The structured composition of A and B is defined as an FST G where a b a b a c V G = V A V B { E G = e 1, e 2 E A E B : o A (e 1) = i B (e2) } ɛ b ɛ a b b b a a b a a a a σ-compose b c a c After σ-composition, the search space becomes L n 1 times larger when using a n-gram language model with a vocabulary of size L. the weight function has access to a pair of labels

30 Experimental Setup iy v eh n ih f... Task Phonetic recognition Dataset TIMIT Size 6 hours Ground truth Manual alignments Loss function hinge loss Maximum duration 30 Label set size 48 Average input length

31 Beam Pruning vs Max-marginal Pruning oracle error (%) beam pruning max-marginal pruning oracle error (%) beam pruning max-marginal pruning density (edges per gold edge) real-time factor Beam pruning is faster. Max-marginal pruning produces more compact lattices.

32 Beam Search vs Exact Search dev PER (%) hit rate (%) beam width beam width When the model is well-trained, beam search can be as good as exact search. Dual decomposition is not an option, since we only allow a single pass over the edges.

33 Beam Search vs Exact Search dev PER (%) hit rate (%) beam width beam width When the model is well-trained, beam search can be as good as exact search. Dual decomposition is not an option, since we only allow a single pass over the edges.

34 Beam Search vs Exact Search dev PER (%) hit rate (%) beam width beam width When the model is well-trained, beam search can be as good as exact search. Dual decomposition is not an option, since we only allow a single pass over the edges.

35 Beam Search vs Exact Search dev PER (%) hit rate (%) beam width beam width When the model is well-trained, beam search can be as good as exact search. Dual decomposition is not an option, since we only allow a single pass over the edges.

36 Learning with Beam Search vs Learning with Cascades unigram bigram dev PER (%) epoch exact beam=10 beam=20 beam=30 cascade Learning with beam search is fine for the unigram case but fails in the bigram case. Learning with cascades is both effective and efficient.

37 Learning with Beam Search vs Learning with Cascades unigram bigram dev PER (%) epoch exact beam=10 beam=20 beam=30 cascade Learning with beam search is fine for the unigram case but fails in the bigram case. Learning with cascades is both effective and efficient.

38 Learning with Beam Search vs Learning with Cascades unigram bigram dev PER (%) epoch exact beam=10 beam=20 beam=30 cascade Learning with beam search is fine for the unigram case but fails in the bigram case. Learning with cascades is both effective and efficient.

39 Learning with Beam Search vs Learning with Cascades unigram bigram dev PER (%) epoch dev PER (%) epoch exact beam=10 beam=20 beam=30 cascade Learning with beam search is fine for the unigram case but fails in the bigram case. Learning with cascades is both effective and efficient.

40 Phonetic Recognition on TIMIT dev test HMM-DNN st -pass segmental model bigram LM nd -order boundary features st -order segment NN st -order bi-phone NN bottleneck

41 We consider signer-dependent, signer-independent, and signer-adapted recognition. We Americannext Sign describe Language the recognizers Fingerspelling we compare, as well as Recognition the techniques we[kim exploreetfor al., signer 2016] adaptation. All of the recognizers use deep neural network (DNN) classifiers of letters or handshape features. <s> T U L I P </s> <s> T U L I P </s> Figure 2-1: Images and ground-truth segmentations of the fingerspelled word TULIP produced by two signers. Image frames are sub-sampled at the same rate from both signers to show the true relative speeds. Asterisks indicate manually annotated peak frames for each letter. <s> and </s> denote non-signing intervals before/after signing. LER Tandem HMM 14.6% 19 Rescoring SCRF 11.5% cascade 1 st pass 8.8% cascade 2 nd pass 7.6%

42 Improving Efficiency first-pass search space H 1 = Y 1 prune first-pass search space H 1 = Y 1 a b a b a c second-pass search space H 2 = Y 2

43 Improving Efficiency baseline proposed dev PER (%) real-time factor 1st pass 2nd pass training hours baseline 1st pass proposed 1st pass proposed 2nd pass

44 Contribution Desideratum: No HMMs! Discriminative segmental cascades [TWGL 2015] [TWGL 2016] Improved performance with segmental neural networks and higher-order features while maintaining efficiency Structured composition for computing higher-order features efficiently Speedup in inference and learning without accuracy loss End-to-end training for segmental models [TWGL 2016] Two-stage training can serve as a good initialization for end-to-end training. Hinge loss converges the fastest and log loss achieves the best accuracy. Marginal log loss achieves strong results without relying on manual alignments.

45 Two-Stage vs End-to-End Training log prob [ ] [ ] [ ] [ ] [ ] [ ] [ ] f Λ x [ ] [ ] [ ] [ ] [ ] [ ] [ ]

46 Two-Stage vs End-to-End Training log prob [ ] [ ] [ ] [ ] [ ] [ ] [ ] f Λ x [ ] [ ] [ ] [ ] [ ] [ ] [ ]

47 Two-Stage vs End-to-End Training log prob [ ] [ ] [ ] [ ] [ ] [ ] [ ] f Λ x [ ] [ ] [ ] [ ] [ ] [ ] [ ]

48 Two-Stage vs End-to-End Training log prob [ ] [ ] [ ] [ ] [ ] [ ] [ ] f Λ x [ ] [ ] [ ] [ ] [ ] [ ] [ ]

49 Two-Stage vs End-to-End Training??? [ ] [ ] [ ] [ ] [ ] [ ] [ ] f Λ x [ ] [ ] [ ] [ ] [ ] [ ] [ ]

50 Two-Stage vs End-to-End Training Two-stage training 1. Find Λ by minimizing cross entropy at each frame. 2. Fix Λ. Find θ by minimizing hinge loss l hinge (θ, Λ; x, y, z) [ = max (y,z ) P cost((y, z ), (y, z)) θ φ Λ (x, y, z) + θ φ Λ (x, y, z ) ] End-to-end training from scratch 1. Randomly initialize Λ. 2. Find θ and Λ jointly by minimizing hinge loss. End-to-end fine-tuning 1. Two-stage training 2. End-to-end training

51 Two-Stage vs End-to-End Training for Hinge Loss 27.5 test PER stage e2e fine tuning End-to-end training can get stuck at a poor local optimum. Two-stage training provides a better starting point.

52 Two-Stage vs End-to-End Training for Hinge Loss test PER training loss stage e2e fine tuning stage e2e fine tuning End-to-end training can get stuck at a poor local optimum. Two-stage training provides a better starting point.

53 Log Loss Log loss l log (θ, Λ; x, y, z) = log p(y, z x) ) p(y, z x) = (θ 1 Z exp φ Λ (x, y, z) Z = ( ) exp θ φ Λ (x, y, z ) (y,z ) P

54 Two-Stage vs End-to-End Training for Log Loss test PER training loss stage e2e fine tuning 2-stage e2e fine tuning End-to-end training for log loss seems easier to optimize. Two-stage training provides a better starting point.

55 Frame-wise Cross Entropy 1.0 cross entropy best dev best dev dropout best dev dropout fine-tuning train CE dev CE End-to-end fine-tuning sticks to the log probability representation and improves it.

56 Other Loss Functions Marginal log loss l log (θ, Λ; x, y) = log p(y x) = log p(y, z x) z Z Latent hinge loss l latent-hinge (θ, Λ; x, y) [ = max (y,z ) P cost((y, z ), (y, z)) max θ φ Λ (x, y, z ) + θ φ Λ (x, y, z ) z Z ] z = argmaxθ φ Λ (x, y, z ) z Z

57 End-to-End Training without Manual Alignments test PER latent hinge loss marginal log loss MLL align 2-stage fine-tuning 2 stage fine-tuning e2e from scratch End-to-end training for marginal log loss seems easier. Two-stage training provides a better starting point.

58 End-to-End Training without Manual Alignments test PER latent hinge loss marginal log loss MLL align 2-stage fine-tuning 2 stage fine-tuning e2e from scratch End-to-end training for marginal log loss seems easier. Two-stage training provides a better starting point.

59 Loss Functions hours hinge loss log loss latent hinge loss marginal log loss LSTM alignments required convex in θ smooth sparse update hinge loss log loss latent hinge loss marginal log loss

60 Where are we? speakerindependent speakeradapted HMM-DNN HMM-CNN [Tòth, 2015] 16.5 Segment-based models [Glass 2003] 24.4 SCRF [Zweig, 2012] 33.1 SCRF with shallow NN [He and Fosler-Lussier, 2012] 26.5 SCRF with DNN [He, 2015] 19.1 Deep segmental NN [Abdel-Hamid et al., 2013] 21.9 cascade 1 st pass [TWGL 2015] 21.7 cascade 2 nd pass [TWGL 2015] 19.9 End-to-end + two-stage training [TWGL 2016] 19.7 Segmental RNN [Lu et al., 2016]

61 Contribution Desideratum: No HMMs! Discriminative segmental cascades [TWGL 2015] [TWGL 2016] Improved performance with segmental neural networks and higher-order features while maintaining efficiency Structured composition for computing higher-order features efficiently Speedup in inference and learning without accuracy loss End-to-end training for segmental models [TWGL 2016] Two-stage training can serve as a good initialization for end-to-end training. Hinge loss converges the fastest and log loss achieves the best accuracy. Marginal log loss achieves strong results without relying on manual alignments.

62 Ongoing and Future Work Unsupervised learning lexical unit discovery contrastive estimation [Smith and Eisner, 2005] autoencoder [Ammar et al., 2014, Tran et al., 2016] generative adversarial networks [Goodfellow et al., 2016] Structure + Network Networks Deep structured models [Chen et al., 2015] Attention [Chorowski et al., 2015] Structured attention networks [Kim et al., 2016] Large-scale structured prediction whole-word speech recognizers TIDIGITS (4.45% SER) Beam search + early update rule [Collins and Roark, 2004] First-order methods for inference Dijkstra s algorithm is steepest descent in the dual [Murota and Shioura, 2010] Structured Prediction Energy Network [Belanger and McCallum, 2015]

63 Acknowledgement Weiran Wang Taehwan Kim Kevin Gimpel Karen Livescu This research was supported by a Google faculty research award and NSF grant IIS The GPUs used for this research were donated by NVIDIA.

64 th ae ng k s

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition by Hong-Kwang Jeff Kuo, Brian Kingsbury (IBM Research) and Geoffry Zweig (Microsoft Research) ICASSP 2007 Presented