SPEECH TRANSLATION: THEORY AND PRACTICES

Size: px
Start display at page:

Download "SPEECH TRANSLATION: THEORY AND PRACTICES"

Transcription

1 SPEECH TRANSLATION: THEORY AND PRACTICES Bowen Zhou and Xiaodong He May 27 th ICASSP 2013

2 Universal Translator: dream (will) come true or yet another over promise? Spoken communication without a language barrier: mankind s long-standing dreams Translate human speech in one language into another (text or speech) Automatic Speech Recognition: ASR (Statistical) Machine Translation: (S)MT We ve been promised by Sci-fi (Star Trek) for 5 decades Serious research efforts have started in early 1990s When statistical ASR (e.g., HMM-based) starts showing dominance and SMT was emerging 2

3 ST Research Evolves Over 2 Decades Broad domains/large vocabulary PROJECT/CAMPAIGN Active Period SCOPE, SCENARIOS AND PLATFORMS C-STAR One/Two-way Limited domain spontaneous speech VERBMOBIL One-way Limited domain conversational speech DARPA BABYLON Two-way Limited domain spontaneous speech; Laptop/Handheld DARPA TRANSTAC Two-way small domain spontaneous dialog; Mobile/Smartphone; IWSLT Limited One-way; domain Limited domain dialog and unlimited free-style talk TC-STAR One-way; Broadcast speech, political speech; Server-based DARPA GALE One-way Broadcast speech, conversational speech; Server-based DARPA BOLT One/two-way Conversational speech; Server-based Formal/concise Desktop/laptop Conversational/Interactive Mobile/Cloud 3

4 Are we closer today? Speech translation demos: Completely smartphonehosted speech-to-speech translation (IBM Research) Rashid s live demo (MSR) in 21 st CCC keynote Statistical approaches to both ASR/SMT are one of the keys for the success Large data, discriminative training, better models (e.g., DNN for ASR) etc. 4

5 Our Objectives 1 Review & analyze state-of-the-art SMT theory, from ST s perspective 2 Unifying ASR and SMT to catalyze joint ASR/SMT for improved ST 3 ST s practical issues & future research topics 5

6 In this talk 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST [20 min coffee Break] 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions Technical papers accompanying this tutorial : Bowen Zhou, Statistical Machine Translation for Speech: A Perspective on Structure, Learning and Decoding, in Proceedings of the IEEE, IEEE, May 2013 Xiaodong He and Li Deng, Speech-Centric Information Processing: An Optimization-Oriented Approach, in Proceedings of the IEEE, IEEE, May 2013 Papers are available on authors webpages Charts coming up soon 6

7 Overview & Backgrounds 7

8 Speech Translation Process Input: a source speech signal sequence x 1 T = x 1,, x T ASR: recognizes it as a set of source word sequences, {f 1 J = f 1,, f J } SMT: Translated into the target language sequence of words e 1 I = e 1,, e I e 1 I = argmax e 1 I P e 1 I x 1 T = argmax e 1 I f 1 J P( e 1 I f 1 J P(x 1 T f 1 J )P f 1 J 8

9 First comparison: ASR vs. SMT Connection: sequential pattern recognition Determine a sequence of symbols that is regarded as the optimal equivalent in the target domain ASR: x 1 T f 1 J SMT f 1 J e 1 I. Hence, many techniques are closely related. Difference: ASR is monotonic but SMT is not Different modeling/decoding formalisms required One of the key issues we address in this tutorial 9

10 ASR in a Nutshell ASR Decoding f 1 J = argmax f 1 J P(x 1 T f 1 J )P(f 1 J ) Acoustic Models (HMM) P x 1 T f 1 J p x t q t = q1 T t p q t q t 1 t p x t q t by Gaussian Mixture Models or NN Language Models (N-gram) p(f J J 1 ) j=1 p(f j f j N+1,, f j 1 ) 10

11 ASR Structures: A Finite-State Problem G: a weighted acceptor that assigns language model probabilities. Search on a composed finite-state space (graph) f 1 J = best_path (O H C L G) L: transduces context-independent phonetic sequences into words 11

12 A Brief History of SMT? MT research traced back to 1950s Statistical approaches started from Brown et al (1993) A pioneering work based on source-channel model Same approach succeeded for ASR at IBM and elsewhere A good confluence example of speech and language communities to drive the transition: Rationalism Empiricism (Church & Mercer, 1993; Knight, 1999) A sequence of unsupervised word alignment models Known today as IBM Models 1-5 (Brown et al, 1993) Much progress has been made since then Key topics for today s talk 12

13 A Bird view of SMT 13

14 How to know Translation X is better than Y? It s a hard problem by itself More than one good translation & even more bad ones Assumption: closer to human translation(s), the better Metrics were proposed to measure the closeness, e.g., BLEU, which measures n-gram precisions (Papineni et al., 2002) BLEU 4 = BP exp n=1 log p n ST measurement is more complicated Objective: WER+BLEU (or METEOR, TER etc) Subjective: HTER (GALE/BOLT), concept transfer rate (TransTac/BOLT) 14

15 Further Reading ASR Backgrounds L. R. Rabiner A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. F. Jelinek, Statistical Methods for Speech Recognition. MIT press. J. M. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan and D. O Shaughnessy, Research developments and directions in speech recognition and understanding, IEEE Signal Processing Mag., vol. 26, pp , May SMT Backgrounds P. Brown, S. Pietra, V. Pietra, and R. Mercer The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, vol. 19, 1993, pp K. Knight A Statistical MT Tutorial Workbook SMT Metrics: K. Papineni, S. Roukos, T. Ward and W.-J. Zhu BLEU: a method for automatic evaluation of machine translation. In Proc. ACL M. Snover, B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul A study of translation edit rate with targeted human annotation. In Proc. AMTA S. Banerjee and A. Lavie METEOR: An automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proc. Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization C. Callison-Burch, P. Koehn, C. Monz, K. Peterson and M. Przybocki and O. F. Zaidan, Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation. In Proc. Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR History of MT and SMT: J. Hutchins, Machine translation: past, present, future. Chichester, Ellis Horwood. Chap. 2. K. W. Church and R. L. Mercer Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics 19.1 (1993):

16 Learning Problems in SMT 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Summary and Future Directions 16

17 Learning Translation Models Word alignment From words to phrases Log-linear model framework to integrate component models (a.k.a. features) Log-linear model training to optimize translation metrics (e.g., MERT) 17

18 Word Alignment Given a set of parallel sentence pairs (e.g., ENU-CHS), find the word-to-word alignment between each sentence pair. English: Finish the task of recruiting students Chinese: 完成招生工作 word translation models 18

19 HMM for Word Alignment Each word in the Chinese sentence is modeled as a HMM state. The English words, the observation, are generated by the HMM one by one. finish the task of recruiting students 完成招生工作 NULL The generative story for word alignment: At each step, the current state (a Chinese word) emits one English word; then jump to the next state. Similar to ASR, HMM can be used to model the word-level translation process Unlike ASR, note the non-monotonic jumping between states, and the NULL state. (Vogel et al., 1996) 19

20 Formal Definition f 1J = f 1,, f J : source sentence (observation) f j : source word at position j J : length of source sentence e 1I = e 1,, e I : target sentence (states of HMM) e i : target word at position i I : length of target sentence a 1J = a 1,, a J : alignment (state sequence) a j Є[1, I]: f j e a j p(f e): word translation probability 20

21 HMM Formulation Given f 1J and e 1I, a 1J is treated as hidden variable J J I 1 1 j j 1 j a a j 1 p( f e ) p( a a, I) p( f e j ) Model assumption: J 1 Emission probability: Model the word-to-word translation Transition probability: Model the distortion of the word order the emission probability only depends on the target word the transition probability only depends on the position of the last state relative distortion 21

22 ML Training of HMM Maximum likelihood training: J I ML argmax p( f1 e1, ) p( a i a i, I), p( f f e e) j j 1 j i Expectation-Maximization (EM) training for Λ. efficient Forward-Backward algorithm exists. 22

23 Find the Optimal Alignment Viterbi decoding: J J 1 j j 1 j a a j 1 aˆ argmax p( a a, I) p( f e j ) J 1 other variations: posterior probability based decoding max posterior mode decoding

24 IBM Model 1-5 IBM model 1-5 A series of generative models (Brown et al., 1994) Summarized below (along with the HMM) Model IBM-1 HMM IBM-2 IBM-3 IBM-4 IBM-5 property Convex model (non-strictly). E-M training. Doesn t model word order distortion. Relative distortion model (captures the intuition that words move in groups) Efficient E-M training algorithm IBM-1 plus an absolute distortion model (measure the divergence of the target word s position from the ideal monotonic alignment position) IBM-2 plus a fertility model (models the number of words that a state generates) Approximate parameter estimation due to the fertility model, costly training Like IBM-3, but IBM-4 uses relative distortion model IBM-5 addressed the model deficiency issue (while IBM-3&4 are deficient). 24

25 Other Word Alignment Models Extensions of HMM Discriminative models Syntax-driven models (Tantounova et al. 2002) (Liang et al. 2006) (Deng & Byrne 2005) (He 2007) (Moore et al. 2006) (Taskar et al. 2005) (Zhang & Gildea 2005) (Fossum et al. 2008) (Haghighi et al. 2009) Model the fertility implicitly. Regularize the alignment by agreements of two directions. Better distortion model. Large linear model with many features. Discriminative learning based on annotation Use syntactic features, and/or syntactically structured models 25

26 From Word to Phrase Translation Extract phrase translation pairs from word alignment Source phrase Target phrase 完成 finish Feature h 招生工作工作招生工作 recruiting students task the task task of recruiting students (Och & Ney 2004; Koehn, Och and Marcu, 2003) 26

27 Common Phrase Translation Models Model name Forward phrase translation model p Forward lexical translation model Backward phrase translation model Backward lexical translation model Parameterization (decomposed form) e f = #( e, f) d #(, f) p e f = γ(e, f) γ(, f) Reverse version of the forward counterpart Reverse version of the forward counterpart Scoring at the sentence level h fl = h bl = h fp = k k m n h bp = k p k n m e k f k p(e k,m f k,n ) p( f k e k ) p(f k,n e k,m ) 27

28 Integration of All Component Models Use a log-linear model to integrate all component models (also known as features) P(E F) = 1 Z exp i Common features include: Phrase translation models (e.g., the four models discussed before) One or more language models in the target language Counts of words and phrases Distortion model models word ordering All features need to be decomposable to a word/n-gram/phrase level Select the translation by the best integrated score E = argmax E (Och & Ney 2002) λ i h i (E, F) P(E F) = argmax E i {h i }: features λ i h i (E, F) λ i : feature weights 28

29 Training Feature Weights Let s denote by λ as {λ i }, λ is trained to optimize translation performance λ = argmax {λ} BLEU E(λ, F), E = argmax {λ} BLEU argmax E i λ i h i (E, F), E Non-convex problem! (Och 2003) 29

30 Minimum Error Rate Training Given a n-best list {E}, find the best λ E.g., the top scored E gives the best BLEU. n-best h 1 h 2 h 3 BLEU E 1 v 1,1 v 1,2 v 1,3 B 1 E 2 v 2,1 v 2,2 v 2,3 B 2 S = i λ i h i E 1 : S = v 1,1 λ 1 + C 1 E 2 : S = v 2,1 λ 1 + C 2 E 3 : S = v 3,1 λ 1 + C 3 E 3 v 3,1 v 3,2 v 3,3 B 3 a b c λ 1 Illustration of the MERT algorithm MERT (Och 2003): The score of E is linear to the value of the λ that is of interest. Only need to check several key values of λ where lines intersect with each other, since the ranking of hypotheses does not change between the two adjunct key values: e. g., when λ 1 < a, E 3 is top scored; when a < λ 1 < c, E 2 is top scored; 30

31 More Advanced Training Use a large set of (sparse) features E.g., integrate lexical, POS, syntax features MERT is not effective anymore, use MIRA, PRO etc. for training of feature weights (Watanabe et al., 2007, Chiang et al. 2009, Hopkins & May 2011, Simianer et al. 2012) Better estimation of (dense) translation models E-M style phrase translation model training (Wuebker et al. 2010, Huang & Zhou 2009) Train the translation probability distributions discriminatively (He & Deng 2012, Setiawan & Zhou 2013) A mix of the above two approaches Build a set of new features for each phrase pair, and train them discriminatively by max expected BLEU (Gao & He 2013) 31

32 Further Reading P. Brown, S. D. Pietra, V. J. D. Pietra, and R. L. Mercer The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics. D. Chiang, K. Knight and W. Wang, ,001 new features for statistical machine translation. In Proceedings of NAACL-HLT. Y. Deng and W. Byrne, 2005, HMM Word and Phrase Alignment For Statistical Machine Translation, in Proceedings of HLT/EMNLP. V. Fossum, K. Knight and S. Abney Using Syntax to Improve Word Alignment Precision for Syntax-Based Machine Translation. In Proceedings of the WMT. J. Gao and X. He, 2013, Training MRF-Based Phrase Translation Models using Gradient Ascent, in Proceedings of NAACL A. Haghighi, J. Blitzer, J. DeNero, and D. Klein Better word alignments with supervised ITG models. In Proceedings of ACL. X. He and L. Deng, 2012, Maximum Expected BLEU Training of Phrase and Lexicon Translation Models, in Proceedings of ACL H. Hopkins, and J. May Tuning as ranking. In Proceedings of EMNLP. S. Huang and B. Zhou An EM algorithm for SCFG in formal syntax-based translation. In Proeedings of ICASSP P. Koehn, F. Och, and D. Marcu Statistical Phrase-Based Translation. In Proceedings of HLT-NAACL. P. Liang, B. Taskar, and D. Klein, 2006, Alignment by Agreement, in Proceedings of NAACL. R. Moore, W. Yih and A. Bode, 2006, Improved Discriminative Bilingual Word Alignment, In Proceedings of COLING/ACL. F. Och and H. Ney Discriminative training and Maximum Entropy Models for Statistical Machine Translation, In Proceedings of ACL. F. Och, 2003, Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of ACL. H. Setiawan and B. Zhou Discriminative Training of 150 Million Translation Parameters and Its Application to Pruning, NAACL P. Simianer, S. Riezler, and C. Dyer, Joint feature selection in distributed stochastic learning for large-scale discriminative training in SMT. In Proceedings of ACL K. Toutanova, H. T. Ilhan, and C. D. Manning Extensions to HMM-based Statistical Word Alignment Models. In Proceedings of EMNLP. S. Vogel, H. Ney, and C. Tillmann HMM-based Word Alignment In Statistical Translation. In Proceedings of COLING. T. Watanabe, J. Suzuki, H. Tsukuda, and H. Isozaki Online large-margin training for statistical machine translation. In Proc. EMNLP J. Wuebker, A. Mauser and H. Ney Training phrase translation models with leaving-one-out, In Proceedings of ACL. H. Zhang and D. Gildea, 2005, Stochastic Lexicalized Inversion Transduction Grammar for Alignment, In Proceedings of ACL. 32

33 Translation Structures for ST 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 33

34 Translation Equivalents (TE) Usually represented by some synchronous (source and target) grammar. The grammar choices limited by two factors. Expressiveness: is it adequate to model linguistic equivalence between natural language pairs? Computational complexity: is it practical to build machine translation solutions upon it? Need to balance both, particularly for ST, Robust to informal spoken language Critical speed requirement due to ST s interactive nature Additional bonus if it permits effectively and efficiently integration with ASR 初步 X 1 initial X 1 X 1 的 X 2 The X 2 of X 1 34

35 Two Dominating Categories of TEs Finite-state-based formalism E.g., phrase-based SMT (Och and Ney, 2004; Koehn et al., 2003) Synchronous context-free grammar-based formalism hierarchical phrase-based (Chiang, 2007), tree-to-string (e.g., Quirk et al., 2005; Huang et al., 2006), forest-to-string (Mi et al., 2008) string-to-tree (e.g., Galley et al., 04; Shen et al., 08) tree-to-tree (e.g., Eisner 2003; Cowan et al, 2006; Zhang et al., 2008; Chiang 2010). Exceptions: STAG (DeNeefe and Knight, 2009) Tree Sequences (Zhang et al., 2008) 初步 X 1 initial X 1 X 1 的 X 2 The X 2 of X 1 35

36 Phrase-based SMT: a generative process 1. The input sentence f 1 J is segmented into K phrases 2. Permute the source phrases in the appropriate order 3. Translate each source phrase into a target phrase 4. Concatenate target phrases to form the target sentence 5. Score the target sentence by the target language model Expressiveness: Any sequence of translation possible if permit arbitrary reordering Complexity: arbitrary reordering leads to a NP-hard problem (Knight, 1999). 36

37 Reordering in Phrasal SMT To constrain search space, reordering is usually limited by preset maximum reordering window and skip size (Zens et al., 2004) Reordering models usually penalize non-monotonic movement Proportional to the distance of jumping Further refined by lexicalization, i.e., the degrees of penalization vary for different surrounding words (Koehn et al. 2007) Reordering is the unit movement at phrasal level i.e., no gaps allowed in reordering movement 37

38 Backgrounds: Semiring Semiring is a 5-tuple: Ψ,,, 0, 1 (Mohri, 2002) Ψ is a Set, and 0, 1 Ψ Two closed and associative operators: sum: to compute the weight of a sequence of edges 0 a=a (unit); 1 a= 1 (absorbing) a Ψ product: to compute the weight of an (optimal) path 0 a = 0 (absorbing); 1 a= a (unit) Examples used in speech & language Viterbi 0,1, max,, 0,1 defined over probabilities, Tropical R + +, min, +, +, 0 operates on non-negative weights e.g., negative log probabilities, aka, costs 38

39 Backgrounds: Automata/WFST Weighted Acceptors: 3-gram LM Weighted Transducers 39

40 Backgrounds: Formal Languages A finite-state automaton (FSA) is equivalent to a regular grammar, and a regular language Weighted finite-state machines closed under the composition operation A regular grammar constitutes strict subset of a context-free grammar (CFG) (Hopcroft and Ullman, 1979) The composition of CFG with a FSM, is guaranteed to be a CFG 40

41 FST-based translation equivalence Source-target equivalence is modeled by weighted nonnested mapping between word strings. Unit of mapping, e i, f i, is a modeling choice, and phrase is better than word due to: reduced word sense ambiguities with surrounding contexts appropriate local reordering encoded in the source-target phrase pair src phrase id : tgt phrase id/cost 41

42 Phrase-based SMT is Finite-State Each step relates input and output as WFST operations F input sentence P segmentation R permutation T translation (phrasal equiv.) Wconcatenation G target language model Doing all steps amounts to composing above WFSTs Guaranteed to produce a FSM, since WFSTs are closed under such composition. Similar to ASR, phrase-based SMT amounts to the following operations, computable with in a general-purpose FST toolkit (Kumar et al. 2005) E = best_path (F P R T W G) 42

43 Why do we care? Not only of theoretic convenience to conceptually connect ASR and SMT better Practically useful to leverage the mature WFST optimization algorithms. Suggests integrated framework to address ST: i.e., composing the WFSTs from ASR/SMT 43

44 Make WFST Approach Practical WFST-based system in (Kumar et al., 2005) runs significantly slower than the multiple-stack based decoder (Koehn, 2004) large memory requirements heavy online computation for each composition. Reordering is a big challenge The number of states needed is O 2 J ; cannot be bounded for any arbitrary inputs as finite-state Next, we show a framework to address both issues to make it more practically suitable for ST 44

45 Folsom: phrase-based SMT by FSM To speed up: first cut the number of online compositions E = best_path (F P R T W G) (1) E = best_path ( F M G) (2) M: WFST encoding a log-linear phrasal translation model, obtained by M = Min(Min(Det(P) T) W Possible by making P determinizable (Zhou et al., 2006) More flexible reordering: F is a WFSA constructed on-thefly To encode the uncertainty of the input sentence (e.g., reordering); examples on the next Slide Design a dedicated decoder for further efficiency: Viterbi search on the lazy 3-way composed graph as in (2) 45

46 Reordering, Finite-State & Gaps (a): Monotonic (like ASR) Each state denotes a specific input coverage with a bit-vector. Arcs labeled with an input position to be covered in state transition: weights given by reordering models. (b): Reordering with skip less than three Every path from the initial to the final state represents an acceptable input permutation Source segmentation allows options with gaps in searching for the bestpath to transduce the inputs at the phrase level f 1 f 2 f 3 f 4 phrase segmentation f1 f 4 f 2 f 3 Such a translation phenomenon is beneficial in many language pairs. translate disjoint Chinese preposition 在 外 as outside verb phrases, 唱 歌 as sing. Reordering with gaps adds flexibility to conventional phrase-based models and confirmed to be useful in many studies 46

47 What FST-based models lack FST-based models (e.g., phrase-based) remain state-ofthe-art for many language pairs/tasks (Zollmann et al., 2008) It makes no use that natural language is inherently structural & hierarchical Led to poor long-distance reordering modeling & exponential complexity of permutation PCFGs effectively used to model linguistic structures in many monolingual tasks (e.g., parsing) Recall: RL is a subset of context-free language The synchronous (bilingual) version of the PCFG (i.e., SCFG) is a alternative to model translation structures 47

48 SCFG-based TE A SCFG (Lewis and Stearns, 1968) rewrites the non-terminal (NT) on its left-hand side (LHS) into <source, target> on its right-hand side (RHS) s.t. the constraint of one-to-one correspondences (co-indexed by subscripts) between the source and target of every NT occurrence. VP p < PP 1 VP 2, VP 2 PP 1 > NTs on RHS can be recursively instantiated simultaneously for both the source and the target, by applying any rules with a matched NT on the LHS. SCFG captures the hierarchical structure of NL & provides a more principled way to model reordering e.g., the PP and VP will be reordered regardless of their span lengths 48

49 Expressiveness vs. Complexity Both depend on maximum number of NTs on RHS (aka, rank of the SCFG grammar) Complexity is polynomial: higher order for higher rank Cubic O( J 3 ) for rank-two (binary) SCFG Expressiveness: more expressive with higher rank; Specifically for a binary SCFG Rare reordering examples exist (Wu 1997; Wellington et al., 2006) that it cannot cover Arguably sufficient in practice Pic from 49

50 What s in common? SCFG-MT vs. Like ice-cream, SCFG models come with different flavors Linguistic syntax based: Utilize structures defined over linguistic theory and annotations (e.g., Penn Treebank) SCFG rules are derived from the parallel corpus guided by explicitly parsing on at least one side of the parallel corpus. E.g., tree-to-string (e.g., Quirk et al., 2005; Huang et al., 2006), forest-to-string (Mi et al., 2008) string-to-tree (e.g., Galley et al., 04; Shen et al., 08) tree-to-tree (e.g., Eisner 2003; Cowan et al, 2006; Zhang et al., 2008; Chiang 2010). Formal syntax based: Utilize the hierarchical structure of natural language only Grammars extracted from the parallel corpus without using any linguistic knowledge or annotations. E.g., ITG (Wu, 1997) and hierarchical models (Chiang, 2007) 50

51 Formal syntax based SCFG Arguably a better for ST: not relying on parsing, which might be difficult for informal spoken languages Grammars: only one universal NT X is used (Chiang, 2007) Phrasal rules: X < 初步试验, initial experiments > Abstract rules: with NTs on RHS X < X 1 的 X 2, The X 2 of X 1 > Glue rules to generate sentence symbol S S < S 1 X 2, S 1 X 2 > S < X 1, X 1 > 51

52 Learning of (Abstract) SCFG Rules Hierarchical SCFG rule extraction (Chiang 2007): Replace any aligned sub-phrase pair of a PP with co-indexed NT symbols Many-to-many mapping between phrase pairs and derived abstract rules. Conditional probabilities estimated based on heuristic counts. Linguistic SCFG rule extraction Syntactic parsing on at least one side of the parallel corpus. Rules are extracted along with the parsing structures: constituency (Galley et al., 2004; Galley et al., 2006), and dependency (Shen et al., 2008) Improved extraction and parameterization: Expected counts: forced alignment and inside-outside (Huang and Zhou, 2009); Leaving-one-out smoothing (Wuebker et al., 2010) Extract additional rules: Reduce conflicts between alignment or parsing structures (DeNeefe et al., 2007) From existing ones of high expected counts: rule arithmetic (Cmejrek and Zhou, 2010) 52

53 Practical Considerations of Formal Syntax-based SCFG On Speed: Finite-state based search with limited reordering usually runs faster than most of the SCFG-based models. Except for tree-to-string (e.g., Huang and Mi, 2010), which is faster in practice. On Model: With beam pruning, using higher-rank (>2) SCFG is possible Weakness: no constraints on X often led to over-generalization of SCFG rules Improvements: add linguistically motivated constraints Refined NT with direct linguistic annotations (Zollmann and Venugopal, 2006) Soft constraints (Marton et al., 2008) Enriched features tied to NTs (Zhou et al., 2008b, Huang et al., 2010) 53

54 Further Reading Phrase-based SMT F. Och and H. Ney The Alignment Template Approach to Statistical Machine Translation. Computational Linguistics F. Och and H. Ney A systematic comparison of various statistical alignment models. Computational Linguistics, vol. 29, 2003, pp F. Och and H. Ney Discriminative training and Maximum Entropy Models for Statistical Machine Translation, In Proceedings of ACL. P. Koehn, F. J. Och and D. Marcu Statistical phrase based translation. In Proc. NAACL. pp P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst Moses: Open Source Toolkit for Statistical Machine Translation. ACL Demo Paper S. Kumar, Y. Deng and W. Byrne, A weighted finite state transducer translation template model for statistical machine translation, Journal of Natural Language Engineering, vol. 11, no. 3, Proc. ACL- HLT R. Zens, H. Ney, T. Watanabe and E. Sumita Reordering constraints for phrase-based statistical machine translation," In Proc. COLING B. Zhou, S. Chen and Y. Gao Folsom: A Fast and Memory-Efficient Phrase-based Approach to Statistical Machine Translation. In Proc. SLT Computational theory: J. Hopcroft and J. Ullman Introduction to Automata Theory, Languages, and Computation, Addison- Wesley G. Gallo, G. Longo, S. Pallottino and S. Nyugen, Directed hypergraphs and applications. Discrete Applied Mathematics, 42(2). M. Mohri Semiring frameworks and algorithms for shortest-distance problems," Journal of Automata, Languages and Combinatorics, vol. 7, no. 3, p , M. Mohri, F. Pereira and M. Riley Weighted finite-state transducers in speech recognition. Computer Speech and Language, vol. 16, no. 1, pp ,

55 Further Reading: SCFG-based SMT B. Cowan, I. Ku cerov a and M. Collins A discriminative model for tree-to-tree translation. In Proc. EMNLP D. Chiang Hierarchical phrase-based translation. Computational Linguistics, 33(2): D. Chiang, K. Knight and W. Wang, ,001 new features for statistical machine translation. In Proc. NAACL-HLT, D. Chiang Learning to translate with source and target syntax. In Proc. ACL S. DeNeefe, K. Knight, W. Wang and D. Marcu What Can Syntax-Based MT Learn from Phrase-Based MT? In Proc. EMNLP- CoNLL S. DeNeefe and K. Knight, Synchronous tree adjoining machine translation. In Proc. EMNLP 2009 J. Eisner Learning non-isomorphic tree mappings for machine translation. In Proc. ACL 2003 M. Galley and C. Manning Accurate nonhierarchical phrase-based translation. In Proc. NAACL M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, W. Wang and I. Thayer, Scalable inference and training of context-rich syntactic translation models. In Proc. COLING-ACL M. Galley, M. Hopkins, K. Knight and D. Marcu What's in a translation rule? In Proc. NAACL L. Huang and D. Chiang Forest Rescoring: Faster Decoding with Integrated Language Models. In Proc. ACL G. Iglesias, C. Allauzen, W. Byrne, A. d. Gispert and M. Riley Hierarchical Phrase-Based Translation Representations. In Proc. EMNLP G. Iglesias, A. d. Gispert, E. R. Banga and W. Byrne Hierarchical phrase-based Translation with weighted finite state transducers. In Proc. NAACL-HLT H. Mi, L. Huang and Q. Liu Forest-based translation. in Porc. ACL 2008 C. Quirk, A. Menezes and C. Cherry Dependency treelet translation: Syntactically informed phrase-based SMT. In Proc. ACL. L. Shen, J. Xu and R. Weischedel A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model. In Proc. ACL-HLT D. Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics. vol. 23, no. 3, pp , M. Zhang, H. Jiang, A. Aw, H. Li, C. L. Tan and S. Li A Tree Sequence Alignment-based Tree-to-Tree Translation Model. In Proc. ACL-HLT

56 Decoding: A Unified Perspective for ASR/SMT/ST 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 56

57 Unifying ASR/SMT/ST Decoding Upon first glance, there is dramatic difference between ASR and SMT due to reordering Even for SMT, there are a variety of paradigms Each may call for its own decoding algorithm Common in all decoding: Search space is usually exponentially large and DP is a must DP: divide-and-conquer w/ reusable sub-solutions. Well-known Viterbi search in HMM-based ASR: an instance of DP We try unify them by observing from a higher standpoint Benefits of a unified perspective Help us understand concepts better Reveals a close connection between ASR and SMT Beneficial for joint ST decoding 57

58 Background: Weighted Directed Acyclic Graph Σ = V, E, Ω A vertices set V and edges set E A weight mapping function Ω: E Ψ assigns each edge a weight from Ψ Ψ defined in a semiring (Ψ,,, 0, 1) A single source vertex s V in the graph. A path π in G is a sequence of consecutive edges π = e 1 e 2 e l End vertex of one edge is the start vertex of the subsequent edge l Weight of path Ω(π) = i=1 Ω(e i ) The shortest distance from s to a vertex q, δ q, is the -sum of the weights of all paths from s to q We define δ s = 1 The search space of any finite-state automata, e.g., the one used in HMM-based ASR and FST-based translation, can be represented by such a graph. 58

59 Genetic Viterbi Search of DAG Many search problems can be converted to the classical shortest distance problem in the DAG (Cormen et al, 2001) Complexity is O V + E, as each edge needs to be visited exactly once. Terminates when all reachable vertices from s have been visited, or if some predefined destination vertices encountered 59

60 Case Studies: ASR and Multi-stack Phrasal SMT HMM-based ASR: Composing the input acceptor with a sequence of transducer (Mohri et al., 2002) This defines a finite-state search space represented by a graph Viterbi search to find shortest-distance path in this graph f 1 J = best_path (O H C L G) Multi-stack Phrasal SMT e.g., Moses (Koehn et al., 2007) The source vertex: none of the source words has been translated and the translation hypothesis is empty. The target vertex: all the source words have been covered Each edge connect start and end vertices by choosing a consecutive uncovered set of source words, to apply one of the source-matched phrase pairs (PP) The target side of the PP is appended to the hypothesis The weight of each edge is determined by a log-linear model. Such edges are expanded consecutively until arrives at t The best translation collected along the best path with the lowest weight δ t 60

61 Hypothesis Recombination in Graph The initial experiments The original attempts are are The initial experiments are The original attempts are key idea: keep partial hypothesis with the lowest weights and discard others with the same signature in phrase-based SMT same set of source words being covered same n-1 last words (for n-gram LM) in the hypothesis Only the partial hypothesis with the lowest cost has a possibility to become the best hypothesis in the future So HR is loses-free for 1-best search It is clear why this works if we view this from graph All partial hypotheses of same signature arrive at the same vertex All future paths leaving this vertex are indistinguishable afterwards. Under -sum operation (e.g., the min in the Tropical semiring), only the lowest-cost partial hypothesis is kept 61

62 Case Studies II: Folsom Phrasal SMT E = best_path ( F M G) Graph expanded by lazy 3-way composition Source vertex = start states comp. (F 1, M 1, G 1 ) Target vertices = each component state is a final state in each individual WFST Subsequent vertices visited in topological order Weight(e) from (F p, M p, G p ) to (F q, M q, G q ) : Ω e = m {I,M,G} λ m Ω(e m ) HR: merge vertices of same F q, M q, G q & keep only the one with lowest cost Any path connecting the source to a target vertex is a translation: best one with the shortest distance. The decoding is optimized lazy 3-way composition + minimization, followed by bestpath search. 62

63 Background: Weighted Directed Hypergraph H =< V, E, Ψ > A vertices set V and hyperedge set E Each hyperedgee links an ordered list of tail vertices to a head vertex Arity of H is the maximum number of tail vertices of all e f e : Ψ T(e) Ψ assigns each hyperedge e a weight from Ψ A derivation d of a vertex q: a sequence of consecutive e connecting source vertices to q Weight Ω d recursively computed from weight functions of each e. The best weight of q is the -sum of all of its derivations A hypergraph, a generalization of a graph, encodes the hierarchicalbranching search space expanded over CFG models 63

64 Genetic Viterbi Search of DAH Decoding of SCFG-based models largely follows CKY, an instance of Viterbi on DAH of arity two Complexity is proportional to O E 64

65 Case Study: Hypergraph decoding Vertices: X, i, j the LHS NT + span Source vertices: apply phrasal rules matching any consecutive inputs Topological sort ensures that vertices with shorter spans j i are visited earlier than those with longer spans Derivation weight δ q at head vertex updated per line 7 Process repeated until the target vertex [S, 0, J] is reached. The best derivation found by backtracing from the target vertex Complexity O E O R J 3 65

66 Case Study: DAH with LM Search space is still DAH Tail vertex additionally encodes the n-1 words on both the left-most and right-most side Each hyperedge updates LM score and boundary information. Worst-case complexity is O R J 3 T 4(n 1) Pruning is a must 66

67 Pruning: A perspective from graph Two options to prune in a graph 1. Discard some end vertices Sort active vertices (sharing the same coverage vector) based on δ q skipping certain vertices that are outside either a beam from the best (beam pruning) the top k list (histogram pruning). 2. Discard some edges leaving a start vertex (beam or histogram pruning) Apply certain reordering constraints (Zens and Ney, 2004) Discard higher-cost phrase-based translation options. Both, unlike HR, may lead to search errors. 67

68 A lazy Histogram Pruning Avoid expanding edges if they fall outside of the top k, if weight function is monotonic in each of its arguments, Input list for each argument is sorted. Example: cube pruning for DAH search (Chiang,07): Suppose a hyperedge bundle {e j } where they share the same source side and identical tail vertices Hyperedges with lower costs are visited earlier Push e into a priority queue that is sorted by f e : Ψ T(e) Ψ. Hyperdge popped from the priority queue is explored. Stops when the top k hyperedges popped, and all remaining ones discarded 68

69 Further Reading D. Chiang Hierarchical phrase-based translation. Computational Linguistics, 33(2): G. Gallo, G. Longo, S. Pallottino and S. Nyugen, Directed hypergraphs and applications. Discrete Applied Mathematics, 42(2). L. Huang Advanced dynamic programming in semiring and hypergraph frameworks. In Proc. COLING. Survey paper to accompany the conference tutorial M. Mohri, F. Pereira and M. Riley Weighted finite-state transducers in speech recognition. Computer Speech and Language, vol. 16, no. 1, pp , D. Klein and C. D. Manning Parsing and hypergraphs. New developments in parsing technology, pages K. Knight Decoding Complexity in Word-Replacement Translation Models. Computational Linguistics, vol. 25, no. 4, pp , P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst Moses: Open Source Toolkit for Statistical Machine Translation. ACL Demo Paper L. Huang and D. Chiang Better k-best parsing. In Proc. the Ninth International Workshop on Parsing Technologies L. Huang and D. Chiang Forest Rescoring: Faster Decoding with Integrated Language Models. In Proc. ACL B. Zhou Statistical Machine Translation for Speech: A Perspective on Structure, Learning and Decoding, in Proceedings of the IEEE, IEEE, August

70 Coupling ASR and SMT for ST Joint Decoding 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 70

71 Bayesian Perspective of ST Sum replaced by Max A common practice f 1 J determined only by x 1 T, and solved as an isolated problem. The foundation of the cascaded approach e 1 I = argmax e 1 I = argmax e 1 I argmax e 1 I argmax e 1 I P e 1 I x 1 T f 1 J max f 1 J P( e 1 I f 1 J P(x 1 T f 1 J )P f 1 J P(e 1 I f 1 J P(x 1 T f 1 J )P(f 1 J ) P[e I 1 argmaxp(x T 1 f J 1 )P f J 1 ] J f 1 Cascaded approach impaired by compounding of errors propagated from ASR to SMT ST improved if ASR/SMT interactions are factored in with better coupled components. Coupling achievable with joint decoding and more coherent modeling across components 71

72 Tight Joint Decoding e 1 I = argmax e 1 I max f 1 J P(e 1 I f 1 J P(x 1 T f 1 J )P(f 1 J ) A fully integrated search over all possible e 1 I and f 1 J With the unified view of ASR/SMT, it s feasible for translation models with simplified reordering WFST-based word-level ST (Matusov et al. 2006): substituted P(e 1 I, f 1 J ) for the usual source LM used in the ASR WFSTs E = best_path (X H C L M G) Produce speech translation in the target language. Monotonic joint translation model at phrase-level (Casacuberta et al., 2008). Tight joint decoding using phrase-based SMT can be achieved by Folsom Composing ASR WFSTs with M and G on-the-fly Followed by fully integrated search Limitation is that reordering can only occur within a phrase. 72

73 Loose Joint Decoding e 1 I = argmax e 1 I max f 1 J P(e 1 I f 1 J P(x 1 T f 1 J )P(f 1 J ) Approximating full search space with a promising subset N-best ASR hypotheses (Zhang et al., 2004): Weak Word lattices produced by ASR recognizers (Matusov et al., 2006; Mathias and Byrne, 2006; Zhou et al., 2007): most general & challenging Confusion networks (Bertoldi et al., 2008): special case of lattice Better trade-off to address the reordering issue SCFG models can take ASR lattice for ST Generalized CKY algorithm for translating lattices (Dyer et al., 2008). Nodes in the lattice numbered such that the end node is always numbered higher than the start node for any edge, The vertex in HG labeled with X, i, j ; here i < j are the node numbers in the input lattice that are spanned by X. If reordering is critical in joint ST, try SCFG-based SMT models Reordering without length constraints Avoids traversing the lattice to compute the distortion costs 73

74 Coupling ASR and SMT for ST Joint Modeling 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 74

75 End-to-end Modeling of Speech Translation Two modules in conventional speech translation X Speech Recognition {F} Machine Translation E Problems of inconsistency SR and MT are optimized for different criteria, inconsistent to the E2E ST quality (metric discrepancy) SR and MT are trained without considering the interaction between them (train/test condition mismatch) 75

76 Why End-to-end Modeling Matters? Lowest WER not necessarily gives the best BLEU Fluent English is preferred as the input for MT, despite the fact that this might cause an increase of WER. (He et al., 2011) 76

77 End-to-end ST Model A End-to-End log-linear model for ST Enabling incorporation of rich features Enabling a principal way to model the interaction between core components Enabling global optimal modeling/feature training P E, F X P E X = E = argmax E X E = 1 Z exp i F P(E, F X) λ i logh i (E, F, X) P(E X) 77

78 Feature Functions for End-to-end ST Each feature function is a probabilistic model (except a few count features) Include all conventional ASR and MT models as features features Acoustic model Source language model ASR hypothesis length model h AM = p(x F) h SLM = P LM (F) h SWC = e F Forward phrase trans. Model h F2Eph = k p e k f k Forward word trans. Model h F2Ewd = k m n p(e k,m f k,n ) Backward phrase trans. Model h E2Fph = k p( f k e k ) Backward word trans. Model h E2Fwd = k n m p(f k,n e k,m ) Translation Phrase count h PC = e K Translation hypothesis length h TWC = e E Phrase segment/reorder model h reorder = P hr (S E, F) Target language model h TLM = P LM (E) 78

79 Learning Feature Weights In The Loglinear Model Jointly optimize the weights of features by MERT: system BLEU Baseline (the current ASR and MT system) 33.8% Global max-bleu training (for all features) 35.2% (+1.4%) Global max-bleu training (for ASR-only features) 34.8% (+1.0%) Global max-bleu training (for SMT-only features) 34.2% (+0.4%) Evaluated on a MS commercial data set 79

80 Case Study: Reco. B contains one more ins. error. However, the insertion to : i) makes the MT input grammatically more correct; ii) provides critical context for great ; iii) Provide critical syntactic info for word ordering of the translation. Reco. B contains two more errors. However, the mis-recognized phrase want to is plentifully represented in the formal text that is usually used for MT training and hence leads to correct translation. 80

81 Learning Parameters Inside The Features First, we define a generic differentiable utility function U Λ = p E r, F r X r, Λ C E r, E r r=1, R E r hyp(f r ) F r hyp(x r ) Λ : the set of model parameters that are of interest X r : the r-th speech input utterance E r : translation reference of the r-th utterance E r : translation hypothesis of the r-th utterance F r : speech recognition hypothesis of the r-th utterance U Λ measures the end-to-end quality of ST, e.g., Choosing C E r, E r properly, U Λ covers a variety of ST metrics C E r, E r BLEU E r, E r objective Max expected BLEU 1 TER E r, E r Min expected Translation Error Rate (He & Deng 2013) 81

82 Objective, Regularization, and Optimization Optimize the utility function directly E.g., max expected BLEU Generic gradient-based methods (Gao & He 2013) Early-stopping/cross-validation on dev set Regularize the objective by K-L divergence Suitable for parameters in a probabilistic domain Training objective: O Λ = logu Λ τ KL(Λ 0 Λ) Extended Baum-Welch based optimization (He & Deng 2008, 2012, 2013) 82

83 EBW Formula for Translation Model Use lexicon translation model as an example p g h, Λ = k,m: f k,m =g ASR score affects estimation of the translation model E,F p E, F X, Λ E γ h k, m + U Λ τ FP p g h, Λ 0 + D h p g h, Λ k,m E,F p E, F X, Λ E γ h k, m + U Λ τ FP + D h where E = C E U Λ, and γ h k, m = n:e k,n =h p(f k,m e k,n,λ ) Training is influenced by translation quality n p(f k,m e k,n,λ ) 83

84 Evaluation on IWSLT 11/TED Challenging open-domain SLT: TED Talks ( Public speech on anything (tech, entertainment, arts, science, politics, economics, policy, environment ) Data (Chinese-English machine translation track) Parallel Zh-En data: 110K snt. TED transcripts; 7.7M snt. UN corpus Monolingual English data: 115M snt. from Europarl, Gigaword, Example English: What I'm going to show you first, as quickly as I can, is some foundational work, some new technology that we brought to... Chinese: 首先, 我要用最快的速度为大家演示一些新技术的基础研究成果 84

85 Results on IWSLT 11/TED Phrase-based system 1 st phrase table from the TED parallel corpus 2 nd phrase table from 500K parallel snt selected from UN 1 st 3-gram LM from TED English transcription 2 nd 5-gram LM from 115M supplementary English snt Max-BLEU training only applied to the primary (TED) phrase table Fine-tuning of full lambda set is performed at the end BLEU scores on IWSLT test sets system Tst2010 (dev) Tst2011 (test) baseline 11.48% 14.68% Max Expected BLEU training (He & Deng 2012) 12.39% (+0.9%) 15.92% (+1.2%) best single-system in IWSLT 11/CE_MT 85

86 Further Reading F. Casacuberta, M. Federico, H. Ney and E. Vidal Recent Efforts in Spoken Language Translation. IEEE Signal Processing Mag. May 2008 C. Dyer, S. Muresan, and P. Resnik Generalizing Word Lattice Translation. In Proc. ACL L. Mathias and W. Byrne Statistical phrase-based speech translation. In Proc. ICASSP pp E. Matusov, S. Kanthak and H. Ney Integrating speech recognition and machine translation: Where do we stand? In Proc. ICASSP H. Ney Speech translation: Coupling of recognition and translation. In Proc. ICASSP R. Zhang, G. Kikui, H. Yamamoto, T. Watanbe, F. Soong and W-K. Lo A unified approach in speechto-speech translation: Integrating features of speech recognition and machine translation. In Proc. COLING B. Zhou, L. Besacier and Y. Gao On Efficient Coupling of ASR and SMT for Speech Translation. In Proc. ICASSP B. Zhou, X. Cui, S. Huang, M. Cmejrek, W. Zhang, J. Xue, J. Cui, B. Xiang, G. Daggett, U. Chaudhari, S. Maskey and E. Marcheret The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks. Computer Speech & Language. Vol. 27, Issue 2, 2013, pp

87 Further Reading X. He, A. Axelrod, L. Deng, A. Acero, M. Hwang, A. Nguyen, A. Wang, and X. Huang The MSR system for IWSLT 2011 evaluation, in Proc. IWSLT, December X. He and L. Deng, 2013, Speech-Centric Information Processing: An Optimization-Oriented Approach, in Proceedings of the IEEE X. He, L. Deng, and A. Acero, Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the Speech Translation Task?, in Proc. ICASSP, IEEE X. He, L. Deng, and W. Chou, Discriminative learning in sequential pattern recognition, IEEE Signal Processing Magazine, September Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, M. Harper, Enriching speech recognition with automatic detection of sentence boundaries and disfluencies, IEEE Trans. on ASLP M. Ostendorf, B. Favre, R. Grishman, D. Hakkani-Tur, M. Harper, D. Hillard, J. Hirschberg, J. Heng, J. Kahn, L. Yang, S. Maskey, E. Matusov, H. Ney, A. Rosenberg, E. Shriberg, W. Wen, C. Woofers, Speech segmentation and spoken document processing, IEEE Signal Processing Magazine, May, 2008 A. Waibel, C. Fugen, Spoken language translation, IEEE Signal Processing Magazine, vol.25, no.3, pp.70-79, May Y. Zhang, L. Deng, X. He, and A. Acero, 2011, A Novel Decision Function and the Associated Decision- Feedback Learning for Speech Translation, in ICASSP, IEEE 87

88 Practices 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 88

89 Two techniques in depth: 1 Domain adaptation 2 System combination 89

90 Domain adaptation for ST Words differ in meaning across domains/contexts Domain adaptation is particularly important for ST ST needs to handle spoken language Colloquial style vs. written style ST has interests on particular scenarios/domains E.g., travel 90

91 Domain Adaptation by Data Selection for MT Selecting data that match the targeting domain from a large general corpus MT needs data that match both source & target languages of the targeting domain. The adapted system need to cover domain-specific input produce domain-appropriate output e.g., Do you know of any restaurants open now? We need to select data to cover not only open at the source side, but also the right translation of open at the target side. Use bilingual cross-entropy difference: H in_src s H out_src s + H in_tgt s H out_tgt s The perplexity of s given the in-domain source language model in_src (Axelrod, He, and Gao, 2011) 91

92 Multi-model and data selection based domain adaptation for ST Spoken lang. parallel data General text parallel data Data Selection Bi-lingual data selection metric Pseudo spk. lang. data Model Training Spoken lang. TM Model Training General TM Combined in a Log-linear model P E F = 1 Z exp{ iλ i logf i (E, F)} 92

93 Evaluation on commercial ST data English-to-Chinese translation Target: dialog style travel domain data Training data available 800K in-domain data 12M general domain data Evaluation Performance on target domain Performance on general domain, e.g., evaluate the robustness on out-of-target-domain. 93

94 Results and Analysis Models and {λ} training travel general General (dev: general) Travel (dev: travel) (+6.1) (-8.0) Multi-TMs (dev: travel) (+5.9) (-4.9) Multi-TMs (dev: tvl : gen = 1:1) (+5.8) (-2.0) Multi-TMs (dev: tvl : gen = 1:2) (+4.0) (-0.8) Results reported in BLEU % 1) Using multiple translation models together as features helps 2) The feature weights, determined by the dev set, play a critical role 3) Big gain on in-domain test (travel), robust on out-of-domain test (general) (From He & Deng, 2011) 94

95 Case Studies Source English A glass of cold water, please. General model (Chinese) 一杯冷水请 Travel-domain adapted model 请来一杯冷的水 Please keep the change. 请保留更改 不用找了 Thank, this is for your tip. 谢谢, 这是为您提示 谢谢, 这是给你的小费 Do you know of any restaurants open now? I'd like a restaurant with cheerful atmosphere 你现在知道的任何打开的餐馆吗? 我想就食肆的愉悦的气氛 你知道现在还有餐馆在营业的吗? 我想要一家气氛活泼的餐厅 Note the improved translation of ambiguous words (e.g., open, tip, change), and improved processing of colloquial style grammar (e.g., a glass of cold water, please.) 95

96 From domain adaptation to topic adaptation Motivation Topic changes talk to talk, dialog to dialog Meanings of words changes, too. Lots of out-of-domain data Broad coverage, but not all of them are relevant. How to utilize the OOD corpus to enhance the translation performance? Method Build topic model on target domain data Select topic relevant data from OOD corpus Combine topic specific model with the general model at testing 96

97 Analysis on IWSLT 11: topics of TED talks Build LDA-based topic model on TED data estimated on the source side of the parallel data Restricted to 4 topics are they meaningful? Only has 775 talks/110k sentences Look at some of the top keywords in each topic: Topic 1 (design, computer, data, system, machine) technology Topic 2 (Africa, dollars, business, market, food, China, society) global Topic 3 (water, earth, universe, ocean, trees, carbon, environment) planet Topic 4 (life, love, god, stories, children, music) abstract Reasonable clustering 97

98 Topic modeling, data selection, and multi-phrase-table decoding Training (for each topic) Topic model UN parallel corpus TED parallel corpus Testing Topic allocation Input sentence Topic allocation Topic specific multi-table decoding topic relevant data Model training topic phrase table Model training general TED phrase table Output Combined in a Log-linear model log linear combination: P E F = 1 Z exp{λ ilogφ i (E, F)} 98

99 Experiments on IWSLT 11/TED For each topic, select 250~400K sentences from the UN corpus, train a topic-specific phrase table Evaluation results on IWSLT 11 (TED talk dataset) Simply adding an extra UN-driven phrase table didn t help Topic specific multi-phrase table decoding helps Phrase table used Dev (2010) Test (2011) TED only (baseline) 11.3% 13.0% TED + UN-all 11.3% 13.0% TED + UN-4 topics 11.8% 13.5% (From Axelrod et al., 2012) 99

100 System Combination for MT Hypotheses from single systems E 1 : she bought the Jeep E 2 : she buys the SUV E 3 : she bought the SUV Jeep Combined MT output she bought the Jeep SUV System Combination: Input: a set of translation hypotheses from multiple single MT systems, for the same source sentence Output: a better translation output derived from input hypotheses 100

101 Confusion Network Based Methods Hypotheses from single systems E 1 : she bought the Jeep E 2 : she buys the SUV E 3 : she bought the SUV Jeep Combined MT output she bought the SUV 1) Select the backbone 2) Align hypotheses 3) Construct and decode CN E B argmin P( E F) L( E, E) E E B E E e.g., E 2 is selected e she bought buys the Jeep SUV ε Jeep she bought the SUV ε 101

102 Hypothesis Alignment Hypothesis alignment is crucial GIZA based approach (Matusov et al., 2006) Use GIZA as a generic tool to align hypotheses TER based alignment (Sim et al., 2007, Rosti et al. 2007) Align one hypothesis to another such that the TER is minimized HMM based hypothesis alignment (He et al. 2008, 2009) Use fine-grained statistical model No training needed, HMM s parameters are derived from pretrained bilingual word alignment models ITG based approach (Karakos et al., 2008) A latest survey and evaluation (Rosti et al., 2012) 102

103 HMM based Hypothesis Alignment e' 1 e' 3 e' 2 E B : e 1 e 2 e 3 E hyp : e' 1 e' 3 e' 2 HMM is built on the backbone side HMM aligns the hypothesis to the backbone After alignment, a CN is built e 1 e 2 e 3 e 1 e 2 e 3 e' 1 e' 2 e' 3 103

104 HMM Parameter Estimation Emitting Probability P(e' 1 e 1 ) models how likely e' 1 and e 1 have similar meanings (via words in source sentence) e' 1 Use the source word sequence {f 1,, f M } as a hidden layer, P(e' 1 e 1 ) takes a mixture-model form, i.e., P src (e' 1 e 1 )= m w m P(e' 1 f m ) where w m = P(f m e 1 )/ m P(f m e 1 ) P(e' 1 f m ) is from the bilingual word alignment model, F E direction P(f m e 1 ) is from that of E F e 1 P(e' 1 e 1 )=? e' 1 e 1 f 1 f 2 f 3 P(e' 1 e 1 )= m w m P(e' 1 f m ) 104

105 HMM Parameter Estimation (cont.) Emitting Probability Normalized similarity measure s Based on Levenshtein distance Based on matched prefix length Use an exponential mapping to get P(e' 1 e 1 ) P simi (e' 1 e 1 )=exp[ρ (s (e' 1, e 1 ) -1)] s (e' 1, e 1 ) is normalized to [0,1] Overall Emitting Probability (via word surface similarity) P simi (e' 1 e 1 ) P(e' 1 e 1 ) = α P src (e' 1 e 1 )+(1- α) P simi (e' 1 e 1 ) ρ=3 Normalized similarity measure s(e' 1, e 1 ) 105

106 HMM Parameter Estimation (cont.) Transition Probability P(a j a j-1 ) models word ordering Takes the same form as a bilingual word alignment HMM Strongly encourages monotonic word ordering Allows non-monotonic word ordering, ' 1 ' 1 d a j, a j j a j 1 d i, a d i i i i P a i 1 j 1 a j alignment of the j-th word d(i, i') κ=2 1 (i- i')

107 Find the Optimal Alignment Viterbi decoding: J J 1 j j 1 j a a j 1 aˆ argmax p( a a, I) p( e e j ) J 1 Other variations: posterior probability & threshold based decoding max posterior mode decoding 107

108 Decode the Confusion Network Log-linear model based decoding (Rosti et al. 2007) Incorporate multiple features (e.g., voting, LM, length, etc.) arg max ln ( ) * E P E F E E h where L ln P( E F) ln P ( e F) ln P ( E ) E l 1 Confusion network decoding (L=5) S MBR l LM he have ε good car he has ε nice sedan it ε a nice car he has a ε sedan e 1 e 2 e 3 e 4 e 5 108

109 Results on 2008 NIST Open MT Eval The MSR-NRC-SRI entry for Chinese-to-English Case-sensitive BLEU Combined System best score in NIST08 C2E Individual Systems Individual systems combined along time (from He et al., 2008) 109

110 Related Approaches & Extensions Incremental hypothesis alignment (Rosti et al., 2008; Li et al, 2009, Karakos et al., 2010) Other approaches Joint decoding (He and Toutanova, 2009) Phrase-level system combination (Feng et al, 2009) System combination as target-to-target decoding (Ma & McKeown, 2012) Survey and evaluation (Rosti et al., 2012) 110

111 Comparison of aligners Results reported in BLEU % (best scores in bold) Significance group ID is marked as superscripts on results Aligner Best single system NIST MT09 Arabic-English WMT11 German-English GIZA Incr. TER Incr. TER Incr. ITG Incr. IHMM WMT11 Spanish-English (From Rosti, He, Karakos, Leusch, et al., 2012) 111

112 Error Analysis Per-sentence errors were analyzed Paired Wilcoxon test measured the probability that the error reduction relative to the best system was real Observations All aligners significantly reduce substitution/shift errors No clear trend for insertions/deletions sometimes worse than the best system Further analysis showed that agreement for the non- NULL words is important measure for alignment quality, but agreement on NULL tokens is not 112

113 Further Reading Y. Feng, Y. Liu, H. Mi, Q. Liu and Y. Lü, 2009, Lattice-based system combination for statistical machine translation, in Proceedings of EMNLP X. He and K. Toutanova, 2009, Joint Optimization for Machine Translation System Combination, In Proceedings of EMNLP X. He, M. Yang, J. Gao, P. Nguyen, and R. Moore, 2008, Indirect-HMM-based Hypothesis Alignment for Combining Outputs from Machine Translation Systems, in Proceedings of EMNLP D. Karakos, J. Eisner, S. Khudanpur, and M. Dreyer Machine Translation System Combination using ITG-based Alignments. In Proceedings of ACL-HLT. D. Karakos, J. Smith, and S. Khudanpur Hypothesis ranking and two-pass approaches for machine translation system combination. In Proc. ICASSP. C. Li, X. He, Y. Liu, and N. Xi, 2009, Incremental HMM Alignment for MT System Combination, in Proceedings of ACL-IJCNLP W. Ma and K McKeown Phrase-level System Combination for Machine Translation Based on Target-to-Target Decoding. In Proceedings of AMTA E. Matusov, N. Ueffing, and H. Ney Computing consensus translation from multiple machine translation systems using enhanced hypotheses alignment. In Proceedings of EACL. A. Rosti, B. Xiang, S. Matsoukas, R. Schwartz, N. Fazil Ayan, and B. Dorr Combining outputs from multiple machine translation systems. In Proceedings of NAACL-HLT. A. Rosti, B. Zhang, S. Matsoukas, and R. Schwartz Incremental hypothesis alignment for building confusion networks with application to machine translation system combination. In Proceedings of WMT A. Rosti, X. He, D. Karakos, G. Leusch, Y. Cao, M. Freitag, S. Matsoukas, H. Ney, J. Smith, and B. Zhang, Review of Hypothesis Alignment Algorithms for MT System Combination via Confusion Network Decoding, in Proceedings of WMT K. Sim, W. Byrne, M. Gales, H. Sahbi, and P. Woodland Consensus network decoding for statistical machine translation system combination. In Proceedings of ICASSP. A. Axelrod, X. He, and J. Gao, Domain Adaptation via Pseudo In-Domain Data Selection, in Proc. EMNLP, 2011 A. Bisazza, N. Ruiz and M. Federico, 2011, Fill-up versus Interpolation Methods for Phrase-based SMT Adaptation, in Proceedings of IWSLT G. Foster and R. Kuhn Mixture-Model Adaptation for SMT. Workshop on Statistical Machine Translation, in Proceedings of ACL G. Foster, C. Goutte, and R. Kuhn Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation. In Proceedings of EMNLP. B. Haddow and P. Koehn, Analysing the effect of out-of-domain data on smt systems. in Proceedings of WMT X. He and L. Deng, 2011, Robust Speech Translation by Domain Adaptation, in Proceedings of Interspeech. P. Koehn and J. Schroeder, Experiments in domain adaptation for statistical machine translation, in Proceedings of WMT. S. Mansour, J. Wuebker and H. Ney, 2012, Combining Translation and Language Model Scoring for Domain-Specific Data Filtering, in Proceedings of IWSLT R. Moore and W. Lewis Intelligent Selection of Language Model Training Data. Association for Computational Linguistics. P. Nakov, Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing, in Proceedings of WMT 113

114 Summary Fundamental concepts and technologies in speech translation Theory of speech translation from a joint modeling perspective In-depth discussions on domain adaptation and system combination Practical evaluations on major ST/MT benchmarks 114

115 ST Future Directions --Additional to ASR and SMT Overcome the low-resource problem: How to rapidly develop ST with limited amount of data, and adapt SMT for ST? Confidence measurement: Reduce misleading dialogue turns & lower the risk of miscommunication. Complicated by the combination of two sources of uncertainty in ASR and SMT. Active vs. Passive ST More active, e.g., to actively clarify, or warn the user when confidence is low Development of suitable dialogue strategies for ST. Context-aware ST ST on smart phones (e.g., location, conversation history). End-to-end modeling of speech translation with new features E.g., prosody acoustic features for translation via end-to-end modeling Other applications of speech translation E.g., cross-lingual spoken language understanding 115

116 Further Reading About this Tutorial This tutorial is mainly based on: Bowen Zhou, Statistical Machine Translation for Speech: A Perspective on Structure, Learning and Decoding, in Proceedings of the IEEE, IEEE, May 2013 Xiaodong He and Li Deng, Speech-Centric Information Processing: An Optimization-Oriented Approach, in Proceedings of the IEEE, IEEE, May 2013 Both provide references for further reading Papers and charts available on authors webpages 116

117 Resources (play it yourself) You need also an ASR system HTK: Kaldi: HTS (TTS for S2S): MT Toolkits (word alignment, decoder, MERT) GIZA++: Moses: Cdec: Benchmark data sets IWSLT: WMT: (include Europarl) NIST Open MT: 117

118 Q & A Bowen Zhou RESEARCH MANAGER IBM WATSON RESEARCH CENTER earcher/view.php?person=us-zhou pub/bowen-zhou/18/b5a/436 Xiaodong He RESEARCHER MICROSOFT RESEARCH REDMOND AFFILIATE PROFESSOR UNIVERSITY OF WASHINGTON, SEATTLE 118

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017 Tuning Philipp Koehn presented by Gaurav Kumar 28 September 2017 The Story so Far: Generative Models 1 The definition of translation probability follows a mathematical derivation argmax e p(e f) = argmax

More information

Statistical Machine Translation Part IV Log-Linear Models

Statistical Machine Translation Part IV Log-Linear Models Statistical Machine Translation art IV Log-Linear Models Alexander Fraser Institute for Natural Language rocessing University of Stuttgart 2011.11.25 Seminar: Statistical MT Where we have been We have

More information

THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM

THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM Adrià de Gispert, Gonzalo Iglesias, Graeme Blackwood, Jamie Brunning, Bill Byrne NIST Open MT 2009 Evaluation Workshop Ottawa, The CUED SMT system Lattice-based

More information

SyMGiza++: Symmetrized Word Alignment Models for Statistical Machine Translation

SyMGiza++: Symmetrized Word Alignment Models for Statistical Machine Translation SyMGiza++: Symmetrized Word Alignment Models for Statistical Machine Translation Marcin Junczys-Dowmunt, Arkadiusz Sza l Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska

More information

Word Graphs for Statistical Machine Translation

Word Graphs for Statistical Machine Translation Word Graphs for Statistical Machine Translation Richard Zens and Hermann Ney Chair of Computer Science VI RWTH Aachen University {zens,ney}@cs.rwth-aachen.de Abstract Word graphs have various applications

More information

K-best Parsing Algorithms

K-best Parsing Algorithms K-best Parsing Algorithms Liang Huang University of Pennsylvania joint work with David Chiang (USC Information Sciences Institute) k-best Parsing Liang Huang (Penn) k-best parsing 2 k-best Parsing I saw

More information

Power Mean Based Algorithm for Combining Multiple Alignment Tables

Power Mean Based Algorithm for Combining Multiple Alignment Tables Power Mean Based Algorithm for Combining Multiple Alignment Tables Sameer Maskey, Steven J. Rennie, Bowen Zhou IBM T.J. Watson Research Center {smaskey, sjrennie, zhou}@us.ibm.com Abstract Alignment combination

More information

A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation.

A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation. A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation May 29, 2003 Shankar Kumar and Bill Byrne Center for Language and Speech Processing

More information

Local Phrase Reordering Models for Statistical Machine Translation

Local Phrase Reordering Models for Statistical Machine Translation Local Phrase Reordering Models for Statistical Machine Translation Shankar Kumar, William Byrne Center for Language and Speech Processing, Johns Hopkins University, 3400 North Charles Street, Baltimore,

More information

Left-to-Right Tree-to-String Decoding with Prediction

Left-to-Right Tree-to-String Decoding with Prediction Left-to-Right Tree-to-String Decoding with Prediction Yang Feng Yang Liu Qun Liu Trevor Cohn Department of Computer Science The University of Sheffield, Sheffield, UK {y.feng, t.cohn}@sheffield.ac.uk State

More information

Discriminative Training for Phrase-Based Machine Translation

Discriminative Training for Phrase-Based Machine Translation Discriminative Training for Phrase-Based Machine Translation Abhishek Arun 19 April 2007 Overview 1 Evolution from generative to discriminative models Discriminative training Model Learning schemes Featured

More information

Hierarchical Phrase-Based Translation with WFSTs. Weighted Finite State Transducers

Hierarchical Phrase-Based Translation with WFSTs. Weighted Finite State Transducers Hierarchical Phrase-Based Translation with Weighted Finite State Transducers Gonzalo Iglesias 1 Adrià de Gispert 2 Eduardo R. Banga 1 William Byrne 2 1 Department of Signal Processing and Communications

More information

A General-Purpose Rule Extractor for SCFG-Based Machine Translation

A General-Purpose Rule Extractor for SCFG-Based Machine Translation A General-Purpose Rule Extractor for SCFG-Based Machine Translation Greg Hanneman, Michelle Burroughs, and Alon Lavie Language Technologies Institute Carnegie Mellon University Fifth Workshop on Syntax

More information

Hierarchical Phrase-Based Translation with Weighted Finite State Transducers

Hierarchical Phrase-Based Translation with Weighted Finite State Transducers Hierarchical Phrase-Based Translation with Weighted Finite State Transducers Gonzalo Iglesias Adrià de Gispert Eduardo R. Banga William Byrne University of Vigo. Dept. of Signal Processing and Communications.

More information

Joint Decoding with Multiple Translation Models

Joint Decoding with Multiple Translation Models Joint Decoding with Multiple Translation Models Yang Liu and Haitao Mi and Yang Feng and Qun Liu Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of

More information

Sparse Feature Learning

Sparse Feature Learning Sparse Feature Learning Philipp Koehn 1 March 2016 Multiple Component Models 1 Translation Model Language Model Reordering Model Component Weights 2 Language Model.05 Translation Model.26.04.19.1 Reordering

More information

Improved Models of Distortion Cost for Statistical Machine Translation

Improved Models of Distortion Cost for Statistical Machine Translation Improved Models of Distortion Cost for Statistical Machine Translation Spence Green, Michel Galley, and Christopher D. Manning Computer Science Department Stanford University Stanford, CA 9435 {spenceg,mgalley,manning}@stanford.edu

More information

CMU-UKA Syntax Augmented Machine Translation

CMU-UKA Syntax Augmented Machine Translation Outline CMU-UKA Syntax Augmented Machine Translation Ashish Venugopal, Andreas Zollmann, Stephan Vogel, Alex Waibel InterACT, LTI, Carnegie Mellon University Pittsburgh, PA Outline Outline 1 2 3 4 Issues

More information

Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices

Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices Shankar Kumar 1 and Wolfgang Macherey 1 and Chris Dyer 2 and Franz Och 1 1 Google Inc. 1600

More information

Large-scale Word Alignment Using Soft Dependency Cohesion Constraints

Large-scale Word Alignment Using Soft Dependency Cohesion Constraints Large-scale Word Alignment Using Soft Dependency Cohesion Constraints Zhiguo Wang and Chengqing Zong National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences {zgwang,

More information

On Hierarchical Re-ordering and Permutation Parsing for Phrase-based Decoding

On Hierarchical Re-ordering and Permutation Parsing for Phrase-based Decoding On Hierarchical Re-ordering and Permutation Parsing for Phrase-based Decoding Colin Cherry National Research Council colin.cherry@nrc-cnrc.gc.ca Robert C. Moore Google bobmoore@google.com Chris Quirk Microsoft

More information

Fast Inference in Phrase Extraction Models with Belief Propagation

Fast Inference in Phrase Extraction Models with Belief Propagation Fast Inference in Phrase Extraction Models with Belief Propagation David Burkett and Dan Klein Computer Science Division University of California, Berkeley {dburkett,klein}@cs.berkeley.edu Abstract Modeling

More information

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com

More information

Stone Soup Translation

Stone Soup Translation Stone Soup Translation DJ Hovermale and Jeremy Morris and Andrew Watts December 3, 2005 1 Introduction 2 Overview of Stone Soup Translation 2.1 Finite State Automata The Stone Soup Translation model is

More information

Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training Xinyan Xiao, Yang Liu, Qun Liu, and Shouxun Lin Key Laboratory of Intelligent Information Processing Institute of Computing

More information

Improving Tree-to-Tree Translation with Packed Forests

Improving Tree-to-Tree Translation with Packed Forests Improving Tree-to-Tree Translation with Packed Forests Yang Liu and Yajuan Lü and Qun Liu Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences

More information

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 Patrick Simianer, Gesa Stupperich, Laura Jehl, Katharina Wäschle, Artem Sokolov, Stefan Riezler Institute for Computational Linguistics,

More information

Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Lexicographic Semirings for Exact Automata Encoding of Sequence Models Lexicographic Semirings for Exact Automata Encoding of Sequence Models Brian Roark, Richard Sproat, and Izhak Shafran {roark,rws,zak}@cslu.ogi.edu Abstract In this paper we introduce a novel use of the

More information

Forest Reranking for Machine Translation with the Perceptron Algorithm

Forest Reranking for Machine Translation with the Perceptron Algorithm Forest Reranking for Machine Translation with the Perceptron Algorithm Zhifei Li and Sanjeev Khudanpur Center for Language and Speech Processing and Department of Computer Science Johns Hopkins University,

More information

Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation

Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation Markus Dreyer, Keith Hall, and Sanjeev Khudanpur Center for Language and Speech Processing Johns Hopkins University 300

More information

A Quantitative Analysis of Reordering Phenomena

A Quantitative Analysis of Reordering Phenomena A Quantitative Analysis of Reordering Phenomena Alexandra Birch Phil Blunsom Miles Osborne a.c.birch-mayne@sms.ed.ac.uk pblunsom@inf.ed.ac.uk miles@inf.ed.ac.uk University of Edinburgh 10 Crichton Street

More information

Assignment 4 CSE 517: Natural Language Processing

Assignment 4 CSE 517: Natural Language Processing Assignment 4 CSE 517: Natural Language Processing University of Washington Winter 2016 Due: March 2, 2016, 1:30 pm 1 HMMs and PCFGs Here s the definition of a PCFG given in class on 2/17: A finite set

More information

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition by Hong-Kwang Jeff Kuo, Brian Kingsbury (IBM Research) and Geoffry Zweig (Microsoft Research) ICASSP 2007 Presented

More information

Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines Kristina Toutanova Microsoft Research Redmond, WA 98502 kristout@microsoft.com Byung-Gyu Ahn Johns Hopkins University

More information

Fluency Constraints for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Fluency Constraints for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices Fluency Constraints for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices Graeme Blackwood and Adrià de Gispert and William Byrne Machine Intelligence Laboratory, Department of Engineering,

More information

An Unsupervised Model for Joint Phrase Alignment and Extraction

An Unsupervised Model for Joint Phrase Alignment and Extraction An Unsupervised Model for Joint Phrase Alignment and Extraction Graham Neubig 1,2, Taro Watanabe 2, Eiichiro Sumita 2, Shinsuke Mori 1, Tatsuya Kawahara 1 1 Graduate School of Informatics, Kyoto University

More information

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs.

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs. NTT SMT System for IWSLT 2008 Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs., Japan Overview 2-stage translation system k-best translation

More information

A Semi-supervised Word Alignment Algorithm with Partial Manual Alignments

A Semi-supervised Word Alignment Algorithm with Partial Manual Alignments A Semi-supervised Word Alignment Algorithm with Partial Manual Alignments Qin Gao, Nguyen Bach and Stephan Vogel Language Technologies Institute Carnegie Mellon University 000 Forbes Avenue, Pittsburgh

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 91 JANUARY

The Prague Bulletin of Mathematical Linguistics NUMBER 91 JANUARY The Prague Bulletin of Mathematical Linguistics NUMBER 9 JANUARY 009 79 88 Z-MERT: A Fully Configurable Open Source Tool for Minimum Error Rate Training of Machine Translation Systems Omar F. Zaidan Abstract

More information

PBML. logo. The Prague Bulletin of Mathematical Linguistics NUMBER??? FEBRUARY Improved Minimum Error Rate Training in Moses

PBML. logo. The Prague Bulletin of Mathematical Linguistics NUMBER??? FEBRUARY Improved Minimum Error Rate Training in Moses PBML logo The Prague Bulletin of Mathematical Linguistics NUMBER??? FEBRUARY 2009 1 11 Improved Minimum Error Rate Training in Moses Nicola Bertoldi, Barry Haddow, Jean-Baptiste Fouet Abstract We describe

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

Forest-Based Search Algorithms

Forest-Based Search Algorithms Forest-Based Search Algorithms for Parsing and Machine Translation Liang Huang University of Pennsylvania Google Research, March 14th, 2008 Search in NLP is not trivial! I saw her duck. Aravind Joshi 2

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Spring 2017 Unit 3: Tree Models Lectures 9-11: Context-Free Grammars and Parsing required hard optional Professor Liang Huang liang.huang.sh@gmail.com Big Picture only 2 ideas

More information

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A Statistical parsing Fei Xia Feb 27, 2009 CSE 590A Statistical parsing History-based models (1995-2000) Recent development (2000-present): Supervised learning: reranking and label splitting Semi-supervised

More information

Finite-State Transducers in Language and Speech Processing

Finite-State Transducers in Language and Speech Processing Finite-State Transducers in Language and Speech Processing Mehryar Mohri AT&T Labs-Research Finite-state machines have been used in various domains of natural language processing. We consider here the

More information

Sparse Non-negative Matrix Language Modeling

Sparse Non-negative Matrix Language Modeling Sparse Non-negative Matrix Language Modeling Joris Pelemans Noam Shazeer Ciprian Chelba joris@pelemans.be noam@google.com ciprianchelba@google.com 1 Outline Motivation Sparse Non-negative Matrix Language

More information

Overview. Search and Decoding. HMM Speech Recognition. The Search Problem in ASR (1) Today s lecture. Steve Renals

Overview. Search and Decoding. HMM Speech Recognition. The Search Problem in ASR (1) Today s lecture. Steve Renals Overview Search and Decoding Steve Renals Automatic Speech Recognition ASR Lecture 10 January - March 2012 Today s lecture Search in (large vocabulary) speech recognition Viterbi decoding Approximate search

More information

On LM Heuristics for the Cube Growing Algorithm

On LM Heuristics for the Cube Growing Algorithm On LM Heuristics for the Cube Growing Algorithm David Vilar and Hermann Ney Lehrstuhl für Informatik 6 RWTH Aachen University 52056 Aachen, Germany {vilar,ney}@informatik.rwth-aachen.de Abstract Current

More information

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions Marcin Junczys-Dowmunt Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska 87, 61-614

More information

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001 Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques

More information

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct 1 INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS Jan Tore Lønning, Lecture 8, 12 Oct. 2016 jtl@ifi.uio.no Today 2 Preparing bitext Parameter tuning Reranking Some linguistic issues STMT so far 3 We

More information

Reassessment of the Role of Phrase Extraction in PBSMT

Reassessment of the Role of Phrase Extraction in PBSMT Reassessment of the Role of Phrase Extraction in PBSMT Francisco Guzman Centro de Sistemas Inteligentes Tecnológico de Monterrey Monterrey, N.L., Mexico guzmanhe@gmail.com Qin Gao and Stephan Vogel Language

More information

Better Evaluation for Grammatical Error Correction

Better Evaluation for Grammatical Error Correction Better Evaluation for Grammatical Error Correction Daniel Dahlmeier 1 and Hwee Tou Ng 1,2 1 NUS Graduate School for Integrative Sciences and Engineering 2 Department of Computer Science, National University

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

A Primer on Graph Processing with Bolinas

A Primer on Graph Processing with Bolinas A Primer on Graph Processing with Bolinas J. Andreas, D. Bauer, K. M. Hermann, B. Jones, K. Knight, and D. Chiang August 20, 2013 1 Introduction This is a tutorial introduction to the Bolinas graph processing

More information

Tuning Machine Translation Parameters with SPSA

Tuning Machine Translation Parameters with SPSA Tuning Machine Translation Parameters with SPSA Patrik Lambert, Rafael E. Banchs TALP Research Center, Jordi Girona Salgado 1 3. 0803 Barcelona, Spain lambert,rbanchs@gps.tsc.upc.edu Abstract Most of statistical

More information

Alignment Link Projection Using Transformation-Based Learning

Alignment Link Projection Using Transformation-Based Learning Alignment Link Projection Using Transformation-Based Learning Necip Fazil Ayan, Bonnie J. Dorr and Christof Monz Department of Computer Science University of Maryland College Park, MD 20742 {nfa,bonnie,christof}@umiacs.umd.edu

More information

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016 Topics in Parsing: Context and Markovization; Dependency Parsing COMP-599 Oct 17, 2016 Outline Review Incorporating context Markovization Learning the context Dependency parsing Eisner s algorithm 2 Review

More information

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models ABSTRACT Euisok Chung Hyung-Bae Jeon Jeon-Gue Park and Yun-Keun Lee Speech Processing Research Team, ETRI, 138 Gajeongno,

More information

Graphical Models. David M. Blei Columbia University. September 17, 2014

Graphical Models. David M. Blei Columbia University. September 17, 2014 Graphical Models David M. Blei Columbia University September 17, 2014 These lecture notes follow the ideas in Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. In addition,

More information

Discriminative Reranking for Machine Translation

Discriminative Reranking for Machine Translation Discriminative Reranking for Machine Translation Libin Shen Dept. of Comp. & Info. Science Univ. of Pennsylvania Philadelphia, PA 19104 libin@seas.upenn.edu Anoop Sarkar School of Comp. Science Simon Fraser

More information

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen Structured Perceptron Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen 1 Outline 1. 2. 3. 4. Brief review of perceptron Structured Perceptron Discriminative Training Methods for Hidden Markov Models: Theory and

More information

Segment-based Hidden Markov Models for Information Extraction

Segment-based Hidden Markov Models for Information Extraction Segment-based Hidden Markov Models for Information Extraction Zhenmei Gu David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2l 3G1 z2gu@uwaterloo.ca Nick Cercone

More information

Chapter 3: Search. c D. Poole, A. Mackworth 2010, W. Menzel 2015 Artificial Intelligence, Chapter 3, Page 1

Chapter 3: Search. c D. Poole, A. Mackworth 2010, W. Menzel 2015 Artificial Intelligence, Chapter 3, Page 1 Chapter 3: Search c D. Poole, A. Mackworth 2010, W. Menzel 2015 Artificial Intelligence, Chapter 3, Page 1 Searching Often we are not given an algorithm to solve a problem, but only a specification of

More information

Parsing with Dynamic Programming

Parsing with Dynamic Programming CS11-747 Neural Networks for NLP Parsing with Dynamic Programming Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Two Types of Linguistic Structure Dependency: focus on relations between words

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

A Smorgasbord of Features to Combine Phrase-Based and Neural Machine Translation

A Smorgasbord of Features to Combine Phrase-Based and Neural Machine Translation A Smorgasbord of Features to Combine Phrase-Based and Neural Machine Translation Benjamin Marie Atsushi Fujita National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika-cho,

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 94 SEPTEMBER An Experimental Management System. Philipp Koehn

The Prague Bulletin of Mathematical Linguistics NUMBER 94 SEPTEMBER An Experimental Management System. Philipp Koehn The Prague Bulletin of Mathematical Linguistics NUMBER 94 SEPTEMBER 2010 87 96 An Experimental Management System Philipp Koehn University of Edinburgh Abstract We describe Experiment.perl, an experimental

More information

Learning to find transliteration on the Web

Learning to find transliteration on the Web Learning to find transliteration on the Web Chien-Cheng Wu Department of Computer Science National Tsing Hua University 101 Kuang Fu Road, Hsin chu, Taiwan d9283228@cs.nthu.edu.tw Jason S. Chang Department

More information

Constituency to Dependency Translation with Forests

Constituency to Dependency Translation with Forests Constituency to Dependency Translation with Forests Haitao Mi and Qun Liu Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences P.O. Box 2704,

More information

Aligning English Strings with Abstract Meaning Representation Graphs

Aligning English Strings with Abstract Meaning Representation Graphs Aligning English Strings with Abstract Meaning Representation Graphs Nima Pourdamghani, Yang Gao, Ulf Hermjakob, Kevin Knight Information Sciences Institute Department of Computer Science University of

More information

Forest-based Algorithms in

Forest-based Algorithms in Forest-based Algorithms in Natural Language Processing Liang Huang overview of Ph.D. work done at Penn (and ISI, ICT) INSTITUTE OF COMPUTING TECHNOLOGY includes joint work with David Chiang, Kevin Knight,

More information

Reassessment of the Role of Phrase Extraction in PBSMT

Reassessment of the Role of Phrase Extraction in PBSMT Reassessment of the Role of Phrase Extraction in Francisco Guzmán CCIR-ITESM guzmanhe@gmail.com Qin Gao LTI-CMU qing@cs.cmu.edu Stephan Vogel LTI-CMU stephan.vogel@cs.cmu.edu Presented by: Nguyen Bach

More information

Stabilizing Minimum Error Rate Training

Stabilizing Minimum Error Rate Training Stabilizing Minimum Error Rate Training George Foster and Roland Kuhn National Research Council Canada first.last@nrc.gc.ca Abstract The most commonly used method for training feature weights in statistical

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

The Expectation Maximization (EM) Algorithm

The Expectation Maximization (EM) Algorithm The Expectation Maximization (EM) Algorithm continued! 600.465 - Intro to NLP - J. Eisner 1 General Idea Start by devising a noisy channel Any model that predicts the corpus observations via some hidden

More information

Machine Translation with Lattices and Forests

Machine Translation with Lattices and Forests Machine Translation with Lattices and Forests Haitao Mi Liang Huang Qun Liu Key Lab. of Intelligent Information Processing Information Sciences Institute Institute of Computing Technology Viterbi School

More information

D6.1.2: Second report on scientific evaluations

D6.1.2: Second report on scientific evaluations D6.1.2: Second report on scientific evaluations UPVLC, XEROX, JSI-K4A, RWTH, EML and DDS Distribution: Public translectures Transcription and Translation of Video Lectures ICT Project 287755 Deliverable

More information

Better Word Alignment with Supervised ITG Models. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley

Better Word Alignment with Supervised ITG Models. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley Better Word Alignment with Supervised ITG Models Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley Word Alignment s Indonesia parliament court in arraigned speaker Word Alignment s Indonesia

More information

Are Unaligned Words Important for Machine Translation?

Are Unaligned Words Important for Machine Translation? Are Unaligned Words Important for Machine Translation? Yuqi Zhang Evgeny Matusov Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6 Computer Science Department RWTH

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Kartik Audhkhasi, Abhinav Sethy Bhuvana Ramabhadran Watson Multimodal Group IBM T. J. Watson Research Center Motivation

More information

Forest-based Translation Rule Extraction

Forest-based Translation Rule Extraction Forest-based Translation Rule Extraction Haitao Mi 1 1 Key Lab. of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences P.O. Box 2704, Beijing 100190, China

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Query Lattice for Translation Retrieval

Query Lattice for Translation Retrieval Query Lattice for Translation Retrieval Meiping Dong, Yong Cheng, Yang Liu, Jia Xu, Maosong Sun, Tatsuya Izuha, Jie Hao # State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory

More information

Memory-Efficient Heterogeneous Speech Recognition Hybrid in GPU-Equipped Mobile Devices

Memory-Efficient Heterogeneous Speech Recognition Hybrid in GPU-Equipped Mobile Devices Memory-Efficient Heterogeneous Speech Recognition Hybrid in GPU-Equipped Mobile Devices Alexei V. Ivanov, CTO, Verbumware Inc. GPU Technology Conference, San Jose, March 17, 2015 Autonomous Speech Recognition

More information

Random Restarts in Minimum Error Rate Training for Statistical Machine Translation

Random Restarts in Minimum Error Rate Training for Statistical Machine Translation Random Restarts in Minimum Error Rate Training for Statistical Machine Translation Robert C. Moore and Chris Quirk Microsoft Research Redmond, WA 98052, USA bobmoore@microsoft.com, chrisq@microsoft.com

More information

Incremental Integer Linear Programming for Non-projective Dependency Parsing

Incremental Integer Linear Programming for Non-projective Dependency Parsing Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel James Clarke ICCS, University of Edinburgh 22. July 2006 EMNLP 2006 S. Riedel, J. Clarke (ICCS, Edinburgh)

More information

Better Synchronous Binarization for Machine Translation

Better Synchronous Binarization for Machine Translation Better Synchronous Binarization for Machine Translation Tong Xiao *, Mu Li +, Dongdong Zhang +, Jingbo Zhu *, Ming Zhou + * Natural Language Processing Lab Northeastern University Shenyang, China, 110004

More information

System Combination Using Joint, Binarised Feature Vectors

System Combination Using Joint, Binarised Feature Vectors System Combination Using Joint, Binarised Feature Vectors Christian F EDERMAN N 1 (1) DFKI GmbH, Language Technology Lab, Stuhlsatzenhausweg 3, D-6613 Saarbrücken, GERMANY cfedermann@dfki.de Abstract We

More information

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11 Graph-Based Parsing Miguel Ballesteros Algorithms for NLP Course. 7-11 By using some Joakim Nivre's materials from Uppsala University and Jason Eisner's material from Johns Hopkins University. Outline

More information

Weighted Finite State Transducers in Automatic Speech Recognition

Weighted Finite State Transducers in Automatic Speech Recognition Weighted Finite State Transducers in Automatic Speech Recognition ZRE lecture 10.04.2013 Mirko Hannemann Slides provided with permission, Daniel Povey some slides from T. Schultz, M. Mohri and M. Riley

More information

Cube Pruning as Heuristic Search

Cube Pruning as Heuristic Search Cube Pruning as Heuristic Search Mark Hopkins and Greg Langmead Language Weaver, Inc. 4640 Admiralty Way, Suite 1210 Marina del Rey, CA 90292 {mhopkins,glangmead}@languageweaver.com Abstract Cube pruning

More information

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.

More information

Training and Evaluating Error Minimization Rules for Statistical Machine Translation

Training and Evaluating Error Minimization Rules for Statistical Machine Translation Training and Evaluating Error Minimization Rules for Statistical Machine Translation Ashish Venugopal School of Computer Science Carnegie Mellon University arv@andrew.cmu.edu Andreas Zollmann School of

More information

Advanced Search Algorithms

Advanced Search Algorithms CS11-747 Neural Networks for NLP Advanced Search Algorithms Daniel Clothiaux https://phontron.com/class/nn4nlp2017/ Why search? So far, decoding has mostly been greedy Chose the most likely output from

More information