SPEECH TRANSLATION: THEORY AND PRACTICES

Size: px

Start display at page:

Download "SPEECH TRANSLATION: THEORY AND PRACTICES"

Lora Anthony
6 years ago
Views:

1 SPEECH TRANSLATION: THEORY AND PRACTICES Bowen Zhou and Xiaodong He May 27 th ICASSP 2013

2 Universal Translator: dream (will) come true or yet another over promise? Spoken communication without a language barrier: mankind s long-standing dreams Translate human speech in one language into another (text or speech) Automatic Speech Recognition: ASR (Statistical) Machine Translation: (S)MT We ve been promised by Sci-fi (Star Trek) for 5 decades Serious research efforts have started in early 1990s When statistical ASR (e.g., HMM-based) starts showing dominance and SMT was emerging 2

3 ST Research Evolves Over 2 Decades Broad domains/large vocabulary PROJECT/CAMPAIGN Active Period SCOPE, SCENARIOS AND PLATFORMS C-STAR One/Two-way Limited domain spontaneous speech VERBMOBIL One-way Limited domain conversational speech DARPA BABYLON Two-way Limited domain spontaneous speech; Laptop/Handheld DARPA TRANSTAC Two-way small domain spontaneous dialog; Mobile/Smartphone; IWSLT Limited One-way; domain Limited domain dialog and unlimited free-style talk TC-STAR One-way; Broadcast speech, political speech; Server-based DARPA GALE One-way Broadcast speech, conversational speech; Server-based DARPA BOLT One/two-way Conversational speech; Server-based Formal/concise Desktop/laptop Conversational/Interactive Mobile/Cloud 3

4 Are we closer today? Speech translation demos: Completely smartphonehosted speech-to-speech translation (IBM Research) Rashid s live demo (MSR) in 21 st CCC keynote Statistical approaches to both ASR/SMT are one of the keys for the success Large data, discriminative training, better models (e.g., DNN for ASR) etc. 4

5 Our Objectives 1 Review & analyze state-of-the-art SMT theory, from ST s perspective 2 Unifying ASR and SMT to catalyze joint ASR/SMT for improved ST 3 ST s practical issues & future research topics 5

6 In this talk 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST [20 min coffee Break] 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions Technical papers accompanying this tutorial : Bowen Zhou, Statistical Machine Translation for Speech: A Perspective on Structure, Learning and Decoding, in Proceedings of the IEEE, IEEE, May 2013 Xiaodong He and Li Deng, Speech-Centric Information Processing: An Optimization-Oriented Approach, in Proceedings of the IEEE, IEEE, May 2013 Papers are available on authors webpages Charts coming up soon 6

7 Overview & Backgrounds 7

8 Speech Translation Process Input: a source speech signal sequence x 1 T = x 1,, x T ASR: recognizes it as a set of source word sequences, {f 1 J = f 1,, f J } SMT: Translated into the target language sequence of words e 1 I = e 1,, e I e 1 I = argmax e 1 I P e 1 I x 1 T = argmax e 1 I f 1 J P( e 1 I f 1 J P(x 1 T f 1 J )P f 1 J 8

9 First comparison: ASR vs. SMT Connection: sequential pattern recognition Determine a sequence of symbols that is regarded as the optimal equivalent in the target domain ASR: x 1 T f 1 J SMT f 1 J e 1 I. Hence, many techniques are closely related. Difference: ASR is monotonic but SMT is not Different modeling/decoding formalisms required One of the key issues we address in this tutorial 9

T t p q t q t 1 t p x t q t by Gaussian Mixture Models or NN

10 ASR in a Nutshell ASR Decoding f 1 J = argmax f 1 J P(x 1 T f 1 J )P(f 1 J ) Acoustic Models (HMM) P x 1 T f 1 J p x t q t = q1 T t p q t q t 1 t p x t q t by Gaussian Mixture Models or NN Language Models (N-gram) p(f J J 1 ) j=1 p(f j f j N+1,, f j 1 ) 10

Search on a composed finite-state space (graph) f 1 J =

11 ASR Structures: A Finite-State Problem G: a weighted acceptor that assigns language model probabilities. Search on a composed finite-state space (graph) f 1 J = best_path (O H C L G) L: transduces context-independent phonetic sequences into words 11

12 A Brief History of SMT? MT research traced back to 1950s Statistical approaches started from Brown et al (1993) A pioneering work based on source-channel model Same approach succeeded for ASR at IBM and elsewhere A good confluence example of speech and language communities to drive the transition: Rationalism Empiricism (Church & Mercer, 1993; Knight, 1999) A sequence of unsupervised word alignment models Known today as IBM Models 1-5 (Brown et al, 1993) Much progress has been made since then Key topics for today s talk 12

13 A Bird view of SMT 13

14 How to know Translation X is better than Y? It s a hard problem by itself More than one good translation & even more bad ones Assumption: closer to human translation(s), the better Metrics were proposed to measure the closeness, e.g., BLEU, which measures n-gram precisions (Papineni et al., 2002) BLEU 4 = BP exp n=1 log p n ST measurement is more complicated Objective: WER+BLEU (or METEOR, TER etc) Subjective: HTER (GALE/BOLT), concept transfer rate (TransTac/BOLT) 14

15 Further Reading ASR Backgrounds L. R. Rabiner A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. F. Jelinek, Statistical Methods for Speech Recognition. MIT press. J. M. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan and D. O Shaughnessy, Research developments and directions in speech recognition and understanding, IEEE Signal Processing Mag., vol. 26, pp , May SMT Backgrounds P. Brown, S. Pietra, V. Pietra, and R. Mercer The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, vol. 19, 1993, pp K. Knight A Statistical MT Tutorial Workbook SMT Metrics: K. Papineni, S. Roukos, T. Ward and W.-J. Zhu BLEU: a method for automatic evaluation of machine translation. In Proc. ACL M. Snover, B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul A study of translation edit rate with targeted human annotation. In Proc. AMTA S. Banerjee and A. Lavie METEOR: An automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proc. Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization C. Callison-Burch, P. Koehn, C. Monz, K. Peterson and M. Przybocki and O. F. Zaidan, Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation. In Proc. Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR History of MT and SMT: J. Hutchins, Machine translation: past, present, future. Chichester, Ellis Horwood. Chap. 2. K. W. Church and R. L. Mercer Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics 19.1 (1993):

16 Learning Problems in SMT 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Summary and Future Directions 16

17 Learning Translation Models Word alignment From words to phrases Log-linear model framework to integrate component models (a.k.a. features) Log-linear model training to optimize translation metrics (e.g., MERT) 17

18 Word Alignment Given a set of parallel sentence pairs (e.g., ENU-CHS), find the word-to-word alignment between each sentence pair. English: Finish the task of recruiting students Chinese: 完成招生工作 word translation models 18

19 HMM for Word Alignment Each word in the Chinese sentence is modeled as a HMM state. The English words, the observation, are generated by the HMM one by one. finish the task of recruiting students 完成招生工作 NULL The generative story for word alignment: At each step, the current state (a Chinese word) emits one English word; then jump to the next state. Similar to ASR, HMM can be used to model the word-level translation process Unlike ASR, note the non-monotonic jumping between states, and the NULL state. (Vogel et al., 1996) 19

20 Formal Definition f 1J = f 1,, f J : source sentence (observation) f j : source word at position j J : length of source sentence e 1I = e 1,, e I : target sentence (states of HMM) e i : target word at position i I : length of target sentence a 1J = a 1,, a J : alignment (state sequence) a j Є[1, I]: f j e a j p(f e): word translation probability 20

21 HMM Formulation Given f 1J and e 1I, a 1J is treated as hidden variable J J I 1 1 j j 1 j a a j 1 p( f e ) p( a a, I) p( f e j ) Model assumption: J 1 Emission probability: Model the word-to-word translation Transition probability: Model the distortion of the word order the emission probability only depends on the target word the transition probability only depends on the position of the last state relative distortion 21

22 ML Training of HMM Maximum likelihood training: J I ML argmax p( f1 e1, ) p( a i a i, I), p( f f e e) j j 1 j i Expectation-Maximization (EM) training for Λ. efficient Forward-Backward algorithm exists. 22

23 Find the Optimal Alignment Viterbi decoding: J J 1 j j 1 j a a j 1 aˆ argmax p( a a, I) p( f e j ) J 1 other variations: posterior probability based decoding max posterior mode decoding

24 IBM Model 1-5 IBM model 1-5 A series of generative models (Brown et al., 1994) Summarized below (along with the HMM) Model IBM-1 HMM IBM-2 IBM-3 IBM-4 IBM-5 property Convex model (non-strictly). E-M training. Doesn t model word order distortion. Relative distortion model (captures the intuition that words move in groups) Efficient E-M training algorithm IBM-1 plus an absolute distortion model (measure the divergence of the target word s position from the ideal monotonic alignment position) IBM-2 plus a fertility model (models the number of words that a state generates) Approximate parameter estimation due to the fertility model, costly training Like IBM-3, but IBM-4 uses relative distortion model IBM-5 addressed the model deficiency issue (while IBM-3&4 are deficient). 24

25 Other Word Alignment Models Extensions of HMM Discriminative models Syntax-driven models (Tantounova et al. 2002) (Liang et al. 2006) (Deng & Byrne 2005) (He 2007) (Moore et al. 2006) (Taskar et al. 2005) (Zhang & Gildea 2005) (Fossum et al. 2008) (Haghighi et al. 2009) Model the fertility implicitly. Regularize the alignment by agreements of two directions. Better distortion model. Large linear model with many features. Discriminative learning based on annotation Use syntactic features, and/or syntactically structured models 25

26 From Word to Phrase Translation Extract phrase translation pairs from word alignment Source phrase Target phrase 完成 finish Feature h 招生工作工作招生工作 recruiting students task the task task of recruiting students (Och & Ney 2004; Koehn, Och and Marcu, 2003) 26

27 Common Phrase Translation Models Model name Forward phrase translation model p Forward lexical translation model Backward phrase translation model Backward lexical translation model Parameterization (decomposed form) e f = #( e, f) d #(, f) p e f = γ(e, f) γ(, f) Reverse version of the forward counterpart Reverse version of the forward counterpart Scoring at the sentence level h fl = h bl = h fp = k k m n h bp = k p k n m e k f k p(e k,m f k,n ) p( f k e k ) p(f k,n e k,m ) 27

28 Integration of All Component Models Use a log-linear model to integrate all component models (also known as features) P(E F) = 1 Z exp i Common features include: Phrase translation models (e.g., the four models discussed before) One or more language models in the target language Counts of words and phrases Distortion model models word ordering All features need to be decomposable to a word/n-gram/phrase level Select the translation by the best integrated score E = argmax E (Och & Ney 2002) λ i h i (E, F) P(E F) = argmax E i {h i }: features λ i h i (E, F) λ i : feature weights 28

29 Training Feature Weights Let s denote by λ as {λ i }, λ is trained to optimize translation performance λ = argmax {λ} BLEU E(λ, F), E = argmax {λ} BLEU argmax E i λ i h i (E, F), E Non-convex problem! (Och 2003) 29

30 Minimum Error Rate Training Given a n-best list {E}, find the best λ E.g., the top scored E gives the best BLEU. n-best h 1 h 2 h 3 BLEU E 1 v 1,1 v 1,2 v 1,3 B 1 E 2 v 2,1 v 2,2 v 2,3 B 2 S = i λ i h i E 1 : S = v 1,1 λ 1 + C 1 E 2 : S = v 2,1 λ 1 + C 2 E 3 : S = v 3,1 λ 1 + C 3 E 3 v 3,1 v 3,2 v 3,3 B 3 a b c λ 1 Illustration of the MERT algorithm MERT (Och 2003): The score of E is linear to the value of the λ that is of interest. Only need to check several key values of λ where lines intersect with each other, since the ranking of hypotheses does not change between the two adjunct key values: e. g., when λ 1 < a, E 3 is top scored; when a < λ 1 < c, E 2 is top scored; 30

31 More Advanced Training Use a large set of (sparse) features E.g., integrate lexical, POS, syntax features MERT is not effective anymore, use MIRA, PRO etc. for training of feature weights (Watanabe et al., 2007, Chiang et al. 2009, Hopkins & May 2011, Simianer et al. 2012) Better estimation of (dense) translation models E-M style phrase translation model training (Wuebker et al. 2010, Huang & Zhou 2009) Train the translation probability distributions discriminatively (He & Deng 2012, Setiawan & Zhou 2013) A mix of the above two approaches Build a set of new features for each phrase pair, and train them discriminatively by max expected BLEU (Gao & He 2013) 31

32 Further Reading P. Brown, S. D. Pietra, V. J. D. Pietra, and R. L. Mercer The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics. D. Chiang, K. Knight and W. Wang, ,001 new features for statistical machine translation. In Proceedings of NAACL-HLT. Y. Deng and W. Byrne, 2005, HMM Word and Phrase Alignment For Statistical Machine Translation, in Proceedings of HLT/EMNLP. V. Fossum, K. Knight and S. Abney Using Syntax to Improve Word Alignment Precision for Syntax-Based Machine Translation. In Proceedings of the WMT. J. Gao and X. He, 2013, Training MRF-Based Phrase Translation Models using Gradient Ascent, in Proceedings of NAACL A. Haghighi, J. Blitzer, J. DeNero, and D. Klein Better word alignments with supervised ITG models. In Proceedings of ACL. X. He and L. Deng, 2012, Maximum Expected BLEU Training of Phrase and Lexicon Translation Models, in Proceedings of ACL H. Hopkins, and J. May Tuning as ranking. In Proceedings of EMNLP. S. Huang and B. Zhou An EM algorithm for SCFG in formal syntax-based translation. In Proeedings of ICASSP P. Koehn, F. Och, and D. Marcu Statistical Phrase-Based Translation. In Proceedings of HLT-NAACL. P. Liang, B. Taskar, and D. Klein, 2006, Alignment by Agreement, in Proceedings of NAACL. R. Moore, W. Yih and A. Bode, 2006, Improved Discriminative Bilingual Word Alignment, In Proceedings of COLING/ACL. F. Och and H. Ney Discriminative training and Maximum Entropy Models for Statistical Machine Translation, In Proceedings of ACL. F. Och, 2003, Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of ACL. H. Setiawan and B. Zhou Discriminative Training of 150 Million Translation Parameters and Its Application to Pruning, NAACL P. Simianer, S. Riezler, and C. Dyer, Joint feature selection in distributed stochastic learning for large-scale discriminative training in SMT. In Proceedings of ACL K. Toutanova, H. T. Ilhan, and C. D. Manning Extensions to HMM-based Statistical Word Alignment Models. In Proceedings of EMNLP. S. Vogel, H. Ney, and C. Tillmann HMM-based Word Alignment In Statistical Translation. In Proceedings of COLING. T. Watanabe, J. Suzuki, H. Tsukuda, and H. Isozaki Online large-margin training for statistical machine translation. In Proc. EMNLP J. Wuebker, A. Mauser and H. Ney Training phrase translation models with leaving-one-out, In Proceedings of ACL. H. Zhang and D. Gildea, 2005, Stochastic Lexicalized Inversion Transduction Grammar for Alignment, In Proceedings of ACL. 32

33 Translation Structures for ST 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 33

34 Translation Equivalents (TE) Usually represented by some synchronous (source and target) grammar. The grammar choices limited by two factors. Expressiveness: is it adequate to model linguistic equivalence between natural language pairs? Computational complexity: is it practical to build machine translation solutions upon it? Need to balance both, particularly for ST, Robust to informal spoken language Critical speed requirement due to ST s interactive nature Additional bonus if it permits effectively and efficiently integration with ASR 初步 X 1 initial X 1 X 1 的 X 2 The X 2 of X 1 34

35 Two Dominating Categories of TEs Finite-state-based formalism E.g., phrase-based SMT (Och and Ney, 2004; Koehn et al., 2003) Synchronous context-free grammar-based formalism hierarchical phrase-based (Chiang, 2007), tree-to-string (e.g., Quirk et al., 2005; Huang et al., 2006), forest-to-string (Mi et al., 2008) string-to-tree (e.g., Galley et al., 04; Shen et al., 08) tree-to-tree (e.g., Eisner 2003; Cowan et al, 2006; Zhang et al., 2008; Chiang 2010). Exceptions: STAG (DeNeefe and Knight, 2009) Tree Sequences (Zhang et al., 2008) 初步 X 1 initial X 1 X 1 的 X 2 The X 2 of X 1 35

36 Phrase-based SMT: a generative process 1. The input sentence f 1 J is segmented into K phrases 2. Permute the source phrases in the appropriate order 3. Translate each source phrase into a target phrase 4. Concatenate target phrases to form the target sentence 5. Score the target sentence by the target language model Expressiveness: Any sequence of translation possible if permit arbitrary reordering Complexity: arbitrary reordering leads to a NP-hard problem (Knight, 1999). 36

Reordering in Phrasal SMT To constrain search space, reordering is usually limited by preset maximum reordering window and skip size (Zens et al.

37 Reordering in Phrasal SMT To constrain search space, reordering is usually limited by preset maximum reordering window and skip size (Zens et al., 2004) Reordering models usually penalize non-monotonic movement Proportional to the distance of jumping Further refined by lexicalization, i.e., the degrees of penalization vary for different surrounding words (Koehn et al. 2007) Reordering is the unit movement at phrasal level i.e., no gaps allowed in reordering movement 37

38 Backgrounds: Semiring Semiring is a 5-tuple: Ψ,,, 0, 1 (Mohri, 2002) Ψ is a Set, and 0, 1 Ψ Two closed and associative operators: sum: to compute the weight of a sequence of edges 0 a=a (unit); 1 a= 1 (absorbing) a Ψ product: to compute the weight of an (optimal) path 0 a = 0 (absorbing); 1 a= a (unit) Examples used in speech & language Viterbi 0,1, max,, 0,1 defined over probabilities, Tropical R + +, min, +, +, 0 operates on non-negative weights e.g., negative log probabilities, aka, costs 38

39 Backgrounds: Automata/WFST Weighted Acceptors: 3-gram LM Weighted Transducers 39

40 Backgrounds: Formal Languages A finite-state automaton (FSA) is equivalent to a regular grammar, and a regular language Weighted finite-state machines closed under the composition operation A regular grammar constitutes strict subset of a context-free grammar (CFG) (Hopcroft and Ullman, 1979) The composition of CFG with a FSM, is guaranteed to be a CFG 40

41 FST-based translation equivalence Source-target equivalence is modeled by weighted nonnested mapping between word strings. Unit of mapping, e i, f i, is a modeling choice, and phrase is better than word due to: reduced word sense ambiguities with surrounding contexts appropriate local reordering encoded in the source-target phrase pair src phrase id : tgt phrase id/cost 41

Phrase-based SMT is Finite-State Each step relates input and output as WFST operations F input sentence P segmentation R permutation T translation (phrasal equiv.

42 Phrase-based SMT is Finite-State Each step relates input and output as WFST operations F input sentence P segmentation R permutation T translation (phrasal equiv.) Wconcatenation G target language model Doing all steps amounts to composing above WFSTs Guaranteed to produce a FSM, since WFSTs are closed under such composition. Similar to ASR, phrase-based SMT amounts to the following operations, computable with in a general-purpose FST toolkit (Kumar et al. 2005) E = best_path (F P R T W G) 42

43 Why do we care? Not only of theoretic convenience to conceptually connect ASR and SMT better Practically useful to leverage the mature WFST optimization algorithms. Suggests integrated framework to address ST: i.e., composing the WFSTs from ASR/SMT 43

44 Make WFST Approach Practical WFST-based system in (Kumar et al., 2005) runs significantly slower than the multiple-stack based decoder (Koehn, 2004) large memory requirements heavy online computation for each composition. Reordering is a big challenge The number of states needed is O 2 J ; cannot be bounded for any arbitrary inputs as finite-state Next, we show a framework to address both issues to make it more practically suitable for ST 44

45 Folsom: phrase-based SMT by FSM To speed up: first cut the number of online compositions E = best_path (F P R T W G) (1) E = best_path ( F M G) (2) M: WFST encoding a log-linear phrasal translation model, obtained by M = Min(Min(Det(P) T) W Possible by making P determinizable (Zhou et al., 2006) More flexible reordering: F is a WFSA constructed on-thefly To encode the uncertainty of the input sentence (e.g., reordering); examples on the next Slide Design a dedicated decoder for further efficiency: Viterbi search on the lazy 3-way composed graph as in (2) 45

46 Reordering, Finite-State & Gaps (a): Monotonic (like ASR) Each state denotes a specific input coverage with a bit-vector. Arcs labeled with an input position to be covered in state transition: weights given by reordering models. (b): Reordering with skip less than three Every path from the initial to the final state represents an acceptable input permutation Source segmentation allows options with gaps in searching for the bestpath to transduce the inputs at the phrase level f 1 f 2 f 3 f 4 phrase segmentation f1 f 4 f 2 f 3 Such a translation phenomenon is beneficial in many language pairs. translate disjoint Chinese preposition 在外 as outside verb phrases, 唱歌 as sing. Reordering with gaps adds flexibility to conventional phrase-based models and confirmed to be useful in many studies 46

47 What FST-based models lack FST-based models (e.g., phrase-based) remain state-ofthe-art for many language pairs/tasks (Zollmann et al., 2008) It makes no use that natural language is inherently structural & hierarchical Led to poor long-distance reordering modeling & exponential complexity of permutation PCFGs effectively used to model linguistic structures in many monolingual tasks (e.g., parsing) Recall: RL is a subset of context-free language The synchronous (bilingual) version of the PCFG (i.e., SCFG) is a alternative to model translation structures 47

48 SCFG-based TE A SCFG (Lewis and Stearns, 1968) rewrites the non-terminal (NT) on its left-hand side (LHS) into <source, target> on its right-hand side (RHS) s.t. the constraint of one-to-one correspondences (co-indexed by subscripts) between the source and target of every NT occurrence. VP p < PP 1 VP 2, VP 2 PP 1 > NTs on RHS can be recursively instantiated simultaneously for both the source and the target, by applying any rules with a matched NT on the LHS. SCFG captures the hierarchical structure of NL & provides a more principled way to model reordering e.g., the PP and VP will be reordered regardless of their span lengths 48

49 Expressiveness vs. Complexity Both depend on maximum number of NTs on RHS (aka, rank of the SCFG grammar) Complexity is polynomial: higher order for higher rank Cubic O( J 3 ) for rank-two (binary) SCFG Expressiveness: more expressive with higher rank; Specifically for a binary SCFG Rare reordering examples exist (Wu 1997; Wellington et al., 2006) that it cannot cover Arguably sufficient in practice Pic from 49

50 What s in common? SCFG-MT vs. Like ice-cream, SCFG models come with different flavors Linguistic syntax based: Utilize structures defined over linguistic theory and annotations (e.g., Penn Treebank) SCFG rules are derived from the parallel corpus guided by explicitly parsing on at least one side of the parallel corpus. E.g., tree-to-string (e.g., Quirk et al., 2005; Huang et al., 2006), forest-to-string (Mi et al., 2008) string-to-tree (e.g., Galley et al., 04; Shen et al., 08) tree-to-tree (e.g., Eisner 2003; Cowan et al, 2006; Zhang et al., 2008; Chiang 2010). Formal syntax based: Utilize the hierarchical structure of natural language only Grammars extracted from the parallel corpus without using any linguistic knowledge or annotations. E.g., ITG (Wu, 1997) and hierarchical models (Chiang, 2007) 50

51 Formal syntax based SCFG Arguably a better for ST: not relying on parsing, which might be difficult for informal spoken languages Grammars: only one universal NT X is used (Chiang, 2007) Phrasal rules: X < 初步试验, initial experiments > Abstract rules: with NTs on RHS X < X 1 的 X 2, The X 2 of X 1 > Glue rules to generate sentence symbol S S < S 1 X 2, S 1 X 2 > S < X 1, X 1 > 51

52 Learning of (Abstract) SCFG Rules Hierarchical SCFG rule extraction (Chiang 2007): Replace any aligned sub-phrase pair of a PP with co-indexed NT symbols Many-to-many mapping between phrase pairs and derived abstract rules. Conditional probabilities estimated based on heuristic counts. Linguistic SCFG rule extraction Syntactic parsing on at least one side of the parallel corpus. Rules are extracted along with the parsing structures: constituency (Galley et al., 2004; Galley et al., 2006), and dependency (Shen et al., 2008) Improved extraction and parameterization: Expected counts: forced alignment and inside-outside (Huang and Zhou, 2009); Leaving-one-out smoothing (Wuebker et al., 2010) Extract additional rules: Reduce conflicts between alignment or parsing structures (DeNeefe et al., 2007) From existing ones of high expected counts: rule arithmetic (Cmejrek and Zhou, 2010) 52

53 Practical Considerations of Formal Syntax-based SCFG On Speed: Finite-state based search with limited reordering usually runs faster than most of the SCFG-based models. Except for tree-to-string (e.g., Huang and Mi, 2010), which is faster in practice. On Model: With beam pruning, using higher-rank (>2) SCFG is possible Weakness: no constraints on X often led to over-generalization of SCFG rules Improvements: add linguistically motivated constraints Refined NT with direct linguistic annotations (Zollmann and Venugopal, 2006) Soft constraints (Marton et al., 2008) Enriched features tied to NTs (Zhou et al., 2008b, Huang et al., 2010) 53

54 Further Reading Phrase-based SMT F. Och and H. Ney The Alignment Template Approach to Statistical Machine Translation. Computational Linguistics F. Och and H. Ney A systematic comparison of various statistical alignment models. Computational Linguistics, vol. 29, 2003, pp F. Och and H. Ney Discriminative training and Maximum Entropy Models for Statistical Machine Translation, In Proceedings of ACL. P. Koehn, F. J. Och and D. Marcu Statistical phrase based translation. In Proc. NAACL. pp P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst Moses: Open Source Toolkit for Statistical Machine Translation. ACL Demo Paper S. Kumar, Y. Deng and W. Byrne, A weighted finite state transducer translation template model for statistical machine translation, Journal of Natural Language Engineering, vol. 11, no. 3, Proc. ACL- HLT R. Zens, H. Ney, T. Watanabe and E. Sumita Reordering constraints for phrase-based statistical machine translation," In Proc. COLING B. Zhou, S. Chen and Y. Gao Folsom: A Fast and Memory-Efficient Phrase-based Approach to Statistical Machine Translation. In Proc. SLT Computational theory: J. Hopcroft and J. Ullman Introduction to Automata Theory, Languages, and Computation, Addison- Wesley G. Gallo, G. Longo, S. Pallottino and S. Nyugen, Directed hypergraphs and applications. Discrete Applied Mathematics, 42(2). M. Mohri Semiring frameworks and algorithms for shortest-distance problems," Journal of Automata, Languages and Combinatorics, vol. 7, no. 3, p , M. Mohri, F. Pereira and M. Riley Weighted finite-state transducers in speech recognition. Computer Speech and Language, vol. 16, no. 1, pp ,

55 Further Reading: SCFG-based SMT B. Cowan, I. Ku cerov a and M. Collins A discriminative model for tree-to-tree translation. In Proc. EMNLP D. Chiang Hierarchical phrase-based translation. Computational Linguistics, 33(2): D. Chiang, K. Knight and W. Wang, ,001 new features for statistical machine translation. In Proc. NAACL-HLT, D. Chiang Learning to translate with source and target syntax. In Proc. ACL S. DeNeefe, K. Knight, W. Wang and D. Marcu What Can Syntax-Based MT Learn from Phrase-Based MT? In Proc. EMNLP- CoNLL S. DeNeefe and K. Knight, Synchronous tree adjoining machine translation. In Proc. EMNLP 2009 J. Eisner Learning non-isomorphic tree mappings for machine translation. In Proc. ACL 2003 M. Galley and C. Manning Accurate nonhierarchical phrase-based translation. In Proc. NAACL M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, W. Wang and I. Thayer, Scalable inference and training of context-rich syntactic translation models. In Proc. COLING-ACL M. Galley, M. Hopkins, K. Knight and D. Marcu What's in a translation rule? In Proc. NAACL L. Huang and D. Chiang Forest Rescoring: Faster Decoding with Integrated Language Models. In Proc. ACL G. Iglesias, C. Allauzen, W. Byrne, A. d. Gispert and M. Riley Hierarchical Phrase-Based Translation Representations. In Proc. EMNLP G. Iglesias, A. d. Gispert, E. R. Banga and W. Byrne Hierarchical phrase-based Translation with weighted finite state transducers. In Proc. NAACL-HLT H. Mi, L. Huang and Q. Liu Forest-based translation. in Porc. ACL 2008 C. Quirk, A. Menezes and C. Cherry Dependency treelet translation: Syntactically informed phrase-based SMT. In Proc. ACL. L. Shen, J. Xu and R. Weischedel A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model. In Proc. ACL-HLT D. Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics. vol. 23, no. 3, pp , M. Zhang, H. Jiang, A. Aw, H. Li, C. L. Tan and S. Li A Tree Sequence Alignment-based Tree-to-Tree Translation Model. In Proc. ACL-HLT

56 Decoding: A Unified Perspective for ASR/SMT/ST 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 56

57 Unifying ASR/SMT/ST Decoding Upon first glance, there is dramatic difference between ASR and SMT due to reordering Even for SMT, there are a variety of paradigms Each may call for its own decoding algorithm Common in all decoding: Search space is usually exponentially large and DP is a must DP: divide-and-conquer w/ reusable sub-solutions. Well-known Viterbi search in HMM-based ASR: an instance of DP We try unify them by observing from a higher standpoint Benefits of a unified perspective Help us understand concepts better Reveals a close connection between ASR and SMT Beneficial for joint ST decoding 57

58 Background: Weighted Directed Acyclic Graph Σ = V, E, Ω A vertices set V and edges set E A weight mapping function Ω: E Ψ assigns each edge a weight from Ψ Ψ defined in a semiring (Ψ,,, 0, 1) A single source vertex s V in the graph. A path π in G is a sequence of consecutive edges π = e 1 e 2 e l End vertex of one edge is the start vertex of the subsequent edge l Weight of path Ω(π) = i=1 Ω(e i ) The shortest distance from s to a vertex q, δ q, is the -sum of the weights of all paths from s to q We define δ s = 1 The search space of any finite-state automata, e.g., the one used in HMM-based ASR and FST-based translation, can be represented by such a graph. 58

59 Genetic Viterbi Search of DAG Many search problems can be converted to the classical shortest distance problem in the DAG (Cormen et al, 2001) Complexity is O V + E, as each edge needs to be visited exactly once. Terminates when all reachable vertices from s have been visited, or if some predefined destination vertices encountered 59

60 Case Studies: ASR and Multi-stack Phrasal SMT HMM-based ASR: Composing the input acceptor with a sequence of transducer (Mohri et al., 2002) This defines a finite-state search space represented by a graph Viterbi search to find shortest-distance path in this graph f 1 J = best_path (O H C L G) Multi-stack Phrasal SMT e.g., Moses (Koehn et al., 2007) The source vertex: none of the source words has been translated and the translation hypothesis is empty. The target vertex: all the source words have been covered Each edge connect start and end vertices by choosing a consecutive uncovered set of source words, to apply one of the source-matched phrase pairs (PP) The target side of the PP is appended to the hypothesis The weight of each edge is determined by a log-linear model. Such edges are expanded consecutively until arrives at t The best translation collected along the best path with the lowest weight δ t 60

61 Hypothesis Recombination in Graph The initial experiments The original attempts are are The initial experiments are The original attempts are key idea: keep partial hypothesis with the lowest weights and discard others with the same signature in phrase-based SMT same set of source words being covered same n-1 last words (for n-gram LM) in the hypothesis Only the partial hypothesis with the lowest cost has a possibility to become the best hypothesis in the future So HR is loses-free for 1-best search It is clear why this works if we view this from graph All partial hypotheses of same signature arrive at the same vertex All future paths leaving this vertex are indistinguishable afterwards. Under -sum operation (e.g., the min in the Tropical semiring), only the lowest-cost partial hypothesis is kept 61

62 Case Studies II: Folsom Phrasal SMT E = best_path ( F M G) Graph expanded by lazy 3-way composition Source vertex = start states comp. (F 1, M 1, G 1 ) Target vertices = each component state is a final state in each individual WFST Subsequent vertices visited in topological order Weight(e) from (F p, M p, G p ) to (F q, M q, G q ) : Ω e = m {I,M,G} λ m Ω(e m ) HR: merge vertices of same F q, M q, G q & keep only the one with lowest cost Any path connecting the source to a target vertex is a translation: best one with the shortest distance. The decoding is optimized lazy 3-way composition + minimization, followed by bestpath search. 62

63 Background: Weighted Directed Hypergraph H =< V, E, Ψ > A vertices set V and hyperedge set E Each hyperedgee links an ordered list of tail vertices to a head vertex Arity of H is the maximum number of tail vertices of all e f e : Ψ T(e) Ψ assigns each hyperedge e a weight from Ψ A derivation d of a vertex q: a sequence of consecutive e connecting source vertices to q Weight Ω d recursively computed from weight functions of each e. The best weight of q is the -sum of all of its derivations A hypergraph, a generalization of a graph, encodes the hierarchicalbranching search space expanded over CFG models 63

64 Genetic Viterbi Search of DAH Decoding of SCFG-based models largely follows CKY, an instance of Viterbi on DAH of arity two Complexity is proportional to O E 64

65 Case Study: Hypergraph decoding Vertices: X, i, j the LHS NT + span Source vertices: apply phrasal rules matching any consecutive inputs Topological sort ensures that vertices with shorter spans j i are visited earlier than those with longer spans Derivation weight δ q at head vertex updated per line 7 Process repeated until the target vertex [S, 0, J] is reached. The best derivation found by backtracing from the target vertex Complexity O E O R J 3 65

66 Case Study: DAH with LM Search space is still DAH Tail vertex additionally encodes the n-1 words on both the left-most and right-most side Each hyperedge updates LM score and boundary information. Worst-case complexity is O R J 3 T 4(n 1) Pruning is a must 66

67 Pruning: A perspective from graph Two options to prune in a graph 1. Discard some end vertices Sort active vertices (sharing the same coverage vector) based on δ q skipping certain vertices that are outside either a beam from the best (beam pruning) the top k list (histogram pruning). 2. Discard some edges leaving a start vertex (beam or histogram pruning) Apply certain reordering constraints (Zens and Ney, 2004) Discard higher-cost phrase-based translation options. Both, unlike HR, may lead to search errors. 67

A lazy Histogram Pruning Avoid expanding edges if they fall outside of the top k, if weight function is monotonic in each of its arguments, Input list for each argument is sorted.

68 A lazy Histogram Pruning Avoid expanding edges if they fall outside of the top k, if weight function is monotonic in each of its arguments, Input list for each argument is sorted. Example: cube pruning for DAH search (Chiang,07): Suppose a hyperedge bundle {e j } where they share the same source side and identical tail vertices Hyperedges with lower costs are visited earlier Push e into a priority queue that is sorted by f e : Ψ T(e) Ψ. Hyperdge popped from the priority queue is explored. Stops when the top k hyperedges popped, and all remaining ones discarded 68

69 Further Reading D. Chiang Hierarchical phrase-based translation. Computational Linguistics, 33(2): G. Gallo, G. Longo, S. Pallottino and S. Nyugen, Directed hypergraphs and applications. Discrete Applied Mathematics, 42(2). L. Huang Advanced dynamic programming in semiring and hypergraph frameworks. In Proc. COLING. Survey paper to accompany the conference tutorial M. Mohri, F. Pereira and M. Riley Weighted finite-state transducers in speech recognition. Computer Speech and Language, vol. 16, no. 1, pp , D. Klein and C. D. Manning Parsing and hypergraphs. New developments in parsing technology, pages K. Knight Decoding Complexity in Word-Replacement Translation Models. Computational Linguistics, vol. 25, no. 4, pp , P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst Moses: Open Source Toolkit for Statistical Machine Translation. ACL Demo Paper L. Huang and D. Chiang Better k-best parsing. In Proc. the Ninth International Workshop on Parsing Technologies L. Huang and D. Chiang Forest Rescoring: Faster Decoding with Integrated Language Models. In Proc. ACL B. Zhou Statistical Machine Translation for Speech: A Perspective on Structure, Learning and Decoding, in Proceedings of the IEEE, IEEE, August

70 Coupling ASR and SMT for ST Joint Decoding 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 70

71 Bayesian Perspective of ST Sum replaced by Max A common practice f 1 J determined only by x 1 T, and solved as an isolated problem. The foundation of the cascaded approach e 1 I = argmax e 1 I = argmax e 1 I argmax e 1 I argmax e 1 I P e 1 I x 1 T f 1 J max f 1 J P( e 1 I f 1 J P(x 1 T f 1 J )P f 1 J P(e 1 I f 1 J P(x 1 T f 1 J )P(f 1 J ) P[e I 1 argmaxp(x T 1 f J 1 )P f J 1 ] J f 1 Cascaded approach impaired by compounding of errors propagated from ASR to SMT ST improved if ASR/SMT interactions are factored in with better coupled components. Coupling achievable with joint decoding and more coherent modeling across components 71

72 Tight Joint Decoding e 1 I = argmax e 1 I max f 1 J P(e 1 I f 1 J P(x 1 T f 1 J )P(f 1 J ) A fully integrated search over all possible e 1 I and f 1 J With the unified view of ASR/SMT, it s feasible for translation models with simplified reordering WFST-based word-level ST (Matusov et al. 2006): substituted P(e 1 I, f 1 J ) for the usual source LM used in the ASR WFSTs E = best_path (X H C L M G) Produce speech translation in the target language. Monotonic joint translation model at phrase-level (Casacuberta et al., 2008). Tight joint decoding using phrase-based SMT can be achieved by Folsom Composing ASR WFSTs with M and G on-the-fly Followed by fully integrated search Limitation is that reordering can only occur within a phrase. 72

73 Loose Joint Decoding e 1 I = argmax e 1 I max f 1 J P(e 1 I f 1 J P(x 1 T f 1 J )P(f 1 J ) Approximating full search space with a promising subset N-best ASR hypotheses (Zhang et al., 2004): Weak Word lattices produced by ASR recognizers (Matusov et al., 2006; Mathias and Byrne, 2006; Zhou et al., 2007): most general & challenging Confusion networks (Bertoldi et al., 2008): special case of lattice Better trade-off to address the reordering issue SCFG models can take ASR lattice for ST Generalized CKY algorithm for translating lattices (Dyer et al., 2008). Nodes in the lattice numbered such that the end node is always numbered higher than the start node for any edge, The vertex in HG labeled with X, i, j ; here i < j are the node numbers in the input lattice that are spanned by X. If reordering is critical in joint ST, try SCFG-based SMT models Reordering without length constraints Avoids traversing the lattice to compute the distortion costs 73

74 Coupling ASR and SMT for ST Joint Modeling 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 74

75 End-to-end Modeling of Speech Translation Two modules in conventional speech translation X Speech Recognition {F} Machine Translation E Problems of inconsistency SR and MT are optimized for different criteria, inconsistent to the E2E ST quality (metric discrepancy) SR and MT are trained without considering the interaction between them (train/test condition mismatch) 75

76 Why End-to-end Modeling Matters? Lowest WER not necessarily gives the best BLEU Fluent English is preferred as the input for MT, despite the fact that this might cause an increase of WER. (He et al., 2011) 76

77 End-to-end ST Model A End-to-End log-linear model for ST Enabling incorporation of rich features Enabling a principal way to model the interaction between core components Enabling global optimal modeling/feature training P E, F X P E X = E = argmax E X E = 1 Z exp i F P(E, F X) λ i logh i (E, F, X) P(E X) 77

78 Feature Functions for End-to-end ST Each feature function is a probabilistic model (except a few count features) Include all conventional ASR and MT models as features features Acoustic model Source language model ASR hypothesis length model h AM = p(x F) h SLM = P LM (F) h SWC = e F Forward phrase trans. Model h F2Eph = k p e k f k Forward word trans. Model h F2Ewd = k m n p(e k,m f k,n ) Backward phrase trans. Model h E2Fph = k p( f k e k ) Backward word trans. Model h E2Fwd = k n m p(f k,n e k,m ) Translation Phrase count h PC = e K Translation hypothesis length h TWC = e E Phrase segment/reorder model h reorder = P hr (S E, F) Target language model h TLM = P LM (E) 78

79 Learning Feature Weights In The Loglinear Model Jointly optimize the weights of features by MERT: system BLEU Baseline (the current ASR and MT system) 33.8% Global max-bleu training (for all features) 35.2% (+1.4%) Global max-bleu training (for ASR-only features) 34.8% (+1.0%) Global max-bleu training (for SMT-only features) 34.2% (+0.4%) Evaluated on a MS commercial data set 79

great ; iii) Provide critical syntactic info for word ordering of the translation. Reco.

80 Case Study: Reco. B contains one more ins. error. However, the insertion to : i) makes the MT input grammatically more correct; ii) provides critical context for great ; iii) Provide critical syntactic info for word ordering of the translation. Reco. B contains two more errors. However, the mis-recognized phrase want to is plentifully represented in the formal text that is usually used for MT training and hence leads to correct translation. 80

81 Learning Parameters Inside The Features First, we define a generic differentiable utility function U Λ = p E r, F r X r, Λ C E r, E r r=1, R E r hyp(f r ) F r hyp(x r ) Λ : the set of model parameters that are of interest X r : the r-th speech input utterance E r : translation reference of the r-th utterance E r : translation hypothesis of the r-th utterance F r : speech recognition hypothesis of the r-th utterance U Λ measures the end-to-end quality of ST, e.g., Choosing C E r, E r properly, U Λ covers a variety of ST metrics C E r, E r BLEU E r, E r objective Max expected BLEU 1 TER E r, E r Min expected Translation Error Rate (He & Deng 2013) 81

82 Objective, Regularization, and Optimization Optimize the utility function directly E.g., max expected BLEU Generic gradient-based methods (Gao & He 2013) Early-stopping/cross-validation on dev set Regularize the objective by K-L divergence Suitable for parameters in a probabilistic domain Training objective: O Λ = logu Λ τ KL(Λ 0 Λ) Extended Baum-Welch based optimization (He & Deng 2008, 2012, 2013) 82

83 EBW Formula for Translation Model Use lexicon translation model as an example p g h, Λ = k,m: f k,m =g ASR score affects estimation of the translation model E,F p E, F X, Λ E γ h k, m + U Λ τ FP p g h, Λ 0 + D h p g h, Λ k,m E,F p E, F X, Λ E γ h k, m + U Λ τ FP + D h where E = C E U Λ, and γ h k, m = n:e k,n =h p(f k,m e k,n,λ ) Training is influenced by translation quality n p(f k,m e k,n,λ ) 83

Evaluation on IWSLT 11/TED Challenging open-domain SLT: TED Talks (www.ted.

84 Evaluation on IWSLT 11/TED Challenging open-domain SLT: TED Talks ( Public speech on anything (tech, entertainment, arts, science, politics, economics, policy, environment ) Data (Chinese-English machine translation track) Parallel Zh-En data: 110K snt. TED transcripts; 7.7M snt. UN corpus Monolingual English data: 115M snt. from Europarl, Gigaword, Example English: What I'm going to show you first, as quickly as I can, is some foundational work, some new technology that we brought to... Chinese: 首先, 我要用最快的速度为大家演示一些新技术的基础研究成果 84

85 Results on IWSLT 11/TED Phrase-based system 1 st phrase table from the TED parallel corpus 2 nd phrase table from 500K parallel snt selected from UN 1 st 3-gram LM from TED English transcription 2 nd 5-gram LM from 115M supplementary English snt Max-BLEU training only applied to the primary (TED) phrase table Fine-tuning of full lambda set is performed at the end BLEU scores on IWSLT test sets system Tst2010 (dev) Tst2011 (test) baseline 11.48% 14.68% Max Expected BLEU training (He & Deng 2012) 12.39% (+0.9%) 15.92% (+1.2%) best single-system in IWSLT 11/CE_MT 85

86 Further Reading F. Casacuberta, M. Federico, H. Ney and E. Vidal Recent Efforts in Spoken Language Translation. IEEE Signal Processing Mag. May 2008 C. Dyer, S. Muresan, and P. Resnik Generalizing Word Lattice Translation. In Proc. ACL L. Mathias and W. Byrne Statistical phrase-based speech translation. In Proc. ICASSP pp E. Matusov, S. Kanthak and H. Ney Integrating speech recognition and machine translation: Where do we stand? In Proc. ICASSP H. Ney Speech translation: Coupling of recognition and translation. In Proc. ICASSP R. Zhang, G. Kikui, H. Yamamoto, T. Watanbe, F. Soong and W-K. Lo A unified approach in speechto-speech translation: Integrating features of speech recognition and machine translation. In Proc. COLING B. Zhou, L. Besacier and Y. Gao On Efficient Coupling of ASR and SMT for Speech Translation. In Proc. ICASSP B. Zhou, X. Cui, S. Huang, M. Cmejrek, W. Zhang, J. Xue, J. Cui, B. Xiang, G. Daggett, U. Chaudhari, S. Maskey and E. Marcheret The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks. Computer Speech & Language. Vol. 27, Issue 2, 2013, pp

87 Further Reading X. He, A. Axelrod, L. Deng, A. Acero, M. Hwang, A. Nguyen, A. Wang, and X. Huang The MSR system for IWSLT 2011 evaluation, in Proc. IWSLT, December X. He and L. Deng, 2013, Speech-Centric Information Processing: An Optimization-Oriented Approach, in Proceedings of the IEEE X. He, L. Deng, and A. Acero, Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the Speech Translation Task?, in Proc. ICASSP, IEEE X. He, L. Deng, and W. Chou, Discriminative learning in sequential pattern recognition, IEEE Signal Processing Magazine, September Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, M. Harper, Enriching speech recognition with automatic detection of sentence boundaries and disfluencies, IEEE Trans. on ASLP M. Ostendorf, B. Favre, R. Grishman, D. Hakkani-Tur, M. Harper, D. Hillard, J. Hirschberg, J. Heng, J. Kahn, L. Yang, S. Maskey, E. Matusov, H. Ney, A. Rosenberg, E. Shriberg, W. Wen, C. Woofers, Speech segmentation and spoken document processing, IEEE Signal Processing Magazine, May, 2008 A. Waibel, C. Fugen, Spoken language translation, IEEE Signal Processing Magazine, vol.25, no.3, pp.70-79, May Y. Zhang, L. Deng, X. He, and A. Acero, 2011, A Novel Decision Function and the Associated Decision- Feedback Learning for Speech Translation, in ICASSP, IEEE 87

88 Practices 1. Overview: ASR, SMT, ST, Metric 2. Learning Problems in SMT 3. Translation Structures for ST 4. Decoding: A Unified Perspective ASR/SMT 5. Coupling ASR/SMT: Decoding & Modeling 6. Practices 7. Future Directions 88

89 Two techniques in depth: 1 Domain adaptation 2 System combination 89

90 Domain adaptation for ST Words differ in meaning across domains/contexts Domain adaptation is particularly important for ST ST needs to handle spoken language Colloquial style vs. written style ST has interests on particular scenarios/domains E.g., travel 90

91 Domain Adaptation by Data Selection for MT Selecting data that match the targeting domain from a large general corpus MT needs data that match both source & target languages of the targeting domain. The adapted system need to cover domain-specific input produce domain-appropriate output e.g., Do you know of any restaurants open now? We need to select data to cover not only open at the source side, but also the right translation of open at the target side. Use bilingual cross-entropy difference: H in_src s H out_src s + H in_tgt s H out_tgt s The perplexity of s given the in-domain source language model in_src (Axelrod, He, and Gao, 2011) 91

92 Multi-model and data selection based domain adaptation for ST Spoken lang. parallel data General text parallel data Data Selection Bi-lingual data selection metric Pseudo spk. lang. data Model Training Spoken lang. TM Model Training General TM Combined in a Log-linear model P E F = 1 Z exp{ iλ i logf i (E, F)} 92

93 Evaluation on commercial ST data English-to-Chinese translation Target: dialog style travel domain data Training data available 800K in-domain data 12M general domain data Evaluation Performance on target domain Performance on general domain, e.g., evaluate the robustness on out-of-target-domain. 93

94 Results and Analysis Models and {λ} training travel general General (dev: general) Travel (dev: travel) (+6.1) (-8.0) Multi-TMs (dev: travel) (+5.9) (-4.9) Multi-TMs (dev: tvl : gen = 1:1) (+5.8) (-2.0) Multi-TMs (dev: tvl : gen = 1:2) (+4.0) (-0.8) Results reported in BLEU % 1) Using multiple translation models together as features helps 2) The feature weights, determined by the dev set, play a critical role 3) Big gain on in-domain test (travel), robust on out-of-domain test (general) (From He & Deng, 2011) 94

95 Case Studies Source English A glass of cold water, please. General model (Chinese) 一杯冷水请 Travel-domain adapted model 请来一杯冷的水 Please keep the change. 请保留更改不用找了 Thank, this is for your tip. 谢谢, 这是为您提示谢谢, 这是给你的小费 Do you know of any restaurants open now? I'd like a restaurant with cheerful atmosphere 你现在知道的任何打开的餐馆吗? 我想就食肆的愉悦的气氛你知道现在还有餐馆在营业的吗? 我想要一家气氛活泼的餐厅 Note the improved translation of ambiguous words (e.g., open, tip, change), and improved processing of colloquial style grammar (e.g., a glass of cold water, please.) 95

96 From domain adaptation to topic adaptation Motivation Topic changes talk to talk, dialog to dialog Meanings of words changes, too. Lots of out-of-domain data Broad coverage, but not all of them are relevant. How to utilize the OOD corpus to enhance the translation performance? Method Build topic model on target domain data Select topic relevant data from OOD corpus Combine topic specific model with the general model at testing 96

97 Analysis on IWSLT 11: topics of TED talks Build LDA-based topic model on TED data estimated on the source side of the parallel data Restricted to 4 topics are they meaningful? Only has 775 talks/110k sentences Look at some of the top keywords in each topic: Topic 1 (design, computer, data, system, machine) technology Topic 2 (Africa, dollars, business, market, food, China, society) global Topic 3 (water, earth, universe, ocean, trees, carbon, environment) planet Topic 4 (life, love, god, stories, children, music) abstract Reasonable clustering 97

98 Topic modeling, data selection, and multi-phrase-table decoding Training (for each topic) Topic model UN parallel corpus TED parallel corpus Testing Topic allocation Input sentence Topic allocation Topic specific multi-table decoding topic relevant data Model training topic phrase table Model training general TED phrase table Output Combined in a Log-linear model log linear combination: P E F = 1 Z exp{λ ilogφ i (E, F)} 98

99 Experiments on IWSLT 11/TED For each topic, select 250~400K sentences from the UN corpus, train a topic-specific phrase table Evaluation results on IWSLT 11 (TED talk dataset) Simply adding an extra UN-driven phrase table didn t help Topic specific multi-phrase table decoding helps Phrase table used Dev (2010) Test (2011) TED only (baseline) 11.3% 13.0% TED + UN-all 11.3% 13.0% TED + UN-4 topics 11.8% 13.5% (From Axelrod et al., 2012) 99

100 System Combination for MT Hypotheses from single systems E 1 : she bought the Jeep E 2 : she buys the SUV E 3 : she bought the SUV Jeep Combined MT output she bought the Jeep SUV System Combination: Input: a set of translation hypotheses from multiple single MT systems, for the same source sentence Output: a better translation output derived from input hypotheses 100

101 Confusion Network Based Methods Hypotheses from single systems E 1 : she bought the Jeep E 2 : she buys the SUV E 3 : she bought the SUV Jeep Combined MT output she bought the SUV 1) Select the backbone 2) Align hypotheses 3) Construct and decode CN E B argmin P( E F) L( E, E) E E B E E e.g., E 2 is selected e she bought buys the Jeep SUV ε Jeep she bought the SUV ε 101

102 Hypothesis Alignment Hypothesis alignment is crucial GIZA based approach (Matusov et al., 2006) Use GIZA as a generic tool to align hypotheses TER based alignment (Sim et al., 2007, Rosti et al. 2007) Align one hypothesis to another such that the TER is minimized HMM based hypothesis alignment (He et al. 2008, 2009) Use fine-grained statistical model No training needed, HMM s parameters are derived from pretrained bilingual word alignment models ITG based approach (Karakos et al., 2008) A latest survey and evaluation (Rosti et al., 2012) 102

103 HMM based Hypothesis Alignment e' 1 e' 3 e' 2 E B : e 1 e 2 e 3 E hyp : e' 1 e' 3 e' 2 HMM is built on the backbone side HMM aligns the hypothesis to the backbone After alignment, a CN is built e 1 e 2 e 3 e 1 e 2 e 3 e' 1 e' 2 e' 3 103

104 HMM Parameter Estimation Emitting Probability P(e' 1 e 1 ) models how likely e' 1 and e 1 have similar meanings (via words in source sentence) e' 1 Use the source word sequence {f 1,, f M } as a hidden layer, P(e' 1 e 1 ) takes a mixture-model form, i.e., P src (e' 1 e 1 )= m w m P(e' 1 f m ) where w m = P(f m e 1 )/ m P(f m e 1 ) P(e' 1 f m ) is from the bilingual word alignment model, F E direction P(f m e 1 ) is from that of E F e 1 P(e' 1 e 1 )=? e' 1 e 1 f 1 f 2 f 3 P(e' 1 e 1 )= m w m P(e' 1 f m ) 104

105 HMM Parameter Estimation (cont.) Emitting Probability Normalized similarity measure s Based on Levenshtein distance Based on matched prefix length Use an exponential mapping to get P(e' 1 e 1 ) P simi (e' 1 e 1 )=exp[ρ (s (e' 1, e 1 ) -1)] s (e' 1, e 1 ) is normalized to [0,1] Overall Emitting Probability (via word surface similarity) P simi (e' 1 e 1 ) P(e' 1 e 1 ) = α P src (e' 1 e 1 )+(1- α) P simi (e' 1 e 1 ) ρ=3 Normalized similarity measure s(e' 1, e 1 ) 105

106 HMM Parameter Estimation (cont.) Transition Probability P(a j a j-1 ) models word ordering Takes the same form as a bilingual word alignment HMM Strongly encourages monotonic word ordering Allows non-monotonic word ordering, ' 1 ' 1 d a j, a j j a j 1 d i, a d i i i i P a i 1 j 1 a j alignment of the j-th word d(i, i') κ=2 1 (i- i')

107 Find the Optimal Alignment Viterbi decoding: J J 1 j j 1 j a a j 1 aˆ argmax p( a a, I) p( e e j ) J 1 Other variations: posterior probability & threshold based decoding max posterior mode decoding 107

108 Decode the Confusion Network Log-linear model based decoding (Rosti et al. 2007) Incorporate multiple features (e.g., voting, LM, length, etc.) arg max ln ( ) * E P E F E E h where L ln P( E F) ln P ( e F) ln P ( E ) E l 1 Confusion network decoding (L=5) S MBR l LM he have ε good car he has ε nice sedan it ε a nice car he has a ε sedan e 1 e 2 e 3 e 4 e 5 108

109 Results on 2008 NIST Open MT Eval The MSR-NRC-SRI entry for Chinese-to-English Case-sensitive BLEU Combined System best score in NIST08 C2E Individual Systems Individual systems combined along time (from He et al., 2008) 109

110 Related Approaches & Extensions Incremental hypothesis alignment (Rosti et al., 2008; Li et al, 2009, Karakos et al., 2010) Other approaches Joint decoding (He and Toutanova, 2009) Phrase-level system combination (Feng et al, 2009) System combination as target-to-target decoding (Ma & McKeown, 2012) Survey and evaluation (Rosti et al., 2012) 110

111 Comparison of aligners Results reported in BLEU % (best scores in bold) Significance group ID is marked as superscripts on results Aligner Best single system NIST MT09 Arabic-English WMT11 German-English GIZA Incr. TER Incr. TER Incr. ITG Incr. IHMM WMT11 Spanish-English (From Rosti, He, Karakos, Leusch, et al., 2012) 111

112 Error Analysis Per-sentence errors were analyzed Paired Wilcoxon test measured the probability that the error reduction relative to the best system was real Observations All aligners significantly reduce substitution/shift errors No clear trend for insertions/deletions sometimes worse than the best system Further analysis showed that agreement for the non- NULL words is important measure for alignment quality, but agreement on NULL tokens is not 112

113 Further Reading Y. Feng, Y. Liu, H. Mi, Q. Liu and Y. Lü, 2009, Lattice-based system combination for statistical machine translation, in Proceedings of EMNLP X. He and K. Toutanova, 2009, Joint Optimization for Machine Translation System Combination, In Proceedings of EMNLP X. He, M. Yang, J. Gao, P. Nguyen, and R. Moore, 2008, Indirect-HMM-based Hypothesis Alignment for Combining Outputs from Machine Translation Systems, in Proceedings of EMNLP D. Karakos, J. Eisner, S. Khudanpur, and M. Dreyer Machine Translation System Combination using ITG-based Alignments. In Proceedings of ACL-HLT. D. Karakos, J. Smith, and S. Khudanpur Hypothesis ranking and two-pass approaches for machine translation system combination. In Proc. ICASSP. C. Li, X. He, Y. Liu, and N. Xi, 2009, Incremental HMM Alignment for MT System Combination, in Proceedings of ACL-IJCNLP W. Ma and K McKeown Phrase-level System Combination for Machine Translation Based on Target-to-Target Decoding. In Proceedings of AMTA E. Matusov, N. Ueffing, and H. Ney Computing consensus translation from multiple machine translation systems using enhanced hypotheses alignment. In Proceedings of EACL. A. Rosti, B. Xiang, S. Matsoukas, R. Schwartz, N. Fazil Ayan, and B. Dorr Combining outputs from multiple machine translation systems. In Proceedings of NAACL-HLT. A. Rosti, B. Zhang, S. Matsoukas, and R. Schwartz Incremental hypothesis alignment for building confusion networks with application to machine translation system combination. In Proceedings of WMT A. Rosti, X. He, D. Karakos, G. Leusch, Y. Cao, M. Freitag, S. Matsoukas, H. Ney, J. Smith, and B. Zhang, Review of Hypothesis Alignment Algorithms for MT System Combination via Confusion Network Decoding, in Proceedings of WMT K. Sim, W. Byrne, M. Gales, H. Sahbi, and P. Woodland Consensus network decoding for statistical machine translation system combination. In Proceedings of ICASSP. A. Axelrod, X. He, and J. Gao, Domain Adaptation via Pseudo In-Domain Data Selection, in Proc. EMNLP, 2011 A. Bisazza, N. Ruiz and M. Federico, 2011, Fill-up versus Interpolation Methods for Phrase-based SMT Adaptation, in Proceedings of IWSLT G. Foster and R. Kuhn Mixture-Model Adaptation for SMT. Workshop on Statistical Machine Translation, in Proceedings of ACL G. Foster, C. Goutte, and R. Kuhn Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation. In Proceedings of EMNLP. B. Haddow and P. Koehn, Analysing the effect of out-of-domain data on smt systems. in Proceedings of WMT X. He and L. Deng, 2011, Robust Speech Translation by Domain Adaptation, in Proceedings of Interspeech. P. Koehn and J. Schroeder, Experiments in domain adaptation for statistical machine translation, in Proceedings of WMT. S. Mansour, J. Wuebker and H. Ney, 2012, Combining Translation and Language Model Scoring for Domain-Specific Data Filtering, in Proceedings of IWSLT R. Moore and W. Lewis Intelligent Selection of Language Model Training Data. Association for Computational Linguistics. P. Nakov, Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing, in Proceedings of WMT 113

114 Summary Fundamental concepts and technologies in speech translation Theory of speech translation from a joint modeling perspective In-depth discussions on domain adaptation and system combination Practical evaluations on major ST/MT benchmarks 114

115 ST Future Directions --Additional to ASR and SMT Overcome the low-resource problem: How to rapidly develop ST with limited amount of data, and adapt SMT for ST? Confidence measurement: Reduce misleading dialogue turns & lower the risk of miscommunication. Complicated by the combination of two sources of uncertainty in ASR and SMT. Active vs. Passive ST More active, e.g., to actively clarify, or warn the user when confidence is low Development of suitable dialogue strategies for ST. Context-aware ST ST on smart phones (e.g., location, conversation history). End-to-end modeling of speech translation with new features E.g., prosody acoustic features for translation via end-to-end modeling Other applications of speech translation E.g., cross-lingual spoken language understanding 115

116 Further Reading About this Tutorial This tutorial is mainly based on: Bowen Zhou, Statistical Machine Translation for Speech: A Perspective on Structure, Learning and Decoding, in Proceedings of the IEEE, IEEE, May 2013 Xiaodong He and Li Deng, Speech-Centric Information Processing: An Optimization-Oriented Approach, in Proceedings of the IEEE, IEEE, May 2013 Both provide references for further reading Papers and charts available on authors webpages 116

117 Resources (play it yourself) You need also an ASR system HTK: Kaldi: HTS (TTS for S2S): MT Toolkits (word alignment, decoder, MERT) GIZA++: Moses: Cdec: Benchmark data sets IWSLT: WMT: (include Europarl) NIST Open MT: 117

Q & A Bowen Zhou RESEARCH MANAGER IBM WATSON RESEARCH CENTER http://researcher.

com/ pub/bowen-zhou/18/b5a/436 Xiaodong He RESEARCHER MICROSOFT RESEARCH REDMOND

118 Q & A Bowen Zhou RESEARCH MANAGER IBM WATSON RESEARCH CENTER earcher/view.php?person=us-zhou pub/bowen-zhou/18/b5a/436 Xiaodong He RESEARCHER MICROSOFT RESEARCH REDMOND AFFILIATE PROFESSOR UNIVERSITY OF WASHINGTON, SEATTLE 118

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017 Tuning Philipp Koehn presented by Gaurav Kumar 28 September 2017 The Story so Far: Generative Models 1 The definition of translation probability follows a mathematical derivation argmax e p(e f) = argmax