Question of the Day. Machine Translation. Statistical Word Alignment. Centauri/Arcturan (Knight, 1997) Centauri/Arcturan (Knight, 1997)

Size: px

Start display at page:

Download "Question of the Day. Machine Translation. Statistical Word Alignment. Centauri/Arcturan (Knight, 1997) Centauri/Arcturan (Knight, 1997)"

Julia Perry
6 years ago
Views:

1 Question of the Day Is it possible to learn to translate from plain example translations? Machine Translation Statistical Word Alignment Based on slides by Philipp Koehn and Kevin Knight Word Alignment 1 Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Word Alignment 2 Word Alignment 3

2 Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp??? Word Alignment 4 Word Alignment 5 Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Your assignment, translate this to Arcturan: farokcrrrok hihok yorok clok kantok ok-yurp Word Alignment 6 Word Alignment 7

3 Your assignment, translate this to Arcturan: farokcrrrok hihok yorok clok kantok ok-yurp Your assignment, translate this to Arcturan: farokcrrrok hihok yorok clok kantok ok-yurp??? Word Alignment 8 Word Alignment 9 Your assignment, translate this to Arcturan: farokcrrrok hihok yorok clok kantok ok-yurp Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination Word Alignment 10 Word Alignment 11

4 Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } cognate? zero fertility Word Alignment 12 Word Alignment 13 Conclusion Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa 1a. Garcia and associates. 1b. Garcia y asociados. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. It is possible to find alignments between words... without prior knowledge Translation models can be learned from word alignment 5a. its clients are angry. 5b. sus clientes estan enfadados. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. Word Alignment 14 Word Alignment 15

5 Chicken and Egg Problem EM Algorithm Statistical alignment models can be used to align data argmax a p(a e, f) =argmax a p(e, a f) p(e f) Word aligned data is necessary to estimate model parameters Learning with incomplete data word alignment is hidden need to fill the gaps in the data Expectation Maximization (EM) in a nutshell 1. initialize model parameters (e.g. uniform) 2. assign probabilities to the missing data 3. estimate model parameters from completed data 4. iterate steps 2 3 until convergence Word Alignment 16 Word Alignment 17 EM Algorithm EM Algorithm... la maison... la maison blue... la fleur la maison... la maison blue... la fleur the house... the blue house... the flower... Initial step: all alignments equally likely Model learns that, e.g., la is often aligned with the... the house... the blue house... the flower... After one iteration Alignments, e.g., between la and the are more likely Word Alignment 18 Word Alignment 19

6 EM Algorithm EM Algorithm... la maison... la maison bleu... la fleur la maison... la maison bleu... la fleur the house... the blue house... the flower... After another iteration It becomes apparent that alignments, e.g., between fleur and flower are more likely... the house... the blue house... the flower... Convergence Inherent hidden structure revealed by EM Word Alignment 20 Word Alignment 21 EM Algorithm... la maison... la maison bleu... la fleur... EM Algorithm consists of two steps IBM Model 1 and EM Expectation-Step: Apply model to the data... the house... the blue house... the flower... p(la the) = p(le the) = p(maison house) = p(bleu blue) = Parameter estimation from the aligned corpus Word Alignment 22 parts of the model are hidden (here: alignments) using the model, assign probabilities to possible alignments Maximization-Step: Estimate model from data take assigned values as fractional counts collect counts (weighted by probabilities) estimate model from counts Iterate these steps until convergence Word Alignment 23

7 IBM Model 1 and EM IBM Model 1 and EM: Expectation Step Probabilities p(the la) =0.7 p(house la) =0.05 p(the maison) =0.1 p(house maison) =0.8 We need to compute p(a e, f) Alignments la maison the house the house la the maison la,,, the house p(e,a f) =0.56 p(e,a f) =0.035 p(e,a f) =0.08 p(e,a f) =0.005 Applying the chain rule: p(a e, f) = p(e,a f) p(e f) p(a e, f) =0.824 p(a e, f) =0.052 p(a e, f) =0.118 p(a e, f) =0.007 Counts c(the la) = c(house la) = c(the maison) = c(house maison) = We already have the formula for p(e, a f) (definition of Model 1) Word Alignment 25 Word Alignment 26 IBM Model 1 and EM: Expectation Step IBM Model 1 and EM: Expectation Step We need to compute p(e f) p(e f) = X p(e,a f) a l fx l fx =... p(e,a f) a(1)=0 a(l e)=0 l fx l fx =... a(1)=0 a(l e)=0 (l f + 1) le l e Y j=1 t(e j f a(j) ) Word Alignment 27 p(e f) = l fx a(1)=0... = (l f + 1) le = (l f + 1) le Note the trick in the last line l fx a(l e)=0 l fx a(1)=0 l e Y (l f + 1) le... j=1 i=0 l fx l e Y j=1 l e Y a(l e)=0 j=1 l fx t(e j f i ) t(e j f a(j) ) t(e j f a(j) ) removes the need for an exponential number of products! this makes IBM Model 1 estimation tractable Word Alignment 28

8 a(1)=0 a(2)=0 j=1 The Trick (case l e = l f =2) p(e f) = /3 2 2X 2X 2Y t(e j f a(j) ) = /3 2 ( t(e 1 f 0 ) t(e 2 f 0 )+t(e 1 f 0 ) t(e 2 f 1 )+t(e 1 f 0 ) t(e 2 f 2 )+ t(e 1 f 1 ) t(e 2 f 0 )+t(e 1 f 1 ) t(e 2 f 1 )+t(e 1 f 1 ) t(e 2 f 2 )+ t(e 1 f 2 ) t(e 2 f 0 )+t(e 1 f 2 ) t(e 2 f 1 )+t(e 1 f 2 ) t(e 2 f 2 )) = /3 2 ( t(e 1 f 0 )(t(e 2 f 0 )+t(e 2 f 1 )+t(e 2 f 2 ))+ t(e 1 f 1 )(t(e 2 f 0 )+t(e 2 f 1 )+t(e 2 f 2 ))+ t(e 1 f 2 )(t(e 2 f 0 )+t(e 2 f 1 )+t(e 2 f 2 ))) = /3 2 ( ( t(e 1 f 0 )+t(e 1 f 1 )+t(e 1 f 2 ))(t(e 2 f 0 )+t(e 2 f 1 )+t(e 2 f 2 ))) IBM Model 1 and EM: Expectation Step Combine what we have: p(a e, f) = = = p(e, a f) p(e f) (l f +1) le Q le (l f +1) le Q le j=1 l e Y j=1 t(e j f a(j) ) P lf i=0 t(e j f i ) j=1 t(e j f a(j) ) P lf i=0 t(e j f i ) Word Alignment 29 Word Alignment 30 IBM Model 1 and EM: Maximization Step IBM Model 1 and EM: Maximization Step Now we have to collect counts Evidence from a sentence pair e,f that word e is a translation of word f: c(e f; e, f) = X a p(a e, f) l e X j=1 (e, e j ) (f,f a(j) ) After collecting these counts over a corpus, we can estimate the model: t(e f; e, f) = P P f (e,f) P (e,f) c(e f; e, f)) c(e f; e, f)) 1 if a = b Note that: (a, b) = 0 otherwise! Count how many times e is aligned to f in alignment a and! weight each count by the likelihood p(a e, f) of that alignment Word Alignment 31 Word Alignment 33

9 IBM Model 1 and EM: Pseudocode Input: set of sentence pairs (e, f) Output: translation prob. t(e f) 1: initialize t(e f) uniformly 2: while not converged do 3: // initialize 4: count(e f) =0for all e, f 5: total(f) =0for all f 6: for all sentence pairs (e,f) do 7: // compute normalization 8: for all words e in edo 9: s-total(e) =0 10: for all words f in fdo 11: s-total(e) +=t(e f) 12: end for 13: end for 14: // collect counts 15: for all words e in edo 16: for all words f in fdo 17: count(e f) += t(e f) s-total(e) 18: total(f) += t(e f) s-total(e) 19: end for 20: end for 21: end for 22: // estimate probabilities 23: for all foreign words f do 24: for all English words e do 25: t(e f) = count(e f) total(f) 26: end for 27: end for 28: end while Word Alignment 34 das the Haus house Convergence das the Buch book ein a Buch book e f initial 1st it. 2nd it. 3rd it.... final the das book das house das the buch book buch a buch book ein a ein the haus house haus Word Alignment 35 Perplexity Higher IBM Models How well does the model fit the data? Perplexity: derived from probability of the training data according to the model log 2 PP = X s 1 S log 2 p(e s f s ) IBM Model 1 IBM Model 2 IBM Model 3 IBM Model 4 IBM Model 5 lexical translation adds absolute reordering model adds fertility model relative reordering model fixes deficiency Example (=1) initial 1st it. 2nd it. 3rd it.... final p(the haus das haus) p(the book das buch) p(a book ein buch) unnormalized perplexity Only IBM Model 1 has global maximum training of a higher IBM model builds on previous model Computationally biggest change in Model 3 trick to simplify estimation does not work anymore! exhaustive count collection becomes computationally too expensive sampling over high probability alignments is used instead Word Alignment 36 Word Alignment 37

10 Typical Training Scheme iterations over alignment models of increasing complexity: 1. n EM iterations of IBM Model 1 with uniform initialization 2. n EM iterations of IBM Model 2 or HMM initialized with Model 1 3. parameter transfer from IBM Model 2 / HMM to IBM Model 3 4. n hill-climbing iterations of IBM Model 3 based on best alignment 5. parameter transfer from IBM Model 3 to IBM Model 4 6. n hill-climbing iterations of IBM Model 4 based on best alignment typical number of iterations: 5 Popular implementation: GIZA++ Conclusion IBM Models were the pioneering models in statistical machine translation EM training learn from incomplete data by maximizing data likelihood iteratively converge to local maximum approximations needed for IBM 3 and higher Recommended reading (besides the text book): SMT Tutorial Workbook (Kevin Knight 1999) Introductory article by Kevin Knight (1997) Lecture notes by Micheal Collins in IBM Model 1 and 2 Hardcore: Brown et al., 1993 The Mathematics of Statistical Machine Translation: Parameter Estimation Word Alignment 44 Word Alignment 46

Statistical Machine Translation Lecture 3. Word Alignment Models

p. Statistical Machine Translation Lecture 3 Word Alignment Models Stephen Clark based on slides by Philipp Koehn p. Statistical Modeling p Mary did not slap the green witch Maria no daba una bofetada