Machine Translation and Discriminative Models

Size: px

Start display at page:

Download "Machine Translation and Discriminative Models"

Christal Booth
5 years ago
Views:

1 Machine Translation and Discriminative Models Tree-to-tree transfer and Discriminative learning Martin Popel ÚFAL (Institute of Formal and Applied Linguistics) Charles University in Prague March 23rd 2015, Seminar of Formal Linguistics, Prague

2 Outline 1 Intro TectoMT schema Isomorphic transfer Moses 2 Quiz 3 MT as labeling 4 TectoMT over years 2008 baseline transfer 2009 HMTM 2010 MaxEnt 2012 TectoMoses 2012 Gibbs 2013 Easy-first 2013 Interpol 2014 VowpalWabbit

3 TectoMT: analysis, transfer, synthesis rule based & statistical blocks ANALYSIS TRANSFER SYNTHESIS tectogramatical layer t-layer fill formems grammatemes use fill morphological categories query build t-tree dictionary HMTM impose agreement mark edges to contract add functional words analytical layer a-layer analytical functions generate parser (McDonald's MST) wordforms morphological layer m-layer tagger (Morce) concatenate lemmatization source language (English) tokenization (Czech) target language w-layer segmentation

SYNTHESIS TectoMT: isomorphic transfer (1-1 node mapping) Source tree (Czech) TRANSFER Target tree (English) P(optimal_tree) = P E(strojový machine) P T(machine translation) P E (překlad translation)

7 P E(překlad translation) = 0.6 0.0001 translation arcade 1 10-8 0.002 0.001 easy simple P T (machine translation) = 0.

4 SYNTHESIS TectoMT: isomorphic transfer (1-1 node mapping) Source tree (Czech) TRANSFER Target tree (English) P(optimal_tree) = P E(strojový machine) P T(machine translation) P E (překlad translation) P T (translation be) P E (snadný easy) P T (easy be) P E (být be) P T (be ROOT) ROOT ROOT P E (být have) = být P E (být be) = 0.8 be have ANALYSIS překlad snadný P E (překlad arcade) = 0.7 P E(překlad translation) = translation arcade easy simple P T (machine translation) = strojový Source sentence: Strojový překlad by měl být snadný. P E(strojový machine) = 0.4 machine P E(strojový engine) = 0.5 engine Target sentence: Machine translation should be easy. P E (source target) emission probabilities translation model P T (dependent governing) transition probabilities target-language tree model

5 Phrase-based Statistical Machine Translation currently most popular approach (Moses toolkit) no linguistic analysis needed (just tokenization) translates each phrase independently (except for LM) many segmentations to phrases considered, only one used

6 Quiz: English-Czech translation Is it possible to translate prime as vláda? In which context?

7 Quiz: English-Czech translation Is it possible to translate prime as vláda? In which context? prime minister předseda vlády

8 Quiz: English-Czech translation Is it possible to translate prime as vláda? In which context? prime minister předseda vlády How to formalize such translation rule? lemma=prime formeme=adj:attr parent_lemma=minister lemma=vláda formeme=n:2 parent_lemma=předseda

9 Quiz: English-Czech translation Is it possible to translate prime as vláda? In which context? prime minister předseda vlády How to formalize such translation rule? lemma=prime formeme=adj:attr parent_lemma=minister lemma=vláda formeme=n:2 parent_lemma=předseda This is still isomorphic transfer, unlike prime minister premiér.

10 Quiz: English-Czech translation Is it possible to translate find as přijít? In which context?

11 Quiz: English-Czech translation Is it possible to translate find as přijít? In which context? (on t-layer, find find_out)

12 Quiz: English-Czech translation Is it possible to translate find as přijít? In which context? (on t-layer, find find_out) Agatha found that book interesting. Agátě přišla ta kniha zajímavá.

13 Quiz: English-Czech translation Is it possible to translate find as přijít? In which context? (on t-layer, find find_out) Agatha found that book interesting. Agátě přišla ta kniha zajímavá. How to formalize such translation rule?

14 Quiz: English-Czech translation Is it possible to translate find as přijít? In which context? (on t-layer, find find_out) Agatha[n:subj] found that book[n:obj] interesting[adj:compl]. Agátě[n:3] přišla ta kniha[n:1] zajímavá[adj:1]. How to formalize such translation rule? child1_formeme=n:subj lemma=find child2_formeme=n:obj child3_formeme=adj:compl (adj:compl means predicative adjective) child1_formeme=n:3 lemma=přijít child2_formeme=n:1 child3_formeme=adj:1

15 Representation of t-layer lemma and formeme as two attributes find v:fin Agatha n:subj book n:obj interesting adj:compl this adj:attr

16 Representation of t-layer lemma and formeme as interleaved sub-nodes v:fin find n:subj Agatha n:obj book adj:attr this adj:compl interesting

17 Representation of t-layer lemma and formeme as interleaved sub-nodes v:fin n:subj Agatha find n:obj book adj:attr adj:compl interesting this grammatemes: translated in postprocessing (current approach) as subnodes (leaves, children of lemmas) encoded within lemma, but only if grammateme changed

18 Handling non-isomorphic transfer preprocessing or postprocessing within transfer (current approach) natively in the main transfer algorithm convert training data to isomorphic trees [not tried yet] n-1 alignment: add special [delete_node] label to the target side 1-n alignment: encode added nodes (L+F) into the main lemma encode topology change: as_child, as_sibling, as_parent

19 TectoMT transfer over years year BLEUdiff method 2008 initial baseline HMTM (TreeViterbi, TreeLM) HMTM + MaxEnt TectoMoses 2012 NA Gibbs sampling treelets Easy-first treelets Interpol treelets VowpalWabbit

20 TectoMT transfer over years year BLEUdiff method 2008 initial baseline HMTM (TreeViterbi, TreeLM) HMTM + MaxEnt TectoMoses 2012 NA Gibbs sampling treelets Easy-first treelets Interpol treelets VowpalWabbit +0.9 other improvements in

21 TectoMT transfer over years year BLEUdiff method 2008 initial baseline HMTM (TreeViterbi, TreeLM) HMTM + MaxEnt TectoMoses 2012 NA Gibbs sampling treelets Easy-first treelets Interpol treelets VowpalWabbit +0.9 other improvements in QTLeap en cs in two months

22 2008: baseline TectoMT transfer static translation model P(target source) = #(source,target) #(source) first translate formemes, then lemmas use only the top variant WMT 2009 en cs results BLEU human score Moses (CUNI) Google Moses (UEdin) Eurotran XP PC Translator TectoMT

23 SYNTHESIS 2009: Hidden Markov Tree Model (HMTM) still using static translation models, but also TreeLM (target lemma-formeme and parent-child compatibility) best labeling is found via HMTM (Tree Viterbi) Source tree (Czech) ROOT TRANSFER P(optimal_tree) = P E(strojový machine) P T(machine translation) P E (překlad translation) P T (translation be) P E (snadný easy) P T (easy be) P E (být be) P T (be ROOT) Target tree (English) ROOT P E (být have) = být P E (být be) = 0.8 be have ANALYSIS překlad snadný P E (překlad arcade) = 0.7 P E(překlad translation) = translation arcade easy simple P T (machine translation) = strojový Source sentence: Strojový překlad by měl být snadný. P E(strojový machine) = 0.4 machine P E(strojový engine) = 0.5 engine Target sentence: Machine translation should be easy. P E (source target) emission probabilities translation model P T (dependent governing) transition probabilities target-language tree model

24 2010: Maximum Entropy translation model still using HMTM (and generative TreeLM), but the static model P(lemma src_lemma) interpolated with context-sensitive discriminative (MaxEnt) model P(lemma src_lemma, other features) he / n:subj union / n:with+x agree / v:fin tense=past, voice=active negation=0, sempos=v cut / v:inf, has_left_child=0, sempos=v, has_right_child=1, tag=vb, position=right, named_entity=0 overtime / n:obj chop, saw, trim, shorten, lumber, hew, lower, delete, crop abolish, cancel,... all / adj:attr TRANSFER ANALYSIS SYNTHESIS He agreed with the unions to cut all overtime. Dohodl se s odbory na zrušení všech přesčasů.

25 2012: TectoMoses substitute transfer (MaxEnt+HMTM) in TectoMT with phrase-based decoding (Moses) of linearized t-trees 2 factors: lemma and formeme, but joint (L+F) n-gram LM better project dependencies rule based and use& TectoMT statistical synthesis blocks ANALYSIS TRANSFER SYNTHESIS tectogramatical layer t-layer fill formems grammatemes use fill morphological categories query build t-tree dictionary HMTM impose agreement mark edges to contract add functional words analytical layer a-layer analytical functions generate parser (McDonald's MST) wordforms morphological layer m-layer tagger (Morce) concatenate lemmatization source language (English) tokenization (Czech) target language w-layer segmentation

26 : Gibbs sampling and CRP segmentations example of treelet-based transfer, P(trg treelet src treelet) Bayesian approach use Chinese Restaurant Process on parallel treebank to learn optimal segmentation to treelet pairs optimal translations use Gibbs sampling both in training and decoding Problems Pitman Yor instead of Chinese Restaurant (heavy tail) Slice sampling or annealing instead of Gibbs (local maxima) We want cut+grass=sekat+tráva and cut+taxes=snížit+daně, but CRP/PYP prefers more reusable segments grass=tráva, taxes=daně (and cut=sekat, cut=snížit). CRP/PYP knows nothing about translation (cf. Chung et al. 2013: Sampling tree fragments from forests).

27 2013: Easy-first treelet-based transfer easy-first decoding of overlapping treelets guided learning of both how to translate each treelet in which order Handles non-isomorphic transfer natively (not tried). How to integrate TreeLM (for partial translations)?

28 2013: Interpolation treelet-based transfer (Interpol) similar to easy-first decoding one feature (weight): source treelet size many other features (lexicalized, PoS tags,...) combinations and quantizations of features all matching rules applied, their scores interpolated does not handle non-isomorphism no guided learning Example score(zajímavá) = P MaxEnt (zajímavá interesting) + w L s(interesting zajímavá) + w Lf s(interesting adj:compl zajímavá adj:1) + w Lf s(interesting adj:compl zajímavá adj:attr) + w Lfl s(interesting adj:compl find zajímavá adj:1 přijít) + w Lfl s(interesting adj:compl find zajímavá adj:attr najít) + w Lflf s(interesting adj:compl find v:fin zajímavá adj:1 přijít v:fin)

29 How is Interpol (2013) different from MaxEnt (2010)?

30 How is Interpol (2013) different from MaxEnt (2010)? The Logit, or There and Back Again

31 How is Interpol (2013) different from MaxEnt (2010)? The Logit, or There and Back Again Interpol can be trained with MaxEnt (multinomial logit) HMTM can be applied on top of Interpol Interpol is equivalent to the 2010 approach with additional features Summary features feature value precomputed on whole training data, relative frequency, or rather (#(src,trg) - 1)/#(src) dense feature (unlike sparse lexicalized higher-order features) Moses also uses summary features (TM-forward, TM-backward, LM)

32 2014: VowpalWabbit-based transfer VW is an ultra-fast and modular machine learning toolkit optimized SGD (AdaGrad, dense+sparse features,...) cost-sensitive one-against-all reduction to binary classification logistic loss enables probabilistic interpretation (for HMTM) all lemmas in one model, fixed memory requirements label-dependent features (features shared for more lemmas)

33 Thank you

Machine Learning for Deep-syntactic MT

Machine Learning for Deep-syntactic MT Martin Popel ÚFAL (Institute of Formal and Applied Linguistics) Charles University in Prague September 11, 2015 Seminar on the 35th Anniversary of the Cooperation