TectoMT: Machine Translation System

Size: px

Start display at page:

Download "TectoMT: Machine Translation System"

Roderick Gibbs
5 years ago
Views:

1 : Machine Translation System Martin Popel ÚFAL (Institute of Formal and Applied Linguistics) Charles University in Prague Universität des Saarlandes November 4 th 2010, Saarbrücken, Germany

2 Outline NLP framework Treex Demo translation step by step Annotation of translation errors Improvements Parsing parentheses Hidden Markov Tree Models (HMTM) Combining dictionaries Maximum Entropy dictionary Results three metrics of translation quality

3 Treex framework for NLP modular, open source, Perl, Linux, OOP-style Data models for stochastic tools translation dictionaries special-purpose lexical databases Third-party Tools Malt parser TreeTagger fntbl, CRF++ Visualization TrEd (Tree Editor with SVG and PDF export options) download shared part In-house Tools taggers, parsers NE recognizers language models API machine learning tools Treex Core Classes Treex::Core::Document Treex::Core::Node Treex::Core::Scenario Treex::Core::Block Treex Blocks Tree_tagger McD_parser Mark_passives Applications scenarios + format conversions Format Convertors (to & from tmt) plain text HTML & various XML corpora PDT, PennTB, EMILLE, PADT, CoNLL, vertical SVN versioned part

4 Translation scheme transfer over the tectogrammatical layer ANALYSIS TRANSFER SYNTHESIS deep syntax: tectogramatical layer t-layer shallow syntax: analytical layer a-layer morphological layer m-layer source language (English) target language (Czech) w-layer

5 Translation scheme rule based & statistical blocks ANALYSIS TRANSFER SYNTHESIS tectogramatical layer t-layer fill formems grammatemes use fill morphological categories query build t-tree dictionary HMTM impose agreement mark edges to contract add functional words analytical layer a-layer analytical functions generate parser (McDonald's MST) wordforms morphological layer m-layer tagger (Morce) concatenate lemmatization tokenization source language (English) target language (Czech) w-layer segmentation

6 Demo Translation Analysis raw text Machine translation should be easy. m-layer machine translation should be easy. NN NN MD VB JJ. a-layer Atr machine Sb translation Pred should Obj be AuxK. Pnom easy

7 Demo Translation Analysis raw text Machine translation should be easy. m-layer machine translation should be easy. NN NN MD VB JJ. a-layer Atr machine Mark functional words Sb translation Pred should Obj be AuxK. Pnom easy

8 Demo Translation Analysis raw text Machine translation should be easy. m-layer machine translation should be easy. NN NN MD VB JJ. a-layer Mark edges to be contracted Atr machine Sb translation Pred should Obj be AuxK. Pnom easy

9 Demo Translation Analysis raw text Machine translation should be easy. m-layer machine translation should be easy. NN NN MD VB JJ. Build t-tree (backbone) t-layer be translation easy machine

10 Demo Translation Analysis raw text Machine translation should be easy. m-layer machine translation should be easy. NN NN MD VB JJ. Fill formems t-layer be v:fin n:attr machine translation n:subj easy adj:compl

11 Demo Translation Analysis raw text Machine translation should be easy. m-layer machine translation should be easy. NN NN MD VB JJ. t-layer n:attr machine Fill grammatemes translation n:subj number = singular be v:fin tense = simple, modality, conditional easy adj:compl degree of comparison = positive

12 Demo Translation Transfer Build target t-tree by cloning source t-layer machine translation n:attr be n:subj v:fin easy adj:compl target t-layer be v:fin translation n:subj easy adj:compl machine n:attr

13 Demo Translation Transfer Get translation variants for lemmas and formems source t-layer be v:fin translation n:subj easy adj:compl machine n:attr target t-layer počítač stroj strojový n:2 n:attr adj:attr překlad převod n:1 být mít v:fin v:inf snadný jednoduchý adj:compl n:1 adv:

14 Demo Translation Transfer source t-layer machine Select the best combination of lemmas and formems translation n:attr be n:subj v:fin easy adj:compl target t-layer počítač stroj strojový n:2 n:attr adj:attr překlad převod n:1 být mít v:fin v:inf snadný jednoduchý adj:compl n:1 adv:

15 Demo Translation Synthesis Build target a-layer by cloning target t-layer být v:fin překlad n:1 snadný adj:compl strojový adj:attr target a-layer být překlad snadný strojový

16 Demo Translation Synthesis Fill morphological categories target t-layer být v:fin překlad n:1 snadný adj:compl strojový adj:attr target a-layer strojový degree = positive být překlad snadný number = singular degree = positive gender = masc. inanim. case = nominative

17 Demo Translation Synthesis Impose agreement target t-layer být v:fin překlad n:1 snadný adj:compl strojový adj:attr target a-layer být number = singular gender = masc. Inanim. number = singular case = nominative gender = masc. inanim. strojový degree = positive překlad number = singular gender = masc. inanim. case = nominative snadný degree = positive number = singular case = nominative gender = masc. inanim.

18 Demo Translation Synthesis Add functional words target t-layer být v:fin překlad n:1 snadný adj:compl strojový adj:attr target a-layer překlad mít. strojový by být snadný

19 Demo Translation Synthesis target t-layer Reorder clitics být v:fin překlad n:1 snadný adj:compl strojový adj:attr target a-layer překlad mít. strojový by být snadný

20 Demo Translation Synthesis Generate wordforms target t-layer být v:fin překlad n:1 snadný adj:compl strojový adj:attr target a-layer překlad měl. strojový by být snadný

21 Demo Translation Synthesis Concatenate tokens for output target t-layer být v:fin překlad n:1 snadný adj:compl strojový adj:attr target a-layer strojový měl. překlad by být snadný Strojový překlad by měl být snadný.

22 Demo Translation Real Scenario SEnglishW_to_SEnglishM:: Tokenization Normalize_forms Fix_tokenization TagMorce Fix_mtags Lemmatize_mtree SEnglishM_to_SEnglishN:: Stanford_named_entities Distinguish_personal_names SEnglishM_to_SEnglishA:: McD_parser Fill_is_member_from_deprel Fix_tags_after_parse McD_parser REPARSE=1 Fill_is_member_from_deprel Fix_McD_topology Fix_nominal_groups Fix_is_member Fix_atree Fix_multiword_prep_and_conj Fix_dicendi_verbs Fill_afun_AuxCP_Coord Fill_afun SEnglishA_to_SEnglishT:: Mark_edges_to_collapse Mark_edges_to_collapse_neg Build_ttree Fill_is_member Move_aux_from_coord- _to_members Fix_tlemmas Assign_coap_functors Fix_either_or Fix_is_member Mark_clause_heads Mark_passives Assign_functors Mark_infin Mark_relclause_heads Mark_relclause_coref Mark_dsp_root Mark_parentheses Recompute_deepord Assign_nodetype Assign_grammatemes Detect_formeme Rehang_shared_attr Detect_voice Fix_imperatives Fill_is_name_of_person Fill_gender_of_person Add_cor_act Find_text_coref SEnglishT_to_TCzechT:: Clone_ttree Translate_LF_phrases Translate_LF_joint_static Delete_superfluous_tnodes Translate_F_try_rules Translate_F_add_variants Translate_F_rerank Translate_L_try_rules Translate_L_add_variants Translate_LF_numerals_by_rules Translate_L_filter_aspect Transform_passive_constructions Prune_personal_name_variants Remove_unpassivizable_variants Translate_LF_compounds Cut_variants Rehang_to_eff_parents Translate_LF_tree_Viterbi Rehang_to_orig_parents Fix_transfer_choices Translate_L_female_surnames Add_noun_gender Add_relpron_below_rc Change_Cor_to_PersPron Add_PersPron_below_vfin Add_verb_aspect Fix_date_time Fix_grammatemes_after_transfer Fix_negation Move_adjectives_before_nouns Move_genitives_to_postposit Move_relclause_to_postposit Move_dicendi_closer_to_dsp Move_PersPron_next_to_verb Move_enough_before_adj Fix_money Recompute_deepord Find_gram_coref_for_refl_pron Neut_PersPron_gender_from_antec Override_pp_with_phrase_translation Valency_related_rules Fill_clause_number Turn_text_coref_to_gram_coref TCzechT_to_TCzechA:: Clone_atree Distinguish_homonymous_mlemmas Reverse_number_noun_dependency Init_morphcat Fix_possessive_adjectives Mark_subject Impose_pron_z_agr Impose_rel_pron_agr Impose_subjpred_agr Impose_attr_agr Impose_compl_agr Drop_subj_pers_prons Add_prepositions Add_subconjs Add_reflex_particles Add_auxverb_compound_passive Add_auxverb_modal Add_auxverb_compound_future Add_auxverb_conditional Add_auxverb_compound_past Add_clausal_expletive_pronouns Resolve_verbs Project_clause_number Add_parentheses Add_sent_final_punct Add_subord_clause_punct Add_coord_punct Add_apposition_punct Choose_mlemma_for_PersPron Generate_wordforms Move_clitics_to_wackernagel Recompute_ordering Delete_superfluous_prepos Delete_empty_nouns Vocalize_prepositions Capitalize_sent_start Capitalize_named_entities TCzechA_to_TCzechW:: Concatenate_tokens Ascii_quotes Remove_repeated_tokens

23 Annotation of Translation Errors sample of 250 sentences, 1463 errors in total Type Subtype Seriousness Circumstances Source lemma, formeme, gram., w. order,... gram: gender, person, tense,... serious, minor coordination, named entity, numbers tok, lem, tagger, parser, tecto, trans, x, syn,? ANALYSIS 30% SYNTHESIS 3% TRANSFER 67% errors caused by the assumption of t-tree isomorphism 8% other transfer errors 59%

24 Recent Improvements Parsing parentheses Hidden Markov Tree Models (HMTM) Combining dictionaries Maximum Entropy dictionary

25 Parsing Parentheses This sentence (excluding the long parenthesis, which was added as an example) is short.

26 HMTM Motivation source t-layer machine Select the best combination of lemmas and formems translation n:attr be n:subj v:fin easy adj:compl target t-layer počítač stroj strojový n:2 n:attr adj:attr překlad převod n:1 být mít v:fin v:inf snadný jednoduchý adj:compl n:1 adv:

27 HMTM Motivation source t-layer machine target t-layer translation n:attr počítač n:2, počítač n:attr, strojový adj:attr,... Select the best label for each node n:subj překlad n:1, převod n:1 be v:fin easy adj:compl být v:fin, být v:inf, mít v:fin, mít v:inf snadný adj:compl, jednoduchý adj:compl,...

28 HMTM in MT Source tree (Czech) ROOT TRANSFER P(optimal_tree) = P E (strojový machine) P T (machine translation) P E (překlad translation) P T (translation be) P E (snadný easy) P T (easy be) P E (být be) P T (be ROOT) Target tree (English) ROOT P E (být have) = SYNTHESIS být P E (být be) = 0.8 be have ANALYSIS překlad snadný P E (překlad arcade) = 0.7 P E (překlad translation) = 0.6 translation arcade easy simple P T (machine translation) = strojový Source sentence: Strojový překlad by měl být snadný. P E (strojový machine) = 0.4 machine P E (strojový engine) = 0.5 engine Target sentence: Machine translation should be easy. P E (source target) emission probabilities translation model P T (dependent governing) transition probabilities target-language tree model

29 Combining Dictionaries new general interface (for lemmas and formems) $dict->get_translations($input_label, $features) returns a list of translation variants including probabilities OOP style, dictionary constructor can take another dictionary (or more) as a parameter hierachy Four basic types of dictionaries: Static plain Context Derivational Combinaional loaded from a file lemma lemma loaded from a file lemma,features lemma translations derived dynamicaly, input dictionary combination of more input dictionaries

30 Hierarchy of lemma dictionaries Backoff 1. MaxEnt (CzEng) Baseline (CzEng) Human

31 Hierarchy of lemma dictionaries Interpolated 0,8 0,1 MaxEnt (CzEng) Human 0,1 Baseline (CzEng)

32 Hierarchy of lemma dictionaries Interpolated MaxEnt (CzEng) Human Prefixes multi-core více-jádrový více-jádro multi-jádrový multi-jádro Baseline (CzEng)

33 Hierarchy of lemma dictionaries Interpolated MaxEnt (CzEng) Human Prefixes Deadjectival_adverbs deaf hluchý hluše necitlivý necitlivě Baseline (CzEng) water voda vodový vodní Nouns_to_adjectives

34 Hierarchy of lemma dictionaries Interpolated MaxEnt (CzEng) Human Prefixes Deadjectival_adverbs Baseline (CzEng) high-water vodový vysoko-vodový vodní vysoko-vodní Hyphen_compounds Nouns_to_adjectives

35 Hierarchy of lemma dictionaries Numbers Suffixes Interpolated MaxEnt (CzEng) Human Backoff Transliterate Verbs_to_nouns Prefixes Deadjectival_adverbs Baseline (CzEng) Hyphen_compounds Deverbal_adjectives Nouns_to_adjectives

36 Maximum Entropy Dictionary Baseline Dictionary p y x = count x, y count x Maximum likelihood estimates (from the training sections of CzEng 0.9) Pruned by thresholds on p(x y) and p(y x) No context used x = source lemma y = target lemma MaxEnt Dictionary p y x = 1 Z x exp i i f i x, y One MaxEnt model for each source lemma (same training data as for the Baseline Dict.) Interpolated with Baseline Dict. (due to pruning) Context features used (x = source context) - local tree context - local linear context - morphological & syntactic categories -...

37 Maximum Entropy Dictionary agree / v:fin tense=past, voice=active negation=0, sempos=v chop, saw, trim, shorten, lumber, hew, lower, delete, crop abolish, cancel,... he / n:subj union / n:with+x cut / v:inf, has_left_child=0, sempos=v, has_right_child=1, tag=vb, position=right, named_entity=0 overtime / n:obj all / adj:attr TRANSFER ANALYSIS SYNTHESIS He agreed with the unions to cut all overtime. Dohodl se s odbory na zrušení všech přesčasů.

38 Results BLEU WMT = Workshop on Statistical Machine Translation bojar tectomt pctrans uedin google eurotran : Year BLEU 2008 WMT 6, WMT 7,3 2009/09 10,2 2010/01 10,4 2010/02 11,3 2010/03 WMT 12,6 WMT results:

39 Results BLEU vs. Ranks 75 Rank (human judgement) : PC Translator 6: Eurotran 5: 2: Bojar 1: Google 3-4: Moses (uedin) R² = 0, BLEU (automatic)

40 Results task-based evaluation Example Translated text: Bílý dům náčelník štábu Rahm Emanuel je naplánovaný k pojídaný vsedě v budově Kongresu se sněmovnou diskutující Nancy Pelosi v 9 p.m. Yes/No Statements: Nancy Pelosi je zaměstnankyní Bílého domu. Schůzka se koná dopoledne. Schůzka se koná v Bílem domě. Results PC Translator Google Moses Jan Berka, Martin Černý, 2010: 18 annotators, 1905 * 3 statements, 4 systems evaluated, 4 domains (dirs, news, meet, quiz) [

41 Examples of Translation A miss by an inch is a miss by a mile. Slečna palec je slečna miliónu. I d rather be a hammer than a nail. Spíše bych byl kladivo než nehet. A bird in the hand is worth two in the bush. Pták v ruce je cenný dvakrát v Bushovi.

Machine Learning for Deep-syntactic MT

Machine Learning for Deep-syntactic MT Martin Popel ÚFAL (Institute of Formal and Applied Linguistics) Charles University in Prague September 11, 2015 Seminar on the 35th Anniversary of the Cooperation