Translation Quality Estimation: Past, Present, and Future

Size: px

Start display at page:

Download "Translation Quality Estimation: Past, Present, and Future"

Sabrina Riley
6 years ago
Views:

1 Translation Quality Estimation: Past, Present, and Future André Martins MT Marathon, Lisbon, August 31st, 2017 André Martins (Unbabel) Quality Estimation MTM, 31/8/17 1 / 69

2 This Talk First part: largely based on Lucia Specia s MTM16 slides Second part: joint work with Marcin, Fabio, Ramon, Chris, Roman Third part: my thoughts on the future of QE André Martins (Unbabel) Quality Estimation MTM, 31/8/17 2 / 69

3 Outline 1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future André Martins (Unbabel) Quality Estimation MTM, 31/8/17 3 / 69

4 Why Do We Care About Evaluation? In the business of developing MT, we need to: measure progress over new/alternative versions compare different MT systems decide whether a translation is good enough for something optimize parameters of MT systems understand where systems go wrong (diagnosis)... remember Yvette s lecture on Monday: André Martins (Unbabel) Quality Estimation MTM, 31/8/17 4 / 69

5 Why Do We Care About Evaluation? One should optimize a system using the same metric that will be used to evaluate it Issue: how to choose a metric? Choice should be related to the system s purpose (not the case in practice) Other aspects are important for tuning (sentence/corpus-level, fast, cheap, differentiable,...) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 5 / 69

6 Complex Problem What does quality mean? Fluent? Adequate? Both? Easy to post-edit? System A better than system B?... André Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69

7 Complex Problem What does quality mean? Fluent? Adequate? Both? Easy to post-edit? System A better than system B?... Quality for whom/what? End-user (gisting vs dissemination) Post-editor (light vs heavy post-editing) Other applications (e.g. CLIR) MT-system (tuning or diagnosis for improvement)... André Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69

8 Complex Problem MT Do buy this product, it s their craziest invention! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69

9 Complex Problem MT Do buy this product, it s their craziest invention! HT Do not buy this product, it s their craziest invention! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69

10 Complex Problem MT Do buy this product, it s their craziest invention! HT Do not buy this product, it s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators André Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69

11 Complex Problem MT Six-hours battery, 30 minutes to full charge last. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69

12 Complex Problem MT Six-hours battery, 30 minutes to full charge last. HT The battery lasts 6 hours and it can be fully recharged in 30 minutes. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69

13 Complex Problem MT Six-hours battery, 30 minutes to full charge last. HT The battery lasts 6 hours and it can be fully recharged in 30 minutes. Ok for gisting - meaning preserved Very costly for post-editing if style is to be preserved André Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69

14 A Taxonomy of MT Evaluation Methods Manual Automatic André Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69

15 A Taxonomy of MT Evaluation Methods Direct asses. Scoring Manual Automatic André Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69

16 Manual Assessment: Scoring Is this translation correct? André Martins (Unbabel) Quality Estimation MTM, 31/8/17 10 / 69

17 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Automatic André Martins (Unbabel) Quality Estimation MTM, 31/8/17 11 / 69

18 Manual Assessment: Ranking André Martins (Unbabel) Quality Estimation MTM, 31/8/17 12 / 69

19 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Error annotation Automatic André Martins (Unbabel) Quality Estimation MTM, 31/8/17 13 / 69

20 MQM (Multidimensional Quality Metrics) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 14 / 69

21 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Error annotation Task-based Post-editing Automatic André Martins (Unbabel) Quality Estimation MTM, 31/8/17 15 / 69

22 Amount of Post-Editing HTER André Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69

23 Amount of Post-Editing André Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69

24 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Error annotation Task-based Post-editing Reading comprehension Automatic André Martins (Unbabel) Quality Estimation MTM, 31/8/17 17 / 69

25 Reading Comprehension André Martins (Unbabel) Quality Estimation MTM, 31/8/17 18 / 69

26 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Error annotation Task-based Post-editing Reading comprehension Eye-tracking Automatic André Martins (Unbabel) Quality Estimation MTM, 31/8/17 19 / 69

27 Eye-Tracking André Martins (Unbabel) Quality Estimation MTM, 31/8/17 20 / 69

28 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Error annotation Task-based Post-editing Reading comprehension Eye-tracking Automatic Reference-based André Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69

29 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Error annotation Task-based Post-editing Reading comprehension Automatic Reference-based Quality estimation Eye-tracking BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN,... André Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69

30 Reference-Based Evaluation Reference(s): subset of good translations, usually one Some metrics expand matching, e.g. synonyms in Meteor Huge variation in reference translations. E.g. Source 不过这一切都由不得你 However these all totally beyond the control of you. MT But all this is beyond the control of you. Human score BLEU score HT 1 But all this is beyond your control HT 2 However, you cannot choose yourself HT 3 However, not everything is up to you to decide HT 4 But you can t choose that André Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69

31 Reference-Based Evaluation Reference(s): subset of good translations, usually one Some metrics expand matching, e.g. synonyms in Meteor Huge variation in reference translations. E.g. Source 不过这一切都由不得你 However these all totally beyond the control of you. MT But all this is beyond the control of you. Human score BLEU score HT 1 But all this is beyond your control HT 2 However, you cannot choose yourself HT 3 However, not everything is up to you to decide HT 4 But you can t choose that Metrics completely disregard source segment André Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69

32 Reference-Based Evaluation Reference(s): subset of good translations, usually one Some metrics expand matching, e.g. synonyms in Meteor Huge variation in reference translations. E.g. Source 不过这一切都由不得你 However these all totally beyond the control of you. MT But all this is beyond the control of you. Human score BLEU score HT 1 But all this is beyond your control HT 2 However, you cannot choose yourself HT 3 However, not everything is up to you to decide HT 4 But you can t choose that Metrics completely disregard source segment Main problem: Cannot be applied for MT systems in use André Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69

33 A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Error annotation Task-based Post-editing Reading comprehension Automatic Reference-based Quality estimation Eye-tracking BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN,... André Martins (Unbabel) Quality Estimation MTM, 31/8/17 23 / 69

34 Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly André Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

35 Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered André Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

36 Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is? André Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

37 Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is? Quality = Can a reader get the gist? André Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

38 Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it? André Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

39 Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it? Quality = How much effort to fix it? André Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

40 Related: Confidence in MT (Blatz et al., 2004; Ueffing and Ney, 2007) Goal: augment the MT system to produce a confidence score Quality Estimation is slightly more general and advantageous: it does not require access to the internals of the MT system (i.e. can treat it as a black box) it makes it possible to use several MT systems, several domains, etc. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 25 / 69

41 Building a model: Quality Estimation: Framework X: examples of source & translations Feature extraction Features Y: Quality scores for examples in X Machine Learning QE model (Slide credit: Lucia Specia) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 26 / 69

42 Applying the model: Quality Estimation: Framework MT system Source Text x s' Translation for x t' Feature extraction Features Quality score y' QE model (Slide credit: Lucia Specia) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 26 / 69

43 Data and Levels of Granularity Sentence level: 1-5 subjective scores, PE time, PE edits Word level: good/bad, good/delete/replace, MQM Phrase level: good/bad Document level: PE effort André Martins (Unbabel) Quality Estimation MTM, 31/8/17 27 / 69

44 Features and Algorithms Adequacy features Source text MT system Translation s -1 s s +1 t -1 t t +1 Complexity features Confidence features Fluency features Algorithms can be used off-the-shelf André Martins (Unbabel) Quality Estimation MTM, 31/8/17 28 / 69

45 Sentence-Level QE: Baseline Setting Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of seen source n-grams André Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69

46 Sentence-Level QE: Baseline Setting Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of seen source n-grams SVM regression with RBF kernel André Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69

Sentence-Level QE: Baseline Setting Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of

47 Sentence-Level QE: Baseline Setting Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of seen source n-grams SVM regression with RBF kernel QuEst: André Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69

48 Predicting HTER (WMT16) Sentence-Level QE: SOTA System ID Pearson Spearman English-German YSDA/SNTX+BLEU+SVM POSTECH/SENT-RNN-QV SHEF-LIUM/SVM-NN-emb-QuEst POSTECH/SENT-RNN-QV SHEF-LIUM/SVM-NN-both-emb UGENT-LT3/SCATE-SVM UFAL/MULTIVEC RTM/RTM-FS-SVR UU/UU-SVM UGENT-LT3/SCATE-SVM RTM/RTM-SVR Baseline SVM SHEF/SimpleNets-SRC SHEF/SimpleNets-TGT André Martins (Unbabel) Quality Estimation MTM, 31/8/17 30 / 69

49 Predicting HTER (WMT17) Sentence-Level QE: SOTA André Martins (Unbabel) Quality Estimation MTM, 31/8/17 31 / 69

50 Word-Level QE: SOTA Word-level ok/bad labels (WMT16) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 32 / 69

51 Word-level ok/bad labels (WMT17) Word-Level QE: SOTA André Martins (Unbabel) Quality Estimation MTM, 31/8/17 33 / 69

52 Outline 1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future André Martins (Unbabel) Quality Estimation MTM, 31/8/17 34 / 69

53 Pushing the Limits of Quality Estimation Recent TACL paper: André F. T. Martins, Marcin Junczys-Dowmunt, Fabio Kepler, Ramón Astudillo, Chris Hokamp, Roman Grundkiewicz. Pushing the Limits of Quality Estimation. TACL, 5: , André Martins (Unbabel) Quality Estimation MTM, 31/8/17 35 / 69

54 In a Nutshell Quality estimation: evaluate a translation on the fly with no reference Useful for predicting (or sidestepping) human post-editing effort Until now: not really accurate enough for practical use André Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69

55 In a Nutshell Quality estimation: evaluate a translation on the fly with no reference Useful for predicting (or sidestepping) human post-editing effort Until now: not really accurate enough for practical use This paper: considerable improvements by: stacking a neural system and a linear sequential model using automatic post-editing as an auxiliary task André Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69

56 In a Nutshell Quality estimation: evaluate a translation on the fly with no reference Useful for predicting (or sidestepping) human post-editing effort Until now: not really accurate enough for practical use This paper: considerable improvements by: stacking a neural system and a linear sequential model using automatic post-editing as an auxiliary task Overall (in the WMT16 En-De dataset): 57.47% (+7.95%) for word-level F mult % (+13.26%) for sentence-level Pearson score André Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69

57 But isn t MT nearly indistinguishable from humans now? André Martins (Unbabel) Quality Estimation MTM, 31/8/17 37 / 69

58 André Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

59 Wrong translation! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

60 Wrong translation! We can fix it with a human post-editor! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

61 Le travail de La traduction automatique fonctionne-t-elle? We can fix it with a human post-editor! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

62 quality estimation BAD BAD BAD OK OK OK Le travail de La traduction automatique fonctionne-t-elle? We can fix it with a human post-editor! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

63 Quality Estimation (Blatz et al., 2004; Specia et al., 2013) BAD BAD BAD OK OK OK Le travail de La traduction automatique fonctionne-t-elle? Sentence-level QE: predict edit distance (HTER) Word-level QE: predict a OK/BAD label for each translated word André Martins (Unbabel) Quality Estimation MTM, 31/8/17 39 / 69

64 Quality Estimation (Blatz et al., 2004; Specia et al., 2013) BAD BAD BAD OK OK OK Le travail de La traduction automatique fonctionne-t-elle? Sentence-level QE: predict edit distance (HTER) Word-level QE: predict a OK/BAD label for each translated word This paper: we first engineer a strong word-level QE system then convert it to make sentence-level predictions too André Martins (Unbabel) Quality Estimation MTM, 31/8/17 39 / 69

65 Why Quality Estimation? 1 Informs an end user about the reliability of translated content 2 Decide if a translation is good to go or requires a human to fix it 3 Highlights to a human post-editor the words that need to be revised André Martins (Unbabel) Quality Estimation MTM, 31/8/17 40 / 69

66 Unbabel Translation Pipeline (Credit: João Graça s EAMT presentation) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 41 / 69

67 Dataset of Post-Edited Translations WMT 2016: English-German, IT domain 12,000 training sentences, 1,000 dev sentences, 2,000 test sentences Quality labels obtained from post-edited translations via TERCOM (Snover et al., 2006) In the paper: also experiments with WMT 2015 (English-Spanish) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 42 / 69

68 Model #1: Linear Sequential Model First shot: a discriminative, shallow, CRF-like model BAD BAD BAD OK OK OK The model predicts a label sequence ŷ 1:N {OK, BAD} N : ŷ 1:N = arg max y N w φ u (s, t, A, y i ) }{{} i=1 unigram features N+1 + i=1 w φ b (s, t, A, y i, y i 1 ) }{{} bigram features André Martins (Unbabel) Quality Estimation MTM, 31/8/17 43 / 69

69 Features Features Label Input (referenced by the ith target word) unigram y i... Bias Word, LeftWord, RightWord SourceWord, SourceLeftWord, SourceRightWord LargestNGramLeft/Right SourceLargestNGramLeft/Right PosTag, SourcePosTag Word+LeftWord, Word+RightWord Word+SourceWord, PosTag+SourcePosTag simple y i y i 1... Bias bigram rich y i y i 1... all above bigrams y i+1 y i... Word+SourceWord, PosTag+SourcePosTag syntactic y i... DepRel, Word+DepRel HeadWord/PosTag+Word/PosTag LeftSibWord/PosTag+Word/PosTag RightSibWord/PosTag+Word/PosTag GrandWord/PosTag+HeadWord/PosTag+Word/PosTag André Martins (Unbabel) Quality Estimation MTM, 31/8/17 44 / 69

70 Performance of Linear Model (Dev Set) unigrams only +simple bigrams rich bigrams syntactic (full) F MULT 1 (WMT16, Dev) Large impact of rich bigram features (3 points) and syntactic features (another 2.5 points) Net improvement exceeds 6 points over the unigram model André Martins (Unbabel) Quality Estimation MTM, 31/8/17 45 / 69

71 Model #2: Neural Model Second shot: a model inspired by QUETCH (Kreutzer et al., 2015): a feedforward neural net whose inputs are target words and their aligned source words We depart from QUETCH in a few aspects: we add recurrent layers more depth embeddings for the POS tags (not just the words) dropout regularization layer normalization In the paper: ablation experiments to validate the architecture André Martins (Unbabel) Quality Estimation MTM, 31/8/17 46 / 69

72 softmax OK/BAD 2 x FF BiGRU x FF 2 x BiGRU x FF 2 x 400 target word source word target POS source POS embeddings 3 x 64 3 x 64 3 x 50 3 x 50 André Martins (Unbabel) Quality Estimation MTM, 31/8/17 47 / 69

73 Linear vs Neural Models (Test Set) UGENT system LinearQE NeuralQE F MULT 1 (WMT16) UGENT is Tezcan et al. (2016) (best in WMT16 after us) Both our linear and neural systems outperform it by a big margin André Martins (Unbabel) Quality Estimation MTM, 31/8/17 48 / 69

74 Model #3: Stacked Architecture Stacked learning is a simple and effective way of ensembling structured models (Cohen and de Carvalho, 2005; Martins et al., 2008) Key idea: include the neural model s prediction as an additional feature in the linear model Better performance if we use several neural models with different random initializations and data shuffles André Martins (Unbabel) Quality Estimation MTM, 31/8/17 49 / 69

75 Stacking Linear and Neural Models UGENT system LinearQE NeuralQE StackedQE F MULT 1 (WMT16) Large improvement (3 points) just by combining the two systems! This shows that they are highly complementary We won the WMT16 shared task with this approach... but can we still do better? André Martins (Unbabel) Quality Estimation MTM, 31/8/17 50 / 69

76 Stacking Linear and Neural Models UGENT system LinearQE NeuralQE StackedQE F MULT 1 (WMT16) Large improvement (3 points) just by combining the two systems! This shows that they are highly complementary We won the WMT16 shared task with this approach... but can we still do better? Yes. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 50 / 69

77 Automatic Post-Editing (Simard et al., 2007)... remember Marcin s lecture on Wednesday: André Martins (Unbabel) Quality Estimation MTM, 31/8/17 51 / 69

78 Automatic Post-Editing (Simard et al., 2007) BAD BAD BAD OK OK OK Le travail de La traduction automatique fonctionne-t-elle? Goal: automatically correct the output of MT While word-level QE detects mistakes, APE seeks to correct mistakes Still, pretty similar. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 52 / 69

79 Roundtrip and Log-Linear Combination Current best APE system (Junczys-Dowmunt and Grundkiewicz, 2016): generate a large amount of artificial data ( roundtrip translations ) train two NMT systems: s p and t p combine the two systems with a log-linear model Our strategy: train an APE system tuned for QE test time: project the predicted post-edited text p onto quality labels André Martins (Unbabel) Quality Estimation MTM, 31/8/17 53 / 69

80 Roundtrip and Log-Linear Combination Current best APE system (Junczys-Dowmunt and Grundkiewicz, 2016): generate a large amount of artificial data ( roundtrip translations ) train two NMT systems: s p and t p combine the two systems with a log-linear model Our strategy: train an APE system tuned for QE test time: project the predicted post-edited text p onto quality labels Two key differences with respect to the other pure QE systems: Learned from finer-grained information (post-edited text) A lot more data (500,000 roundtrip translations) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 53 / 69

81 Model #4: APE-Based Quality Estimation UGENT system LinearQE NeuralQE StackedQE APE-QE F MULT 1 (WMT16) This strategy outperforms the pure QE systems by 5 points!! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 54 / 69

82 Model #5: Combining Linear, Neural, and APE UGENT system LinearQE NeuralQE StackedQE APE-QE FullStackedQE F MULT 1 (WMT16)... combining all the models together, we get another improvement of 2 points!! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 55 / 69

83 Examples Source Combines the hue value of the blend color with the luminance and saturation of the base color to create the result color. MT Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grundfarbe zu erstellen. PE (Reference) Kombiniert den Farbtonwert der Mischfarbe mit der Luminanz und Sättigung der Grundfarbe. APE Kombiniert den Farbton der Mischfarbe mit der Luminanz und die Sättigung der Grundfarbe, um die Ergebnisfarbe zu erstellen. StackedQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grundfarbe zu erstellen. ApeQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grundfarbe zu erstellen. FullStackedQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grundfarbe zu erstellen. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 56 / 69

84 Performance over Sentence Length Figure: Averaged number of words predicted as bad by the different systems in the WMT16 gold dev set, for different bins of the sentence length. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 57 / 69

85 Sentence-Level Quality Estimation Now that we have a strong word-level system, we apply a very simple procedure to obtain sentence-level predictions: For pure QE: convert word-level predictions to a sentence HTER prediction by counting the fraction of BAD words For APE: use directly the HTER between t and p For the combined system: average of the two above André Martins (Unbabel) Quality Estimation MTM, 31/8/17 58 / 69

86 Sentence-Level Quality Estimation YANDEX system StackedQE APE-QE FullStackedQE Pearson correlation (WMT16) YANDEX was the best sentence-level system at WMT16 (Kozlova et al., 2016) Our simple conversion led to impressive results, well above the SOTA! André Martins (Unbabel) Quality Estimation MTM, 31/8/17 59 / 69

87 This Year s WMT17: Other Competitive Methods André Martins (Unbabel) Quality Estimation MTM, 31/8/17 60 / 69

88 This Year s WMT17: Other Competitive Methods André Martins (Unbabel) Quality Estimation MTM, 31/8/17 61 / 69

89 Conclusions New SOTA systems for word-level and sentence-level QE considerably more accurate than previously existing systems First, we proposed a new pure QE system which stacks a linear and a neural system Then, by relating the tasks of APE and word-level QE, we derived a new APE-based QE system Finally, we combined the two systems via a full stacking architecture The full system was extended to sentence-level QE by virtue of a simple word-to-sentence conversion (no further training or tuning) André Martins (Unbabel) Quality Estimation MTM, 31/8/17 62 / 69

90 Outline 1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future André Martins (Unbabel) Quality Estimation MTM, 31/8/17 63 / 69

91 The Future of QE Word-level: go beyond ok/bad labels look at missing words from the source (particularly relevant for NMT) mistakes of NMT systems have different patterns than PBMT how to estimate adequacy? Sentence-level: multi-task learning with word-level predict MQM scores (useful for quality ensurance of crowd-sourced translation) Document-level: take inter-sentential context into account (very relevant for chat) evaluate translation global coherence André Martins (Unbabel) Quality Estimation MTM, 31/8/17 64 / 69

92 The Future of QE Beyond MT: human translation quality estimation much harder: humans have higher variability hard to distinguish good non-literal translations from complete rubbish André Martins (Unbabel) Quality Estimation MTM, 31/8/17 65 / 69

93 We re Hiring! Excited about MT, crowdsourcing and Lisbon? Andre Martins (Unbabel) Quality Estimation MTM, 31/8/17 66 / 69

94 Acknowledgments EXPERT project (EU Marie Curie ITN No ) Fundação para a Ciência e Tecnologia (FCT), through contracts UID/EEA/50008/2013 and UID/CEC/50021/2013 LearnBig project (PTDC/EEI-SII/7092/2014) GoLocal project (grant CMUPERI/TIC/0046/2014) Amazon Academic Research Awards program André Martins (Unbabel) Quality Estimation MTM, 31/8/17 67 / 69

95 References I Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueffing, N. (2004). Confidence estimation for machine translation. In Proc. of the International Conference on Computational Linguistics, page 315. Cohen, W. W. and de Carvalho, V. R. (2005). Stacked sequential learning. In IJCAI. Junczys-Dowmunt, M. and Grundkiewicz, R. (2016). Log-linear combinations of monolingual and bilingual neural machine translation models for automatic post-editing. In Proceedings of the First Conference on Machine Translation, pages , Berlin, Germany. Association for Computational Linguistics. Kozlova, A., Shmatova, M., and Frolov, A. (2016). YSDA Participation in the WMT 16 Quality Estimation Shared Task. In Proceedings of the First Conference on Machine Translation, pages Kreutzer, J., Schamoni, S., and Riezler, S. (2015). Quality estimation from scratch (quetch): Deep learning for word-level translation quality estimation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages Martins, A. F. T., Das, D., Smith, N. A., and Xing, E. P. (2008). Stacking Dependency Parsers. In Proc. of Empirical Methods for Natural Language Processing. Simard, M., Ueffing, N., Isabelle, P., and Kuhn, R. (2007). Rule-based translation with statistical phrase-based post-editing. In Proceedings of the Second Workshop on Statistical Machine Translation, pages André Martins (Unbabel) Quality Estimation MTM, 31/8/17 68 / 69

96 References II Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pages Specia, L., Shah, K., de Souza, J. G., and Cohn, T. (2013). QuEst - a translation quality estimation framework. In Proc. of the Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages Tezcan, A., Hoste, V., and Macken, L. (2016). Ugent-lt3 scate submission for wmt16 shared task on quality estimation. In Proceedings of the First Conference on Machine Translation, pages , Berlin, Germany. Association for Computational Linguistics. Ueffing, N. and Ney, H. (2007). Word-level confidence estimation for machine translation. Computational Linguistics, 33(1):9 40. André Martins (Unbabel) Quality Estimation MTM, 31/8/17 69 / 69

An efficient and user-friendly tool for machine translation quality estimation

An efficient and user-friendly tool for machine translation quality estimation Kashif Shah, Marco Turchi, Lucia Specia Department of Computer Science, University of Sheffield, UK {kashif.shah,l.specia}@sheffield.ac.uk