THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM

Similar documents
Hierarchical Phrase-Based Translation with WFSTs. Weighted Finite State Transducers

Hierarchical Phrase-Based Translation with Weighted Finite State Transducers

Fluency Constraints for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

WFST-based Grapheme-to-Phoneme Conversion: Open Source Tools for Alignment, Model-Building and Decoding

Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices

Local Phrase Reordering Models for Statistical Machine Translation

A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation.

Word Graphs for Statistical Machine Translation

BBN System Description for WMT10 System Combination Task

A General-Purpose Rule Extractor for SCFG-Based Machine Translation

On LM Heuristics for the Cube Growing Algorithm

Statistical Machine Translation Part IV Log-Linear Models

Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation

TALP: Xgram-based Spoken Language Translation System Adrià de Gispert José B. Mariño

The QCN System for Egyptian Arabic to English Machine Translation. Presenter: Hassan Sajjad

Left-to-Right Tree-to-String Decoding with Prediction

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs.

Forest Reranking for Machine Translation with the Perceptron Algorithm

Overview. Search and Decoding. HMM Speech Recognition. The Search Problem in ASR (1) Today s lecture. Steve Renals

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Power Mean Based Algorithm for Combining Multiple Alignment Tables

Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Joint Decoding with Multiple Translation Models

Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

K-best Parsing Algorithms

Reassessment of the Role of Phrase Extraction in PBSMT

Sparse Feature Learning

System Combination Using Joint, Binarised Feature Vectors

MOSES. Statistical Machine Translation System. User Manual and Code Guide. Philipp Koehn University of Edinburgh

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Stabilizing Minimum Error Rate Training

Weighted Finite State Transducers in Automatic Speech Recognition

The Prague Bulletin of Mathematical Linguistics NUMBER 91 JANUARY

Knowledge-Based Word Lattice Rescoring in a Dynamic Context. Todd Shore, Friedrich Faubel, Hartmut Helmke, Dietrich Klakow

Improved Models of Distortion Cost for Statistical Machine Translation

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

On a Kernel Regression Approach to Machine Translation

A Quantitative Analysis of Reordering Phenomena

Speech Technology Using in Wechat

Training and Evaluating Error Minimization Rules for Statistical Machine Translation

D6.1.2: Second report on scientific evaluations

Constrained Discriminative Training of N-gram Language Models

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10

Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Tuning Machine Translation Parameters with SPSA

Arabic Preprocessing Schemes for Statistical Machine Translation *

A Smorgasbord of Features to Combine Phrase-Based and Neural Machine Translation

Automatic Speech Recognition (ASR)

Joint Decoding with Multiple Translation Models

CMU-UKA Syntax Augmented Machine Translation

Scalable Trigram Backoff Language Models

Memory-Efficient Heterogeneous Speech Recognition Hybrid in GPU-Equipped Mobile Devices

Speed-Constrained Tuning for Statistical Machine Translation Using Bayesian Optimization

Weighted Finite State Transducers in Automatic Speech Recognition

Moses version 2.1 Release Notes

Handwritten Text Recognition

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING

Reversing Morphological Tokenization in English-to-Arabic SMT

Large-Scale Discriminative Training for Statistical Machine Translation Using Held-Out Line Search

Machine Translation with Lattices and Forests

Training a Super Model Look-Alike:Featuring Edit Distance, N-Gram Occurrence,and One Reference Translation p.1/40

Stone Soup Translation

On Hierarchical Re-ordering and Permutation Parsing for Phrase-based Decoding

CS 288: Statistical NLP Assignment 1: Language Modeling

Marcello Federico MMT Srl / FBK Trento, Italy

Discriminative Reranking for Machine Translation

Lattice-based Minimum Error Rate Training for Statistical Machine Translation

Reassessment of the Role of Phrase Extraction in PBSMT

Deliverable D1.1. Moses Specification. Work Package: WP1: Moses Coordination and Integration Lead Author: Barry Haddow Due Date: May 1st, 2012

Ping-pong decoding Combining forward and backward search

Google Search by Voice

Query Lattice for Translation Retrieval

Algorithms for NLP. Machine Translation. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions

Cube Pruning as Heuristic Search

Better Synchronous Binarization for Machine Translation

How SPICE Language Modeling Works

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

SPEECH TRANSLATION: THEORY AND PRACTICES

Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

EFFICIENT LATTICE RESCORING USING RECURRENT NEURAL NETWORK LANGUAGE MODELS. X.Liu,Y.Wang,X.Chen,M.J.F.Gales & P.C.Woodland

The Prague Bulletin of Mathematical Linguistics NUMBER 91 JANUARY Memory-Based Machine Translation and Language Modeling

Better Word Alignment with Supervised ITG Models. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Stochastic Segment Modeling for Offline Handwriting Recognition

Achieving Domain Specificity in SMT without Overt Siloing

OpenMaTrEx: A Free/Open-Source Marker-Driven Example-Based Machine Translation System

Type-based MCMC for Sampling Tree Fragments from Forests

Moses Version 3.0 Release Notes

Are Unaligned Words Important for Machine Translation?

Random Restarts in Minimum Error Rate Training for Statistical Machine Translation

EXPERIMENTAL SETUP: MOSES

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Julius rev LEE Akinobu, and Julius Development Team 2007/12/19. 1 Introduction 2

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Forest-based Translation Rule Extraction

Improving the quality of a customized SMT system using shared training data. August 28, 2009

Assignment 4 CSE 517: Natural Language Processing

Transcription:

THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM Adrià de Gispert, Gonzalo Iglesias, Graeme Blackwood, Jamie Brunning, Bill Byrne NIST Open MT 2009 Evaluation Workshop Ottawa,

The CUED SMT system Lattice-based Hierarchical SMT system implemented with weighted finite state transducers (WFSTs) - HiFST decoder - Based on the Google OpenFST toolkit Hierarchical translation rules extracted from MTTK-aligned parallel text - Training data: identical to NIST MT 2008 - Arabic-to-English : 6M sentences, 150M words Hybrid system based on three Arabic morphological decompositions - MADA-D2, MADA-D3, Sakhr MET parameter optimization for BLEU, separately for newswire and web text - MT06 newswire and web sets for tuning Two English language models - Top cell pruning: 4gram Knesser-Ney estimated over 0.8B words - Lattice generation: zero cut-off, stupid backoff 5gram estimated over 4.7B words - Exact back-off implementations with failure transitions Lattice-based MBR rescoring Casing with SRILM 1 / 12

HiFST. Hierarchical Translation with WFSTs 1 Goal: Keep all possible derivations in each cell Efficiently explore largest T in argmax t T P(s t) P(t) S x8420 CYK grid constructed in the usual way s 1 s 2 s 3 Build a WFSA in each cell for target side of translation rules active in each cell Avoids cube pruning Faster and more efficient to work with sublattices containing many hypotheses than to work individually with distinct hypotheses In each cell, do: For each rule in the cell: Build Rule WFSA by Concatenating target elements ( N ) Build Cell WFSA by Unioning Rule WFSAs ( L ) x420 x20 X x20 x20 x20 x20 x20 1 G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. 2009. Hierarchical Phrase-Based Translation with Weighted Finite State Transducers. Proc. of NAACL-HLT. 2 / 12

Building Rule WFSAs by Concatenation R 5 R 12 R 5 : R 12 : 3 / 12

Building Cell WFSA by Union R 5 : R 12 : Can be made compact Target language model can be applied, if pruning during search is required Translation hypothesis space can be built incrementally and efficiently 4 / 12

Delayed Translation Avoids expanding and replicating hypotheses until needed Easy implementation with FST Replace operation Usual FST operations can be applied to skeleton lattice size reduction 5 / 12

Building the Rule Set Hierarchical rules extracted from MTTK-aligned parallel text. Standard constraints 2 : maximum number of non-terminals is two disallow adjacent non-terminals in the source language unaligned words are not allowed at the edges of the rule require at least one pair of aligned words per rule Rule filtering 3 by pattern: discard monotonic patterns wx, wx and Xw, Xw wx 1 wx 2, wx 1 wx 2 and X 1 wx 2 w, X 1 wx 2 w wx 1 wx 2 w, wx 1 wx 2 w by number of translations (20) 2 D. Chiang, 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. Proc. ACL. 3 G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. 2009. Rule Filtering by Pattern for Efficient Hierarchical Translation. Proc. of EACL. 6 / 12

Grammar configurations: Shallow grammar Goal: avoid nested hierarchical rules of non-terminals full hierarchical grammar S X,X S S X,S X X γ,α,, γ, α {X T} + glue rule glue rule hiero rules of any level For Arabic-to-English, shallow-1 grammar performs as full hiero Constrained search space, but can be built exactly and quickly - no pruning required shallow-1 grammar S X 1,X 1 glue rule S S X 1,S X 1 glue rule X 1 γ 0,α 0,, γ 0, α 0 {{X 0 } T} + hiero rules level 1 X 0 γ p,α p, γ p, α p T + regular phrases 7 / 12

Grammar configurations: Low-level rule concatenation for long-range verb movement Allow for VSO to SVO Grammar allows long-range movement if the target-side contains a verb Otherwise, allow only monotonic rule concatenation (includes local reordering) Enforced by adding non-terminals: additional non-terminal M n,k allows monotonic concatenation of k rules X n M n,k only used in rule X n+1 if s T, s γ n /s is verb +0.4 BLEU in newswire rules included S X 2,X 2 glue rule S S X 2,S X 2 glue rule X 2 γ 1,α 1,, γ 1, α 1 {{X 1, M 1,2, M 1,3 } T} +, hiero rules level 2 X 1 γ 0,α 0,, γ 0, α 0 {{X 0 } T} + hiero rules level 1 X 0 γ p,α p, γ p, α p T + regular phrases M 1,2 X 1 X 1,X 1 X 1 catenation of 2 X 1 M 1,3 X 1 M 1,2,X 1 M 1,2 catenation of 3 X 1 8 / 12

Automatic Genre Detection Goal: Train genre-specific Arabic LMs to classify input documents by perplexity LM data is tokenized Arabic portion of parallel text for NIST 09 evaluation task LM vocabulary is the set of all Arabic words in the respective evaluation set Train Witten-Bell 3-gram NW and NG LMs from genre-specific subsets of parallel text Classify Arabic input document as newswire if PP nw < PP ng and newsgroup otherwise Reference Classification nw ng nw 73 1 ng 14 36 Table: Classification results for the 74 newswire (nw) and 50 newsgroup (ng) documents of the NIST mt08 evaluation set. 14 newsgroup documents are incorrectly labeled as newswire by the classifier. Translation performance not greatly affected by incorrectly classified documents: mt08-nw mt08-ng BLEU TER BP BLEU TER BP Reference 49.3 43.7 0.979 34.4 55.2 0.977 Genre Classifier 49.3 43.7 0.979 34.3 54.9 0.969 Table: BLEU score, TER and brevity penalty (BP) for first pass translation using reference labels and perplexity-based classifier labels for newswire (mt08-nw) and newsgroup (mt08-ng) evaluation subsets. If all documents are misclassified, the overall score degrades by approximately -1.0 BLEU 9 / 12

Arabic Morphological Decompositions Three decompositions considered: MADA-D2, MADA-D3 and SAKHR Build independent rule sets for each decomposition Individual decompositions are not robust across genres Motivation for developing a hybrid system: consider all decompositions simultaneously BLEU 4-gram LM newswire web decomposition mt06-nist mt08 mt06-nist mt08 MD2 51.1 51.4 37.8 36.3 MD3 51.0 51.4 38.0 36.2 Sakhr 51.4 51.5 36.9 35.5 10 / 12

Lattice Minimum Bayes Risk for Multiple Source Language Analyses Lattice-based Minimum Bayes Risk has gains over N-best MBR 4 MBR can be used to merge hypothesis from multiple analyses 5 Hierarchical decoder with one rule set for each analysis and a common target LM Scores (posterior distributions over ngrams) derived from each rule set are interpolated to form a single distribution for MBR An alternative to packing multiple analyses into a lattice for lattice-based translation quite easy if an MBR implementation is available mt02-05-tune mt02-05-test mt08 BLEU TER BLEU TER BLEU TER MD2 54.2 40.5 53.8 41.0 44.9 48.5 MD3 5gram LM, 1-best 53.8 41.2 53.6 41.4 45.0 48.3 Sakhr 54.1 40.7 53.8 40.7 44.7 48.7 MD2+MD3 N-Best MBR 55.1 40.0 54.7 40.3 46.1 47.7 LMBR 55.7 40.1 55.4 40.2 46.7 47.8 MD2+Sakhr N-Best MBR 55.4 39.7 54.9 39.9 46.5 47.7 LMBR 56.0 39.5 55.9 39.6 46.9 47.4 MD2+Sakhr+MD3 N-Best MBR 55.3 39.7 54.9 40.0 46.5 47.8 LMBR 56.0 39.6 55.7 39.7 47.3 46.9 4 R. Tromble, S. Kumar, F. Och, and W. Macherey. 2008. Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation. Proceedings of the EMNLP, pp. 620-629. 5 A. de Gispert, S. Virpioja, M. Kurimo, and W. Byrne. Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions. Procedings of NAACL-HLT, 2009. 11 / 12

Conclusion New this year: Syntax HiFST: WFST implementation of Hiero Grammar configuration strategies: Shallow translation Modeling of verb movement for Arabic to English MT good gains in newswire Hybrid translation system Good gains from lattice MBR translation based on multiple morphological analyses Particularly effective for web translation Emphasis on efficient construction of translation search spaces Minimal pruning Aim for few (i.e. no) search errors under the translation grammar Why: Much richer search spaces leads to greater gains in subsequent decoding steps Entire translation process is essentially a front-end for MBR decoding HiFST produces (a) very large translation lattices and (b) n-gram posterior distributions No translation hypotheses are produced until the final LMBR decoding step N.B. Our Chinese-English SMT system is pretty good, too Since Chinese-English was offered only as a progress set, we focused on Arabic-English 12 / 12

Acknowledgments This work was supported in part by the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-C-0022 G. Iglesias was supported by a Spanish Government research grant BES-2007-15956 (project TEC2006-13694-C03-03) Thanks to Sakhr Inc for use of Arabic text processed by their morphological analyzer