Institute of Computational Linguistics Treatment of Markup in Statistical Machine Translation

Similar documents
Treatment of Markup in Statistical Machine Translation

Transferring Markup Tags in Statistical Machine Translation: A Two-Stream Approach

An efficient and user-friendly tool for machine translation quality estimation

Open and Flexible Metadata in Localization

Understand Localization Standards and Use Them Effectively. John Watkins, President, ENLASO

This document is a preview generated by EVS

Abstract. 1. Introduction

iems Interactive Experiment Management System Final Report

LetsMT!: Cloud-Based Platform for Building User Tailored Machine Translation Engines

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews

ITS 2.0-ENABLED STATISTICAL MACHINE TRANSLATION AND TRAINING WEB SERVICES

A General Overview of Machine Translation System Development

Marcello Federico MMT Srl / FBK Trento, Italy

Towards Corpus Annotation Standards The MATE Workbench 1

AIM. 10 September

MT at NetApp This is how we do it

CA Productivity Accelerator 12.1 and Later

TMX Wish List. FEISGILTT 2016 Dublin. Angelika Zerfaß

XLIFF. An XML standard for localisation Tony Jewtushenko Oracle Peter Reynolds Bowne Global Solutions. Tuesday, November 12, 2002 LRC 2002 Conference

Building and Annotating Corpora of Collaborative Authoring in Wikipedia

A System of Exploiting and Building Homogeneous and Large Resources for the Improvement of Vietnamese-Related Machine Translation Quality

ISO INTERNATIONAL STANDARD. Financial services Universal financial industry message scheme Part 8: ASN.1 generation

Swordfish III User Guide. Copyright Maxprograms

Enabling better translation TAUS RESEARCH REPORT. Moses Users - Finding Common Ground

System Combination Using Joint, Binarised Feature Vectors

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar..

Text clustering based on a divide and merge strategy

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

Find MAC Address. Getting Started. LizardSystems

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions

Access Control for Shared Resources

A general framework for minimizing translation effort: towards a principled combination of translation technologies in computer-aided translation

OCR and Automated Translation for the Navigation of non-english Handsets: A Feasibility Study with Arabic

Parsing partially bracketed input

Build Better Business Models for Machine Translation

Security Based Heuristic SAX for XML Parsing

The Prague Bulletin of Mathematical Linguistics NUMBER 91 JANUARY Memory-Based Machine Translation and Language Modeling

XML Support for Annotated Language Resources

XLIFF 2.0. Dr. David Filip. Multilingual Web Workshop CNR PISA, April 4, 2011

Creation of new TM segments: Fulfilling translators wishes

Jay Lofstead under the direction of Calton Pu

Fine-grained Software Version Control Based on a Program s Abstract Syntax Tree

Advanced Control Foundation: Tools, Techniques and Applications. Terrence Blevins Willy K. Wojsznis Mark Nixon

Experimenting in MT: Moses Toolkit and Eman

Integration of UML Profiles into the SiDiff and SiLift Tools

XML APIs Testing Using Advance Data Driven Techniques (ADDT) Shakil Ahmad August 15, 2003

A General-Purpose Rule Extractor for SCFG-Based Machine Translation

Processing Structural Constraints

ISO INTERNATIONAL STANDARD. Language resources management Multilingual information framework

This document is a preview generated by EVS

On-the-Fly Data Race Detection in MPI One-Sided Communication

An UML-XML-RDB Model Mapping Solution for Facilitating Information Standardization and Sharing in Construction Industry

Using Markov Chain Usage Models to Test Complex Systems

Introducing live graphics gems to educational material

The Prague Bulletin of Mathematical Linguistics NUMBER 100 OCTOBER

Standards and Paradigms

ProQuest Dissertations and Theses Overview. Austin McLean and Marlene Coles CGS Summer Workshop, July 2017

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

Archiving and linguistic databases

From to etranslation in CEF

A Latex Template for Independent Work Reports Version 2016v3

MPML: A Multimodal Presentation Markup Language with Character Agent Control Functions

Qanda and the Catalyst Architecture

code pattern analysis of object-oriented programming languages

D1.6 Toolkit and Report for Translator Adaptation to New Languages (Final Version)

0.1 Induction Challenge OMDoc Manager (ICOM)

Participant Packet. Quality Attributes Workshop. Bates College Integrated Knowledge Environment

Inera s Reference Processing Tools: extyles versus Edifix

PROMT Flexible and Efficient Management of Translation Quality

1.

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Flexible Packet Matching XML Configuration

Event study Deals Stock Prices Additional Financials

TABLE OF CONTENTS PAGE TITLE NO.

Modeling Systems Using Design Patterns

Azureus Plugin for Facebook Integration

1.0 Abstract. 2.0 TIPSTER and the Computing Research Laboratory. 2.1 OLEADA: Task-Oriented User- Centered Design in Natural Language Processing

jldadmm: A Java package for the LDA and DMM topic models

Utilizing a Common Language as a Generative Software Reuse Tool

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

Temporal Correlations between Spam and Phishing Websites

Formatting Your Paper for the MT Summit 2017 Conference

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming

Similarities in Source Codes

A Hybrid Method of Hiding The Text Information Using Stegnography

Review, Adapt, Improve. April 3, 2018

AAB UNIVERSITY. Lecture 5. Use of technology in translation process. Dr.sc. Arianit Maraj

Two Practical Rhetorical Structure Theory Parsers

Acknowledgements...xvii. Foreword...xix

ISO/IEC INTERNATIONAL STANDARD

Space Details. Available Pages

The Muc7 T Corpus. 1 Introduction. 2 Creation of Muc7 T

Automatic Domain Partitioning for Multi-Domain Learning

Design and Development of Japanese Law Translation Database System

How to Write a Thesis

Transcription:

Institute of Computational Linguistics Treatment of Markup in Statistical Machine Translation Master s Thesis Mathias Müller September 13, 2017

Motivation Researchers often give themselves the luxury of pretending that only pure text matters [...] (Joanis et al., 2013) tendency to overlook applied problems September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 2/30

One of those applied problems: markup handling September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 3/30

One of those applied problems: markup handling September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 4/30

One of those applied problems: markup handling September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 5/30

One of those applied problems: markup handling September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 6/30

The essence of markup handling September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 7/30

Why is markup handling important? (1) Translation industry has a need to translate content with markup (2) Markup errors harm translator productivity 1 1 see e.g. O Brien (2011); Joanis et al. (2013); Parra Escartín and Arcedillo (2015) September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 8/30

Existing solutions Complete list of popular machine translation frameworks that have markup handling: September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 9/30

Existing solutions Is the code that implements markup handling available to the public? Are there empirical comparisons of the proposed method? publication code experiments Du et al. (2010) Zhechev and van Genabith (2010) Hudík and Ruopp (2011) Tezcan and Vandeghinste (2011) Joanis et al. (2013) September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 10/30

Existing solutions Is there pseudo code or a clear explanation of the algorithm? solution code experiments algorithm D (2010) Z (2010) H (2011) T (2011) J (2013) ( ) solution code experiments algorithm thesis work September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 11/30

The mtrain framework Code: https://gitlab.cl.uzh.ch/mt/mtrain September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 12/30

Implemented strategies two general classes of strategies, masking and reinsertion how do they solve the problem of markup handling? September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 13/30

How to solve markup handling? September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 14/30

How to solve markup handling? Make tokenization more sophisticated. After that, hide: replace markup with an innocuous string (masking) remove markup alltogether (reinsertion)... in any case, the original content needs to be restored after translation (seek) September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 15/30

How Masking Works September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 16/30

How Masking Works September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 17/30

How Masking Works This is what the final training corpus looks like! September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 18/30

How Reinsertion Works September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 19/30

How Reinsertion Works This is what the final training corpus looks like! September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 20/30

Undoing masking or markup removal September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 21/30

Using word alignment or similar correspondence September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 22/30

Experiments in a nutshell test 5 markup handling methods data sets where markup is abundant train standard Moses SMT systems, vary only the markup handling method manual evaluation of performance: September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 23/30

Manual evaluation of 568 tags in XLIFF data September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 24/30

Manual evaluation of 584 tags in Euromarkup data September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 25/30

Wrapping up The main contributions of my thesis are: a survey of existing solutions for markup handling in SMT implementations of novel and existing methods in a unified framework that is available for free recommendations regarding the choice markup strategy (... and other things I could not mention in all brevity :( ) September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 26/30

Calling the mtrain API example >>> from mtrain. preprocessing. reinsertion import Reinserter >>> reinserter = Reinserter( alignment ) >>> source_segment = Hello <g id ="1" ctype ="x- bold;" > World! </g> # markup removal, then translation... >>> translated_segment = Hallo Welt! >>> alignment = [(0,0),(1,1),(2,2)] >>> reinserter. reinsert( source_segment, translated_segment, alignment) Hallo <g ctype ="x- bold;" id ="1" > Welt! </g> September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 27/30

<b>thanks</b> for listening! First question for a lively discussion: Yeah, but does this work for neural machine translation? September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 28/30

Bibliography I Du, J., Roturier, J., and Way, A. (2010). TMX markup: a challenge when adapting smt to the localisation environment. In EAMT - 14th Annual Conference of the European Association for Machine Translation. European Association for Machine Translation. Hudík, T. and Ruopp, A. (2011). The integration of Moses into localization industry. In 15th Annual Conference of the EAMT, pages 47 53. Joanis, E., Stewart, D., Larkin, S., and Kuhn, R. (2013). Transferring markup tags in statistical machine translation: A two-stream approach. In O Brien, S., Simard, M., and Specia, L., editors, Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice, pages 73 81. O Brien, S. (2011). Towards predicting post-editing productivity. Machine translation, 25(3):197 215. Parra Escartín, C. and Arcedillo, M. (2015). Machine translation evaluation made fuzzier: A study on post-editing productivity and evaluation metrics in commercial settings. In Proceedings of MT Summit XV, pages 131 144. Association for Machine Translation in the Americas. Tezcan, A. and Vandeghinste, V. (2011). SMT-CAT integration in a Technical Domain: Handling XML Markup Using Pre & Post-processing Methods. Proceedings of EAMT 2011. September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 29/30

Bibliography II Zhechev, V. and van Genabith, J. (2010). Seeding statistical machine translation with translation memory output through tree-based structural alignment. In Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation, pages 43 51, Beijing, China. Coling 2010 Organizing Committee. September 13, 2017 University of Zurich, Institute of Computational Linguistics, MA, Mathias Müller 30/30