MMT Modern Machine Translation

Size: px

Start display at page:

Download "MMT Modern Machine Translation"

Elisabeth Rose
6 years ago
Views:

1 MMT Modern Machine Translation Second Design and Specifications Report Author(s): Davide Caroselli, Nicola Bertoldi, Mauro Cettolo, Marcello Federico Dissemination Level: Public Date: July 1 st, 2016

2 Grant agreement no Project acronym Project full title Funding scheme Coordinator Start date, duration Distribution MMT Modern Machine Translation MMT will deliver a language independent commercial online translation service based on a new open source machine translation distributed architecture Collaborative project Alessandro Cattelan (TRANSLATED) 1 January 2015, 36 months Public Contractual date of delivery April 1 st, 2016 Actual date of delivery July 1 st, 2016 Deliverable number 1.2 Deliverable title Type Status and version Second Design and Specifications Report Report Final Number of pages 20 Contributing partners WP leader Task leader Authors EC project officer The partners in MMT are: TRANSLATED, FBK TRANSLATED TRANSLATED Nicola Bertoldi, Davide Caroselli, Mauro Cettolo, Marcello Federico, David Madl, Luca Mastrostefano Saila Rinne Translated S.r.l. (TRANSLATED), Italy Fondazione Bruno Kessler (FBK), Italy The University of Edinburgh (UEDIN), United Kingdom TAUS B.V. (TAUS), The Netherlands 1

3 For copies of reports, updates on project activities and other MMT related information, contact: TRANSLATED MMT Alessandro Cattelan Via Nepal, 29 Phone: (+39) I Rome, Italy Fax: (+39) , Nicola Bertoldi, Davide Caroselli, Mauro Cettolo, Marcello Federico, David Madl, Luca Mastrostefano No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. 2

4 Table of Contents 1. Executive Summary 2. Introduction 3. Use Cases 3.1 General Requirements Linguistic requirements Functional requirements Performance requirements Computational requirements 4. Architecture Specifications 4.1 Distributed Infrastructure Fault tolerance Distribution of updates Scalability 5. Components Design and Specifications 5.1 Context Analyzer 5.2 Word Aligner 5.3 Adaptive Language Model 5.4 Adaptive Phrase Table 5.5 Text Processing Tokenizer and detokenizer XML Tag Manager 3

5 1. Executive Summary This document presents updated design and specifications for the ModernMT prototypes and final product, that will be developed during the final 18 months of the project. Starting from the addressed use cases, we outline the overall requirements of the ModernMT product, and then explain how these have to reflect in the design and specifications of the architecture and the main components developed in the project. 4

6 2. Introduction This report is an updated version of the previous deliverable D1.1 First Design and Specifications Report, which was prepared before developing the first prototype versions of the ModernMT software. This document in fact follows quite extensive development and testing phases performed first on a minimal viable product of ModernMT and then several versions of the first prototype. In particular, testing activities also included field tests with real potential customers represented by large IT companies. The field tests consisted of comparisons of current in house MT solution against ModernMT under fair comparison conditions: that is, same training data and MT quality evaluation carried out by external people. Experience gained from the field tests, discussions about requirements with the potential customers, and a preliminary market analysis conducted by the industrial partner, inspired a new list of requirements described in this document. This report is structured as follows. In Section 3, the two foreseen use cases of ModernMT are defined and then the main use case is described in detail. Then general requirements of the main use cases are defined in terms of linguistic, functional, performance, computational, and architectural requirements. In particular, linguistic requirements define the progression of translation directions that will be covered by ModernMT, functional requirements the core operations that it will permit to perform, performance requirements the level of MT quality to target, computational requirements the processing speed perceived by the user, and architectural requirements the way distributed processing and scalability should be improved during the second part of the project. In Section 4, the requirement for the ModernMT software architecture are described, covering, respectively, aspects related to distributed processing, reliability, updating policy, and scalability. Finally, Section 5 covers design and specifications of main components of the architecture that the project is developing around the Moses core platform, that is the context analyser, the word aligner, the language model manager, the translation model manager, and the pre and post processors. Specifications of each component are mainly defined in terms of required processing speed, by taking into account, where possible, the contribution of that component within the core operations previously defined at the use case level. 5

7 3. Use Cases ModernMT aims to develop an innovative solution for the translation industry, by providing both better MT quality for post editing as well as a better integration of MT with commercial CAT tools. Two use cases of ModernMT have been identified: (i) the enterprise use case, in which a language service provider or localisation department of a large company installs ModernMT to manage its translation workflow, (ii) the translator use case, in which single translators install the ModernMT plugin in their favorite CAT tool and use ModernMT as their preferred source of suggestions/matches for their daily workflow. This document mainly focuses on the translator or plugin use case, which seems the most promising from a commercial perspective, but also the most demanding in terms of requirements and specifications. In fact, we believe that the requirements of the enterprise use case are actually included in those of the plugin use case. The plugin use case assumes that professional translators can purchase a plugin that naturally integrates in their CAT tool and that provides suggestions from a machine translation engine that: Instantly adapts to the document they are translating Quickly learns from their data (translation memories) and their post editing work In terms of performance and usability, ModernMT should perform better than and be simpler to use than any available customizable commercial MT services (e.g. Microsoft Hub). At the same time, ModernMT should also perform better than popular online MT systems such as Google Translate. In order to meet the market demand implied by a CAT tool like MateCat, ModernMT should offer at least 60 language pairs, and support at least 10,000 active users, and the same order of magnitude of translation memories uploaded in the system. From a functional perspective, the ModernMT plugin should allow a user to perform the following operations: 1. Log in as a user 2. Upload one or more TMs 3. Connect ModernMT to a CAT tool through a key 4. Receive translation suggestions directly in the CAT tool 5. Seamlessly update ModernMT while using the CAT tool An important aspect related to ModernMT in general, and to the plugin scenario is particular, is the data privacy model offered to the customers/users that will upload their TMs through the 6

8 ModernMT plugin. While encouraging users to share their data for the sake of the overall machine translation quality, we will also foresee two data privacy options: Standard privacy : data from one user can be used to generate machine translations for other users but the source will be hidden to them. In other words, users will not know the origin of the translation fragments (phrases) that ModernMT used to assembly the machine translation of their text. Strong privacy : data supplied by a user will not be accessed to generate translations for another user. While the first level of privacy will be guaranteed by default, the strong privacy modality will be offered as optional. 3.1 General Requirements The use case introduced in the previous section implies requirements at different levels: linguistic, functional, performance, computational, and architectural. We briefly go through each of them Linguistic requirements We define the following progression in terms of required language coverage and language resources required for each language pair. The exact language pairs will be determined on the basis of commercial criteria by the industrial partner of the project (Translated). Date Languages Resources 2016 Q2 5 language pairs 200 million parallel words 5 billion monolingual words 2016 Q4 15 language pairs 200 million parallel words 5 billion monolingual words 2017 Q2 30 language pairs 200 million parallel words 5 billion monolingual words 2017 Q4 60 language pairs 200 million parallel words 5 billion monolingual words 7

9 3.1.2 Functional requirements The plugin scenario introduces the following functional requirements on the back end of ModernMT which are here presented with their timeline. Date Function Description 1 Q Q Fast training Context aware translation Incremental training Data privacy models ModernMT can be quickly trained from scratch starting from a collection of parallel and monolingual data. ModernMT translates segments by considering the context in which they occur. ModernMT can be updated by adding a new TM to the available data, w/o restarting the system. Both standard and strong privacy models can be applied. Q Online learning ModernMT can be updated from a stream of post edited data. ModernMT knowns where the data comes from and where to add them Performance requirements Requirements for translation quality are described below for two scenario: (i) the user does not provide a TM to translate a document, (ii) the user provides a TM to translated a document. For each condition, we have identified two commercial reference competitors, respectively offering a generic online system and a customizable online system. Our goal is to offer better translation quality than competitors in terms of automatic scores (BLEU) as well as human evaluation, based on quality ranking. Performance will be measured on benchmarks based on real translation projects. Improvements have to be statistically significant. 1 This requirement has been achieved by the time of writing this deliverable. 8

10 Condition Target Description Translation w/o TM Translation w TM Better than reference online MT Better that reference online and commercial customizable engines For all language pairs, and available industrial benchmarks For all language pairs, and available industrial benchmarks Computational requirements From the user perspective, not only translation quality counts. The user experience is strongly related to how smoothly and seamlessly ModernMT integrates in the human workflow. In particular, we are concerned with the perceived response time of ModernMT during training, online learning and translating. Below are our requirements. Function Performance User experience TM upload time 1 million words in 30 sec Training time must be comparable to upload time of user data. Translation time Online learning < 5 sec / segment (15 words on average) < 5 sec / segment (15 words on average) Translation time should not cause any delay in the workflow. Translation time should allow prefetching at steps of one segment. Updates during online learning should be effective from the second next segment. 9

11 3.1.5 Architectural requirements The following requirements are related to the deployment of the ModernMT architecture as an online commercial service. Requirement are listed according to a temporal progression, that gives priority first to the functional/performance aspects and then to the scalability aspects. Notice that ultimate goal is to permit scalability up to 10K users and 100K TMs. Date Feature Description Q Distributed workload Replicated static models Workload is automatically balanced among a fixed number of nodes. No scalability is allowed. Uses file based models Q Replicated dynamic models File based models can be efficiently updated with new data. Q Q Elastic architecture with scalability Shared dynamic models Efficient scalability Elastic with respect to workload (users). Performs efficient resync of models across nodes when models are updated. Scalability by replicating resources. Distributed models sharing information across the cluster. Scalability with efficient use of resources. 4. Architecture Specifications The final architecture of ModernMT must be designed with the goal of fulfilling the requirements shown in the previous section. The general guidelines imply an infrastructure capable of operating with multiple language pairs, scalable and resilient in the sense that it must be fault tolerant and it has to support a dynamic reorganization of the cluster in order to redistribute resources where most needed. 10

12 As a corollary of the main use case, the architecture of ModernMT must support real time incremental learning: this is a fundamental step for the project s goals and a very important challenge for the overall interaction between the components both at high level and low level abstraction. We have structured the Translation Engine s models into two parts: background model and foreground model. The foreground model is a lightweight, incremental data structure that holds customers data, which is used to adjust the probabilities of the background model, a large, static and immutable data structure trained, once per language pair, on a large amount of data collected from the web. This design can be found in Word Aligner, Multiplexed Language Model and Suffix Array Phrase Table. The following sections show in details the design choices of each component and the improvements made respect to the previous architecture. 4.1 Distributed Infrastructure One of the goals of this project is to design a Translation Technology capable of handling the high workload of a platform with thousands of users. Only an efficient and fault tolerant distributed infrastructure can achieve such goal: redistributing customers translation requests to the whole cluster is the best option. But in our use case there is also another important piece of information that must be spread across the whole cluster, that is translators feedback and customers TMs that must be delivered to the translation engines with particular attention to the data consistency. Failing to do that, will result in a cluster that cannot ensure a replicable behaviour with possible loss in translation quality and, even worse, errors in request handling Fault tolerance Fault tolerance is a generic design principle that indicates the ability of a system to recover after an unexpected condition that has produced an error. In this paragraph we present the design of ModernMT in order to be fault tolerant in two distinct situations: a translation request error and an shutdown of a cluster node. Translation error. The inability to complete a translation request by the engine must be handled in a way that the user is not alerted with an error message until it is really necessary. A translation error could occur due a temporary situation that prevented the system from operating properly, or a deterministic error due the request itself. In the first case, the engine must silently retry the translation without prompting an error to the end user, while in the second the error must be reported immediately to the user. This strategy gives to the system the ability to recover after a temporary problem and reduce the number of errors thrown during execution. 11

13 Cluster node shutdown. The ability of a cluster to recover after one of its nodes has halted unexpectedly depends on the its topology. The one that ModernMT will implement must allow our product to dynamically redistribute the workload even if one or more nodes shutdown, without corrupting the cluster or preventing its normal operation. More in detail the infrastructure must not have a single point of failure: a configuration that stops if a particular node halts (i.e. Master Slave configuration could lead to a global inability to operate if the Master node goes down). Furthermore the ModernMT cluster must be able reintegrate the broken node once it has fully recovered and seamlessly restore the original cluster operation Distribution of updates In the ModernMT use case there are two source of updates: the user uploading a new private TM and the translator who is translating documents while working on a translation job. Both data sources are an unbounded stream of parallel segments with source domain information attached; more precisely in the first case the domain is brand new, while in the latter the user is appending new segments to an existing domain. This stream of data must be delivered by the distributed infrastructure to every node of the cluster ensuring data consistency. This means that a node that had a temporary problem, and lost real time updates for a particular time window, must be able to reconnect to the data stream from the last update received before crashing. Figure 4.1 Figure 4.1 shows a possible implementation of an infrastructure that meets the requirements listed above. The update stream, made by both atomic segments or entire TMs, is backed into a distributed persistent queue; every node can read from the queue starting from any point in the past. This design allows: Regular nodes to receive the contributions from the users in real time. 12

14 A recovering node to start reading updates from the latest checkpoint before crashing. A brand new node joining the cluster to build its models starting from the beginning of the queue Scalability The scalability of a distributed system is the key for containing the costs of the infrastructure. Being able to redistribute resources when and where most needed allows the cluster to avoid wasting allocated memory and computational power for language pairs with less traffic. The ModernMT distributed architecture should support cluster resizing and language pair redistribution. The size of the cluster can change due workload variation during a time period (i.e. daily or weekly); being able to allocate new nodes only when needed and to shut them down as soon as they become useless can reduce the cost of the infrastructure in a cloud environment. On the other hand allowing the architecture to give more computational power to those languages that are used most will allow the system to better use its hardware and avoiding wasting too many resources for rarely used language combinations. Cluster resizing. The dynamic resizing of a fault tolerant cluster does not add complexity to the infrastructure: in fact a new node joining the cluster can be managed as a node that is recovering after an unexpected shutdown. Similarly, shutting down a node when it s not needed anymore can be managed in the same way the cluster handles a node shutting down unexpectedly. Language pairs balancing. Not all the language pairs have the same rate of translation requests, the ModernMT distributed infrastructure should be able to dynamically allocate more resources to the language pairs that are used most, while reducing resources for the rare language pairs. This optimization can be implemented by analyzing the traffic and the efficiency of the system and allowing it to dynamically load and unload translation engines resources (i.e. Language Model, Translation Model). For example, if the system detects an overload for a particular language pair, it should reassign resources to it evicting less utilized language pairs models. 5. Components Design and Specifications The following diagram sketches the ModernMT architecture and components developed during the project. In the following, we will concentrate on design and specifications of each component but the ModernMT core architecture, that was addressed in the previous section, and the Decoder, which will actually not be changed from the actual implementation available in Moses. 13

15 5.1 Context Analyzer One of the main goals of the ModernMT project is to provide context dependent machine translation. The Context Analyzer is the component responsible for this process: it analyzes the context provided by the user and through IR algorithms computes weights that will be used to bias the behaviour of the machine translation components in order to generate a contextualized translation of the input sentence. The Context Analyzer module is trained with the source documents during the training phase of ModernMT. It can be queried either through a REST API or natively in Java and, given as input either a text, a few sentences or a path to a local textual file, it produces as output a distribution of weights of the most similar documents. The current version of the Context Analyzer computes the cosine similarity in a word space model between the input text and the trained documents in order to estimate the real similarity of those documents. A query to the Context Analyzer index should take less than 300ms. The Context Analyzer will be able to delete or update a TM that has been previously added to the index during the training phase or to add a new one at runtime. Updating the content of a TM must take less than 1s. 5.2 Word Aligner Word Aligner is the ModernMT module which performs the word alignment of parallel sentences required by the MT module for building its models, and by the tag management module (see Section 5.5.2) for re inserting markup tags in the translated text. 14

16 Word Aligner applies to a stream of sentence pairs, and generates a stream of triplets including the original texts (for convenience) and the set of links between source and target words. To this purpose the module exploits a pre trained word alignment model. Word Aligner is also able to estimate a word alignment model either from scratch or incrementally, given a parallel corpus. Word Aligner estimates two directional models (source to target and target to source) first, and then combines them into a one bidirectional (symmetrized) model. In the incremental modality, when new parallel become available, Word Aligner adapts its models, already estimated on the previously existing data, to the new data avoiding the re estimation from scratch on old and new data. Word Aligner provides its functionality for estimating word alignment model and for aligning new sentence pairs by means of APIs and standalone executables. In order to be compliant with the computational requirements of the ModernMT system stated in Section 3.1.4, the Word Aligner component is expected to satisfy the following speed requirements: The alignment of a segment pair of 15 word average length should take less than 1 ms ; The overall estimation of the symmetrized word alignment model should take less than 20s for a corpus of 1M running words and 100K sentence pairs; The loading of pre estimated models trained on 100M running words should take less than 30 seconds. Word Aligner is expected to satisfy the following quality requirements: The translation quality achieved by the decoder exploiting the word alignment model should be not below 2 BLEU points than that obtainable with GIZA++ up to model Adaptive Language Model The Adaptive L anguage Model (LM) module computes the LM scores of the target fragments, required by the decoder to compute the overall score of the translation alternatives. As already mentioned in Section 4.1.2, in the ModernMT use case all data are partitioned into pre defined domains. New training data gathered from customers during the life cycle of the system are associated to either an existing or a brand new domain. The Adaptive LM relies on this partition to create its model. 15

17 In the bootstrap phase, the Adaptive LM estimates a language model from scratch exploiting the domain specific monolingual target corpora. In the incremental modality, when new target texts become available, the Adaptive LM shall adapt avoiding the re estimation from scratch on old and new data. If the new data are associated to an existing domain, the Adaptive LM appends the new data to the old data, and re estimates a new domain specific LM; when ready, the LM module replaces the old foreground LM. Otherwise, if the new data belong to a brand new domain, the Adaptive LM build a new foreground LM for the new domain using the new data, and adds it to the ensemble of domain specific LMS. Since the new training data sets are usually small, the Adaptive LM does not update the background LM, because the impact of 2 the new data on it is likely negligible. In the actual design, the bootstrap and incremental training phases are kept independent from the runtime queries. This solution has several advantages: The system is always active and ready to serve translation requests; The incremental training can be performed in parallel without blocking the translation service, and possibly on different machines without overloading those exploited for the translation service; The training of the domain specific LMs can be performed in parallel; The replacement of the foreground LMs is an almost atomic operation, which does not require switching off and on the system. We intend to support an overall number of 1,000 domains initially in 2016, and to extend this figure to the planned number of active users (10,000, see Section 3). In order to cope with the overall computational requirements of MateCat, the Adaptive LM is expected to satisfy the following speed requirements: The estimation of a domain specific LM should take less than 5s for a corpus of 1M running words; The Adaptive LM should provide query response time compatible with the overall translation time constraint. 2 In any case, the system administrator can always decide to re run the bootstrap training phase including new data as well. 16

18 5.4 Adaptive Phrase Table Based on an index of the bilingual text training corpus, the Adaptive Phrase Table module provides phrase translations and translation probabilities on the fly. As opposed to a static phrase table, this design allows domain sensitive probabilities to be computed on the fly, based on domain information passed to the decoder at run time as a domain probability distribution. The existing suffix array based Phrase Table implementation in Moses builds a static index of the training corpus on disk, which is queried in a read only mode. For this reason, the previous design of the suffix array prevents easy addition of training data to the index. For the plugin use case, the Adaptive Phrase Table has to support incremental addition of new training data. The incremental training data arrives as a stream of individual segment pairs (see Section 4.1). The ability to incrementally add segment pairs naturally extends the use case of adding an entire Translation Memory for a new project/customer. New training data should influence the possible phrase translations and their probabilities immediately after its addition. This influence allows ModernMT to adapt to individual translators domain of text while they are working. Therefore, the index of the training corpus must be incrementally updatable. A single engine supports multiple translators working at the same time. Therefore, both the updates and the index itself should carry information about which domain the segments belong to. We intend to support an overall number of 1,000 domains initially in 2016, and to extend this figure to the planned number of active users (10,000, see Section 3). Finally, the Adaptive PT should also support the strong privacy data model that forbids sampling from private translation memories. For practicality, the time for adding a new medium sized Translation Memory (1 million words) to the training corpus and incremental index should not take more than 30 seconds (write latency, see Section 3.1.4). Ideally, a Phrase Table on a single cluster node (see Section 4.1.2) should be able to support thousands of different translators in terms of write throughput and aforementioned write latency, so scaling becomes possible even with few cluster nodes. All the while, the Phrase Table must continue to provide read access performant enough to permit the overall translation latency goal of less than 5s per segment. 17

19 5.5 Text Processing Tokenizer and detokenizer The tokenizer and detokenizer are the two components of the pre and pro processing pipelines that most of all are language dependent. Both need rules or models for tokenization that must be customized for every single language. While in the project there are already tokenizers for 45 languages, for all but two languages we still need to evaluate and optimize these components on real benchmarks. Evaluating the tokenizers and the detokenizers in isolation is not very reliable: as their ultimate goal it to positively impact on the overall translation process. Moreover, the proper coupling of the two components is also important: a good tokenizer should provide enough information to the detokenizer in order to reconstruct the original sentence. The latter, in fact, has the duty to join the tokens, that the tokenizer has produced during the training, into a valid sentence. A too heavy tokenization, for example, can turn the detokenizer process into an very hard task, sometimes even impossible. We define two different tests for tokenization/detokenization evaluation; the first aims to estimate the proper coupling of the two components, while the second evaluates the utility of a candidate implementation in terms of translation quality. The simplest test is to tokenize and detokenize some text and to compare the result with the original. The less are the editings needed to recover the original text, the better is the tokenizer/detokenizer implementation. A more expensive test is to define a benchmark for training, tuning and evaluating the translations the ModernMT system. Improvements in the (de )tokenization process should lead to a higher translation quality. In order to also evaluate the quality of the post processed text, we consider both the BLEU and the Post Editing score. The first test should be used to quickly evaluate and iterate over different implementations in order to find out which one is the more promising. Only the second test however can definitively prove that the evaluated implementation lead to real enhancement in the translation quality of the ModernMT engine. In order to be compliant with the overall computational requirements of the ModernMT system, the (de )tokenization steps should take less than 2s for a corpus of 1M running words. 18

20 5.5.2 XML Tag Manager An important requirement in the translation industry is the capability to reproduce the layout of the input document in the output document as faithfully as possible. In particular, XML tags like formatting tags should be re inserted in the correct positions, and spaces of any type (tabs, multiple and hidden spaces) should be reproduced perfectly. XML Tag Manager is the ModernMT module which addresses this tasks. For instance, assuming that the English sentence Who is the ModernMT Team? translates into Chi è il gruppo ModernMT?, the tagged input Who is the modernmt Team? XML Tag Manager should produce the tagged output: Chi è il gruppo modernmt? Where the XML formatting tag and bend the correct fragments gruppo ModernMT and ModernMT, respectively. Notice that XML tags can be nested, overlapped, or even self contained. In the current version of ModernMT system, XML Tag Manager: identifies, classifies and stores tags and spaces in the input sentence; removes tags; transforms spaces into standard spaces; handles characters with unusual encoding; sends the input to the MT decoder and receives the output translation, which does not include any tag and has standard spaces; sends input and output to a word aligner and receives back the word alignments in both forward and backward directions; symmetrizes forward and backward alignment; re inserts the stored tags and spaces in the best positions suggested by the symmetrized word alignment. XML Tag Manager currently does not handle errors in the character encoding of the input. XML Tag Manager is expected to have an overall accuracy above 80% for all language pairs taken into account by ModernMT system, where the overall precision is the percentage of sentences which are correct in terms of number and positions of XML tags, and white spaces among all sentences without character encoding errors. 19

Marcello Federico MMT Srl / FBK Trento, Italy

Marcello Federico MMT Srl / FBK Trento, Italy Proceedings for AMTA 2018 Workshop: Translation Quality Estimation and Automatic Post-Editing Boston, March 21, 2018 Page 207 Symbiotic Human and Machine Translation