MMT Modern Machine Translation. First Evaluation Plan

Size: px

Start display at page:

Download "MMT Modern Machine Translation. First Evaluation Plan"

Reginald Elijah Alexander
6 years ago
Views:

This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No 645487.

1 This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No MMT Modern Machine Translation First Evaluation Plan Author(s): Dissemination Level: Ulrich Germann, Barry Haddow, Anna Samiotou, Luca Mastrostefano, Mauro Cettolo Public Date: June 30 țh 2015

2 Grant agreement no Project acronym MMT Project full title MMT will deliver a language independent commercial online translation service based on a new open-source machine translation distributed architecture Funding Scheme Collaborative Project Coordinator Alessandro Cattelan (TRANSLATED) Start date, duration 1 January 2015, 36 months Distribution Public Contractual date of delivery June 30 țh 2015 Actual date of delivery July 1 șt 2015 Deliverable number 5.1 Deliverable title First Evaluation Plan Type Report Status and version Final Number of pages 14 Contributing partners UEDIN, TAUS, TRANSLATED, FBK WP leader UEDIN Task leader UEDIN Authors Ulrich Germann, Barry Haddow, Anna Samiotou, Luca Mastrostefano, Mauro Cettolo EC project officer Saila Rinne The Partners in Translated S.r.l. (TRANSLATED), Italy MMTare: Fondazione Bruno Kessler (FBK), Italy The University of Edinburgh (UEDIN), United Kingdom TAUS B.V. (TAUS), The Netherlands For copies or reports, updates on project activities and other MMT-related information, contact: TRANSLATED MMT Alessandro Cattelan alessandro@translated.net Via Nepal, 26 Phone: (+39) I Rome, Italy Fax: (+39) c 2015 Ulrich Germann, Barry Haddow, Anna Samiotou, Luca Mastrostefano, Mauro Cettolo This document has been released under the Creative Commons Attribution-Non-commercial-Share-alike License v.4.0 ( 2

3 Contents 1 Introduction 4 2 Evaluation of Data Quality (Task 5.2) Data that will be collected for MMT Data quality evaluation methodology Data quality measures Language independent data quality indicators Language dependent data quality indicators Metadata and quality evaluation Data quality tests MT component testing Word alignment Automatic measures of translation quality against external benchmarks Human evaluation of translation quality Speed/quality trade-offs Instant domain adaptation vs. static systems Big Data MT Performance and Scalability Testing Performance and scalability test environment Result analysis Schedule 13 3

4 Modern MT Deliverable 5.1: First Evaluation Plan Ulrich Germann a, Barry Haddow a, Anna Samiotou b, Luca Mastrostefano c, and Mauro Cettolo d a University of Edinburgh b TAUS c Translated d FBK June 30, Introduction This document presents the First Evaluation Plan for the Project Modern Machine Translation (MMT) and is intended to guide the project-internal evaluation process until the First Technology Assessment Report (D-5.2), which is due in M17. It will be superseded by the Second Evaluation Plan (D-5.3; M19). In a nutshell, the goals of the project are to provide better translation quality; faster translations without noticeable loss of translation quality; short update cycles with respect to integrating and exploiting parallel data as it is being acquired; and scalability with respect to the volumes of parallel data and translation throughput that can be handled by the MT infrastructure. To reach these goals, MMT will collect large amounts of parallel and monolingual language data (WP2), improve or replace essential components in the MT pipeline to achieve greater speed, throughput and robustness across a variety of input domains (WP3), and implement these improvements in a robust, scalable, commercial product (WP4). The mandate of WP5 is it to monitor progress towards these goals. The following sections cover assessment of the quality of data acquired in WP2, testing of the MT component(s) developed in WP3, and performance and scalability tests (for WP4), in this order. 2 Evaluation of Data Quality (Task 5.2) There are three key ingredients to the success of contemporary statistical machine translation (SMT) systems: probabilistic models of the translation process / translation quality, algorithms to train and apply these models, and large amounts of language data (both parallel and monolingual) that are used to construct the knowledge bases and probabilistic models the translation system relies on. Within the context of this project, the notion of data quality thus refers to how useful the data collected within WP 2 is for building and improving SMT systems. The goal of Task 5.2 is to develop criteria that allow us to select data that improves translation quality while rejecting data that only adds noise. Below, we first describe what types of data will be collected in the course of WP 2 and then turn our attention to how its quality will be evaluated. 2.1 Data that will be collected for MMT We will collect bilingual data for translation modeling, monolingual data for language modeling, and metadata that can help determine the applicability and usefulness of the bilingual and monolingual data collected. The metadata collected as part of the data collection efforts within MMT are described in more detail in Section 2.4 below. Monolingual data consists of the monolingual data contained in the bilingual corpora collected, plus additional data for which no existing translations are known. 4

5 Bilingual data come from a variety of sources: Translation Memories (TMs) from TAUS Data and MyMemory repositories The translated segments are produced by professional translators (either from scratch or by post-editing MT output) or by crowdsourcing methods. Open and publicly available data (such as parallel corpora from the EU, UN etc.). Parallel segments from such sources are usually of good quality, but all data collected will be evaluated to ensure quality. Automatically harvested and aligned web data from bi- or multi-lingual websites included in the Common Crawl data sets 1. Bilingual data mined from the web are often of low quality. The quality of such data will be thoroughly evaluated and improved in order to avoid degradation of SMT performance. Data obtained from third parties which may be in different file formats (such as TMX, doc, excel, pdf etc.). Third parties include companies, organizations or institutions that provide data resources in exchange for credits to use the system (Task 2.3). This is the reciprocal model adopted by TAUS for data sharing: users eager to share their parallel data in different language combinations, register to TAUS either as users or members in order to upload their data through the TAUS Data platform. In this way, they gain credits to download parallel data in any preferred language combination.the pooling ratio depends on the user status. 2.2 Data quality evaluation methodology The data quality evaluation and improvement loop can be outlined as follows: 1. check for and identify known and expected problems in the data (cf. Section 2.3 below); 2. identify unforeseen problems in the data; 3. create/update sets of appropriate data quality measures; 4. create/update scripts to clean up the datasets automatically; 5. check the correlation of data quality measures with the impact of data selection on the translation process 6. return to step Data quality measures In this deliverable D-5.1 we define general data quality measures that apply to data from all types of sources as listed in Section 2.1. In Deliverable D-5.3 (M19) we will develop additional data quality indicators that apply to data from specific data sources. Given the amount of parallel data that we intend to collect, comprehensive human evaluation is impossible; we must develop automatic procedures for determining whether or not specific candidate data is useful for our purposes. Over the next year, we will thus investigate easily observable characteristics of candidate data with respect to their usefulness as data quality indicators (features) in automatic labelling of data as good or bad. This decision will be based on a single quality score computed in a linear or log-linear framework as the weighted sum of individual features values. Features that we will investigate include the ones listed below. These lists are not final yet; item may be dropped or added as our investigations progress Language independent data quality indicators The following language independent data properties are signs of poor quality data. An unusual length ratio between source and target segments (counted in characters or words). Mismatches in the number of sentences. Empty source or target segments. Identical source and target segments. Very long sentences (set a threshold i.e. rule out segments with more than 1000 characters or 200 words)

6 An unnatural distribution of word lengths in the candidate segments, for example, sequences of many short words in a row. Source segments that contain special characters of the target language and vice versa. Inconsistency in numbers, dates, or URL address. Source and target segments that have an unusual proportion of numeric characters, or many non-alphabetic characters. Extra spaces between words. Presence of unwanted tags, symbols (e.g. emoticons) or HTML/XML entities (e.g., & ) and markup. High ratio of bilingual segments that do not occur elsewhere in our bilingual database, i.e., items that are only found in the candidate data but nowhere else Language dependent data quality indicators In order to apply the language dependent data properties, we will use state-of-the-art language recognizers 2 to determine the predominant language of text segments automatically. The following language dependent data properties are signs of poor quality data. Presence and percentage of foreign words used in segments. Presence and percentage of un-translated words in the target segment. Merged words. The presence of merged words suggests processing errors somewhere in the data creation pipeline, including the creation of the original data. Presence of unusually long words in the data. Non-uniform encoding Missing accents. Spelling errors as detected by automatic spell checkers. 3 Presence of named entities. Named entities are often not included in dictionaries and not recognized by spell checkers. Recognition of named entities will help to distinguish named entities (highly informative and valuable) from spelling errors (noise). Inconsistency in punctuation - i.e. quotes, double quotes, brackets, braces, parenthesis, colon, semicolon, especially with respect to language dependent conventions, e.g. quotation marks in English: (... ) vs. French: ( «...» ) vs. German: (... ). Inconsistency in capitalization, taking into account language-specific conventions. Acronyms on one side but full names on the other side. 2.4 Metadata and quality evaluation In addition to the actual data, we will collect, organise and store metadata that describes and characterises the data. A number of common distribution formats for text and language data distribution allow the encoding of such information directly in the actual data file, such as TMX (an industry standard XML document type for translation memories), HTML, and other XML document types. Metadata will be obtained from the data sources directly if embedded there, or where applicable, via the user interface of the MMT system. The initial set of metadata fields is loosely modeled on but not identical to the Dublin Core R Metadata element set 4. Table 1 gives an overview over the kinds of metadata that will be collected. This metadata will be used as an additional information source to compute the quality score for data selection by the MMT system (Task 3.1), including the degree to which metadata fields are filled, based on the assumption that more complete metadata is a sign of better-curated data to begin with. 2 E.g. the Google Language Identifier ( TextCat ( 3 E.g. the LanguageTool ( Hunspell ( etc.) 4 6

7 Table 1: Metadata to be collected for data from a variety of sources. Metadata Item TM Web other Data source origin Domain/subject area Content type Data Owner Creation Date Change Date Source Language/Locale Target Language/Locale Document id Data Provider Retrieval date URL Parallel webpage classifier metrics Location (country, city) TM: Translation Memories; Web: publicly available parallel text collections as well as parallel data harvested automatically; other: data obtained from third parties. 2.5 Data quality tests The test sets for the evaluation of data quality will be comprised of data from different sources (as described in Section 2.1). The size of the test sets is yet to be specified. For example, test sets could comprise of 1000 bilingual segments from the existing repositories and 1000 from aligned web crawled data. The test sets will provide both positive and negative training cases, i.e., good and bad examples. The data cleaning procedure, as well as the data selection procedure will be evaluated with respect to their ability to: identify and filter out data that harms performance (bad data) identify and retain data that improves performance (good data) The difference between data cleaning and data selection is that data cleaning should remove data that is always harmful, whereas data selection should avoid data that may be harmful in a specific case yet select data that is useful. The quality tests will help to identify how strict or loose the quality indicators, the thresholds and the weights should be in order to get large amounts of good data from the data sets.these tests will be performed in an automatic setup and measure how variations in quality score computation and threshold values affect automatic measures of MT performance (such as BLEU, TER, etc.). 3 MT component testing MT performance can be measured along several dimensions: translation speed: how long does it take to produce a translation? translation quality: how good are the translations? resource use: how much resources (e.g., memory, computation effort) are needed? scalability: how well does the system handle an increase in translation volume, and how do the costs scale with the throughput? generality: how well is the system able to translate a variety of texts from different genres, domains, and topic areas? Of course, these aspects of MT performance are inter-related: more thorough exploration of the search space, i.e., considering more translation options, may lead to better translations, but also requires more computational resources (time and/or hardware). Parallelisation allows trading resource use for speed (within limits); the use of resource-intensive statistical models (disk space, 7

8 memory, and/or computational effort) may lead to better translations but drive up the cost of doing translation, etc. This means that the performance of a system cannot be distilled into a single number in the context of real-world use of MT: trade-offs always have to be considered. In addition to one-dimensional performance measures (e.g., translation quality), we therefore plan to conduct grid evaluations, in which we systematically try out ranges of parameter settings that certainly affect translation speed and may affect translation quality. Specifically, we will conduct the following kinds of evaluation. 3.1 Word alignment Word alignment (WA) is a crucial processing step in the MT system construction pipeline that automatically establishes correspondences between individual words in a bitext, i.e., word-level translation relations. These word-level correspondences guide the discovery of phrase-level translation relations that are used by the decoder to create sentence-level translation hypotheses. There are currently two established OSS tools for word aligment that are widely used in the field: Giza++ / Mgiza 5 re-implements the original work of Brown et al. (1993) at IBM. FastAlign (Dyer et al., 2013) is a recent re-parameterization of IBM Model 2. FastAlign is much faster that Giza++ and Mgiza. The tool s authors claim that the increase in alignment speed comes without deterioration of alignment quality. We intend to confirm this claim empirically with a comparative evaluation of these two tools. We will evaluate both tools on data sets of varying sizes, ranging from tens of thousands of words (corresponding to the volume of adaptation data selected for translating single segments) to hundreds of millions of words a typical size for training sets for standard SMT models. Alignment quality will be measured in terms of F-measure and Alignment Error Rate (AER). The impact on translation quality will be assessed by a pair-wise comparison of systems that differ only in the underlying word alignment but are identical otherwise, on the usual automatic quality metrics: BLEU, TER, etc. Experiments will be carried out on public and MMT benchmarks, such as the shared WA tasks of the ACL 2005 workshop on under-resourced languages, for which training and test sets were prepared for English-{Hindi, Inuktitut, Romanian} pairs training and test sets used by one of the partners (FBK) in previous investigations on WA, involving English-{French, Italian, Spanish} pairs (legal and parliamentary documents) the Edinburgh corpus of manual WAs of German-English bitexts, in the Europarl and News domains We expect that our tests show that FastAlign offers the better speeed/quality ratio and expect to use it as the basis for planned extensions to our word aligment capabilities. Giza++/Mgiza assume by default no prior knowledge and start with a blank slate (uniform probabilities) for each parallel corpus that they are asked to align. This is sub-optimal for small chunks of parallel text, as small amounts of text often do not provide enough information for accurate alignments. The more (good quality) data is processed, the better the resulting models. Whatever tool is ultimately chosen as the basis of our WA module, will therefore be extended to offer incremental training, i.e., models are retained and successively improve as we feed in more and more data. As the project progresses, novel word alignment methods will be designed and implemented, focusing on robustness for less resourced languages, that is in scarcity of data. They will be experimentally compared to the state-of-the-art approaches. Each version of the WA module will be evaluated in terms of the Key Performance Indicators specified above. 3.2 Automatic measures of translation quality against external benchmarks These lab tests will compare the end-to-end performance of the MMT decoder against external benchmarks of MT performance. The Shared Translation Task at the annual Workshop on Statistical Machine Translation (WMT) provides such benchmarks. The workshop publishes the results of all entries, and all underlying resources (in the constrained track) are publicly available. For the duration of the project, we will monitor the performance of the MMT decoder against state-of-the-art submissions to WMT All systems will be trained and evaluated automatically on the data released for this workshop. In this context, evaluating end-to-end performance means testing the entire translation pipeline (training, pre-processing, decoding and post-processing) as opposed to testing individual components. All tests will be scored using several standard automatic metrics, not manual human evaluation, so that they can be re-run on a regular basis. We do not necessarily expect the MMT decoder to reach or beat the performance of the top performers in WMT-2015, as these systems are carefully tuned to the WMT-2015 shared task and may 5 The difference between the two is that Mgiza (Gao and Vogel, 2008) is able to perform parallel processing of the data, whereas Giza++ (Och and Ney, 2003) is single-threaded 8

9 involved computationally expensive, language-specific pre- and post-processing that is not feasible within the context and vision of MMT. We will measure translation quality for 8 of the 10 translation directions included in the Shared Task 6 at WMT The translation directions French English and English French, on the other hand, will be considered in the big-data tests described in Section 3.6 below. As training data sets we will use all the data available for the constrained track of the tasks, except for the LDC GigaWord data sets, because they greatly increase the size of the language model (see Table 2 for a summary of the data sets). For tuning (where necessary) we will use the newstest2014 set (newsdev2015 for fi en) and for testing we will use newstest2015. We will compare MMT against Moses baselines, which will largely follow the University of Edinburgh s phrase-based system submissions. However, for the sake of simplicity we will use a simpler set-up. Specifically, we will not to use sparse features, the Operation Sequence Model, or class-based language models. We will also omit some additional language specific processing, such as pre-reordering for German and transliteration for Russian. Table 2: WMT2015 Data available for testing decoder performance. Word counts are on untokenised text. parallel data + monolingual only (news crawl) Language pair Sentences Words (English) Words (Other) Sentences Words Language Czech-English 15.8M 217M 187M 45.1M 635M Czech 118M 2380M English Finnish-English 2.08M 47.9M 32.6M 1.38M 14.0M Finnish French-English 40.8M 1020M 1160M 45.9M 919M French German-English 4.52M 104M 96.2M 160M 2440M German Russian-English 1.74M 26.9M 23.9M 45.8M 679M Russian Translation quality will be measured with the three widely-used automatic metrics: BLEU, TER and Meteor. We will apply the metrics to the output of the whole MMT and Moses pipelines, including all post-processing. Furthermore, we will apply the technique proposed by Clark et al. (2011) to account for stochastic variation in the components. The output of these automated tests and scoring will be a set of numerical scores with confidence intervals. For the MMT translation setup, these tests will be re-run according to the schedule in Section 5. The Moses baseline system will be the Moses Decoder v3.0, as released in January We will set up an automated testing framework that downloads and compiles the release from the respective code repository, builds a complete translation pipeline from the respective data set, and evaluates it against a set of test documents and tabulates the results. 3.3 Human evaluation of translation quality Human evaluation of MT quality is time-consuming and costly, but necessary to put the results of automatic evaluation into perspective. For these reasons we will perform human evaluation only of major releases of the MMT system, where it will provide a key benchmark of our progress. The human evaluation will take two different forms: Direct comparison of MMT system output against an established system ( A/B preference testing). Evaluators will be presented with the output of the two translation systems (in random order, so that they do not know which system produced which translation), and asked which one they prefer. This evaluation will use the Dynamic Quality Framework (DQF) tools available from TAUS 7. Productivity tests, where we measure the effect on productivity of using the MMT system in a real post-editing project. In these tests, professional translators will use MateCat to post-edit raw machine translation output. The DQF will be used to analyse the post-editing effort (PEE) and time-to-edit (TTE) when using the MMT system for the machine translation component in a MateCat post-editing job. The human evaluations will be performed by translators working for Translated. The human evaluation will compare the MMT system against online volume translation systems (e.g., Google Translate). In order to evaluate the impact of the MMT system on translator productivity in a scenario as realistic as possible, productivity tests will be embedded in the regular workflow at Translated on actual translation jobs. The precise scheduling of these tests will be coordinated with the actual release schedule of the MMT system and ongoing business needs

10 3.4 Speed/quality trade-offs As mentioned at the beginning of this section, the choice of search parameter settings can affect both translation speed and translation quality. Speed-quality trade-offs will be measured by grid evaluations, in which key parameters that affect the decoder s exploration of the search space (and thus its speed) are systematically altered. End-to-end translation time and quality of the raw MT output will be recorded. For the reference system (Moses 3.0), this data need to be collected only once as a reference point, unless there are changes in the underlying setup (e.g., fundamental changes in the common set-up that would make the comparison between the reference system and the MMT system meaningless). The outcome of these grid evaluations will be tuples of translation-time, quality metrics, and underlying parameter settings, one for each set of parameter settings in the grid. In order to eliminate interference from other computations occurring on the same computing infrastructure, speed tests require hardware specifically reserved for this task. Therefore, these tests will be run one at a time on a dedicated machine during each testing cycle. A new cycle of tests will be run with each major upgrade of the MMT sofware. 3.5 Instant domain adaptation vs. static systems One of the main aims of MMT is on-the-fly domain adaptation, so this test will compare the static domain adaptation available in Moses, with the dynamic adaptation in MMT. The test will apply to two domains (IT and legal) and three language pairs (it en and fr en). For the out-of-domain data, we will use the Europarl corpora, and the in-domain data will be selected from TAUS data using the classification provided in the repository. To ensure that it is a valid domain adaptation problem, the in-domain training set will be small (100,000 sentences) compared to the out-of-domain (about 2M sentences). We will select the tuning and test sets from the same source as the in-domain training data. To create the Moses domain adaptation baseline, we will run experiments with all the static domain adaptation methods available in Moses. These consist of translation and language model interpolation, modified Moore-Lewis (MML) data selection and provenance features. We will then choose the best performing combination as the static domain adaptation technique for the baseline Moses benchmark. 3.6 Big Data MT In this test, we will assess the MMT product s ability to scale to very large data sets, and compare it with a state-of-the-art online volume MT system (e.g., Google Translate). We will use data from the WMT fr en task, as there is close to 40M parallel sentences available for training. For language modelling, we will include the LDC GigaWord corpora along with the WMT monolingual corpora and the target side of the parallel data. The newstest2014 data set will be used for testing, and the newstest2013 for tuning (if necessary). For the Moses baseline, we will follow Edinburgh s submission for WMT14 (Durrani et al., 2014). According to the human evaluation (Bojar et al., 2014) this system ranked just ahead of Google in en fr and just behind Google in fr en. We will also perform a comparison with Google (scraped on a fixed date), although note that we cannot be sure that Google have not included the test data in their training set. 4 Performance and Scalability Testing In order to determine how to best position the MMT product in the market, including a competitive yet profitable pricing scheme, we will perform tests to assess how the whole system is able to scale to an increasing number of requests, as we scale the underlying computing infrastructure. These tests consist of two types: Scalability tests: These tests focus on the horizontal scalability of the MMT architecture, that is, the translation speed that can be achieved by expanding and distributing computations over an increasing number of nodes, and by taking advantage of replication and partitioning strategies. The Key Performance Indicator (KPI) for these tests will be the average translation speed for translating 1MB of text (words/second) against the corresponding hardware and infrastructure cost. This measure will indicate how efficiently and effectively the architecture can be distributed over a cluster of machines. Performance tests: These tests focus on the vertical scalability of the MMT architecture, namely translation speed achieved by increasing the power of a single computing node, for instance by adding more CPUs or memory. The KPI will be the average translation speed of a single sentence (words/second) against the corresponding hardware and infrastructure cost. This indicator will include the minimal latency that is expected for translating a single sentence at different levels of costs. 10

11 In order to make different versions of tests comparable, they will always use the same test sets, which will be created by sampling actual requests for automatic translation from MateCAT or MyMemory, to approximate a real usage scenario as faithfully as possible. The test set will then be composed both of individual sentences to be translated, sampled from MyMemory s queries, and entire documents to be translated from MateCAT, which represent a more complex processes and will give us the possibility to analyze the throughput of the system in case of multiple requests for translations that share the context. Since these tests aim at analyzing the efficiency and cost of the system, and not at the quality, it is not relevant that the test set also contains translations of the sampled segments. Instead it is important that part of the test set is sampled from language pairs with a large quantity of training data, to analyze the system s ability to work with big data. 4.1 Performance and scalability test environment Translated will create the environment necessary for the execution of tests. In particular, Translated will set up tools for automatic execution of those tests. By interfacing with cloud computing services, the infrastructure will allow the lead developers (i.e, the project partners responsible for certain components) to easily launch test instances after changes to the software, making the development more independent and agile. Translated will perform periodically both scalability and performance tests of the whole system, for the purposes of analysis and for individualizing any regressions. Scalability and performance will be tested with respect to: the cost per megabyte of translated text translation throughput in megabytes per second number of request served per second Specifically, the tests will be carried out as follows: Scalability: Using cloud computing services we will initialize the minimum number of nodes required to start the system, installing and configuring the component to be tested. To better stress the system, preventing the test server itself becoming the bottleneck, a number of virtual test instances will be launched in parallel. These will be sufficient to query the system bringing it to the limit of its capacity, given the computational resources available at the current iteration. After a period of time long enough to draw statistically relevant conclusions, the number of nodes on which the component to be tested is installed will be increased and the procedure repeated until we have enough data to plot the variation of the KPI against the number of nodes in the cluster. Table 3: Output of a complete scalability test on a certain component. N nodes Cost per MB MB per second N requests per second 1??? 2???...??? N??? Performance: Using cloud computing services we will initialize the minimum number of nodes required to start the system, installing and configuring the component to be tested. To better stress the system, preventing the test server itself becoming the bottleneck, we will launch in parallel a number of virtual test instances sufficient to query the system up to the limit of its capacity, given the computational resources available at the current iteration. After a period of time long enough to draw statistically relevant conclusions, the computational resources of the instance on which the component to be tested is installed will be varied, increasing the performance of the limiting resource, such as the CPU, the memory, or the speed of reading and writing on the secondary storage. This procedure will be repeated in order to provide relevant data on the ability to scale up on a single node or to identify a bottleneck. 4.2 Result analysis As mentioned at the beginning of this section, the primary purpose of the performance and scalability tests is to help us determine the best positioning of the product in the market. The system will be set up with the computational resources necessary to 11

12 Table 4: Output of a complete performance test on a certain component. N CPUs cost per MB MB per second N requests per second 1??? 2???...??? N??? RAM cost per MB MB per second N requests per second 1 [GB]??? 2 [GB]???...??? N [GB]??? Disk cost per MB MB per second N requests per second 80 [Mbps]??? 160 [Mbps]???...??? N [Mbps]??? achieve a translation turnaround time compatible with current market expectation. Once the overall system architecture has been established, and we have reached a satisfactory level of average translation quality through cycles of translation quality testing and subsequent improvements to the system, translation cost per MB will be the primary criterion in determining which segment of the market our marketing efforts should target. 12

13 5 Schedule Sept 1, 2015 Sep Oct, 2015 Jul Oct 2015 Oct Dec, 2015 Dec 31, 2015 Dec 2015 Jan Mar 2016 Mar 2016 Mar 2016 Apr 2016 Dec 2016 Mar May 2017 Jul Sep 2017 Aug Oct 2017 Sep 2017 Sep-Oct 2017 Oct 2017 First version of automatic testing framework in place Testing of the automatic testing framework. Evaluation of static word alignment technology. Speed tests and automatic translation quality evaluation for baseline benchmark systems (based on Moses 3.0). All baseline benchmarks available for all translation directions. Human A/B evaluation, en it or en fr, depeding on amount of data available at that time; comparison with Google Translate. Test data: 400 sentences from actual translation jobs in MateCat, 4 translators. First evaluation of data quality and data classification (good/bad). First test of dynamic vs. static domain adaptation. Productivity evaluation in real production, as alternative to GT. Compare speed, cost per MB. First evaluation of incremental word alignment. A/B testing for 10 selected language pairs. Which language pairs will be used will be decided closer to the tests. Performance and scalability tests. A/B testing and productivity tests, compare to GT, 1 language with volumes of data comparable to Google, multiple languages with MyMemory, TAUS Data and MMT collected. Second evaluation of data quality and data classification (good/bad). Second test of dynamic vs. static domain adaptation. A/B testing (10 language pairs); and productivity (at least one language pair). Second evaluation of incremental word alignment. 13

14 References Bojar, Ondrej, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna Findings of the 2014 workshop on statistical machine translation. Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA. Brown, Peter F., Stephen A. Della-Pietra, Vincent J. Della-Pietra, and Robert L. Mercer The mathematics of statistical machine translation. Computational Linguistics, 19(2): Clark, Jonathan, Chris Dyer, Alon Lavie, and Noah Smith Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability. Proceedings of ACL. Durrani, Nadir, Barry Haddow, Philipp Koehn, and Kenneth Heafield Edinburgh s phrase-based machine translation systems for WMT-14. Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA. Dyer, Chris, Victor Chahuneau, and Noah A. Smith A simple, fast, and effective reparameterization of ibm model 2. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia. Gao, Qin and Stephan Vogel Parallel implementations of word alignment tool. Software Engineering, Testing, and Quality Assurance for Natural Language Processing, Columbus, Ohio. Och, Franz Josef and Hermann Ney A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):

MMT Modern Machine Translation

MMT Modern Machine Translation Second Design and Specifications Report Author(s): Davide Caroselli, Nicola Bertoldi, Mauro Cettolo, Marcello Federico Dissemination Level: Public Date: July 1 st, 2016 Grant