Stone Soup Translation

Size: px
Start display at page:

Download "Stone Soup Translation"

Transcription

1 Stone Soup Translation DJ Hovermale and Jeremy Morris and Andrew Watts December 3, Introduction 2 Overview of Stone Soup Translation 2.1 Finite State Automata The Stone Soup Translation model is a finite-state automata phrase-based model for translation. Finite-state automata have several properties that make them good tools for working with natural language despite their restriction to modelling regular languages. Standard finite-state transduction models, however, have the limitation that transductions can only be performed in-order on the words as they appear in the text - which is an impediment to translating between languages that require different word orderings of concepts in their sentences. Different finite-state models have attempted to solve this issue in different ways. The finite-state head transducer model proposed by Alshawi and Douglas (Alshawi and Douglas, 2000) gives one approach to solving this problem. It proposes a new type of finite-state transducer that can operate in different ways on its output tape and its input tape and is not restricted to starting at the beginning of an input string. The Stone Soup Translation model proposed by Davis and Brew (Davis and Brew, 2002), on the other hand, takes a different approach. Instead of using a more powerful type of finitestate machine, the Stone Soup model uses a standard finite-state automata model. To solve the problem of word ordering, the Stone Soup model uses what is called a linked automata model. The linked automata model uses a transformation function that takes sequences of words from a source model and transforms them into sequences of words in the target language. In this manner the Stone Soup model is a phrase based model for translation. 2.2 Transformation Function The transformation function is built from an aligned bitext corpus of data. Alignments between the phrases in the bitext are counted and stored in a lookup table. In the Stone Soup model, an alignment between two word sequences is not just a transformation of one 1

2 sequence of words into another, however. Instead, the Stone Soup model transforms phrases based on their locations in the source and target sentences. For example, in the sentence pair: replace the ink cartridge with a Xerox Ink Cartridge. remplacez-la par une cartouche Xerox ; The phrase with a Xerox Ink Cartridge starting at word 5 in the source might be aligned with the phrase par une cartouche Xerox starting at word 2 in the target text. All word sequence/location pairs are counted in this manner through the entire bitext. Once these counts are made, they can be transformed into probabilities. In our case, we made the judgement to simply assign each alignment a probability based on the proportion of its number of occurrances in the bitext against the total number of alignments in the bittext. 2.3 Language Models In addition to the Transformation Function, the algorithm makes use of two language models: a source language model and target language model. The source model is used to restrict input to things that we have training to handle, while the target model is used to restrict our output to things that are valid target language outputs. These models are built from the training data. Each phrase that we discover while building the transformation table is turned into an arc sequence and added to the appropriate finite state machine. In this way, any phrase that was found in either of the texts should be in the appropriate final language model. 2.4 The Basic Algorithm Once we have built our transformation function and our language models from the data, we can use them to find a translation for an input sentence. The following algorithm is used to turn a string in the source language into a string in the target language. 1. Take the input source string and turn it into a Finite State Automata. Check to see if the source language model accepts it 2. Turn the input FSA into a sequence of words and locations in the input string 3. Query the Transformation Function for all of these word/location combinations and recieve a target result 4. Assemble the target results into all possible FSAs that make use of each of the locations in the original word sequence 5. Keep the best scoring target result, based on the scores computed in building the Transformation table 2

3 3 Our Implementation For our implementation, we used three main tools: GIZA++ to perform the string aligning, the AT&T FSM Toolkit to provide the finite-state automata structure and processing, and Python for the intermediate glue code to hold our system together. GIZA++ Python AT&T FSM Toolkit 3.1 GIZA Why GIZA++? We chose to use GIZA++ (Och and Ney, 2003) as our aligner for several reasons. The primary reason is simply that it is what Davis and Brew (Davis and Brew, 2002) used, and what we were attempting to do is replicate their system. Regardless, we needed an aligner; preferably one that is freely available. GIZA++ fits the bill there, because it is licensed under the GPL and it is cross platform, running on Linux, Irix, Solaris and MacOS X About GIZA++ GIZA was written by F.J. Och and H. Ney as part of the EGYPT statistical machine translation project at Johns Hopkins University in They wrote it in C++ using the STL Libraries, and Franz Och has maintained and updated it as GIZA++, making several releases in the last few years. It implements IBM Models 1 through 5 (Brown et al., 1993). GIZA++ produces tons of output including a translation table, a fertility table, distortion tables, an HMM table (optionally), alignment files (human readable and machine readable), a perplexity file, revised vocabulary files (with updated counts) and a final configuration file to allow replication of the run, and it can be configured create even more output depending on the user s needs How do we use GIZA++ to get alignments? For our alignments, we used the output of the best alignment of our training sentences from GIZA++. These aligned sentences look like this: remplacez-la par une cartouche Xerox ; NULL ({ }) replace ({ 2 }) the ({ }) ink ({ }) cartridge ({ 4 }) with ({ }) a ({ 3 }) Xerox ({ 5 }) Ink ({ 6 } Cartridge ({ 1 }). ({ }) 3

4 In this structure, the target sentence comes first and the source sentence second. So in this example, the source word cartridge is aligned with the target word cartouche, so the alignment cartridge/4 cartouche/4 would be added to our Transformation table. But we also want sequences of words to be in our table, so something like [replace/2 the/3 ink/4 cartridge/5 with/6 a/7] [par/2 une/3 cartouche/4] also needs to be counted, as does every other combination of continuous word sequence transformations. Once we have obtained a count of every time each phrase transformation occurs, we determine the probability of a phrase transformation by computing the proportion of the occurances of a transformation to the total number of transformations in the training set (as outlined in the previous section). 3.2 Python Why Python? Python was an easy choice for a language to use for glue code and program logic not handled by the AT&T toolkit. It is a common language known to at least some degree by all members of the group. It is flexible enough to allow calling out outside programs the AT&T Finite State Toolkit in our case. We used to build the Transformation table as well as to process our source and target strings Our Implementation For our implementation, we used three objects to drive our training and translating processes: TTBuilder: This object takes an aligned file output from GIZA++ (as described above) and builds a Transformation table out of it. TrSequence: This helper object takes a string and creates a sequence object that contains a tuple containing a source state, destination state, string, and a unique label for each word in the string. This data structure proved very useful for building and restructuring finite-state machines. SSTrans: This object implements the core of the translation algorithm. It takes our source and target language models and our input string and returns the best scoring (i.e. highest probability) string produced by the Stone Soupe Translation algorithm, along with a score for that string. 3.3 AT&T FSM Toolkit The AT&T FSM toolkit M. Mohri and Riley is a useful tool for building and using Finite State Automata and Transducers that we used everywhere we needed to use a Finite State Model, such as when we need to check to see if a language model accepted a particular string or to find the best scoring string. The toolkit also contains the Finite State Archive (FAR) 4

5 tools, which were useful tools for transforming the initial training texts into finite-state machines for initial language model building. 4 Our Corpora Developed with the HCfrench and HCenglish Xerox printer corpora provided for Homework #1 We created a larger corpus from a grandfather clock service manual that was in English and Spanish. There were some large alignment issues between the two. 5 Initial Results Our implementation of the basic Stone Soup algorithm provides 100 % accuracy on the data it was built from, but it is limited only to bitexts it was trained on - the basic algorithm rejects input sentences that it hasn t seen before. 6 Extensions While our initial results were promising, we wanted to see the system to more. In particular, we wanted to see the system move beyond just being able to translate the exact texts that it had been trained on. The Stone Soup Paper recommended a few generalization techniques to expand coverage: Merging of transitions Fragment processing Unknown Word Fall-Through Partial Source Parsing Partial Target Parsing In the time we had allotted for the project, we were not able to implement all of these generalizations, so we decided to restrict ourselves to a few of them. We were able to add Fragment Processing, Partial Source and Partial Target Parsing to our implementation of the basic algorithm. 5

6 6.1 Fragment Processing Fragment Processing allows us to process substrings of the source language and substrings of the target language instead of just the full sentences. In the source and target language models, this is implemented by changing the automatons for each string to also accept any substring of that string. We made this change by adding links to the language model to allow it to start on any word in the substring. We also added final states to the model to allow the system to accept the termination of any arc as a final state. These automatons are then simplified using the AT&T FSM tools to remove unnecessary epsilon transitions. 6.2 Partial Source Processing Partial source processing is related to fragment processing. If we cannot recognize the entire source string as one entity, but we can recognize each of the words in the source, we can use a greedy algorithm to find the longest substring of the source that can be recognized and translate that. We were able to partially implement this - it works fine when words are in the same positions in the string as they were in the source training text, but fails when the words are in positions the system hasn t seen them before. Combining this with an unknown word fall-through heuristic might allow us to overcome this limitation. 6.3 Partial Target Processing When we get a set of fragments from the partial source parse, we need to piece these together in a reasonable fashion. We piece together these fragments based on where the Transformation table says that they lie - if one fragment begins in position 6 and the second fragment ends in position 4, we put fragment 2 ahead of fragment 1. If there is no clear ordering from the fragments, then we attempt to piece them together based on how far each fragment is from the beginning of the sentence. For example, if fragment 1 starts at position 2 and fragment 2 starts at position 4, we place fragment 1 before fragment 2 (because its initial word lies closer to the start of the machine at position 0). 7 Conclusion References H. Alshawi and S. Douglas. Learning dependency trandsuction models from unannoated examples. In Philosophical Transactions of the Royal Society (Series A: Mathematical, Physical, and Engineering Sciences), pages , P. F. Brown, S. A.D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19: , June

7 P. Davis. Stone Soup Translation: The Linked Automata Model. PhD thesis, The Ohio State University, P. Davis and C. Brew. Stone soup translation. In Proceedings of the 9th Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2002), pages 31 41, F. Pereira M. Mohri and M. Riley. AT&T finite-state machine library. research.att.com/sw/tools/fsm/. Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19 51,

A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation.

A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation. A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation May 29, 2003 Shankar Kumar and Bill Byrne Center for Language and Speech Processing

More information

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder GIZA++ and Moses Outline GIZA++ Steps Output files Moses Training pipeline Decoder Demo GIZA++ A statistical machine translation toolkit used to train IBM Models 1-5 (moses only uses output of IBM Model-1)

More information

Homework 1. Leaderboard. Read through, submit the default output. Time for questions on Tuesday

Homework 1. Leaderboard. Read through, submit the default output. Time for questions on Tuesday Homework 1 Leaderboard Read through, submit the default output Time for questions on Tuesday Agenda Focus on Homework 1 Review IBM Models 1 & 2 Inference (compute best alignment from a corpus given model

More information

Alignment Link Projection Using Transformation-Based Learning

Alignment Link Projection Using Transformation-Based Learning Alignment Link Projection Using Transformation-Based Learning Necip Fazil Ayan, Bonnie J. Dorr and Christof Monz Department of Computer Science University of Maryland College Park, MD 20742 {nfa,bonnie,christof}@umiacs.umd.edu

More information

FSA: An Efficient and Flexible C++ Toolkit for Finite State Automata Using On-Demand Computation

FSA: An Efficient and Flexible C++ Toolkit for Finite State Automata Using On-Demand Computation FSA: An Efficient and Flexible C++ Toolkit for Finite State Automata Using On-Demand Computation Stephan Kanthak and Hermann Ney Lehrstuhl für Informatik VI, Computer Science Department RWTH Aachen University

More information

Ling/CSE 472: Introduction to Computational Linguistics. 4/6/15: Morphology & FST 2

Ling/CSE 472: Introduction to Computational Linguistics. 4/6/15: Morphology & FST 2 Ling/CSE 472: Introduction to Computational Linguistics 4/6/15: Morphology & FST 2 Overview Review: FSAs & FSTs XFST xfst demo Examples of FSTs for spelling change rules Reading questions Review: FSAs

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Power Mean Based Algorithm for Combining Multiple Alignment Tables

Power Mean Based Algorithm for Combining Multiple Alignment Tables Power Mean Based Algorithm for Combining Multiple Alignment Tables Sameer Maskey, Steven J. Rennie, Bowen Zhou IBM T.J. Watson Research Center {smaskey, sjrennie, zhou}@us.ibm.com Abstract Alignment combination

More information

Inclusion of large input corpora in Statistical Machine Translation

Inclusion of large input corpora in Statistical Machine Translation Inclusion of large input corpora in Statistical Machine Translation Bipin Suresh Stanford University bipins@stanford.edu ABSTRACT In recent years, the availability of large, parallel, bilingual corpora

More information

A General Weighted Grammar Library

A General Weighted Grammar Library A General Weighted Grammar Library Cyril Allauzen, Mehryar Mohri, and Brian Roark AT&T Labs Research, Shannon Laboratory 80 Park Avenue, Florham Park, NJ 0792-097 {allauzen, mohri, roark}@research.att.com

More information

Improving Statistical Word Alignment with Ensemble Methods

Improving Statistical Word Alignment with Ensemble Methods Improving Statiical Word Alignment with Ensemble Methods Hua Wu and Haifeng Wang Toshiba (China) Research and Development Center, 5/F., Tower W2, Oriental Plaza, No.1, Ea Chang An Ave., Dong Cheng Dirict,

More information

Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora

Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora Chris Callison-Burch David Talbot Miles Osborne School on nformatics University of Edinburgh 2 Buccleuch Place Edinburgh

More information

Algorithms for NLP. Machine Translation. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Machine Translation. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Machine Translation Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Machine Translation Machine Translation: Examples Levels of Transfer Word-Level MT: Examples la politique

More information

CS 288: Statistical NLP Assignment 1: Language Modeling

CS 288: Statistical NLP Assignment 1: Language Modeling CS 288: Statistical NLP Assignment 1: Language Modeling Due September 12, 2014 Collaboration Policy You are allowed to discuss the assignment with other students and collaborate on developing algorithms

More information

A Primer on Graph Processing with Bolinas

A Primer on Graph Processing with Bolinas A Primer on Graph Processing with Bolinas J. Andreas, D. Bauer, K. M. Hermann, B. Jones, K. Knight, and D. Chiang August 20, 2013 1 Introduction This is a tutorial introduction to the Bolinas graph processing

More information

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis CS 1622 Lecture 2 Lexical Analysis CS 1622 Lecture 2 1 Lecture 2 Review of last lecture and finish up overview The first compiler phase: lexical analysis Reading: Chapter 2 in text (by 1/18) CS 1622 Lecture

More information

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design i About the Tutorial A compiler translates the codes written in one language to some other language without changing the meaning of the program. It is also expected that a compiler should make the target

More information

Computing Optimal Alignments for the IBM-3 Translation Model

Computing Optimal Alignments for the IBM-3 Translation Model 14th Conference on Computational Natural Language Learning, Uppsala, Sweden, 2010 Computing Optimal Alignments for the IBM-3 Translation Model Thomas Schoenemann Centre for Mathematical Sciences Lund University,

More information

On a Kernel Regression Approach to Machine Translation

On a Kernel Regression Approach to Machine Translation On a Kernel Regression Approach to Machine Translation Nicolás Serrano, Jesús Andrés-Ferrer, and Francisco Casacuberta Instituto Tecnológico de Informática {nserrano,jandres,fcn}@iti.upv.es Abstract. We

More information

THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM

THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM Adrià de Gispert, Gonzalo Iglesias, Graeme Blackwood, Jamie Brunning, Bill Byrne NIST Open MT 2009 Evaluation Workshop Ottawa, The CUED SMT system Lattice-based

More information

Theory of Languages and Automata

Theory of Languages and Automata Theory of Languages and Automata Chapter 3- The Church-Turing Thesis Sharif University of Technology Turing Machine O Several models of computing devices Finite automata Pushdown automata O Tasks that

More information

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 2-4 Lexical Analysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger Prof. Hilfinger CS 164 Lecture 2 1 Administrivia Moving to 60 Evans on Wednesday HW1 available Pyth manual available on line.

More information

Ling/CSE 472: Introduction to Computational Linguistics. 5/4/17 Parsing

Ling/CSE 472: Introduction to Computational Linguistics. 5/4/17 Parsing Ling/CSE 472: Introduction to Computational Linguistics 5/4/17 Parsing Reminders Revised project plan due tomorrow Assignment 4 is available Overview Syntax v. parsing Earley CKY (briefly) Chart parsing

More information

Lexical Analysis. Chapter 2

Lexical Analysis. Chapter 2 Lexical Analysis Chapter 2 1 Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples

More information

1.0 Languages, Expressions, Automata

1.0 Languages, Expressions, Automata .0 Languages, Expressions, Automata Alphaet: Language: a finite set, typically a set of symols. a particular suset of the strings that can e made from the alphaet. ex: an alphaet of digits = {-,0,,2,3,4,5,6,7,8,9}

More information

A General Weighted Grammar Library

A General Weighted Grammar Library A General Weighted Grammar Library Cyril Allauzen, Mehryar Mohri 2, and Brian Roark 3 AT&T Labs Research 80 Park Avenue, Florham Park, NJ 07932-097 allauzen@research.att.com 2 Department of Computer Science

More information

Incorporating Position Information into a Maximum Entropy/Minimum Divergence Translation Model

Incorporating Position Information into a Maximum Entropy/Minimum Divergence Translation Model In: Proceedings of CoNLL-2 and LLL-2, pages 37-42, Lisbon, Portugal, 2. Incorporating Position Information into a Maximum Entropy/Minimum Divergence Translation Model George Foster RALI, Universit6 de

More information

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017 Tuning Philipp Koehn presented by Gaurav Kumar 28 September 2017 The Story so Far: Generative Models 1 The definition of translation probability follows a mathematical derivation argmax e p(e f) = argmax

More information

The GREYC Machine Translation System for the IWSLT 2008 campaign

The GREYC Machine Translation System for the IWSLT 2008 campaign The GREYC Machine Translation System for the IWSLT 2008 campaign Yves Lepage Adrien Lardilleux Julien Gosme Jean-Luc Manguin GREYC, University of Caen, France (GREYC@IWSLT 2008) 1 / 12 The system The system

More information

Lecture 3 Regular Expressions and Automata

Lecture 3 Regular Expressions and Automata Lecture 3 Regular Expressions and Automata CS 6320 Fall 2018 @ Dan I. Moldovan, Human Language Technology Research Institute, The University of Texas at Dallas 78 Outline Regular Expressions Finite State

More information

Lexical Analysis. Lecture 3-4

Lexical Analysis. Lecture 3-4 Lexical Analysis Lecture 3-4 Notes by G. Necula, with additions by P. Hilfinger Prof. Hilfinger CS 164 Lecture 3-4 1 Administrivia I suggest you start looking at Python (see link on class home page). Please

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward Lexical Analysis COMP 524, Spring 2014 Bryan Ward Based in part on slides and notes by J. Erickson, S. Krishnan, B. Brandenburg, S. Olivier, A. Block and others The Big Picture Character Stream Scanner

More information

Report for each of the weighted automata obtained ˆ the number of states; ˆ the number of ɛ-transitions;

Report for each of the weighted automata obtained ˆ the number of states; ˆ the number of ɛ-transitions; Mehryar Mohri Speech Recognition Courant Institute of Mathematical Sciences Homework assignment 3 (Solution) Part 2, 3 written by David Alvarez 1. For this question, it is recommended that you use the

More information

Finding parallel texts on the web using cross-language information retrieval

Finding parallel texts on the web using cross-language information retrieval Finding parallel texts on the web using cross-language information retrieval Achim Ruopp University of Washington, Seattle, WA 98195, USA achimr@u.washington.edu Fei Xia University of Washington Seattle,

More information

Word Graphs for Statistical Machine Translation

Word Graphs for Statistical Machine Translation Word Graphs for Statistical Machine Translation Richard Zens and Hermann Ney Chair of Computer Science VI RWTH Aachen University {zens,ney}@cs.rwth-aachen.de Abstract Word graphs have various applications

More information

Statistical Machine Translation Models for Personalized Search

Statistical Machine Translation Models for Personalized Search Statistical Machine Translation Models for Personalized Search Rohini U AOL India R& D Bangalore, India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University

More information

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points? Ranked Retrieval One option is to average the precision scores at discrete Precision 100% 0% More junk 100% Everything points on the ROC curve But which points? Recall We want to evaluate the system, not

More information

SyMGiza++: Symmetrized Word Alignment Models for Statistical Machine Translation

SyMGiza++: Symmetrized Word Alignment Models for Statistical Machine Translation SyMGiza++: Symmetrized Word Alignment Models for Statistical Machine Translation Marcin Junczys-Dowmunt, Arkadiusz Sza l Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska

More information

On LM Heuristics for the Cube Growing Algorithm

On LM Heuristics for the Cube Growing Algorithm On LM Heuristics for the Cube Growing Algorithm David Vilar and Hermann Ney Lehrstuhl für Informatik 6 RWTH Aachen University 52056 Aachen, Germany {vilar,ney}@informatik.rwth-aachen.de Abstract Current

More information

Introduction to Formal Languages

Introduction to Formal Languages Introduction to Formal Languages Martin Fränzle Informatics and Mathematical Modelling The Technical University of Denmark 02140 Languages and Parsing, MF, Fall 2003 p.1/16 Formal Languages What is it?

More information

A study of large vocabulary speech recognition decoding using finite-state graphs 1

A study of large vocabulary speech recognition decoding using finite-state graphs 1 A study of large vocabulary speech recognition decoding using finite-state graphs 1 Zhijian OU, Ji XIAO Department of Electronic Engineering, Tsinghua University, Beijing Corresponding email: ozj@tsinghua.edu.cn

More information

Theory of Computation Dr. Weiss Extra Practice Exam Solutions

Theory of Computation Dr. Weiss Extra Practice Exam Solutions Name: of 7 Theory of Computation Dr. Weiss Extra Practice Exam Solutions Directions: Answer the questions as well as you can. Partial credit will be given, so show your work where appropriate. Try to be

More information

Local Phrase Reordering Models for Statistical Machine Translation

Local Phrase Reordering Models for Statistical Machine Translation Local Phrase Reordering Models for Statistical Machine Translation Shankar Kumar, William Byrne Center for Language and Speech Processing, Johns Hopkins University, 3400 North Charles Street, Baltimore,

More information

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

CD Assignment I. 1. Explain the various phases of the compiler with a simple example. CD Assignment I 1. Explain the various phases of the compiler with a simple example. The compilation process is a sequence of various phases. Each phase takes input from the previous, and passes the output

More information

Turing Machines. A transducer is a finite state machine (FST) whose output is a string and not just accept or reject.

Turing Machines. A transducer is a finite state machine (FST) whose output is a string and not just accept or reject. Turing Machines Transducers: A transducer is a finite state machine (FST) whose output is a string and not just accept or reject. Each transition of an FST is labeled with two symbols, one designating

More information

6 NFA and Regular Expressions

6 NFA and Regular Expressions Formal Language and Automata Theory: CS21004 6 NFA and Regular Expressions 6.1 Nondeterministic Finite Automata A nondeterministic finite automata (NFA) is a 5-tuple where 1. is a finite set of states

More information

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis dministrivia Lexical nalysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger Moving to 6 Evans on Wednesday HW available Pyth manual available on line. Please log into your account and electronically

More information

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

CSE Theory of Computing Spring 2018 Project 2-Finite Automata CSE 30151 Theory of Computing Spring 2018 Project 2-Finite Automata Version 1 Contents 1 Overview 2 2 Valid Options 2 2.1 Project Options.................................. 2 2.2 Platform Options.................................

More information

Finite-State Transducers in Language and Speech Processing

Finite-State Transducers in Language and Speech Processing Finite-State Transducers in Language and Speech Processing Mehryar Mohri AT&T Labs-Research Finite-state machines have been used in various domains of natural language processing. We consider here the

More information

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1} CSE 5 Homework 2 Due: Monday October 6, 27 Instructions Upload a single file to Gradescope for each group. should be on each page of the submission. All group members names and PIDs Your assignments in

More information

Introduction to Lexing and Parsing

Introduction to Lexing and Parsing Introduction to Lexing and Parsing ECE 351: Compilers Jon Eyolfson University of Waterloo June 18, 2012 1 Riddle Me This, Riddle Me That What is a compiler? 1 Riddle Me This, Riddle Me That What is a compiler?

More information

CMSC 350: COMPILER DESIGN

CMSC 350: COMPILER DESIGN Lecture 11 CMSC 350: COMPILER DESIGN see HW3 LLVMLITE SPECIFICATION Eisenberg CMSC 350: Compilers 2 Discussion: Defining a Language Premise: programming languages are purely formal objects We (as language

More information

UNIT -2 LEXICAL ANALYSIS

UNIT -2 LEXICAL ANALYSIS OVER VIEW OF LEXICAL ANALYSIS UNIT -2 LEXICAL ANALYSIS o To identify the tokens we need some method of describing the possible tokens that can appear in the input stream. For this purpose we introduce

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

CSE Theory of Computing Spring 2018 Project 2-Finite Automata CSE 30151 Theory of Computing Spring 2018 Project 2-Finite Automata Version 2 Contents 1 Overview 2 1.1 Updates................................................ 2 2 Valid Options 2 2.1 Project Options............................................

More information

Decoding in Statistical Machine Translation Using Moses And Cygwin on Windows

Decoding in Statistical Machine Translation Using Moses And Cygwin on Windows Decoding in Statistical Machine Translation Using Moses And Cygwin on Windows Ms. Pragati Vaidya M.Tech Student, Banasthali Vidyapith, Banasthali, Jaipur Abstract Decoding is an integral part in SMT most

More information

LING/C SC/PSYC 438/538. Lecture 2 Sandiway Fong

LING/C SC/PSYC 438/538. Lecture 2 Sandiway Fong LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong Adminstrivia Reminder: Homework 1: JM Chapter 1 Homework 2: Install Perl and Python (if needed) Today s Topics App of the Day Homework 3 Start with Perl App

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Lexicographic Semirings for Exact Automata Encoding of Sequence Models Lexicographic Semirings for Exact Automata Encoding of Sequence Models Brian Roark, Richard Sproat, and Izhak Shafran {roark,rws,zak}@cslu.ogi.edu Abstract In this paper we introduce a novel use of the

More information

CS 314 Principles of Programming Languages. Lecture 3

CS 314 Principles of Programming Languages. Lecture 3 CS 314 Principles of Programming Languages Lecture 3 Zheng Zhang Department of Computer Science Rutgers University Wednesday 14 th September, 2016 Zheng Zhang 1 CS@Rutgers University Class Information

More information

Exact Matching: Hash-tables and Automata

Exact Matching: Hash-tables and Automata 18.417 Introduction to Computational Molecular Biology Lecture 10: October 12, 2004 Scribe: Lele Yu Lecturer: Ross Lippert Editor: Mark Halsey Exact Matching: Hash-tables and Automata While edit distances

More information

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) ) Natural Language Processing SoSe 2014 Question Answering Dr. Mariana Neves June 25th, 2014 (based on the slides of Dr. Saeedeh Momtazi) ) Outline 2 Introduction History QA Architecture Natural Language

More information

Formal Languages and Compilers Lecture VI: Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis Formal Languages and Compilers Lecture VI: Lexical Analysis Free University of Bozen-Bolzano Faculty of Computer Science POS Building, Room: 2.03 artale@inf.unibz.it http://www.inf.unibz.it/ artale/ Formal

More information

Theory of Programming Languages COMP360

Theory of Programming Languages COMP360 Theory of Programming Languages COMP360 Sometimes it is the people no one imagines anything of, who do the things that no one can imagine Alan Turing What can be computed? Before people even built computers,

More information

Reassessment of the Role of Phrase Extraction in PBSMT

Reassessment of the Role of Phrase Extraction in PBSMT Reassessment of the Role of Phrase Extraction in PBSMT Francisco Guzman Centro de Sistemas Inteligentes Tecnológico de Monterrey Monterrey, N.L., Mexico guzmanhe@gmail.com Qin Gao and Stephan Vogel Language

More information

Welcome to CS120 Fall 2012

Welcome to CS120 Fall 2012 Welcome to CS120 Fall 2012 John Magee (jmagee@clarku.edu) 1 Welcome to CS120 Computing is ubiquitous Daily life, news, ecommerce Sciences and engineering fields Social sciences, humanity, Arts, music,

More information

Machine Translation as Tree Labeling

Machine Translation as Tree Labeling Machine Translation as Tree Labeling Mark Hopkins Department of Linguistics University of Potsdam, Germany hopkins@ling.uni-potsdam.de Jonas Kuhn Department of Linguistics University of Potsdam, Germany

More information

Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation

Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation Markus Dreyer, Keith Hall, and Sanjeev Khudanpur Center for Language and Speech Processing Johns Hopkins University 300

More information

TALP: Xgram-based Spoken Language Translation System Adrià de Gispert José B. Mariño

TALP: Xgram-based Spoken Language Translation System Adrià de Gispert José B. Mariño TALP: Xgram-based Spoken Language Translation System Adrià de Gispert José B. Mariño Outline Overview Outline Translation generation Training IWSLT'04 Chinese-English supplied task results Conclusion and

More information

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing Agenda for today Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing 1 Projective vs non-projective dependencies If we extract dependencies from trees,

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

CISC 4090 Theory of Computation

CISC 4090 Theory of Computation CISC 4090 Theory of Computation Turing machines Professor Daniel Leeds dleeds@fordham.edu JMH 332 Alan Turing (1912-1954) Father of Theoretical Computer Science Key figure in Artificial Intelligence Codebreaker

More information

Figure 2.1: Role of Lexical Analyzer

Figure 2.1: Role of Lexical Analyzer Chapter 2 Lexical Analysis Lexical analysis or scanning is the process which reads the stream of characters making up the source program from left-to-right and groups them into tokens. The lexical analyzer

More information

A MACHINE LEARNING FRAMEWORK FOR SPOKEN-DIALOG CLASSIFICATION. Patrick Haffner Park Avenue Florham Park, NJ 07932

A MACHINE LEARNING FRAMEWORK FOR SPOKEN-DIALOG CLASSIFICATION. Patrick Haffner Park Avenue Florham Park, NJ 07932 Springer Handbook on Speech Processing and Speech Communication A MACHINE LEARNING FRAMEWORK FOR SPOKEN-DIALOG CLASSIFICATION Corinna Cortes Google Research 76 Ninth Avenue New York, NY corinna@google.com

More information

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis CSE450 Translation of Programming Languages Lecture 4: Syntax Analysis http://xkcd.com/859 Structure of a Today! Compiler Source Language Lexical Analyzer Syntax Analyzer Semantic Analyzer Int. Code Generator

More information

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Question Answering Dr. Mariana Neves July 6th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Introduction History QA Architecture Outline 3 Introduction

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Reassessment of the Role of Phrase Extraction in PBSMT

Reassessment of the Role of Phrase Extraction in PBSMT Reassessment of the Role of Phrase Extraction in Francisco Guzmán CCIR-ITESM guzmanhe@gmail.com Qin Gao LTI-CMU qing@cs.cmu.edu Stephan Vogel LTI-CMU stephan.vogel@cs.cmu.edu Presented by: Nguyen Bach

More information

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs Hua He Dept. of Computer Science University of Maryland College Park, Maryland huah@cs.umd.edu

More information

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1 CSE P 501 Compilers LR Parsing Hal Perkins Spring 2018 UW CSE P 501 Spring 2018 D-1 Agenda LR Parsing Table-driven Parsers Parser States Shift-Reduce and Reduce-Reduce conflicts UW CSE P 501 Spring 2018

More information

Formal Languages and Compilers Lecture I: Introduction to Compilers

Formal Languages and Compilers Lecture I: Introduction to Compilers Formal Languages and Compilers Lecture I: Introduction to Compilers Free University of Bozen-Bolzano Faculty of Computer Science POS Building, Room: 2.03 artale@inf.unibz.it http://www.inf.unibz.it/ artale/

More information

A Flexible XML-based Regular Compiler for Creation and Conversion of Linguistic Resources

A Flexible XML-based Regular Compiler for Creation and Conversion of Linguistic Resources A Flexible XML-based Regular Compiler for Creation and Conversion of Linguistic Resources Jakub Piskorski,, Oliver Scherf, Feiyu Xu DFKI German Research Center for Artificial Intelligence Stuhlsatzenhausweg

More information

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running

More information

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct 1 INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS Jan Tore Lønning, Lecture 8, 12 Oct. 2016 jtl@ifi.uio.no Today 2 Preparing bitext Parameter tuning Reranking Some linguistic issues STMT so far 3 We

More information

Lexical Analysis - 2

Lexical Analysis - 2 Lexical Analysis - 2 More regular expressions Finite Automata NFAs and DFAs Scanners JLex - a scanner generator 1 Regular Expressions in JLex Symbol - Meaning. Matches a single character (not newline)

More information

Evaluation of Web Search Engines with Thai Queries

Evaluation of Web Search Engines with Thai Queries Evaluation of Web Search Engines with Thai Queries Virach Sornlertlamvanich, Shisanu Tongchim and Hitoshi Isahara Thai Computational Linguistics Laboratory 112 Paholyothin Road, Klong Luang, Pathumthani,

More information

CS Compiler Construction West Virginia fall semester 2014 August 18, 2014 syllabus 1.0

CS Compiler Construction West Virginia fall semester 2014 August 18, 2014 syllabus 1.0 SYL-410-2014C CS 410 - Compiler Construction West Virginia fall semester 2014 August 18, 2014 syllabus 1.0 Course location: 107 ERB, Evansdale Campus Course times: Tuesdays and Thursdays, 2:00-3:15 Course

More information

RLAT Rapid Language Adaptation Toolkit

RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction

More information

2014/09/01 Workshop on Finite-State Language Resources Sofia. Local Grammars 1. Éric Laporte

2014/09/01 Workshop on Finite-State Language Resources Sofia. Local Grammars 1. Éric Laporte 2014/09/01 Workshop on Finite-State Language Resources Sofia Local Grammars 1 Éric Laporte Concordance Outline Local grammar of dates Invoking a subgraph Lexical masks Dictionaries of a text 01/09/2014

More information

Lexical Scanning COMP360

Lexical Scanning COMP360 Lexical Scanning COMP360 Captain, we re being scanned. Spock Reading Read sections 2.1 3.2 in the textbook Regular Expression and FSA Assignment A new assignment has been posted on Blackboard It is due

More information

Theory of Computations Spring 2016 Practice Final Exam Solutions

Theory of Computations Spring 2016 Practice Final Exam Solutions 1 of 8 Theory of Computations Spring 2016 Practice Final Exam Solutions Name: Directions: Answer the questions as well as you can. Partial credit will be given, so show your work where appropriate. Try

More information

Computer Science 236 Fall Nov. 11, 2010

Computer Science 236 Fall Nov. 11, 2010 Computer Science 26 Fall Nov 11, 2010 St George Campus University of Toronto Assignment Due Date: 2nd December, 2010 1 (10 marks) Assume that you are given a file of arbitrary length that contains student

More information

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11 Graph-Based Parsing Miguel Ballesteros Algorithms for NLP Course. 7-11 By using some Joakim Nivre's materials from Uppsala University and Jason Eisner's material from Johns Hopkins University. Outline

More information

Question Answering Systems

Question Answering Systems Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction

More information

WFSTDM Builder Network-based Spoken Dialogue System Builder for Easy Prototyping

WFSTDM Builder Network-based Spoken Dialogue System Builder for Easy Prototyping WFSTDM Builder Network-based Spoken Dialogue System Builder for Easy Prototyping Etsuo Mizukami and Chiori Hori Abstract This paper introduces a network-based spoken dialog system development tool kit:

More information

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

6.080 / Great Ideas in Theoretical Computer Science Spring 2008 MIT OpenCourseWare http://ocw.mit.edu 6.8 / 6.89 Great Ideas in Theoretical Computer Science Spring 28 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

13 th Annual Johns Hopkins Math Tournament Saturday, February 18, 2012 Explorations Unlimited Round Automata Theory

13 th Annual Johns Hopkins Math Tournament Saturday, February 18, 2012 Explorations Unlimited Round Automata Theory 13 th Annual Johns Hopkins Math Tournament Saturday, February 18, 2012 Explorations Unlimited Round Automata Theory 1. Introduction Automata theory is an investigation of the intersection of mathematics,

More information

Structure of Programming Languages Lecture 3

Structure of Programming Languages Lecture 3 Structure of Programming Languages Lecture 3 CSCI 6636 4536 Spring 2017 CSCI 6636 4536 Lecture 3... 1/25 Spring 2017 1 / 25 Outline 1 Finite Languages Deterministic Finite State Machines Lexical Analysis

More information