Language in 10 minutes

Similar documents
Homework 1. Leaderboard. Read through, submit the default output. Time for questions on Tuesday

Statistical Machine Translation Lecture 3. Word Alignment Models

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Question of the Day. Machine Translation. Statistical Word Alignment. Centauri/Arcturan (Knight, 1997) Centauri/Arcturan (Knight, 1997)

Discriminative Training for Phrase-Based Machine Translation

Algorithms for NLP. Machine Translation. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Inclusion of large input corpora in Statistical Machine Translation

Intro. Scheme Basics. scm> 5 5. scm>

A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation.

PAIRS AND LISTS 6. GEORGE WANG Department of Electrical Engineering and Computer Sciences University of California, Berkeley

IBM Model 1 and Machine Translation

Natural Language Processing

Lesson Share TEACHER'S NOTES LESSON SHARE. ing by Olya Sergeeva. Overview. Preparation. Procedure

UMass CS690N: Advanced Natural Language Processing, Spring Assignment 1

Fall 09, Homework 5

COMP 102: Computers and Computing

SyMGiza++: Symmetrized Word Alignment Models for Statistical Machine Translation

More details on Loopy BP

Introduction to User Stories. CSCI 5828: Foundations of Software Engineering Lecture 05 09/09/2014

RECURSION. Week 6 Laboratory for Introduction to Programming and Algorithms Uwe R. Zimmer based on material by James Barker. Pre-Laboratory Checklist

Homework Assignment #3

I2204 ImperativeProgramming Semester: 1 Academic Year: 2018/2019 Credits: 5 Dr Antoun Yaacoub

Missing Data. Where did it go?

#$ % $ $& "$%% " $ '$ " '

Natural Language Processing

Computer Science February Homework Assignment #2 Due: Friday, 9 March 2018 at 19h00 (7 PM),

Excel for Algebra 1 Lesson 5: The Solver

Better Word Alignment with Supervised ITG Models. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley

Problem 1: Complexity of Update Rules for Logistic Regression

Recurrent Neural Networks

Lecture 14: Annotation

Here are some of the more basic curves that we ll need to know how to do as well as limits on the parameter if they are required.

CS103 Handout 50 Fall 2018 November 30, 2018 Problem Set 9

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Sparse Feature Learning

Water. Notes. Free surface. Boundary conditions. This week: extend our 3D flow solver to full 3D water We need to add two things:

VK Multimedia Information Systems

Mathematics 96 (3581) CA (Class Addendum) 2: Associativity Mt. San Jacinto College Menifee Valley Campus Spring 2013

Have the students look at the editor on their computers. Refer to overhead projector as necessary.

RACKET BASICS, ORDER OF EVALUATION, RECURSION 1

Lecture 4 Median and Selection

Asking for information (with three complex questions, so four main paragraphs)

CS 4349 Lecture September 13th, 2017

For Volunteers An Elvanto Guide

Notebook Assignments

COMP 161 Lecture Notes 16 Analyzing Search and Sort

Meeting 1 Introduction to Functions. Part 1 Graphing Points on a Plane (REVIEW) Part 2 What is a function?

Note: This is a miniassignment and the grading is automated. If you do not submit it correctly, you will receive at most half credit.

Introduction to Programming

CS 161 Computer Security

You must include this cover sheet. Either type up the assignment using theory4.tex, or print out this PDF.

Administrative. Machine learning code. Machine learning: Unsupervised learning

Math 1525: Lab 4 Spring 2002

Lecture 25 Notes Spanning Trees

CRF Feature Induction

MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel

Week 9: Planar and non-planar graphs. 7 and 9 November, 2018

1: Introduction to Object (1)

1 Dynamic Memory continued: Memory Leaks

CS61A Notes Week 6: Scheme1, Data Directed Programming You Are Scheme and don t let anyone tell you otherwise

Paxos and Replication. Dan Ports, CSEP 552

PHYS 202 Notes, Week 8

CS1 Lecture 22 Mar. 6, 2019

Topics. Introduction to Repetition Structures Often have to write code that performs the same task multiple times. Controlled Loop

Homework #2. If (your ID number s last two digits % 6) = 0: 6, 12, 18

Anatomy of a Standard Transcript

APPENDIX B. Fortran Hints

Week 3: Perceptron and Multi-layer Perceptron

Topology Inference from Co-Occurrence Observations. Laura Balzano and Rob Nowak with Michael Rabbat and Matthew Roughan

Bootstrapping Methods

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Entity Resolution, Clustering Author References

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Chamberlin and Boyce - SEQUEL: A Structured English Query Language

Algebraic Specifications

Lecture Transcript While and Do While Statements in C++

IAE Professional s (02)

MTAT : Software Testing

Review: Using Imported Code. What About the DrawingGizmo? Review: Classes and Object Instances. DrawingGizmo pencil; pencil = new DrawingGizmo();

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

5.6 Rational Equations

Hints for Exercise 4: Recursion

CSE : Python Programming

News English.com Ready-to-use ESL / EFL Lessons 2005 was a second longer than usual

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

Detailed instructions for video analysis using Logger Pro.

Why study algorithms? CS 561, Lecture 1. Today s Outline. Why study algorithms? (II)

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

1.7 Limit of a Function

: Principles of Imperative Computation. Summer Assignment 4. Due: Monday, June 11, 2012 in class

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Problem Set 3 CMSC 426 Due: Thursday, April 3

CS101 Introduction to Programming Languages and Compilers

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

CS4450. Computer Networks: Architecture and Protocols. Lecture 11 Rou+ng: Deep Dive. Spring 2018 Rachit Agarwal

Name: Magpie Chatbot Lab: Student Guide. Introduction

ENCM 501 Winter 2018 Assignment 2 for the Week of January 22 (with corrections)

An Unsupervised Model for Joint Phrase Alignment and Extraction

MID-TERM EXAM TCP/IP NETWORKING Duration: 2 hours With Solutions

Transcription:

Language in 10 minutes http://mt-class.org/jhu/lin10.html By Friday: Group up (optional, max size 2), choose a language (not one y all speak) and a date First presentation: Yuan on Thursday Yuan will start assigning groups and people who miss the deadline

Brief review! p(e f) = e p(f e)p(e) Last week: language modeling: p(e) This week: translation modeling: p(f e) In particular, alignment

Brief review: language Modeling P(e) modeling Dumb idea: maintain a huge table of complete sentences and relative frequencies Better idea: n-grams Chain rule and conditional independence for an n 1 word history Problems remain, but they work pretty well Not discussed: smoothing

Alignments p(e f) = e p(f e)p(e) How to model p(f e)? What s the dumb idea?

Alignments Each French word f is generated by exactly one English word e NULL And the program has been implemented Le programme a ete mis en application Alignment vector a = 2,3,4,5,6,6,6

Alignments Each French word f is generated by exactly one English word e NULL And the program has been implemented Le programme a ete mis en application Alignment vector a = 0,0,0,0,2,2,2

Alignments How many possible alignments? Each of m French word has l= E +1 choices, so m l+1

A bit more formally p(f 1,f 2,...,f m e 1,e 2,...,e l,m) = a A p(f 1,...,f m,a 1,...,a m e 1,...,e l,m) Define a conditional model projecting the translations through the alignments We also introduce a conditional independence assumption: every word is translated independently

Brainstorm 5 minutes, with a neighbor or two Is this idea a good one? What are some of its limitations? What else should be modeled?

IBM Alignment Models Proposed by IBM researchers (under CLSP s Fred Jelinek) in the late 80s / early 90s Aside: Rip Van Winkle event Transcript at cs.jhu.edu/~post/bitext

IBM Model 1 Input: English words, e1 el, French length m For each French word position i 1 m Choose an English source index Choose a translation q(j i, l, m) = 1 t(f i e ai ) l +1

IBM Model 1 f e p(f e) le the 0.42 t(f e) is just a table la the 0.4 programme the 0.001 a has 0.78

IBM Model 1 Important notes Alignment is based on word positions, not word identities Alignment probabilities are uniform! Words are translated independently

IBM Model 1 NULL And the program has been implemented Le programme a ete mis en application On board: p(f, a e) =?

IBM Model 2 Input: English words, e1 el, French length m For each French word position i 1..m Choose an English source index Choose a translation q(j i, l, m) t(f i e ai )

IBM Model 2 Only difference: q(j i, l, m) is now a table instead of uniform What do you think of this model? j q(j 1, 6, 7) 1 0.27 2 0.14 How many parameters are there? 48 1E-75

Tasks The models tell us how (we pretend) the data came to be There are now two tasks we care about Inference: given a sentence pair (e,f), what is the most probable alignment? Estimation: how do we get the parameters t(f e) and q(j i, l, m)?

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Task 1: Inference NULL And the program has been implemented Le programme a ete mis en application Input: a sentence pair (e,f), the model (t( ) and q( )) Knowledge: target words generated independently

Homework 1 The inference task is what you re doing in Homework 1 The metric is Alignment Error Rate (AER) Alignment links are labeled as one of (S)ure or (P)ossible, S P Precision: A P Recall: P A S S

Homework 1 A P Precision: Recall: P A S S (A S, P )= A S + A P A + S NULL And the program has been implemented Le programme a ete mis en application

Task 2: Parameter Estimation Computing alignments is useful as an intermediate test of new alignment methods, but we actually don t care about the alignments themselves What we really need is to compute the parameters of the model: t(f e) and q(j i, l, m)

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? NULL And the program has been implemented Le programme a ete mis en application

Estimation from (e,f,a) Easy! Just like n-grams: count and normalize (forget smoothing) Board exercise: what are the equations and values if this were our corpus? t(f e) = q(j i, l, m) = c(e, f) c(e) c(j, i, l, m) c(i, l, m)

Estimation from (e,f) Unfortunately, we don t have alignments! (Even more unfortunately, alignments are a fuzzy concept) Chicken and egg problem If we had the alignments, we could compute parameters If we had the parameters, we could compute the alignments (how?)

Estimation from (e,f) This suggests an iterative solution: Algorithm 1 (hard EM) initialize parameters t and q to something repeat until convergence for every sentence for every target position j for every source position i if aligned(i, j) count(fj ei) += 1 count(ei) += 1 count(j, i, l, m) += 1 count(i, l, m) += 1 t(f e) = count(f, e) / count(e) q(j i, l, m) = count(j, i, l, m) / count(i, l, m)

Estimation A few problems We don t actually care about the alignments Bad init. might set us off in the wrong direction A softer approach: compute expectations over all alignments Weight the accumulated counts by the alignment probability

Estimation Each alignment link has a weight P (a i = j e i,f j ) = q(j i, l, m) t(f i e j ) l j =1 q(j i, l, m) t(f i e j ) Counts now use this soft value instead of a hard count (1 or 0) Any issues here?

Estimation from (e,f) Old solution Algorithm 1 (hard EM) initialize parameters t and q to something repeat until convergence for every sentence for every target position j for every source position i if aligned(i, j) count(fj ei) += 1 count(ei) += 1 count(j, i, l, m) += 1 count(i, l, m) += 1 t(f e) = count(f, e) / count(e) q(j i, l, m) = count(j, i, l, m) / count(i, l, m)

Estimation from (e,f) New solution Algorithm 1 (soft EM) initialize parameters t and q to something repeat until convergence for every sentence for every target position j for every source position i count(fj, ei) += P(ai = j ei, fj) count(ei) += P(ai = j ei, fj) count(j, i, l, m) += P(ai = j ei, fj) count(i, l, m) += P(ai = j ei, fj) t(f e) = count(f, e) / count(e) q(j i, l, m) = count(j, i, l, m) / count(i, l, m)

Estimation with EM Why does this work? We are accumulating evidence (soft counts) for totally bogus alignments: all pairs of words that cooccur, e.g., t(streetcar le) It works for the same reason you were able to solve the alignment exercise from the first day of class Words that co-occur frequently continually steal probability mass from pairs that co-occur less often

Properties of EM The EM algorithm guarantees that data likelihood does not decrease across iterations L(t, q E,F) = p(f (n) e (n) ) N n=1 N = p(f (n),a e (n) ) n=1 a A EM can get stuck in local optima: subprime peaks in the global likelihood function

Assorted notes There are many known problems with these alignment models (garbage collection, initialization) Despite all this blabbing about modeling p(f e), the IBM models are not actually used for translation! Who cares about alignment?

Thursday s Agenda Read: Collins notes on Models 1 and 2, Koehn Chapter 4, Knight s MT workbook We ll cover new models IBM Model 3, HMM model Time for questions on Homework 1 (due Feb. 17) Language in 10 minutes (Yuan)