Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Similar documents
Discriminative Training for Phrase-Based Machine Translation

Sparse Feature Learning

Statistical Machine Translation Part IV Log-Linear Models

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

Language in 10 minutes

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Machine Learning for Software Engineering

Word Graphs for Statistical Machine Translation

Random Restarts in Minimum Error Rate Training for Statistical Machine Translation

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs.

Joint Decoding with Multiple Translation Models

Simplicial Global Optimization

Statistical Machine Translation Lecture 3. Word Alignment Models

TDDC17. Intuitions behind heuristic search. Recall Uniform-Cost Search. Best-First Search. f(n) =... + h(n) g(n) = cost of path from root node to n

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Heuristic (Informed) Search

TDDC17. Intuitions behind heuristic search. Best-First Search. Recall Uniform-Cost Search. f(n) =... + h(n) g(n) = cost of path from root node to n

Homework 1. Leaderboard. Read through, submit the default output. Time for questions on Tuesday

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute

Stabilizing Minimum Error Rate Training

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Hierarchical Phrase-Based Translation with WFSTs. Weighted Finite State Transducers

A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation.

Unsupervised Learning

Training and Evaluating Error Minimization Rules for Statistical Machine Translation

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Advanced Search Algorithms

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

B553 Lecture 12: Global Optimization

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

An Appropriate Search Algorithm for Finding Grid Resources

Feature Subset Selection using Clusters & Informed Search. Team 3

Natural Language Processing

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Clustering web search results

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Assignment 4 CSE 517: Natural Language Processing

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CSCI 5582 Artificial Intelligence. Today 10/31

Machine Learning Classifiers and Boosting

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

K-best Parsing Algorithms

GRASP. Greedy Randomized Adaptive. Search Procedure

The Prague Bulletin of Mathematical Linguistics NUMBER 91 JANUARY

A Formal Approach to Score Normalization for Meta-search

THE CUED NIST 2009 ARABIC-ENGLISH SMT SYSTEM

Modern Methods of Data Analysis - WS 07/08

NOTATION AND TERMINOLOGY

Clustering: Classic Methods and Modern Views

Outline. Best-first search. Greedy best-first search A* search Heuristics Local search algorithms

16.410/413 Principles of Autonomy and Decision Making

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Key Concepts: Economic Computation, Part II

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models

Reassessment of the Role of Phrase Extraction in PBSMT

E0005E - Industrial Image Analysis

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Complex Prediction Problems

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

1.2 Numerical Solutions of Flow Problems

On Structured Perceptron with Inexact Search, NAACL 2012

Computer Science 1000: Part #2. Algorithms

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

15-780: MarkovDecisionProcesses

K-Means and Gaussian Mixture Models

Clustering Lecture 5: Mixture Model

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

A Brief Look at Optimization

Recurrent Neural Networks

Artificial Intelligence

Algorithms for NLP. Machine Translation. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

27: Hybrid Graphical Models and Neural Networks

Going nonparametric: Nearest neighbor methods for regression and classification

COMS 4771 Clustering. Nakul Verma

Chapter 14 Global Search Algorithms

ECE 5424: Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Function approximation using RBF network. 10 basis functions and 25 data points.

Regularization and model selection

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

ACONM: A hybrid of Ant Colony Optimization and Nelder-Mead Simplex Search

Automatic Summarization

(Refer Slide Time: 1:27)

Machine Learning. Supervised Learning. Manfred Huber

Today s s lecture. Lecture 3: Search - 2. Problem Solving by Search. Agent vs. Conventional AI View. Victor R. Lesser. CMPSCI 683 Fall 2004

Speech Technology Using in Wechat

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Using Hidden Markov Models to analyse time series data

Transcription:

Tuning Philipp Koehn presented by Gaurav Kumar 28 September 2017

The Story so Far: Generative Models 1 The definition of translation probability follows a mathematical derivation argmax e p(e f) = argmax e p(f e) p(e) Occasionally, some independence assumptions are thrown in for instance IBM Model 1: word translations are independent of each other p(e f, a) = 1 Z p(e i f a(i) ) i Generative story leads to straight-forward estimation maximum likelihood estimation of component probability distribution EM algorithm for discovering hidden variables (alignment)

Log-linear Models 2 IBM Models provided mathematical justification for multiplying components p LM p T M p D These may be weighted p λ LM LM pλ T M T M pλ D D Many components p i with weights λ i i p λ i i We typically operate in log space λ i log(p i ) = log i i p λ i i

Knowledge Sources 3 Many different knowledge sources useful language model reordering (distortion) model phrase translation model word translation model word count phrase count character count drop word feature phrase pair frequency additional language models Could be any function h(e, f, a) h(e, f, a) = { 1 if e i e, e i is verb 0 otherwise

Set Feature Weights 4 Contribution of components p i determined by weight λ i Methods manual setting of weights: try a few, take best automate this process Learn weights set aside a development corpus set the weights, so that optimal translation performance on this development corpus is achieved requires automatic scoring method (e.g., BLEU)

Discriminative vs. Generative Models 5 Generative models translation process is broken down to steps each step is modeled by a probability distribution each probability distribution is estimated from data by maximum likelihood Discriminative models model consist of a number of features (e.g. the language model score) each feature has a weight, measuring its value for judging a translation as correct feature weights are optimized on development data, so that the system output matches correct translations as close as possible

Overview 6 Generate a set of possible translations of a sentence (candidate translations) Each candidate translation represented using a set of features Each feature derives from one property of the translation feature score: value of the property (e.g., language model probability) feature weight: importance of the feature (e.g., language model feature more important than word count feature) Task of discriminative training: find good feature weights Highest scoring candidate is best translation according to model

Discriminative Training Approaches 7 Reranking: 2 pass approach first pass: run decoder to generate set of candidate translations second pass: add features rescore translations Tuning integrate all features into the decoder learn feature weights that lead decoder to best translation Large scale discriminative training (next lecture) thousands or millions of features optimization of the entire training corpus requires different training methods

8 finding candidate translations

Finding Candidate Translations 9 Number of possible translations exponential with sentence length But: we are mainly interested in the most likely ones Recall: decoding do not list all possible translation beam search for best one dynamic programming and pruning How can we find set of best translations?

Search Graph 10 he p:-0.556 it p:-0.484 are p:-1.220-2.729 goes p:-1.648 does not p:-1.664 goes p:-1.388-3.569 not p:-3.526 go p:-2.743 to p:-2.839-4.672 go p:-4.087 home p:-5.012 to house p:-4.334 home p:-4.182 house p:-5.912 Decoding explores space of possible translations by expanding the most promising partial translations Search graph

Search Graph 11 he p:-0.556 it p:-0.484 are p:-1.220-2.729 goes p:-1.648 does not p:-1.664 goes p:-1.388-3.569 not p:-3.526 go p:-2.743 to p:-2.839-4.672 go p:-4.087 home p:-5.012 to house p:-4.334 home p:-4.182 house p:-5.912 Keep transitions from recombinations without: total number of paths = number of full translation hypotheses with: combinatorial expansion Example without: 4 full translation hypotheses with: 10 different full paths Typically many more paths due to recombination

Word Lattice 12-0.556 he -0.484 it -1.220 are <s> he <s> it <s> are -0.912 goes -1.108 does not -1.220 does not -0.904 goes he goes does not it goes -1.451 to -1.878 not -2.146 not -0.819 go not to goes not not go -1.248 go -1.146 to house -1.591 to house to go -1.486 home -1.439 home -0.825 house not home to house go home go house Search graph as finite state machine states: partial translations transitions: applications of phrase translations weights: added scores by phrase translation

Finite State Machine 13 Formally, a finite state machine, is a q quintuple (Σ, S, s 0, δ, F ), where Σ is the alphabet of output symbols (in our case, the emitted phrases) S is a finite set of states s 0 is an initial state (s 0 S), (in our case the initial hypothesis) δ is the state transition function δ : S Σ S F is the set of final states (in our case representing hypotheses that have covered all input words). Weighted finite state machine scores for emissions from each transition π : S Σ S R

N-Best List 14 rank score sentence 1-4.182 he does not go home 2-4.334 he does not go to house 3-4.672 he goes not to house 4-4.715 it goes not to house 5-5.012 he goes not home 6-5.055 it goes not home 7-5.247 it does not go home 8-5.399 it does not go to house 9-5.912 he does not to go house 10-6.977 it does not to go house Word graph may be too complex for some methods Extract n best translations

Computing N-Best Lists 15-0.556 he -0.484 it -1.220 are <s> he <s> it <s> are goes does not -1.065 does not goes he goes does not it goes to not -0.043 not go not to goes not not go go -0.338 to house to go home to house home house not home to house go home go house -0.830-0.152-1.730 Representing the graph with back transitions Include detours with cost

Path 1 16 <s> he -0.556 he does not -1.065 does not go not go home go home -0.830-0.152-1.730 Follow back transitions Best path: he does not go home Keep note of detours from this path Base path Base cost Detour cost Detour state final -0-0.152 to house final -0-0.830 not home final -0-1.065 does not final -0-1.730 go house

Path 2 17-0.556 he <s> he does not -1.065 does not go not go -0.338 to house to house -0.152 Take cheapest detour Afterwards, follow back transitions Second best path: he does not go to house Add its detours to priority queue Base path Base cost Detour cost Detour state to house -0.152-0.338 goes not final -0-0.830 not home final -0-1.065 does not to house -0.152-1.065 it final -0-1.730 go house

Path 3 18-0.556 he <s> he goes he goes not -0.043 goes not -0.338 to house to house -0.152 Third best path: he goes not to house Add its detours to priority queue Base path Base cost Detour cost Detour state to house / goes not -0.490-0.043 it goes final -0-0.830 not home final -0-1.065 does not to house -0.152-1.065 it final -0-1.730 go house

Scoring N-Best List 19 Two opinions about items in the n-best list model score: what the machine translation system thinks is good error score: what is actually a good translation Error score can be computed with reference translation recall: lecture on evaluation canonical metric: BLEU score Some methods require sentence-level scores commonly used: BLEU+1 adjusted precision: correct matches+1 total+1

Scored N-Best List 20 Reference translation: he does not go home N-best list Translation Feature values BLEU+1 it is not under house -32.22-9.93-19.00-5.08-8.22-5 27.3% he is not under house -34.50-7.40-16.33-5.01-8.15-5 30.2% it is not a home -28.49-12.74-19.29-3.74-8.42-5 30.2% it is not to go home -32.53-10.34-20.87-4.38-13.11-6 31.2% it is not for house -31.75-17.25-20.43-4.90-6.90-5 27.3% he is not to go home -35.79-10.95-18.20-4.85-13.04-6 31.2% he does not home -32.64-11.84-16.98-3.67-8.76-4 36.2% it is not packing -32.26-10.63-17.65-5.08-9.89-4 21.8% he is not packing -34.55-8.10-14.98-5.01-9.82-4 24.2% he is not for home -36.70-13.52-17.09-6.22-7.82-5 32.5% What feature weights push up the correct translation?

Rerank Approach 21 Training Testing training input sentences test input sentence base model decode reference translations base model decode n-best list of translations combine labeled training data learn reranker additional features n-best list of translations combine rerank reranker translation additional features

22 parameter tuning

Parameter Tuning 23 Recall log-linear model p(x) = exp n λ i h i (x) Overall translation score p(x) is combination of components h i (x), weighted by parameters λ i i=1 Setting parameters as supervised learning problem Two methods Powell search Simplex algorithm

Experimental Setup 24 Training data for translation model: 10s to 100s of millions of words Training data for language model: billions of words Parameter tuning set a few weights (say, 10 15) tuning set of 1000s of sentence pairs sufficient Finally, test set needed

Minimum Error Rate Training 25 Optimize metric: e.g., BLEU Tuning set of 1000s of sentences, for each we have n-best list of translations Different weight setting different translations come out on top BLEU score Even with 10-15 features: high dimensional space, intractable

Bad N-Best Lists? 26 N-Best list produced with initial weight setting Decoding with optimized weight settings may produce completely different translations Iterate optimization, accumulate n-best lists

Parameter Tuning 27 initial parameters decoder decode n-best list of translations apply new parameters optimize parameters if changed if converged final parameters

28 powell search

Och s minimum error rate training (MERT) 29 Line search for best feature weights given: sentences with n-best list of translations iterate n times randomize starting feature weights iterate until convergences for each feature find best feature weight update if different from current return best feature weights found in any iteration

Find Best Feature Weight 30 Core task: find optimal value for one parameter weight λ... while leaving all other weights constant Score of translation i for a sentence f: p(e i f) = λa i + b i Recall that: we deal with 100s of translations e i per sentence f we deal with 100s or 1000s of sentences f we are trying to find the value λ so that over all sentences, the error score is optimized

One Translations for One Sentence 31 p(x) λ Probability of one translation p(e i f) is a function of λ p(e i f) = λa i + b i

N-Best Translations for One Sentence 32 p(x) λ Each translation is a different line

Upper Envelope 33 p(x) λ Highest probability translation depends on λ

Threshold Points 34 p(x) t1 t2 λ argmax p(x) There are one a few threshold points t j where the model-best line changes

Finding the Optimal Value for λ 35 Real-valued λ can have infinite number of values But only on threshold points, one of the model-best translation changes Algorithm: find the threshold points for each interval between threshold points find best translations compute error-score pick interval with best error-score

BLEU Error Surface 36 Varying one parameter: a rugged line with many local optima 0.5 "BLEU" 0.495 "BLEU" 0.45 0.4945 0.4 0.494 0.35 0.4935 0.3 0.493 0.25-1 -0.5 0 0.5 1 0.4925-0.01-0.005 0 0.005 0.01 full range peak

Pseudo Code 37 Input: sentences with n-best list of translations, initial parameter values 1: repeat 2: for all parameter do 3: set of threshold points T = {} 4: for all sentence do 5: for all translation do 6: compute line l: parameter value score 7: end for 8: find line l with steepest descent 9: while find line l 2 that intersects with l first do 10: add parameter value at intersection to set of threshold points T 11: l = l 2 12: end while 13: end for 14: sort set of threshold points T by parameter value 15: compute score for value before first threshold point 16: for all threshold point t T do 17: compute score for value after threshold point t 18: if highest do record max score and threshold point t 19: end for 20: if max score is higher than current do update parameter value 21: end for 22: until no changes to parameter values applied

38 simplex algorithm

Simplex Algorithm 39 Similar to Powell search Less calculations of the current error recall: error is computed over the entire tuning set brute force method requires reranking of 1000s of n-best lists Similar to gradient descent methods try to find direction in which the optimum lies here: we cannot compute derivative

Simplex Algorithm 40 Randomly generate three points in the high dimensional space high dimensional space = each dimension is one of the λ i parameters a point in the space = each parameter set to a value

Simplex Algorithm 41 best worst good We can score each of these points use parameter settings to rerank all the n-best lists compute overall tuning set score (BLEU) Rank the 3 points into best, good, worst

Simplex Algorithm 42 best worst good The 3 points form a triangle

First Idea: Move Away from the Bad Point 43 best worst good Compute 3 additional points mid point: M = 1 2 (best + good) reflection point: R = M + (M worst) extension: R = M + 2(M worst) Three cases 1. if error(e) < error(r) < error(worst), replace worst with E. 2. else if error(r) < error(worst), replace worst with R. 3. else try something else

Second Idea: Well, Not Too Far Away 44 best terrible worst good Compute 2 additional points C 1 point between worst and M: C 1 = M + 1 2 (M worst) C 2 point between M and R: C 2 = M + 3 2 (M worst). Three cases 1. if error(c 1 ) < error(worst) and error(c 1 ) < error(c 2 ), replace worst with C 1. 2. if error(c 2 ) < error(worst) and error(c 2 ) < error(c 1 ), replace worst with C 2. 3. else continue

Third Idea: Move Closer to Best Point 45 best worst good Compute 1 additional point S point between worst and best: S = 1 2 (best + worst). Shrink triangle

Simplex in High Dimensions 46 Process of updates is iterated until the points converge Typically very quick More dimensions: more points n + 1 points for n parameters midpoint M is the center of all points except worst in final case, all good points moved towards midpoints closer to best Once optimum is found generate n-best list iterate

Summary 47 Reframing probabilistic model as log-linear model with weights Discriminative training task: set weights Generate n-best candidate translations from search graph Reranking Powell search (Och s MERT) Simplex algorithm