Statistical Methods for NLP

Similar documents
School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Parameter Estimation. Learning From Data: MLE. Parameter Estimation. Likelihood. Maximum Likelihood Parameter Estimation. Likelihood Function 12/1/16

Dynamic Programming. Ellen Feldman and Avishek Dutta. February 27, CS155 Machine Learning and Data Mining

Introduction to Hidden Markov models

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Structured Learning. Jun Zhu

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Advanced Introduction to Machine Learning CMU-10715

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Machine Learning

The Expectation Maximization (EM) Algorithm

Week 4: Simple Linear Regression II

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

MEMMs (Log-Linear Tagging Models)

Conditional Random Fields. Mike Brodie CS 778

Graphical Models & HMMs

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Statistical Methods for NLP

COMP90051 Statistical Machine Learning

Complex Prediction Problems

Lecture 5: Markov models

Probabilistic Graphical Models

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Machine Learning

Integer Linear Programming

Structured Prediction Basics

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Introduction to Machine Learning CMU-10701

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Dependency Parsing CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

CS61A Summer 2010 George Wang, Jonathan Kotker, Seshadri Mahalingam, Eric Tzeng, Steven Tang

Lecture 12 March 4th

Classification. 1 o Semestre 2007/2008

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Machine Learning Lecture 3

17 STRUCTURED PREDICTION

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Convolutional Codes. COS 463: Wireless Networks Lecture 9 Kyle Jamieson. [Parts adapted from H. Balakrishnan]

Recursively Enumerable Languages, Turing Machines, and Decidability

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Modeling Sequence Data

6.867 Machine Learning

Simple Model Selection Cross Validation Regularization Neural Networks

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Graphical Models. David M. Blei Columbia University. September 17, 2014

Biostatistics 615/815 Lecture 10: Hidden Markov Models

I Know Your Name: Named Entity Recognition and Structural Parsing

Exam Marco Kuhlmann. This exam consists of three parts:

Function Algorithms: Linear Regression, Logistic Regression

Part-of-Speech Tagging

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Week 4: Simple Linear Regression III

COMS 4771 Clustering. Nakul Verma

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

CS664 Lecture #16: Image registration, robust statistics, motion

Sequence Labeling: The Problem

Dynamic Feature Selection for Dependency Parsing

Machine Learning Lecture 3

Parallel HMMs. Parallel Implementation of Hidden Markov Models for Wireless Applications

Lecture 2: Modeling Introduction

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Bayesian Methods. David Rosenberg. April 11, New York University. David Rosenberg (New York University) DS-GA 1003 April 11, / 19

Machine Learning

Copyright 2000, Kevin Wayne 1

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Bayes Net Learning. EECS 474 Fall 2016

Semi-Markov Conditional Random Fields for Information Extraction

Skip Lists S 3 S 2 S 1. Skip Lists Goodrich, Tamassia

Weighted Finite State Transducers in Automatic Speech Recognition

Semi-Supervised Learning of Named Entity Substructure

Introduction to SLAM Part II. Paul Robertson

Biology 644: Bioinformatics

Dynamic Time Warping

Lecture 11: Clustering Introduction and Projects Machine Learning

27: Hybrid Graphical Models and Neural Networks

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

DISCRETE HIDDEN MARKOV MODEL IMPLEMENTATION

Hidden Markov Models in the context of genetic analysis

To earn the extra credit, one of the following has to hold true. Please circle and sign.

10601 Machine Learning. Model and feature selection

Learning and Generalization in Single Layer Perceptrons

6.001 Notes: Section 4.1

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Transcription:

Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey * Most of the slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition Lecture 4: Hidden Markov Models

Project Proposals Proposals due in 2 weeks (Feb 23) 1 Page Speech Recognition Lecture 4: Hidden Markov Models 2

Topics for Today Information Extraction Hidden Markov Models Speech Recognition Lecture 4: Hidden Markov Models 3

Information Extraction Extract relevant information from large amount of unstructured text Extracted information can be in structured form Can be used to populate databases for example We can define the kind of information we want to extract Speech Recognition Lecture 4: Hidden Markov Models 4

Examples of Information Extraction Tasks Named Entity Identification Relation Extraction Coreference resolution Term Extraction Lexical Disambiguation Event Detection and Classification Speech Recognition Lecture 4: Hidden Markov Models 5

Classifiers for Information Extraction Sometimes extracted information can be a sequence Extract Parts of Speech for the given sentence DET VB AA ADJ NN This is a good book What kind of classifier may work well for this kind of sequence classification? Speech Recognition Lecture 4: Hidden Markov Models 6

Classification without Memory CLASS1 CLASS2 CLASS1 CLASS2 CLASS1 CLASS2 CLASS1CLASS2 PREDICT PREDICT PREDICT PREDICT PREDICT X X X X Features Features Features Features Speech Recognition Lecture 4: Hidden Markov Models 7

Classification without Memory P(NN book) = 0.62 P(VB book) = 0.14 P(ADV book) = 0.004 C C C C C X1 X2 X3 book XN C(t) (class) is dependent on current observations X(t) C(t) can be POS tags, document class, word class, X(t) can be text based features Perceptron is an example of classifier without memory Speech Recognition Lecture 4: Hidden Markov Models 8

Probability Sequence Computation for Models without Memory A coin has probability of heads = p, probability of tails = 1-p Flip the coin 10 times. Assume I.I.D. random sequence. There are 2 10 possible sequences. Sequence: 1 0 1 0 0 0 1 0 0 1 Probability: p(1-p)p(1-p)(1-p)(1-p) p(1-p)(1-p)p = p 4 (1-p) 6 Models without memory: Observations are Independent. Probability is the same for all sequences with 4 heads & 6 tails. Order of heads & tails does not matter in assigning a probability to the sequence, only the number of heads & number of tails Probability of 0 heads (1-p) 10 1 head p(1-p) 9 10 heads p 10 Speech Recognition Lecture 4: Hidden Markov Models 9

Models without Memory: Learning Model Parameters If p is known, then it is easy to compute the probability of the sequence. Now suppose p is unknown. We toss the coin N times, obtaining H heads and T tails, where H+T=N We want to estimate p A reasonable estimate is p=h/n. Is this actually the best choice for p? What is best? Consider the probability of the observed sequence. Prob(seq)=p H (1-p) T Prob(seq) p mle p The value of p for which Prob(seq) is maximized is the Maximum Likelihood Estimate (MLE) of p. (Denote p mle ) Speech Recognition Lecture 4: Hidden Markov Models 10

Models without Memory: Example, cont d Assertion: p mle = H/N Proof: Prob(seq)=p H (1-p) T Maximizing Prob is equivalent to maximizing log(prob) L=log(Prob(seq)) = H log p + T log (1-p) L = Η/p + Τ/(1 p) p L L maximized when = 0 p H/p mle - T/(1-p mle ) = 0 H H p mle = T p mle H = T p mle + H p mle = p mle (T + H) = p mle N p mle = H/N Speech Recognition Lecture 4: Hidden Markov Models 11

Models without Memory Example, cont d We showed that in this case MLE = Relative Frequency = H/N We will use this idea many times. Often, parameter estimation reduces to counting and normalizing. Speech Recognition Lecture 4: Hidden Markov Models 12

Models with Memory: Markov Models Flipping a coin was memory-less. The outcome of each flip did not depend on the outcome of the other flips. Adding memory to a memory-less model gives us a Markov Model. Useful for modeling sequences of events. For POS tagging adding memory to classifier could be useful Speech Recognition Lecture 4: Hidden Markov Models 13

Classification with Memory CLASS1 CLASS2 CLASS1 CLASS2 CLASS1 CLASS2 CLASS1CLASS2 PREDICT PREDICT PREDICT PREDICT PREDICT X X X X Features Features Features Features Current Prediction depends only on previous predictions and current observation Speech Recognition Lecture 4: Hidden Markov Models 14

Classification with Memory P(NN book, prevadj) = 0.80 P(VB book, prevadj) = 0.04 P(VB book, prevadv) = 0.10 C C ADJ C C X1 X2 good book XN C(t) (class) is dependent on current observations X(t) and previous state of classification (C(t-1)) C(t) can be POS tags, document class, word class, X(t) can be text based features Speech Recognition Lecture 4: Hidden Markov Models 15

Adding Memory to Coin Example Consider 2 coins. Coin 1: p H = 0.9, p T = 0.1 Coin 2: p H = 0.2, p T = 0.8 Experiment: Flip Coin 1. for J = 2 ; J<=4; J++ if (previous flip == H ) flip Coin 1; else flip Coin 2; Consider the following 2 sequences: H H T T prob = 0.9 x 0.9 x 0.1 x 0.8 =.0648 H T H T prob = 0.9 x 0.1 x 0.2 x 0.1 =.0018 Sequences with consecutive heads or tails are more likely. The sequence has memory - order matters. Order matters for language. Adjective noun probably more common than adjective adjective Speech Recognition Lecture 4: Hidden Markov Models 16

Markov Models State Space Representation Consider 2 coins. Coin 1: p H = 0.9, p T = 0.1 Coin 2: p H = 0.2, p T = 0.8 State-space representation of previous example H 0.9 T 0.1 T 0.8 1 2 H 0.2 Speech Recognition Lecture 4: Hidden Markov Models 17

Markov Models State Space Representation (Con t) State sequence can be uniquely determined from the outcome sequence, given the initial state. Output probability is easy to compute. It is the product of the transition probs for state sequence. H 0.9 T 0.1 T 0.8 1 2 H 0.2 Example: O: H T T T S: 1(given) 1 2 2 Prob: 0.9 x 0.1 x 0.8 x 0.8 Speech Recognition Lecture 4: Hidden Markov Models 18

Back to Memory-Less Models: Hidden Information Let s return to the memory-less coin flip model Consider 3 coins. Coin 0: p H = 0.7 Coin 1: p H = 0.9 Coin 2 p H = 0.2 Experiment: For J=1..4 Flip coin 0. If outcome == H Flip coin 1 and record. else Flip coin 2 and record. Speech Recognition Lecture 4: Hidden Markov Models 19

Hiding Information (cont.) Coin 0: p H = 0.7 Coin 1: p H = 0.9 Coin 2: p H = 0.2 We cannot uniquely determine the output of the Coin 0 flips. This is hidden. Consider the sequence H T T T. What is the probability of the sequence? Order doesn t matter (memory-less) p(head)=p(head coin0=h)p(coin0=h)+ p(head coin0=t)p(coin0=t)= 0.9x0.7 + 0.2x0.3 = 0.69 p(tail) = 0.1 x 0.7 + 0.8 x 0.3 = 0.31 P(HTTT) =.69 x.31 3 Speech Recognition Lecture 4: Hidden Markov Models 20

Hidden Markov Model The state sequence is hidden. Unlike Markov Models, the state sequence cannot be uniquely deduced from the output sequence. Experiment: Flip the same two coins. This time, flip each coin twice. The first flip gets recorded as the output sequence. The second flip determines which coin gets flipped next. Now, consider output sequence H T T T. No way to know the results of the even numbered flips, so no way to know which coin is flipped each time. Unlike previous example, order now matters (start with coin 1, p H = 0.9) H H T T T H T =.9 x.9 x.1 x.1 x.8 x.2 x.1 =.0001296 T T T H T T H =.1 x.1 x.8 x.2 x.1 x.1 x.2 =.0000032 Even worse, same output sequence corresponds to multiple probabilities! H T T H T T T =.9 x.1 x.8 x.2 x.1 x.1 x.8 =.0001152 Speech Recognition Lecture 4: Hidden Markov Models 21

Hidden Markov Model The state sequence is hidden. Unlike Markov Models, the state sequence cannot be uniquely deduced from the output sequence. 0.9 0.1 0.9 0.1 0.2 0.8 0.9 0.1 0.8 1 2 0.2 0.8 0.2 Speech Recognition Lecture 4: Hidden Markov Models 22

Is a Markov Model Hidden or Not? A necessary and sufficient condition for being state-observable is that all transitions from each state produce different outputs a, b a, b c b d State-observable d Hidden Speech Recognition Lecture 4: Hidden Markov Models 23

Three problems of general interest for an HMM 3 problems need to be solved before we can use HMM s: 1. Given an observed output sequence X=x 1 x 2..x T, compute P θ (X) for a given model θ (scoring) 2. Given X, find the most likely state sequence (Viterbi algorithm) 3. Estimate the parameters of the model (training) These problems are easy to solve for a state-observable Markov model. More complicated for an HMM because we need to consider all possible state sequences. Must develop a generalization. Speech Recognition Lecture 4: Hidden Markov Models 24

Problem 1 1. Given an observed output sequence X=x 1 x 2..x T, compute Pθ(X) for a given model θ Recall the state-observable case H 0.9 T 0.1 T 0.8 1 2 H 0.2 Example: O: H T T T S: 1(given) 1 2 2 Prob: 0.9 x 0.1 x 0.8 x 0.8 Speech Recognition Lecture 4: Hidden Markov Models 25

Problem 1 1. Given an observed output sequence X=x 1 x 2..x T, compute Pθ(X) for a given model θ Sum over all possible state sequences: Pθ(X)=Σ S Pθ(X,S) The obvious way of calculating Pθ(X) is to enumerate all state sequences that produce X Unfortunately, this calculation is exponential in the length of the sequence Speech Recognition Lecture 4: Hidden Markov Models 26

Example for Problem 1 0.5 0.8 0.2 a b 0.4 0.5 0.5 0.7 0.3 0.3 0.7 0.3 0.5 1 2 3 0.2 0.1 Compute Pθ(X) for X=aabb, assuming we start in state 1 Speech Recognition Lecture 4: Hidden Markov Models 27

0.5 EECS E6870 Example for Problem 1,cont d 0.8 0.2 0.7 0.3 0.3 0.5 1 2 3 0.2 a b 0.4 0.5 0.5 0.1 0.3 0.7 1 0.2 0.2 0.5 x 0.8 0.5 x 0.8 2 0.3 x 0.7 0.3 x 0.7 2 0.5 x 0.8 0.4 x 0.5 1 0.4 1 1 2 0.21 2 0.2 0.2 2 0.08 2 2 0.1 3 0.021 0.4 x 0.5 2 0.04 0.1 3 0.004 0.1 3 0.008 0.5 x 0.3 3 0.03 Let s enumerate all possible ways of producing x 1 =a, assuming we start in state 1. Speech Recognition Lecture 4: Hidden Markov Models 28 0.2 2

Example for Problem 1, cont d Now let s think about ways of generating x1x2=aa, for all paths from state 2 after the first observation 2 1 1 2 0.21 2 0.08 2 3 3 2 2 3 2 2 0.04 3 2 2 3 3 Speech Recognition Lecture 4: Hidden Markov Models 29

Example for Problem 1,cont d We can save computations by combining paths. This is a result of the Markov property, that the future doesn t depend on the past if we know the current state 1 2 1 2 0.33 0.4 x 0.5 0.4 x 0.5 0.5 x 0.3 2 2 3 0.1 3 Speech Recognition Lecture 4: Hidden Markov Models 30

Problem 1: Trellis Diagram Expand the state-transition diagram in time. Create a 2-D lattice indexed by state and time. Each state transition sequence is represented exactly once. State: 1 2 3 Time: 0 1 2 3 4 Obs: φ a aa aab aabb.5x.8.5x.8.5x.2.5x.2.3x.7.2.2.2.2.2.4x.5.4x.5.4x.5.4x.5.5x.3.3x.7.5x.3.1.1.1.3x.3.5x.7.3x.3.5x.7.1.1 Speech Recognition Lecture 4: Hidden Markov Models 31

Problem 1: Trellis Diagram, cont d Now let s accumulate the scores. Note that the inputs to a node are from the left and top, so if we work to the right and down all necessary input scores will be available. State: 1 2 3 Time: 0 1 2 3 4 Obs: φ a aa aab aabb.5x.8.5x.8.5x.2.5x.2 1 0.4.16.2.02.3x.7.3x.7.2.2.2.2.2.4x.5.4x.5.4x.5.4x.5.21+.04+.08=.33.5x.3.5x.3.1.1.1.3x.3.084+.066+.32=.182.5x.7.033+.03=.063.0495+.0182=.0677.3x.3.5x.7.1.1 Speech Recognition Lecture 4: Hidden Markov Models 32

Problem 1: Trellis Diagram, cont d Boundary condition: Score of (state 1, φ) = 1. Basic recursion: Score of node i = 0 For the set of predecessor nodes j: Score of node i += score of predecessor node j x the transition probability from j to i x observation probability along that transition if the transition is not null. Speech Recognition Lecture 4: Hidden Markov Models 33

Problem 1: Forward Pass Algorithm Let α t (s) for t ε {1..T} be the probability of being in state s at time t and having produced output x 1t =x 1..x t α t (s) = Σ s P θ (s s ) α t-1 (s ) P θ (s s ) P θ (x t s ->s) + Σ s α t (s ) 1 st term: sum over all output producing arcs 2 nd term: all null arcs This is called the Forward Pass algorithm. Important: The computational complexity of the forward algorithm is linear in time (or in the length of the observation sequence) This calculation allows us to solve Problem 1: P (X) θ = Σ s α T (s) Speech Recognition Lecture 4: Hidden Markov Models 34

Problem 2 Given the observations X, find the most likely state sequence This is solved using the Viterbi algorithm Preview: The computation is similar to the forward algorithm, except we use max( ) instead of + Also, we need to remember which partial path led to the max Speech Recognition Lecture 4: Hidden Markov Models 35

Problem 2: Viterbi algorithm Returning to our example, let s find the most likely path for producing aabb. At each node, remember the max of predecessor score x transition probability. Also store the best predecessor for each node. State: 1 2 3 Time: 0 1 2 3 4 Obs: φ a aa aab aabb.5x.8.5x.8.5x.2.5x.2 1 0.4.16.016.0016.3x.7.3x.7.2.2.2.2.2.4x.5.4x.5.4x.5.4x.5 max(.08.21.04).5x.3.5x.3.1.1.1.3x.3 max(.084.042.032).5x.7.0168.3x.3.5x.7.00336.1.1.00588 max(.03.021) Max(.0084.0315).0294 Speech Recognition Lecture 4: Hidden Markov Models 36

Problem 2: Viterbi algorithm, cont d Starting at the end, find the node with the highest score. Trace back the path to the beginning, following best arc leading into each node along the best path. State: 1 2 3 Time: 0 1 2 3 4 Obs: φ a aa aab aabb.5x.8.5x.8.5x.2.5x.2 1 0.4 0.2.3x.7.3x.7.2.2.2.2.2.4x.5.4x.5.4x.5.4x.5.5x.3.21.5x.3.1.1.1.16.016.084.3x.3.5x.7.0168.3x.3.5x.7.1.1.0016.00336 0.02.00588.03.0315.0294 Speech Recognition Lecture 4: Hidden Markov Models 37

Reference http://www.cs.jhu.edu/~jason/papers/#tnlp02 http://www.ee.columbia.edu/~stanchen/fall09/e6870/ Speech Recognition Lecture 4: Hidden Markov Models 38