Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Similar documents
Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Conditional Random Fields. Mike Brodie CS 778

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Structured Learning. Jun Zhu

Conditional Random Fields for Object Recognition

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

CS 6784 Paper Presentation

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

Collective classification in network data

Probabilistic Graphical Models

Computationally Efficient M-Estimation of Log-Linear Structure Models

Conditional Random Fields : Theory and Application

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Machine Learning. Sourangshu Bhattacharya

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Conditional Random Field for tracking user behavior based on his eye s movements 1

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Probabilistic Graphical Models

27: Hybrid Graphical Models and Neural Networks

Time series, HMMs, Kalman Filters

Computer vision: models, learning and inference. Chapter 10 Graphical Models

OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees

Expectation Propagation

Piecewise Training for Structured Prediction

Loopy Belief Propagation

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields

ECE521 Lecture 18 Graphical Models Hidden Markov Models

CRF Feature Induction

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

Complex Prediction Problems

Mean Field and Variational Methods finishing off

Dynamic Bayesian network (DBN)

Lecture 11: Clustering Introduction and Projects Machine Learning

Algorithms for Markov Random Fields in Computer Vision

Deep Generative Models Variational Autoencoders

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

What is machine learning?

Probabilistic Graphical Models

MEMMs (Log-Linear Tagging Models)

Chapter 8 of Bishop's Book: Graphical Models

Part I: Sum Product Algorithm and (Loopy) Belief Propagation. What s wrong with VarElim. Forwards algorithm (filtering) Forwards-backwards algorithm

Conditional Random Fields for Word Hyphenation

Mean Field and Variational Methods finishing off

Introduction to Graphical Models

More details on Loopy BP

CS 532c Probabilistic Graphical Models N-Best Hypotheses. December

6 : Factor Graphs, Message Passing and Junction Trees

Max-Sum Inference Algorithm

Learning Deep Structured Models for Semantic Segmentation. Guosheng Lin

Finding Non-overlapping Clusters for Generalized Inference Over Graphical Models

Introduction to Hidden Markov models

Feature Extraction and Loss training using CRFs: A Project Report

Spanning Tree Approximations for Conditional Random Fields

CS 664 Flexible Templates. Daniel Huttenlocher

Predicting 3D People from 2D Pictures

Semantic Segmentation. Zhongang Qi

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Dynamic Programming. Ellen Feldman and Avishek Dutta. February 27, CS155 Machine Learning and Data Mining

Bayes Net Learning. EECS 474 Fall 2016

Introduction to Machine Learning CMU-10701

Deep Boltzmann Machines

SVM: Multiclass and Structured Prediction. Bin Zhao

Sequence Labeling: The Problem

Semi-Markov Conditional Random Fields for Information Extraction

Introduction to SLAM Part II. Paul Robertson

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

COMP90051 Statistical Machine Learning

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER

Support Vector Machine Learning for Interdependent and Structured Output Spaces

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Exam Topics. Search in Discrete State Spaces. What is intelligence? Adversarial Search. Which Algorithm? 6/1/2012

Pairwise Clustering and Graphical Models

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Detection and Extraction of Events from s

Lecture 3: Conditional Independence - Undirected

Graphical Models for Computer Vision

Bayesian Networks Inference

ALTW 2005 Conditional Random Fields

Machine Learning

Integrating Probabilistic Reasoning with Constraint Satisfaction

10703 Deep Reinforcement Learning and Control

Graphical Models & HMMs

Message-passing algorithms (continued)

Modeling Sequence Data

Edinburgh Research Explorer

Transcription:

Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory, Dong-A University X 1 X 2 X n HMM models direct dependence between each state and only its corresponding observation NLP example: In a sentence segmentation task, segmentation may depend not just on a single word, but also on the features of the whole line such as line length, indentation, amount of white space, etc. (eg. P(capitalization tag), P(hyphen tag), P(suffix tag)) Mismatch between learning objective function and prediction objective function HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y X) 2 MEMM uses the Viterbi algorithm with MaxEnt POS tagging HMM vs. MEMM MEMM can condition on any useful feature of the input observation. HMM vs. MEMM 3 4 1

Summary of HMMs vs. MEMMs Decoding in MEMMs HMM MEMM 5 6 7 8 2

9 11 12 3

13 14 X 1:n Models dependence between each state and the full observation sequence explicitly More expressive than HMMs 15 Discriminative model Completely ignores modeling P(X): saves modeling effort Learning objective function consistent with predictive function: P(Y X) 16 4

5 5 What the local transition probabilities say: almost always prefers to go to state 2 almost always prefer to stay in state 2 Probability of path 1-> 1-> 1-> 1: x x = 0.09 17 18 5 5 Probability of path 2->2->2->2 : Other paths: X X = 0.018 1-> 1-> 1-> 1: 0.09 Probability of path 1->2->1->2: X X = 0.06 Other paths: 1->1->1->1: 0.09 2->2->2->2: 0.018 19 5

5 5 Probability of path 1->1->2->2: X 5 X = 0.066 Other paths: 1->1->1->1: 0.09 2->2->2->2: 0.018 Most Likely Path: 1-> 1-> 1-> 1 Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2. why? 21 1->2->1->2: 0.06 22 5 5 Most Likely Path: 1-> 1-> 1-> 1 has only two transitions but state 2 has 5: Average transition probability from state 2 is lower Label bias problem in MEMM: Preference of states with lower number of transitions over others 23 24 6

Solution: Do not normalize probabilities locally Solution: Do not normalize probabilities locally 5 5 5 From local probabilities. From local probabilities to local potentials States with lower transitions do not have an unfair advantage! 25 26 From MEMM. From MEMM to CRF x 1:n X 1:n CRF is a partially directed model Discriminative model like MEMM Usage of global normalizer Z(x) overcomes the label bias problem of MEMM Models the dependence between each state and the entire observation sequence (like MEMM) 27 28 7

Conditional Random Fields General parametric form: CRFs: Inference Given CRF parameters and, find the y * that maximizes P(y x) x 1:n Can ignore Z(x) because it is not a function of y Run the max-product algorithm on the junction-tree of CRF: x 1:n Same as Viterbi decoding used in HMMs! Y n-1,y n Y 2 Y 3 Y n-2 Y n-1 Y 1,Y 2 Y 2,Y 3. Y n-2,y n-1 29 CRF learning CRF learning Given {(x d, y d )} d=1n, find *, * such that Computing the model expectations: Requires exponentially large number of summations: Is it intractable? Computing the gradient w.r.t Gradient of the log-partition function in an exponential family is the expectation of the sufficient statistics. Expectation of f over the corresponding marginal probability of neighboring nodes!! Tractable! Can compute marginals using the sum-product algorithm on the chain 31 32 8

CRF learning Computing marginals using junction-tree calibration: CRF learning Computing feature expectations using calibrated potentials: x 1:n Junction Tree Initialization: Now we know how to compute r L(, ): Y n-1,y n Y 2 Y 3 Y n-2 Y n-1 Y 1,Y 2 Y 2,Y 3. Y n-2,y n-1 After calibration: Also called forward-backward algorithm Learning can now be done using gradient ascent: 33 34 CRF learning In practice, we use a Gaussian Regularizer for the parameter vector to improve generalizability CRFs: some empirical results Comparison of error rates on synthetic data In practice, gradient ascent has very slow convergence MEMM error MEMM error Alternatives: Conjugate Gradient method Limited Memory Quasi-Newton Methods CRF error HMM error Data is increasingly higher order in the direction of arrow 35 CRF error HMM error 36 CRFs achieve the lowest error rate for higher order data 9

CRFs: some empirical results Parts of Speech tagging Using same set of features: HMM >=< CRF > MEMM Using additional overlapping features: CRF + > MEMM + >> HMM Other CRFs So far we have discussed only 1- dimensional chain CRFs Inference and learning: exact We could also have CRFs for arbitrary graph structure E.g: Grid CRFs Inference and learning no longer tractable Approximate techniques used MCMC Sampling Variational Inference Loopy Belief Propagation We will discuss these techniques in the future 37 38 Summary Conditional Random Fields are partially directed discriminative models They overcome the label bias problem of MEMMs by using a global normalizer Inference for 1-D chain CRFs is exact Same as Max-product or Viterbi decoding Learning also is exact globally optimum parameters can be learned Requires using sum-product or forward-backward algorithm CRFs involving arbitrary graph structure are intractable in general E.g.: Grid CRFs Inference and learning require approximation techniques MCMC sampling Variational methods Loopy BP 39