Conditional Random Fields : Theory and Application

Similar documents
Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Structured Learning. Jun Zhu

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

CS 6784 Paper Presentation

COMP90051 Statistical Machine Learning

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Feature Extraction and Loss training using CRFs: A Project Report

Introduction to Hidden Markov models

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Edinburgh Research Explorer

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

Handwritten Word Recognition using Conditional Random Fields

CRF Feature Induction

Complex Prediction Problems

Conditional Random Fields for Object Recognition

Conditional Random Field for tracking user behavior based on his eye s movements 1

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Introduction to CRFs. Isabelle Tellier

Comparisons of Sequence Labeling Algorithms and Extensions

Computationally Efficient M-Estimation of Log-Linear Structure Models

Segmentation and labeling of documents using Conditional Random Fields

CSEP 517 Natural Language Processing Autumn 2013

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

27: Hybrid Graphical Models and Neural Networks

Lecture 7: Neural network acoustic models in speech recognition

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Machine Learning. Sourangshu Bhattacharya

Conditional Random Fields. Mike Brodie CS 778

Estimating Labels from Label Proportions

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Probabilistic Graphical Models

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Part 5: Structured Support Vector Machines

STA 4273H: Statistical Machine Learning

Classification: Linear Discriminant Functions

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

Detecting Coarticulation in Sign Language using Conditional Random Fields

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

CS545 Project: Conditional Random Fields on an ecommerce Website

Semi-Supervised Learning of Named Entity Substructure

Semi-Markov Conditional Random Fields for Information Extraction

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Bayes Net Learning. EECS 474 Fall 2016

Application of MRF s to Segmentation

Machine Learning. Supervised Learning. Manfred Huber

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

MEMMs (Log-Linear Tagging Models)

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

ALTW 2005 Conditional Random Fields

Time series, HMMs, Kalman Filters

Extracting Structured Information from User Queries with Semi-Supervised Conditional Random Fields

Training LDCRF model on unsegmented sequences using Connectionist Temporal Classification

Social Interactions: A First-Person Perspective.

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Generative and discriminative classification techniques

Contextual Recognition of Hand-drawn Diagrams with Conditional Random Fields

Scaling Conditional Random Fields for Natural Language Processing

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Unsupervised Learning

Supervised Learning for Image Segmentation

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Learning Diagram Parts with Hidden Random Fields

COMP90051 Statistical Machine Learning

A Taxonomy of Semi-Supervised Learning Algorithms

Robust Action Recognition and Segmentation with Multi-Task Conditional Random Fields

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Clustering web search results

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

A comparison of training approaches for discriminative segmental models

Bayesian Classification Using Probabilistic Graphical Models

Neural Network Neurons

Webpage Understanding: an Integrated Approach

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Building Classifiers using Bayesian Networks

Conditional Random Fields for Activity Recognition

Clinical Name Entity Recognition using Conditional Random Field with Augmented Features

Mixture Models and the EM Algorithm

CRFs for Image Classification

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields

Transductive Phoneme Classification Using Local Scaling And Confidence

Statistical Methods for NLP

Probabilistic Graphical Models

Part 5: Structured Support Vector Machines

Feature Selection for Image Retrieval and Object Recognition

Sequence Labeling: The Problem

Natural Language Processing

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

What is machine learning?

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Closing the Loop in Webpage Understanding

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Collective classification in network data

Graphical Models. David M. Blei Columbia University. September 17, 2014

Transcription:

Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 2/26

Sequence Classification Structured input Task: Given a vector of structured observed features x (potentially multi-valued), what is the probability of assigning an atomic label y to this sequence of inputs? Solution: Model the posterior distribution of a single label event given an observation sequence and use y (x)=argmax p(y x) y Approach: Model p(y x) using Naïve Bayes or Maximum Entropy model (discussed later) 3/26

Sequence Classification Structured input, structured output Task: Generalise problem so that y is also structured. Given a vector of structured observed features x (potentially multi-valued), what is the probability of assigning a corresponding label sequence y to this sequence of inputs? Solution: Model the posterior distribution of a label sequence given an observation sequence and use y (x)=argmax p(y x) y Approach: Model p(y x) using an HMM or CRF model (discussed later) 4/26

Sequence Classification Naïve Bayes A directed graphical model (generative) which factorises the joint distribution p(x 1,...,x m,y) as a product of conditionals p(x i x 1,...,x i 1,y). Simplify the model by making the Naïve Bayes assumption that observations are independent, yielding the definition for a Naïve Bayes Classifier. y m p(y x) p(y,x) p(y) p(x i y) i=1 x 1 x 2 x m Figure: A Naive Bayes Classifier 5/26

Sequence Classification Hidden Markov Models HMMs are an extension of NB models to operate on label sequences. Independence Assumption - each observation x i is assumed to only depend on the current class label y i. It is however reasonable to assume there are dependencies between consecutive observations. Transition probabilities are used to capture this behaviour. n p(y,x)= p(y i y i 1 )p(x i y i ) i=0 y 0 y 1 y 2 y n y n+1 x 1 x 2 x n Figure: HMM Architecture 6/26

Sequence Classification Maximum Entropy Models I An undirected graphical model (discriminative). No longer trained to maximise joint likelihood of data, but rather the conditional likelihood of the data. Factorises the joint distribution as a product of potential functions. Based on the principle of Maximum Entropy - model data so as to maximise the entropy given the inherent constraints of the training data The primal problem: p (y x)= argmax H(y x) p(y x) P 7/26

Sequence Classification Maximum Entropy Models II A fundamental aspect of Maximum Entropy models is the representation of characteristics of the training data through a number of feature functions, f k (x,y). Moment Constraints - enforce that the expected value for each feature f k on the emiprical distribution be equal to its expected value on the model distribution: E(f k )=Ẽ(f k) Finding p (y x) can then be formulated as a constrained optimisation function (Lagrangian) using the moment constraints, standard PDF constraints and primal. This derivation yields the definition for a Maximum Entropy Model: ( k ) p θ (y x)= 1 Z θ (x) exp λ k f k (x,y) k=1 The normalisation term above Z θ (x) is the sum of the numerator over all possible labels y Y. 8/26

Sequence Classification Graphical Model Comparison Where do CRFs fit into the picture? NB Conditional MaxEnt Sequential Sequential HMM Conditional CRF Figure: Graphical Model Comparison CRFs are discriminative sequential models which factorise the joint distribution into conditional potential functions. 9/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 10/26

Linear Chain CRFs Overview I First proposed by Lafferty, McCallum and Perreira (2001) [3] The Maximum Entropy Markov Model (MEMM) was the first attempt at a discriminative version of an HMM. MEMM uses per-state exponential models for the conditional probabilities of next states given the current state. CRF uses a single exponential model for the joint probability of the entire label sequence given the observation sequence. CRFs address the independence assumption issue inherent to HMMs and the label bias problem inherent to MEMMs. y 0 y 1 y 2 y n y n+1 x 1 x 2 x n Figure: Basic Linear-Chain CRF architecture 11/26

Linear Chain CRFs Overview II Define the Linear-Chain CRF as a distribution of the form: p θ (y x)= 1 Z θ (x) exp λ k t k (y,x)+ µ j g j (y,x) k j The feature functions t k and g j are assumed to be given and fixed, λ k and µ j are the associated Lagrangian multipliers. The choice of an exponential family of distributions is natural within the Maximum Entropy framework employed for parameter estimation. Training - The parameters of the model, θ=(λ 1,...,λ k,µ 1,...,µ j ) must be estimated from the training data D = {x (p),y (p) } with empirical distribution p(x, y) - details to follow. 12/26

Linear Chain CRFs Standard Feature Functions A natural (HMM-like) starting point is to define a set of features for each state pair ("transition") and one for each state-observation pair ("emission"): t y,y = δ(y i 1,y )δ(y i,y) g y,x = δ(x i,x)δ(y i,y) The parameters corresponding to these functions (λ y,y and µ y,x ) play a similar role as the usual HMM parameters p(y y) and p(x y). Although CRFs can be reduced to HMMs, they are generally more expressive. 13/26

Linear Chain CRFs Standard Feature Functions A natural (HMM-like) starting point is to define a set of features for each state pair ("transition") and one for each state-observation pair ("emission"): t y,y = δ(y i 1,y )δ(y i,y)=f k,k trans (y,x,i) g y,x = δ(x i,x)δ(y i,y)=f k,k obs (y,x,i) The parameters corresponding to these functions (λ y,y and µ y,x ) play a similar role as the usual HMM parameters p(y y) and p(x y). Although CRFs can be reduced to HMMs, they are generally more expressive. Define a generic feature function as a function f k (where K is the total number of feature functions), which relates the label sequence y to the observation sequence x at position i. 13/26

Linear Chain CRFs Training Formulation I Principal of parameter estimation for CRFs is based on that of Maximum Entropy models. Employ Conditional Maximum Likelihood training, i.e. maximise the conditional log likelihood of the training data (N training patterns, sequence length T, K feature functions): 14/26

Linear Chain CRFs Training Formulation I Principal of parameter estimation for CRFs is based on that of Maximum Entropy models. ( ) p θ (y x)= 1 Z θ (x) exp λ k f k (y,x,i) N N T K L(θ)= logp(y (p) x (p) )= λ k f k (y (p) i,y (p) i 1,x(p),i) p=1 p=1 i=1 k=1 i k N logz θ (x (p) ) p=1 14/26

Linear Chain CRFs Training Formulation I Principal of parameter estimation for CRFs is based on that of Maximum Entropy models. ( ) p θ (y x)= 1 Z θ (x) exp λ k f k (y,x,i) N N T K L(θ)= logp(y (p) x (p) )= λ k f k (y (p) i,y (p) i 1,x(p),i) p=1 Ẽ fk p=1 i=1 k=1 i k N logz θ (x (p) ) L(θ) N T N T = f k (y (p) i,y (p) i 1 λ,x(p),i) f k (y,y,x (p),i)p(y,y x (p) ) k p=1 i=1 p=1 i=1 y,y }{{}} {{} E fk p=1 14/26

Linear Chain CRFs Training Formulation II The derivative of the log likelihood function w.r.t f k is therefore equal to the difference Ẽ fk E fk. The empirical expectation Ẽf k is trivial to compute. The model expectation E fk is difficult to compute. The forward-backward algorithm is typically used to do so. Although this function is convex, no closed form solution exists. Iterative numerical techniques are required. Initial approach [3] used Improved Iterative Scaling (IIS), which converges slowly and makes various assumptions on sequence length. LBFGS, RPROP and Conjugate Gradient yield significantly improved convergence times [4], and are typically used instead. 15/26

Linear Chain CRFs Feature Functions Binary feature functions may be extended to capture more interesting characteristics of underlying data. e.g. For POS tagging, f y,x = δ(x[0],upper(x[0]))δ(y,np) Moment Constraints with binary feature functions acting on literal observations are natural for many applications (e.g. NLIP). It is also possible to construct sets of features for discrete valued observations, with delta functions centered at discrete points. More difficult to account for continuous valued features. Approaches are: Quantise real valued inputs and construct binary feature functions. Recent work [5] makes use of continuous feature functions and Distribution Contraints. 16/26

Linear Chain CRFs Continuous Feature Funtions Most applications use binning/quantisation and moment constraints. Work in [5] is based on using a (nonlinear) continuous weighting function for the continuous feature functions (λ i (f i )). This does however result in a model which is no longer log-linear. Spline interpolation is used in order to approximate these weighting functions. With K knots in the spline approximation: p(y x,θ) exp λ ik a k (f i (x,y))f i (x,y)+ λ j f j (x,y) j {continuous},k j binary Where a k (x) is the scaling value associated with a particular knot k in the spline approximation. f i (x,y) could for instance be the continuous input value. 17/26

Linear Chain CRFs Applications Part-of-Speech Tagging - (Lafferty et al. 2001) Improvement with HMM-like features from 5.7% to 5.6% classification error. With additional orthographic features achieved 4.3%. Named Entity Recognition - (McCallum and Li 2003) Shallow Parsing - (Sha and Pereira 2003) Object Recognition - (Quattoni et al. 2004) Biomedical NER - (Settles, 2004) Information Extraction - (Peng and McCallum 2004) Phonetic Recognition - (Morris and Fosler-Lussier 2006) Consistently showed 1-1.5% improvement over HMM baseline. Word alignment for Machine Translation - (Blunsom and Cohn 2006) etc... 18/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 19/26

CRF extensions Hidden CRFs Including hidden states (s) in the CRF framework, no a-priori segmentation of the data into substructures is assumed [1]. Labels at individual observations are optimally combined to form a class conditional estimate: p(y x;θ)= P(y,s x,θ) ( ) exp λ k f k (y,s,x) s s y If the marginalisation over the hidden state sequence corresponding to y was not carried out, the result would essentially be a CRF of the form p(y,s x;θ). HCRFs are a natural candidate for most sequential classification problems traditionally modeled with HMMs. k 20/26

CRF extensions Compartive Results HCRFs vs HMMs in Speech Table: Phone Classification - CER on TIMIT corpus [6] and [1] # Mix Comps. HMM-ML HMM-MMI HCRF-MC HCRF-DC 10 28.1% 24.8% 21.7% 21.4% 20 26.4% 25.3% 21.3% 20.8% Table: Phone Recognition - CER on TIMIT corpus [2] # Mix. Comp HMM-ML HMM-MMI HMM-MPE HCRF 8 35.9% 33.3% 32.1% 29.4% 32 31.6% 30.8% 30.5% 28.3% 21/26

CRF extensions Other Architectures and Extensions Semi-Markov CRFs Microsoft use Segmental CRF in SCARF toolkit for speech recognition. Deep-structured CRFs Hierarchical CRFs Bayesian CRFs Dynamic CRFs 22/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 23/26

Summary CRFs estimate the distribution of a sequence of labels conditioned on an entire observation sequence. CRFs do not make conditional independence assumptions between elements of observation sequence (as with HMMs). CRFs are capable of performing at least as well as HMMs without any feature design effort. There are proven algorithms for parameter estimation in CRFs and HCRFs (LBFGS, RPROP, etc and Forwards-Backwards). Arbitrary combinations of input features can be considered - binary, discrete and continuous feature data streams can be used. HCRFs are a natural extension of the framework which makes it possible to use the CRF framework for more complex tasks. 24/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 25/26

Bibliography I A. Gunawardana, M. Mahajan, A. Acero, and J. Platt. Hidden conditional random fields for phone classification. Ninth European Conference on Speech Communication and Technology, 2005. D. Jurafsky. Hidden Conditional Random Fields for Phone Recognition. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR 06), pages 1521 1527, 2006. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 283âĂŞ 289. Citeseer, 2001. H. Wallach. Efficient training of conditional random fields. Proc. 6th Annual CLUK Research Colloquium, 112, 2002. D. Yu, L. Deng, and A. Acero. Using continuous features in the maximum entropy model. Pattern Recognition Letters, 30(14):1295 1300, 2009. D. Yu, L. Deng, A. Acero, and A. Modeling. Hidden conditional random field with distribution constraints for phone classification. In Proc. of Interspeech, pages 676 679, 2009. 26/26