Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 2/26

Sequence Classification Structured input Task: Given a vector of structured observed features x (potentially multi-valued), what is the probability of assigning an atomic label y to this sequence of inputs? Solution: Model the posterior distribution of a single label event given an observation sequence and use y (x)=argmax p(y x) y Approach: Model p(y x) using Naïve Bayes or Maximum Entropy model (discussed later) 3/26

Sequence Classification Structured input, structured output Task: Generalise problem so that y is also structured. Given a vector of structured observed features x (potentially multi-valued), what is the probability of assigning a corresponding label sequence y to this sequence of inputs? Solution: Model the posterior distribution of a label sequence given an observation sequence and use y (x)=argmax p(y x) y Approach: Model p(y x) using an HMM or CRF model (discussed later) 4/26

Sequence Classification Naïve Bayes A directed graphical model (generative) which factorises the joint distribution p(x 1,...,x m,y) as a product of conditionals p(x i x 1,...,x i 1,y). Simplify the model by making the Naïve Bayes assumption that observations are independent, yielding the definition for a Naïve Bayes Classifier. y m p(y x) p(y,x) p(y) p(x i y) i=1 x 1 x 2 x m Figure: A Naive Bayes Classifier 5/26

Sequence Classification Hidden Markov Models HMMs are an extension of NB models to operate on label sequences. Independence Assumption - each observation x i is assumed to only depend on the current class label y i. It is however reasonable to assume there are dependencies between consecutive observations. Transition probabilities are used to capture this behaviour. n p(y,x)= p(y i y i 1 )p(x i y i ) i=0 y 0 y 1 y 2 y n y n+1 x 1 x 2 x n Figure: HMM Architecture 6/26

Sequence Classification Maximum Entropy Models I An undirected graphical model (discriminative). No longer trained to maximise joint likelihood of data, but rather the conditional likelihood of the data. Factorises the joint distribution as a product of potential functions. Based on the principle of Maximum Entropy - model data so as to maximise the entropy given the inherent constraints of the training data The primal problem: p (y x)= argmax H(y x) p(y x) P 7/26

Sequence Classification Maximum Entropy Models II A fundamental aspect of Maximum Entropy models is the representation of characteristics of the training data through a number of feature functions, f k (x,y). Moment Constraints - enforce that the expected value for each feature f k on the emiprical distribution be equal to its expected value on the model distribution: E(f k )=Ẽ(f k) Finding p (y x) can then be formulated as a constrained optimisation function (Lagrangian) using the moment constraints, standard PDF constraints and primal. This derivation yields the definition for a Maximum Entropy Model: ( k ) p θ (y x)= 1 Z θ (x) exp λ k f k (x,y) k=1 The normalisation term above Z θ (x) is the sum of the numerator over all possible labels y Y. 8/26

Sequence Classification Graphical Model Comparison Where do CRFs fit into the picture? NB Conditional MaxEnt Sequential Sequential HMM Conditional CRF Figure: Graphical Model Comparison CRFs are discriminative sequential models which factorise the joint distribution into conditional potential functions. 9/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 10/26

Linear Chain CRFs Overview I First proposed by Lafferty, McCallum and Perreira (2001) [3] The Maximum Entropy Markov Model (MEMM) was the first attempt at a discriminative version of an HMM. MEMM uses per-state exponential models for the conditional probabilities of next states given the current state. CRF uses a single exponential model for the joint probability of the entire label sequence given the observation sequence. CRFs address the independence assumption issue inherent to HMMs and the label bias problem inherent to MEMMs. y 0 y 1 y 2 y n y n+1 x 1 x 2 x n Figure: Basic Linear-Chain CRF architecture 11/26

Linear Chain CRFs Overview II Define the Linear-Chain CRF as a distribution of the form: p θ (y x)= 1 Z θ (x) exp λ k t k (y,x)+ µ j g j (y,x) k j The feature functions t k and g j are assumed to be given and fixed, λ k and µ j are the associated Lagrangian multipliers. The choice of an exponential family of distributions is natural within the Maximum Entropy framework employed for parameter estimation. Training - The parameters of the model, θ=(λ 1,...,λ k,µ 1,...,µ j ) must be estimated from the training data D = {x (p),y (p) } with empirical distribution p(x, y) - details to follow. 12/26

Linear Chain CRFs Standard Feature Functions A natural (HMM-like) starting point is to define a set of features for each state pair ("transition") and one for each state-observation pair ("emission"): t y,y = δ(y i 1,y )δ(y i,y)=f k,k trans (y,x,i) g y,x = δ(x i,x)δ(y i,y)=f k,k obs (y,x,i) The parameters corresponding to these functions (λ y,y and µ y,x ) play a similar role as the usual HMM parameters p(y y) and p(x y). Although CRFs can be reduced to HMMs, they are generally more expressive. Define a generic feature function as a function f k (where K is the total number of feature functions), which relates the label sequence y to the observation sequence x at position i. 13/26

Linear Chain CRFs Training Formulation I Principal of parameter estimation for CRFs is based on that of Maximum Entropy models. Employ Conditional Maximum Likelihood training, i.e. maximise the conditional log likelihood of the training data (N training patterns, sequence length T, K feature functions): 14/26

Linear Chain CRFs Training Formulation I Principal of parameter estimation for CRFs is based on that of Maximum Entropy models. ( ) p θ (y x)= 1 Z θ (x) exp λ k f k (y,x,i) N N T K L(θ)= logp(y (p) x (p) )= λ k f k (y (p) i,y (p) i 1,x(p),i) p=1 p=1 i=1 k=1 i k N logz θ (x (p) ) p=1 14/26

Linear Chain CRFs Training Formulation II The derivative of the log likelihood function w.r.t f k is therefore equal to the difference Ẽ fk E fk. The empirical expectation Ẽf k is trivial to compute. The model expectation E fk is difficult to compute. The forward-backward algorithm is typically used to do so. Although this function is convex, no closed form solution exists. Iterative numerical techniques are required. Initial approach [3] used Improved Iterative Scaling (IIS), which converges slowly and makes various assumptions on sequence length. LBFGS, RPROP and Conjugate Gradient yield significantly improved convergence times [4], and are typically used instead. 15/26

Linear Chain CRFs Feature Functions Binary feature functions may be extended to capture more interesting characteristics of underlying data. e.g. For POS tagging, f y,x = δ(x[0],upper(x[0]))δ(y,np) Moment Constraints with binary feature functions acting on literal observations are natural for many applications (e.g. NLIP). It is also possible to construct sets of features for discrete valued observations, with delta functions centered at discrete points. More difficult to account for continuous valued features. Approaches are: Quantise real valued inputs and construct binary feature functions. Recent work [5] makes use of continuous feature functions and Distribution Contraints. 16/26

Linear Chain CRFs Continuous Feature Funtions Most applications use binning/quantisation and moment constraints. Work in [5] is based on using a (nonlinear) continuous weighting function for the continuous feature functions (λ i (f i )). This does however result in a model which is no longer log-linear. Spline interpolation is used in order to approximate these weighting functions. With K knots in the spline approximation: p(y x,θ) exp λ ik a k (f i (x,y))f i (x,y)+ λ j f j (x,y) j {continuous},k j binary Where a k (x) is the scaling value associated with a particular knot k in the spline approximation. f i (x,y) could for instance be the continuous input value. 17/26

Linear Chain CRFs Applications Part-of-Speech Tagging - (Lafferty et al. 2001) Improvement with HMM-like features from 5.7% to 5.6% classification error. With additional orthographic features achieved 4.3%. Named Entity Recognition - (McCallum and Li 2003) Shallow Parsing - (Sha and Pereira 2003) Object Recognition - (Quattoni et al. 2004) Biomedical NER - (Settles, 2004) Information Extraction - (Peng and McCallum 2004) Phonetic Recognition - (Morris and Fosler-Lussier 2006) Consistently showed 1-1.5% improvement over HMM baseline. Word alignment for Machine Translation - (Blunsom and Cohn 2006) etc... 18/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 19/26

CRF extensions Hidden CRFs Including hidden states (s) in the CRF framework, no a-priori segmentation of the data into substructures is assumed [1]. Labels at individual observations are optimally combined to form a class conditional estimate: p(y x;θ)= P(y,s x,θ) ( ) exp λ k f k (y,s,x) s s y If the marginalisation over the hidden state sequence corresponding to y was not carried out, the result would essentially be a CRF of the form p(y,s x;θ). HCRFs are a natural candidate for most sequential classification problems traditionally modeled with HMMs. k 20/26

CRF extensions Compartive Results HCRFs vs HMMs in Speech Table: Phone Classification - CER on TIMIT corpus [6] and [1] # Mix Comps. HMM-ML HMM-MMI HCRF-MC HCRF-DC 10 28.1% 24.8% 21.7% 21.4% 20 26.4% 25.3% 21.3% 20.8% Table: Phone Recognition - CER on TIMIT corpus [2] # Mix. Comp HMM-ML HMM-MMI HMM-MPE HCRF 8 35.9% 33.3% 32.1% 29.4% 32 31.6% 30.8% 30.5% 28.3% 21/26

CRF extensions Other Architectures and Extensions Semi-Markov CRFs Microsoft use Segmental CRF in SCARF toolkit for speech recognition. Deep-structured CRFs Hierarchical CRFs Bayesian CRFs Dynamic CRFs 22/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 23/26

Summary CRFs estimate the distribution of a sequence of labels conditioned on an entire observation sequence. CRFs do not make conditional independence assumptions between elements of observation sequence (as with HMMs). CRFs are capable of performing at least as well as HMMs without any feature design effort. There are proven algorithms for parameter estimation in CRFs and HCRFs (LBFGS, RPROP, etc and Forwards-Backwards). Arbitrary combinations of input features can be considered - binary, discrete and continuous feature data streams can be used. HCRFs are a natural extension of the framework which makes it possible to use the CRF framework for more complex tasks. 24/26

Outline The Sequence Classification Problem Linear Chain CRFs CRF extensions Summary Bibliography 25/26

Bibliography I A. Gunawardana, M. Mahajan, A. Acero, and J. Platt. Hidden conditional random fields for phone classification. Ninth European Conference on Speech Communication and Technology, 2005. D. Jurafsky. Hidden Conditional Random Fields for Phone Recognition. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR 06), pages 1521 1527, 2006. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 283âĂŞ 289. Citeseer, 2001. H. Wallach. Efficient training of conditional random fields. Proc. 6th Annual CLUK Research Colloquium, 112, 2002. D. Yu, L. Deng, and A. Acero. Using continuous features in the maximum entropy model. Pattern Recognition Letters, 30(14):1295 1300, 2009. D. Yu, L. Deng, A. Acero, and A. Modeling. Hidden conditional random field with distribution constraints for phone classification. In Proc. of Interspeech, pages 676 679, 2009. 26/26