Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

0-708: Probabilisti Graphial Models 0-708, Spring 204 Disrete sequential models and CRFs Leturer: Eri P. Xing Sribes: Pankesh Bamotra, Xuanhong Li Case Study: Supervised Part-of-Speeh Tagging The supervised part-of-speeh tagging is a supervised task that tags eah part of speeh sequene with predefined labels, suh as noun, verb, and so on. As shown in Figure, given the data D = {x (n), y (n) }, x represent the speeh word and y represents the tag. Figure : Supervised Part-of-Speeh Tagging This problem an be approahed with many different methods. Here we disuss three of them: Markov random field, Bayes network, and onditional random field. Markov Random Field (Figure 2): it models the joint distribution over the tags Y i and words X i. The individual fators are not probabilities. Thus a normalization Z is needed. Bayes Network (Figure 3): it also models the joint distribution over the tags Y i and words X i. Note that here the individual fators are probabilities. So Z =. Conditional Random Field (Figure 4): it models onditional distribution over tags Y i given words X i. The fators and Z are speifi to sentene X.

2 Disrete sequential models and CRFs Figure 2: Supervised Part-of-Speeh Tagging with Markov Random Field Figure 3: Supervised Part-of-Speeh Tagging with Bayes Network Figure 4: Supervised Part-of-Speeh Tagging with Conditional Random Field

Disrete sequential models and CRFs 3 2 Review of Inferene Algorithm The forward-bakward algorithm (marginal inferene) and viterbi algorithm (MAP inferene) are two major inferene algorithms we have seen so far. It turns out they are all belief propagation algorithms. Figure 5: Belief Propagation in Forward-bakward Algorithm For example, in the forward-bakward algorithm, as shown in Figure 5, α is the belief from forward pass, the β is the belief from the bakward pass, and the ψ is the belief from the x i. Then the belief of of Y 2 = n is the produt of the belief from the three diretions: α(n)β(n)ψ(n). 3 Hidden Markov Model (HMM) and Maximal Entropy Markov Model (MEMM) 3. HMM Figure 6: Hidden Markov Model

4 Disrete sequential models and CRFs HMM (Figure 6) is a graphial model with latent variable Y and observed variable X. It is a simple model for sequential data suh as language, speeh, and so fore. But, there are two issues with HMM. Loality of feature: HMM models only apture dependenies between eah state and its orresponding observation. But in real world problem like NLP, eah segmental state may depend not just on a single word (and the adjaent segmental stages), but also on the (non-loal) features of the whole line suh as line length, indentation, amount of white spae, et. Mismath of objetive funtion: HMM learns a joint distribution of states and observations P (Y, X), but in a predition task, we need the onditional probability P (Y X) 3.2 MEMM Figure 7: Maximal Entropy Markov Model MEMM (Figure 7) is for solving the problems of HMM. It gives full observation of sequene to every state, whih make the model more experssive than HMM. It is also a disriminative model, sine it ompletely ignores modeling P (X) and the learning objetive funtion onsistent with preditive funtion: P (Y X). But MEMM has the label bias problem, whih means a preferene for states with lower number of transitions over others. To avoid this problem, one solution is not normalizing probabilities loally. This improvement gives us the onditional random field (CRF). 4 Comparison between Markov Random Field and Conditional Random Field 4. Data CRF: D = {X (n), Y (n) } N n= MRF: D = {Y (n) } N n= 4.2 Model CRF: P (Y X, θ) = Z(X,θ) MRF: P (Y θ) = Z(θ) 4.3 Log likelihood CRF: l(θ, D) = N N n= log(yn x (n), θ) MRF: l(θ, D) = N N n= log(yn θ) C ψ (Y, X), where ψ (Y, X) = exp(θf (Y, X)) C ψ (Y ), where ψ (Y ) = exp(θf (Y ))

Disrete sequential models and CRFs 5 4.4 Derivatives l CRF: θ k = N N n= f,k(y (n), x (n) ) N N n= MRF: l θ k = N N n= f,k(y (n), x (n) ) N N n= y p(y (n) x (n) )f,k (y (n), x (n) ) y p(y (n) )f,k (y (n) ) 5 Generative Vs. Disriminative Models - Reap In simple terms, the differene between generative and disriminative models is the generative models are based on joint distribution p(y, x), while disriminative models are based on the onditional distribution p(y x). Typial example of generative-disriminative pair [] is the Naive Bayes lassifier and Logisti regression. The priniple advantage of using disriminative models is that they an inorporate rih features whih an have long range dependenies. For example, in POS tagging we an inorporate features like apitalization of the word, syntati properties of the neighbour words, and others like loation of the word in the sentene. Having suh features in generative models generally leads to poor performane of the models. Figure 8: Relationship between some generative-disriminative pairs 6 CRF formulation Conditional random fields are formulated as below: - P (y :n x :n ) = Z(x :n ) n φ(y i, y i, x :n ) = Z(x :n, w) i= n exp(w T f(y i, y i, x :n )) An important differene to note here is that CRF formulation looks more or less like the MEMM formulation. However, here the partition funtion is global and lies outside of the exponential produt term. i= 7 Properties of CRFs CRFs are partially direted models.

6 Disrete sequential models and CRFs CRFs are disriminative models like MEMMs. CRF formulation has global normalizer Z that helps in overoming the label bias problem of the MEMMs. Being a disriminative model, CRFs an model rih set of features over the entire observation sequene. 8 Linear Chain CRFs Figure 9: A Linear hain CRF In a linear hain onditional random field, we model the potentials between adjaent nodes as a Markov Random Field onditioned on input x. Thus, we an formulate linear hain CRF as: - P (y x) = n Z(x, λ, µ) exp( where Z(x, λ, µ) = y exp( i=( k n i=( k λ k f k (y i, y i, x) + l λ k f k (y i, y i, x) + l µ l g l (y i, x))) µ l g l (y i, x))) f k and g l in the above equations are referred to as feature funtions that an enode rih set of features over the entire observation sequene. 9 CRFs Vs. MRFs 9. Model CRF : P (y x, θ) = Z(x, θ) MRF : P (y x, θ) = Z(θ) exp(θ, f(y, x)) exp(θ, f(y )) 9.2 Average Likelihood CRF : l(θ; D) = N N log p(y (n) x (n), θ) n= MRF : l(θ; D) = N N log p(y (n) θ) n=

Disrete sequential models and CRFs 7 9.3 Derivatives CRF : MRF : d l(θ; D) dθ k = N d l(θ; D) dθ k N n= = N f,k (y (n) n=, x (n) ) N N p(y x (n) )f,k (y, x (n) ) n= N f,k (y (n) ) N y N n= y p(y )f,k (y ) 0 Parameter estimation As we saw in the previous setions CRFs have parameters λ k and µ k whih we estimate from the training data D = (x (n), y (n ) N i= having empirial distribution p(x, y). We an use iterative saling algorithms and gradient desent to maximizes the log-likelihood funtion. Performane Figure 0: CRFs vs. other sequential models for POS tagging on Penn treebank 2 Minimum Bayes Risk Deoding Deoding in CRFs refers to hoosing a partiular y from p(y x) suh that some form of loss funtion is minimized. A minimum Bayes risk deoder returns an assignment that minimizes expeted loss under model distribution. This an be represented as: - h θ (x) = argmin y p θ (y x)l(ŷ, y) Here L(ŷ, y) represents a loss funtion like 0- loss or Hamming loss. y 3 Appliations - Computer Vision A few appliations of CRFs in omputer vision are: - Image segmentation

8 Disrete sequential models and CRFs Handwriting reognition Pose estimation Objet reognition 3. Image segmentation Image segmentation an be modelled by onditional random fields. We exploit the fat that foreground and bakground portions of an image have pixel-wise and loal harateristis that an be inorporated as CRF parameters. Thus, image segmentation an be formulated as: - Y = argmax y (0,) n V i (y i, X) + V i,j (y i, y j ) i S i S j N i Here, Y refers to image label as foreground or bakground, Xs are data features, S are pixels, and N i refers to the neighbours of pixel i.