Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Size: px

Start display at page:

Download "Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國"

Shanon Day
6 years ago
Views:

1 Conditional Random Fields - A probabilistic graphical model Yen-Chin Lee 指導老師 : 鮑興國

2 Outline Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional random field (CRF) From directed to undirected graphical models From generative to discriminative models Sequence models From HMMs to CRFs Difference between MEMM & CRFs Parameter estimation / inference Experiment

3 Labeling Sequence Data X is a random variable over data sequence Y is a random variable over label sequence Y i is assumed to range over a finite label set A The problem: Learn how to give labels y from the label set to a data sequence x Applications Computational biology Computational linguistics Information extraction X: Y: x 1 x 2 x 3 Thinking is being noun verb noun y 1 y 2 y 3

4 Conditional Random Fields A form of discriminative model Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks Undirected acyclic graph exp ( λi fi( x, yt ) + μ jg j( x, yt, yt 1)) t i j P( y x) = Z( x) Allow some transitions vote more strongly than others depending on the corresponding observations

5 Motivation Bayesian Network Naive Bayes Logistic Regression Hidden Markov Model Linear Chain Conditional Random Field General Conditional Random Field Markov Random Field

6 Directed vs. Undirected Models Directed models Using conditional prob. for each local substructure Called Bayesian network Undirected models P(Y X) Ψ(X,Y) Using potential functions in each local substructure Called Markov random field or Markov network

7 generative discriminative models Bayesian Network Naive Bayes Logistic Regression Hidden Markov Model Linear Chain Conditional Random Field General Conditional Random Field Markov Random Field

8 generative v.s discriminative models generative Naïve Bayes discriminative Logistic regression Base on a model of Joint distribution P(y,x) Need to calculate P(X) Base on a model of conditional distribution P(y x) Don t need

9 Overview: sequence models Bayesian Network Naive Bayes Logistic Regression Hidden Markov Model Linear Chain Conditional Random Field General Conditional Random Field Markov Random Field

10 Sequence models: HMMs Power of graphical models: model many interdependent variables HMM models joint distribution Uses two independence assumptions to do it tractably Given the direct predecessor, each state is independent of his ancestors Each observation depends only on current state

11 From HMMs to linear chain CRFs (1) Key: conditional distribution p(y x) of an HMM is a CRF with a particular choice of feature function with λ ij = log p(y'= i y = j)

12 From HMMs to linear chain CRFs (2) last step: write conditional probability P(y x) for the HMM.Then a linear-chain conditional random field is a distribution p(y x) that takes the form

13 Maximum Entropy Markov Models (MEMMs) A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon. Given training set X with label sequence Y: Train a model θ that maximizes P(Y X, θ) For a new data sequence x, the predicted label y maximizes P(y x, θ) Notice the per-state normalization

14 MEMMs (cont d) MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states Subject to Label Bias Problem Bias toward states with fewer outgoing transitions

15 Label Bias Problem Consider this MEMM: P(1 and 2 ro) = P(2 1 and ro)p(1 ro) = P(2 1 and o)p(1 r) P(1 and 2 ri) = P(2 1 and ri)p(1 ri) = P(2 1 and i)p(1 r) Since P(2 1 and x) = 1 for all x, P(1 and 2 ro) = P(1 and 2 ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 1) = 1, so P(2 1 and x) = 1 for all x However, we expect P(1 and 2 ri) to be greater than P(1 and 2 ro). Per-state normalization does not allow the required expectation

16 Solve the Label Bias Problem Change the state-transition structure of the model Not always practical to change the set of states

17 Principles in parameter estimation basic principle: maximum likelihood estimation with conditional log likelihood of N i=1 l(θ) = log p(y (i) x (i) ) advantage: conditional log likelihood is concave, therefore every local optimum is a global one Differentiating the log-likelihood function with respect to parameters gives λ j

18 Principles in parameter estimation There is no analytical solutions for the parameter by maximizing the log-likelihood Setting the gradient to zero and solving for does not always yield a closed form solution Iterative technique is adopted Iterative scaling Gradient decent use gradient descent: quasi-newton methods runtime in O(TM 2 NG) T: length of sequence M: number of labels N: number of training instances G: number of required gradient computations

19 Summary of Structures relation

20 Experiment

21 Modeling the label bias problem A run consists of 2,000 training examples and 500 test examples, trained to convergence using Iterative Scaling algorithm CRF error is 4.6%, and MEMM error is 42% MEMM fails to discriminate between the two branches CRF solves label bias problem

22 MEMM vs. HMM The HMM outperforms the MEMM

23 MEMM vs. CRF CRF usually outperforms the MEMM

24 Summary Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance

25 Thanks!!

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum