CRF Feature Induction

Size: px

Start display at page:

Download "CRF Feature Induction"

Leon Sanders
6 years ago
Views:

1 CRF Feature Induction Andrew McCallum Efficiently Inducing Features of Conditional Random Fields Kuzman Ganchev

2 1 Introduction Basic Idea Aside: Transformation Based Learning Notation/CRF Review 2 Arbitrary CRFs 3 Linear CRFs 4 Results

3 Basic Idea: Greedy Search Assume we have a large set of candidate features. Repeat until < something >: Train a CRF using current set of features Add best new feature(s) best means greatest increase in log likelihood

4 Aside: One-slide TBL Input labeled structured data (e.g. POS tagged text) large set of rule candidates (type: if X and Y then Z) Learning method: greedy search Repeat until < something >: Add best new feature(s) where best means greatest error reduction. Output of learning: Ordered set of rules

5 Notation x - an observed instance (input to classifier) y - a label instance (output of classifier) C(x) - the set of cliques of the Markov graph for a particular observed instance. n i - set of neighbors of node y i. Classifier is: arg max p(y x) y ( ) p(y x) = 1 Z(x) exp λ c f c (y c x) c C

6 Linear CRFs Graph is just a chain of nodes Can represent all clique features as features of edges. ( ) p(y x) = 1 Z(x) exp λ i f i (y i, y i 1 x) i

7 Basic Idea: Greedy Search (repeat) Assume we have a large set of candidate features. Repeat until < something >: Train a CRF using current set of features Add best new feature(s) best means greatest increase in log likelihood

8 General CRFs: Log Likelihood Objective function is log likelihood of training data (i is a training instance): L = i = i log(p(y i x i )) ( ) λ c f c (y i,c x i ) logz(x i ) c C Defined for a set of features and their weights Λ. Convex in feature weights: we can take max Λ L Λ.

9 General CRFs: Maximizing Gain We want features that give us best improvement in log likelihood define G(g), gain for feature g: Last term is regularization. G(g) = max µ G Λ(g, µ) = max µ L Λ,g,µ L Λ µ2 2σ 2

10 General CRFs: Assume Λ doesn t change We begin by assuming weights for old features remain unchanged. This gives us (sum over instances omitted): G(g) = max L Λ,g,µ L Λ µ2 µ 2σ 2 = max µ c C µg(y c x) logz Λ,g,µ (x) + logz Λ (x) µ2 2σ 2 No need to retrain each feature considered Still retrain for features actually added

11 General CRFs: Mean Field Approximation p(y x) v V p(y v x) (avoids joint inference) calculate max likelihood distributions for all output values p(y i = y x). To infer distribution at node y, assume distributions at all other nodes are fixed. p Λ,g,µ (y i = y x) y p Λ,g,µ (y i = y n i = y, x)p Λ (n i = y x)

12 General CRFs: Gain only for some nodes With mean field approximation: G(g) = max µ v V µg(y v n v, x) logz Λ,g,µ (x) + logz Λ (x) µ2 2σ 2 Restrict calculation to wrong or borderline label nodes Use only nodes that are mislabeled by the current model or nodes that are within some margin. Quoth McCallum: do exact inference for a while at start.

13 Linear CRFs: Agglomerated Features transition features less important than observation features define agglomerative features g(y i n i 1, x) g(y i x) This gives us: Z i (Λ, g, µ) is easy to compute. p Λ,g,µ (y i = y x) = p Λ(y i = y x)e µg(y i,x) Z i (Λ, g, µ) Exact calculation of p Λ (y i = y x)

14 Linear CRFs: Agglomerated Features (contd) Still use only wrong y i for gain. G Λ,g,µ = ( ) e µg(y i,x) log Z i (Λ, g, µ) i wrong = wrong µe[g] i wrong µ2 2σ 2 log(e Λ (e µ,g x)) µ2 2σ 2 Claim: This converges in just 10 iterations of Newton s method

15 Linear CRFs: Other simplifications Add many features at a time After adding features train for only a few iterations of BFGS Claim: this helps reduce overfitting Does it really?

16 English named entity extraction. Without induction With induction Prec Recall F1 Prec Recall F1 PER LOC ORG MISC Overall

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative