CS 6784 Paper Presentation

Size: px

Start display at page:

Download "CS 6784 Paper Presentation"

Audra West
6 years ago
Views:

1 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John La erty, Andrew McCallum, Fernando C. N. Pereira February 20, 2014

2 Main Contributions Main Contribution Summary This 2001 paper introduced the Conditional Random Field (CRF). Describes e of features. cient representation of field potentials in terms Provides two algorithms for finding Maximum Likelihood parameter values. Provides some really unconvincing examples...

3 Main Contributions Main Contribution Summary... examples are NOT the strongest point of this paper

4 Main Contributions Talk Structure Brad CRF in context

5 Main Contributions Talk Structure Brad CRF in context The Label Bias Problem

6 Main Contributions Talk Structure Brad CRF in context The Label Bias Problem Stephanie Parameter Estimation

7 Main Contributions Talk Structure Brad CRF in context The Label Bias Problem Stephanie Parameter Estimation Experiments Conclusion

8 Intro CRF in Context X = {X 1, X 2, } be a set of observed RV Y = {Y 1, Y 2, } be a set of label RV X, Y be a set of joint observations of X, Y Generative Discriminative Directed?????? Undirected??????

9 Intro CRF in Context X = {X 1, X 2, } be a set of observed RV Y = {Y 1, Y 2, } be a set of label RV X, Y be a set of joint observations of X, Y Generative Discriminative Directed HMM??? Undirected MRF???

10 Intro CRF in Context X = {X 1, X 2, } be a set of observed RV Y = {Y 1, Y 2, } be a set of label RV X, Y be a set of joint observations of X, Y Generative Discriminative Directed HMM ME-MM Undirected MRF??? In 2001, HMM, ME-MM and MRF were well known,

11 Intro CRF in Context X = {X 1, X 2, } be a set of observed RV Y = {Y 1, Y 2, } be a set of label RV X, Y be a set of joint observations of X, Y Generative Discriminative Directed HMM ME-MM Undirected MRF CRF In 2001, HMM, ME-MM and MRF were well known,the paper presents the CRF.

12 Generative vs. Discriminative Generative vs. Discriminative Generative: maximise joint P(Y, X )=P(Y X )P(X ) Discriminative: maximise conditional P(Y X ) When is Discriminative helpful? Tractability requires independence

13 Generative vs. Discriminative Generative vs. Discriminative Generative: maximise joint P(Y, X )=P(Y X )P(X ) Discriminative: maximise conditional P(Y X ) When is Discriminative helpful? Tractability requires independence...but sometimes there are important correlations in X.

14 Generative vs. Discriminative Examples: important correlations Long range interactions in human genomics

15 Generative vs. Discriminative Examples: important correlations Long range interactions in human genomics Pronoun definition and binding

16 Generative vs. Discriminative Examples: important correlations Long range interactions in human genomics Pronoun definition and binding Context in whole scene image recognition

17 Generative vs. Discriminative Examples: important correlations Long range interactions in human genomics Pronoun definition and binding Context in whole scene image recognition Recursive structure in language

18 Directed vs. Undirected Directed vs. Undirected For a graphical model G(E, V) with joint potential (V). Let C be the set of cliques (fully connected subgroups) in G, withc 2 C having edges E c and vertices V c.

19 Directed vs. Undirected Directed vs. Undirected For a graphical model G(E, V) with joint potential (V). Let C be the set of cliques (fully connected subgroups) in G, withc 2 C having edges E c and vertices V c. Finally, Dom(V) is the set of all values assumable by the random variables, V(= X [ Y). P(V) = 1 Z (V), Z = X vindom(v) (v)

20 Directed vs. Undirected Directed vs. Undirected, continued Compactness requires factorization (Hammersley-Cli ord, 1971): (V) = Y c2c c(v c )

21 Directed vs. Undirected Directed vs. Undirected, continued Compactness requires factorization (Hammersley-Cli ord, 1971): (V) = Y c2c c(v c ) Directed: local Normalization - X 8c 2 C, v2dom(v c) c(v) =1

22 Directed vs. Undirected Directed vs. Undirected, continued Compactness requires factorization (Hammersley-Cli ord, 1971): (V) = Y c2c c(v c ) Directed: local Normalization - each c is a probability. X 8c 2 C, c(v) =1 v2dom(v c)

23 Directed vs. Undirected Directed vs. Undirected, continued Compactness requires factorization (Hammersley-Cli ord, 1971): (V) = Y c2c c(v c ) Directed: local Normalization - each c is a probability. X 8c 2 C, c(v) =1 v2dom(v c) Undirected: Global Normalization - relaxes this constraint... but what does it buy us?

24 The Label Bias Problem: Conditional Markov Model (EM MM) Toy Problem fragment of a ME MM Training Data:

25 The Label Bias Problem: Conditional Markov Model (EM MM) Toy Problem fragment of a ME MM Training Data: Y 1 P(Y 1 X 1 ) 1 2 X 1 =R Rel. Joint (Y 2,X 2,Y 1 ) 3 4 X 2 =I 8 Y 1 =1 Y 1 =2 Y 2 X 2 =O X 2 =I X 2 =O 2

26 The Label Bias Problem: Conditional Markov Model (EM MM) Toy Problem fragment of a ME MM Training Data: Y 1 P(Y 1 X 1 ) 1 2 X 1 =R Y 2 Rel. Joint (Y 2,X 2,Y 1 ) 3 4 X 2 =I 8 Y 1 =1 Y 1 =2 X 2 =O X 2 =I X 2 =O 2 Conditional P(Y 2 X 2,Y 1 ) 3 4 X 2 =I 1- Y 1 =1 Y 1 =2 Y 2 X 2 =O X 2 =I X 2 =O 1-

27 The Label Bias Problem: Conditional Markov Model (EM MM) Toy Problem fragment of a ME MM Training Data: Y 1 P(Y 1 X 1 ) 1 2 X 1 =R Y 2 Rel. Joint (Y 2,X 2,Y 1 ) 3 4 X 2 =I 8 Y 1 =1 Y 1 =2 X 2 =O X 2 =I X 2 =O 2 Conditional P(Y 2 X 2,Y 1 ) 3 4 X 2 =I 1- Y 1 =1 Y 1 =2 Y 2 X 2 =O X 2 =I X 2 =O 1- Viterbi is Lets try it for Y 1,Y 2 P(Y 1 R) P(Y 2 Y 1,O) P(Y 1,Y 2 RO) 1, , , , Which labeling wins?

28 The Label Bias Problem: Conditional Markov Model (EM MM) Toy Problem fragment of a ME MM Training Data: Y 1 P(Y 1 X 1 ) 1 2 X 1 =R Y 2 Rel. Joint (Y 2,X 2,Y 1 ) 3 4 X 2 =I 8 Y 1 =1 Y 1 =2 X 2 =O X 2 =I X 2 =O 2 Conditional P(Y 2 X 2,Y 1 ) 3 4 X 2 =I 1- Y 1 =1 Y 1 =2 Y 2 X 2 =O X 2 =I X 2 =O 1- Viterbi is Lets try it for Y 1,Y 2 P(Y 1 R) P(Y 2 Y 1,O) P(Y 1,Y 2 RO) 1, , , , But we want What happened?

29 The Label Bias Problem: Conditional Markov Model (EM MM) Toy Problem fragment of a ME MM Training Data: Local Normalization requires a probability. So.. Y 2 Rel. Joint (Y 2,X 2,Y 1 ) 3 4 X 2 =I 8 Y 1 =1 Y 1 =2 X 2 =O X 2 =I X 2 =O 2 Conditional P(Y 2 X 2,Y 1 ) 3 4 X 2 =I 1- Y 1 =1 Y 1 =2 Y 2 X 2 =O X 2 =I X 2 =O 1- Y 1,Y 2 P(Y 1 R) P(Y 2 Y 1,O) P(Y 1,Y 2 RO) 1, , , ,

30 Toy Problem fragment of a CRF The Label Bias Problem: Potentials Training Data:

31 Toy Problem fragment of a CRF The Label Bias Problem: Potentials X 1 Y R Y Y Y I 8 X 2 O 2 Training Data:

32 Toy Problem fragment of a CRF The Label Bias Problem: Potentials X 1 Y R Y Y Y I 8 X 2 O 2 Training Data: Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2 X) 1, , ,3 2 2, Which labeling wins, now?

33 Toy Problem fragment of a CRF The Label Bias Problem: Potentials X 1 Y R Y Y Y I 8 X 2 O 2 Training Data: Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2 X) 1, small 1, tiny 2, infinitesimal 2, ~100% Because potentials do not have to normalize into probabilities until AFTER aggregation, they don t suffer from inappropriate conditioning.

34 Fun fact Fun fact: We have seen this in class before! Graphical model G(E, V) with joint potential (V), C the set of cliques in G with c 2 C having edges E c and vertices V c P(V) / (V) = Y c2c c(v c )

35 Fun fact Fun fact: We have seen this in class before! Graphical model G(E, V) with joint potential (V), C the set of cliques in G with c 2 C having edges E c and vertices V c P(V) / (V) = Y c2c c(v c ) M 3 nets: cliques are pairs, and all conditioned on observed x: P(y x) / Y ij(y i, y j, x) (i,j)2e

36 Fun fact Fun fact: We have seen this in class before! Graphical model G(E, V) with joint potential (V), C the set of cliques in G with c 2 C having edges E c and vertices V c P(V) / (V) = Y c2c c(v c ) M 3 nets: cliques are pairs, and all conditioned on observed x: P(y x) / Y ij(y i, y j, x) (i,j)2e AMN: cliques are pairs of nodes and singletons: P (y) = 1 NY i(y i ) Y i,j(y i, y j ) Z i i,j2e

37 Where do Parameters Come From? CRF s are part of the same general class, Ψ Ψ

38 Where do Parameters Come From? CRF s are part of the same general class, Ψ Ψ For trees, cliques are pairs of vertices sharing an edge ( ), and single vertices ( ): Ψ Ψ Ψ

39 Where do Parameters Come From? CRF s are part of the same general class, Ψ Ψ For trees, cliques are pairs of vertices sharing an edge ( ), and single vertices ( ): Ψ Ψ Ψ And because this is a conditional network with Ψ Ψ, Ψ,

40 Where do Parameters Come From? CRF s are part of the same general class, Ψ Ψ For trees, cliques are pairs of vertices sharing an edge ( ), and single vertices ( ): Ψ Ψ Ψ And because this is a conditional network with An exponential identity gives us Ψ Ψ, Ψ, Ψ exp log Ψ, log Ψ,

41 Where do Parameters Come From? CRF s are part of the same general class, Ψ Ψ For trees, cliques are pairs of vertices sharing an edge ( ), and single vertices ( ): Ψ Ψ Ψ And because this is a conditional network with An exponential identity gives us Ψ Ψ, Ψ, Ψ exp log Ψ, log Ψ, Potentials can be ANY positive values like linear combinations of arbitrary features Ψ exp,,,,,,

42 Training algorithms Improved iterative scaling Want to maximize log-likelihodod with respect to parameters =( 1, 2, ; µ 1,µ 2, ) 1 Della Pietra et al. (1997)

43 Training algorithms Improved iterative scaling Want to maximize log-likelihodod with respect to parameters =( 1, 2, ; µ 1,µ 2, ) Method: Improved Iterative Scaling 1 : Extension of Generalised Iterative Scaling (Darroch and Ratcli 1972). 1 Della Pietra et al. (1997)

44 Training algorithms Improved iterative scaling Want to maximize log-likelihodod with respect to parameters =( 1, 2, ; µ 1,µ 2, ) Method: Improved Iterative Scaling 1 : Extension of Generalised Iterative Scaling (Darroch and Ratcli 1972). Improved because features need not sum to constant. 1 Della Pietra et al. (1997)

45 Training algorithms Improved iterative scaling Want to maximize log-likelihodod with respect to parameters =( 1, 2, ; µ 1,µ 2, ) Method: Improved Iterative Scaling 1 : Extension of Generalised Iterative Scaling (Darroch and Ratcli 1972). Improved because features need not sum to constant. Idea: new set of parameters 0 = + =( 1 + 1, ; µ 1 + µ 1 ) which will not decrease objective function. Iteratively apply! 1 Della Pietra et al. (1997)

46 Training algorithms Improved iterative scaling Want to maximize log-likelihodod with respect to parameters =( 1, 2, ; µ 1,µ 2, ) Method: Improved Iterative Scaling 1 : Extension of Generalised Iterative Scaling (Darroch and Ratcli 1972). Improved because features need not sum to constant. Idea: new set of parameters 0 = + =( 1 + 1, ; µ 1 + µ 1 ) which will not decrease objective function. Iteratively apply! Problem: slow, and nobody uses this any more. 1 Della Pietra et al. (1997)

47 Training algorithms Modern CRF training - L-BFGS Generally use L-BFGS 2 algorithm. 2 Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

48 Training algorithms Modern CRF training - L-BFGS Generally use L-BFGS 2 algorithm. Approximates Newton s method. Optimise multivariate function f ( ) through updates t+1 = t [Hf ( t )] 1 rf ( t ) 2 Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

49 Training algorithms Modern CRF training - L-BFGS Generally use L-BFGS 2 algorithm. Approximates Newton s method. Optimise multivariate function f ( ) through updates t+1 = t [Hf ( t )] 1 rf ( t ) Quasi-Newtonian: approximates Hessian Hf ( ). 2 Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

50 Training algorithms Modern CRF training - L-BFGS Generally use L-BFGS 2 algorithm. Approximates Newton s method. Optimise multivariate function f ( ) through updates t+1 = t [Hf ( t )] 1 rf ( t ) Quasi-Newtonian: approximates Hessian Hf ( ). Limited-memory: doesn t store full (approximate) Hessian. 2 Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

51 Performance Label bias Generate data with noisy HMM. 4-state system (not counting initial state ), transitions: 1 ) 2 ) 3 4 ) 5 ) 3 Emissions: highly biased! P(X = Y s preferred value Y ) = 29/32 P(X = other Y )=1/32 Preferred values: 1! r, 4! r, 2! i, 5! o, 3! b.

52 Performance Label bias Generate data with noisy HMM. 4-state system (not counting initial state ), transitions: 1 ) 2 ) 3 4 ) 5 ) 3 Emissions: highly biased! P(X = Y s preferred value Y ) = 29/32 P(X = other Y )=1/32 Preferred values: 1! r, 4! r, 2! i, 5! o, 3! b. Result: CRF error 4.6%, MEMM error 42%.

53 Performance Mixed-order sources Generate data with mixed-order HMM: Transitions: (1 )p 1 (y i y i 1 )+ p 2 (y i y i 1, y i 2 ) Emissions: (1 )p 1 (x i y i )+ p 2 (x i y i, x i 1 ) Five labels, 26 observation values. Training/testing: 1000 sequences of length 25. CRF trained with Algorithm S (modified IIS). MEMM trained with iterative scaling. Viterbi to label test set.

54 Performance Mixed-order sources: results Squares : < 0.5.

55 Performance Mixed-order sources: results Squares : < 0.5. CRF sort of wins?

56 Performance Part of Speech Tagging Penn Treebank: 45 syntactic tags, label each word in sentence. Train first-order HMM, MEMM, CRF.

57 Performance Part of Speech Tagging Penn Treebank: 45 syntactic tags, label each word in sentence. Train first-order HMM, MEMM, CRF.

58 Performance Part of Speech Tagging Penn Treebank: 45 syntactic tags, label each word in sentence. Train first-order HMM, MEMM, CRF. Spelling features exploit conditional framework. Examples: starts with number/upper case?, contains hyphen, has su x?

59 Skip-chain CRF Skip-chain CRF Example: skip-chain CRF 3. 3 From An Introduction to Conditional Random Fields for Relational Learning, CharlesSuttonandAndrewMcCallum,2006.

60 Skip-chain CRF Skip-chain CRF Example: skip-chain CRF 3. Has long-range features! 3 From An Introduction to Conditional Random Fields for Relational Learning, CharlesSuttonandAndrewMcCallum,2006.

61 Skip-chain CRF Skip-chain CRF Example: skip-chain CRF 3. Has long-range features! Basic idea: extend linear-chain CRF by joining some distant observations with skip edges. 3 From An Introduction to Conditional Random Fields for Relational Learning, CharlesSuttonandAndrewMcCallum,2006.

62 Skip-chain CRF Skip-chain CRF Example: skip-chain CRF 3. Has long-range features! Basic idea: extend linear-chain CRF by joining some distant observations with skip edges. Connect multiple mentions of entity across whole document. 3 From An Introduction to Conditional Random Fields for Relational Learning, CharlesSuttonandAndrewMcCallum,2006.

63 Skip-chain CRF Example: Note: Squares denote factors (e.g. potential functions).

64 Skip-chain CRF Example: Note: Squares denote factors (e.g. potential functions). Question: Ignoring the skip edges (in blue), what potentials does Y i appear in?

65 Skip-chain CRF Example: Note: Squares denote factors (e.g. potential functions). Question: Ignoring the skip edges (in blue), what potentials does Y i appear in? Answer: (Y i, Y i 1, X i ), (Y i+1, Y i, X i+1 ).

66 Skip-chain CRF Skip-chain task Data: 485 announcements for seminars at CMU.

67 Skip-chain CRF Skip-chain task Data: 485 announcements for seminars at CMU. Task: identify start time, end time, location, speaker.

68 Skip-chain CRF Skip-chain task Data: 485 announcements for seminars at CMU. Task: identify start time, end time, location, speaker. Linear chain CRF with skip edges between identical capitalised words.

69 Skip-chain CRF Skip-chain task Data: 485 announcements for seminars at CMU. Task: identify start time, end time, location, speaker. Linear chain CRF with skip edges between identical capitalised words. Other word-specific features e.g. appears in list of first names, upper case, appears to be part of time/date (by regex), etc.

70 Skip-chain CRF Skip-chain results Values are F1 scores.

71 Skip-chain CRF Skip-chain results Values are F1 scores. Repeated occurrences of speaker improve skip-chain performance.

72 Skip-chain CRF Skip-chain results Values are F1 scores. Repeated occurrences of speaker improve skip-chain performance. Tokens are consistently classified by skip-chain. Linear-chain is inconsistent on 30.2 speakers, skip-chain: 4.8.

73 Summary CRFs combine discriminative (e.g. MEMM) and undirected (e.g. MRF) properties to solve problems:

74 Summary CRFs combine discriminative (e.g. MEMM) and undirected (e.g. MRF) properties to solve problems: Global normalisation avoids label bias.

75 Summary CRFs combine discriminative (e.g. MEMM) and undirected (e.g. MRF) properties to solve problems: Global normalisation avoids label bias. Conditioning on observations avoids modelling complex dependencies.

76 Summary CRFs combine discriminative (e.g. MEMM) and undirected (e.g. MRF) properties to solve problems: Global normalisation avoids label bias. Conditioning on observations avoids modelling complex dependencies. Enables use of features using global structure.

77 Summary CRFs combine discriminative (e.g. MEMM) and undirected (e.g. MRF) properties to solve problems: Global normalisation avoids label bias. Conditioning on observations avoids modelling complex dependencies. Enables use of features using global structure. Examples in paper strangely insubstantial, but CRFs are widely and successfully used.

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative