Conditional Random Fields. Mike Brodie CS 778

Size: px

Start display at page:

Download "Conditional Random Fields. Mike Brodie CS 778"

Amy Marcia Grant
6 years ago
Views:

1 Conditional Random Fields Mike Brodie CS 778

2 Motivation Part-Of-Speech Tagger 2

3 Motivation object 3

4 Motivation I object! 4

5 Motivation object Do you see that object? 5

6 Motivation Part-Of-Speech Tagger - Java CRFTagger INPUT FILE - Logan.txt Logan woke up this morning and ate a big bowl of Fruit Loops. On his way to school, a small dog chased after him. Fortunately, Logan's leg had healed and he outran the dog. 6

7 Motivation Part-Of-Speech Tagger - Java CRFTagger OUTPUT FILE - Logan.txt.pos Logan/NNP woke/vbd up/rp this/dt morning/nn and/cc ate/vb a/dt big/jj bowl/nn of/in Fruit/ NNP Loops/NNP./. On/IN his/prp$ way/nn to/ TO school/vb,/, a/dt small/jj dog/nn chased/ VBN after/in him/prp./. Fortunately/RB,/, Logan/ NNP 's/pos leg/nn had/vbd healed/vbn and/ CC he/prp outran/vb the/dt dog/nn./. 7

8 Motivation Stanford Named Entity Recognition Recognizes names, places, organizations 8

9 Motivation Image Applications Segmentation Multi-Label Classification Labels: Hat, Shirt Cup, Stanley 9

10 Motivation Bioinformatics Applications RNA Structural Alignment Protein Structure Prediction 10

11 Motivation Cupcakes with Rainbow Filling 11

12 Background Isn't it about time 12

13 Background Isn't it about time for Graphical Models? 13 Represent distribution as a product of local functions that depend on a smaller subset of variables

14 Background Generative vs Discriminative Generative models describe how a label vector y can probabilistically generate a feature vector x. Discriminative models describe how to take a feature vector x and assign it a label y. Either type of model can be converted to the other type using Bayes rule P (x y)p (y) =P (x, y) =P (y x)p (x) 14

15 Background Joint Distributions P (x, y) Problems:? 15

16 Background Joint Distributions P (x, y) Problems: - Modeling inputs => intractable models - Ignoring inputs => poor performance 16

17 Background Joint Distributions P (x, y) Naive Bayes Generative model Strong assumptions to simplify computation 17

18 Background Generative vs Discriminative Naive Bayes Logistic Regression Conditional 18

19 Background Generative vs Discriminative Naive Bayes Logistic Regression Conditional If trained to maximize conditional likelihood 19

20 Background Generative vs Discriminative Naive Bayes Logistic Regression Conditional If trained to maximize joint likelihood 20

21 Background Sequence Labeling Hidden Markov Model (HMM) 21

22 Background Sequence Labeling Hidden Markov Model (HMM) ASSUMPTIONS: y t? y 1,...,y t 2 y t 1 x t? Y \{y t },X \{x t } y t

23 Background Sequence Labeling Hidden Markov Model (HMM) PROBLEMS: - Later labels cannot influence previous labels - Cannot represent overlapping features 23

24 Background Improvements to HMMs Maximum Entropy Markov Model 24

25 Background Improvements to HMMs Maximum Entropy Markov Model St-1 St MEMM Ot P (S 1,...,S n O 1,...,O n )= 25 nx t=1 P (S t S t 1,O t )

26 Background Improvements to HMMs Conditional Markov Model PROS: More flexible form of context dependence Independence assumptions among labels, not observations CONS: Only models log-linear distribution over component transition distributions

27 Background Improvements to HMMs Label Bias Problem 27

28 Solution Conditional Random Fields - Drop Local Normalization Constant - Normalize GLOBALLY Over Full Graph Avoids Label Bias Problem 28

29 Solution Conditional Random Fields Calculate path probability by normalizing the path score by the sum of the scores of all possible paths Poorly matching path -> low score (probability) Well-matching path -> high score 29

30 Solution Conditional Random Fields Similar to CMM - Discriminative - Model Conditional Distribution p(y x) - Allow arbitrary, overlapping features 30

31 Solution Conditional Random Fields Similar to CMM - Discriminative - Model Conditional Distribution p(y x) - Allow arbitrary, overlapping features TAKEAWAY: Retains advantages of CMM over HMM Overcomes label bias problem of CMM and MEMM 31

32 Linear-Chain CRFs Conditional Random Fields Remember the HMM 32

33 Linear-Chain CRFs Conditional Random Fields Remember the HMM

34 Linear-Chain CRFs Conditional Random Fields By expanding Z, we have the conditional distribution p(y x) 34

35 Feature Functions students.cs.byu.edu/~mbrodie/778 35

36 36

37 Linear-Chain CRFs Conditional Random Fields 37

38 Feature Engineering Larger feature set benefits: Greater prediction accuracy More flexible decision boundary However, may lead to overfitting 38

39 Feature Engineering Label-Observation Features Where is an observation function rather than a specific word value For example: word word x t x t is capitalized ends in ing 39

40 Feature Engineering Unsupported Features For example: word x t is with and label is NOUN Greatly increases the number of parameters. y t 40

41 Feature Engineering Unsupported Features For example: word x t is with and label is NOUN Greatly increases the number of parameters. However, usually gives a slight boost in accuracy y t Can be useful -> Assign negative weights to prevent spurious labels from receiving high probability 41

42 Feature Engineering Boundary Labels Boundary labels (start/end of sentence, edge of image) can have different characteristics from other labels. For example: Capitalization in the middle of the sentence generally indicates a proper noun (but not necessarily at the beginning of a sentence). 42

43 Feature Engineering Other Methods Feature Induction Begin with a number of base features. Gradually add conjunctions to those features Normalize Features Convert categorical to binary features Features from Different Time Steps 43

44 Feature Engineering Other Methods Features as Model Combination Input Dependent Structure e.g. Skip-Chain CRFs Encourage identical words in a sentence to have the same label Do this by adding an edge in the graph 44

45 Linear-Chain CRFs Conditional Random Fields HMM-like CRF Previous observation affects transition 45

46 Feature Effects Pseudo Weighted Functions on the Brown Corpus: 38 Training, 12 Test Sentences Features Accuracy # Feature Functions yt-1 and yt xt and yt xt and yt AND xt and yt-1and yt xt and yt AND yt-1 and yt

47 Linear-Chain CRFs Parameter Estimation We want to find parameters for feature functions: 47

48 Linear-Chain CRFs Parameter Estimation We want to find parameters for feature functions: Maximize the conditional log likelihood: 48

49 Linear-Chain CRFs Parameter Estimation 49

50 Linear-Chain CRFs Parameter Estimation Which becomes 50

51 Linear-Chain CRFs Parameter Estimation Which becomes Penalize large weight vectors

52 Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 52

53 Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 53

54 Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 54

55 Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 55

56 Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 56

57 Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 57

58 Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 58

59 Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 59

60 Feature Effects Pseudo Weighted Functions on the Brown Corpus: 38 Training, 12 Test Sentences Features Accuracy # Feature Functions Trained Accuracy 10 Epochs yt-1 and yt xt and yt xt and yt AND xt and yt-1and yt xt and yt AND yt-1 and yt

61 Linear-Chain CRFs Inference During Training Marginal Distributions for each edge During Testing Viterbi Algorithm to compute the most likely labeling Compute Z(x) 61

62 Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: 62

63 Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: Factor Definition: 63

64 Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: Factor Definition: Rewrite Using Distributive Law:

65 Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: This leads to a set of forward variables 65

66 Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: This leads to a set of forward variables Compute t by recursion

67 Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: Similarly, we computer a set of backward variables Compute t by recursion 67

68 Linear-Chain CRFs Inference Compute Marginals Needed for Gradient Using Forward and Backward Results 68

69 Linear-Chain CRFs Inference Compute Marginals Needed for Gradient Using Forward and Backward Results = 69

70 Linear-Chain CRFs Inference How does this compare to CRF training/inference? 70

71 Linear-Chain CRFs Inference How does this compare to CRF training/inference? 71

72 Linear-Chain CRFs Inference How does this compare to CRF training/inference? 72

73 Breakthrough 73

74 My Implementation 3 Different Implementations 1. Tensorflow 74

75 Tensorflow Tips Do NOT start with CRFs

76 Tensorflow Tips Be prepared for awkward, sometimes unintuitive formats 76

77 Tensorflow Gaaaaah!! len(x) => X.get_shape().dims[0].value Start out with Theano More widely used More intuitive 77

78 My Implementation 3 Different Implementations 1. Tensorflow 2. Scipy Optimization (fmin_l_bfgs_b) 78

79 My Implementation 3 Different Implementations 1. Tensorflow 2. Scipy Optimization (fmin_l_bfgs_b) 3. Gradient Descent A. Trainable Version B. Simplified, Web-based Version 79

80 Experiments Data Sets Brown Corpus News Articles Penn Treebank Corpus WSJ 5,000 POS Sentences 27,596 Tags 75/25 Train-Test Split 3,914 sentence SAMPLE $1,700 (or free BYU version) 75/25 Train-Test Split 80

81 Experiments Comparison Models Python Natural Language Toolkit (NLTK) Hidden Markov Model Conditional Random Field Averaged-Multilayer Perceptron Average the weight updates -> prevent radical changes due to different training batches 81

82 Experiments Brown Corpus Results Accuracy HMM Avg-MLP NLTK-CRF Brodie-CRF 82

83 Experiments Penn Corpus Results Accuracy ? 75 HMM Avg-MLP NLTK-CRF Brodie-CRF 83

84 Pros/Cons of Implementation Pros Can use forward/ backward and Viterbi Algorithm Learn weights for features Cons Slow Limited to gradient descent training 84

85 Extensions General CRFs Skip-chain CRFs Hidden Node CRFs 85

86 CRF-RNN + CNN Problem: Traditional CNNs produce coarse outputs for pixel-level labels. Solution: Apply CRF as a post-processing refinement

87 BI-LSTM-CRF

88 Future Directions New model combinations of CRFs Better training approximations (i.e. besides MCMC sampling methods) Additional hidden architectures 88

89 Bibliography 89

90 Bibliography 90

91 Bibliography Other Resources Tom Mitchell s Machine Learning. Online draft version of Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. NBayesLogReg.pdf. Accessed January 25, Edwin Chen Blog. Introduction to Conditional Random Fields. Accessed February 2, LingPipe Tutorial: tutorial/crf/read-me.html. Accesed February 2,

92 Thank You Questions? 92

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,