(Multinomial) Logistic Regression + Feature Engineering

Size: px

Start display at page:

Download "(Multinomial) Logistic Regression + Feature Engineering"

Sheryl Charles
5 years ago
Views:

1 -6 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University (Multinomial) Logistic Regression + Feature Engineering Matt Gormley Lecture 9 Feb. 4, 28

2 Reminders Homework 3: KNN, Perceptron, Lin.Reg. Out: Wed, Feb 7 Due: Wed, Feb 4 at :59pm Homework 4: Logistic Regression Out: Wed, Feb 4 Due: Fri, Feb 23 at :59pm 3

3 MULTINOMIAL LOGISTIC REGRESSION 6

4 Multinomial Logistic Regression Chalkboard Background: Multinomial distribution Definition: Multi-class classification Geometric intuitions Multinomial logistic regression model Generative story Reduction to binary logistic regression Partial derivatives and gradients Applying Gradient Descent and SGD Implementation w/ sparse features 7

5 Debug that Program! In-Class Exercise: Think-Pair-Share Debug the following program which is (incorrectly) attempting to run SGD for multinomial logistic regression Buggy Program: while not converged: for i in shuffle([,,n]): for k in [,,K]: theta[k] = theta[k] - lambda * grad(x[i], y[i], theta, k) Assume: grad(x[i], y[i], theta, k) returns the gradient of the negative log-likelihood of the training example (x[i],y[i]) with respect to vector theta[k]. lambda is the learning rate. N = # of examples. K = # of output classes. M = # of features. theta is a K by M matrix. 8

6 Debug that Program! In-Class Exercise: Think-Pair-Share Debug the following program which is (incorrectly) attempting to run SGD for multinomial logistic regression Buggy Program: while not converged: for i in shuffle([,,n]): for k in [,,K]: for m in [,, M]: theta[k,m] = theta[k,m] + lambda * grad(x[i], y[i], theta, k,m) Assume: grad(x[i], y[i], theta, k, m) returns the partial derivative of the negative log-likelihood of the training example (x[i],y[i]) with respect to theta[k,m]. lambda is the learning rate. N = # of examples. K = # of output classes. M = # of features. theta is a K by M matrix. 9

7 FEATURE ENGINEERING

8 Handcrafted Features born-in LOC PER p(y x) NP S VP f( ) exp(θ ) y ADJP NP VP NNP : VBN NNP VBD egypt - born proyas direct Egypt - born Proyas directed

9 Where do features come from? Feature Engineering hand-crafted features Sun et al., 2 Zhou et al., 25 First word before M Second word before M Bag-of-words in M Head word of M Other word in between First word after M2 Second word after M2 Bag-of-words in M2 Head word of M2 Bigrams in between Words on dependency path Country name list Personal relative triggers Personal title list WordNet Tags Heads of chunks in between Path of phrase labels Combination of entity types Feature Learning 2

10 Where do features come from? Feature Engineering hand-crafted features Sun et al., 2 input (context words) Zhou et al., 25 word embeddings Mikolov et al., 23 Look-up table similar words, similar embeddings CBOW model in Mikolov et al. (23) Feature Learning embeddin g cat: dog: Classifier missing word unsupervised learning

11 Where do features come from? pooling Feature Engineering hand-crafted features Sun et al., 2 The [movie] showed [wars] Convolutional Neural Networks (Collobert and Weston 28) CNN Zhou et al., 25 word embeddings Mikolov et al., 23 Feature Learning The [movie] showed [wars] Recursive Auto Encoder (Socher 2) string embeddings Socher, 2 Collobert & Weston, 28 RAE 4

12 Where do features come from? Feature Engineering hand-crafted features NP W NP,VP Sun et al., 2 W DT,NN S VP W V,NN The [movie] showed [wars] Zhou et al., 25 word embeddings Mikolov et al., 23 Feature Learning tree embeddings Socher et al., 23 Hermann & Blunsom, 23 string embeddings Socher, 2 Collobert & Weston, 28 5

28 word embeddings Mikolov et al., 23 Hermann et al.

13 Where do features come from? Feature Engineering hand-crafted features Sun et al., 2 Zhou et al., 25 word embedding features Turian et al. 2 Koo et al. 28 word embeddings Mikolov et al., 23 Hermann et al. 24 Feature Learning tree embeddings Socher et al., 23 Hermann & Blunsom, 23 string embeddings Socher, 2 Collobert & Weston, 28 6

14 Where do features come from? Feature Engineering hand-crafted features Sun et al., 2 Zhou et al., 25 word embedding features Turian et al. 2 Hermann et al. Koo et al word embeddings Mikolov et al., 23 Feature Learning best of both worlds? tree embeddings Socher et al., 23 Hermann & Blunsom, 23 string embeddings Socher, 2 Collobert & Weston, 28 7

15 Feature Engineering for NLP Suppose you build a logistic regression model to predict a part-of-speech (POS) tag for each word in a sentence. What features should you use? deter. noun noun noun verb verb The movie I watched depicted hope 8

16 Feature Engineering for NLP Per-word Features: is-capital(w i ) endswith(w i, e ) endswith(w i, d ) endswith(w i, ed ) w i == aardvark w i == hope x () x (2) x (3) x (4) x (5) x (6) deter. noun noun verb verb noun The movie I watched depicted hope 9

17 Feature Engineering for NLP Context Features: w i == watched w i+ == watched w i- == watched w i+2 == watched w i-2 == watched x () x (2) x (3) x (4) x (5) x (6) deter. noun noun verb verb noun The movie I watched depicted hope 2

18 Feature Engineering for NLP Context Features: w i == I w i+ == I w i- == I w i+2 == I w i-2 == I x () x (2) x (3) x (4) x (5) x (6) deter. noun noun verb verb noun The movie I watched depicted hope 2

19 Table from Manning (2) Feature Engineering for NLP Table 3. Tagging accuracies with different feature templates and other changes on the WSJ 9-2 development set. Model Feature Templates # Sent. Token Unk. Feats Acc. Acc. Acc. 3gramMemm See text 248, % 96.92% 88.99% naacl 23 See text and [] 46, % 97.5% 88.6% Replication See text and [] 46, % 97.8% 88.92% Replication +rarefeaturethresh = 5 482, % 97.9% 88.96% 5w + t,w 2, t,w 2 73, % 97.2% 89.3% 5wShapes + t,s, t,s, t,s + 73, % 97.25% 89.8% 5wShapesDS + distributional similarity 737, % 97.28% 9.46% deter. noun noun verb verb noun The movie I watched depicted hope 22

What features should you use? deter. noun noun noun verb.

20 Feature Engineering for NLP Suppose you want to predict whether the word is the root (i.e. predicate) of the sentence. What features should you use? deter. noun noun noun verb. verb The [movie] M I watched depicted [hope] M2 23

21 Feature Engineering for NLP Per-word Features: f f 2 f 3 f 4 f 5 f 6 on-path(w i ) is-between(w i ) head-of-m(w i ) head-of-m2(w i ) before-m(w i ) before-m2(w i ) deter. noun noun verb. verb noun The [movie] M I watched depicted [hope] M2 24

22 Feature Engineering for NLP Per-word Features: on-path(w i ) is-between(w i ) head-of-m(w i ) head-of-m2(w i ) before-m(w i ) before-m2(w i ) f 5 deter. noun noun verb verb noun The [movie] M I watched depicted [hope] M2 25

23 Feature Engineering for NLP Per-word Features: (with conjunction) on-path(w i ) && w i == depicted is-between(w i ) && w i == depicted head-of-m(w i ) && w i == depicted head-of-m2(w i ) && w i == depicted before-m(w i ) && w i == depicted before-m2(w i ) && w i == depicted f 5 deter. noun noun noun verb verb The [movie] M I watched depicted [hope] M2 26

24 Feature Engineering for CV Edge detection (Canny) Corner Detection (Harris) Figures from 27

25 Feature Engineering for CV Scale Invariant Feature Transform (SIFT) Figure from Lowe (999) and Lowe (24) 28

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 5 Jan. 31, 2018 1 Q&A Q: We pick the best hyperparameters