Log- linear models. Natural Language Processing: Lecture Kairit Sirts

Size: px

Start display at page:

Download "Log- linear models. Natural Language Processing: Lecture Kairit Sirts"

Mervyn Gregory
5 years ago
Views:

1 Log- linear models Natural Language Processing: Lecture Kairit Sirts

2 The goal of today s lecture Introduce the log- linear/maximum entropy model Explain the model components: features, parameters, log- likelihood function The role of the regularization Training log- linear models 2

3 Ngram Language Model The girl with the flowers is cute. P(sentence) = P(The) * P(girl The) * P(with The girl) * P(the girl with) * P(flowers with the) * P(is the flowers) * P(cute flowers is) 3

4 Ngram Language Model In case of a trigram model: k the order of the ngram model 4

5 Linear Interpolation Interpolate the trigrams with bigrams and unigrams Interpolation coefficients are typically trained on a development set. Unknown words are replaced with a dummy word UNK. 5

6 What if we could include more information? Consider the whole context of a sentence Consider even larger context previous sentences Consider the type of the word (Part- of- Speech): NOUN, VERB, ADJECTIVE, ADVERB etc Consider morphological categories and syntactic information 6

7 The girl with the flowers is cute Full context: POS tags: Morphology: Syntax 7

8 Extended interpolated model? 8

9 Extended interpolated model? Problems The model becomes very heavy and complex The model is very sparse most probabilities are 0s Difficult to optimize the interpolation coefficients directly Solution Log- linear modeling 9

10 Problem definition Inputs: Outputs: We want to model: 10

11 In terms of language modeling Inputs: - all possible sequences, all related annotations (morphological, syntactic etc) Outputs: vocabulary We want to model: 11

12 The output set should be finite The input set can be anything: Finite Countably infinite Uncountably infinite 12

13 Log- linear model Components: Inputs, outputs: The finite number of parameters: A feature mapping function: A parameter vector: 13

14 Features and parameters Some terminology - feature vector - feature, feature function Features allow to represent different properties of the input Each feature is associated with a parameter The parameter values are estimated during training from the training pairs 14

15 Features for language modeling Each feature is an indicator function that returns 0 or 1 15

16 Feature templates Feature templates group together similar feature functions. Trigram feature template Bigram feature template Unigram feature template Templates involving POS tags Templates involving morphological tags Templates involving syntactic relations etc 16

Feature sparsity The number of features is potentially very large (even millions) For each training instance, the feature vector is very sparse Mostly

17 Feature sparsity The number of features is potentially very large (even millions) For each training instance, the feature vector is very sparse Mostly 0s, only few 1s In practice no need to consider features with 0 values: Z(x, y) - the set of non- zero features for the training pair (x, y) Typically: 17

18 - a probability - a vector of real numbers; can contain both positive and negative numbers - sparse binary vectors 18

19 Numerator - a real vector - a sparse binary vector - the sum of weight values for non- zero features, a real number, can be positive or negative - a positive real number 19

20 The girl with the flowers is cute 20

21 Denominator Sum over all possible y values Normalisation constant for the probability distribution 21

22 Why log- linear model? The log probability depends linearly on the features f(x, y) g(x) only depends on the input and can be thus treated as a constant 22

23 Parameter estimation - MLE MLE Maximum Likelihood Estimation: find the parameters that maximise the log- likelihood of the training data Training data: Log- likelihood: The MLE estimation: 23

24 Problems with MLE What if we have seen a feature only once in the training set? Count(the, flowers, is) = 1 Count(the, flowers) = 1 24

25 Problems with MLE What if we have seen a feature only once in the training set? Count(the, flowers, is) = 1 Count(the, flowers) = 1 Overfitting!!! This problem becomes especially relevant when there are lots of features and the number of features is larger than the number of training items 25

26 Solution - regularization Regularization term Regularization forces the model to prefer smaller parameter values In regularized models the number of features can be larger than the number of training items 26

27 L2- regularization L1- regularization Forces all parameter values to be close to 0. Most parameters are non- zero Forces a sparse solution Most parameter values are 0 Performs feature selection 27

28 How to find the optimal parameters? No closed form solution The objective function is convex Use gradient based methods Typically we minimize the negative log- likelihood 28

Derivatives Machine Learning 101: https://medium.

29 Derivatives Machine Learning 101: tech/machine- learning be2e0a86c96a 29

Convex function Machine Learning 101: https://medium.

30 Convex function Machine Learning 101: tech/machine- learning be2e0a86c96a 30

Gradient Descent Machine Learning 101: https://medium.

31 Gradient Descent Machine Learning 101: tech/machine- learning be2e0a86c96a 31

32 Gradient descent algorithm Set: Repeat until convergence: Calculate: Set: - learning rate 32

33 When the learning rate is too large Linear Regression with NumPy: 33

34 Gradient based methods There are more sophisticated gradient based methods that converge faster than the simple gradient descent LBFGS uses second derivatives to get the curvature information Optimization isn t rocket science in ML: isnt- rocket- science- ml- shamane- siriwardhana 34

35 Tools LogisticRegression model in the scikit- learn python package MegaM can be interfaced from NLTK Stanford maximum entropy classifier - java implementation Scipy optimize various optimization algorithms; must provide the implementations of the objective function and the its first derivative (with respect to parameters) 35

36 Partial derivatives Empirical count Expected count The partial derivative of a log- linear model is always a difference between the empirical feature counts and the expected feature counts. 36

37 Relation to logistic regression Logistic regression: log- linear model for binary classification When equal (0.5) then the probabilities of y=0 and y=1 are 37

38 Comparison to Ngram models Advantages Can use arbitrary features Can model longer context, even beyond the current sentence Don t need the Markov assumption Can also look into the future Disadvantages More complex features (syntactic, morphological) need annotated data The structure of the model and the training in general is more complex Cannot be used to generate text when the features look into the future 38

39 Recap what you should know The general form of the log- linear model: When to use a log- linear model? Multi- class classification with lots of sparse features Why the regularization is necessary? To avoid overfitting the parameters to the training data 39

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum