AKA: Logistic Regression Implementation

Size: px

Start display at page:

Download "AKA: Logistic Regression Implementation"

Rafe Powers
5 years ago
Views:

1 AKA: Logistic Regression Implementation 1

2 Supervised classification is the problem of predicting to which category a new observation belongs. A category is chosen from a list of predefined categories. The categories are unspecified and unknown in unsupervised classification. The machine learning algorithm discovers categories from an analysis of the training examples. Examples cases in which the data analysis task is supervised classification: After having trained on a labeled set of training examples, the learner is presented with a new (unlabeled): , and the learner must decide if it is spam or not spam (yes or no?). Credit card transaction, and the learner must decide if it is fraudulent or not (yes or no?). Image of a cell, and the learner must decide if it is cancerous or not (malignant or benign?). In the above examples a classifier is constructed to predict categorical labels. These labels are yes or no for and credit cards, and malignant or benign for cells. 2

3 Difference Between Classification and Regression CLASSIFICATION In classification, a classifier assigns a given input to one of a finite set of categories (typically a set of two categories). A classifier learns how to classify an input by undergoing a training phase, in which it trains on a set of training examples and constructs a model. After the learning phase, the classifier uses the learned model to classify new data presented to it. The output is a discrete value (0/1), which indicates its membership of a category. REGRESSION In regression, the learner predicts a continuous value for each given input. The predictor learns how to predict values by fitting a curve to a set of training examples, thereby constructing a model. After the learning stage, the predictor predicts a value for a new input by applying the input to the learned curve and extracting the output value. The output is a continuous value on the fitted curve. 3

4 Typically, there are only two categories in supervised classification (i.e., output of the classifier) 0: Negative Class (e.g., benign tumor) 1: Positive Class (e.g., malignant tumor) There can be many categories, aka Multiclass classification problem. 4

5 Try Applying Linear Regression to a Classification Problem Tumor size is labelled as either Yes (malignant) or No by the supervised training data. Linear regression produces a straight line to approximate the training data. New inputs would be fitted to the straight line, and generally not mapped to the labels. Some cases of new inputs do not even fall between the labels 0 and 1. But, we need to specify 0 (benign) or 1 (malignant). Therefore, linear regression, in its unmodified standard form, does not make sense for classification problems. 5

6 Modifying Linear Regression for Classification Problems 0.5 Tumor size Threshold We can try modifying linear regression, for example, by modifying our interpretation of the result. We may interpret as: for all tumor sizes such that, then the tumor is benign, and otherwise it is malignant: 6

7 Incorrectly Classified 0.5 Relearning with an additional training example, while still using the previous threshold value of 0.5, causes the linear regression model to incorrectly classify some of the training examples. The threshold could be changed to say Note: a user sets the threshold value. But it may need to change for each new example added to the training set, especially, in online training mode, and this would be inappropriate for a user to change the threshold each time a new input is added. We would rather prefer the threshold be learned by the machine in the process of training. Therefore, linear regression and even a slight modification to linear regression are not appropriate for classification problems. 7

8 Instead of using linear regression for classification problems, let us develop a classifier from scratch. First, let s look at typical linearly separable data for classification purposes, and then determine what function can be used to separate the data. In many real-world cases, the data is not totally separable, due to noise and other causes, so we need to modify the interpretation of the output of the classifier, e.g., in terms of the classifier s certainty or probability that an input belongs to a certain class. Our goal is to use the training data to learn the function that separates the data. 8

9 Linear Decision Boundary Decision boundary Predict x if Predict o if Consider a classification system in which the data is represented by two features. As shown, the training examples may be separated by a line embedded in 2-dimensional space. The line forms a decision boundary. For all feature points that lie above or on the line, classify them as the x class, otherwise as the o class. Note, that for all feature points that lie above or on the decision boundary, then,. Likewise, for all feature points below the line,. 9

Decision boundary Note that the RHS of the decision boundary equation is very similar to the hypothesis used for regression. We will also call the hypothesis for this classification problem.

10 Decision boundary Note that the RHS of the decision boundary equation is very similar to the hypothesis used for regression. We will also call the hypothesis for this classification problem. To perform the classification task, we introduce a threshold function. If, then, else if, then. In regression, we used the hypothesis equation to fit a curve to the training data. In classification, we use the hypothesis equation to separate the training data into one of two labels (two, for this example). Let: Predict Predict if if 10

11 Non-linear Decision Boundary Decision boundary Let: Predict if Predict if 11

12 Non-linear Decision Boundary Depending on the polynomials or other non-linear functions used in the hypothesis, other more general non-linear boundaries are possible. 12

13 Note that a step function was used to decide if a given input should be classified as class 1 or class 0. I.e., that classifier makes binary decisions. Input Object Binary Decision Classifier: 1 if 0 if A problem with using a step function is that most real-world classification problems cannot be simply separated by a crisp boundary (i.e., a step function). The world is not black and white; there are different shades of grey. For data points very close to the decision boundary, the classifier may or may not classify them correctly. Real world problems have incomplete, noisy, and possibly incorrect data, and so, the classifier may not correctly classify those inputs. 13

14 A better approach would be to have the classifier output a measure of its level of certainty that the input data belongs to a given class. For example, consider the Logistic function (aka Sigmoid function). As the value gets closer to the midpoint of (i.e., close to 0.5) the classifier becomes increasingly uncertain about its classification, while as the values get very far from the midpoint (i.e., close to 0.0 or 1.0) the classifier becomes very certain about its classification. Note that: i.e.,, This is desired, because class labels: 14

15 is computed as before in regression, but, in classification, the result is input to a threshold function, in this case, the Logistic function, which may be interpreted as the probability that, given input. If the logistic classifier gives an output very close to 1.0, then it is very certain that the input training example belongs to the class label 1. If the logistic classifier gives and output near 0.5, then it is uncertain that the input training example belongs to the class label 1. If the logistic classifier gives an output very close to 0.0, then it is very certain that the input training example does not belong to the class label 1. 15

16 Let, and Then, tell patient: 70% chance of tumor being malignant. Usual Probability properties: + probability that, parametrized by, given 16

17 predict. predict

18 Training set: m Examples: How should we choose the values of the weights (parameters)? 18

19 Linear regression: Logistic regression: For Logistic Regression, let us consider using the same cost function as in Linear Regression. We don t need the ½ term. Problem is that the in logistic regression will yield a non-convex cost function, which will have multiple minima. Would like to have a convex cost function, so that there will be only one minimum, i.e., the global minimum. 19

20 A curve C is convex if a line segment between any two points in C lies in C. Convex curve. Has one minimum, a global minimum. Non convex curve. Has many local minimum, and a global minimum. 20

21 Problem: Non-convex Function Using a linear hypothesis, the following definition of yields a convex cost function in linear regression: Using the following definition of would yield a non-convex cost function in logistic regression: 21

22 Consider using the log function to obtain a convex cost function. 22

23 For : 23

24 For : This captures the intuition that if while, then penalize the algorithm by a very large cost. Also, if while, then the cost zero. 24

25 F : 25

26 F : This captures the intuition that if while, then penalize the algorithm by a very large cost. Also, if while, then the cost zero. 26

27 27

28 Apply gradient descent to minimize. Repeat for each feature { ), } 28

29 Interpreting the Minimization of For training examples, the minimizer will attempt to find that will make. For training examples, the minimizer will attempt to find that will make. The sweet spot will be somewhere in the middle. log 1 Cost log Clost curve for examples Cost curve for examples The position of the global minimum depends on the training set. The shape of these curves depends on the training set. or 29

30 Example Minimization of Consider training with 11 positive ( ) training examples only. Then add one negative ( ) example and train again. Cost of 11 positive + one negative training examples. The global minimum. Cost Cost of 11 positive ( 1) training examples + one negative training example, using. Cost of 11 positive ( 1) training examples; Cost 0, using. is not optimal, since if minimizer moves this way, then it is able to find lower minimum. 30

31 Expression for 31

32 Expression for See slide 35 32

33 Expression for 33

34 Linear regression and logistic regression appear to have the same expression for the derivatives. However, the expression for the predictor is different: Linear regression: Logistic regression: 34

35 Expression for Let 35

36 For each feature (j=0; j<=n; j++) { } 36

37 Training Phase For each feature (j=0; j<=n; j++) { Classification Phase To predict an output for a new input : } Note that this consists of two steps. The first step is the computation of, which is exactly the same as for linear regression. In linear regression the predicted output is. In logistic classification the is the decision boundary., 1 In logistic classification, the tells us on what side the input falls on the decision boundary, the positive (side) class or the negative (side) class. The second step passes through the sigmoid function to determine the class of the input and the probability that the new input belongs to the positive class. In logistic classification, the predicted output is the class assigned by the logistic function with the probability the new input belongs to the positive class. 37

38 This example uses a very simple data set that represents scores on a test in the 1 st column (x) and the pass/fail indicator in the 2 nd column (y). 0 represents fail, and 1 represents pass. x y

39 Opening and Plotting Data Files clear all; close all; clc t0=cputime(); x = load('x.txt'); y = load('y.txt'); m = length(y); figure; hold on for i=1:m if (y(i)==1) plot(x(i),y(i),'s', 'color', 'g', 'markersize', 18, 'markerfacecolor', 'g'); else plot(x(i),y(i),'o', 'color', 'r', 'markersize', 18, 'markerfacecolor', 'r'); endif endfor ylabel('pass/fail', 'fontsize', 18, 'fontname', 'Arial'); xlabel('exam 1 Score', 'fontsize', 18, 'fontname', 'Arial'); title('pass/fail vs Exam Score', 'fontsize', 20, 'fontname', 'Arial'); 39

40 x = [ones(m, 1) x]; % Add a column of ones to x numfeatures = size(x,2); theta = zeros(numfeatures,1); numtrainsam = size(x,1); maxiterations = 1000; learningrate = 5.0; errorperiteration = zeros(maxiterations,1); 40

41 For each iteration { For each feature (j=0; j<=n; j++) { for t=1:maxiterations toterror = 0; for j=1:numfeatures totslope = 0; for i=1:m z=0; } for jj=1:numfeatures } z=z+prevtheta(jj)*x(i,jj); end h=1.0/(1.0+exp(-z)); totslope = (totslope + (h-y(i))*x(i,j)); toterror = (toterror + -y(i)*log(h)-(1-y(i))*log(1-h)); end toterror=toterror/numtrainsam; theta(j)=theta(j)-learningrate*(totslope/numtrainsam); end prevtheta=theta; errorperiteration(t)=toterror/numfeatures; end 41

42 for i=1:m z=0; for jj=1:numfeatures z=z+prevtheta(jj)*x(i,jj); end h=1.0/(1.0+exp(-z)); totslope = (totslope + (h-y(i))*x(i,j)); toterror = (toterror + -y(i)*log(h)-(1-y(i))*log(1-h)); end 42

0+exp(-z)); totslope = (totslope The intermediate + (h-y(i))*x(i,j)); updated toterror = (toterror should +

toterror=toterror/numtrainsam; theta(j)=theta(j)-learningrate*(totslope/numtrainsam); end prevtheta=theta;

43 For each feature (j=0; j<=n; j++) { for t=1:maxiterations toterror = 0; for j=1:numfeatures totslope = 0; } for i=1:m z=0; for jj=1:numfeatures z=z+prevtheta(jj)*x(i,jj); end h=1.0/(1.0+exp(-z)); totslope = (totslope The intermediate + (h-y(i))*x(i,j)); updated toterror = (toterror should + not -y(i)*log(h)-(1-y(i))*log(1-h)); be used in the end computation of. toterror=toterror/numtrainsam; theta(j)=theta(j)-learningrate*(totslope/numtrainsam); end prevtheta=theta; errorperiteration(t)=toterror/numfeatures; After all have been computed, then end should be updated. Note: in the batch update method, when computing each, for, the previous is used. 43

44 Learning rate = 5.0 Number of iterations =

45 x y Training example. Training labeled output. Prediction. 45

46 References 46

Linear Regression Implementation

Linear Regression Implementation 1 When you experience regression, you go back in some way. The process of regressing is to go back to a less perfect or less developed state. Modeling data is focused on