Logistic Regression

Size: px

Start display at page:

Download "Logistic Regression"

Brianna Barrett
6 years ago
Views:

1 Logistic Regression

2 Agenda Model Specification Model Fitting Bayesian Logistic Regression Online Learning and Stochastic Optimization Generative versus Discriminative Classifiers

3 Model Specification Plots of Sigmoid(w 1 *x 1 + w 2 *x 2 )

4 Model Specification The Logistic Function Maps log odds to a probability

5 Model Specification Logistic Regression Example: The Data

6 Model Specification Logistic Regression Example: Prediction

7 Model Specification Logistic Regression Example: The Model This model does not have a problem with collinearity

8 Model Specification Logistic Regression Example: The Model This model does not have a problem with collinearity, because our solver recognized the matrix was rank deficient

9 Model Specification Logistic Regression Example: Model 2 This model has a problem with collinearity: interpretation of the coefficients is problematic

10 Model Fitting MLE for Logistic Regression Unfortunately, there is no closed form solution for the MLE!

11 Model Fitting Gradient and Hessian of the NLL

12 Model Fitting Steepest Descent The negative of the gradient is used to update the parameter vector Eta is the learning rate, which can be difficult to set Small, constant: takes a long time to converge Large: may not converge at atll

13 Model Fitting Gradient Descent Example Too small? Too large!

14 Model Fitting Steepest Descent Example Step size driven by line search : the local gradient is perpendicular to the search direction

15 Model Fitting Newton s Method

16 Model Fitting Newton s Method Example Convex Example Non-Convex Example

17 Model Fitting Iteratively Reweighted Least Squares We re applying Newton s algorithm to find the MLE

18 Model Fitting Broyden Fletcher Goldfarb Shanno (BFGS) Computing the Hessian matrix explicitly can be expensive Rather than computing the Hessian explicitly, we can use an approximation We re using a diagonal plus low-rank approximation for the Hessian

19 Model Fitting Limited Memory BFGS (L-BFGS) Storing the Hessian matrix takes O D 2 space, where D is the number of dimensions We can further approximate the Hessian by using only the m most recent s k, y k pairs, reducing the storage requirement to O md

20 Bayesian Logistic Regression Bayesian Logistic Regression

21 Bayesian Logistic Regression Posterior Predictive Distribution Probit Approximation

22 Bayesian Logistic Regression Posterior Predictive Density for SAT Data includes 95% CI

23 Online Learning and Stochastic Optimization Online Learning and Regret Updating the model as we go Example: user is presented an ad, and either clicks or doesn t click

24 Online Learning and Stochastic Optimization Stochastic Gradient Descent The adagrad variant uses a per-parameter step size based on the curvature of the loss function

25 Online Learning and Stochastic Optimization Least Mean Squares Example known as Widrow-Hoff rule or the delta rule Stochastic Gradient Descent scales really well; used by Apache Spark

26 Online Learning and Stochastic Optimization Perceptron Converges *if* linearly separable

27 Generative versus Discriminative Classifiers Generative v Discriminative Classifiers Easy to fit? Fit classes separately? Handle missing features easily? Can handle unlabeled training data? Symmetric in inputs and outputs? Can handle feature preprocessing? Well-calibrated probabilities?

28 Generative versus Discriminative Classifiers Summary of Models

29 Generative versus Discriminative Classifiers Multinomial Logistic Regression Linear Non-Linear RBF expansion

30 Generative versus Discriminative Classifiers Class Conditional Density v Class Posterior

31 Generative versus Discriminative Classifiers Fisher s Linear Discriminant Analysis Example PCA sometimes makes the problem more difficult Fisher s LDA projects the data into C 1 dimensions

32 Generative versus Discriminative Classifiers PCA versus FLDA Projection of Vowel Data Projecting high dimensional data down to C 1 dimensions can be problematic; however, FLDA certainly seems to produce better structure than PCA for this example

33 Active Learning

34 Active Learning Active learning is a special case of semi-supervised learning, where we are able to obtain labels for some number of unlabeled test examples Popular strategies include Uncertainty sampling Query by committee Balancing exploration and exploitation to be continued on June 2 nd

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5