Logistic Regression

Similar documents
Theoretical Concepts of Machine Learning

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Introduction to Optimization

Experimental Data and Training

A Brief Look at Optimization

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

6. Linear Discriminant Functions

Multivariate Numerical Optimization

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

Lecture 6 - Multivariate numerical optimization

5 Machine Learning Abstractions and Numerical Optimization

A large number of user subroutines and utility routines is available in Abaqus, that are all programmed in Fortran. Subroutines are different for

Classification: Linear Discriminant Functions

CS 395T Lecture 12: Feature Matching and Bundle Adjustment. Qixing Huang October 10 st 2018

Tested Paradigm to Include Optimization in Machine Learning Algorithms

Combine the PA Algorithm with a Proximal Classifier

Neural Networks (pp )

What is machine learning?

Numerical Optimization: Introduction and gradient-based methods

Newton and Quasi-Newton Methods

06: Logistic Regression

Facial Expression Classification with Random Filters Feature Extraction

Tree-GP: A Scalable Bayesian Global Numerical Optimization algorithm

PATTERN CLASSIFICATION AND SCENE ANALYSIS

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Learning via Optimization

Linear methods for supervised learning

CMPT 882 Week 3 Summary

Linear Regression Optimization

Maximum Likelihood estimation: Stata vs. Gauss

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Case Study 1: Estimating Click Probabilities

Machine Learning. Chao Lan

Multi Layer Perceptron trained by Quasi Newton learning rule

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Logistic Regression and Gradient Ascent

Optimization. Industrial AI Lab.

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Constrained and Unconstrained Optimization

Algebraic Iterative Methods for Computed Tomography

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

Practice Questions for Midterm

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Supervised vs unsupervised clustering

CPSC Coding Project (due December 17)

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

DM6 Support Vector Machines

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Non-Linearity of Scorecard Log-Odds

CS281 Section 3: Practical Optimization

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Classical Gradient Methods

A projected Hessian matrix for full waveform inversion Yong Ma and Dave Hale, Center for Wave Phenomena, Colorado School of Mines

MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms

scikit-learn (Machine Learning in Python)

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Scalable Data Analysis

Convex Optimization CMU-10725

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Lecture 12: convergence. Derivative (one variable)

Class 6 Large-Scale Image Classification

Machine Learning: Think Big and Parallel

Data Mining. Neural Networks

Modern Methods of Data Analysis - WS 07/08

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS229 Final Project: Predicting Expected Response Times

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Logistic Regression. May 28, Decision boundary is a property of the hypothesis and not the data set e z. g(z) = g(z) 0.

Optimization. there will solely. any other methods presented can be. saved, and the. possibility. the behavior of. next point is to.

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Predicting Popular Xbox games based on Search Queries of Users

Machine Learning Techniques

Louis Fourrier Fabien Gaie Thomas Rolf

Exercise: Training Simple MLP by Backpropagation. Using Netlab.

Linear discriminant analysis and logistic

Dynamic Analysis of Structures Using Neural Networks

Scalable Machine Learning in R. with H2O

Multiple cosegmentation

Composite Likelihood Data Augmentation for Within-Network Statistical Relational Learning

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

Dimension Reduction CS534

Clustering analysis of gene expression data

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Constrained optimization

Numerical Optimization

Optimization. 1. Optimization. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

Machine Learning / Jan 27, 2010

M. Sc. (Artificial Intelligence and Machine Learning)

Transcription:

Logistic Regression ddebarr@uw.edu 2016-05-26

Agenda Model Specification Model Fitting Bayesian Logistic Regression Online Learning and Stochastic Optimization Generative versus Discriminative Classifiers

Model Specification Plots of Sigmoid(w 1 *x 1 + w 2 *x 2 )

Model Specification The Logistic Function Maps log odds to a probability

Model Specification Logistic Regression Example: The Data

Model Specification Logistic Regression Example: Prediction

Model Specification Logistic Regression Example: The Model This model does not have a problem with collinearity

Model Specification Logistic Regression Example: The Model This model does not have a problem with collinearity, because our solver recognized the matrix was rank deficient

Model Specification Logistic Regression Example: Model 2 This model has a problem with collinearity: interpretation of the coefficients is problematic

Model Fitting MLE for Logistic Regression Unfortunately, there is no closed form solution for the MLE!

Model Fitting Gradient and Hessian of the NLL

Model Fitting Steepest Descent The negative of the gradient is used to update the parameter vector Eta is the learning rate, which can be difficult to set Small, constant: takes a long time to converge Large: may not converge at atll

Model Fitting Gradient Descent Example Too small? Too large!

Model Fitting Steepest Descent Example Step size driven by line search : the local gradient is perpendicular to the search direction

Model Fitting Newton s Method

Model Fitting Newton s Method Example Convex Example Non-Convex Example

Model Fitting Iteratively Reweighted Least Squares We re applying Newton s algorithm to find the MLE

Model Fitting Broyden Fletcher Goldfarb Shanno (BFGS) Computing the Hessian matrix explicitly can be expensive Rather than computing the Hessian explicitly, we can use an approximation We re using a diagonal plus low-rank approximation for the Hessian

Model Fitting Limited Memory BFGS (L-BFGS) Storing the Hessian matrix takes O D 2 space, where D is the number of dimensions We can further approximate the Hessian by using only the m most recent s k, y k pairs, reducing the storage requirement to O md

Bayesian Logistic Regression Bayesian Logistic Regression

Bayesian Logistic Regression Posterior Predictive Distribution Probit Approximation

Bayesian Logistic Regression Posterior Predictive Density for SAT Data includes 95% CI

Online Learning and Stochastic Optimization Online Learning and Regret Updating the model as we go Example: user is presented an ad, and either clicks or doesn t click

Online Learning and Stochastic Optimization Stochastic Gradient Descent The adagrad variant uses a per-parameter step size based on the curvature of the loss function

Online Learning and Stochastic Optimization Least Mean Squares Example known as Widrow-Hoff rule or the delta rule Stochastic Gradient Descent scales really well; used by Apache Spark

Online Learning and Stochastic Optimization Perceptron Converges *if* linearly separable

Generative versus Discriminative Classifiers Generative v Discriminative Classifiers Easy to fit? Fit classes separately? Handle missing features easily? Can handle unlabeled training data? Symmetric in inputs and outputs? Can handle feature preprocessing? Well-calibrated probabilities?

Generative versus Discriminative Classifiers Summary of Models

Generative versus Discriminative Classifiers Multinomial Logistic Regression Linear Non-Linear RBF expansion

Generative versus Discriminative Classifiers Class Conditional Density v Class Posterior

Generative versus Discriminative Classifiers Fisher s Linear Discriminant Analysis Example PCA sometimes makes the problem more difficult Fisher s LDA projects the data into C 1 dimensions

Generative versus Discriminative Classifiers PCA versus FLDA Projection of Vowel Data Projecting high dimensional data down to C 1 dimensions can be problematic; however, FLDA certainly seems to produce better structure than PCA for this example

Active Learning

Active Learning Active learning is a special case of semi-supervised learning, where we are able to obtain labels for some number of unlabeled test examples Popular strategies include Uncertainty sampling Query by committee Balancing exploration and exploitation to be continued on June 2 nd