Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Similar documents
Machine Learning / Jan 27, 2010

Introduc)on to Probabilis)c Latent Seman)c Analysis. NYP Predic)ve Analy)cs Meetup June 10, 2010

Machine Learning Crash Course: Part I

STA 4273H: Sta-s-cal Machine Learning

Minimum Redundancy and Maximum Relevance Feature Selec4on. Hang Xiao

Neural Networks. Aarti Singh & Barnabas Poczos. Machine Learning / Apr 24, Slides Courtesy: Tom Mitchell

Bayes Net Learning. EECS 474 Fall 2016

Linear Regression & Gradient Descent

Neural Networks. Aarti Singh. Machine Learning Nov 3, Slides Courtesy: Tom Mitchell

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Lecture 24 EM Wrapup and VARIATIONAL INFERENCE

Introduction to Machine Learning CMU-10701

Notes and Announcements

Machine Learning in Telecommunications

Network Traffic Measurements and Analysis

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

CS 6140: Machine Learning Spring 2016

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Decision Trees, Random Forests and Random Ferns. Peter Kovesi

Supervised Learning for Image Segmentation

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Machine Learning. Supervised Learning. Manfred Huber

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Unsupervised Learning: Clustering

Variational Methods for Discrete-Data Latent Gaussian Models

Structure and Support Vector Machines. SPFLODD October 31, 2013

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Unsupervised Learning

Gradient Descent. Michail Michailidis & Patrick Maiden

Generative and discriminative classification techniques

Neural Networks: Learning. Cost func5on. Machine Learning

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Logistic Regression. Abstract

All lecture slides will be available at CSC2515_Winter15.html

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

A Brief Look at Optimization

Structured Learning. Jun Zhu

Machine Learning

Machine Learning

Note Set 4: Finite Mixture Models and the EM Algorithm

Classification: Linear Discriminant Functions

About the Course. Reading List. Assignments and Examina5on

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning Classifiers and Boosting

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

Clustering web search results

Naïve Bayes, Gaussian Distributions, Practical Applications

10703 Deep Reinforcement Learning and Control

CS 6140: Machine Learning Spring Final Exams. What we learned Final Exams 2/26/16

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Decision Trees: Representa:on

Linear Regression QuadraJc Regression Year

CSE 473: Ar+ficial Intelligence Machine Learning: Naïve Bayes and Perceptron

Deformable Part Models

What is machine learning?

Statistical Methods for NLP

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CS 6140: Machine Learning Spring 2016

Multi-Class Logistic Regression and Perceptron

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Content-based image and video analysis. Machine learning

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Collective classification in network data

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

732A54/TDDE31 Big Data Analytics

Learning from Data. COMP61011 : Machine Learning and Data Mining. Dr Gavin Brown Machine Learning and Op<miza<on Research Group

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

IBL and clustering. Relationship of IBL with CBR

CSE 446 Bias-Variance & Naïve Bayes

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Logistic Regression. May 28, Decision boundary is a property of the hypothesis and not the data set e z. g(z) = g(z) 0.

Semi- Supervised Learning

Neural Networks and Deep Learning

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Conflict Graphs for Parallel Stochastic Gradient Descent

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Machine Learning. Chao Lan

Pa#ern Recogni-on for Neuroimaging Toolbox

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Inference and Representation

Bayesian model ensembling using meta-trained recurrent neural networks

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Simple Model Selection Cross Validation Regularization Neural Networks

Instance-based Learning

Support Vector Machines.

SD 372 Pattern Recognition

10703 Deep Reinforcement Learning and Control

A Taxonomy of Semi-Supervised Learning Algorithms

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Deep Generative Models Variational Autoencoders

Transcription:

Logis&c Regression Aar$ Singh & Barnabas Poczos Machine Learning 10-701/15-781 Jan 28, 2014

Linear Regression & Linear Classifica&on Weight Height Linear fit Linear decision boundary 2

Naïve Bayes Recap NB Assump$on: NB Classifier: Assume parametric form for P(X i Y) and P(Y) Es$mate parameters using MLE/MAP and plug in 3

Gaussian Naïve Bayes (GNB) There are several distribu$ons that can lead to a linear decision boundary. As an example, consider Gaussian Naïve Bayes: Gaussian class conditional densities What if we assume variance is independent of class, i.e.? 4

GNB with equal variance is a Linear Classifier! Decision boundary: dy P (X i Y = 0)P (Y = 0) = i=1 dy P (X i Y = 1)P (Y = 1) i=1 log P (Y = 0) Q d i=1 P (X i Y = 0) P (Y = 1) Q d i=1 P (X i Y = 1) = log 1 + dx i=1 log P (X i Y = 0) P (X i Y = 1) Constant term First- order term 5

Gaussian Naïve Bayes (GNB) Decision Boundary X = (x 1,x 2 ) P 1 = P (Y = 0) P 2 = P (Y = 1) p 1 (X) = p(x Y = 0) N(M 1, 1 ) p 2 (X) = p(x Y = 1) N(M 2, 2 ) 6

Genera&ve vs. Discrimina&ve Classifiers Genera$ve classifiers (e.g. Naïve Bayes) Assume some func$onal form for P(X,Y) (or P(X Y) and P(Y)) Es$mate parameters of P(X Y), P(Y) directly from training data But arg max_y P(X Y) P(Y) = arg max_y P(Y X) Why not learn P(Y X) directly? Or beder yet, why not learn the decision boundary directly? Discrimina$ve classifiers (e.g. Logis$c Regression) Assume some func$onal form for P(Y X) or for the decision boundary Es$mate parameters of P(Y X) directly from training data 7

Logis&c Regression Assumes the following func$onal form for P(Y X): Logis$c func$on applied to a linear func$on of the data Logis&c func&on (or Sigmoid): logit (z) Features can be discrete or con&nuous! z 8

Logis&c Regression is a Linear Classifier! Assumes the following func$onal form for P(Y X): Decision boundary: 1 1 (Linear Decision Boundary) 9

Logis&c Regression is a Linear Classifier! Assumes the following func$onal form for P(Y X): 1 1 1 10

Logis&c Regression for more than 2 classes Logis$c regression in more general case, where Y {y 1,,y K } 2 for k<k for k=k (normaliza$on, so no weights for this class) Is the decision boundary s$ll linear? 11

Training Logis&c Regression We ll focus on binary classifica$on: How to learn the parameters w 0, w 1, w d? Training Data Maximum Likelihood Es$mates But there is a problem Don t have a model for P(X) or P(X Y) only for P(Y X) 12

Training Logis&c Regression How to learn the parameters w 0, w 1, w d? Training Data Maximum (Condi$onal) Likelihood Es$mates Discrimina$ve philosophy Don t waste effort learning P(X), focus on P(Y X) that s all that maders for classifica$on! 13

Expressing Condi&onal log Likelihood 14

Maximizing Condi&onal log Likelihood Bad news: no closed- form solu$on to maximize l(w) Good news: l(w) is concave func$on of w! concave func$ons easy to op$mize (unique maximum) 15

Op&mizing concave/convex func&on Condi$onal likelihood for Logis$c Regression is concave Maximum of a concave func$on = minimum of a convex func$on Gradient Ascent (concave)/ Gradient Descent (convex) Gradient: Update rule: Learning rate, η>0 16

Gradient Ascent for Logis&c Regression Gradient ascent algorithm: iterate un$l change < ε For i=1,,d, repeat Predict what current weight thinks label Y should be Gradient ascent is simplest of op$miza$on approaches e.g., Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3) 17

Effect of step- size η Large η => Fast convergence but larger residual error Also possible oscilla$ons Small η => Slow convergence but small residual error 18

Gaussian Naïve Bayes vs. Logis&c Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logis&c Regression parameters Representa$on equivalence But only in a special case!!! (GNB with class- independent variances) But what s the difference??? LR makes no assump&ons about P(X Y) in learning!!! Loss func&on!!! Op$mize different func$ons! Obtain different solu$ons 19

Naïve Bayes vs. Logis&c Regression Consider Y boolean, X i con$nuous, X=<X 1... X d > Number of parameters: NB: 4d +1 π, (µ 1,y, µ 2,y,, µ d,y ), (σ 2 1,y, σ2 2,y,, σ2 d,y ) y = 0,1 LR: d+1 w 0, w 1,, w d Es$ma$on method: NB parameter es$mates are uncoupled LR parameter es$mates are coupled 20

Genera&ve vs Discrimina&ve Given infinite data (asympto$cally), [Ng & Jordan, NIPS 2001] If condi$onal independence assump$on holds, Discrimina$ve and genera$ve NB perform similar. If condi$onal independence assump$on does NOT holds, Discrimina$ve outperforms genera$ve NB. 21

Genera&ve vs Discrimina&ve Given finite data (n data points, d features), [Ng & Jordan, NIPS 2001] Naïve Bayes (genera$ve) requires n = O(log d) to converge to its asympto$c error, whereas Logis$c regression (discrimina$ve) requires n = O(d). Why? Independent class condi$onal densi$es * parameter es$mates not coupled each parameter is learnt independently, not jointly, from training data. 22

Naïve Bayes vs Logis&c Regression Verdict Both learn a linear decision boundary. Naïve Bayes makes more restric$ve assump$ons and has higher asympto$c error, BUT converges faster to its less accurate asympto$c error. 23

Experimental Comparison (Ng- Jordan 01) UCI Machine Learning Repository 15 datasets, 8 con$nuous features, 7 discrete features More in Paper Naïve Bayes Logis$c Regression 24

What you should know LR is a linear classifier decision rule is a hyperplane LR op$mized by maximizing condi$onal likelihood no closed- form solu$on concave! global op$mum with gradient ascent Gaussian Naïve Bayes with class- independent variances representa$onally equivalent to LR Solu$on differs because of objec$ve (loss) func$on In general, NB and LR make different assump$ons NB: Features independent given class! assump$on on P(X Y) LR: Func$onal form of P(Y X), no assump$on on P(X Y) Convergence rates GNB (usually) needs less data LR (usually) gets to beder solu$ons in the limit 25