Logistic Regression. May 28, Decision boundary is a property of the hypothesis and not the data set e z. g(z) = g(z) 0.

Similar documents
06: Logistic Regression

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Computational Intelligence (CS) SS15 Homework 1 Linear Regression and Logistic Regression

All lecture slides will be available at CSC2515_Winter15.html

Machine Learning Lecture-1

Neural Networks: Learning. Cost func5on. Machine Learning

Network Traffic Measurements and Analysis

AKA: Logistic Regression Implementation

Programming Exercise 4: Neural Networks Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Logistic Regression and Gradient Ascent

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CS 237: Probability in Computing

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Linear Regression & Gradient Descent

Week 3: Perceptron and Multi-layer Perceptron

Programming Exercise 3: Multi-class Classification and Neural Networks

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

5 Learning hypothesis classes (16 points)

Support Vector Machines.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Machine Learning / Jan 27, 2010

Simple Model Selection Cross Validation Regularization Neural Networks

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CSE 546 Machine Learning, Autumn 2013 Homework 2

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Optimization. 1. Optimization. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Generative and discriminative classification techniques

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Linear Regression and K-Nearest Neighbors 3/28/18

Logistic Regression

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Multi-Class Logistic Regression and Perceptron

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Optimization. Industrial AI Lab.

A Brief Look at Optimization

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 237: Probability in Computing

Support Vector Machines

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Support Vector Machines.

6.034 Quiz 2, Spring 2005

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Introduction to Machine Learning. Xiaojin Zhu

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Practice Questions for Midterm

Boolean Classification

Machine Learning: Think Big and Parallel

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Machine Learning Classifiers and Boosting

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Logistic Regression. Abstract

AM 221: Advanced Optimization Spring 2016

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Using Machine Learning to Optimize Storage Systems

COMS 4771 Support Vector Machines. Nakul Verma

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Introduction to Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Classical Gradient Methods

6. Linear Discriminant Functions

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Tested Paradigm to Include Optimization in Machine Learning Algorithms

Lecture on Modeling Tools for Clustering & Regression

Learning via Optimization

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Weka ( )

Lecture 9: Support Vector Machines

Supervised vs unsupervised clustering

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Semi-supervised Learning

Neural Networks (pp )

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Lecture 12: convergence. Derivative (one variable)

5 Machine Learning Abstractions and Numerical Optimization

Classification: Linear Discriminant Functions

CS 224N: Assignment #1

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Linear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Kernel Methods & Support Vector Machines

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

LARGE MARGIN CLASSIFIERS

Multivariate Numerical Optimization

The Problem of Overfitting with Maximum Likelihood

Transcription:

Logistic Regression May 28, 202 Logistic Regression. Decision Boundary Decision boundary is a property of the hypothesis and not the data set. sigmoid function: h (x) = g( x) = P (y = x; ) suppose predict y = ifh (x) 0.5 g(z) = + e z g(z) 0.5 when z 0 suppose predict y = 0if h (x) < 0.5 h (x) = g( x) 0.5 whenever x 0 h (x) = g( x) < 0.5 whenever x < 0 Non Linear Decision Boundaries Add higher order polynomials, 2 exra features. h (x) = g( 0 + x + 2 x 2 + 3 x 2 + 4 x 2 2) 0 = 0 predict y = if + x 2 + x 2 2 0, that is x 2 + x 2 2 if we plot x 2 + x 2 2 =, it s a circle. Inside of the circle, y = 0, outside y =. With even higher polynomial terms, we can get even more complex decision boundaries.

.2 Cost Function Training set: m examples: {(x (), y () ), (x (2), y (2) ),..., (x (m), y (m) ) x x 0 x... x n Rn+ x 0 =, y {0, } h (x) = + e x How do we choose parameters? Cost function for logistic regression, needs to be convex and a cost function for linear regression would not be convex. Instead, use (6:20): { log(h (x)) if y = Cost(h (x), y) = log( h (x)) if y = 0 Cost = 0 if y =, h (x) = But as h (x) 0, Cost Captures intuition that if h (x) = 0, (predict P (y = x; ), but y =, we ll penalize learning algorithm by a very large cost..3 Simplified cost function and gradient descent Logistic regression cost function J() = m Cost(h (x (i), y (i) ) m { log(h (x)) if y = Cost(h (x), y) = log( h (x)) if y = 0 Note: y = 0 or always We can rewrite the cost function: Cost(h (x), y) = y log(h (x)) ( y) log( h (x)) So then if y = or if y = 0, this function loses terms accordingly. Now we can rewrite the cost function like so: [ J() = m ] y (i) log h (x (i) ) + ( y (i) ) log( h (x (i) )) m To fit parameters : 2

To make a prediction given new x: Output h (x) = min J() + e x which we interpret as probability that y = p(y = x; ) We re going to minimize the cost function with Gradient Descent. Vectorized: min J() : Repeat { } m j = j α (h (x (i) ) y (i) )x (i) j = 0. n m := α (h (x (i) ) y (i) )x (i) This is not the same as Gradient Descent for linear regression, because the definition of h (x)has changed. Apply the method from linear regression to make sure logistic regression is converging, plot cost. Note: Recall: Vectorized: [ J() = m ] y (i) log h (x (i) ) + ( y (i) ) log( h (x (i) )) m J() = m [ y log h (X) ( y) log( h (X)) ] The gradient of the cost is a vector of the same length as where the j th element for j = (0,, 2,..., n) is defined as follows: 3

Vectorized: J() j = m J() j m (h (x (i) ) y (i) )x (i) j = m X h (X) y.4 Advanced Optimization Given, we have code that can compute: J() j J() (for j = 0,,..., n) We can use these to minimize the cost function: Gradient descent Conjugate gradient BFGS L-BFGS Advantages: No need to manually pick α Line search algorithm automatically picks α, even for each iteration. Often faster than gradient descent Disadvantages: More complex Example: [ ] = 2 = 5 2 = 5 J() = ( 5) 2 + ( 2 5) 2 J() = 2( 5) 2 J() = 2( 5) function [ jval, gradient ] = costfunction ( theta ) 2 jval = ( theta () - 5) ^2 + ( theta (2) - 5) ^2; 3 gradient = zeros (2, ) 4 gradient () = 2 * ( theta () - 5); 5 gradient () = 2 * ( theta (2) - 5); 4

jval gets the result of the cost function gradient gets a 2x vector, the terms of which correspond to the two derivative terms. We can then call the advanced optimization function, fminunc, function minimization unconstrained. Chooses optimal value of automagically. options = optimset ( GradObj, on, MaxIter, 00 ); 2 initialtheta = zeros (2, ); 3 [ opttheta, functionval, exitflag ] =... 4 fminunc ( @costfunction, initialtheta, options ) @ represents a pointer to the above costfunction set some options: GradObj, on means that we will provide a gradient MaxIter, 00 means the number of iterations initialt heta R d, d 2 How do we apply this to logistic regression? 0 theta =. n function [jval, gradient] = costfunction(theta) jval = [code to compute J()]; gradient() = [code to compute 0 J()]; gradient(2) = [code to compute J()];. gradient(n+) = [code to compute n J()];.5 Multi-class classification: One-vs-all How to get logistic regression to get to work with multi-class classification problems. Scenario : tag emails for different folders Scenario 2: Medical diagrams: Not ill, cold, flu Scenario 3: Weather: Sunny, Cludy, Rainy, Snow There will be more groups of items on plot. Turn the problem into 3 different binary classification problems. We ll create a fake trainig set, where Class gets assigned to a positive class and Class 2 and 3 get assigned to the negative class. h () h (2) h (2) (x)will denote, say triangles (x)will denote, say squares (x)will denote, say x s 5

We fit 3 classifiers that try to estimate what is the probability that y = i h (i) = P (y = i x; ) (i =, 2, 3) Train a logistic regression classifier h (i) x for each class i to predict the probability that y =. On a new input x, to make a prediction, pick the class i that maximizes max h (i) i (x) 6