Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Similar documents
Classification: Linear Discriminant Functions

Machine Learning. Topic 4: Linear Regression Models

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

6. Linear Discriminant Functions

CS 229 Midterm Review

Generative and discriminative classification techniques

Support Vector Machines.

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

ADVANCED CLASSIFICATION TECHNIQUES

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Machine Learning / Jan 27, 2010

The exam is closed book, closed notes except your one-page cheat sheet.

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Neural Networks and Deep Learning

Fall 09, Homework 5

Content-based image and video analysis. Machine learning

Network Traffic Measurements and Analysis

For Monday. Read chapter 18, sections Homework:

All lecture slides will be available at CSC2515_Winter15.html

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Logistic Regression

Introduction to Machine Learning CMU-10701

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Supervised vs unsupervised clustering

k-nearest Neighbors + Model Selection

1 Case study of SVM (Rob)

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Introduction to Machine Learning

LARGE MARGIN CLASSIFIERS

Perceptron: This is convolution!

DM6 Support Vector Machines

1-Nearest Neighbor Boundary

5 Machine Learning Abstractions and Numerical Optimization

6.034 Quiz 2, Spring 2005

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Learning via Optimization

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Machine Perception of Music & Audio. Topic 10: Classification

Multi-Class Logistic Regression and Perceptron

Support Vector Machines

Support Vector Machines

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Deep Generative Models Variational Autoencoders

3 Perceptron Learning; Maximum Margin Classifiers

Support Vector Machines

Machine Learning: Think Big and Parallel

COMS 4771 Support Vector Machines. Nakul Verma

Unsupervised Learning

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Linear Regression & Gradient Descent

06: Logistic Regression

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Logistic Regression and Gradient Ascent

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Lecture 9. Support Vector Machines

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Optimization. Industrial AI Lab.

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Machine Learning Classifiers and Boosting

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

STA 4273H: Statistical Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

CSE 546 Machine Learning, Autumn 2013 Homework 2

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Lecture 3. Oct

Machine Learning in Biology

Instructor: Jessica Wu Harvey Mudd College

Lecture #11: The Perceptron

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning. Chao Lan

Image Registration Lecture 4: First Examples

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Support Vector Machines

Support vector machines

What is machine learning?

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Support Vector Machines.

Ensemble methods in machine learning. Example. Neural networks. Neural networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

5 Learning hypothesis classes (16 points)

Natural Language Processing

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Modern Methods of Data Analysis - WS 07/08

Unsupervised Learning: Clustering

Theoretical Concepts of Machine Learning

Mixture Models and the EM Algorithm

Kernel Methods & Support Vector Machines

Transcription:

Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork for images and ideas 1

General Classification Learning Task There is a set of possible examples Each example is an k-tuple of attribute values 1 1 There is a target function that maps X onto some finite set Y The DATA is a set of tuples <example, target function values> D Find a hypothesis h such that... X!! = { x,... x } 1 n! x =< a,..., ak f : X Y!!!! x, f ( x ) >,... < x, f ( x ) > } 1 m = { < 1 m!!! x, h( x) f ( x) >

Where would you draw the line? a 2 0 0 6 8 a 1 3

Linear Discriminants A linear combination of the attributes. g x w, w 0 ( ) = w 0 + w T x = w0 + w i Easily interpretable k a i i=1 Are optimal when classes are Gaussian and share a covariance matrix 4

g( x) = 0 Two-Class Classification defines a decision boundary that splits the space in two a 2 g( x) = w 1 a 1 + w 2 a 2 + w 0 = 0 If a line exists that does this without error, the classes are linearly separable g( x) < 0 h( x ) = " $ # %$ 1 if g( x ) > 0 1 otherwise g( x) > 0 a 1 5

Any hyperplane can be written as the set of points satisfying the equation below, where w and x are vectors in R d Defining a Hyperplane w T x + w0 = 0 The vector w is a normal vector: it is perpendicular to the hyperplane. It is often referred to as the weight vector. The parameter w 0 determines the offset of the hyperplane from the origin along the normal vector. It is often referred to as the threshold. dist to origin = d 0 = w 0 w Euclidean norm

Estimating the model parameters There are many ways to estimate the model parameters: Minimize least squares criteria (classification via regression) Maximize Fisher linear discriminant criteria Minimize perceptron criteria (using numerical optimization techniques, e.g. gradient descent) Many other methods for solving for the inequalities using constrained optimization 7

Classification via regression Label each class by a number Call that the response variable Analytically derive a regression line Round the regression output to the nearest label number Fall 2014 8

An example 1 r Decision boundary 1 1 1 1.5 0 O 0 0 0 x Mark Cartwright and Fall 2012 9

1 r What happens now? Decision boundary 1 1 1 1.5 0 O 0 0 0 x Mark Cartwright and Fall 2012 10

Estimating the model parameters There are many ways to estimate the model parameters: Minimize least squares criteria (i.e. classification via regression as shown earlier) Maximize Fisher linear discriminant criteria Minimize perceptron criteria (using numerical optimization techniques, e.g. gradient descent) Many other methods for solving for the inequalities using constrained optimization 11

Hill-climbing (aka Gradient Descent) Start somewhere and head up (or down) hill. Objective function J(w) w: the value of some parameter

Hill-climbing Easy to get stuck in local maxima (minima) Objective function J(w) w: the value of some parameter

What s our objective function? 1 0.5 0 0.5 Minimise number of misclassifications? Maximize margin between classes? Personal satisfaction? 1 1 0.5 0 0.5 1 14

Gradient Descent Simple 1 st order numerical optimization method Idea: follow the gradient to a minimum Finds global minimum when objective function is convex, otherwise local minimum Objective function (the function you are minimizing) must be differentiable Used when there is no analytical solution to finding minimum 15

SSE: Our objective function We re going to use the sum of squared errors (SSE) as our classifier error function. As a reminder, we used SSE in linear regression (next slide) Note, in linear regression, the line we learn embodies our hypothesis function. Therefore, any change in the line will change our SSE, by some amount. Mark Cartwright and Fall 2012 16

Simple Linear Regression Typically estimate parameters by minimizing sum of squared residuals (a.k.a. least squares): number of training examples Response variable residual Mark Cartwright and Fall 2012 17

SSE: Our objective function In classification (next slide), the line we learn is INPUT to our hypothesis function. Therefore, a change of the line may not change the SSE objective function, if moving it does not change the classification of any training point. Note also, in the next slide, both the horizontal and vertical dimensions are INPUTS to the hypothesis function while in the previous slide, the vertical dimension was the OUTPUT. Therefore, the dimension a1 and a2 are the elements of the vector X that gets handed to h(x). Mark Cartwright and Fall 2012 18

Simple Linear Classification Typically estimate parameters by minimizing sum of squared residuals (a.k.a. least squares): a 1 E(! w) 1 2 m ( f ( x! m ) h( x! m )) 2 1 g ( x! ) > 0 g ( x! ) = 0 g ( x! ) < 0 h ( x! ) = 1 if g ( x! ) > 0 1 otherwise Mark Cartwright and Fall 2012 19 a 2

Gradient Descent Gradient descent is a useful, simple optimization method, but it can be very slow on difficult problems. There are many many optimization methods out there for different types of problems. Take the optimization courses offered in EECS and IEMS to learn more. 20

Gradient Descent are parameters is the convergence threshold is the step size (in general, important to choose well) is the gradient (vector of partial derivatives with respect to parameters) of the objective function Algorithm: begin initialize do end return until 21

Gradient Descent In batch gradient descent, J is a function of both the parameters and ALL training samples, summing the total error, e.g. In stochastic gradient descent, J is a function of the parameters and a different single random training sample at each iteration. this is a common choice in machine learning when there is a lot of training data, and computing the sum over all samples is expensive. 22

Perceptron Criteria Perceptron criteria: NOTE: augmented to include the threshold w 0 where and so that we want is the set of misclassified training examples. We can minimize this criteria to solve for our weight vector w Restricted to 2-class discriminant We can use stochastic gradient descent to solve for this Only converges when data is linearly separable 23

Perceptron Algorithm are parameters is set to 1 (this is fine in this case, since multiplying w by a scalar doesn t affect the decision) Algorithm: begin initialize end return do if then until all examples are properly classified 24

Example: Perceptron Algorithm 1 0.5 Red is the positive class Blue is the negative class 0 0.5 1 1 0.5 0 0.5 1 25

Perceptron Algorithm Example (cont d): 1 0.5 Red is the positive class Blue is the negative class 0 0.5 1 1 0.5 0 0.5 1 26

Perceptron Algorithm Example (cont d): 1 0.5 Red is the positive class Blue is the negative class 0 0.5 1 1 0.5 0 0.5 1 27

Perceptron Algorithm Example (cont d): 1 0.5 Red is the positive class Blue is the negative class 0 0.5 1 1 0.5 0 0.5 1 28

Multi-class Classification a 2 When there are N>2 classes: you can classify using N discriminant functions. Choose the class with the maximum output Geometrically divides feature space into N convex decision regions a 1 Choose C i if g i x N ( ) = max j=1 g j x ( ) 29

Pairwise Multi-class Classification If they are not linearly separable (singly connected convex regions), may still be pair-wise separable, using N(N-1)/2 linear discriminants. a 2 g ij x wij, w ij0 K ( ) = w ij0 + w ijl l=1 x l choose C i if j i, g ij ( x) > 0 a 1 30

Appendix (stuff I didn t have time to discuss in class) 31

2 Fisher Linear Discriminant Criteria Can think of as dimensionality reduction from K-dimensions to 1 Objective: w T x Maximize the difference between class means Minimize the variance within the classes 4 J( w) = (m 2 m 1 ) 2 2 0 2 2 6 s 1 2 + s 2 2 where s i and m i are the sample variance and mean for class i in the projected dimension. We want to maximize J. Mark Cartwright and Fall 2012 32

Fisher Linear Discriminant Criteria Solution: where However, while this finds finds the direction ( ) of decision boundary. Must still solve for to find the threshold. Can be expanded to multiple classes Mark Cartwright and Fall 2012 33

Logistic Regression (Discrimination) a 2 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Mark Cartwright and Fall 2012 34 a 1

Logistic Regression (Discrimination) Discriminant model but well-grounded in probability Flexible assumptions (exponential family classconditional densities) Differentiable error function ( cross entropy ) Works very well when classes are linearly separable Mark Cartwright and Fall 2012 35

Logistic Regression (Discrimination) Probabilistic discriminative model Models posterior probability To see this, let s start with the 2-class formulation: p(c 1 x) = p(! x C 1 )p(c 1 ) p(! x C 1 )p(c 1 )+p(! x C 2 )p(c 2 ) 1 = 1+exp log p(! x C 1 )p(c 1 ) p(! x C 2 )p(c 2 ) = 1 1+exp( ) = ( ) where = log p(! x C 1 )p(c 1 ) p(! x C 2 )p(c 2 ) logistic sigmoid function Mark Cartwright and Fall 2012 36

Logistic Regression (Discrimination) logistic sigmoid function 1 0 0 Squashing function that maps Mark Cartwright and Fall 2012 37

Logistic Regression (Discrimination) For exponential family ere of densities, is a linear function of x. = log p(! x C 1 )p(c 1 ) p(! x C 2 )p(c 2 ) Therefore we can model the posterior probability as a logistic sigmoid acting on a linear function of the attribute vector, and simply solve for the weight vector w (e.g. treat it as a discriminant model): To classify: Mark Cartwright and Fall 2012 38