Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani

Outline Discriminant functions Linear Discriminant functions Linear Discriminant Function Least Mean Squared Error Method Sum of Squared Error Method Perceptron Multi-class problems Linear machine Completely Linearly Separable Pairwise Linearly Separable Generalized LDFs 2

Classification Problem Given:Training set labeled set of input-output pairs =, {1,,} Goal: Given an input, assign it to one of classes 3

Types of Classifiers Probabilistic classification approaches (previous lectures): Generative Discriminative Discriminant function Various procedures for determining discriminant functions (some of them are statistical) However, they don t require knowledge of the forms of underlying probability distributions 4

Discriminant Functions Discriminant functions: A popular way of representing a classifier A discriminant function for each class (=1,,): is assigned to class if: () > () Decision surfaces (boundaries) can also be found using discriminant functions Boundary of the R and R :, = () 5

Probabilistic Discriminant Functions Maximum likelihood () = ( ) Bayesian Classifier () = ( ) Expected Loss (Conditional Risk) () = ( ) 6

Discriminant Functions: Two-Category For two-category problem, we can only find a function R R = () = () Decision surface: = 0 First, we explain two-category classification problem and then discuss the multi-category problems. 7

Linear Discriminant Functions Linear Discriminant Functions (LDFs) Decision boundaries are linear in, or linear in some given set of functions of Why LDFs? They can be optimal for some problems E.g., if the underlying distributions ( ) are gaussians having equal covariance Even when they are not optimal, we can use their simplicity LDFs are relatively easy to compute In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 8

LDFs: Two Category = 1 =[ ] : bias contains the parameters we need to set if then else Decision surface (boundary): The equation (; ) = 0 defines the decision surface separating samples of the two categories Augmented When (; ) is linear, the decision surface is a hyperplane. 9

LDFs : Two Category Decision boundary is a ( 1)-dimensional hyperplane in the -dimensional feature space The orientation of is determined by the normal vector,, is the location of the surface is determined by the bias. = + = = () () is proportional to the signed distance from to 10

LDFs: Two Category 3+ 3 4 + =0 2 3 2 1 if 0 else 1 2 3 4 1 = [3, 0.75, 1] =[1,, ] 11

LDFs: Cost Function Finding LDFs is formulated as an optimization problem A cost function is needed and a procedure is used to solve it. Criterion or cost functions for classification: Average training error or loss incurred in classifying training samples A small training error does not guarantee a small test error We will investigate several cost functions for the classification problem 12

LDFs: Methods Many classification methods are based on LDFs: Mean Squared Error Sum of Squared Error Perceptorn Fisher Linear Discriminant Analysis (LDA) [next lectures] SVM [next lecture] 13

Main Steps in Methods based on LDFs We have specified the class of discriminant functions as linear Select how to measure the prediction loss Based on the training set =,, a cost function is defined (e.g., SSE) that is a function of the classifier parameters Solve the resulting optimization problem to find parameters: Find optimal = ; where =argmin 14

Bayes vs. LDFs Cost Function Bayes minimum error classifier: (.), If we know the probabilities in advance then the above optimization problem will be solved easily. = argmax ( ) We only have a set of training samples instead of and usually optimize the following problem: () (.) 15

Mean Squared Error Squared Loss Two-category: { 1,1} = 1for, =1for 16

Sum of Squared Error (SSE) Cost function: Prediction errors on the training set: empirical loss on training samples 17

Sum of Squared Error (SSE) () () () () () () () () Pseudo Inverse of 18

Sum of Squared Error (SSE) SSE penalizes too correct predictions too correct predictions: samples lie a long way on the correct side of the decision =1 It also lack robustness to noise Correct predictions that are penalized by SSE 1 = 1 19 [Bishop] 1

Perceptron Two-category: = 1for, =1for Goal: () () = 1, <0 1, 0 20

Perceptron Criterion Only misclassified training samples affect the discriminant functions: M : subset of training data that are misclassified Many solutions? Which solution among them? 21

Perceptron vs. Other Criteria () () Misclassification Perceptron 22 [Duda,Hart&Stork]

Some classification criteria () () Perceptron SSE 23 [Duda,Hart&Stork]

Batch Perceptron Gradient Descent to solve the optimization problem: M Batch Perceptron converges in finite number of steps for linearly separable data Initialize, 0 Repeat =+ +1 Until M M < 24

Single-Sample Perceptron If () is misclassified: () () can be set to 1 Fixed-Increment Single Sample Perceptron Initialize, 0 repeat +1 mod if () is misclassified then =+ () () Until all patterns properly classified 25

Convergence of Perceptron Change in a direction that corrects the error 26 [Bishop]

Convergence of Perceptron If the training data set is linearly separable, the single-sample perceptron algorithm is also guaranteed to find a solution in a finite number of steps 27 [Duda,Hart&Stork]

2 LDFs: Multi-Category Linear discriminant functions for multi-category problems: Linear machine : A discriminant function () for each class Converting the problem to a set of two-class problems: one versus rest or one against all For each class, an LDF separates samples of from all the other samples. Totally linearly separable one versus one ( 1)/2 LDFs are used, one to separate samples of a pair of classes. Pairwise linearly separable 1 Converting the problem to a set of two-class problems can lead to regions in which the classification is undefined. 28

Multi-Category Classification One-vs-all (one-vs-rest) 2 1 2 2 29 Class 1: Class 2: Class 3: 1 2 1 1

Multi-Category Classification One-vs-one 2 1 2 2 1 30 Class 1: Class 2: Class 3: 1 2 1

Multi-Category Classification: Ambiguity one versus rest one versus one 31 [Duda,Hart&Stork]

Multi-Category Classification: Linear Machine A discriminant function for each class ( ): is assigned to class if: () > () Decision surfaces (boundaries) can also be found using discriminant functions Boundary of the contiguous R and R :, = () + =0 32

Multi-Category Classification: Linear Machine 33 [Duda,Hart&Stork]

Multi-Category Classification: Linear Machine Decision regions are convex Linear machines are most suitable for problems where ( ) are unimodal., R, is linear +(1 ) +(1 ) + 1 + 1 + 1 R Convex region definition:, R, 0 1 + 1 R 34

Multi-Category Classification: Target Coding Scheme Target values: Binary classification: a target variable 0,1 Multiple classes ( >2): TargetClass : =1 =0 35

SSE Cost Function: Multi-Class () () () () () () 36

SSE Cost Function: Multi-Class Low performance of the SSE cost function for the classification problem 37 [Bishop]

Perceptron: Multi-Class,, M : subset of training data that are misclassified () 38 Initialize =,,, 0 repeat +1 mod if () is misclassified then = () = + () Until all patterns properly classified

Generalized LDFs Linear combination of a fixed non-linear functions of the input vector : set of basis functions (or features) 39 We will discuss about them in the next lectures.

Generalized LDFs: Example Choose non-linear features Discriminant functions are still linear in parameters 2 boundary 1 + + =0 1 = [1,,,,, ] = [ 1, 0, 0,1,1,0] 1 1 1 1 if () 0 then =1 else = 1 =[, ] 40 We will discuss about them in the next lectures.