EE04 Spring 08 S. Lall and S. Boyd Boolean Classification Sanjay Lall and Stephen Boyd EE04 Stanford University
Boolean classification
Boolean classification I supervised learning is called boolean classification when raw output variable v is a categorical that can take two possible values I we denote these and, and they often correspond to ffalse; trueg I for a data record u i ; v i, the value v i f ; g is called the class or label I a boolean classifier predicts label ^v given raw input u
Classification 4 0 4 4 0 4 I here u R I red points have v i =, blue points have v i = I we d like a predictor that maps any u R into prediction ^v f ; g 4
Example: nearest neighbor 4 0 4 4 0 4 I given u, let k = argmin k ku u k k, then predict ^v = v k I red region is the set of u for which prediction is I blue region is the set of u for which prediction is I zero training error (all points classified correctly), but perhaps overfit 5
Example: least-squares regression 4 0 4 4 0 4 I embed x = (; u) and y = v, apply least squares regression I gives ^y = + u + u I predict using ^v = sign(^y) I % of points misclassified at training 6
Confusion matrix 7
The two types of errors I measure performance of a specific predictor on a set of n data records I each data point i has v i f ; g I and corresponding prediction ^v i = f(v i ) f ; g I only four possible values for the data pair ^v i, v i : I true positive if ^v i = and v i = I true negative if ^v i = and v i = I false negative or type II error if ^v i = and v i = I false positive or type I error if ^v i = and v i = 8
Confusion matrix I define the confusion matrix C = # true negatives # false negatives # false positives # true positives (warning: some people use the transpose of C) I C tn + C fn + C fp + C tp = n I N n = C tn + C fp is number of negative examples I N p = C fn + C tp is number of positive examples I diagonal entries give numbers of correct predictions I off-diagonal entries give numbers of incorrect predictions Ctn C fn = C fp C tp 9
Some boolean classification measures I confusion matrix Ctn C fn C fp C tp I the basic error measures: I false positive rate is C fp =n I false negative rate is C fn =n I error rate is (C fn + C fp )=n I error measures some people use: I true positive rate or sensitivity or recall is C tp=n p I false alarm rate is C fp =N n I specificity or true negative rate is C tn=n n I precision is C tp=(c tp + Cfp ) 0
Neyman-Pearson error I Neyman-Pearson error over a data set is C fn =n + C fp =n I a scalarization of our two objectives, false positive and false negative rates I is how much more false negatives irritate us than false positives I when =, the Neyman-Pearson error is the error rate I we ll use the Neyman-Pearson error as our scalarized measure
ERM
Embedding I we embed raw input and output records as x = (u) and y = (v) I is the feature map I is the identity map, (v) = v I un-embed by ^v = sign(^y) I equivalent to ^v = argmin j^y vf ;g (v)j I i.e., choose the nearest boolean value to the (real) prediction ^y
ERM I given loss function `(^y; y), empirical loss on a data set is L = n nx i= I for linear model ^y = T x, with R d, L() = n I ERM: choose to minimize L() nx i= `(^y i ; y i ) `( T x i ; y i ) I regularized ERM: choose to minimize L() + r(), with > 0 4
Loss functions for boolean classification I to apply ERM, we need a loss function on embedded variables `(^y; y) I y can only take values or I ^y = T x R can be any real number I to specify `, we only need to give two functions (of a scalar ^y): I `(^y; ) is how much ^y irritates us when y = I `(^y; ) is how much ^y irritates us when y = I we can take `(^y; ) = `( ^y; ), to reflect that false negatives irritate us a factor more than false positives 5
Neyman-Pearson loss I Neyman-Pearson loss is I `NP (^y; ) = ^y 0 0 ^y < 0 I `NP (^y; ) = l NP (^y; ) = ^y < 0 0 ^y 0 I empirical Neyman-Pearson loss L NP is the Neyman-Pearson error `NP (^y; ) `NP (^y; ) 0 0 ^y 0 0 ^y 6
The problem with Neyman-Pearson loss I empirical Neyman-Pearson loss L NP () is not differentiable, or even continuous (and certainly not convex) I worse, its gradient rl NP () is either zero or undefined I so an optimizer does not know how to improve the model 7
Idea of proxy loss I we get better results using a proxy loss that I approximates, or at least captures the flavor of, the Neyman-Pearson loss I is more easily optimized (e.g., is convex or has nonzero derivative) I we want a proxy loss function I with `(^y; ) small when ^y < 0, and larger when ^y > 0 I with `(^y; +) small when ^y > 0, and larger when ^y < 0 I which has other nice characteristics, e.g., differentiable or convex 8
Sigmoid loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = + e ^y, `(^y; ) = `( ^y; ) = + e^y I differentiable approximation of Neyman-Pearson loss I but not convex 9
Logistic loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = log( + e^y ), `(^y; ) = `( ^y; ) = log( + e ^y ) I differentiable and convex approximation of Neyman-Pearson loss 0
Hinge loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = ( + ^y) +, `(^y; ) = `( ^y; ) = ( ^y) + I nondifferentiable but convex approximation of Neyman-Pearson loss
Square loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = ( + ^y), `(^y; ) = `( ^y; ) = ( ^y) I ERM is least squares problem
Hubristic loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I define the hubristic loss (huber + logistic) as `(^y; ) = 8 >< >: 0 ^y < (^y + ) ^y 0 + ^y ^y > 0 I `(^y; ) = `( ^y; ) 5/5 stars! Yelp the biggest loss of 08 The Economist good, but it s no -norm The Guardian a gamechanger Facebook
Boolean classifiers 4
Least squares classifier I use least squares loss L() = n 0 @ X i:y i = and your choice of regularizer ( + ^y i ) + X i:y i = I with sum squares regularizer, this is least squares classifier ( ^y i ) A I we can minimize L() + r() using, e.g., QR factorization 5
Logistic regression I use logistic loss L() = n 0 @ X i:y i = and your choice of regularizer log( + e^yi ) + X i:y i = I can minimize L() + r() using prox-gradient method I we will find an actual minimizer if r is convex log( + e ^yi ) A 6
Support vector machine (usually abbreviated as SVM) I use hinge loss L() = n 0 @ X and sum squares regularizer I L() + r() is convex i:y i = ( + ^y i ) + + X i:y i = ( ^y i ) + A I it can be minimized by various methods (but not prox-gradient) 7
Support vector machine.0.5.0 hinge loss, `(^y; ) 4.5.0 0.5 0.0 0.0.5 hinge loss, `(^y; ) 0.0.5.0 0.5 0.0 0 4 4 0 4 I decision boundary is T x = 0 I black lines show points where T x = I what is the training loss here? 8
ROC 9
Receiver operating characteristic (always abbreviated as ROC, come from WWII) I explore trade-off of false negative versus false positive rates I create classifier for many values of I for each choice of, select hyper-parameter via validation on test set with Neyman-Pearson loss I plot the test (and maybe train) false negative and false positive rates against each other I called receiver operating characteristic (ROC) (when viewed upside down) 0
Example 0.6 4 0.5 C fp =n 0.4 0. 0 0. 0. 0.0 0.0 0. 0. 0. 0.4 0.5 C fn =n 4 4 0 4 I square loss, sum squares regularizer I left hand plot shows training errors in blue, test errors in red I right hand plot shows minimum-error classifier (i.e., = )
Example 4 4 0 0 4 4 0 4 4 4 0 4 I left hand plot shows predictor when = 0:4 I right hand plot shows predictor when = 4