Boolean Classification

Similar documents
MEASURING CLASSIFIER PERFORMANCE

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Classification Part 4

Evaluating Classifiers

CSE 158. Web Mining and Recommender Systems. Midterm recap

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

CS249: ADVANCED DATA MINING

Evaluating Classifiers

Weka ( )

Evaluating Classifiers

Network Traffic Measurements and Analysis

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CS145: INTRODUCTION TO DATA MINING

Logistic Regression: Probabilistic Interpretation

Pattern recognition (4)

Business Club. Decision Trees

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

2. On classification and related tasks

Topics in Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Information Management course

CSE 190, Spring 2015: Midterm

CSE 190, Fall 2015: Midterm

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

List of Exercises: Data Mining 1 December 12th, 2015

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Lecture 25: Review I

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Lecture 9: Support Vector Machines

Classification: Linear Discriminant Functions

CS381V Experiment Presentation. Chun-Chen Kuo

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Nearest Neighbor Predictors

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Support Vector Machines

2 Second Derivatives. As we have seen, a function f (x, y) of two variables has four different partial derivatives: f xx. f yx. f x y.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Machine Learning: Think Big and Parallel

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Machine Learning for. Artem Lind & Aleskandr Tkachenko

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CS 584 Data Mining. Classification 3

Classification Algorithms in Data Mining

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Global Probability of Boundary

Learning via Optimization

Model s Performance Measures

SOFTWARE DEFECT PREDICTION USING IMPROVED SUPPORT VECTOR MACHINE CLASSIFIER

I211: Information infrastructure II

CS145: INTRODUCTION TO DATA MINING

ECE 5470 Classification, Machine Learning, and Neural Network Review

Artificial Neural Networks (Feedforward Nets)

Rapid growth of massive datasets

Naïve Bayes for text classification

5 Learning hypothesis classes (16 points)

Classification. Instructor: Wei Ding

EE795: Computer Vision and Intelligent Systems

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Clustering will not be satisfactory if:

Supervised vs unsupervised clustering

CS4491/CS 7265 BIG DATA ANALYTICS

Large Scale Data Analysis Using Deep Learning

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Machine Learning Classifiers and Boosting

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

CLASSIFICATION JELENA JOVANOVIĆ. Web:

6.034 Quiz 2, Spring 2005

Data Mining and Knowledge Discovery Practice notes 2

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Performance Evaluation of Various Classification Algorithms

All lecture slides will be available at CSC2515_Winter15.html

Logistic Regression. May 28, Decision boundary is a property of the hypothesis and not the data set e z. g(z) = g(z) 0.

Data Mining and Knowledge Discovery: Practice Notes

Adversarial Examples and Adversarial Training. Ian Goodfellow, Staff Research Scientist, Google Brain CS 231n, Stanford University,

Generative and discriminative classification techniques

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Data Mining and Knowledge Discovery: Practice Notes

Classification and Regression

06: Logistic Regression

Homework 1. for i=1:n curr_dist = norm(x - X[i,:])

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CISC 4631 Data Mining

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Text Classification using String Kernels

Chuck Cartledge, PhD. 23 September 2017

Supervised Learning Classification Algorithms Comparison

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning for NLP

Nonparametric Classification Methods

Variable Selection 6.783, Biomedical Decision Support

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Transcription:

EE04 Spring 08 S. Lall and S. Boyd Boolean Classification Sanjay Lall and Stephen Boyd EE04 Stanford University

Boolean classification

Boolean classification I supervised learning is called boolean classification when raw output variable v is a categorical that can take two possible values I we denote these and, and they often correspond to ffalse; trueg I for a data record u i ; v i, the value v i f ; g is called the class or label I a boolean classifier predicts label ^v given raw input u

Classification 4 0 4 4 0 4 I here u R I red points have v i =, blue points have v i = I we d like a predictor that maps any u R into prediction ^v f ; g 4

Example: nearest neighbor 4 0 4 4 0 4 I given u, let k = argmin k ku u k k, then predict ^v = v k I red region is the set of u for which prediction is I blue region is the set of u for which prediction is I zero training error (all points classified correctly), but perhaps overfit 5

Example: least-squares regression 4 0 4 4 0 4 I embed x = (; u) and y = v, apply least squares regression I gives ^y = + u + u I predict using ^v = sign(^y) I % of points misclassified at training 6

Confusion matrix 7

The two types of errors I measure performance of a specific predictor on a set of n data records I each data point i has v i f ; g I and corresponding prediction ^v i = f(v i ) f ; g I only four possible values for the data pair ^v i, v i : I true positive if ^v i = and v i = I true negative if ^v i = and v i = I false negative or type II error if ^v i = and v i = I false positive or type I error if ^v i = and v i = 8

Confusion matrix I define the confusion matrix C = # true negatives # false negatives # false positives # true positives (warning: some people use the transpose of C) I C tn + C fn + C fp + C tp = n I N n = C tn + C fp is number of negative examples I N p = C fn + C tp is number of positive examples I diagonal entries give numbers of correct predictions I off-diagonal entries give numbers of incorrect predictions Ctn C fn = C fp C tp 9

Some boolean classification measures I confusion matrix Ctn C fn C fp C tp I the basic error measures: I false positive rate is C fp =n I false negative rate is C fn =n I error rate is (C fn + C fp )=n I error measures some people use: I true positive rate or sensitivity or recall is C tp=n p I false alarm rate is C fp =N n I specificity or true negative rate is C tn=n n I precision is C tp=(c tp + Cfp ) 0

Neyman-Pearson error I Neyman-Pearson error over a data set is C fn =n + C fp =n I a scalarization of our two objectives, false positive and false negative rates I is how much more false negatives irritate us than false positives I when =, the Neyman-Pearson error is the error rate I we ll use the Neyman-Pearson error as our scalarized measure

ERM

Embedding I we embed raw input and output records as x = (u) and y = (v) I is the feature map I is the identity map, (v) = v I un-embed by ^v = sign(^y) I equivalent to ^v = argmin j^y vf ;g (v)j I i.e., choose the nearest boolean value to the (real) prediction ^y

ERM I given loss function `(^y; y), empirical loss on a data set is L = n nx i= I for linear model ^y = T x, with R d, L() = n I ERM: choose to minimize L() nx i= `(^y i ; y i ) `( T x i ; y i ) I regularized ERM: choose to minimize L() + r(), with > 0 4

Loss functions for boolean classification I to apply ERM, we need a loss function on embedded variables `(^y; y) I y can only take values or I ^y = T x R can be any real number I to specify `, we only need to give two functions (of a scalar ^y): I `(^y; ) is how much ^y irritates us when y = I `(^y; ) is how much ^y irritates us when y = I we can take `(^y; ) = `( ^y; ), to reflect that false negatives irritate us a factor more than false positives 5

Neyman-Pearson loss I Neyman-Pearson loss is I `NP (^y; ) = ^y 0 0 ^y < 0 I `NP (^y; ) = l NP (^y; ) = ^y < 0 0 ^y 0 I empirical Neyman-Pearson loss L NP is the Neyman-Pearson error `NP (^y; ) `NP (^y; ) 0 0 ^y 0 0 ^y 6

The problem with Neyman-Pearson loss I empirical Neyman-Pearson loss L NP () is not differentiable, or even continuous (and certainly not convex) I worse, its gradient rl NP () is either zero or undefined I so an optimizer does not know how to improve the model 7

Idea of proxy loss I we get better results using a proxy loss that I approximates, or at least captures the flavor of, the Neyman-Pearson loss I is more easily optimized (e.g., is convex or has nonzero derivative) I we want a proxy loss function I with `(^y; ) small when ^y < 0, and larger when ^y > 0 I with `(^y; +) small when ^y > 0, and larger when ^y < 0 I which has other nice characteristics, e.g., differentiable or convex 8

Sigmoid loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = + e ^y, `(^y; ) = `( ^y; ) = + e^y I differentiable approximation of Neyman-Pearson loss I but not convex 9

Logistic loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = log( + e^y ), `(^y; ) = `( ^y; ) = log( + e ^y ) I differentiable and convex approximation of Neyman-Pearson loss 0

Hinge loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = ( + ^y) +, `(^y; ) = `( ^y; ) = ( ^y) + I nondifferentiable but convex approximation of Neyman-Pearson loss

Square loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I `(^y; ) = ( + ^y), `(^y; ) = `( ^y; ) = ( ^y) I ERM is least squares problem

Hubristic loss.0.0.5 `(^y; ).5 `(^y; ).0.0.5.5.0.0 0.5 0.5 0.0 0 ^y 0.0 0 ^y I define the hubristic loss (huber + logistic) as `(^y; ) = 8 >< >: 0 ^y < (^y + ) ^y 0 + ^y ^y > 0 I `(^y; ) = `( ^y; ) 5/5 stars! Yelp the biggest loss of 08 The Economist good, but it s no -norm The Guardian a gamechanger Facebook

Boolean classifiers 4

Least squares classifier I use least squares loss L() = n 0 @ X i:y i = and your choice of regularizer ( + ^y i ) + X i:y i = I with sum squares regularizer, this is least squares classifier ( ^y i ) A I we can minimize L() + r() using, e.g., QR factorization 5

Logistic regression I use logistic loss L() = n 0 @ X i:y i = and your choice of regularizer log( + e^yi ) + X i:y i = I can minimize L() + r() using prox-gradient method I we will find an actual minimizer if r is convex log( + e ^yi ) A 6

Support vector machine (usually abbreviated as SVM) I use hinge loss L() = n 0 @ X and sum squares regularizer I L() + r() is convex i:y i = ( + ^y i ) + + X i:y i = ( ^y i ) + A I it can be minimized by various methods (but not prox-gradient) 7

Support vector machine.0.5.0 hinge loss, `(^y; ) 4.5.0 0.5 0.0 0.0.5 hinge loss, `(^y; ) 0.0.5.0 0.5 0.0 0 4 4 0 4 I decision boundary is T x = 0 I black lines show points where T x = I what is the training loss here? 8

ROC 9

Receiver operating characteristic (always abbreviated as ROC, come from WWII) I explore trade-off of false negative versus false positive rates I create classifier for many values of I for each choice of, select hyper-parameter via validation on test set with Neyman-Pearson loss I plot the test (and maybe train) false negative and false positive rates against each other I called receiver operating characteristic (ROC) (when viewed upside down) 0

Example 0.6 4 0.5 C fp =n 0.4 0. 0 0. 0. 0.0 0.0 0. 0. 0. 0.4 0.5 C fn =n 4 4 0 4 I square loss, sum squares regularizer I left hand plot shows training errors in blue, test errors in red I right hand plot shows minimum-error classifier (i.e., = )

Example 4 4 0 0 4 4 0 4 4 4 0 4 I left hand plot shows predictor when = 0:4 I right hand plot shows predictor when = 4