Evaluating Classifiers

Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)

Evaluating Classifiers What we want: Classifier that best predicts unseen ( test ) data Common assumption: Data is iid (independently and identically distributed)

Topics Cross Validation Precision and Recall ROC Analysis Bias, Variance, and Noise

Cross-Validation

Accuracy and Error Rate Accuracy = fraction of correct classifications on unseen data (test set) Error rate = 1 Accuracy

How to use available data to best measure accuracy? Split data into training and test sets. But how to split? Too little training data: Cannot learn a good model Too little test data: Cannot evaluate learned model Also, how to learn hyper-parameters of the model?

One solution: k-fold cross validation Used to better estimate generalization accuracy of model Used to learn hyper-parameters of model ( model selection )

Using k-fold cross validation to estimate accuracy Each example is used both as a training instance and as a test instance. Instead of splitting data into training set and test set, split data into k disjoint parts: S 1, S 2,..., S k. For i = 1 to k Select S i to be the test set. Train on the remaining data, test on S i, to obtain accuracy A i. Report 1 k k i=1 A i as the final accuracy.

Split data into training and test sets. Put test set aside. Split training data into k disjoint parts: S 1, S 2,..., S k. Assume you are learning one hyper-parameter. Choose R possible values for this hyperparameter. For j = 1 to R Using k-fold cross validation to learn hyper-parameters (e.g., learning rate, number of hidden units, SVM kernel, etc. ) For i = 1 to k Select S i to be the validation set Train the classifier on the remaining data using the jth value of the hyperparameter Test the classifier on S i, to obtain accuracy A i,j. Compute the average of the accuracies: A j = 1 k Choose the value j of the hyper-parameter with highest. Retrain the model with all the training data, using this value of the hyper-parameter. Test resulting model on the test set. A j k i=1 A i, j

Precision and Recall

Evaluating classification algorithms Confusion matrix for a given class c Actual Predicted (or classified ) Positive Negative (in class c) (not in class c) Type 2 error Positive (in class c) TruePositive FalseNegative Negative (not in class c) FalsePositive TrueNegative Type 1 error

Recall and Precision All instances in test set Predicted positive Predicted negative x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10

Recall and Precision All instances in test set Predicted positive Predicted negative x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Recall: Fraction of positive examples predicted as positive.

Recall and Precision All instances in test set Predicted positive Predicted negative x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Recall: Fraction of positive examples predicted as positive. Here: 3/4.

Recall and Precision All instances in test set Predicted positive Predicted negative x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Recall: Fraction of positive examples predicted as positive. Here: 3/4. Precision: Fraction of examples predicted as positive that are actually positive.

Recall and Precision All instances in test set Predicted positive Predicted negative x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Recall: Fraction of positive examples predicted as positive. Here: 3/4. Recall = Sensitivity = True Positive Rate Precision: Fraction of examples predicted as positive that are actually positive. Here: 1/2

In-class exercise 1

Example: A vs. B Assume A is positive class Results from Perceptron: Precision = TP TP + FP Instance Class Perception Output 1 A 1 2 A 1 3 A 1 4 A 0 5 B 1 6 B 0 7 B 0 8 B 0 Recall = F-measure = 2 TP TP + FN precision recall precision + recall

Creating a Precision vs. Recall Curve P = TP TP + FP R = TP TP + FN Results of classifier Threshold Accuracy Precision Recall.9.8.7.6.5.4.3.2.1 -

Multiclass classification: Mean Average Precision Average precision for a class: Area under precision/recall curve for that class Mean average precision: Mean of average precision over all classes

ROC Analysis

( recall or sensicvity ) (1 - specificity )

Creating a ROC Curve True Positive Rate (= Recall) = False Positive Rate = FP TN + FP TP TP + FN Results of classifier Threshold Accuracy TPR FPR.9.8.7.6.5.4.3.2.1 -

How to create a ROC curve for a binary classifier Run your classifier on each instance in the test data. Calculate score. Get the range [min, max] of your scores Divide the range into about 200 thresholds, including and max For each threshold, calculate TPR and FPR Plot TPR (y-axis) vs. FPR (x-axis)

In-Class Exercise 2

Precision/Recall vs. ROC Curves Precision/Recall curves typically used in detection or retrieval tasks, in which positive examples are relatively rare. ROC curves used in tasks where positive/negative examples are more balanced.

Bias, Variance, and Noise

Bias, Variance, and Noise A classifier s error can be separated into three components: bias, variance, and noise

Bias: Related to ability of model to learn function High Bias: Model is not powerful enough to learn function. Learns a lesscomplicated function (underfitting). Low Bias: Model is too powerful; learns a too-complicated function (overfitting). From http:// eecs.oregonstate.edu/~tgd/talks/bv.ppt hips://www.quora.com/topic/logiscc-regression

Variance: How much a learned function will vary if different training sets are used Same experiment, repeated: with 50 samples of 20 points each From hip:// eecs.oregonstate.edu/~tgd/talks/bv.ppt

Noise: Underlying process generating data is stochastic, or data has errors or outliers.. From hip:// eecs.oregonstate.edu/~tgd/talks/bv.ppt

Examples of causes of bias? Examples of causes of variance? Examples of causes of noise?

From http://lear.inrialpes.fr/~verbeek/mlcr.10.11.php