Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Similar documents
Evaluating Classifiers

Evaluating Classifiers

CS4491/CS 7265 BIG DATA ANALYTICS

Machine Learning and Bioinformatics 機器學習與生物資訊學

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Lecture 25: Review I

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CS145: INTRODUCTION TO DATA MINING

Classification Part 4

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Weka ( )

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Evaluating Machine-Learning Methods. Goals for the lecture

CS249: ADVANCED DATA MINING

List of Exercises: Data Mining 1 December 12th, 2015

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

Network Traffic Measurements and Analysis

2. On classification and related tasks

Classification. Instructor: Wei Ding

Evaluating Machine Learning Methods: Part 1

Data Mining and Knowledge Discovery Practice notes 2

Machine Learning Techniques for Data Mining

Data Mining Classification: Bayesian Decision Theory

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining and Knowledge Discovery: Practice Notes

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining and Knowledge Discovery: Practice Notes

DATA MINING OVERFITTING AND EVALUATION

Probabilistic Classifiers DWML, /27

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

I211: Information infrastructure II

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Tutorials Case studies

Artificial Intelligence. Programming Styles

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Pattern recognition (4)

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Classification Algorithms in Data Mining

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

ECLT 5810 Evaluation of Classification Quality

Gene Clustering & Classification

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

MEASURING CLASSIFIER PERFORMANCE

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

EE795: Computer Vision and Intelligent Systems

Evaluating Classifiers

Further Thoughts on Precision

Lecture on Modeling Tools for Clustering & Regression

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Information Management course

10 Classification: Evaluation

Machine learning in fmri

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Machine Learning Methods. Majid Masso, PhD Bioinformatics and Computational Biology George Mason University

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Machine Learning: An Applied Econometric Approach Online Appendix

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Understanding Clustering Supervising the unsupervised

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

CISC 4631 Data Mining

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

Classification/Regression Trees and Random Forests

A Bagging Method using Decision Trees in the Role of Base Classifiers

INF 4300 Classification III Anne Solberg The agenda today:

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Use of Synthetic Data in Testing Administrative Records Systems

K- Nearest Neighbors(KNN) And Predictive Accuracy

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Random Forest A. Fornaser

Using Machine Learning to Optimize Storage Systems

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Business Club. Decision Trees

For continuous responses: the Actual by Predicted plot how well the model fits the models. For a perfect fit, all the points would be on the diagonal.

CSE 158. Web Mining and Recommender Systems. Midterm recap

Cross-validation for detecting and preventing overfitting

CS 584 Data Mining. Classification 3

Pattern Recognition for Neuroimaging Data

Classification. Slide sources:

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Large Scale Data Analysis Using Deep Learning

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Boolean Classification

Tree-based methods for classification and regression

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Transcription:

Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015

Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2 Regression 3 Unsupervised Methods 4 Validation 1. Cross-Validation 2. Leave-one-out Cross-Validation 3. Bootstrap Validation 5 How to Do Cross-Validation Sebastian Pölsterl 2 of 49

Performance Measures: Classification Confusion Matrix Deterministic Classifiers Scoring Classifiers Multi-class Single-class Graphical Measures Summary Statistics No Change Correction Accuracy Error Rate Micro/Macro Average Change Correction Chohen s Kappa Fleiss Kappa TP/FP Rate, Precision, Recall, Sensitivity, Specificity, F1-Measure, Dice, Geometric Mean ROC Curves PR Curves Lift Charts Cost Curves Area under the curve H Measure Sebastian Pölsterl 3 of 49

Test Outcomes Let us consider a binary classification problem: True Positive (TP) = positive sample correctly classified as belonging to the positive class False Positive (FP) = negative sample misclassified as belonging to the positive class True Negative (TN) = negative sample correctly classified as belonging to the negative class False Negative (FN) = positive sample misclassified as belonging to the negative class Sebastian Pölsterl 4 of 49

Confusion Matrix I Prediction Ground Truth Class A Class B Class A True positive False positive Type I Error (α) Class B False negative True negative Type II Error (β) Let class A indicate the positive class and class B the negative class. Accuracy = TP+TN TP+FP+TN+FN Error rate = 1 - Accuracy Sebastian Pölsterl 5 of 49

Confusion Matrix II Pred. Ground Truth Class A Class B Class A TP FP Class B FN TN Sensitivity Specificity False negative rate False positive rate Sensitivity/True positive rate/recall = Specificity/True negative rate = FN TP TP+FN TN TN+FP False negative rate = FN+TP = 1 - Sensitivity False positive rate = FP FP+TN = 1 - Specificity Sebastian Pölsterl 6 of 49

Confusion Matrix III Pred. Ground Truth Class A Class B Class A TP FP Positive predictive value Class B FN TN Negative predictive value Positive predictive value (PPV)/Precision = Negative predictive value (NPV) = TN TN+FN TP TP+FP Sebastian Pölsterl 7 of 49

Multiple Classes One vs. One Prediction Ground Truth Class A Class B Class C Class D Class A Correct Wrong Wrong Wrong Class B Wrong Correct Wrong Wrong Class C Wrong Wrong Correct Wrong Class D Wrong Wrong Wrong Corrent With k classes confusion matrix becomes a k k matrix. No clear notion of positives and negatives. Sebastian Pölsterl 8 of 49

Multiple Classes One vs. All Pred. Ground Truth Class A Other Class A True positive False positive Other False negative True negative Choose one of k classes as positive (here: class A). Collapse all other classes into negative to obtain k different 2 2 matrices. In each of these matrices the number of true positives is the same as in the corresponding cell of the original confusion matrix. Sebastian Pölsterl 9 of 49

Micro and Macro Average Micro Average: 1. Construct a single 2 2 confusion matrix by summing up TP, FP, TN and FN from all k one-vs-all matrices. 2. Calculate performance measure based on this average. Macro Average: 1. Obtain performance measure from each of the k one-vs-all matrices separately. 2. Calculate average of all these measures. Sebastian Pölsterl 10 of 49

F 1 -Measure F 1 -measure is the harmonic mean of positive predictive value and sensitivity: 2 PPV sensitivity F 1 = PPV + sensitivity (1) Micro Average F 1 -Measure: 1. Calculate sums of TP, FP, and FN across all classes 2. Calculate F 1 based on these values Macro Average F 1 -Measure: 1. Calculate PPV and sensitivity for each class separately 2. Calculate mean PPV and sensitivity 3. Calculate F 1 based on mean values F1 PPV Sensitivity Sebastian Pölsterl 11 of 49

1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2 Regression 3 Unsupervised Methods 4 Validation 1. Cross-Validation 2. Leave-one-out Cross-Validation 3. Bootstrap Validation 5 How to Do Cross-Validation Sebastian Pölsterl 12 of 49

Receiver operating characteristics (ROC) Binary classifier returns probability or score that represents the degree to which class an instance belongs to. The ROC plot compares sensitivity (y-axis) with false positive rate (x-axis) for all possible thresholds of the classifier s score. It visualizes the trade-off between benefits (sensitivity) and costs (FPR). True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Sebastian Pölsterl 13 of 49

ROC Curve Line from the lower left to upper right corner indicates random classifier. Curve of perfect classifier goes through the upper left corner at (0, 1). A single confusion matrix corresponds to one point in ROC space. It is insensitive to changes in class distribution or changes in error costs. True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Sebastian Pölsterl 14 of 49

Area under the ROC curve (AUC) The AUC is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance (Mann-Whitney U test). The Gini coefficient is twice the area that lies between the diagonal and the ROC curve: True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 AUC = 0.89 Gini coefficient + 1 = 2 AUC 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Sebastian Pölsterl 15 of 49

Averaging ROC curves I Merging: Merge instances of n tests and their respective scores and sort the complete set Vertical averaging: 1. Take vertical samples of the ROC curves for fixed false positive rate 2. Construct confidence intervals for the mean of true positive rates Average true positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Vertical Average 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Sebastian Pölsterl 16 of 49

Averaging ROC curves II Threshold averaging: 1. Do merging as described above 2. Sample based on thresholds instead of points in ROC space 3. Create confidence intervals for FPR and TPR at each point Average true positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Threshold Average 0.0 0.2 0.4 0.6 0.8 1.0 Average false positive rate Sebastian Pölsterl 17 of 49

Disadvantages of ROC curves ROC curves can present an overly optimistic view of an algorithm s performance if there is a large skew in the class distribution, i.e. the data set contains much more samples of one class. A large change in the number of false positives can lead to a small change in the false positive rate (FPR). FPR = FP FP + TN Comparing false positives to true positives (precision) rather than true negatives (FPR), captures the effect of the large number of negative examples. Precision = TP FP + TP Sebastian Pölsterl 18 of 49

1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2 Regression 3 Unsupervised Methods 4 Validation 1. Cross-Validation 2. Leave-one-out Cross-Validation 3. Bootstrap Validation 5 How to Do Cross-Validation Sebastian Pölsterl 19 of 49

Precision-Recall Curve Compares precision (y-axes) to recall (x-axes) at different thresholds. PR curve of optimal classifier is in the upper-right corner. One point in PR space corresponds to a single confusion matrix. Average precision is the area under the PR curve. Precision 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Sebastian Pölsterl 20 of 49

Relationship to Precision-Recall Curve Algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve Example: Dataset has 20 positive examples and 2000 negative examples. True Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 Recall Sebastian Pölsterl 21 of 49

1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2 Regression 3 Unsupervised Methods 4 Validation 1. Cross-Validation 2. Leave-one-out Cross-Validation 3. Bootstrap Validation 5 How to Do Cross-Validation Sebastian Pölsterl 22 of 49

Evaluating Regression Results Remember that the predicted value is continuous. Measuring the performance is based on comparing the actual value y i with the predicted value ŷ i for each sample. Measures are either the sum of squared or absolute differences. 0.0 0.2 0.4 0.6 0.8 1.0 Sebastian Pölsterl 23 of 49

Regression Performance Measures Sum of absolute error (SAE): n y i ŷ i i=1 Sum of squared errors (SSE): n (y i ŷ i ) 2 i=1 Mean squared error (MSE): 1 n SSE Root mean squared error (RMSE): MSE Sebastian Pölsterl 24 of 49

1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2 Regression 3 Unsupervised Methods 4 Validation 1. Cross-Validation 2. Leave-one-out Cross-Validation 3. Bootstrap Validation 5 How to Do Cross-Validation Sebastian Pölsterl 25 of 49

Unsupervised Methods Problem: Ground truth is usually not available or requires manual assignment Without ground truth (internal validation): Cohesion Separation Silhouette Coefficient With ground truth (external validation): Jaccard index Dice s coefficient (Normalized) mutual information (Adjusted) rand index Sebastian Pölsterl 26 of 49

Cohesion and Separation Requires definition of proximity measure, such as distance or similarity cohesion(c i ) = proximity(x, y) x,y C i seperation(c i, C j ) = proximity(x, y) x C i,y C j Sebastian Pölsterl 27 of 49

Silhouette Coefficient a(i) is the mean distance between the i-th sample and all other points in the same class b(i) the mean distance to all other points in the next nearest cluster The silhouette coefficient s(i) [ 1; 1] is defined as s(i) = b(i) a(i) max(a(i), b(i)) s(i) = 1 if the clustering is dense and well separated s(i) = 1 if the i-th sample was assigned incorrectly s(i) = 0 if clusters overlap Sebastian Pölsterl 28 of 49

Jaccard Index and Dice s Coefficient Consider two sets S 1, S 2 where one set is used as ground truth and the other was predicted. Example: Pixels in image classification or segmentation. Jaccard Index Jaccard(S 1, S 2 ) = S 1 S 2 [0; 1] S 1 S 2 Dice s coefficient Dice(S 1, S 2 ) = 2 S 1 S 2 [0; 1] S 1 + S 2 Sebastian Pölsterl 29 of 49

1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2 Regression 3 Unsupervised Methods 4 Validation 1. Cross-Validation 2. Leave-one-out Cross-Validation 3. Bootstrap Validation 5 How to Do Cross-Validation Sebastian Pölsterl 30 of 49

Validation Regimes Validation Regimes No Re-sampling Re-sampling Hold-Out Simple Re-sampling Multiple Re-sampling Cross- Validation Random Sub-Sampling Bootstrapping Repeated Cross- Validation Randomization Permutation Test Non-Stratified Stratified Leave-One-Out 5 2 CV 10 10 CV Sebastian Pölsterl 31 of 49

Validation Test error: Prediction error over an independent sample. Training error: Average loss over the training samples 1 n n L(y i, ˆf (x i )) i=1 As the model gets more complex it infers more information from the training data to represent more complicated underlying structures. Sebastian Pölsterl 32 of 49

Validation Training Error Training error consistently decreases with increasing model complexity, whereas Test error starts to increase again. Training error is not a good measure of performance. Sebastian Pölsterl 33 of 49

Validation Over- and Underfitting Prediction Error Underfitting Optimum Overfitting Model Complexity Overfitting: A model with zero or very low training error is likely to perform well on the training data but generalize badly (model too complex). Underfitting: Model does not capture the underlying structure and hence performs poorly (model too simple). Sebastian Pölsterl 34 of 49

Validation Ideal Situation Assume we have access to large amount of data. Construct three different sets 1. Training set: Used to fit the model. 2. Validation set: Estimate prediction error to choose best model (e.g. different costs C for SVMs). 3. Test set: Used to asses how well final model generalizes. Training Validation Test Sebastian Pölsterl 35 of 49

Cross-Validation Dataset Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Train Valid. Performance 1 Performance 2 Performance 3 Performance 4 Performance 5 Average Performances Cross-validation: Split data set into k equally large parts. Stratified cross-validation: Ensures that the ratio between classes is the same in each fold as in the complete dataset. Sebastian Pölsterl 36 of 49

Leave-one-out Cross-Validation Use all but one sample for training and assess performance on the excluded sample. For a data set with n samples, leave-one-out cross-validation is equivalent to n-fold cross-validation. Not suitable if data set is very large and/or training the classifier takes a long time. Sebastian Pölsterl 37 of 49

Bootstrap Sampling The bootstrap is a general tool for assessing statistical accuracy. Assumption: Our data set is a representative portion of the overall population. Bootstrap sampling: Randomly draw samples with replacement from the original data set to generate new data sets of the same size. Sebastian Pölsterl 38 of 49

Bootstrap Validation Bootstrap sampling is repeated B times and samples not included in each bootstrap sample are recorded. Train model on each of the B bootstrap samples. For each sample of the original data set, asses performance only on bootstrap samples not containing this sample: 1 n n i=1 1 C i b C i L(y i, ˆf b (x i )) Sebastian Pölsterl 39 of 49

1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2 Regression 3 Unsupervised Methods 4 Validation 1. Cross-Validation 2. Leave-one-out Cross-Validation 3. Bootstrap Validation 5 How to Do Cross-Validation Sebastian Pölsterl 40 of 49

A Typical Strategy 1. Find a good subset of features that show fairly strong (univariate) correlation with the class labels 2. Using just this subset of features, build a multivariate classifier 3. Use cross-validation to estimate the unknown hyper-parameters and to estimate the prediction error of the final model. Sebastian Pölsterl 41 of 49

A Typical Strategy 1. Find a good subset of features that show fairly strong (univariate) correlation with the class labels 2. Using just this subset of features, build a multivariate classifier 3. Use cross-validation to estimate the unknown hyper-parameters and to estimate the prediction error of the final model. Is this the correct way to do cross-validation? Sebastian Pölsterl 41 of 49

Scenario Consider a data set with 50 samples in two equal-sized classes and 5000 features that are independent of the class labels The true test error rate of any classifier is 50% Example: 1. Choose 100 predictors with highest correlation with class labels 2. Use a 1-Nearest Neighbor classifier based on these 100 features 3. Result: Doing 50 simulations in this setting, yielded an average CV error rate of 1.4% Sebastian Pölsterl 42 of 49

What Happened? Classifier had an unfair advantage because features were selected based on all samples This validates the requirement that the test set is completely independent of the training set, because the classifier has already seen the samples in the test set Sebastian Pölsterl 43 of 49

What Happened? Wrong Frequency 0 100 300 1.0 0.5 0.0 0.5 1.0 Correlations of Selected Features with Label Correct Frequency 0 200 400 0.5 0.0 0.5 1.0 Correlations of Selected Features with Label Sebastian Pölsterl 44 of 49

How to Do It Right? 1. Divide data set into K folds at random 2. For each fold 2.1 Find a subset of good features 2.2 Using this subset, build a multivariate classifier, using all samples expect those in fold k 2.3 Use the classifier to predict the class label of samples in fold k Sebastian Pölsterl 45 of 49

How to Do It Right? 1. Divide data set into K folds at random 2. For each fold 2.1 Find a subset of good features 2.2 Using this subset, build a multivariate classifier, using all samples expect those in fold k 2.3 Use the classifier to predict the class label of samples in fold k Result The estimated mean error rate is 51.2%, which is much closer to the true test error rate. Sebastian Pölsterl 45 of 49

How to Do It Right? Cross-validation must be applied to the entire sequence of modeling steps Examples: Selection of features Tuning of hyper-parameters Sebastian Pölsterl 46 of 49

Conclusion Many different performance measures for classification exist. ROC and Precision-Recall curves can be applied for binary classifiers which return probabilities or scores. Cross-Validation is the most commonly used validation scheme. Bootstrap cannot only be used for validation, it can be used in many more applications as well (e.g. bagging). Important Every performance measure has its advantages and its disadvantages. There is no best measure. Therefore, you have to consider multiple measures to evaluate your model. Sebastian Pölsterl 47 of 49

References (1) Davis, J. and Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning, ICML 06, pages 233 240, New York, NY, USA. ACM. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861 874. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer, second edition. http://www-stat.stanford.edu/~tibs/elemstatlearn/. Sebastian Pölsterl 48 of 49

References (2) Parker, C. (2011). An Analysis of Performance Measures for Binary Classifiers. In 2011 IEEE 11th International Conference on Data Mining, pages 517 526. IEEE. Sebastian Pölsterl 49 of 49