Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Similar documents
Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Machine Learning Methods: Part 1

Network Traffic Measurements and Analysis

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

DATA MINING OVERFITTING AND EVALUATION

Part II: A broader view

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Lecture 25: Review I

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Classification. Slide sources:

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Weka ( )

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Data Mining and Knowledge Discovery Practice notes 2

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining and Knowledge Discovery: Practice Notes

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

10601 Machine Learning. Model and feature selection

Data Mining and Knowledge Discovery: Practice Notes

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Classification. Instructor: Wei Ding

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

1 Machine Learning System Design

Evaluating Classifiers

Evaluating Classifiers

Machine Learning: Symbolische Ansätze

Supervised Learning: The Setup. Spring 2018

List of Exercises: Data Mining 1 December 12th, 2015

Evaluating Classifiers

PROBLEM 4

Regularization and model selection

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Model Selection and Assessment

CS4491/CS 7265 BIG DATA ANALYTICS

Machine Learning Techniques for Data Mining

A Brief Look at Optimization

Artificial Intelligence. Programming Styles

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Retrieval Evaluation. Hongning Wang

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Part III: Multi-class ROC

SOCIAL MEDIA MINING. Data Mining Essentials

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Chapter 6 Evaluation Metrics and Evaluation

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

2. On classification and related tasks

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Pattern recognition (4)

Logistic Regression: Probabilistic Interpretation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSC 2515 Introduction to Machine Learning Assignment 2

Simple Model Selection Cross Validation Regularization Neural Networks

Semi-supervised learning and active learning

CS145: INTRODUCTION TO DATA MINING

Probabilistic Classifiers DWML, /27

Business Club. Decision Trees

CISC 4631 Data Mining

Introduction to Information Retrieval

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering will not be satisfactory if:

Boosting Simple Model Selection Cross Validation Regularization

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

10 Classification: Evaluation

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Context Change and Versatile Models in Machine Learning

Model selection and validation 1: Cross-validation

I211: Information infrastructure II

5 Learning hypothesis classes (16 points)

Machine Learning and Bioinformatics 機器學習與生物資訊學

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Information Retrieval. Lecture 7

ECLT 5810 Evaluation of Classification Quality

Measuring Intrusion Detection Capability: An Information- Theoretic Approach

Knowledge Discovery and Data Mining

CLASSIFICATION JELENA JOVANOVIĆ. Web:

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

1) Give decision trees to represent the following Boolean functions:

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Traditional clustering fails if:

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Fast or furious? - User analysis of SF Express Inc

Use of Synthetic Data in Testing Administrative Records Systems

CS249: ADVANCED DATA MINING

Cost Functions in Machine Learning

Classification Key Concepts

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

PV211: Introduction to Information Retrieval

Transcription:

Evaluate what? Evaluation Charles Sutton Data Mining and Exploration Spring 2012 Do you want to evaluate a classifier or a learning algorithm? Do you want to predict accuracy or predict which one is better? Do you have a lot of data or not much? Are you interested in one domain or in understanding accuracy across domains? For really large amounts of data... You could use training error to estimate your test error But this is stupid, so don t do it Instead split the instances randomly into a training set and test set But then suppose you need to: Compare 5 different algorithms Compare 5 different feature sets Each of them have different knobs in the training algorithm (e.g., size of neural network, gradient descent step size, k in k-nearest neighbour, etc., etc.) A: Use a validation set. Training Validation Testing When you first get the data, put the test set away and don t look at it. The validation set lets you compare the tweaking parameters of different algorithms. This is a fine way to work, if you have lots of data.

Variability 1. Hypothesis Testing Classifier A: 81% accuracy Classifier B: 84% accuracy Which classifier do you think is best? Variability Classifier A: 81% accuracy Classifier B: 84% accuracy But then suppose I tell you Only 100 examples in the test set After 400 more test examples, I get 0-100 101-200 201-300 301-400 401-500 A: 0.81 0.77 0.78 0.81 0.78 B: 0.84 0.75 0.75 0.76 0.78 Sources of Variability Choice of training set Choice of test set Errors in data labeling Inherent randomness in learning algorithm

0-100 101-200 201-300 301-400 401-500 A: 0.81 0.77 0.78 0.81 0.78 B: 0.84 0.75 0.75 0.76 0.78 Key point: Your measured testing error is a random variable (you sampled the testing data) Want to infer the true test error based on this sample Test error is a random variable Call your test data x 1,x 2,...,x N where x i D independently True error e =Pr x D [f(x) 6= h(x)] Test error NX h f the classifier the true function 1 foo delta function This is another learning problem! ê = 1 N i=1 1 [f(xi )6=h(x i )] Next slide: Make this more formal... Theorem: As N!1 then ê! e [Why?] Test error is a random variable Call your test data x 1,x 2,...,x N where x i D independently True error e =Pr x D [f(x) 6= h(x)] h f the classifier the true function 1 foo delta function Main question Suppose Classifier A: 81% accuracy Classifier B: 84% accuracy Is that difference real? Test error ê = 1 N Theorem: NX i=1 1 [f(xi )6=h(x i )] ê Binomial(N,e)

Rough-and-ready variability Classifier A: 81% accuracy World Learning Original problem (e.g., Difference between spam and normal emails) Evaluation True error Classifier B: 84% accuracy Sample Inboxes for multiple users Classifier performance on each example Is that difference real? Answer 1: If doing c-v, report both mean and standard deviation of error across folds. If doing c-v, report both mean and standard deviation of error across folds. Estimation Classifier Avg error on test set Hypothesis testing Want to know whether ê A and ê B are significantly different. 1. Suppose not. [ null hypothesis ] 2. Define a test statistic, in this case 3. Measure a value of the statistic 4. Derive the distribution of ˆT assuming #1. 5. If p =Pr[T>ˆT ] is really low, e.g., < 0.05, reject the null hypothesis p is your p-value T = e A e B ˆT = ê A ê B If you reject, then the difference is statistically significant Example Classifier A: 81% accuracy Classifier B: 84% accuracy ˆT = ê A ê B =0.03 4. Derive the distribution of ˆT assuming the null. What we know: ê A Binomial(N,e A ) ê B Binomial(N,e B ) e A = e B

Approximation to the rescue Distribution under the null 0.00 0.05 0.10 0.15 0 5 10 15 20 25 Approximate binomial by normal ê A N (Ne A,Ne A (1 e A )) 4. Derive the distribution of ˆT assuming the null. What we know: ê A N (Ne A,s 2 A) ê B N (Ne B,s 2 B) e A = e B where s 2 A = Ne A (1 e A ) Distribution under the null 4. Derive the distribution of ˆT assuming the null. What we know: ê A N (Ne A,s 2 A) ê B N (Ne B,s 2 B) e A = e B But this means ê A ê B N (0,s AB ) (assuming the two are independent...) 2 where s 2 A = Ne A (1 e A ) 2 s AB = 2e AB(1 e AB ) N e AB = 1 2 (e A + e B ) Computing the p-value 5. If p =Pr[T>ˆT ] is really low, e.g., < 0.05, reject the null hypothesis In our example 2 ê A ê B N (0,s AB ) s 2 AB 0.0029 So one line of R (or MATLAB): > pnorm(-0.03, mean=0, sd=sqrt(0.0029)) [1] 0.2887343 (assuming the two are independent...)

Frequentist Statistics Frequentist Statistics What does p =Pr[T>ˆT ] really mean? Generated 1000 test sets for classifiers A and B, computed error under the null: Frequency 0 20 40 60 80 100 120 0.00 0.05 0.10 0.15 0.20 0.25 T.hat Our example: ˆT =0.03 p-value is shaded area What does p =Pr[T>ˆT ] really mean? Refers to the frequency behaviour if the test is applied over and over for different data sets. Fundamentally different (and more orthodox) than Bayesian statistics. Errors in the hypothesis test Summary Type I error: False rejects Type II error: False non-reject Logic is to fix the Type I error =0.05 Design the test to minimise Type II error Call this test difference in proportions test An instance of a z-test This is OK, but there are tests that work better in practice...

Classifier A correct Classifier A wrong McNemar s Test Classifier B correct n11 Classifier B wrong n01 n10 n00 p A McNemar s Test probability A is correct GIVEN A and B disagree Null hypothesis: p A =0.5 Test statistic: ( n 10 n 01 1) 2 n 01 + n 10 Distribution under null? McNemar s Test Test statistic: ( n 10 n 01 1) 2 n 01 + n 10 Distribution under null? 2 (1 degree of freedom) 0.0 0.5 1.0 1.5 Pros/Cons McNemar s test Pros Doesn t require the independence assumptions of the difference-of-proportions test Works well in practice [Dietterich, 1997] Cons Does not assess training set variability 0 1 2 3 4 5 6

Accuracy is not the only measure Accuracy is great, but not always helpful e.g., Two class problem. 98% instances negative Alternative: for every class C, define Precision Recall P = R = # instances of C that classifier got right # instances that classifier predicted C # instances of C that classifier got right # true instances of C Calibration Sometimes we care about the confidence of a classification. If the classifier outputs probabilities, can use cross-entropy: H(p) = 1 N NX log p(y i x i ) i=1 where (x i,y i ) feature vector, true label for each instance i p(y i x i ) probabilities output by the classifier F-measure F 1 = 2 1 P + 1 R = 2PR P + R An aside We ve talked a lot about overfitting. What does this mean for well-known contest data sets? (Like the ones in your mini-project.) Think about the paper publishing process. I have an idea, implement it, try it on a standard train/test set, publish a paper if it works. Is there a problem with this? 2. ROC curves (Receiver Operating Characteristic)

Problems in what we ve done so far Skewed class distributions Differing costs Classifiers as rankers Most classifiers output a real-valued score as well as a prediction e.g., decision trees: proportion of classes at leaf e.g., logistic regression: P(class x) Instead of evaluating accuracy at a single threshold, evaluate how good the score is at ranking More evaluation measures Assume two classes. (Hard to do ROC with more.) Predicted + - True + - FN FP TN : True positives FP: False positives TN: True negatives FN: False negatives ACC = More evaluation measures Predicted + + TN + TN + FP + FN - True + - FN FP TN

More evaluation measures True + - + FP Predicted - FN TN More evaluation measures True + - + FP Predicted - FN TN ACC = + TN + TN + FP + FN ACC = + TN + TN + FP + FN R = + FN a.k.a., recall P = + FP P = + FP FPR = FP FP + TN R: True positive rate, FPR: False positive rate More evaluation measures ROC space True + - 1.0 D B Predicted R = FPR = + FN FP FP + TN + - FN FP TN total + instances total - instances True Positive rate 0.8 0.6 0.4 0.2 0 A C E R: True positive rate, FPR: False positive rate 0 0.2 0.4 0.6 0.8 1.0 False Positive rate [Fawcett, 2003]

ROC curves ROC curves more positive test set (sorted by classifier s score) 1 insts.roc.+ insts2.roc.+ 0.8 Threshold 1: 5 3 FP Threshold 1: 2 1 FP True label: POSITIVE NEGATIVE 0.6 Add a point for every possible threshold, and... 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 [Fawcett, 2003] Class skew ROC curves insensitive to class skew. Area under curve (AUC) Sometimes you want a single number Predicted + - True + - FN =P FP TN =N True Positive rate 1.0 0.8 0.6 0.4 0.2 B A R = + FN 0 FPR = FP FP + TN 0 0.2 0.4 0.6 0.8 1.0 False Positive rate [Fawcett, 2003]

A: Use a validation set. Training Validation Testing 3. Cross Validation When you first get the data, put the test set away and don t look at it. The validation set lets you compare the tweaking parameters of different algorithms. This is a fine way to work, if you have lots of data. Problem: You don t have lots of data Cross-validation This causes two problems: You don t want to set aside a test set (waste of perfectly good data). There s lots of variability in your estimate of the error. Has several goals: Don t waste examples by never using them for training Get some idea of variation due to training sets Allow tweaking classifier parameters without use of a separate validation set

Cross-validation Randomly split data into K equal-sized subsets (called folds ). Call these D1, D2,... Dk For i = 1 to K Train on all D1, D2,... Dk except Di Test on Di. Let ei be testing error Final estimate of test error: ê = 1 K KX i=1 e i Cross-validation (prettily) 1 Test 2-5 Train Train 3-5 1 4 Test 2 5 1 Train Test 3 Fold 1 Fold 2 Fold 3 Final error estimate: Mean of test error in each fold 2 How to pick k? Bigger K (e.g., K=N called leave-one-out) Bigger training sets (good if training data is small) Smaller K means Bigger test sets (good) Less computationally expensive Less overlap in training sets I typically use 5 or 10 N.B. Can use more than one fold for testing Comments about C-V Tune parameters of your learning algorithm via cross-validation error Note that the different training sets are (highly) dependent Sometimes need to be careful about exactly which data goes into training-test splits (e.g., fmri data, University HTML pages)

Some question about C-V Say I m doing 5-fold cv. I get 5 classifiers out. Which one is my final one for my problem? Let s say I want to choose the pruning parameter for my decision tree. I use c-v. How do I then estimate the error of my final classifier? 4. Evaluating clustering How to evaluate clustering? If you really knew what you wanted, you d be doing classification instead of clustering. Option 1: Measure how well the clusters do for some other task (e.g., as features for a classifier, or for ranking documents in IR) Not always what you want to do. Option 2: Measure goodness of fit Option 3: Compare to an external set of labels Evaluation for Clustering Suppose that we do have labeled data for evaluation, called ground truth, that we don t use in the clustering algorithm. (More for evaluating an algorithm than a clustering.) Each example has features X, cluster label C, and ground truth label Y C k Y j C 2 {1, 2,...K} Y 2 {1, 2,...J} set of examples in cluster k set of examples with true label j

Purity Purity example Essentially, your best possible accuracy if clusters are mapped to ground truth labels. Reminder: C k Y j N Purity = 1 N KX max C k \ Y j j2[1,j] k=1 set of examples in cluster k set of examples with true label j number of data points Cluster 1 2 3 Purity is: 5+4+3 6+5+5 = 12 16 =0.75 Problem with Purity Rand Index Alg A Cluster Alg B 1 2 3 Consider pairwise decisions Ground truth same different same FP Clustering different FN TN Both of these have the same purity, but Alg B is doing no better than predicting majority class. Lessons: a) Try simple baselines b) Look at multiple evaluation metrics Now can compute P, R, F1 Accuracy in this table called: Rand Index

Ceiling effects 5. Other issues in evaluation Decision tree 97% AdaBoost 98% Mystery algorithm 96% Ceiling effects Ceiling effects Decision tree 97% AdaBoost 98% Mystery algorithm 96% Decision tree 97% AdaBoost 98% Mystery algorithm 96% Moral: If your test set is too easy, it won t tell you anything about the algorithms. Always compare to simpler baselines to evaluate how easy (or hard) the testing problem is. Moral: If your test set is too easy, it won t tell you anything about the algorithms. Always compare to simpler baselines to evaluate how easy (or hard) the testing problem is. Always ask yourself what chance performance would be.

Floor effects Similarly, your problem could be so hard that no algorithm does well. Example: Stock picking. Here there are no experts. One way to get at this is inter-annotator agreement. Accuracy Learning Curves It can be interesting to look at how learning performance differs as you get more data. This can tell you whether it s worth spending money to gather more data. Some algorithms are better with small training sets, but worse with large ones. Learning curves usually have this shape. Why? Number of training instances Learning Curves References Accuracy Number of training instances Learning curves usually have this shape. Why? You learn easy information from the first few examples (e.g., word Viagra usually means the email is spam) Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924 Fawcett, T. (2003). ROC Graphs: Notes and Practical Considerations for Researchers Tom Fawcett. HP Labs Tech Report HPL-2003-4. http://home.comcast.net/~tom.fawcett/ public_html/papers/roc101.pdf Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, (2008). Introduction to Information Retrieval, Cambridge University Press. http://nlp.stanford.edu/ir-book/ html/htmledition/evaluation-of-clustering-1.html