A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Similar documents
Sample 1. Dataset Distribution F Sample 2. Real world Distribution F. Sample k

How do we obtain reliable estimates of performance measures?

ECLT 5810 Evaluation of Classification Quality

Machine Learning. Cross Validation

CS4491/CS 7265 BIG DATA ANALYTICS

Ron Kohavi. Stanford University. for low variance, assuming the bias aects all classiers. similarly (e.g., estimates are 5% pessimistic).

Machine Learning Techniques for Data Mining

Classification. Instructor: Wei Ding

CSE 446 Bias-Variance & Naïve Bayes

3 Virtual attribute subsetting

RESAMPLING METHODS. Chapter 05

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CloNI: clustering of JN -interval discretization

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Model validation T , , Heli Hiisilä

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Machine Learning and Bioinformatics 機器學習與生物資訊學

On Cross-Validation for MLP Model Evaluation

Classification Part 4

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology

Chapter 8 The C 4.5*stat algorithm

A noninformative Bayesian approach to small area estimation

Simple Model Selection Cross Validation Regularization Neural Networks

Nonparametric Error Estimation Methods for Evaluating and Validating Artificial Neural Network Prediction Models

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

A Comparative Study of Reliable Error Estimators for Pruning Regression Trees

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

CS 584 Data Mining. Classification 3

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

SSV Criterion Based Discretization for Naive Bayes Classifiers

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Bias-free Hypothesis Evaluation in Multirelational Domains

10601 Machine Learning. Model and feature selection

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

Evaluating Machine-Learning Methods. Goals for the lecture

MS&E 226: Small Data

Dual-Frame Sample Sizes (RDD and Cell) for Future Minnesota Health Access Surveys

Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey

Building Classifiers using Bayesian Networks

Evaluating Machine Learning Methods: Part 1

Automatic Domain Partitioning for Multi-Domain Learning

Boosting Simple Model Selection Cross Validation Regularization

Nonparametric Methods Recap

A new approach of Association rule mining algorithm with error estimation techniques for validate the rules

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Bias-Variance Analysis of Ensemble Learning

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

10 Classification: Evaluation

Data Mining and Knowledge Discovery Practice notes 2

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Data Mining and Knowledge Discovery: Practice Notes

Hyperparameters and Validation Sets. Sargur N. Srihari

A Two-level Learning Method for Generalized Multi-instance Problems

Feature Selection with Decision Tree Criterion

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Learning and Evaluating Classifiers under Sample Selection Bias

Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery

Introduction to Automated Text Analysis. bit.ly/poir599

Multicollinearity and Validation CIVL 7012/8012

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

IOM 530: Intro. to Statistical Learning 1 RESAMPLING METHODS. Chapter 05

Data Mining Introduction

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Performance Optimization of Data Mining Application Using Radial Basis Function Classifier

Targeting Business Users with Decision Table Classifiers

Improving Classification Accuracy for Single-loop Reliability-based Design Optimization

Regularization and model selection

Data Mining Classification: Bayesian Decision Theory

Feature-weighted k-nearest Neighbor Classifier

TUBE: Command Line Program Calls

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Classification with Decision Tree Induction

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Louis Fourrier Fabien Gaie Thomas Rolf

Sample Size and Modeling Accuracy with Decision Tree Based Data Mining Tools. James Morgan, Robert Dougherty, Allan Hilchie, and Bern Carey

Data Mining and Knowledge Discovery: Practice Notes

Evaluation Strategies for Network Classification

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Lecture 25: Review I

Nonparametric Importance Sampling for Big Data

Data mining: concepts and algorithms

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

CS249: ADVANCED DATA MINING

Stability of Feature Selection Algorithms

Evaluating Classifiers

Feature Subset Selection as Search with Probabilistic Estimates Ron Kohavi

Transcription:

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu February 27, 2012

Goal and Motivation Review accuracy estimation methods and compare the two most common methods: cross-validation and bootstrap. Estimating the accuracy of a classifier is important in order to predict future prediction accuracy, we would like low bias and low variance, and choose a classifier from a given set, we are willing to trade off bias for low variance.

Accuracy The accuracy of a classifier C is the probability of correctly classifying a randomly selected instance i.e., acc = Pr(C (v) = y) for a randomly selected instance v,y X.

Holdout The holdout method partitions the data into a training and a test set (or holdout set). The holdout estimated accuracy is defined as acc h = 1 h v,y D h δ(i (D t,v i ),y i ), where I (D,v) the label assigned to an unlabeled instance v by the classifier built by inducer I on dataset D, D h the holdout set, a subset of D of size h, D t = D/ D h and δ(i,j) = 1 if i = j and 0 otherwise.

Holdout The more instances we leave for the test set, the higher the bias of our estimate. However, fewer test set instances means wider confidence interval for the accuracy. The holdout estimate depends on the division into a training set and a test set. In random subsampling, the estimated accuracy is derived by averaging k runs. The assumption of independence of instances in the test set from those in the training set is violated. In practice, the dataset size is always finite, and usually smaller than we would like it to be. The holdout method makes inefficient use of the data: a third of dataset is not used for training the inducer.

Cross-Validation, Leave-one-out, and Stratification Each time the inducer is trained on D/ D t and tested on D t, t {1,2,...,k}. The cross-validation estimate of accuracy is the overall number of correct classifications, divided by the number of instances in the dataset. Repeating cross-validation multiple times using different splits into folds provides a better Monte-Carlo estimate to the complete cross-validation at an added cost. In stratified cross-validation, the folds are stratified so that they contain approximately the same proportions of labels as the original dataset.

Cross-Validation, Leave-one-out, and Stratification Proposition (Variance in k-fold cross-validation) If the inducer is stable under the perturbations caused by deleting the instances for the folds in k-fold cross-validation, the cross-validation estimate will be unbiased and the variance of the estimated accuracy will be approximately acc cv (1 acc cv )/ n, where n is the number of instances in the dataset. Corollary (Variance in cross-validation) If the inducer is stable under the perturbations caused by deleting the test instances for the folds in k-fold cross-validation for various values of k, then the variance of the estimates will be the same.

Bootstrap Given a dataset of size n, a bootstrap sample is created by sampling n instances uniformly from the data (with replacement). Given a number b, the number of bootstrap samples, let ε0 i be the accuracy estimate for bootstrap sample i. The 0.632 bootstrap estimate is defined as acc boot = 1 b b i=1 (0.632 ε0 i + 0.368 acc s ) The assumptions made by bootstrap are basically the same as that of cross-validation, i.e., stability of the algorithm on the dataset.

Methodology They use C4.5 and a Naive-Bayesian classifier to conduct a large-scale experiment. Because the target concept is unknown for real-world concepts, the holdout method was used. Six datasets from a wide variety of domains, such that the learning curve for both algorithms did not flatten out too early, plus a no information dataset were used. To see how well an accuracy estimation method performs, they ran the induction algorithm on the training set and tested the classifier on the rest of the instances in the dataset. This was repeated 50 times at points where the learning curve was sloping up. The same folds in cross-validation and the same samples in bootstrap were used for both algorithms compared.

The Bias Figure: (a) C4.5: The bias of cross-validation with varying folds. (b) C4.5: The bias of bootstrap with varying samples.

The Bias The diagrams clearly show that k-fold cross-validation is pessimistically biased, especially for two and five folds. Most of the estimates are reasonably good at 10 folds and at 20 folds they are almost unbiased. Stratified cross-validation had similar behavior, except for lower pessimism.

The Variance Figure: (a) Cross-validation and (b).632 Bootstrap: standard deviation of accuracy (population).

The Variance Cross-validation has high variance at 2-folds on both C4.5 and Naive-Bayes. On C4.5, there is high variance at the high-ends too -at leave-one-out and leave-two-out- for three files out of the seven datasets. Stratification reduces the variance slightly, and thus seems to be uniformly better than cross-validation, both for bias and variance.

Summary The results indicate that: stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation, bootstrap has low variance, but extremely large bias on some problems, and the best method to use for model selection is ten-fold stratified cross validation, even if computation power allows using more folds.

Thank you!

References I Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, San Francisco, CA, USA, pp. 1137 1143. Morgan Kaufmann Publishers Inc.