Sample 1. Dataset Distribution F Sample 2. Real world Distribution F. Sample k

Size: px

Start display at page:

Download "Sample 1. Dataset Distribution F Sample 2. Real world Distribution F. Sample k"

Miles Williamson
6 years ago
Views:

1 can not be emphasized enough that no claim whatsoever is It made in this paper that all algorithms are equivalent in being in the real world. In particular, no claim is being made practice, one should not use cross-validation in the real world. that Wolpert, 1994 Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection Kohavi Ron University Stanford IJCAI-95 1

2 Motivation & Summary of Results have ten induction algorithms. Which one is the best for a You dataset? given run them all and pick the one with the highest estimated Answer: accuracy. Which accuracy estimation? 1. Holdout? Cross-validation? Bootstrap? Resubstitution? 2. How sure would you be of your choice? If you had spare CPU cycles, would you alway choose leaveone-out? 3. accuracy estimation: stratied 10 to 20-fold CV. For model selection: multiple runs of 3-5 CV. For 2

3 Accuracy estimation methods: ➁ Holdout & random subsampling. F F Bootstrap. Talk Outline ➀ Accuracy estimation: the problem. F Cross-validation. ➂ Experiments, recent experiments. ➃ Summary. 3

4 Basic Denitions 1. Let D be a dataset of n labelled instances. Assume D is an i.i.d. sample from some underlying distribution 2. on the set of labelled instances. Let I be an induction algorithm that produces a classier 3. data D. from 4

5 The Problem Estimate the accuracy of the classier induced from the dataset. accuracy of a classier is the probability of correctly classifying The a randomly selected instance from the distribution. resampling methods (holdout, cross-validation, bootstrap) All use a uniform distribution on the given dataset as a distri- will bution from which they sample. Sample 1 Real world Distribution F Dataset Distribution F Sample 2 Sample k 5

6 Holdout partition the data into two mutually exclusive subsets Holdout a training set and a test set. The training set is given to called the inducer, and the induced classier is tested on the test set. 2/3 Inducer Classifier Training Set Eval 1/3 estimator because only a portion of the data is given Pessimistic the inducer for training. to 6

7 8 >< z < acc h acc acc(1 acc)=h >: 9 >= Condence Bounds classication of each test instance can be viewed as a The trial: correct or incorrect prediction. Bernoulli S be the number of correct classications on the test set of Let h, then S is distributed binomially and S=h is approximately size normal with mean acc and variance of acc (1 acc)=h. Pr q < z >; (1) 7

8 Random Subsampling Repeating the holdout many times is called random subsampling. Common mistake: cannot compute the standard deviation of the samples' One mean accuracy in random subsampling because the test instances are not independent. t-tests you commonly see with signicance levels are The wrong. usually Example: random concept (50% for each class). One induc- algorithm predicts 0, the other 1. Given a sample of 100 tion chances are it won't be 50/50, but something like instances, 53/47. Leaving 1/3 out gives s.d.=8.7%. 8

9 Audience comments? Pedro Domingos et al., please keep the comments to the end. as dened here is the prediction accuracy on instances Accuracy from the parent population, not from the small dataset sampled we have at hand. with random subsampling tests the hypothesis that one t-tests is better than another, assuming the distribution is algorithm uniform and each instance in the dataset has 1=n probability. is not an interesting hypothesis, except that we know how This compute it. to 9

10 Cross Validation k-fold cross-validation, the dataset D is randomly split into k In exclusive subsets (the folds) D1 ; D 2 ; : : : ; D k of approxi- mutually mately equal size. inducer is trained and tested k times; each time t 2 The 2; : : : ; kg it is trained on D n D t and tested on D t. f1; Training Set Train: Inducer Inducer c c Test: 3 2 Eval Eval Avg 3 2 Inducer c 1 Eval 10

11 Standard Deviation Two ways to look at cross-validation: A method that works for stable inducers, i.e., inducers that 1. not change their predictions much when given similar do datasets (see paper). A method of computing the expected accuracy for a dataset 2. size n k=n, where k is the number of folds. of each instance is used only once as a test instance, the Since deviation can be estimated as acccv (1 acccv)=n, standard where n is the number of instances in the dataset. 11

12 Failure of cross-validation/leave-one-out Leave-one-out is n-fold cross-validation. iris dataset contains 50 instances of each class (three Fisher's leading one to expect that a majority inducer should classes), have accuracy about 33%. an instance is deleted from the dataset, its label is a When in the training set; thus the majority inducer predicts minority of the other two classes and always errs in classifying the one instance. test leave-one-out estimated accuracy for a majority inducer The on the iris dataset is therefore 0%0% 12

13 Given a number b, the number of bootstrap samples, let 0 i where acc s acc boot = 1 b bx i=1 (0:632 0 i + :368 acc s ).632 Bootstrap bootstrap sample is created by sampling n instances uniformly A from the data (with replacement). be accuracy estimate for bootstrap sample i on the instances the included in the sample. The.632 bootstrap estimate is not dened as is the resubstitution accuracy estimate on the full dataset. 13

14 Bootstrap failure.632 bootstrap fails to give the expected result when the The can perfectly t the data (e.g., an unpruned decision classier or a one nearest neighbor classier) and the dataset is completely tree random, say with two classes. apparent accuracy is 100%, and the 0 accuracy is about The Plugging these into the bootstrap formula, one gets an 50%. accuracy of about 68.4%, far from the real accuracy estimated 50%. of 14

15 Experimental design: Q & A Which induction algorithm to use? C4.5 and Naive Bayes. 1. FAST inducers. Reason: Which datasets to use? Seven datasets were chosen. 2. Reasons: (a) Wide variety of domains. (b) At least 500 instances. (c) Learning curve that did not atten too early. 15

16 % Accuracy Breast cancer NB C4.5 TS size % Accuracy Chess endgames C4.5 NB TS size % Accuracy Hypothyroid C4.5 NB TS size % Accuracy Mushroom C4.5 NB TS size % Accuracy Soybean (large) NB C TS size % Accuracy Vehicle C4.5 NB TS size Learning Curves Arrow indicate sampling points. 16

17 The bias (CV) bias of a method that estimates a parameter is dened as The expected estimated value minus the value of. An unbiased the % acc Soybean Vehicle Rand folds % acc Mushroom Hypo Chess folds estimation method is a method that has zero bias. C4.5: The bias of cross-validation with varying folds. A negative k folds stands for leave-k-out. Error bars are 95% condence intervals for the mean. The gray regions indicate 95% condence intervals for the true accuracies. 17

18 is right on the mark for three out of six, but biased Bootstrap soybean and extremely biased for vehicle and rand. for Bootstrap Bias % acc Soybean Vehicle Rand Estimated Soybean Vehicle Rand samples % acc Mushroom Hypo Chess samples C4.5: The bias of bootstrap. 18

19 The Variance of CV std dev Vehicle Soybean Rand Breast Hypo Chess Mushroom C folds std dev Vehicle Soybean Rand Breast Hypo Chess Mushroom C folds The variance is high at the ends: two fold and leave-f1,2g-out Regular (left), Stratied (right). Stratication slightly reduces both bias and variance. 19

20 The Variance of Bootstrap std dev C4.5 7 Soybean 6 Rand Vehicle 5 Breast Hypo Mushroom Chess samples std dev Naive-Bayes 7 Rand Vehicle 6 5 Breast 4 3 Soybean 2 Chess 1 Hypo Mushroom samples 20

21 Multiple Times stabilize the cross-validation, we can execute CV multiple To shuing the instances each time. times, std dev Soybean Vehicle Breast Rand Hypo Chess Mushroom C4.5: 5 x CV folds std dev Rand Vehicle Soybean Breast Chess Hypo Mushroom Naive Bayes: 5 x CV folds graphs show the eect on C4.5 and NB when CV was run The times. The variance for lower folds was due to the variance ve because of smaller dataset size. 21

22 Summary Reviewed accuracy estimation: holdout, cross-validation, 1. CV, bootstrap. All methods fail in some cases. stratied 2. Bootstrap has low variance, but is extremely biased. With cross-validation, you can determine the bias/variance 3. that is appropriate, and run multiple times to reduce tradeo variance. Standard deviations can be computed for cross-validation, 4. are incorrect for random subsampling (holdout). but Recommendation accuracy estimation: stratied 10 to 20-fold CV. For model selection: multiple runs of 3-5 folds (dynamic). For 22

23 no. of sample-size no. of C4.5 Naive-Bayes Dataset / total size categories attr. cancer 10 50/ Breast / Chess / Hypothyroid / Mushroom large / Soybean / Vehicle What is the \true" accuracy estimate the true accuracy at the sampling points, we sampled To 500 times and estimated the accuracy using the unseen instances (holdout). Rand /

24 std dev Vehicle Soybean Rand Breast C4.5: 2-fold CV std dev Vehicle Rand Soybean Breast Naive Bayes: 2-fold CV How Many Times Should We Repeat? 3 2 Hypo 1 Chess Mushroom times 3 2 Chess 1 Hypo Mushroom times Multiple runs with two-fold CV. The more, the better. 24

Ron Kohavi. Stanford University. for low variance, assuming the bias aects all classiers. similarly (e.g., estimates are 5% pessimistic).

Ron Kohavi. Stanford University. for low variance, assuming the bias aects all classiers. similarly (e.g., estimates are 5% pessimistic). Appears in the International Joint Conference on Articial Intelligence (IJCAI), 1995 A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection Ron Kohavi Computer Science Department