Cross-validation for detecting and preventing overfitting

Size: px

Start display at page:

Download "Cross-validation for detecting and preventing overfitting"

Melvyn Mitchell Cameron
5 years ago
Views:

1 Cross-validation for detecting and preventing overfitting Andrew W. Moore/Anna Goldenberg School of Computer Science Carnegie Mellon Universit Copright 2001, Andrew W. Moore Apr 1st, 2004 Want to learn model from data To eample Can we learn f() that fits our data, =f()+noise? Copright 2001, Andrew W. Moore Cross-Validation: Slide 2 1

2 Which is best? linear quadratic piecewise linear Wh not choose the method with the best fit to the data? Copright 2001, Andrew W. Moore Cross-Validation: Slide 3 What do we reall want? linear quadratic piecewise linear Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright 2001, Andrew W. Moore Cross-Validation: Slide 4 2

3 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set Copright 2001, Andrew W. Moore Cross-Validation: Slide 5 The test set method (Linear function eample) 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Learn model from the training set Copright 2001, Andrew W. Moore Cross-Validation: Slide 6 3

4 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Learn the model from the training set (Linear How function to tell how eample) good is the prediction? 4. Estimate our future performance with the test set Copright 2001, Andrew W. Moore Cross-Validation: Slide 7 MSE (Mean Squared Error) ( 2, 2o ) MSE = ( 1p - 1o ) 2 + ( 2p - 2o ) 2 + ( 3p - 3o ) 2 3 ( 1, 1o ) ( 3, 3o ) ( 1, 1p ) ( 2, 2p ) ( 3, 3p ) (Linear function eample) Copright 2001, Andrew W. Moore Cross-Validation: Slide 8 4

5 MSE (Mean Squared Error) ( 2, 2o ) MSE = ( 1p - 1o ) 2 + ( 2p - 2o ) 2 + ( 3p - 3o ) 2 3 ( 1, 1o ) ( 3, 3o ) ( 1, 1p ) ( 2, 2p ) ( 3, 3p ) In general, Σ n i=1 (prediction i -observation i )2 MSE = n (Linear function eample) Copright 2001, Andrew W. Moore Cross-Validation: Slide 9 The test set method (Linear function eample) Mean Squared Error = Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Learn the model from the training set 4. Estimate our future performance with the test set Copright 2001, Andrew W. Moore Cross-Validation: Slide 10 5

6 The test set method (Quadratic function) Mean Squared Error = Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform model fitting on the training set 4. Estimate our future performance with the test set Copright 2001, Andrew W. Moore Cross-Validation: Slide 11 The test set method (Piecewise Linear function) Mean Squared Error = Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform model fitting on the training set 4. Estimate our future performance with the test set Copright 2001, Andrew W. Moore Cross-Validation: Slide 12 6

7 The test set method Good news: Ver ver simple Can simpl choose the method with the best test-set score Bad news: What s the downside? Copright 2001, Andrew W. Moore Cross-Validation: Slide 13 Good news: Ver ver simple The test set method Can simpl choose the method with the best test-set score Bad news: Wastes data: best model is estimated using 30% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright 2001, Andrew W. Moore Cross-Validation: Slide 14 7

8 LOOCV (Leave-one-out Cross Validation) For k=1 to R, R = # of records 1. Let ( k, k ) be the k th record Copright 2001, Andrew W. Moore Cross-Validation: Slide 15 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset Copright 2001, Andrew W. Moore Cross-Validation: Slide 16 8

9 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints Copright 2001, Andrew W. Moore Cross-Validation: Slide 17 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) Copright 2001, Andrew W. Moore Cross-Validation: Slide 18 9

10 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. Copright 2001, Andrew W. Moore Cross-Validation: Slide 19 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.12 Copright 2001, Andrew W. Moore Cross-Validation: Slide 20 10

11 LOOCV for Quadratic Function For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 Copright 2001, Andrew W. Moore Cross-Validation: Slide 21 LOOCV for Join The Dots For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright 2001, Andrew W. Moore Cross-Validation: Slide 22 11

12 Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright 2001, Andrew W. Moore Cross-Validation: Slide 23 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) Copright 2001, Andrew W. Moore Cross-Validation: Slide 24 12

13 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. Copright 2001, Andrew W. Moore Cross-Validation: Slide 25 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright 2001, Andrew W. Moore Cross-Validation: Slide 26 13

14 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Copright 2001, Andrew W. Moore Cross-Validation: Slide 27 k-fold Cross Validation Linear model MSE 3FOLD =2.05 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright 2001, Andrew W. Moore Cross-Validation: Slide 28 14

15 k-fold Cross Validation Quadratic model MSE 3FOLD =1.11 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright 2001, Andrew W. Moore Cross-Validation: Slide 29 k-fold Cross Validation Piecewise linear model MSE 3FOLD =2.93 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright 2001, Andrew W. Moore Cross-Validation: Slide 30 15

16 Which kind of Cross Validation? Downside Upside Test-set Leaveone-out 10-fold 3-fold R-fold Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Copright 2001, Andrew W. Moore Cross-Validation: Slide 31 Which kind of Cross Validation? Downside Upside Test-set Slightl better than testset Leaveone-out 10-fold 3-fold R-fold Variance: unreliable Cheap estimate of future performance But note: can use Epensive. algorithmic Doesn t waste tricks data to Has some weird behaviormake these cheap Wastes 10% of the data. Onl wastes 10%. Onl 10 times more epensive 10 times more epensive than testset instead of R times. Wastier than 10-fold. Epensivier than testset Identical to Leave-one-out Slightl better than testset Copright 2001, Andrew W. Moore Cross-Validation: Slide 32 16

17 CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table i f i f 1 f 2 f 3 f 4 f 5 f 6 TRAINERR 10-FOLD-CV-ERR Choice Copright 2001, Andrew W. Moore Cross-Validation: Slide 33 Alternatives to CV-based model selection Model selection methods: AIC (Akaike Information Criterion) asmptoticall equivalent to LOOCV BIC (Baesian Information Criterion) used for finding best structure rather than for classification, asmptoticall equivalent to carefull chosen k-fold VC-dimension (Vapnik-Chervonenkis Dimension) onl directl applicable to choosing classifiers, wildl conservative Their advantage over CV ou onl need the training error Man alternatives---including proper Baesian approaches Copright 2001, Andrew W. Moore Cross-Validation: Slide 34 17

18 Other Cross-validation Can do leave all pairs out or leave-allntuples-out if feeling resourceful. k-folds where each fold is an independentlchosen subset of the data Copright 2001, Andrew W. Moore Cross-Validation: Slide 35 Cross-Validation is useful Preventing overfitting Comparing different learning algorithms Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Copright 2001, Andrew W. Moore Cross-Validation: Slide 36 18

19 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute Copright 2001, Andrew W. Moore Cross-Validation: Slide 37 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. Copright 2001, Andrew W. Moore Cross-Validation: Slide 38 19

20 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. But there s a more sensitive alternative: Compute log P(all test outputs all test inputs, our model) Copright 2001, Andrew W. Moore Cross-Validation: Slide 39 Cross-Validation for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright 2001, Andrew W. Moore Cross-Validation: Slide 40 20

21 Feature Selection Suppose ou have a learning algorithm LA and a set of input attributes { X 1, X 2.. X m } You epect that LA will onl find some subset of the attributes useful. Question: How can we use cross-validation to find a useful subset? Four ideas: Another fun area in which Forward selection Andrew has spent a lot of time Backward elimination Hill Climbing Stochastic search (Simulated Annealing or GAs) Copright 2001, Andrew W. Moore Cross-Validation: Slide 41 Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright 2001, Andrew W. Moore Cross-Validation: Slide 42 21

22 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear models, each one using one of the attributes. What can be done about it? Copright 2001, Andrew W. Moore Cross-Validation: Slide 43 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear models, each one using one of the attributes. The best of those 1000 looks good! What can be done about it? Copright 2001, Andrew W. Moore Cross-Validation: Slide 44 22

23 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear models, each one using one of the attributes. The best of those 1000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional testset before doing an model selection. The best model should perform well even on the additional testset. Copright 2001, Andrew W. Moore Cross-Validation: Slide 45 What ou should know Wh ou can t use training-set-error to estimate the qualit of our learning algorithm on our data. Wh ou can t use training set error to choose the learning algorithm Test-set cross-validation Leave-one-out cross-validation k-fold cross-validation Feature selection methods Copright 2001, Andrew W. Moore Cross-Validation: Slide 46 23

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.