Cross-validation for detecting and preventing overfitting

Size: px

Start display at page:

Download "Cross-validation for detecting and preventing overfitting"

Martina Stephens
5 years ago
Views:

1 Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials http// Comments and corrections gratefull received. Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon Universit awm@cs.cmu.edu Copright 2001, Andrew W. Moore Oct 15th, 2001 A Regression Problem = f() + noise Can we learn f from this data? Let s consider three methods Copright 2001, Andrew W. Moore Cross-Validation Slide 2 1

2 Linear Regression Copright 2001, Andrew W. Moore Cross-Validation Slide Linear Regression Univariate Linear regression with a constant term X 1 Y 7 X= = =().. 1 =7.. Originall discussed in the previous Andrew Lecture Neural Nets Copright 2001, Andrew W. Moore Cross-Validation Slide 4 2

3 Linear Regression Univariate Linear regression with a constant term X 1 Y 7 X= = Z= 1 =().. 1 = 7 1 = z 1 =(1,).. z k =(1, k ) 1 = Copright 2001, Andrew W. Moore Cross-Validation Slide 5 Linear Regression Univariate Linear regression with a constant term X 1 Y 7 X= = 1 Z= 1 =().. 1 =7.. 1 = β=(z T Z) -1 (Z T ) z 1 =(1,).. z k =(1, k ) 1 =7.. 7 est = β 0 + β 1 Copright 2001, Andrew W. Moore Cross-Validation Slide 6

4 Quadratic Regression Copright 2001, Andrew W. Moore Cross-Validation Slide 7 Quadratic Regression X Y 7 1 Z= z=(1,, 2,) X= = =(,2).. 1 =7.. = 7 Much more about this in the future Andrew Lecture Favorite Regression Algorithms β=(z T Z) -1 (Z T ) est = β 0 + β 1 + β 2 2 Copright 2001, Andrew W. Moore Cross-Validation Slide 8 4

5 Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright 2001, Andrew W. Moore Cross-Validation Slide 9 Which is best? Wh not choose the method with the best fit to the data? Copright 2001, Andrew W. Moore Cross-Validation Slide 10 5

6 What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright 2001, Andrew W. Moore Cross-Validation Slide 11 The test set method 1. Randoml choose 0% of the data to be in a test set 2. The remainder is a training set Copright 2001, Andrew W. Moore Cross-Validation Slide 12 6

7 The test set method 1. Randoml choose 0% of the data to be in a test set 2. The remainder is a training set. Perform our regression on the training set (Linear regression eample) Copright 2001, Andrew W. Moore Cross-Validation Slide 1 The test set method (Linear regression eample) Mean Squared Error = Randoml choose 0% of the data to be in a test set 2. The remainder is a training set. Perform our regression on the training set 4. Estimate our future performance with the test set Copright 2001, Andrew W. Moore Cross-Validation Slide 14 7

8 The test set method 1. Randoml choose 0% of the data to be in a test set 2. The remainder is a training set. Perform our regression on the training set (Quadratic regression eample) 4. Estimate our future performance with the test Mean Squared Error = 0.9 set Copright 2001, Andrew W. Moore Cross-Validation Slide 15 The test set method (Join the dots eample) Mean Squared Error = Randoml choose 0% of the data to be in a test set 2. The remainder is a training set. Perform our regression on the training set 4. Estimate our future performance with the test set Copright 2001, Andrew W. Moore Cross-Validation Slide 16 8

9 Good news Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news What s the downside? Copright 2001, Andrew W. Moore Cross-Validation Slide 17 Good news Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news Wastes data we get an estimate of the best method to appl to 0% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright 2001, Andrew W. Moore Cross-Validation Slide 18 9

10 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record Copright 2001, Andrew W. Moore Cross-Validation Slide 19 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset Copright 2001, Andrew W. Moore Cross-Validation Slide 20 10

11 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset. Train on the remaining R-1 datapoints Copright 2001, Andrew W. Moore Cross-Validation Slide 21 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) Copright 2001, Andrew W. Moore Cross-Validation Slide 22 11

12 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. Copright 2001, Andrew W. Moore Cross-Validation Slide 2 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.12 Copright 2001, Andrew W. Moore Cross-Validation Slide 24 12

13 LOOCV for Quadratic Regression For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 Copright 2001, Andrew W. Moore Cross-Validation Slide 25 LOOCV for Join The Dots For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =. Copright 2001, Andrew W. Moore Cross-Validation Slide 26 1

14 Which kind of Cross Validation? Test-set Leaveone-out Downside Variance unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright 2001, Andrew W. Moore Cross-Validation Slide 27 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k= partitions colored Red Green and Blue) Copright 2001, Andrew W. Moore Cross-Validation Slide 28 14

15 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k= partitions colored Red Green and Blue) For the red partition Train on all the points not in the red partition. Find the test-set sum of errors on the red points. Copright 2001, Andrew W. Moore Cross-Validation Slide 29 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k= partitions colored Red Green and Blue) For the red partition Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright 2001, Andrew W. Moore Cross-Validation Slide 0 15

16 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k= partitions colored Red Green and Blue) For the red partition Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Copright 2001, Andrew W. Moore Cross-Validation Slide 1 k-fold Cross Validation Linear Regression MSE FOLD =2.05 Randoml break the dataset into k partitions (in our eample we ll have k= partitions colored Red Green and Blue) For the red partition Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright 2001, Andrew W. Moore Cross-Validation Slide 2 16

17 k-fold Cross Validation Quadratic Regression MSE FOLD =1.11 Randoml break the dataset into k partitions (in our eample we ll have k= partitions colored Red Green and Blue) For the red partition Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright 2001, Andrew W. Moore Cross-Validation Slide k-fold Cross Validation Joint-the-dots MSE FOLD =2.9 Randoml break the dataset into k partitions (in our eample we ll have k= partitions colored Red Green and Blue) For the red partition Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright 2001, Andrew W. Moore Cross-Validation Slide 4 17

18 Which kind of Cross Validation? Downside Upside Test-set Leaveone-out 10-fold -fold R-fold Variance unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Copright 2001, Andrew W. Moore Cross-Validation Slide 5 Which kind of Cross Validation? Downside Upside Test-set Slightl better than testset Leaveone-out 10-fold -fold R-fold Variance unreliable Cheap estimate of future performance But note One of Epensive. Andrew s Doesn t jos waste in data life is Has some weird behavioralgorithmic tricks for Wastes 10% of the data. making Onl wastes these cheap 10%. Onl 10 times more epensive 10 times more epensive than testset instead of R times. Wastier than 10-fold. Epensivier than testset Identical to Leave-one-out Slightl better than testset Copright 2001, Andrew W. Moore Cross-Validation Slide 6 18

19 CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table i f i f 1 f 2 f f 4 f 5 f 6 TRAINERR 10-FOLD-CV-ERR Choice Copright 2001, Andrew W. Moore Cross-Validation Slide 7 Alternatives to CV-based model selection Model selection methods 1. Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) Onl directl applicable to choosing classifiers Described in a future Andrew Lecture Copright 2001, Andrew W. Moore Cross-Validation Slide 8 19

20 Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension AIC, BIC and SRMVC advantage ou onl need the training error. CV error might have more variance SRMVC is wildl conservative Asmptoticall AIC and Leave-one-out CV should be the same Asmptoticall BIC and carefull chosen k-fold should be same You want BIC ou want the best structure instead of the best predictor (e.g. for clustering or Baes Net structure finding) Man alternatives---including proper Baesian approaches. It s an emotional issue. Copright 2001, Andrew W. Moore Cross-Validation Slide 9 Other Cross-validation issues Can do leave all pairs out or leave-allntuples-out if feeling resourceful. Some folks do k-folds in which each fold is an independentl-chosen subset of the data Do ou know what AIC and BIC are? If so LOOCV behaves like AIC asmptoticall. k-fold behaves like BIC if ou choose k carefull If not Nardel nardel noo noo Copright 2001, Andrew W. Moore Cross-Validation Slide 40 20

21 Cross-Validation for regression Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Choosing which regressor to use Copright 2001, Andrew W. Moore Cross-Validation Slide 41 Supervising Gradient Descent This is a weird but common use of Test-set validation Suppose ou have a neural net with too man hidden units. It will overfit. As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 2001, Andrew W. Moore Cross-Validation Slide 42 21

22 Supervising Gradient Descent This is a weird but common use of Test-set validation Suppose ou have a neural net with too Relies on an intuition that a not-full- man hidden units. It will overfit. As gradient descent having fewer progresses, parameters. maintain a graph of MSE-testset-error Works prett well in vs. practice, Iteration apparentl minimized set of weights is somewhat like Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 2001, Andrew W. Moore Cross-Validation Slide 4 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute Copright 2001, Andrew W. Moore Cross-Validation Slide 44 22

23 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. Copright 2001, Andrew W. Moore Cross-Validation Slide 45 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. But there s a more sensitive alternative Compute log P(all test outputs all test inputs, our model) Copright 2001, Andrew W. Moore Cross-Validation Slide 46 2

24 Cross-Validation for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright 2001, Andrew W. Moore Cross-Validation Slide 47 Cross-Validation for densit estimation Compute the sum of log-likelihoods of test points Eample uses Choosing what kind of Gaussian assumption to use Choose the densit estimator NOT Feature selection (testset densit will almost alwas look better with fewer features) Copright 2001, Andrew W. Moore Cross-Validation Slide 48 24

25 Feature Selection Suppose ou have a learning algorithm LA and a set of input attributes { X 1, X 2.. X m } You epect that LA will onl find some subset of the attributes useful. Question How can we use cross-validation to find a useful subset? Four ideas Forward selection Backward elimination Hill Climbing Another fun area in which Andrew has spent a lot of his wild outh Stochastic search (Simulated Annealing or GAs) Copright 2001, Andrew W. Moore Cross-Validation Slide 49 Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright 2001, Andrew W. Moore Cross-Validation Slide 50 25

26 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. What can be done about it? Copright 2001, Andrew W. Moore Cross-Validation Slide 51 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! What can be done about it? Copright 2001, Andrew W. Moore Cross-Validation Slide 52 26

27 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional testset before doing an model selection. Check the best model performs well even on the additional testset. Copright 2001, Andrew W. Moore Cross-Validation Slide 5 What ou should know Wh ou can t use training-set-error to estimate the qualit of our learning algorithm on our data. Wh ou can t use training set error to choose the learning algorithm Test-set cross-validation Leave-one-out cross-validation k-fold cross-validation Feature selection methods CV for classification, regression & densities Copright 2001, Andrew W. Moore Cross-Validation Slide 54 27

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Andrew W. Moore/Anna Goldenberg School of Computer Science Carnegie Mellon Universit Copright 2001, Andrew W. Moore Apr 1st, 2004 Want to learn