Cross-validation for detecting and preventing overfitting

Size: px

Start display at page:

Download "Cross-validation for detecting and preventing overfitting"

Oswald Ward
6 years ago
Views:

1 Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials: Comments and corrections gratefull received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon Universit awm@cs.cmu.edu Copright Andrew W. Moore Slide 1

2 A Regression Problem = f() + noise Can we learn f from this data? Let s consider three methods Copright Andrew W. Moore Slide 2

3 Linear Regression Copright Andrew W. Moore Slide 3

4 Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = : : Z= 1 =(3).. 1 =7.. = : 7 3 : β=(z T Z) -1 (Z T ) z 1 =(1,3).. 1 =7.. est = β 0 + β 1 z k =(1, k ) Copright Andrew W. Moore Slide 6

5 Quadratic Regression Copright Andrew W. Moore Slide 7

6 Quadratic Regression X 3 1 : Z= Y 7 3 : : z=(1,, 2,) 9 1 X= = 3 1 : =(3,2).. 1 =7.. = 7 3 : : Much more about this in the future Andrew Lecture: Favorite Regression Algorithms β=(z T Z) -1 (Z T ) est = β 0 + β 1 + β 2 2 Copright Andrew W. Moore Slide 8

7 Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright Andrew W. Moore Slide 9

8 Which is best? Wh not choose the method with the best fit to the data? Copright Andrew W. Moore Slide 10

9 What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright Andrew W. Moore Slide 11

10 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set Copright Andrew W. Moore Slide 12

11 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set (Linear regression eample) Copright Andrew W. Moore Slide 13

12 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Linear regression eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 14

13 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Quadratic regression eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 15

14 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Join the dots eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 16

15 Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: What s the downside? Copright Andrew W. Moore Slide 17

16 Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright Andrew W. Moore Slide 18

17 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record Copright Andrew W. Moore Slide 19

18 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset Copright Andrew W. Moore Slide 20

19 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints Copright Andrew W. Moore Slide 21

20 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) Copright Andrew W. Moore Slide 22

21 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. Copright Andrew W. Moore Slide 23

22 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.12 Copright Andrew W. Moore Slide 24

23 LOOCV for Quadratic Regression For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 Copright Andrew W. Moore Slide 25

24 LOOCV for Join The Dots For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright Andrew W. Moore Slide 26

25 Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright Andrew W. Moore Slide 27

26 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) Copright Andrew W. Moore Slide 28

27 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. Copright Andrew W. Moore Slide 29

28 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright Andrew W. Moore Slide 30

29 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Copright Andrew W. Moore Slide 31

30 k-fold Cross Validation Linear Regression MSE 3FOLD =2.05 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 32

31 k-fold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 33

32 k-fold Cross Validation Joint-the-dots MSE 3FOLD =2.93 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 34

33 Which kind of Cross Validation? Test-set Leaveone-out 10-fold 3-fold R-fold Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset Copright Andrew W. Moore Slide 35

The bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form

34 The bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that don t occur in the new training set for testing 21

35 The bootstrap Also called the bootstrap A particular instance has a probabilit of 1 1/n of not being picked Thus its probabilit of ending up in the test data is: n ' 1 $ % 1 ( ( "! e & n# 1 = This means the training data will contain approimatel 63.2% of the instances 22

36 Estimating error with the bootstrap The error estimate on the test data will be ver pessimistic Trained on just ~63% of the instances Therefore, combine it with the resubstitution error: err = 0.632! e + 0.! e test instances 368 training The resubstitution error gets less weight than the error on the test data instances Repeat process several times with different replacement samples; average the results 23

37 CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table i f i TRAINERR 10-FOLD-CV-ERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 Copright Andrew W. Moore Slide 37

38 CV-based Model Selection Eample: Choosing k for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for si different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice K=1 K=2 K=3 K=4 K=5 K=6 Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 39

39 Note on parameter tuning It is important that the test data is not used in an wa to create the classifier Some learning schemes operate in two stages: Stage 1: build the basic structure Stage 2: optimize parameter settings The test data can t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters 7

40 CV-based Algorithm Choice Eample: Choosing which regression algorithm to use Step 1: Compute 10-fold-CV error for si different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice 1-NN 10-NN Linear Reg n Quad reg n LWR, KW=0.1 LWR, KW=0.5 Step 2: Whichever algorithm gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 45

41 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute Copright Andrew W. Moore Slide 52

42 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. Copright Andrew W. Moore Slide 53

43 Counting the cost In practice, different tpes of classification errors often incur different costs Eamples: Terrorist profiling Not a terrorist correct 99.99% of the time Loan decisions Oil-slick detection Fault diagnosis Promotional mailing 38

44 Counting the cost The confusion matri: Actual class Yes No Predicted class Yes No True positive False negative False positive True negative There man other tpes of cost! E.g.: cost of collecting training data 39

45 Lift charts In practice, costs are rarel known Decisions are usuall made b comparing possible scenarios Eample: promotional mailout to 1,000,000 households Mail to all; 0.1% respond (1000) Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost ma pa off Identif subset of 400,000 most promising, 0.2% respond (800) A lift chart allows a visual comparison 40

46 Generating a lift chart Sort instances according to predicted probabilit of being positive: Predicted probabilit Actual class Yes Yes No Yes ais is sample size ais is number of true positives 41

47 A hpothetical lift chart 40% of responses for 10% of cost 80% of responses for 40% of cost 42

48 ROC curves ROC curves are similar to lift charts Stands for receiver operating characteristic Used in signal detection to show tradeoff between hit rate and false alarm rate over nois channel Differences to lift chart: ais shows percentage of true positives in sample rather than absolute number ais shows percentage of false positives in sample rather than sample size 43

49 A sample ROC curve Jagged curve one set of test data Smooth curve use cross-validation 44

50 ROC curves for two schemes For a small, focused sample, use method A For a larger one, use method B In between, choose between A and B with appropriate probabilities 46

51 Precision and Recall tpicall used in document retrieval Precision: how man of the returned documents are correct precision(threshold) Recall: how man of the positives does the model return recall(threshold) Precision/Recall Curve: sweep thresholds 17

52 19

Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods

Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler Lecture 7 Machine Learning: Cross Validation Outline Performance evaluation methods test/train sets cross-validation k-fold Leave-one-out 1 A