Bias-Variance Decomposition Error Estimators Cross-Validation

Bias-Variance tradeoff Intuition Model too simple does not fit the data well a biased solution. Model too comple small changes to the data, solution changes a lot high-variance solution.

Epected Loss Let h = E[t ] be the optimal predictor and our actual predictor, which will incur the following epected loss ddt p t h d p h t h t h h h E ddt t p t h h t E, 0 ] [ since dt t p -t t E dt p t h d p h t E Noise term Source of error 1

Sources of error 1 noise What if we have perfect learner, infinite data? Our learning solution satisfies =h Still have remaining, unavoidable error of σ due to noise ε t t p f t p dtd h h, ~ N0, E[ ]

Bias-variance decomposition Focus on Let us first eamine epected loss averaging over data sets. For one data set D and one test point : Hence d p h ], [ ], [, ], [ ], [, ], [ ], [,, h D E D E D h D E D E D h D E D E D h D D D D D D D variance bias ], [, ], [, D E D E h D E h D E D D D D of this cancels to zero E D

Bias Suppose ou are given a dataset D with m samples from some distribution. You learn function from data D If ou sample a different datasets, ou will learn different Epected hpothesis: E D [,D] Bias: difference between what ou epect to learn and the truth. Measures how well ou epect to represent true solution Decreases with more comple model bias E [, D] h p d D Epected to learn True model

Variance Variance: difference between what ou epect to learn and what ou learn from a particular dataset Measures how sensitive learner is to a specific dataset Decreases with simpler model var iance, D E [, D] p d E D D Learned for D Epected to learn, averaged over datasets

Bias-Variance Tradeoff Choice of hpothesis class introduces learning bias More comple class less bias More comple class more variance

The Epected Prediction Error The epected prediction squared error over fied size training sets D drawn from PX,T can be epressed as sum of three components: unavoidabl e error bias variance where unavoidabl e error bias E [ ] h p d D var iance E E [ ] p d D D

Higher variance Lower bias

Training Error Given a training data, choose a loss function. e.g., squared error L for regression Training set error: For a particular set of parameters, loss function on training data: Err train N 1 train M w t i Ntrain i1 j1 w i j

Error Training Error vs. Model Compleit Training error low Model compleit high

Prediction Generalization Error Training set error can be poor measure of qualit" of solution Prediction error: We reall care about error over all possible input points, not just training data: d w t w t E w Err M j j i M j j i true X 1 1 X

Error Prediction Error vs. Model Compleit True Error Training error low want Model compleit high

Wh training set error doesn t approimate prediction error? Sampling approimation of prediction error: Err Training error Err train true w Ver similar equations!!! Wh is training set a bad measure of prediction error??? 1 p p i1 t M j1 w N 1 train M w t i Ntrain i1 j1 i j w i j

Wh is training set a bad measure of prediction error??? Training error is a good estimate for a single w, But we optimized w with respect to the training error, and found w that is good for this set of samples. Training error is a optimisticall biased estimate of prediction error

Test Error Given a dataset, randoml split it into two parts: Training data { 1,, Ntrain } Test data { 1,, Ntest } Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using: Err 1 N test M * test w t i Ntest i1 j1 Unbiased but has variance w * i Test set onl unbiased if ou never do an learning on the test data For eample, if ou use the test set to select the degree of the polnomial no longer unbiased!!! j

Error Test Set Error vs. Model Compleit Test Error True Error Training error low want Model comleit high

How man points to use for training/testing? Ver hard question to answer! Too few training points, learned w is bad Too few test points, ou never know if ou reached a good solution Theor proposes error bounds advanced course Tpicall: if ou have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning if ou have little data, then ou need to use some special techniques e.g., bootstrapping

Error Error as a function of number of training eamples for a fied model compleit Test error Training error Little data Infinite data

Overfitting With too few training eamples our model ma achieve zero training error but never the less has a large generalization error When the training error no longer bears an relation to the generalization error the model overfits the data

Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefull received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon Universit www.cs.cmu.edu/~awm awm@cs.cmu.edu 41-68-7599 Copright Andrew W. Moore Slide

A Regression Problem = f + noise Can we learn f from this data? Let s consider three methods Copright Andrew W. Moore Slide 3

Linear Regression Copright Andrew W. Moore Slide 4

Quadratic Regression Copright Andrew W. Moore Slide 5

Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright Andrew W. Moore Slide 6

Which is best? Wh not choose the method with the best fit to the data? Copright Andrew W. Moore Slide 7

What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright Andrew W. Moore Slide 8

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Copright Andrew W. Moore Slide 9

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set 3. Perform our regression on the training set Linear regression eample Copright Andrew W. Moore Slide 30

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Linear regression eample Mean Squared Error =.4 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 31

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Quadratic regression eample Mean Squared Error = 0.9 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 3

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Join the dots eample Mean Squared Error =. 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 33

Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: What s the downside? Copright Andrew W. Moore Slide 34

Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright Andrew W. Moore Slide 35

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record Copright Andrew W. Moore Slide 36

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset Copright Andrew W. Moore Slide 37

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints Copright Andrew W. Moore Slide 38

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k Copright Andrew W. Moore Slide 39

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. Copright Andrew W. Moore Slide 40

LOOCV for Quadratic Regression For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =0.96 Copright Andrew W. Moore Slide 4

LOOCV for Join The Dots For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright Andrew W. Moore Slide 43

Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright Andrew W. Moore Slide 44

k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue Copright Andrew W. Moore Slide 45

k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright Andrew W. Moore Slide 47

k-fold Cross Validation Linear Regression MSE 3FOLD =.05 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 49

k-fold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 50

k-fold Cross Validation Joint-the-dots MSE 3FOLD =.93 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 51

Which kind of Cross Validation? Test-set Leaveone-out 10-fold 3-fold R-fold Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset Copright Andrew W. Moore Slide 5