Bias-Variance Decomposition Error Estimators

Size: px

Start display at page:

Download "Bias-Variance Decomposition Error Estimators"

Paulina Powers
5 years ago
Views:

1 Bias-Variance Decomposition Error Estimators Cross-Validation

2 Bias-Variance tradeoff Intuition Model too simple does not fit the data well a biased solution. Model too comple small changes to the data, solution changes a lot high-variance solution.

3 Epected Loss Let h = E[t ] be the optimal predictor and our actual predictor, which will incur the following epected loss + = + + = + = dt p t h d p h t h t h h h E ddt t p t h h t E, + = dt p t h d p h 0 ] [ since = dt t p -t t E + = dt p t h d p h t E Noise term Source of error 1

4 Sources of error 1 noise What if we have perfect learner, infinite data? Our learning solution satisfies =h Still have remaining, unavoidable error of σ due to noise ε t t p f = t p dtd h h+ ε = ε, ε ~ N0, σ E[ ε ] = σ

5 Bias-variance decomposition Focus on Let us first eamine epected loss averaging over data sets. For one data set D and one test point : d p h ], [ ], [,, h D E D E D h D D D + = Hence ], [ ], [, ], [ ], [, h D E D E D h D E D E D D D D D + + = variance bias ], [, ], [, + = + = D E D E h D E h D E D D D D

6 Bias Suppose ou are given a dataset D with m samples from some distribution. You learn function from data D If ou sample a different datasets, ou will learn different Epected hpothesis: E D [,D] Bias: difference between what ou epect to learn and the truth. Measures how well ou epect to represent true solution Decreases with more comple model bias = E [, D] h p d D Epected to learn True model

7 Variance Variance: difference between what ou epect to learn and what ou learn from a particular dataset Measures how sensitive learner is to a specific dataset Decreases with simpler model var iance, D E [, D] p d = E D D Learned for D Epected to learn, averaged over datasets

8 Bias-Variance Tradeoff Choice of hpothesis class introduces learning bias More comple class less bias More comple class more variance

9 The Epected Prediction Error The epected prediction squared error over fied size training sets D drawn from PX,T can be epressed as sum of three components: unavoidable error + bias + variance where unavoidable error=σ bias = E [ ] h p d D var iance= E [ ] E [ ] p d D D

10 Higher variance Lower bias

11 Bias-Variance Decomposition Graphical Representation

12 Training Error Given a training data, choose a loss function. e.g., squared error L for regression Training set error: For a particular set of parameters, loss function on training data: Err train N 1 train M w = t i Ntrain i= 1 j= 1 w i j

13 Training Error vs. Model Compleit Error Training error low Model compleit high

14 Prediction Generalization Error Training set error can be poor measure of qualit" of solution Prediction error: We reall care about error over all possible input points, not just training data: M d w t w t E w Err M j j i M j j i true = = = = X 1 1 X

15 Prediction Error vs. Model Compleit True Error Error Training error low want Model compleit high

16 Wh training set error doesn t approimate prediction error? Sampling approimation of prediction error: Err Training error Err train true w 1 p p Ver similar equations!!! M t i= 1 j= 1 w Wh is training set a bad measure of prediction error??? i j N train M 1 w = t i wi j N train i= 1 j= 1

17 Wh is training set a bad measure of prediction error??? Training error is a good estimate for a single w, But we optimized w with respect to the training error, and found w that is good for this set of samples. Training error is a optimisticall biased estimate of prediction error

18 Test Error Given a dataset, randoml split it into two parts: Training data { 1,, Ntrain } Test data { 1,, Ntest } Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using: Err 1 N test M * test w = t i Ntest i= 1 j= 1 Unbiased but has variance w * i Test set onl unbiased if ou never do an learning on the test data For eample, if ou use the test set to select the degree of the polnomial no longer unbiased!!! j

19 Test Set Error vs. Model Compleit Test Error Error Training error True Error low want Model comleit high

20 How man points to use for training/testing? Ver hard question to answer! Too few training points, learned w is bad Too few test points, ou never know if ou reached a good solution Theor proposes error bounds advanced course Tpicall: if ou have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning if ou have little data, then ou need to use some special techniques e.g., bootstrapping

21 Error as a function of number of training eamples for a fied model compleit Test error Error Training error Little data Infinite data

22 Overfitting With too few training eamples our model ma achieve zero training error but never the less has a large generalization error When the training error no longer bears an relation to the generalization error the model overfits the data

23 Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials: Comments and corrections gratefull received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon Universit awm@cs.cmu.edu Copright Andrew W. Moore Slide 3

24 A Regression Problem = f + noise Can we learn f from this data? Let s consider three methods Copright Andrew W. Moore Slide 4

25 Linear Regression Copright Andrew W. Moore Slide 5

26 Quadratic Regression Copright Andrew W. Moore Slide 6

27 Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright Andrew W. Moore Slide 7

28 Which is best? Wh not choose the method with the best fit to the data? Copright Andrew W. Moore Slide 8

29 What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright Andrew W. Moore Slide 9

30 The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Copright Andrew W. Moore Slide 30

31 The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set 3. Perform our regression on the training set Linear regression eample Copright Andrew W. Moore Slide 31

32 The test set method 1. Randoml choose 30% of the data to be in a test set Linear regression eample Mean Squared Error =.4. The remainder is a training set 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 3

33 The test set method 1. Randoml choose 30% of the data to be in a test set Quadratic regression eample Mean Squared Error = 0.9. The remainder is a training set 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 33

34 The test set method 1. Randoml choose 30% of the data to be in a test set Join the dots eample Mean Squared Error =.. The remainder is a training set 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 34

35 Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: What s the downside? Copright Andrew W. Moore Slide 35

36 Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright Andrew W. Moore Slide 36

37 LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record Copright Andrew W. Moore Slide 37

38 LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset Copright Andrew W. Moore Slide 38

39 LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints Copright Andrew W. Moore Slide 39

40 LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k Copright Andrew W. Moore Slide 40

41 LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. Copright Andrew W. Moore Slide 41

42 LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =.1 Copright Andrew W. Moore Slide 4

43 LOOCV for Quadratic Regression For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =0.96 Copright Andrew W. Moore Slide 43

44 LOOCV for Join The Dots For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright Andrew W. Moore Slide 44

45 Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright Andrew W. Moore Slide 45

46 k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue Copright Andrew W. Moore Slide 46

47 k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. Copright Andrew W. Moore Slide 47

48 k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright Andrew W. Moore Slide 48

49 k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Copright Andrew W. Moore Slide 49

50 k-fold Cross Validation Linear Regression MSE 3FOLD =.05 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 50

51 k-fold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 51

52 k-fold Cross Validation Joint-the-dots MSE 3FOLD =.93 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 5

53 Which kind of Cross Validation? Test-set Leave- one-out 10-fold 3-fold R-fold Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset Copright Andrew W. Moore Slide 53

Bias-Variance Decomposition Error Estimators Cross-Validation

Bias-Variance Decomposition Error Estimators Cross-Validation Bias-Variance tradeoff Intuition Model too simple does not fit the data well a biased solution. Model too comple small changes to the data,