Bias-Variance Decomposition Error Estimators Cross-Validation

Similar documents
Bias-Variance Decomposition Error Estimators

Cross-validation for detecting and preventing overfitting

LECTURE 6: CROSS VALIDATION

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting

Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods

Overfitting, Model Selection, Cross Validation, Bias-Variance

Non-linear models. Basis expansion. Overfitting. Regularization.

Performance Evaluation

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Machine Learning. Cross Validation

CSE 446 Bias-Variance & Naïve Bayes

K-means and Hierarchical Clustering

CSE446: Linear Regression. Spring 2017

Cross-validation. Cross-validation is a resampling method.

EE 511 Linear Regression

Model Complexity and Generalization

CSC 411 Lecture 4: Ensembles I

Simple Model Selection Cross Validation Regularization Neural Networks

Clustering Part 2. A Partitional Clustering

Boosting Simple Model Selection Cross Validation Regularization

10601 Machine Learning. Model and feature selection

Instance-based Learning

Perceptron as a graph

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Linear Feature Engineering 22

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Nonparametric Methods Recap

6.867 Machine learning

Graphing square root functions. What would be the base graph for the square root function? What is the table of values?

2.4 Polynomial and Rational Functions

Introduction to Homogeneous Transformations & Robot Kinematics

Lecture 25: Review I

SECTION 3-4 Rational Functions

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

RESAMPLING METHODS. Chapter 05

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

6.867 Machine learning

5.2. Exploring Quotients of Polynomial Functions. EXPLORE the Math. Each row shows the graphs of two polynomial functions.

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Partial Fraction Decomposition

Instance-based Learning

Week 3. Topic 5 Asymptotes

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Transformations of Functions. 1. Shifting, reflecting, and stretching graphs Symmetry of functions and equations

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Chapter 3. Interpolation. 3.1 Introduction

Graphing Polynomial Functions

Multicollinearity and Validation CIVL 7012/8012

It s Not Complex Just Its Solutions Are Complex!

CSE446: Linear Regression. Spring 2017

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

CSE 573: Artificial Intelligence Autumn 2010

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Answers. Investigation 4. ACE Assignment Choices. Applications

Investigation Free Fall

Lecture 12 Date:

Hyperparameters and Validation Sets. Sargur N. Srihari

Section 1.4 Limits involving infinity

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

5.2 Graphing Polynomial Functions

Problem 1: Complexity of Update Rules for Logistic Regression

How do we obtain reliable estimates of performance measures?

Lecture 26: Missing data

3.6 Graphing Piecewise-Defined Functions and Shifting and Reflecting Graphs of Functions

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

Stereo: the graph cut method

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Algorithms: Decision Trees

Introduction to Homogeneous Transformations & Robot Kinematics

Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

CPSC 340: Machine Learning and Data Mining

2.4. Families of Polynomial Functions

Answers Investigation 4

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Classification: Feature Vectors

Scale Invariant Feature Transform (SIFT) CS 763 Ajit Rajwade

Cross-validation and the Bootstrap

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

2.3 Polynomial Functions of Higher Degree with Modeling

Graphing Review. Math Tutorial Lab Special Topic

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

9. f(x) = x f(x) = x g(x) = 2x g(x) = 5 2x. 13. h(x) = 1 3x. 14. h(x) = 2x f(x) = x x. 16.

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

model order p weights The solution to this optimization problem is obtained by solving the linear system

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Regularization and model selection

IOM 530: Intro. to Statistical Learning 1 RESAMPLING METHODS. Chapter 05

Recognition Tools: Support Vector Machines

Lecture 16: High-dimensional regression, non-linear regression

Introduction to Automated Text Analysis. bit.ly/poir599

Going nonparametric: Nearest neighbor methods for regression and classification

Section 4.2 Graphing Lines

Topic 2 Transformations of Functions

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Bias-Variance Analysis of Ensemble Learning

Transcription:

Bias-Variance Decomposition Error Estimators Cross-Validation

Bias-Variance tradeoff Intuition Model too simple does not fit the data well a biased solution. Model too comple small changes to the data, solution changes a lot high-variance solution.

Epected Loss Let h = E[t ] be the optimal predictor and our actual predictor, which will incur the following epected loss ddt p t h d p h t h t h h h E ddt t p t h h t E, 0 ] [ since dt t p -t t E dt p t h d p h t E Noise term Source of error 1

Sources of error 1 noise What if we have perfect learner, infinite data? Our learning solution satisfies =h Still have remaining, unavoidable error of σ due to noise ε t t p f t p dtd h h, ~ N0, E[ ]

Bias-variance decomposition Focus on Let us first eamine epected loss averaging over data sets. For one data set D and one test point : Hence d p h ], [ ], [, ], [ ], [, ], [ ], [,, h D E D E D h D E D E D h D E D E D h D D D D D D D variance bias ], [, ], [, D E D E h D E h D E D D D D of this cancels to zero E D

Bias Suppose ou are given a dataset D with m samples from some distribution. You learn function from data D If ou sample a different datasets, ou will learn different Epected hpothesis: E D [,D] Bias: difference between what ou epect to learn and the truth. Measures how well ou epect to represent true solution Decreases with more comple model bias E [, D] h p d D Epected to learn True model

Variance Variance: difference between what ou epect to learn and what ou learn from a particular dataset Measures how sensitive learner is to a specific dataset Decreases with simpler model var iance, D E [, D] p d E D D Learned for D Epected to learn, averaged over datasets

Bias-Variance Tradeoff Choice of hpothesis class introduces learning bias More comple class less bias More comple class more variance

The Epected Prediction Error The epected prediction squared error over fied size training sets D drawn from PX,T can be epressed as sum of three components: unavoidabl e error bias variance where unavoidabl e error bias E [ ] h p d D var iance E E [ ] p d D D

Higher variance Lower bias

Training Error Given a training data, choose a loss function. e.g., squared error L for regression Training set error: For a particular set of parameters, loss function on training data: Err train N 1 train M w t i Ntrain i1 j1 w i j

Error Training Error vs. Model Compleit Training error low Model compleit high

Prediction Generalization Error Training set error can be poor measure of qualit" of solution Prediction error: We reall care about error over all possible input points, not just training data: d w t w t E w Err M j j i M j j i true X 1 1 X

Error Prediction Error vs. Model Compleit True Error Training error low want Model compleit high

Wh training set error doesn t approimate prediction error? Sampling approimation of prediction error: Err Training error Err train true w Ver similar equations!!! Wh is training set a bad measure of prediction error??? 1 p p i1 t M j1 w N 1 train M w t i Ntrain i1 j1 i j w i j

Wh is training set a bad measure of prediction error??? Training error is a good estimate for a single w, But we optimized w with respect to the training error, and found w that is good for this set of samples. Training error is a optimisticall biased estimate of prediction error

Test Error Given a dataset, randoml split it into two parts: Training data { 1,, Ntrain } Test data { 1,, Ntest } Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using: Err 1 N test M * test w t i Ntest i1 j1 Unbiased but has variance w * i Test set onl unbiased if ou never do an learning on the test data For eample, if ou use the test set to select the degree of the polnomial no longer unbiased!!! j

Error Test Set Error vs. Model Compleit Test Error True Error Training error low want Model comleit high

How man points to use for training/testing? Ver hard question to answer! Too few training points, learned w is bad Too few test points, ou never know if ou reached a good solution Theor proposes error bounds advanced course Tpicall: if ou have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning if ou have little data, then ou need to use some special techniques e.g., bootstrapping

Error Error as a function of number of training eamples for a fied model compleit Test error Training error Little data Infinite data

Overfitting With too few training eamples our model ma achieve zero training error but never the less has a large generalization error When the training error no longer bears an relation to the generalization error the model overfits the data

Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefull received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon Universit www.cs.cmu.edu/~awm awm@cs.cmu.edu 41-68-7599 Copright Andrew W. Moore Slide

A Regression Problem = f + noise Can we learn f from this data? Let s consider three methods Copright Andrew W. Moore Slide 3

Linear Regression Copright Andrew W. Moore Slide 4

Quadratic Regression Copright Andrew W. Moore Slide 5

Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright Andrew W. Moore Slide 6

Which is best? Wh not choose the method with the best fit to the data? Copright Andrew W. Moore Slide 7

What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright Andrew W. Moore Slide 8

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Copright Andrew W. Moore Slide 9

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set 3. Perform our regression on the training set Linear regression eample Copright Andrew W. Moore Slide 30

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Linear regression eample Mean Squared Error =.4 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 31

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Quadratic regression eample Mean Squared Error = 0.9 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 3

The test set method 1. Randoml choose 30% of the data to be in a test set. The remainder is a training set Join the dots eample Mean Squared Error =. 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 33

Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: What s the downside? Copright Andrew W. Moore Slide 34

Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright Andrew W. Moore Slide 35

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record Copright Andrew W. Moore Slide 36

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset Copright Andrew W. Moore Slide 37

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints Copright Andrew W. Moore Slide 38

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k Copright Andrew W. Moore Slide 39

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. Copright Andrew W. Moore Slide 40

LOOCV Leave-one-out Cross Validation For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =.1 Copright Andrew W. Moore Slide 41

LOOCV for Quadratic Regression For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =0.96 Copright Andrew W. Moore Slide 4

LOOCV for Join The Dots For k=1 to R 1. Let k, k be the k th record. Temporaril remove k, k from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error k, k When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright Andrew W. Moore Slide 43

Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright Andrew W. Moore Slide 44

k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue Copright Andrew W. Moore Slide 45

k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. Copright Andrew W. Moore Slide 46

k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright Andrew W. Moore Slide 47

k-fold Cross Validation Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Copright Andrew W. Moore Slide 48

k-fold Cross Validation Linear Regression MSE 3FOLD =.05 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 49

k-fold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 50

k-fold Cross Validation Joint-the-dots MSE 3FOLD =.93 Randoml break the dataset into k partitions in our eample we ll have k=3 partitions colored Red Green and Blue For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 51

Which kind of Cross Validation? Test-set Leaveone-out 10-fold 3-fold R-fold Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset Copright Andrew W. Moore Slide 5