Overfitting, Model Selection, Cross Validation, Bias-Variance

Similar documents
Bias-Variance Decomposition Error Estimators Cross-Validation

Bias-Variance Decomposition Error Estimators

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting

LECTURE 6: CROSS VALIDATION

Linear Feature Engineering 22

Cross-validation for detecting and preventing overfitting

Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods

Performance Evaluation

M O T I O N A N D D R A W I N G

6.867 Machine learning

6.867 Machine learning

SECTION 3-4 Rational Functions

Simple Model Selection Cross Validation Regularization Neural Networks

5.2 Graphing Polynomial Functions

Non-linear models. Basis expansion. Overfitting. Regularization.

Transformations of Functions. 1. Shifting, reflecting, and stretching graphs Symmetry of functions and equations

5 Learning hypothesis classes (16 points)

g(x) h(x) f (x) = Examples sin x +1 tan x!

5.2 Graphing Polynomial Functions

10601 Machine Learning. Model and feature selection

12.4 The Ellipse. Standard Form of an Ellipse Centered at (0, 0) (0, b) (0, -b) center

3.2 Polynomial Functions of Higher Degree

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

Week 3. Topic 5 Asymptotes

science. In this course we investigate problems both algebraically and graphically.

Roberto s Notes on Differential Calculus Chapter 8: Graphical analysis Section 5. Graph sketching

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

y = f(x) x (x, f(x)) f(x) g(x) = f(x) + 2 (x, g(x)) 0 (0, 1) 1 3 (0, 3) 2 (2, 3) 3 5 (2, 5) 4 (4, 3) 3 5 (4, 5) 5 (5, 5) 5 7 (5, 7)

Section 4.2 Graphing Lines

Boosting Simple Model Selection Cross Validation Regularization

Evaluating Classifiers

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Partial Fraction Decomposition

Double Integrals in Polar Coordinates

Regularization and model selection

Clustering Part 2. A Partitional Clustering

Evaluating Classifiers

The exam is closed book, closed notes except your one-page cheat sheet.

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Lecture 2 8/24/18. Computer Memory. Computer Memory. Goals for today. Computer Memory. Computer Memory. Variables Assignment Sequential Steps

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Developed in Consultation with Tennessee Educators

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

EXAMPLE A {(1, 2), (2, 4), (3, 6), (4, 8)}

GRAPHS AND GRAPHICAL SOLUTION OF EQUATIONS

Section 1.4 Limits involving infinity

LESSON 5.3 SYSTEMS OF INEQUALITIES

Polynomial and Rational Functions

LINEAR PROGRAMMING. Straight line graphs LESSON

Model Complexity and Generalization

Exponential and Logarithmic Functions

Exponential and Logarithmic Functions

CSC411/2515 Tutorial: K-NN and Decision Tree

Chapter Goals: Evaluate limits. Evaluate one-sided limits. Understand the concepts of continuity and differentiability and their relationship.

CPSC 340: Machine Learning and Data Mining

UNIT P1: PURE MATHEMATICS 1 QUADRATICS

Evaluate and Graph Polynomial Functions

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

1 Machine Learning System Design

The Problem of Overfitting with Maximum Likelihood

3.5 Rational Functions

CS 229 Midterm Review

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Lecture #11: The Perceptron

Using Characteristics of a Quadratic Function to Describe Its Graph. The graphs of quadratic functions can be described using key characteristics:

4.5.2 The Distance-Vector (DV) Routing Algorithm

2.4 Polynomial and Rational Functions

Exponential Functions. Christopher Thomas

2.3 Polynomial Functions of Higher Degree with Modeling

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Function vs Relation Notation: A relationis a rule which takes an inputand may/or may not produce multiple outputs. Relation: y = 2x + 1

1.1. Parent Functions and Transformations Essential Question What are the characteristics of some of the basic parent functions?

3.6 Graphing Piecewise-Defined Functions and Shifting and Reflecting Graphs of Functions

NOTES: ALGEBRA FUNCTION NOTATION

Systems of Linear Equations

Implicit differentiation

Section 4.3 Features of a Line

Fall 09, Homework 5

Check Skills You ll Need (For help, go to Lesson 1-2.) Evaluate each expression for the given value of x.

MATHEMATICAL METHODS UNITS 3 AND Sketching Polynomial Graphs

Chapter 3. Interpolation. 3.1 Introduction

CSC 263 Lecture 4. September 13, 2006

Topic 2 Transformations of Functions

Introduction to Homogeneous Transformations & Robot Kinematics

Introduction to Homogeneous Transformations & Robot Kinematics

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Exam January? 9:30 11:30

CSE446: Linear Regression. Spring 2017

Mathematics Stage 5 PAS5.1.2 Coordinate geometry

2.3. One-to-One and Inverse Functions. Introduction. Prerequisites. Learning Outcomes

Cross-validation and the Bootstrap

Graphing by. Points. The. Plotting Points. Line by the Plotting Points Method. So let s try this (-2, -4) (0, 2) (2, 8) many points do I.

CSC 2515 Introduction to Machine Learning Assignment 2

Laurie s Notes. Overview of Section 6.3

k-means Gaussian mixture model Maximize the likelihood exp(

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

LESSON 3.1 INTRODUCTION TO GRAPHING

Discussion: Clustering Random Curves Under Spatial Dependence

Transcription:

Statistical Machine Learning Notes 2 Overfitting, Model Selection, Cross Validation, Bias-Variance Instructor: Justin Domke Motivation Suppose we have some data TRAIN = {(, ), ( 2, 2 ),..., ( N, N )} that we want to fit a curve to:.8.6.4.2.2.4.6.8 Here, we want to fit a polnomial, of the form = w + w + w 2 2 +... + w p p. In these notes, we are interested in the question of how to choose p. For various p, we can find and plot the best polnomial, in terms of minimizing the squared distance to all the points ( i, i ) min w Ê TRAIN ( w + w X + w 2 X 2 +... + w p X p Y )2. (.) We will refer to this mean-squared distance of squares distances as the training error.

Overfitting, Model Selection, Cross Validation, Bias-Variance 2 ( Ê w + w X + w 2 X 2 +... + w p X p Y ) 2 TRAIN 2.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 3 4 5.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 6 7 8.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 9.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8

Overfitting, Model Selection, Cross Validation, Bias-Variance 3 2 3 4.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 5 6 7.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 8 9 2.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 The question we address in these notes is: what p do we epect to generalize best? Specificall, we will imagine that we will get some new data TEST from the same distribution. We want to pick p to minimize the same sum-of-squares difference we used for fitting to the training data. This is called the test data. ( Ê w + w X + w 2 X 2 +... + w p X p Y ) 2 TEST Here, we see a simple eample of the basic game of machine learning. It works like this: Input some data TRAIN.

Overfitting, Model Selection, Cross Validation, Bias-Variance 4 Fit some model to TRAIN. Get some new data TEST. Test the model on the TEST. In machine learning, one is generall considered to have won the game if one s performance on TEST is best. This is often in contrast to the attitude in statistics, where success is usuall measured in terms of if the model is close to the true model. Our assumption here is that the elements of TRAIN and TEST are both drawn from the same distribution. What this means is that there is some true distribution p (, ). We don t know p, but we assume all the elements of TRAIN and TEST are both drawn from it independentl. An eample should make this clear. Suppose that is a person s height, and is a person s age. Then (, ) p (, ) means that we got and b showing up at a random house on a random da, grabbing a random person, and measuring their height and age. If we got TRAIN b doing this in Rochester, and TEST b doing this in Buffalo, then we would be violating the assumption that TRAIN and TEST came from the same distribution. Now, return to our eample of choosing p. What will happen to the training error as p gets bigger? Now, clearl, a larger p means more power, and so the training error will strictl decrease. However, the high degree polnomials appear to displa severe artifacts, that don t appear likel to capture real properties of the data. What is going wrong here? 2 The Bias-Variance Tradeoff Let us return to our initial problem of tring to pick the right degree p for our polnomial. The first idea that springs to mind is to pick that p that fits the data best. If we plot min w Ê TRAIN for each p, we see something disturbing: ( w + w X + w 2 X 2 +... + w p X p Y )2.

Overfitting, Model Selection, Cross Validation, Bias-Variance 5.35.3.25 train error.2.5..5 5 5 2 polnomial Clearl this is not what we want. Instead, suppose we had a giant pile of, etra points drawn from the same distribution as the training error. We will call this test data, and the average error on it the test error..8.6.4 test error.35.3.25.2.5.2..5.2.4.6.8 5 5 2 polnomial Things blow up after p = 6. (This is intuitivel plausible if ou look at the plots of the polnomials above.) There are two things that are happening here: For ver low p, the model is ver simple, and so can t capture the full compleities of the data. This is called bias. For ver high p, the model is comple, and so tends to overfit to spurious properties of the data. This is called variance. Don t think about the names bias and variance too much. The names derive from technical concepts in fitting least-squares regression models, but we will use them more informall

Overfitting, Model Selection, Cross Validation, Bias-Variance 6 for all tpes of models. Now, when doing model selection, one often sees a bias-variance tradeoff. This is commonl thought of b drawing a picture like this: Error Variance Bias low compleit high compleit This shows the bias, variance, and error of a range of different learning methods, for a given amount of training data. It shows the common situation in practice that () for simple models, the bias is high, while (2) for comple models, the variance is high. Since the test error depends on both, the optimal compleit is somewhere in the middle. Question: Suppose the amount of training data is increased. How would the above curves change? Answer: The new curves would look something like: Error Variance Bias low compleit high compleit The variance is reduced since there is more data, and so a slightl more comple model minimizes the epected error. Bias-variance tradeoffs are seen ver frequentl, in all sorts of problems. We can ver often understand the differences in performance between different algorithms as trading off between bias and variance. Usuall, if we take some algorithm, and change it to reduce bias, we will also increase variance. This doesn t have to happen, though. If tr, ou can change an algorithm in such a wa as to increase both bias and variance.

Overfitting, Model Selection, Cross Validation, Bias-Variance 7 3 Cross Validation Let s return to the issue of picking p. We saw above that if we had, etra data, we could just fit the model with each p, and pick the p that does best on the validation data. What would ou do if ou don t happen to have a spare, etra data sitting around? The first idea that comes to mind is to hold out some of our original training data. This is called hold-out set validation.. Split TRAIN into TRAIN fit and TRAIN validation. (e.g. half in each) 2. Fit each model to TRAIN train, and evaluate how well it does on TRAIN validation. 3. Output the model that has the best score on TRAIN validation. This can work reasonabl well, but it wastes the data b onl training on half, and onl testing on half. We can make better use of the data b making several different splits of the data. Each datum is used once for testing, and the other times for training. This algorithm is called K-fold cross validation.. Split TRAIN into K chunks 2. For k =, 2,..., K: (a) Set TRAIN validation to be the kth chunk of data, and TRAIN fit to be the other K chunks. (b) Fit each model to TRAIN fit and evaluate how well it does on TRAIN validation. 3. Pick the model that has the best average test score. 4. Retrain that model on all of TRAIN, and output it. If we do 5-fold cross validation on our original set of 6 points, we get: leave out cross validation error.35.3.25.2.5..5 5 fold cross validation error.35.3.25.2.5..5 5 5 2 polnomial 5 5 2 polnomial

Overfitting, Model Selection, Cross Validation, Bias-Variance 8 We can see that this picks p = 3, rather than the optimal p = 2. Clearl, cross validation is no substitute for a large test set. However, if we onl have a limited training set, it is often the best option available. In this case, cross-validation selected a simpler model than optimal. What do we epect to happen on average? Notice that with 6 points, 5-fold cross validation effectivel tries to pick the polnomial that makes the best bias-variance tradeoff for 48 (6 4 ) points. If we 5 had done -fold cross validation, it would instead tr to pick the best polnomial for 54 (6 9 ) points. Thus, cross validation biases towards simpler models. We could reduce this to the minimum possible b doing 6-fold cross validation. This has a special name: leaveone-out cross validation. In comple models, this can be epensive, though there has been research on clever methods for doing leave-one-out cross validation with out needing to recompute the full model. In practice, we usuall don t see too much benefit for doing more then 5 or fold cross validation. In this particular eample, 6-fold cross validation still selects p = 3. leave out cross validation error.35.3.25.2.5..5 5 fold cross validation error.35.3.25.2.5..5 5 5 2 polnomial 5 5 2 polnomial Cross validation is a good technique, but it doesn t work miracles: there is onl so much information in a small dataset. People often ask questions like: "If I do clustering, how man clusters should I use?" "If I fit a polnomial, what should its order be?" "If I fit a neural network, how man hidden units should it have?" "If I fit a miture of Gaussians, how man components should it have?" How do I set m regularization constant?

Overfitting, Model Selection, Cross Validation, Bias-Variance 9 I want to fit a Support Vector Machine classifier to this dataset. Should I use a Gaussian, polnomial, or linear kernel? A reasonable answer to all of these is cross-validation. Though cross validation is etremel widel used, there has been difficult proving that it actuall works better than just using a single hold-out set of th of the data! There is even K a $5 bount. See: http://hunch.net/?p=29 http://hunch.net/?p=32