Linear Feature Engineering 22

Similar documents
Overfitting, Model Selection, Cross Validation, Bias-Variance

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Dynamic Programming I Date: 10/6/16

Model selection and validation 1: Cross-validation

CSC 411 Lecture 4: Ensembles I

7. Decision or classification trees

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

CS103 Handout 29 Winter 2018 February 9, 2018 Inductive Proofwriting Checklist

Regularization and model selection

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

10601 Machine Learning. Model and feature selection

Approximation Algorithms

Bias-Variance Decomposition Error Estimators Cross-Validation

Instance-based Learning

Bias-Variance Decomposition Error Estimators

model order p weights The solution to this optimization problem is obtained by solving the linear system

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Lecture 3: Convex sets

Generating Functions

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis

Scan and Quicksort. 1 Scan. 1.1 Contraction CSE341T 09/20/2017. Lecture 7

High Dimensional Indexing by Clustering

Greedy Algorithms. This is such a simple approach that it is what one usually tries first.

Lecture 25 Spanning Trees

LECTURE 6: CROSS VALIDATION

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

The Problem of Overfitting with Maximum Likelihood

Fractional Cascading

Instance-based Learning

MATH 51: MATLAB HOMEWORK 3

Perceptron as a graph

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Towards the Proof of the PCP Theorem

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute

CS261: A Second Course in Algorithms Lecture #14: Online Bipartite Matching

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Graphing by. Points. The. Plotting Points. Line by the Plotting Points Method. So let s try this (-2, -4) (0, 2) (2, 8) many points do I.

Lecture Notes on Spanning Trees

6.001 Notes: Section 31.1

(Refer Slide Time: 0:19)

Algorithms for Euclidean TSP

Hyperparameters and Validation Sets. Sargur N. Srihari

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Hot X: Algebra Exposed

Nonparametric Methods Recap

(Refer Slide Time: 1:43)

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Lecture 25 Notes Spanning Trees

Lecture 13: Model selection and regularization

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Knowledge Discovery and Data Mining

Lecture 06 Decision Trees I

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Model Complexity and Generalization

Lecture 7. s.t. e = (u,v) E x u + x v 1 (2) v V x v 0 (3)

Feature selection. LING 572 Fei Xia

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi

Lecture 4: 3SAT and Latin Squares. 1 Partial Latin Squares Completable in Polynomial Time

Rational functions, like rational numbers, will involve a fraction. We will discuss rational functions in the form:

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

6.001 Notes: Section 4.1

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Figure : Example Precedence Graph

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

Clustering: Centroid-Based Partitioning

Cross-validation for detecting and preventing overfitting

Polynomial Models Studio October 27, 2006

CS 4100 // artificial intelligence

Chapter 16. Greedy Algorithms

Lecture 20: Satisfiability Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

Dependent Independent Input Discrete. X Y Output Continuous

Greedy Approximations

CS125 : Introduction to Computer Science. Lecture Notes #4 Type Checking, Input/Output, and Programming Style

Table of Laplace Transforms

Drawing Hands, by M. C. Escher (lithograph, 1948)

Math 4242 Polynomial Time algorithms, IndependentSet problem

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017

CS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Algebra of Sets. Aditya Ghosh. April 6, 2018 It is recommended that while reading it, sit with a pen and a paper.

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

CS 188: Artificial Intelligence Fall 2011

COMP 250 Fall recurrences 2 Oct. 13, 2017

Nearest Neighbor Predictors

Introduction to Programming in C Department of Computer Science and Engineering\ Lecture No. #02 Introduction: GCD

NON-CALCULATOR ARITHMETIC

1 Definition of Reduction

We have already seen the transportation problem and the assignment problem. Let us take the transportation problem, first.

Algorithms: Decision Trees

Lesson 10: Representing, Naming, and Evaluating Functions

Algebra Domains of Rational Functions

Main approach: always make the choice that looks best at the moment.

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Lecture 19: Decision trees

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Transcription:

Linear Feature Engineering 22 3 Overfitting Remember our dataset from last time. We have a bunch of inputs x i and corresponding outputs y i. In the previous lecture notes, we considered how to fit polynomials of different orders, as y = w 0 + w 1 x + w 2 x 2 +... + w p x p. Now, a natural question to ask is: what value of p should we use? To start off, let s simply try fitting the above data by least squares with a range of p.

Linear Feature Engineering 23

Linear Feature Engineering 24 If we plot the error R as a function of p, we see: This doesn t seem satisfying. Should we always use the highest-order polynomial possible? To think about this more, let s suppose for the moment that we happened to have an extra 100 point lying around in a test set. We can see the errors on this test set like so

Linear Feature Engineering 25

Linear Feature Engineering 26 If we plot the error R as a function of p, we see the best test error is found at p =2, followed by a rapid increase. So, this is all well and good. The problem, of course, is that we generally won t have a gigantic extra amount of data sitting around. And if we did, do we want to use it for training or for testing? In reality, we have one dataset, which we will, ourselves, split into a training set and a test set. We might formalize this as a type of algorithm. Algorithm: Polynomial selection using a test set Input some data. Take some fraction (e.g. 30%) of the data as the test set, keep the rest as the training set. For p =0, 1,...,P Fit a p-order polynomial to the training data. Evaluate the error R test (p) on the test set Find p =argmin p R test (p). Re-fit a polynomial of order p to the entire dataset, and output it. Note, of course, that this kind of algorithm is not specific to polynomials. In any kind of application where one is considering different classes of predictors, some of which are more powerful than others, this kind of strategy can be used.

Linear Feature Engineering 27 The previous algorithm is all well and good. However, can we think of any weaknesses of it? How to split between training and test set? A 70/30 split is not obviously optimal. Will, on average pick a p that is too small. (Suppose we have 100 data. If we use 70% of the data as the training set, it intuitively tries to find the p to generalize best from 70 data, not 100.) 3.1 Error estimation using cross-validation Let s start from the beginning, and try to build up approaches to estimate errors. For example, take our original dataset. Suppose that we want to estimate the test error of an affine function we fit to it. The simplest way to do this would be a hold-out set. This means we 1. Take some subset of the data as a test set. 2. Take the rest as the training set. 3. Fit a function to the training set. 4. Evaluate the error on the test set. Exercise 21. Make a procedure to estimate errors of an affine function using the simple dataset, holding out the first 2 points. Verify that you compute a training error of.0513 and a test-error of.0899. This is not a particularly good estimate. Notice that, above, using a large test set, we have a test error of.0245, while here we happen to have a much larger estimate. Notice also that (as we would expect) the training error is below the true test error. Now, of course, the particular choice of using the first two points as a hold-out set was totally arbitrary. Why not the first two? A key idea of cross-validation is that we can use multiple test sets, and average the results. The procedure works as follows. Algorithm: K-fold cross validation 1. Divide the data into K chunks 2. For k =1, 2,...,K (a) Take the k-th chunk of the data as a test set.

Linear Feature Engineering 28 (b) Take the other K 1 chunks as the training set. (c) Fit a function to the training set. (d) Evaluate the error on the test set. 3. Return the average of all the errors calculated in step 2(e). Exercise 22. Make a procedure to estimate errors of estimating an affine function using the simple dataset, using 5 different folds. Check that the mean training error is 0.0822 and the mean test error is 0.0427. We can look at how much variability there is the different folds in the following graphs. These show the true training and test errors using the full set of 10 points for training and the 100 points for testing. 3.2 K-Fold Cross-Validation Now, we have a good method for estimating the errors. How can we use this to find the best polynomial? The idea is simple: Algorithm: Polynomial selection using K-fold cross validation For each p, estimate the test error of fitting a p-th order polynomial using cross validation. Pick the polynomial order p that has the smallest estimated test error. Re-fit a polynomial of order p to the entire dataset, and output it. Doing N-fold cross-validation on a dataset with N points has the special name of leaveone-out cross-validation.

Linear Feature Engineering 29 Exercise 23. On the above dataset, for p =0, 1,...,10 estimate the test error using 5-fold cross validation. Verify that you get estimated test errors as in the following table: order train error test error 0 0.4155 0.1107 1 0.0822 0.0427 2 0.0435 0.0324 3 0.0411 0.0419 4 0.0364 0.0523 5 0.0139 0.0499 6 0.0077 0.2376 7 0.0000 13.6933 8 0.0000 25.3749 9 0.0000 39.2409 10 0.0000 49.9065 We can visualize this with the following plot. Compare to our earlier plot using the large test set: Cross-validation does a reasonable job. (In particular, in this example, both methods would pick p =2.) However, it does not miracles. As practitioners tend to say, more data beats a clever algorithm. 3.3 Discussion The best way of doing these types of things remains to some degree an active research problem. For more information, see Andrew Moore s Cross-validation fordetecting and preventing overfitting.