RESAMPLING METHODS. Chapter 05

Similar documents
IOM 530: Intro. to Statistical Learning 1 RESAMPLING METHODS. Chapter 05

Multicollinearity and Validation CIVL 7012/8012

Cross-validation. Cross-validation is a resampling method.

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Cross-validation and the Bootstrap

Resampling methods (Ch. 5 Intro)

Machine Learning. Cross Validation

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Cross-validation and the Bootstrap

Lecture 25: Review I

Nonparametric Approaches to Regression

Support Vector Machines

Network Traffic Measurements and Analysis

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Regularization and model selection

How do we obtain reliable estimates of performance measures?

Random Forests and Boosting

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

10601 Machine Learning. Model and feature selection

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

CSC 411 Lecture 4: Ensembles I

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

The Basics of Decision Trees

Last time... Bias-Variance decomposition. This week

Stat 342 Exam 3 Fall 2014

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Lecture 13: Model selection and regularization

Boosting Simple Model Selection Cross Validation Regularization

CSE 446 Bias-Variance & Naïve Bayes

Model Complexity and Generalization

The Problem of Overfitting with Maximum Likelihood

Nonparametric Methods Recap

Model validation T , , Heli Hiisilä

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Moving Beyond Linearity

Simple Model Selection Cross Validation Regularization Neural Networks

Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Lasso Regression: Regularization for feature selection

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Classification/Regression Trees and Random Forests

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

4. Feedforward neural networks. 4.1 Feedforward neural network structure

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

CS4491/CS 7265 BIG DATA ANALYTICS

Lecture 20: SW Testing Presented by: Mohammad El-Ramly, PhD

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Nonparametric Classification Methods

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

LECTURE 6: CROSS VALIDATION

CS 229 Midterm Review

Random Forest A. Fornaser

Notes and Announcements

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

STA 4273H: Statistical Machine Learning

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

CSE 158, Fall 2017: Midterm

Bias-Variance Decomposition Error Estimators

Bias-Variance Decomposition Error Estimators Cross-Validation

CS249: ADVANCED DATA MINING

Generalized Additive Model

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY

Classification and Regression Trees

Cross-validation for detecting and preventing overfitting

Neural Networks and Machine Learning Applied to Classification of Cancer. Sachin Govind, Advisor: Namrata Pandya, IMSA

k-nearest Neighbors + Model Selection

Nonparametric Error Estimation Methods for Evaluating and Validating Artificial Neural Network Prediction Models

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Lecture 20: Bagging, Random Forests, Boosting

Cover Page. The handle holds various files of this Leiden University dissertation.

Supervised Learning for Image Segmentation

Stat 602X Exam 2 Spring 2011

Evaluating Classifiers

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Algorithms: Decision Trees

STA121: Applied Regression Analysis

Hyperparameters and Validation Sets. Sargur N. Srihari

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Probabilistic Approaches

Ensemble Methods, Decision Trees

Machine Learning and Bioinformatics 機器學習與生物資訊學

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

STAT 705 Introduction to generalized additive models

HW 10 STAT 672, Summer 2018

EE 8591 Homework 4 (10 pts) Fall 2018 SOLUTIONS Topic: SVM classification and regression GRADING: Problems 1,2,4 3pts each, Problem 3 1 point.

Transcription:

1 RESAMPLING METHODS Chapter 05

2 Outline Cross Validation The Validation Set Approach Leave-One-Out Cross Validation K-fold Cross Validation Bias-Variance Trade-off for k-fold Cross Validation Cross Validation on Classification Problems

3 What are resampling methods? Tools that involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain more information about the fitted model Model Assessment: estimate test error rates Model Selection: select the appropriate level of model flexibility They are computationally expensive! But these days we have powerful computers Two resampling methods: Cross Validation Bootstrapping

5.1.1 Typical Approach: The Validation Set Approach Suppose that we would like to find a set of variables that give the lowest test (not training) error rate If we have a large data set, we can achieve this goal by randomly splitting the data into training and validation(testing) parts We would then use the training part to build each possible model (i.e. the different combinations of variables) and choose the model that gave the lowest error rate when applied to the validation data 4 Training Data Testing Data

5 Example: Auto Data Suppose that we want to predict mpg from horsepower Two models: mpg ~ horsepower mpg ~ horsepower + horspower 2 Which model gives a better fit? Randomly split Auto data set into training (196 obs.) and validation data (196 obs.) Fit both models using the training data set Then, evaluate both models using the validation data set The model with the lowest validation (testing) MSE is the winner!

6 Results: Auto Data Left: Validation error rate for a single split Right: Validation method repeated 10 times, each time the split is done randomly! There is a lot of variability among the MSE s Not good! We need more stable methods!

7 The Validation Set Approach Advantages: Simple Easy to implement Disadvantages: The validation MSE can be highly variable Only a subset of observations are used to fit the model (training data). Statistical methods tend to perform worse when trained on fewer observations

8 5.1.2 Leave-One-Out Cross Validation (LOOCV) This method is similar to the Validation Set Approach, but it tries to address the latter s disadvantages For each suggested model, do: Split the data set of size n into Training data set (blue) size: n -1 Validation data set (beige) size: 1 Fit the model using the training data Validate model using the validation data, and compute the corresponding MSE Repeat this process n times The MSE for the model is computed as follows: CV (n ) = 1 n n i =1 MSE i.

9 LOOCV vs. the Validation Set Approach LOOCV has less bias We repeatedly fit the statistical learning method using training data that contains n-1 obs., i.e. almost all the data set is used LOOCV produces a less variable MSE The validation approach produces different MSE when applied repeatedly due to randomness in the splitting process, while performing LOOCV multiple times will always yield the same results, because we split based on 1 obs. each time LOOCV is computationally intensive (disadvantage) We fit the each model n times!

10 5.1.3 k-fold Cross Validation LOOCV is computationally intensive, so we can run k-fold Cross Validation instead With k-fold Cross Validation, we divide the data set into K different parts (e.g. K = 5, or K = 10, etc.) We then remove the first part, fit the model on the remaining K- 1 parts, and see how good the predictions are on the left out part (i.e. compute the MSE on the first part) We then repeat this K different times taking out a different part each time By averaging the K different MSE s we get an estimated validation (test) error rate for new observations CV (k) = 1 k k i =1 MSE i.

K-fold Cross Validation 11

12 Auto Data: LOOCV vs. K-fold CV Left: LOOCV error curve Right: 10-fold CV was run many times, and the figure shows the slightly different CV error rates LOOCV is a special case of k-fold, where k = n They are both stable, but LOOCV is more computationally intensive!

Auto Data: Validation Set Approach vs. K- fold CV Approach Left: Validation Set Approach Right: 10-fold Cross Validation Approach Indeed, 10-fold CV is more stable! 13

14 K-fold Cross Validation on Three Simulated Date Blue: True Test MSE Black: LOOCV MSE Orange: 10-fold MSE Refer to chapter 2 for the top graphs, Fig 2.9, 2.10, and 2.11

15 5.1.4 Bias- Variance Trade-off for k-fold CV Putting aside that LOOCV is more computationally intensive than k-fold CV Which is better LOOCV or K-fold CV? LOOCV is less bias than k-fold CV (when k < n) But, LOOCV has higher variance than k-fold CV (when k < n) Thus, there is a trade-off between what to use Conclusion: We tend to use k-fold CV with (K = 5 and K = 10) These are the magical K s It has been empirically shown that they yield test error rate estimates that suffer neither from excessively high bias, nor from very high variance

16 5.1.5 Cross Validation on Classification Problems So far, we have been dealing with CV on regression problems We can use cross validation in a classification situation in a similar manner Divide data into K parts Hold out one part, fit using the remaining data and compute the error rate on the hold out data Repeat K times CV error rate is the average over the K errors we have computed

17 CV to Choose Order of Polynomial The data set used is simulated (refer to Fig 2.13) The purple dashed line is the Bayes boundary Bayes Error Rate: 0.133

18 Linear Logistic regression (Degree 1) is not able to fit the Bayes decision boundary Quadratic Logistic regression does better than linear Error Rate: 0.201 Error Rate: 0.197

19 Using cubic and quartic predictors, the accuracy of the model improves Error Rate: 0.160 Error Rate: 0.162

20 CV to Choose the Order Logistic Regression KNN Brown: Test Error Blue: Training Error Black: 10-fold CV Error