Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Similar documents
Topics in Machine Learning-EE 5359 Model Assessment and Selection

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Lecture 13: Model selection and regularization

CS229 Lecture notes. Raphael John Lamarre Townshend

CS249: ADVANCED DATA MINING

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Machine Learning. Cross Validation

GLMSELECT for Model Selection

Lecture 25: Review I

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Cross-validation and the Bootstrap

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Machine Learning. Topic 4: Linear Regression Models

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Nonparametric Methods Recap

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Cross-validation and the Bootstrap

Evaluating Classifiers

Regularization and model selection

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Evaluating Classifiers

Package EBglmnet. January 30, 2016

Weka ( )

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

Discriminant analysis in R QMMA

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

The exam is closed book, closed notes except your one-page cheat sheet.

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE 446 Bias-Variance & Naïve Bayes

Using Machine Learning to Optimize Storage Systems

Network Traffic Measurements and Analysis

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Stat 342 Exam 3 Fall 2014

Nonparametric Classification Methods

Nonparametric Approaches to Regression

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

7. Decision or classification trees

RESAMPLING METHODS. Chapter 05

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Cross-validation. Cross-validation is a resampling method.

Tree-based methods for classification and regression

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

The Basics of Decision Trees

What is machine learning?

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Classification Part 4

22s:152 Applied Linear Regression

Tutorials Case studies

Didacticiel - Études de cas

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Resampling methods (Ch. 5 Intro)

Introduction to Classification & Regression Trees

Lecture 5: Decision Trees (Part II)

CS294-1 Assignment 2 Report

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CS 229 Midterm Review

Generalized Additive Model

Linear Feature Engineering 22

Classification and Regression Trees

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Lecture 06 Decision Trees I

22s:152 Applied Linear Regression

Lecture 9: Support Vector Machines

Tutorial on Machine Learning Tools

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

MATH 829: Introduction to Data Mining and Analysis Model selection

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Model selection and validation 1: Cross-validation

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Lecture 19: Decision trees

Lasso. November 14, 2017

Classification with Decision Tree Induction

Multicollinearity and Validation CIVL 7012/8012

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Chapter 6: Linear Model Selection and Regularization

CSE446: Linear Regression. Spring 2017

Random Forests and Boosting

Statistical Pattern Recognition

Louis Fourrier Fabien Gaie Thomas Rolf

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Machine Learning: Think Big and Parallel

Simple Model Selection Cross Validation Regularization Neural Networks

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Classification/Regression Trees and Random Forests

I211: Information infrastructure II

Cross-validation for detecting and preventing overfitting

CS145: INTRODUCTION TO DATA MINING

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

10601 Machine Learning. Model and feature selection

List of Exercises: Data Mining 1 December 12th, 2015

1) Give decision trees to represent the following Boolean functions:

Cross-validation for detecting and preventing overfitting

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Variable selection is intended to select the best subset of predictors. But why bother?

Last time... Bias-Variance decomposition. This week

Statistical Pattern Recognition

Transcription:

Statistical Consulting Topics Using cross-validation for model selection Cross-validation is a technique that can be used for model evaluation. We often fit a model to a full data set and then perform hypothesis tests on the parameters of interest. But if your main goal is to build a model to be used for prediction, then it would seem like a good idea to... Split your data into at least two parts. Build your model with one part (training set). Then, see how well the model you built performs on the other part (test set). This is the idea behind cross-validation. 1

In cross-validation we often refer to a partition of the data set composed of a training set (the part of the data to which the model is fitted) and a test set (the part of the data to which we evaluate the model that was built using the training set). If the model fits well to the training set (e.g. small MSE), but does a poor job in the test set, then that says that we overfit our model to the training set, and it will not perform well in a new data set (and will not do a good job at prediction). Validating a chosen model (using data that was not used to formulate the model) preserves the integrity of the statistical inference. 2

Cross-validation (CV) methods Holdout method Simplest method. The data set is partitioned into two sets: training set and test set. The disadvantage is that the evaluation may be significantly different depending on how the division is made. K-fold cross-validation The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k 1 subsets are put together to form a training set. 5-fold or 10-fold CV is commonly used. The disadvantage is that the process has to be re-ran k times (computation). 3

Leave-one-out cross-validation Essentially, K-fold cross validation taken to its extreme. Each test set consists of 1 observation. Process is ran n times. Selection criterion: Cross-validation error Predicted residual sums of squares (PRESS). Define e ( ij) = y ij ŷ ( ij) as the residual of i th observation in subset j based on a model fitted with the j th subset dropped from the data, where i=1, 2,... n j,j = 1, 2... k. PRESS= k j=1 n j i=1 e 2 ( ij) 4

PRESS= k j=1 n j i=1 e 2 ( ij) PRESS provides a measure of the fit of a model to a sample of observations that were not themselves used to estimate the model. Select the model with the smallest PRESS over all models in the candidate pool. 5

What about other commonly used variable selection (or model selection) procedures? How does cross-validation compare to those? Farm injuries and medications study (n=625). Variable Values Variables Description INJURY no injury=0, farmers self-reported injuries 1 or more injuries=1 (response variable) FARMID 102,..., 912 identifies each farmer Farm acres 0,..., 3320 number of acres the farmer owns and rents Hours week 1, 2,..., 6 average number of hours worked gen health Excellent=1 status on the farmers health Very Good=2 (entered as a continuous variable) Good=3 Fair/Poor=4 TakingMeds Yes=1 and No=2 current use of any type of prescription or over the counter medication EXP1 dust Yes=1 and No=2 exposure to high levels of dust EXP2 noise Yes=1 and No=2 exposure to loud noise EXP3 chem Yes=1 and No=2 exposure to pesticides or other chemicals EXP4 lift Yes=1 and No=2 exposure to lifting heavy objects Table 1. Description of variables 6

The candidate models included all possible subsets of these variables (no interactions). Chosen variables by method: Method of Variable Variables Selection Selected CV EXP1 dust, EXP3 chem AIC EXP1 dust, EXP3 chem, EXP4 lift BIC none LASSO EXP1 dust, EXP4 lift SCAD EXP4 lift CV, AIC, BIC were were calculated for all possible models (not a step-wise procedure). In R, you can calculate the CV prediction error for a linear or generalized linear model using the cv.glm function in the boot package (you must use glm() to fit model). 7

## Fit the model to the sample: > glm.out=glm(y~x1+x2+x3+x4, family=binomial) ## Get the CV error from k-fold CV: > library(boot) > cv.err=cv.glm(my.data, glm.out, K=10)$delta[1] > cv.err [1] 0.1631616 ---------------------------------- There is also a bias-corrected CV error available in the cv.glm object... > cv.err.c=cv.glm(my.data, glm.out, K=10)$delta[2] K-fold cross-validation has an upward bias for the true out-ofsample error, and the bias increases as K decreases. If you use leave-one-out cross validation, the raw CV error and corrected CV error should be similar. We know if we used the training sample to estimate the error, we would have a downward bias (overfits the sample), so the correction should fall somewhere between these two. 8

Dichotomous response When the response is dichotomous, the fitted model (e.g. logistic regression) will provide a probability that Y = 1. When the model is used for prediction, the threshold of 0.5 is usually used for predictive classification (probability > 0.5 1). The leave-one-out CV predicted probability for each observation can be used to classify the observations as either 0 or 1, and the CV classification can be compared to the observed outcome to establish a misclassification rate. ROC curves can also be plotted to evaluate the predictive ability of the model. 9

[NOTE: If we use the guideline that you should have 6 to 10 times as many cases (n) as predictors variables when fitting the model, then the sample size of the original data set may need to be large if using a holdout method.] 10