Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Statistical Consulting Topics Using cross-validation for model selection Cross-validation is a technique that can be used for model evaluation. We often fit a model to a full data set and then perform hypothesis tests on the parameters of interest. But if your main goal is to build a model to be used for prediction, then it would seem like a good idea to... Split your data into at least two parts. Build your model with one part (training set). Then, see how well the model you built performs on the other part (test set). This is the idea behind cross-validation. 1

In cross-validation we often refer to a partition of the data set composed of a training set (the part of the data to which the model is fitted) and a test set (the part of the data to which we evaluate the model that was built using the training set). If the model fits well to the training set (e.g. small MSE), but does a poor job in the test set, then that says that we overfit our model to the training set, and it will not perform well in a new data set (and will not do a good job at prediction). Validating a chosen model (using data that was not used to formulate the model) preserves the integrity of the statistical inference. 2

Cross-validation (CV) methods Holdout method Simplest method. The data set is partitioned into two sets: training set and test set. The disadvantage is that the evaluation may be significantly different depending on how the division is made. K-fold cross-validation The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k 1 subsets are put together to form a training set. 5-fold or 10-fold CV is commonly used. The disadvantage is that the process has to be re-ran k times (computation). 3

Leave-one-out cross-validation Essentially, K-fold cross validation taken to its extreme. Each test set consists of 1 observation. Process is ran n times. Selection criterion: Cross-validation error Predicted residual sums of squares (PRESS). Define e ( ij) = y ij ŷ ( ij) as the residual of i th observation in subset j based on a model fitted with the j th subset dropped from the data, where i=1, 2,... n j,j = 1, 2... k. PRESS= k j=1 n j i=1 e 2 ( ij) 4

PRESS= k j=1 n j i=1 e 2 ( ij) PRESS provides a measure of the fit of a model to a sample of observations that were not themselves used to estimate the model. Select the model with the smallest PRESS over all models in the candidate pool. 5

What about other commonly used variable selection (or model selection) procedures? How does cross-validation compare to those? Farm injuries and medications study (n=625). Variable Values Variables Description INJURY no injury=0, farmers self-reported injuries 1 or more injuries=1 (response variable) FARMID 102,..., 912 identifies each farmer Farm acres 0,..., 3320 number of acres the farmer owns and rents Hours week 1, 2,..., 6 average number of hours worked gen health Excellent=1 status on the farmers health Very Good=2 (entered as a continuous variable) Good=3 Fair/Poor=4 TakingMeds Yes=1 and No=2 current use of any type of prescription or over the counter medication EXP1 dust Yes=1 and No=2 exposure to high levels of dust EXP2 noise Yes=1 and No=2 exposure to loud noise EXP3 chem Yes=1 and No=2 exposure to pesticides or other chemicals EXP4 lift Yes=1 and No=2 exposure to lifting heavy objects Table 1. Description of variables 6

The candidate models included all possible subsets of these variables (no interactions). Chosen variables by method: Method of Variable Variables Selection Selected CV EXP1 dust, EXP3 chem AIC EXP1 dust, EXP3 chem, EXP4 lift BIC none LASSO EXP1 dust, EXP4 lift SCAD EXP4 lift CV, AIC, BIC were were calculated for all possible models (not a step-wise procedure). In R, you can calculate the CV prediction error for a linear or generalized linear model using the cv.glm function in the boot package (you must use glm() to fit model). 7

## Fit the model to the sample: > glm.out=glm(y~x1+x2+x3+x4, family=binomial) ## Get the CV error from k-fold CV: > library(boot) > cv.err=cv.glm(my.data, glm.out, K=10)$delta[1] > cv.err [1] 0.1631616 ---------------------------------- There is also a bias-corrected CV error available in the cv.glm object... > cv.err.c=cv.glm(my.data, glm.out, K=10)$delta[2] K-fold cross-validation has an upward bias for the true out-ofsample error, and the bias increases as K decreases. If you use leave-one-out cross validation, the raw CV error and corrected CV error should be similar. We know if we used the training sample to estimate the error, we would have a downward bias (overfits the sample), so the correction should fall somewhere between these two. 8

Dichotomous response When the response is dichotomous, the fitted model (e.g. logistic regression) will provide a probability that Y = 1. When the model is used for prediction, the threshold of 0.5 is usually used for predictive classification (probability > 0.5 1). The leave-one-out CV predicted probability for each observation can be used to classify the observations as either 0 or 1, and the CV classification can be compared to the observed outcome to establish a misclassification rate. ROC curves can also be plotted to evaluate the predictive ability of the model. 9

[NOTE: If we use the guideline that you should have 6 to 10 times as many cases (n) as predictors variables when fitting the model, then the sample size of the original data set may need to be large if using a holdout method.] 10