Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17
What do we know so far In linear regression, adding predictors always decreases the training error or RSS. 2 / 17
What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. 2 / 17
What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. When n < p, there is no least squares solution: ˆβ = (X T X) 1 X T y. }{{} So, we must find a way to select fewer predictors if we want to use least squares linear regression. 2 / 17
What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. When n < p, there is no least squares solution: ˆβ = (X T X) 1 X T y. }{{} Singular So, we must find a way to select fewer predictors if we want to use least squares linear regression. 2 / 17
Best subset selection Simple idea: let s compare all models with k predictors. 3 / 17
Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. 3 / 17
Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. 3 / 17
Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R 2 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 3 / 17
Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R 2 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors Naturally, the RSS and R 2 improve as we increase k. 3 / 17
Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 4 / 17
Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC): 1 nˆσ 2 (RSS + 2kˆσ2 ) where ˆσ 2 is an estimate of the irreducible error, typically ˆσ 2 = 1 n p 1 RSS of full model with all p predictors 4 / 17
Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 1 n (RSS + 2kˆσ2 ) where ˆσ 2 is an estimate of the irreducible error, typically ˆσ 2 = 1 n p 1 RSS of full model with all p predictors 4 / 17
Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 1 n (RSS + log(n)kˆσ2 ) 4 / 17
Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : R 2 = 1 RSS TSS where TSS= i (y i ȳ) 2 4 / 17
Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : R 2 adj = 1 RSS/(n k 1) TSS/(n 1) 4 / 17
Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : How do they compare to cross validation: They are much less expensive to compute. They are motivated by asymptotic arguments and rely on model assumptions (eg. normality of the errors). Equivalent concepts for other models (e.g. logistic regression). 4 / 17
Example Best subset selection for the Credit dataset. C p 10000 15000 20000 25000 30000 BIC 10000 15000 20000 25000 30000 Adjusted R 2 0.86 0.88 0.90 0.92 0.94 0.96 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Number of Predictors Number of Predictors Number of Predictors 5 / 17
Example Cross-validation vs. the BIC. Square Root of BIC 100 120 140 160 180 200 220 Validation Set Error 100 120 140 160 180 200 220 Cross Validation Error 100 120 140 160 180 200 220 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 6 / 17
Example Cross-validation vs. the BIC. Square Root of BIC 100 120 140 160 180 200 220 Validation Set Error 100 120 140 160 180 200 220 Cross Validation Error 100 120 140 160 180 200 220 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors Recall: In k-fold cross validation, we can estimate a standard error or accuracy for our test error estimate and apply the one standard-error rule. 6 / 17
Stepwise selection methods Best subset selection has 2 problems: 7 / 17
Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 7 / 17
Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). 7 / 17
Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). The model selected has high variance. 7 / 17
Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). The model selected has high variance. In order to mitigate these problems, we can restrict our search space for the best model. This reduces the variance of the selected model at the expense of an increase in bias. 7 / 17
Forward selection 8 / 17
Forward selection vs. best subset 9 / 17
Backward selection 10 / 17
Forward vs. backward selection You cannot apply backward selection when p > n. 11 / 17
Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. 11 / 17
Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ 11 / 17
Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ Forward: {X3 } {X 3, X 2 } {X 3, X 2, X 1 } 11 / 17
Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ Forward: {X3 } {X 3, X 2 } {X 3, X 2, X 1 } Backward: {X1, X 2, X 3 } {X 1, X 2 } {X 2 } 11 / 17
Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. 12 / 17
Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. Forward stagewise selection: Like forward selection, but when introducing a new variable, do not adjust the coefficients of the previously added variables. 12 / 17
Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. Forward stagewise selection: Like forward selection, but when introducing a new variable, do not adjust the coefficients of the previously added variables.... 12 / 17
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. 13 / 17
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? 13 / 17
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. 13 / 17
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. There are Bayesian motivations to do this: the prior tends to shrink the parameters. 13 / 17
Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 β 2 j 14 / 17
Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j 14 / 17
Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j We find an estimate ˆβ λ R cross-validation. for many values of λ and then choose it by 14 / 17
Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. We find an estimate ˆβ λ R for many values of λ and then choose it by cross-validation. Fortunately, this is no more expensive than running a least-squares regression. β 2 j 14 / 17
Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. 15 / 17
Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true. 15 / 17
Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true. In practice, what do we do? Scale each variable such that it has sample variance 1 before running the regression. This prevents penalizing some coefficients more than others. 15 / 17
Example. Ridge regression Ridge regression of default in the Credit dataset. Standardized Coefficients 300 100 0 100 200 300 400 1e 02 1e+00 1e+02 1e+04 Income Limit Rating Student Standardized Coefficients 300 100 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 16 / 17
Bias-variance tradeoff In a simulation study, we compute bias, variance, and test error as a function of λ. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 1e 01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 Cross validation would yield an estimate of the test error. 17 / 17