22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider the available explanatory variables as candidate variables. (Some candidates may be transformations of others). Model selection can be challenging. If we have k candidate variables, there are potentially 2 k models to consider (i.e. each term being in or out of a given model). There are many methods for model selection, and we will only talk about a few in this class. One way to avoid looking at all possible subsets (potentially a very large number of models) is to use a stepwise procedure. For example, consider a backward stepwise method: 1. Start with the absolute largest model. 2. Choose a measure to quantify what makes a good model (R 2 is not a good choice, it will just choose the largest model every time). 3. Remove the term that most greatly increases the good model measure. 4. Continue to remove terms one at a time while the removal still provides a better model. 5. When removal of the next term would give you a worse model, stop the procedure. You ve found the best model. 1 2 The measure we use to make our choice should consider: 1. The number of explanatory variables in the model (we ll penalize models with too many). 2. The goodness of fit that the model provides. These express our conflicting interests... To describe the data reasonably well. (pushes toward more variables) To build a model simple enough to be interpretable. (pushes toward fewer variables) Some model selection measures (or criteria) Adjusted R 2,or R 2 : R 2 =1 RSS n 1 TSS n k 1 We prefer a model with a large R 2. Cross-Validation Criterion: ni=1 (Ŷ( i) CV = Y i) 2 n where Ŷ( i) is from the model fitted without using observation i. If you use a lot of parameters, you tend to over-fit the data, and we will do poorly at predicting a new Y not in the model-fitting (or training) data set. We prefer a model with a small CV. 3 4
Akaike information criterion (AIC): (This is assuming we have normal errors) AIC=n log e ˆσ 2 +2(k +1) We prefer a model with a small AIC. Bayesian information criterion (BIC): (This is assuming we have normal errors) BIC=n log e ˆσ 2 +(k +1)log e n We prefer a model with a small BIC. For both AIC and BIC, more parameters will provide smaller ˆσ 2,butthelasttermaddsona penalty related to the number of parameters in the model. Choose a best model using AIC in a Backward Stepwise Algorithm: Example: Crimeratedataset Crime-related and demographic statistics for 47 US states in 1960. The data were collected from the FBI s Uniform Crime Report and other government agencies to determine how the variable crime rate depends on the other variables measured in the study. VARIABLES RATE: Crime rate as # of offenses reported to police per million population Age: The number of males of age 14-24 per 1000 population S: Indicator variable for Southern states (0 = No, 1 = Yes) Ed: Mean # of years of schooling x 10 for persons of age 25 or older Ex0: 1960 per capita expenditure on police by state and local government 5 6 Ex1: 1959 per capita expenditure on police by state and local government LF: Labor force participation rate per 1000 civilian urban males age 14-24 M: The number of males per 1000 females N: State population size in hundred thousands NW: The number of non-whites per 1000 population U1: Unemployment rate of urban males per 1000 of age 14-24 U2: Unemployment rate of urban males per 1000 of age 35-39 W: Median value of transferable goods and assets or family income in tens of $ Pov: The number of families per 1000 earning below 1/2 the median income Use the step procedure in R to choose a good subset of predictors by subtracting terms one at atime. > crime.data=read.delim("crime.txt",sep="\t",header=false) > dimnames(crime.data)[[2]]=c("rate","age","s","ed","ex0", "Ex1","LF","M","N","NW","U1","U2","W","Pov") > attach(crime.data) > head(crime.data) RATE Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X 1 79.1 151 1 91 58 56 510 950 33 301 108 41 394 261 2 163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194 3 57.8 142 1 89 45 44 533 969 18 219 94 33 318 250 4 196.9 136 0 121 149 141 577 994 157 80 102 39 673 167 5 123.4 141 0 121 109 101 591 985 18 30 91 20 578 174 6 68.2 121 0 110 118 115 547 964 25 44 84 29 689 126 ## Fit the model including all candidate variables: > lm.full.out=lm(rate ~Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 + W + Pov) > vifs=vif(lm.full.out) > round(vifs,2) Age S Ed Ex0 Ex1 LF 2.70 4.88 5.05 94.63 98.64 3.68 M N NW U1 U2 W Pov 3.66 2.32 4.12 5.94 5.00 9.97 8.41 7 8
## The starting AIC is 301.66... ## remove variables one at a time. > model.selection=step(lm.full.out) Start: AIC=301.66 RATE ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 + W + Pov - NW 1 6.1 15884.8 299.7 - LF 1 34.4 15913.1 299.8 - N 1 48.9 15927.6 299.8 - S 1 149.4 16028.1 300.1 - Ex1 1 162.3 16041.0 300.1 - M 1 296.5 16175.2 300.5 <none> 15878.7 301.7 - W 1 810.6 16689.3 302.0 - U1 1 911.5 16790.2 302.3 - Ex0 1 1109.8 16988.5 302.8 - U2 1 2108.8 17987.5 305.5 - Age 1 2911.6 18790.3 307.6 - Ed 1 3700.5 19579.2 309.5 - Pov 1 5474.2 21352.9 313.6 ## Remove NW and check if we should remove another. Step: AIC=299.68 RATE ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + U1 + U2 + W + Pov - LF 1 28.7 15913.4 297.8 9 - N 1 48.6 15933.4 297.8 - Ex1 1 156.3 16041.0 298.1 - S 1 158.0 16042.8 298.1 - M 1 294.1 16178.9 298.5 <none> 15884.8 299.7 - W 1 820.2 16705.0 300.0 - U1 1 913.1 16797.9 300.3 - Ex0 1 1104.3 16989.1 300.8 - U2 1 2107.1 17991.9 303.5 - Age 1 3365.8 19250.5 306.7 - Ed 1 3757.1 19641.9 307.7 - Pov 1 5503.6 21388.3 311.7 ## Remove LF and check if we should remove another. Step: AIC=297.76 RATE ~ Age + S + Ed + Ex0 + Ex1 + M + N + U1 + U2 + W + Pov - N 1 62.2 15975.6 295.9 - S 1 129.4 16042.8 296.1 - Ex1 1 134.8 16048.2 296.2 - M 1 276.8 16190.2 296.6 <none> 15913.4 297.8 - W 1 801.9 16715.3 298.1 - U1 1 941.8 16855.2 298.5 - Ex0 1 1075.9 16989.4 298.8 - U2 1 2088.5 18001.9 301.6 - Age 1 3407.9 19321.3 304.9 - Ed 1 3895.3 19808.7 306.1 - Pov 1 5621.3 21534.7 310.0 10 ## Remove N and check if we should remove another. Step: AIC=295.95 RATE ~ Age + S + Ed + Ex0 + Ex1 + M + U1 + U2 + W + Pov - S 1 104.4 16080.0 294.3 - Ex1 1 123.3 16098.9 294.3 - M 1 533.8 16509.4 295.5 <none> 15975.6 295.9 - W 1 748.7 16724.4 296.1 - U1 1 997.7 16973.4 296.8 - Ex0 1 1021.3 16996.9 296.9 - U2 1 2082.3 18057.9 299.7 - Age 1 3425.9 19401.6 303.1 - Ed 1 3887.6 19863.3 304.2 - Pov 1 5896.9 21872.6 308.7 ## Remove S and check if we should remove another. Step: AIC=294.25 RATE ~ Age + Ed + Ex0 + Ex1 + M + U1 + U2 + W + Pov - Ex1 1 171.5 16251.5 292.8 - M 1 563.4 16643.4 293.9 <none> 16080.0 294.3 - W 1 734.7 16814.7 294.4 - U1 1 906.0 16986.0 294.8 - Ex0 1 1162.0 17241.9 295.5 11 - U2 1 1978.0 18058.0 297.7 - Age 1 3354.5 19434.4 301.2 - Ed 1 4139.1 20219.1 303.0 - Pov 1 6094.8 22174.8 307.4 ## Remove Ex1 and check if we should remove another. Step: AIC=292.75 RATE ~ Age + Ed + Ex0 + M + U1 + U2 + W + Pov - M 1 691.0 16942.5 292.7 <none> 16251.5 292.8 - W 1 759.0 17010.5 292.9 - U1 1 921.8 17173.2 293.3 - U2 1 2018.1 18269.5 296.3 - Age 1 3323.1 19574.5 299.5 - Ed 1 4005.1 20256.5 301.1 - Pov 1 6402.7 22654.2 306.4 - Ex0 1 11818.8 28070.2 316.4 ## Remove M and check if we should remove another. Step: AIC=292.71 RATE ~ Age + Ed + Ex0 + U1 + U2 + W + Pov - U1 1 408.6 17351.1 291.8 <none> 16942.5 292.7 - W 1 1016.9 17959.3 293.4 - U2 1 1548.6 18491.1 294.8 - Age 1 4511.6 21454.1 301.8 12
- Ed 1 6430.6 23373.0 305.8 - Pov 1 8147.7 25090.1 309.2 - Ex0 1 12019.6 28962.1 315.9 ## Remove U1 and check if we should remove another. Step: AIC=291.83 RATE ~ Age + Ed + Ex0 + U2 + W + Pov <none> 17351 292 - W 1 1253 18604 293 - U2 1 1629 18980 294 - Age 1 4461 21812 301 - Ed 1 6215 23566 304 - Pov 1 8932 26283 309 - Ex0 1 15597 32948 320 ######################################### ## Procedure stops because removing ## ## any of the remaining variables ## ## only increases AIC. ## ######################################### ## Get the output from the final chosen model: > summary(model.selection) Call: lm(formula = RATE ~ Age + Ed + Ex0 + U2 + W + Pov) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -618.5028 108.2456-5.714 1.19e-06 *** Age 1.1252 0.3509 3.207 0.002640 ** Ed 1.8179 0.4803 3.785 0.000505 *** Ex0 1.0507 0.1752 5.996 4.78e-07 *** U2 0.8282 0.4274 1.938 0.059743. W 0.1596 0.0939 1.699 0.097028. Pov 0.8236 0.1815 4.538 5.10e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 20.83 on 40 degrees of freedom Multiple R-squared: 0.7478,Adjusted R-squared: 0.71 F-statistic: 19.77 on 6 and 40 DF, p-value: 1.441e-10 > lm.out=lm(rate ~Age + Ed + Ex0 + U2 + W + Pov) > vif(lm.out) Age Ed Ex0 U2 W Pov 2.061942 3.061153 2.875709 1.381671 8.705602 5.559788 13 14 You can use the step function with the BIC instead through an option in the step() statement. Setting k = log(n)inthestatementchangesthe criterion to the BIC rather than the AIC (Eventhough some of the output says AIC). > model.selection.bic=step(lm.full.out,k=log(47)) ##### similar output to previous example... ###### > summary(model.selection.bic) Call: lm(formula = RATE ~ Age + Ed + Ex0 + U2 + Pov) BIC tends to favor smaller models than the AIC. It has a heavier penalty for using more parameters. The only difference between the two chosen models in this example is that W (for wealth) is also removed from the BIC chosen model. You can also use the option direction = forward to build abestmodel. Butstartingwiththefull model is generally more reliable. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -524.3743 95.1156-5.513 2.13e-06 *** Age 1.0198 0.3532 2.887 0.006175 ** Ed 2.0308 0.4742 4.283 0.000109 *** Ex0 1.2331 0.1416 8.706 7.26e-11 *** U2 0.9136 0.4341 2.105 0.041496 * Pov 0.6349 0.1468 4.324 9.56e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 21.3 on 41 degrees of freedom Multiple R-squared: 0.7296,Adjusted R-squared: 0.6967 F-statistic: 22.13 on 5 and 41 DF, p-value: 1.105e-10 15 16
We mentioned stepwise procedures as a way to get around looking at every single possible model (there are 2 k possibilities). What about actually considering every possible model? Is this feasible?... Depends on the total number of variables. We can use the regsubsets function in the leaps library to consider the best model of each possible size (1 predictor, 2 predictors, 3 predictors, etc.) > library(leaps) ## The. below means we ll use all the other variables ## besides RATE as predictors in the largest model. > crime.subsets=regsubsets(rate ~., nbest=1,nvmax=13, data=crime.data) ## For each size of model, we ll only record ## the single best model (nbest=1). > summary(crime.subsets) Subset selection object Call: regsubsets.formula(rate ~., nbest = 1, nvmax = 13, data = crime.data) 13 Variables (and intercept) 1 subsets of each size up to 13 Selection Algorithm: exhaustive Age S Ed Ex0 Ex1 LF M N NW U1 U2 W Pov 1 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " " 2 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " "*" 3 ( 1 ) " " " " "*" "*" " " " " " " " " " " " " " " " " "*" 4 ( 1 ) "*" " " "*" "*" " " " " " " " " " " " " " " " " "*" 5 ( 1 ) "*" " " "*" "*" " " " " " " " " " " " " "*" " " "*" 6 ( 1 ) "*" " " "*" "*" " " " " " " " " " " " " "*" "*" "*" 7 ( 1 ) "*" " " "*" "*" " " " " " " " " " " "*" "*" "*" "*" 8 ( 1 ) "*" " " "*" "*" " " " " "*" " " " " "*" "*" "*" "*" 9 ( 1 ) "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*" "*" 10 ( 1 ) "*" "*" "*" "*" "*" " " "*" " " " " "*" "*" "*" "*" 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*" 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" As Ex0, the1960policeexpenditures,isfirst chosen and appears in every best model, this may be the most important explanatory variable of crime rate. ## We ll consider models up to a ## 13 variable model (nvmax=13). 17 18 The 5 variable model matches that of the stepwise BIC best model we saw earlier. And the 6variablemodelmatchesthatofthestepwise AIC best model we saw earlier. If you use regsubsets to record more than just the single best model of each size, you can see how different the BIC values are for the top X best models of each size in a visual plot... Statistic: bic -38-36 -34-32 E1-M-P Ed-E1-P E0-M-P Ed-E0-P Ed-E0-U2-P Ed-E0-W-P Ed-E0-M-P A-Ed-E0-P A-Ed-E0-U1-P A-Ed-E0-M-P A-Ed-E0-W-P A-Ed-E0-E1-U2-P A-Ed-E0-L-U2-P A-Ed-E0-U1-U2-P A-Ed-E0-U2-W-P A-Ed-E0-U2-P We ll keep the best 4 models of each size. > crime.subsets.2=regsubsets(rate~., nbest=4, nvmax=8, data=crime.data) ## The next line will give you the graphic ## for models of size 3 to 6. > subsets(crime.subsets.2,min.size=3,max.size=6,legend=f) ## The subsets() function is in the car library. 19 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Subset Size This may be useful if you re deciding between models with similar BIC values, and some models seem better for you in terms of your research, and which variables are included. The plot also shows the model with the smallest BIC (if you show all subset sizes). 20
Here we keep track of the best 4 models of each size up to a model with all 13 variables included. The graphic becomes a bit hard to read when you look at all recorded models, so subsetting the picture (as on the previous page) is useful. > crime.subsets.3=regsubsets(rate~., nbest=4, nvmax=13, data=crime.data) > subsets(crime.subsets.3,legend=f) AIC function To compare model 1 to model 2 using AIC, you can just use the AIC function directly. > model.1=lm(rate ~Ex0 + Ex1 + LF + M + N) > model.2=lm(rate ~Age + S + Ex0 + Ex1 + U1 + Pov) > AIC(model.1) [1] 454.5741 Statistic: bic -30-20 -10 0 N W E1 E0 A-E1 E1-P A-E0 E0-P A-Ed-E0-M-NW-U1-U2-W-P A-Ed-E0-E1-M-U1-U2-W-P A-Ed-E0-M-N-U1-U2-W-P A-S-Ed-E0-M-U1-U2-W-P E1-M-P Ed-E1-P A-Ed-E0-E1-U1-U2-W-P A-Ed-E0-N-U1-U2-W-P A-Ed-E0-L-U1-U2-W-P E0-M-PEd-E0-U2-P A-Ed-E0-M-U1-U2-W-P Ed-E0-W-P Ed-E0-M-P A-Ed-E0-U1-P A-Ed-E0-M-P A-Ed-E0-E1-U2-P Ed-E0-P A-Ed-E0-U1-U2-P A-Ed-E0-L-U2-P A-Ed-E0-U1-U2-W-P A-Ed-E0-E1-U2-W-P A-Ed-E0-M-U1-U2-P A-Ed-E0-L-U2-W-P A-Ed-E0-P A-Ed-E0-W-P A-Ed-E0-U2-P A-Ed-E0-U2-W-P A-S-Ed-E0-E1-L-M-N-NW-U1-U2-W-P A-S-Ed-E0-E1-M-N-NW-U1-U2- A-Ed-E0-E1-L-M-N-NW-U1-U2- A-S-Ed-E0-E1-L-M-NW-U1-U2- A-S-Ed-E0-E1-L-M-N-U1-U2-W A-Ed-E0-E1-M-N-NW-U1-U2-W-P A-S-Ed-E0-E1-M-NW-U1-U2-W-P A-S-Ed-E0-E1-M-N-U1-U2-W-P A-S-Ed-E0-E1-L-M-U1-U2-W-P A-Ed-E0-E1-M-NW-U1-U2-W-P A-Ed-E0-E1-M-N-U1-U2-W-P A-S-Ed-E0-E1-M-U1-U2-W-P A-S-Ed-E0-M-N-U1-U2-W-P > AIC(model.2) [1] 445.9138 *smaller AIC is better. To compare the two models using BIC... > AIC(model.1,k=log(47)) [1] 467.5252 > AIC(model.2,k=log(47)) [1] 460.715 *smaller BIC is better. 2 4 6 8 10 12 Subset Size 21 22