The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Size: px

Start display at page:

Download "The problem we have now is called variable selection or perhaps model selection. There are several objectives."

Doris Bryan
5 years ago
Views:

1 STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We like these to be close to 1, and we certainly get upset when they exceed 10. VIF stands for variance inflation factor. The lowest possible value is 1.0, which is considered good. High values represent trouble, in that a variable with high VIF is likely to be strongly linearly dependent on other independent variables. A related concept is the TOLERANCE, which is provided by some other software. The 1 quantities are related as VIF =. TOLERANCE values are between 0 (bad) TOLERANCE and 1 (good). For a problem in which we regress Y on (A, B, C, D), the TOLERANCE for variable A is defined as TOLERANCE(A) = 1 - R (regression of A on {B, C, D} ) Similarly, the TOLERANCE for variable B is TOLERANCE(B) = 1 - R (regression of B on {A, C, D} ) Thus, we ve got a good regression with some problems. The likely possibility is that there are some strong dependencies among A, B, C, and D, since each has a large VIF number. You should observe that the VIF numbers are computed without reference to the dependent variable Y. That is, these VIFs are comments only about the independent variables. The problem we have now is called variable selection or perhaps model selection. There are several objectives. (1) We d like to get an model in which all the variables play an active role. That is, each variable has p () We d like to get a model that s very good for prediction. (3) We d like to get a model in which we can figure out the roles of the independent variables in determining Y. Objectives (1) and () are achievable. Objective (3) cannot always be achieved. 1

2 There are several strategies that can be used. One is to remove independent variables one at a time until the VIF values for the variables remaining are all acceptable. This works sometimes, but many people like to use an automated method. Minitab employs two automated methods, stepwise regression and best subset regression. In terms of selecting which variables to use, let s look at the automated procedures best subsets and stepwise regression. Let s start with best subsets. This method will list the best (in terms of R ) model for each number of independent variables to be tried. The best subsets option will by default list the two best models for each level of complexity. You will likely find it easier to just list one. You can fix this by Stat Regression Best Subsets Options Models of each size to print: (then select 1). In the best subset regression, the program will show the best model(s) of each level of complexity. Note this: Quality of fit is measured by R, discussed just below. R adj, and s ε. Also given is the C p statistic, The number of models of each level of complexity to be shown is specified by the user. The Minitab default is, but most users just want to see the single best model. The C p statistic is frequently used as a measure of fit of any particular model. The p here is 1 + number of independent variables used in the test model. (The phrase test model refers to the set of independent variables currently being tried out.) The statistic is defined as Residual SUM of squares for test model C p = ( n p) Residual MEAN square using all the independent variables It always happens that the model with all the variables has C p = p exactly. For test models, a good fit is indicated by C p p, with C p < p even better. It should be pointed out that C p measures the quality of a test model relative to the model that uses all available independent variables. It could easily happen that one has a very bad model even while using all the available independent variables. Here s a crude layout:

3 C p >> p (much bigger) C p > p C p p C p < p test model is definitely not acceptable judgment call test model is acceptable test model is excellent Why does this work? Here s a short digression. Case 1. The proposed test model is adequate (at least compared to the model that uses all available predictors). That is, the variables that appear among the K all predictors in the full model but do not appear in the proposed test model are irrelevant. In this case, the residual mean square in the analysis of variance table would still estimate σ. Thus, Residual mean square in test model estimates σ or (what is the same thing) Residual sum of squares n p in test model estimates σ. We can rewrite the statement above as Residual sum of squares in test model estimates (n - p) σ. This final statement corresponds to the numerator of C p. Thus, when the proposed test model is adequate the numerator of C p estimates (n - p) σ the denominator of C p estimates σ p σ overall, C p estimates ( ) ( n p) n σ = p Thus C p should be close to p for any adequate test model. 3

4 Case. The proposed test model is not adequate. That is, the variables that appear among the K all predictors in the full model but do not appear in the proposed test model are relevant to the relationship with the dependent variable. In this case, the residual mean square in the analysis of variance table estimates something larger than σ. Why? The residual sum of squares is ( ˆ ) n i= 1 Y i Y i, where Y ˆi is the i th value. If the proposed test model is not adequate, then Y ˆi is far away from its best value, and this sum of squares is inflated. Following the logic of case 1, we see now that C p will estimate something larger than p. test Thus, for adequate test models, C p estimates p for inadequate test models, C p estimates something larger than p In order to select a model, we choose the simplest model (smallest number of predictors) for which C p is near p. Of course, using the near p statement requires some judgment. Finding C p < p is certainly an indication of a good fit. You are not morally compelled to obey the dictates of the C p statistic. It s a very helpful suggestion. You might the R column or the s ε column more compelling. This will be illustrated (first) with the library data and then with the low birth weight example. In the latter, we will keep together the race indicators. These are related to separate handouts. We looked at best subsets as a method for screening out potential regression models. This will list the best model at each level of complexity and leave us with a relatively easy selection job. Stepwise regression pushes this one step further, and actually selects a model for us. Stepwise regression, as performed by Minitab, will start with an empty model (no predictors) and then sequentially add variables to the model as long as it seems that the quality of fit is being improved. Actually, there is a formal inferential-type step involved in this, requiring that any variable added to the model must do so with a t statistic with a p-value less than or equal to some threshold, called alpha-to-enter, set by default to Stepwise regression can even remove a variable from a regression model, if it fails the t 4

5 criterion; the corresponding threshold on the p-value, called alpha-to-remove, is also set by default to Here we ll recommend that these values be set to 0.05, so that the stepwise regression decisions will be more likely to agree with decision made through best subsets regression. Here is the set of Minitab commands: Stat Regression Stepwise Methods Use alpha values You might wish to reset the alpha values from 0.15 to This tends to make stepwise easier to compare to best subsets. Before we do this with the low birth weight data set, let s looks at the original fitted equation with all variables. This was The regression equation is BWT = AGE LWT SMOKE - 49 PTL HT UI FTV AfAmer OtherRace The interpretation on -489 (for AfAmer) must be that, all other things equal, AfAmer babies would be predicted to be 489 g lighter than White babies. Here the white indicator was not used. Similarly, the OtherRace babies would be predicted to be 357 g lighter than White babies. If you had run this with indicators for White and AfAmer (but not OtherRace), you d get this: The regression equation is BWT = AGE LWT SMOKE - 49 PTL HT UI FTV White AfAmer Now the coefficient on White is 357. This says that, everything else equal, White babies would be predicted to be 357 g heavier than OtherRace babies. Note also that AfAmer babies would be predicted to be [ (-133) 357 ] = 490 g lighter than White babies. There are internal consistencies here. Thus, in using a procedure like Stepwise or Best Subsets, we should keep these indicator sets together! Here now is the Best Subsets run on the low birth weight data. This is on a separate handout. 5

6 The methods illustrated here, best subsets and stepwise, have some great advantages and disadvantages. Advantages of best subsets regression and stepwise regression: The procedures are automated, so that the user does not have to think about correlations, VIF numbers, residual sums of squares. The procedures actually make choices. They are bold enough to actually select a model. (Well, best subset regression only goes as far as selecting the best model for each size, but the user s role thereafter is pretty easy.) The procedures do not care about collinearity. The procedures (especially stepwise) can be used in cases where there is a great excess of independent variables. Indeed, you can use stepwise regression even when n is less than the number of independent variables! (Minitab will not allow you to do best subsets in this case.) Disadvantages of best subsets regression and stepwise regression: The procedures sometimes select the wrong variables. For example, if A is really the variable that drives Y, you would like the regression to use variable A. If B is a correlated proxy for A, it could very well happen that the procedure uses B and omits A. The fit is often too good, in that s ε for the selected model may be rather smaller than σ ε, the true-but-unknown noise standard deviation. This occurs because the procedures choose among models which fluctuate around the truth, favoring models with low s ε. The statistical inferential calculations (t, p-values, F) are bogus. They were obtained after several steps of data-torturing and simply do not have the statistical properties of regressions done without all these steps. 6

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed