Chapter 10: Variable Selection. November 12, 2018

Size: px

Start display at page:

Download "Chapter 10: Variable Selection. November 12, 2018"

Jacob Williams
5 years ago
Views:

1 Chapter 10: Variable Selection November 12, 2018

2 1 Introduction 1.1 The Model-Building Problem The variable selection problem is to find an appropriate subset of regressors. It involves two conflicting objectives. (1) We would like the model to include as many regressors as possible so that the information content in these factors can influence the pre-

3 dicted value ofy. (2) We want the model to include as few regressors as possible because the variance of the prediction ŷ increases as the number of regressors increases. Also the more regressors there are in a model, the greater the costs of data collection and model maintenance. The process of finding a model that is a compromise between these two objectives is called select-

4 ing the best regression equation. Unfortunately, there is no unique definition of best. Different procedures may result in different subsets of the candidate regressors as best. In fact, there usually is not a single best equation but rather several equally good ones. The variable selection problem is often discussed in an idealized setting. It is usually assumed that the correct functional specification of the regres-

5 sors is known, and that no outliers or influential observations are present. In practice, these assumptions are rarely met. An iterative approach is often employed: (1) A particular variable selection strategy is employed. (2) The resulting subset model is checked for correct functional specification, outliers, and influential observations.

6 Several iterations may be required to produce an adequate model.

7 1.2 Consequences of Model Misspecification Let y i = β 0 + K j=1 β j x ij +ǫ i, i = 1,2,,n or equivalently y = Xβ +ǫ

8 denote the full model, which consists of all K regressors. Let y i = β 0 + or equivalently p 1 i=1 β jx ij +ǫ i, i = 1,2,,n y = X p β p +ǫ denote a subset model of (p 1) regressors. In this chapter, we assume that all subset models include the intercept term. The properties of the es-

9 timates ˆβ p and ˆσ 2 p can be summarized as follows: 1. E(ˆβ p ) = β p +(X px p ) 1 X px r β r = β p +AB r, where X r is the matrix consisting of all regressors not included in X p, β r are the regression coefficients corresponding tox r,a = (X px p ) 1 X px r is sometimes called the alias matrix. Thus, ˆβp is a biased estimate ofβ p unlessβ r = 0 orx px r = 0.

10 2. The variance of ˆβp and ˆβ are Var(ˆβ p ) = σ 2 (X px p ) 1 and Var(ˆβ) = σ 2 (X X) 1, respectively. The matrix Var(ˆβ f p) Var(ˆβ p ) is positive semidefinite, that is, the variance of the least squares estimates of the parameters in the full model are greater than or equal to the variances of the corresponding estimates in the subset model. 3. Since ˆβ p is possibly a biased estimate ofβ p,

11 its precision can be measured in terms of mean square error. Recall that if ˆθ is an estimate of the parameterθ, the mean square error of ˆθ is MSE(ˆθ) = Var(ˆθ)+[E(ˆθ) θ] 2. The mean square error of ˆβ p is MSE(ˆβ p ) = σ 2 (X px p ) 1 +Aβ r β ra. 4. The estimate ˆσ 2 f is an unbiased estimate of

12 σ 2. However, for the subset model E(ˆσ 2 p) = σ 2 + β rx r[i X p (X px p ) 1 X p]x r β r n p That is, ˆσ 2 p is generally biased upward as an estimate ofσ 2.

13 1.3 Criteria for evaluating subset regression models Coefficient of Multiple Determination LetR 2 p denote the coefficient of multiple determination for a subset regression model withpterms. Computationally, R 2 p = 1 SS Res(p) SS T. Note that there are ( K p 1) values of R 2 p for each value of p, one for each possible subset model of size p. R 2 p increases as p increases and is a

14 maximum when p = K +1. Therefore, the analyst uses this criterion by adding regressors to the model up to the point where an additional variable is not useful in that it provides only a small increase inrp 2. Aitkin (1974) proposed one solution to this problem by providing a test by which all subset regression models that have anr 2 not significantly different from ther 2 for the full model can be identified.

15 Let R 2 0 = 1 (1 R 2 K+1)(1+d α,n,k ), where d α,n,k = KF α,n,n K 1 n K 1. Aitkin calls any subset of of regressor variables producing anr 2 greater thanr 2 0 anr2 -adequate(α) subset.

16 AdjustR 2 AdjRp 2 = 1 SS Res(p)/(n p). SS T /(n 1) The AdjR 2 p statistic does not necessarily increase as additional regressors are introduced into the model. In fact, it can be shown (Seber, 1977) that if s regressors are added to the model, AdjR 2 p+s will exceed AdjR 2 p if and only if the partialf -statistic for testing the significance of the s additional regressors exceeds 1.

17 One criterion for selection of an optimum subset model is to choose the model that has a maximum AdjR 2 p.

18 Residual Mean Square The residual mean square for a subset regression model is defined as MS res (p) = SS Res(p) n p. Comparing to AdjRp 2, it is easy to see that the subset regression model that minimizes MS Res (p) will also maximize AdjR 2 p.

19 Mallows C p statistic The mean squared error of a fitted value is E[ŷ i E(y i )] 2 = [E(y i ) E(ŷ i )] 2 +Var(ŷ i ) (1) where E(y i ) is the expected response from the true regression equation, ande(ŷ i ) is the expected response from the p-term subset model. The two terms of the right-hand side of (1) are the squared bias and variance components, respectively, of the

20 mean square error. Since (Appendix C.3) n E[ (ŷ i y i ) 2 ] = E[SS Res (p)] = SS B (p)+(n p)σ 2, i=1 where and n SS B (p) = [E(y i ) E(ŷ i )] 2 i=1 n Var(ŷ i ) = pσ 2. i=1

21 Hence, minimizing the total mean squared error is equivalent to minimizing Γ p = n i=1 E[ŷ i E(y i )] 2 σ 2 = 1 σ 2{E[SS Res(p)] (n p)σ 2 +pσ 2 } = ESS Res(p) σ 2 +2p n. Suppose thatˆσ 2 is a good estimate ofσ 2. Then replacinge[ss Res (p)] 2 by the observed valuess Res (p)

22 produces an estimate ofγ p, say C p = SS Res(p) σ 2 +2p n. If thep-term model has negligible bias, thenss B (p) = 0. Consequently,E[SS Res (p)] = (n p)σ 2, and E[C p Bias = 0] = (n p)σ2 σ 2 +2p n = p. Hence, a good model should have C p p. However, when k is large, an exhaustive search for the models where C p p is impossible and

23 the minimum C p is often used for the model selection. Mallows (1973) warned that minimizing C p may lead to the selection of a model that has a larger mean squared prediction error than the full model. Hocking (1976) suggested to choose any other model under the line C p = p. Mallows (1995) further suggested that any candidate model with C p < p should be carefully examined as a potential best model.

24 Information Criteria Criteria for comparing various candidate subsets are based on the lack of fit of a model and its complexity. Lack of fit for a candidate subset ofx is measured by its residual sum of squaresss Res,p. Complexity for multiple linear regression models is measured by the number of termspin the subset, including the intercept. The most common criterion that is useful in multiple linear regression and many other problems

25 where model comparison is at issue is the Akaike Information Criterion, or AIC, which is given as follows: AIC p = nlog(ss Res,p )+2p. Small value of AIC are preferred. An alternative to AIC is the Bayes Information criterion, or BIC, which is given by BIC p = nlog(ss Res,p )+plog(n). Once again, smaller values are preferred.

26 Cross-Validation Cross-validation can also be used to compare candidate subset mean functions. The most straight-forward type of cross-validation is to split the data into two parts at random, a construction set and a validation set. The construction set is used to estimate the parameters in the mean function. Fitted values from this fit are then computed for the cases in the validation set, and the average of the squared differences between the re-

27 sponse and the fitted values for the validation set is used as a summary if fit for the candidate subset. Good candidates will have small cross-validation errors. Another version of cross-validation uses predicted residuals for the subset mean function based in the candidate X p. For this criterion, compute the fitted value for each i from a regression that does not include case i. The sum of squares of these

28 values is the predicted residual sum of squares, or PRESS, n n ( epi ) 2. PRESS p = i=1 (y i x p(i) β p(i) ) 2 = i=1 1 h pii 2 Computational Techniques for Variable Selection Highway Accident Data We will use the highway accident data to demonstrate the variable selection procedures. The response variable, log(rate), and

29 predictors are described in the following table. Variable Description log(rate) Based-two logarithm of 1973 accident rate per million log(len) vehicle miles, the response. Base-two logarithm of the length of the segment in miles. log(adt) Base-two logarithm of average daily traffic count in thousands. log(t rks) Base-two logarithm of truck volume as a percent of the total volume.

30 Slim Lwid Shld Itg 1973 speed limit Lane width in feet Shoulder width in feet of outer shoulder on the roadway Number of freeway-type interchanges per mile in the segment log(sigs1) Base-two logarithm of (number of signalized interchanges per mile in the segment +1)/(length of segment) Acpt Hwy Number of access points per mile in the segment A factor coded 0 if a federal interstate highway, 1 if a principal arterial highway, 2 if a major arterial, and 3 otherwise

31 2.1 All possible regression This procedure requires the analyst fit all possible candidate models. For example, if there are K candidate regressors, then totally there are2 K candidate models. 2.2 Stepwise regression methods Forward selection The procedure begins with a null model (only the intercept term is included), and

32 attempts to insert regressors one by one according to the partial F-statistics. The procedure terminates either when the partialf -statistic at a particular step does not exceed the pre-specified cutoff value F IN or when the last candidate regressor is added to the model. At each step, the regressor with the largest F - statistic F = SS R(x i+1 x 1,,x i ) MS Res (x 1,,x i,x i+1 )

33 value is chosen to be inserted.

34 Backward elimination Backward elimination begins with a model that includes all K regressors, and attempts to eliminate regressors one by one according to the partial F -statistics. The procedure terminates when the smallest partial F value is not less than the pre-specified cutoff valuef OUT. At each step, the partialf -statistic is computed for each regressor of the current model as if it were the last variable to enter the model. If the smallest

35 F value is less thanf OUT, then the corresponding regressor is removed from the model.

36 Stepwise Regression Stepwise regression is a modification of forward selection in which at each step all regressors entered into the model previously are reassessed via their partialf -statistics. A regressor added at an earlier step may now be redundant because of the relationships between it and regressors now in the equation. If the partial F - statistic for a variable is less thanf OUT, that variable is dropped from the model.

37 Stepwise regression requires two cutoff values, F IN and F OUT. Some analysts prefer to choose F IN = F OUT, although this is not necessary. Frequently we choose F IN > F OUT, making it relatively more difficult to add a regressor than to delete one. About the choice of the values off IN andf OUT. A popular setting isf IN = F OUT = 4.0, and this corresponds roughly to the upper 5% point of the

38 F distribution.

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always