STA121: Applied Regression Analysis

Size: px

Start display at page:

Download "STA121: Applied Regression Analysis"

Sibyl Gallagher
6 years ago
Views:

1 STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009

2 Outline Introduction 1 Introduction 2 3 4

3 Variable Selection Model selection is one of the most heavily studied problems in statistics. It is important to be able to identify a good model (containing a set of variables) that explains the the response. Number of variables in real life problems is often very large leading to a very large number of possible models. Let k be the total number of predictors. There are 2 k distinct regression models. We will deal with relatively smaller problems here, thus computation won t be a problem.

4 All Possible Regressions One of the variable selection techniques suggested to aid in choosing the best regression model is called all possible regressions. As the name suggest, we run all possible regressions between the dependent variable and all possible subsets of explanatory variables. For instance, let us have 3 predictors, x 1, x 2 and x 3. As mentioned earlier this leads to 8 distinct models. A null model, three one-variable models, three two-variable models and a three-variable model. We fit a regression for each model and pick the best one.

5 Best One? Introduction The real question arises when a model is to be picked: What do we base our decision on? R 2 and Radjusted 2 are common tools to assess a regression model. Recall that Radjusted 2 accounts for the number of predictors included in the model and does not have the same meaning as R 2. It is used as a guiding quantity for model selection. We will be focusing on three different criteria to assess a model: C p, AIC and BIC. These are the most popular criteria used for model selection purposes (which should not imply they are the best ones) and are computed by almost every statistical package available.

6 C p C p = SSEp (n 2p) where p is the number of variables in the reduced model MSE F (including the intercept). When there is no bias in the regression model with p 1 predictor variables, the expected value of C p is approximately p. In using the C p criterion, one seeks to identify subsets of variables for which (1) the C p value is small and (2) the C p value is near p. Effective use of the C p criterion requires careful development of the pool of all potential variables, with the independent variables expressed in appropriate form (linear, quadratic, transformed, etc.) and useless variables excluded so that MSE F provides an unbiased estimate of the error variance σ 2. Sometimes C p may be computed to be smaller than p which is a result of random variation in this measure. Sets of variables with small C p values have a small total mean squared error (smaller risk). When the C p value is also near p, the bias of the regression model is small. C p values substantially larger than p indicate large bias. Thus, sometimes we may end up picking a slightly larger model with a slightly larger C p that is closer to p.

7 AIC and BIC Introduction Akaike s information criterion, developed by Hirotsugu Akaike under the name of "an information criterion" (AIC) in 1971 and proposed in Akaike (1974), is a measure of the goodness of fit of an estimated statistical model. It is grounded in the concept of entropy, in effect offering a relative measure of the information lost when a given model is used to describe reality. The BIC was developed by Gideon E. Schwarz, who gave a Bayesian argument for adopting it. In statistics, the Bayesian information criterion (BIC) or Schwarz Criterion (SBC) is a criterion for model selection among a class of parametric models with different numbers of parameters. Choosing a model to optimize BIC is a form of regularization. The penalty term in BIC for additional parameters is stronger than that of the AIC, favoring smaller models. Smaller AIC and BIC values point to a better model.

8 AIC and BIC Introduction AIC = n + n log(2π) + n log SSE n BIC = n + n log(2π) + n log SSE n + p log n + 2p

9 Example - Meddicorp Cp Number of Predictors Here the C p value for the best model is smaller than p (below the line) due to its random nature. We discard the models which fall significantly above the C p = p line due to the substantial bias.

10 Example - Meddicorp For BIC, you need to load the library nlme.

11 Example - Meddicorp I also created a simple function, select(formula,data), that computes R 2, R 2 adjusted, C p, AIC and BIC for the best subset of each size. It s on the course website.

12 Forward Selection and Backward Elimination When the number of potential variables is very large and it is not computationally feasible to go through all possible subsets, we resort in some algorithm to search for a good model. Although there are problems with these stepwise algorithms like some other techniques, they are very popular. These algorithms are not guaranteed to find the best model but surely will find a better model than what we start with.

13 Forward Selection - Meddicorp

14 Backward Elimination - Meddicorp

15 Both - Meddicorp

16 Exhaustive (With Interaction Terms) - Meddicorp

17 Both (With Interaction Terms) - Meddicorp

18 Exhaustive (With Interaction and Quadratic Terms) - Meddicorp

19 Both (With Interaction Terms) - Meddicorp

20 Use the Ozone data set under the library mlbench. If you type help(ozone), you can see the explanation for different variables. There are three categorical variables now. Make sure you check the explanation of the data as the order of variables now is not the same as it was in the take-home exam. Use a model selection procedure to come up with a good model check if you satisfy the assumptions. Consider all main effects, quadratic terms and all two-way interactions (excluding the interaction terms amongst the categorical variables). Due on October 30.

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always