BIOL 458 BIOMETRY Lab 10 - Multiple Regression

Size: px

Start display at page:

Download "BIOL 458 BIOMETRY Lab 10 - Multiple Regression"

Edwina Gardner
6 years ago
Views:

1 BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but several continuous independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form: Y ˆ = b + b x + b x + + b x k k ε where Yˆ is the estimated value of Y, the criterion variable ; X, X 2,..., X k are the k predictor variables; and b 0 and b, b 2,.. b k are the regression coefficients. The values of the regression coefficients are determined by minimizing the sum of squares of the residuals, i.e, minimizing Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria, as discussed in lecture. These models can be generated using a forward, back-wards, or stepwise regression routine. In forward regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). In backwards regression all independent variables are initially entered into the model and sequentially taken out if the do not meet a set significance criterion (often p>0., for the partial F - test for removal of a term). Stepwise regression uses both these techniques. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term and if necessary thrown out of the model. The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model. Further Instructions for Lab 0 n ( Y i Yˆ i ) i= In Lab 0 you use the same regression module to fit multiple regression models in SPSS that you used in Lab 9 to fit bivariate regression models. The big difference now

2 is in entering the multiple independent variables, selecting the algorithm for building the model, and evaluating the fit of each model. In the Linear Regression sub-window, you will see a box with a pull down arrow called Method that by default is occupied by the word Enter. Enter is one of several model building algorithms available in the Method box. Enter in SPSS is equivalent of forcing all the variables in the independents variable box to be entered into the model simultaneously. The opposite of Enter if Remove where all variables are removed simultaneously. Other model building algorithms use various criteria to make decisions about which variables are entered (or removed) from the model, and when to stop adding or removing variables from the model. SPSS has algorithms named; Stepwise, Backward, and Forward. In the Stepwise algorithm, the variable with the smallest probability of its F statistic (if it meets a criteria, such as p<0.05) is entered into the model first. Then this process is repeated for the variables not yet included in the model. The next variable that meets this criterion is added to the model. This process continues to add variables to the model until there are no variables left that have F statistics that meet some user specified criteria (p<0.05 for example). As this process progresses, the F statistics for variables already in the model can change. If the significance level of these F statistics exceeds the criterion, then these variables are removed from the model. Hence, in a Stepwise algorithm, variables can be both added and removed from a model in the model building process. The Forward algorithm is identical to the Stepwise algorithm, except that variables can only be added to the model, not removed. The Backward algorithm puts all variables into the model, but then attempts to sequentially remove variables. The variable with the smallest partial correlation with the dependent variable is removed first if it meets the criterion for removal. If this variable is removed, then the variable with the next smallest partial correlation with the dependent variable is considered for removal, and removed if it meets the criterion. Note that in the Backward algorithm, variables are removed because their partial correlations exceed the significance criterion (p>0.05), the opposite of the criterion for a Stepwise or Forward algorithm. Unfortunately, none of these algorithms are guaranteed to choose the best model. I prefer the Forward algorithm, but sometimes build models with different algorithms to see if they all choose the same best model. Occasionally you might wish to enter variables in a specific sequence into a model, or to use different algorithms for model building for different groups of independent variables. To do so you need to look at the text and buttons surrounding the Independent(s) box in the Linear Regression sub-window. Note a light gray line enclosing this region, and blue text that says Block of. SPSS allows you to group variables into blocks and specify different variable selection methods for each block. For example, to build the Analysis of Covariance models that I described in class, you would place the variable name for the covariate into the Independent(s) box and select Enter as the Method (since you don t want SPSS to do any thinking, just put the variable in the model). Then you would click on the Next button. Note that the blue text now says Block 2 of 2. Here you would enter the names of the dummy variables that define your groupings or factors in the covariance analysis. Again use the Method: Enter. Finally, you would

3 click on the Next button to create the third block of variables. Here you would enter the variable names for the factor-covariate interactions. Once again use the Method: Enter. Assessing model fit involves all the same procedures used in bivariate regression since the same assumptions apply. The dependent variable should be normally distributed, scatter plots should indicate linear relationships between the dependent and independent variables, and residual plots should show homoscedasticity (equality of variances in the residuals throughout the regression line). In addition to these issues, one also needs to check for outliers or overly influential data points, and for high intercorrelations between pairs of independent variables (called multi-colinearity). If two independent variables are highly correlated (r>0.9), then inclusion of both variables in the model causes problems in parameter estimation. You can pre-screen your independent variables by getting a correlation matrix prior to performing the regression and only allowing one variable of a pair of high correlated variables to serve as a candidate variable for model building at a time. You could also examine the Tolerance values provided by SPSS in the output table named Excluded Variables. These values also provide you information about whether you have a problem with multicolinearity. Come to class to find out how to interpret the tolerance values. Lab 0 Assignment The exercise to be performed in this lab is to use the SPSS stepwise and forward regression routine to generate a series of models, and to select the "best" model from each series, as discussed in lecture. Two data sets will be provided; you are to perform the analysis on either of these two. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit. DESCRIPTION OF DATA The data is stored in a file MULTR2. The variables are as follows (they are in the same order in the data sets): VARIABLE (UNITS) Mean elevation (feet) Mean temperature (degrees F) Mean annual precipitation (inches) Vegetative density (percent cover) Drainage area (miles 2 ) Latitude (degrees) Longitude (degrees) Elevation at temperature station (feet) -hour, 25-year precipitation. intensity (inches/hour) Annual water yield (inches) (Dependent variable) 2

4 The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA. Lab 0 In R (use 'multr2.csv') To Obtain a General Plot of Every Variable Against Every other Variable When looking at a dataset with multiple variables, this can be a useful tool for seeing correlations between specific pairs of variables S plot(dataset) Stepwise, Backward, and Forward Model Selection Using the StepF script This function is for running a backward or forward model selection algorithm. It uses the add and drop functions to select variables to add or drop based on the F statistic. The model is then updated using update. This is repeated until a level of significance is met. You will see a print out of each iteration with the output of add or drop, and the variable selected for addition or deletion. To use the function, you need to source the script. ) Save the script StepF.R on your computer. 2) In the R Console window choose File -> Source R Code... 3) Select the StepF.R file and press Open Methods of Use: There are a few ways to use StepF. I will use the pubescence data set to illustrate the different methods. The data set should be attached. However it is used, if you assign the result of StepF to a variable, it will contain the final model selected. Example: S mylm<- StepF(dataTable = pubescence, response = "abherb", level =.05, direction = "backward") <Here you will see the output for each iteration> S mylm Call: lm(formula = abherb ~ srherb) Coefficients: (Intercept) srherb

5 ) Provide a data set and identify the response variable. StepF will then construct a model. If you are using direction = backward the full model based on every column in the data set will be created. If you are using.direction = forward an empty model will be created. This model will then by run through the algorithm, removing or adding variables based on the level of significance. If you want it to use glm instead of lm use the argument general = TRUE. Example: S StepF(dataTable = pubescence, response = "abherb", level =.05, direction = "backward") In the first iteration, the model starts with every variable from pubescence. After 8 iterations only srherb is selected. 2) Provide a model you have made(either lm or glm). StepF will run your model through the algorithm as before. Example: S plm5 <- lm(formula = abherb ~ site + lfsize + cong + range + aveden + avelen + slarea + srherb) S StepF(model = plm5, level =.05, direction = "backward") This is the same full model as StepF created from the data set in the last example. The result is the same. 3) Whether you provide the model, or let StepF make it from a data file, if you want to limit the variables than can be removed you can specify them in a scope as you would when using add or drop. Backward example: S StepF(model = plm5, scope = formula( ~ aveden + avelen + slarea), direction = "backward", level =.05) The model is as before (all variables from pubescence), but only aveden, avelen and slarea are options for StepF to drop from the model. In this case, all three are removed. Forward example: S plm6 <- glm(abherb ~ srherb) S StepF(model=plm6, scope = formula (~ srherb + slarea + aveden + avelen), direction = "forward", level =.05) 4

6 First we create the model we want to start with. In plm6 we are forcing srherb to be included in the model, and having StepF check if any of the other three listed in the scope formula should be added. In this case no more are added. Function arguments and defaults: StepF<- function(model = NULL, general = FALSE, datatable = NULL, scope = NULL, response = NULL, interactions = FALSE, level = 0.05, direction = "backward", steps = 00) List of arguments: model: The lm or glm to use as a starting point for the algorithm. Note: Can use this, have StepF create a model from datatable and response. general: When StepF creates the model, if glm should be used instead of lm. datatable: Data with variables, both response and predictor(s), to be used in the model. Note: If you use this instead of supplying your own model, you need to also specify response scope: Formula to be passed to add (list of variables to add), or drop (list of variables to drop). response: The string name of the response variable in the datatable. interactions: If TRUE, will use * to link all variables when StepF creates the formula from the datatable. Otherwise + will be used(default). Note: Only effects model creation for backward algorithm. level: The level of significance against which to test if a variable should be removed. direction: The algorithm to use. backwards or forwards steps: The maximum number of iterations that will be run. 5

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several