BIOL 458 BIOMETRY Lab 10 - Multiple Regression

Size: px

Start display at page:

Download "BIOL 458 BIOMETRY Lab 10 - Multiple Regression"

Kathleen Lester
6 years ago
Views:

1 BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several continuous and potentially discrete independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form: where ε i ~ iid N(0, σ 2 ). Y i = β 0 + β 1 X 1i + + β p X pi + ε i Y i is the response of subject i. β 0 is the y-intercept of the model or one of the model's coefficients β i are the model coefficients or weights that relate the Y i to the experimental treatments or explanatory variables X ii are the values of the measured independent variables or codes that are used to classify a subject as to group membership. For example, 1 for receiving medicine and 0 for placebo might be the codes we use in the two group design. ε i are random errors or residuals that arise from the deviation of the observed values of the responses from the model's prediction. The values of the regression coefficients are determined by minimizing the sum of squares of the residuals (errors), i.e, minimizing n n n ε i 2 = Y i Y i 2 = (Y i [β 0 + β 1 X 1i + β 2 X 2i + + β k X ik ]) 2 1=1 i=1 i=1 Where Y i is the predicted value from the model associated with observation Y i. Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria. These models can be generated using algorithms in which the addition of variables to the model are done in a stepwise fashion, or by examining a large number or even all possible regression models given the data. Stepwise Regression Stepwise regression builds a model by adding or removing variables one at a time. They have lost much of their popularity because it has been shown that they are not guaranteed to select the best model.

2 In a forward stepwise regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). The variable with the lowest p-value added to the model at each step and the algorithm stops when no new variable meets the significance criterion. In a backwards stepwise regression all independent variables are initially entered into the model. They are then sequentially removed if they do not meet a set significance criterion for retention (often p>0.1 or p>0.05, for the partial F - test for removal of a term). The variable with the highest p-value is removed from the model at each step until no additional variable meeting the criterion remains. Stepwise regression uses both these techniques with variables added or removed on each step of the process. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term, and if necessary a variable might be removed from the model. The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model. In R, no stepwise regression module using a partial F test is available in the base installation or as a user contributed package. R does contain a stepwise function called stepaic that uses Akaike s Information Criteria as a basis for stepwise model selection. We will avoid this information theoretic approach to model selection for now. However, a former student (Joe Hill Warren) wrote a function called StepF which we can use to examine the behavior of these algorithms. However, StepF does not perform the stepwise algorithm, only the forward and backwards selection algorithms. To read more about the StepF function click the link. To use the StepF function, down load the file StepF.R. Open this file in RStudio either from the File Menu or from the Code Menu. From the File Menu, open the file and then click on the Code Menu and click Source. Alternatively, from the Code Menu just click Source File and choose the downloaded file StepF.R. This will load the function StepF and you can use its features to perform stepwise regression Later in this Lab we will address issues in building and assessing regression models. We could at this point use a number of techniques to examine our data before beginning the process of model selection, or we could use those same techniques after developing a set of candidate models to assess. In this demonstration I will take the later approach, postponing a detailed assessment of whether a model meets the assumptions of regression until later. To demonstrate multiple regression we will examine data on the specie richness of plants in the Galapagos Islands. The data file Galapagos-plants.txt contains species 1

3 richness and the number of endemic species for plants on 29 islands along with data about the physical characteristics of the islands (island name, island area, maximum elevation, distance to nearest island, area of nearest island, distance to Santa Cruz island, and the number of botanical collecting trips to each island). # read in data file on Galapagos plants dat=read.table("k:/biometry/biometry-fall-2015/lab10/galapagos-plants.txt", h eader=true) head(dat) Isla Spec Area Elev DisN DisS AreA Coll Endm 1 Balt Bart Cald Cham Coam Daph It is traditional to examine the relationship between the log(number of species) and log(area), so I will create these variables and a new data.frame to hold them along with the other original variables for the analysis. # create variable to be used in regression and put in new data.frame logarea=log(dat$area) logspec=log(dat$spec) elev=dat$elev diss=dat$diss disn=dat$disn coll=dat$coll area=dat$area dd=data.frame(logspec,logarea,elev,diss,disn,area,coll) head(dd) logspec logarea elev diss disn area coll As an initial diagnostic step, I obtain the correlation matrix of the variables. I set the options(digits=4) to control how many digits are printed so the matrix will not wraparound. # obtain correlation matrix of variables 2

4 options(digits=4) cor(dd) logspec logarea elev diss disn area coll logspec logarea elev diss disn area coll Note that we can already see that logspec is strongly associated with logarea and coll (the number of collecting trips), and less strongly associated with elev so we might expect these to be the variables that are entered into the regression models.. Now we will source the StepF.R code file. Note that in Rmarkdown you need to give the full path and name of the file to be sourced. source("k:/biometry/biometry-fall-2015/lab10/stepf.r") Now let s use the forward stepwise approach to select a model. Note that the output is a multi-step process. At each step of the process, partial F - tests are reported that test if the reduction in the residual sums of squares associated with adding each variable to the model individually would be a statistically significant reduction in the residual sums of squares. The variable that causes the greatest reduction in the residual sums of squares will be added to the model. Note on iteration 1, that with the grand mean in the model the RSS (residual sums of squares) is 70.6, but that if logarea is added to the model the RSS will be reduced to Logarea also have the smallest p value, so logarea will be added to the model first. On iteration 2 after logarea is added to the model, we see that only addition of coll to the model will result in a statistically significant reduction in the RSS at the α = 0.05 (p = 0.017). Therefore, coll will be added to the model. However, on iteration 3 none of the remaining variables have p < 0.05, so the algorithm stops after adding logarea and coll to the model. # perform forward stepwise regression mod.7=stepf(datatable=dd,response="logspec", level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea e-11 *** elev e-05 *** 3

5 diss disn area coll e-09 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea e-11 Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> elev diss disn area coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: coll Updating model formula:.~. +coll ==================== Iteration #3 ==================== Single term additions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> elev diss disn area ========== No further variables significant at 0.05 ========== Final Model: We can use the backwards stepwise approach as well. This algorithm will not always converge on the same model as the forward approach, but in this instance it does. Again, in the backwards approach all variables are initially put into the model and those that cause the smallest increase in the RSS are sequentially removed from the model. It takes a couple more iterations than the forwards approach, but converges on the same best model. 4

6 # perform backwards stepwise regression mod.8=stepf(datatable=dd,response="logspec",level=0.05, direction="backward") ==================== Iteration #1 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + disn + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss disn area coll Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: disn formula:.~. -disn ==================== Iteration #2 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss area coll Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: area formula:.~. -area ==================== Iteration #3 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss

7 coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: diss formula:.~. -diss ==================== Iteration #4 ==================== Single term deletions Model: logspec ~ logarea + elev + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: elev formula:.~. -elev ==================== Iteration #5 ==================== Single term deletions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ========== All variables significant at 0.05 ========== Final Model: There are other ways in which the StepF function can be used. One approach is to build a model that contains variables you wish to force into the model, and then check to see if any other variables will be added after those forced in. For example, what if we wanted to force coll to be in the model, and wanted to know if other variables explain residual variation in logspec after accounting for coll. We could think of coll as a nuisance variable that measures the differences in sampling effort across the islands. Perhaps we want to know if after accounting for the variable sampling effort which other variables are still useful in explaining variation among the islands in the species richness of plants. To do this, we first build a linear model with coll only and save it in a model object. Then we call StepF specifying the model object name as our initial mode, and 6

8 then the scope argument with coll and any other variable we wish to assess. Note that this process indicates that even after accounting for the variability in sampling effort among islands that logarea still explains residuals variation in logspec. # to determine if any variables would be added to a model with only coll as t he predictor variable # first build linear model with coll as the only predictor variable lm1=lm(formula=logspec~coll) # then use StepF specifying the model with coll and a "scope"" argument listi ng coll and the other candidate variables StepF(model=lm1,scope=formula( ~ coll+logarea+elev+disn+diss+area),level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev disn diss area Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ coll + logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> elev disn diss area ========== No further variables significant at 0.05 ========== Final Model: 7

9 Call: lm(formula = logspec ~ coll + logarea) Coefficients: (Intercept) coll logarea The StepF pdf file explains other ways in which you might use the StepF function. Note that the StepF function does not report the regression coefficients nor does it compute the residuals and other fit statistics for the final model. After using StepF to select models, one then must use the lm function to fit the selected models and evaluate their adequacy. Best Subsets Regression An alternative approach to model selection is to compute all possible regressions given a set of candidate explanatory variables, or at least the best subset of models for various levels of model complexity. By model complexity, I mean the number predictor variables included in the model. To calculate the number of possible models with 6 predictor variables, we need to compute the number of permutations of 6 variables taken 1, 2, 3, 4, 5, or 6 at a time and add them up. A permutation is like a combination except that we consider the case AB different from the case BA. To calculate the number of permutations of n things k at a time: P k n = n! (n k)! The calculation of the number of permutations is similar to the calculation of the number of combinations of n things k at a time except that it lacks a factor of k! from the denominator. R code to calculate the number of possible models is given below. Do you want to generate, fit and asses all possible 1237 models? Remember that in regression we use Type I sums of squares, so in models with different orderings of the variables the individual variables may explain different amounts of variation in the response variable. # calculate the number of permutations of 6 variables for models with 1 to 6 predictor variables perm=rep(0,5) n=6 for (k in 1:n-1){ perm[k]=factorial(n)/factorial(n-k) 8

10 } perm [1] totperm=sum(perm)+1 totperm [1] 1237 Rather than tackling the daunting task of examining 1237 models, we will use the regsubsets function from the package leaps to select the best k models with 1 predictor, 2 predictors, etc. # load package leaps library(leaps) In leaps we will use the regsubsets function and generate the 3 best models for each level of complexity. You could choose to do more, but the graphical display of the results becomes problematic with large subset sizes. Running the regsubsets function requires the model formula and the specification of the subset size. Performing a summary of the regsubsets object results in a tabulation of the models ranked in order of best fit. An * indicates that the variable is included in the model. # to get k best regression models for each size k=3 mm=regsubsets(logspec~logarea+elev+disn+diss+area+coll,data=dd,nbest=k) summary(mm) Subset selection object Call: regsubsets.formula(logspec ~ logarea + elev + disn + diss + area + coll, data = dd, nbest = k) 6 Variables (and intercept) Forced in Forced out logarea FALSE FALSE elev FALSE FALSE disn FALSE FALSE diss FALSE FALSE area FALSE FALSE coll FALSE FALSE 3 subsets of each size up to 6 Selection Algorithm: exhaustive logarea elev disn diss area coll 1 ( 1 ) "*" " " " " " " " " " " 1 ( 2 ) " " " " " " " " " " "*" 1 ( 3 ) " " "*" " " " " " " " " 2 ( 1 ) "*" " " " " " " " " "*" 2 ( 2 ) "*" " " " " "*" " " " " 2 ( 3 ) "*" "*" " " " " " " " " 3 ( 1 ) "*" "*" " " " " " " "*" 3 ( 2 ) "*" " " " " "*" " " "*" 9

11 3 ( 3 ) "*" " " "*" " " " " "*" 4 ( 1 ) "*" "*" " " "*" " " "*" 4 ( 2 ) "*" "*" "*" " " " " "*" 4 ( 3 ) "*" "*" " " " " "*" "*" 5 ( 1 ) "*" "*" " " "*" "*" "*" 5 ( 2 ) "*" "*" "*" "*" " " "*" 5 ( 3 ) "*" "*" "*" " " "*" "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*" Although not printed as part of the summary, the summary of the regsubset object contains much more information including the R 2 values for each model. I demonstrate below one way to extract those R 2 values from the summary. You can learn more about the information in the summary by using the str() on the summary. # to get rsq values from regsubsets object nummods=k*(n-1)+1 m=rep(0,nummods) a=summary(mm) for (i in 1:16){ m[i]=summary(mm)[[2]][[i]] } m [1] [11] There are other options to graphically display the results from regsubsets. For example, one kind of plot shows the model R2 on the y-axis and which variables are in the model by colors on the x axis. White would indicate that the variable is not in the model, while shading indicates the variable is in the model. Variables with mostly white contribute to few models while variables that are mostly black contribute to many models. In our example, as you might expect logarea, coll, and elev are mostly black while the other variables are mostly white. # to plot a summary of which variables are in each model ranked by r2 plot(mm,scale="r2") plot(mm,scale="adjr2") 10

12 A similar plot can be generated for other goodness-of-fit statistics such as the adjusted R 2. Remember R 2 = 1 (SS error SS total ) and the adjusted R 2 which penalizes the model for its complexity is adjusted R 2 = 1 SS error df error. SS total df total Finally, there is also another graphical display of the results available in the car package. The function subsets in car will plot the results of a call to regsubsets from leaps. # plot a summary of results library(car) subsets(mm,statistic = "rsq", abbrev=1, legend=true, cex=0.7) 11

13 Abbreviation logarea l elev e disn dsn diss dss area a coll c This summary is similar to the previous one, but labels the models with the included variables. For models with many variables or for larger subset sizes the labels overlap making the plot unreadable. You can also use the regsubsets function with the argument method specified as forward, backward, or seqrep for sequential replacement rather than the default which is exhaustive. At this point, after using one of the stepwise algorithms or an exhaustive search, one has a set of candidate models to examine in more detail. Assessing model fit involves all the same procedures used in bivariate regression since the same assumptions apply. The dependent variable should be normally distributed, scatter plots should indicate linear relationships between the dependent and independent variables, and residual plots should show homoscedasticity (equality of variances in the residuals throughout the regression line). In addition to these issues, one also needs to check for outliers or overly influential data points, and for high inter-correlations between pairs of independent variables (called multi-colinearity). If two independent variables are highly 12

14 correlated (r>0.9), then inclusion of both variables in the model causes problems in parameter estimation. You can pre-screen your independent variables by getting a correlation matrix prior to performing the regression and only allowing one variable of a pair of high correlated variables to serve as a candidate variable for model building at a time. Remember the tools outlined in Lab 9 for assessing model fit are also applicable to multiple regression models. The norm function from Quantpsyc, plot(modelobj) and plot(data.frame) can provide much useful diagnostic information. Other diagnostic procedures are available in the car package. Advice on Building and Assessing Regression Models Building 1. Choose the set of candidate predictor variables to potentially be included in the model. 2. Examine the distribution of the response variable to determine if it meets the assumption of normality. Transform if necessary. 3. Examine scatter plots of the relationships between the response variable y and the predictor or independent variables x to determine if the relationships are linear. Potentially transform either x or y or, both to achieve linearity. 4. Examine the correlations between the predictor variables. High correlations (values of r >> 0.9) might suggest linear dependencies among the predictor variables which can make the estimates of the regression coefficients unstable and inflate the variance of the estimates. Consider deleting members of these pairs of variables since they are essentially redundant. 5. Choose the algorithmic approach to fitting a model. In blocks (chunkwise), by forcing entry of variables into the model in a particular sequence, by backwards elimination, or forwards addition of variable to the model, etc. 6. Decide on the criteria you will use for retaining variables in the model (significant partial t or F statistics at a specified α). Build the model. Assessing 1. Obtain a plot of the standardized residual against the standardized predicted values. Examine this plot for heterogeneity in the distribution of the residuals. A desirable pattern for the residuals would have both negative and positive residuals of equal magnitude throughout the length of the predicted regression. The envelope of residuals around the regression line should appear to be rectangular and be centered on the regression line. 13

15 2. Examine the correlations among pairs of predictor variables to check for multicolinearity. If for any pair r >>0.9 then try alternative models that eliminate one pair member. 3. Examine the diagnostic plots to make sure that there are no observations with high leverage or high influence. Influential data points will have Cook s D values greater than Compare alternative models to determine if one or more models fit the data equally well. 5. The model with the best residual pattern, that is not beset with colinearity and influential data points, and that has the highest R 2 is the best model. Note that R 2 is the last criteria to use in choosing a model, not the first. Lab 10 Assignment The exercise to be performed in this lab is to use the StepF and/or regsubsets functions in R to generate a set of candidate models, and to select the individual "best" model or set of best models if 2 or more models seem to be equally good. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit. DESCRIPTION OF DATA The data is stored in a file multr2.csv The variables are as follows (they are in the same order in the data sets): VARIABLE (UNITS) Mean elevation (feet) Mean temperature (degrees F) Mean annual precipitation (inches) Vegetative density (percent cover) Drainage area (miles 2 ) Latitude (degrees) Longitude (degrees) Elevation at temperature station (feet) 1-hour, 25-year precipitation. intensity (inches/hour) Annual water yield (inches) (Dependent variable) The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA. 14

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but