BIOL 458 BIOMETRY Lab 10 - Multiple Regression

Similar documents
BIOL 458 BIOMETRY Lab 10 - Multiple Regression

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression

Variable selection is intended to select the best subset of predictors. But why bother?

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Quantitative Methods in Management

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Lab 07: Multiple Linear Regression: Variable Selection

Lecture 13: Model selection and regularization

Workshop 8: Model selection

Regression Analysis and Linear Regression Models

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

STA121: Applied Regression Analysis

Linear Methods for Regression and Shrinkage Methods

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

The Truth behind PGA Tour Player Scores

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Chapter 10: Variable Selection. November 12, 2018

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Multiple Linear Regression

SYS 6021 Linear Statistical Models

Package leaps. R topics documented: May 5, Title regression subset selection. Version 2.9

CREATING THE ANALYSIS

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Chapter 6: Linear Model Selection and Regularization

Applied Regression Modeling: A Business Approach

Assignment 6 - Model Building

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Lecture on Modeling Tools for Clustering & Regression

Multivariate Analysis Multivariate Calibration part 2

Multiple Regression White paper

Model selection Outline for today

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Model Selection and Inference

Using Excel for Graphical Analysis of Data

Analysis of variance - ANOVA

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Information Criteria Methods in SAS for Multiple Linear Regression Models

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Discussion Notes 3 Stepwise Regression and Model Selection

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Section 4 General Factorial Tutorials

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

Data Management - 50%

An introduction to SPSS

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Using the DATAMINE Program

Salary 9 mo : 9 month salary for faculty member for 2004

Regression on the trees data with R

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

MODEL DEVELOPMENT: VARIABLE SELECTION

Using Excel for Graphical Analysis of Data

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

This electronic supporting information S4 contains the main steps for fitting a response surface model using Minitab 17 (Minitab Inc.).

For our example, we will look at the following factors and factor levels.

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Nonparametric Classification Methods

Two-Stage Least Squares

Middle School Math Course 3

Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

7. Collinearity and Model Selection

3 Feature Selection & Feature Extraction

Data Presentation. Figure 1. Hand drawn data sheet

Predicting Porosity through Fuzzy Logic from Well Log Data

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Using Machine Learning to Optimize Storage Systems

Predicting Web Service Levels During VM Live Migrations

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Regression Models Course Project Vincent MARIN 28 juillet 2016

Generalized Additive Models

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution

Basics of Multivariate Modelling and Data Analysis

Stat 8053, Fall 2013: Additive Models

BayesFactor Examples

Building Better Parametric Cost Models

Applied Regression Modeling: A Business Approach

Machine Learning. Topic 4: Linear Regression Models

INF 4300 Classification III Anne Solberg The agenda today:

Machine Learning / Jan 27, 2010

Package optimus. March 24, 2017

Stat 5303 (Oehlert): Response Surfaces 1

Elemental Set Methods. David Banks Duke University

Lecture 7: Linear Regression (continued)

General Factorial Models

1 StatLearn Practical exercise 5

IQR = number. summary: largest. = 2. Upper half: Q3 =

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

The RcmdrPlugin.HH Package

Gene Clustering & Classification

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Transcription:

BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several continuous and potentially discrete independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form: where ε i ~ iid N(0, σ 2 ). Y i = β 0 + β 1 X 1i + + β p X pi + ε i Y i is the response of subject i. β 0 is the y-intercept of the model or one of the model's coefficients β i are the model coefficients or weights that relate the Y i to the experimental treatments or explanatory variables X ii are the values of the measured independent variables or codes that are used to classify a subject as to group membership. For example, 1 for receiving medicine and 0 for placebo might be the codes we use in the two group design. ε i are random errors or residuals that arise from the deviation of the observed values of the responses from the model's prediction. The values of the regression coefficients are determined by minimizing the sum of squares of the residuals (errors), i.e, minimizing n n n ε i 2 = Y i Y i 2 = (Y i [β 0 + β 1 X 1i + β 2 X 2i + + β k X ik ]) 2 1=1 i=1 i=1 Where Y i is the predicted value from the model associated with observation Y i. Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria. These models can be generated using algorithms in which the addition of variables to the model are done in a stepwise fashion, or by examining a large number or even all possible regression models given the data. Stepwise Regression Stepwise regression builds a model by adding or removing variables one at a time. They have lost much of their popularity because it has been shown that they are not guaranteed to select the best model.

In a forward stepwise regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). The variable with the lowest p-value added to the model at each step and the algorithm stops when no new variable meets the significance criterion. In a backwards stepwise regression all independent variables are initially entered into the model. They are then sequentially removed if they do not meet a set significance criterion for retention (often p>0.1 or p>0.05, for the partial F - test for removal of a term). The variable with the highest p-value is removed from the model at each step until no additional variable meeting the criterion remains. Stepwise regression uses both these techniques with variables added or removed on each step of the process. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term, and if necessary a variable might be removed from the model. The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model. In R, no stepwise regression module using a partial F test is available in the base installation or as a user contributed package. R does contain a stepwise function called stepaic that uses Akaike s Information Criteria as a basis for stepwise model selection. We will avoid this information theoretic approach to model selection for now. However, a former student (Joe Hill Warren) wrote a function called StepF which we can use to examine the behavior of these algorithms. However, StepF does not perform the stepwise algorithm, only the forward and backwards selection algorithms. To read more about the StepF function click the link. To use the StepF function, down load the file StepF.R. Open this file in RStudio either from the File Menu or from the Code Menu. From the File Menu, open the file and then click on the Code Menu and click Source. Alternatively, from the Code Menu just click Source File and choose the downloaded file StepF.R. This will load the function StepF and you can use its features to perform stepwise regression Later in this Lab we will address issues in building and assessing regression models. We could at this point use a number of techniques to examine our data before beginning the process of model selection, or we could use those same techniques after developing a set of candidate models to assess. In this demonstration I will take the later approach, postponing a detailed assessment of whether a model meets the assumptions of regression until later. To demonstrate multiple regression we will examine data on the specie richness of plants in the Galapagos Islands. The data file Galapagos-plants.txt contains species 1

richness and the number of endemic species for plants on 29 islands along with data about the physical characteristics of the islands (island name, island area, maximum elevation, distance to nearest island, area of nearest island, distance to Santa Cruz island, and the number of botanical collecting trips to each island). # read in data file on Galapagos plants dat=read.table("k:/biometry/biometry-fall-2015/lab10/galapagos-plants.txt", h eader=true) head(dat) Isla Spec Area Elev DisN DisS AreA Coll Endm 1 Balt 57 25.09 150 0.6 0.6 903.82 10 23 2 Bart 31 1.24 109 0.6 26.3 572.33 6 21 3 Cald 3 0.21 114 2.8 58.7 170.92 1 3 4 Cham 25 0.10 46 1.9 47.4 0.18 4 9 5 Coam 2 0.05 10 1.9 1.9 903.82 1 1 6 Daph 18 0.34 10 8.0 8.0 903.82 6 11 It is traditional to examine the relationship between the log(number of species) and log(area), so I will create these variables and a new data.frame to hold them along with the other original variables for the analysis. # create variable to be used in regression and put in new data.frame logarea=log(dat$area) logspec=log(dat$spec) elev=dat$elev diss=dat$diss disn=dat$disn coll=dat$coll area=dat$area dd=data.frame(logspec,logarea,elev,diss,disn,area,coll) head(dd) logspec logarea elev diss disn area coll 1 4.0430513 3.2224694 150 0.6 0.6 903.82 10 2 3.4339872 0.2151114 109 26.3 0.6 572.33 6 3 1.0986123-1.5606477 114 58.7 2.8 170.92 1 4 3.2188758-2.3025851 46 47.4 1.9 0.18 4 5 0.6931472-2.9957323 10 1.9 1.9 903.82 1 6 2.8903718-1.0788097 10 8.0 8.0 903.82 6 As an initial diagnostic step, I obtain the correlation matrix of the variables. I set the options(digits=4) to control how many digits are printed so the matrix will not wraparound. # obtain correlation matrix of variables 2

options(digits=4) cor(dd) logspec logarea elev diss disn area coll logspec 1.00000 0.89060 0.673628-0.084227 0.150027-0.01344 0.8522 logarea 0.89060 1.00000 0.817091 0.056801 0.247098 0.06126 0.8294 elev 0.67363 0.81709 1.000000 0.006814 0.008184 0.25368 0.6899 diss -0.08423 0.05680 0.006814 1.000000 0.614391-0.10690-0.1327 disn 0.15003 0.24710 0.008184 0.614391 1.000000-0.17220 0.0890 area -0.01344 0.06126 0.253683-0.106900-0.172201 1.00000-0.1425 coll 0.85217 0.82937 0.689866-0.132707 0.089000-0.14251 1.0000 Note that we can already see that logspec is strongly associated with logarea and coll (the number of collecting trips), and less strongly associated with elev so we might expect these to be the variables that are entered into the regression models.. Now we will source the StepF.R code file. Note that in Rmarkdown you need to give the full path and name of the file to be sourced. source("k:/biometry/biometry-fall-2015/lab10/stepf.r") Now let s use the forward stepwise approach to select a model. Note that the output is a multi-step process. At each step of the process, partial F - tests are reported that test if the reduction in the residual sums of squares associated with adding each variable to the model individually would be a statistically significant reduction in the residual sums of squares. The variable that causes the greatest reduction in the residual sums of squares will be added to the model. Note on iteration 1, that with the grand mean in the model the RSS (residual sums of squares) is 70.6, but that if logarea is added to the model the RSS will be reduced to 14.6. Logarea also have the smallest p value, so logarea will be added to the model first. On iteration 2 after logarea is added to the model, we see that only addition of coll to the model will result in a statistically significant reduction in the RSS at the α = 0.05 (p = 0.017). Therefore, coll will be added to the model. However, on iteration 3 none of the remaining variables have p < 0.05, so the algorithm stops after adding logarea and coll to the model. # perform forward stepwise regression mod.7=stepf(datatable=dd,response="logspec", level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) <none> 70.6 27.81 logarea 1 56.0 14.6-15.89 103.54 9.8e-11 *** elev 1 32.0 38.6 12.27 22.43 6.2e-05 *** 3

diss 1 0.5 70.1 29.60 0.19 0.66 disn 1 1.6 69.0 29.15 0.62 0.44 area 1 0.0 70.6 29.80 0.00 0.94 coll 1 51.3 19.3-7.76 71.61 4.5e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea 9.76507623932042e-11 Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> 14.6-15.9 elev 1 0.621 14.0-15.2 1.15 0.292 diss 1 1.287 13.3-16.6 2.51 0.125 disn 1 0.369 14.2-14.6 0.67 0.419 area 1 0.328 14.3-14.6 0.60 0.447 coll 1 2.916 11.7-20.4 6.49 0.017 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: coll 0.0171409159531924 Updating model formula:.~. +coll ==================== Iteration #3 ==================== Single term additions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> 11.7-20.4 elev 1 0.728 11.0-20.2 1.66 0.21 diss 1 0.381 11.3-19.3 0.84 0.37 disn 1 0.061 11.6-18.5 0.13 0.72 area 1 0.000 11.7-18.4 0.00 0.98 ========== No further variables significant at 0.05 ========== Final Model: We can use the backwards stepwise approach as well. This algorithm will not always converge on the same model as the forward approach, but in this instance it does. Again, in the backwards approach all variables are initially put into the model and those that cause the smallest increase in the RSS are sequentially removed from the model. It takes a couple more iterations than the forwards approach, but converges on the same best model. 4

# perform backwards stepwise regression mod.8=stepf(datatable=dd,response="logspec",level=0.05, direction="backward") ==================== Iteration #1 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + disn + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> 10.4-15.69 logarea 1 6.94 17.4-2.87 14.66 0.00091 *** elev 1 0.85 11.3-15.40 1.80 0.19312 diss 1 0.12 10.5-17.34 0.26 0.61364 disn 1 0.04 10.5-17.57 0.09 0.76936 area 1 0.05 10.5-17.55 0.10 0.75309 coll 1 1.90 12.3-12.82 4.02 0.05749. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: disn 0.76935945292974 formula:.~. -disn ==================== Iteration #2 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> 10.5-17.57 logarea 1 7.59 18.1-3.75 16.69 0.00046 *** elev 1 0.83 11.3-17.36 1.82 0.19039 diss 1 0.35 10.8-18.61 0.77 0.38880 area 1 0.05 10.5-19.43 0.11 0.73924 coll 1 1.92 12.4-14.68 4.22 0.05140. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: area 0.739236282590088 formula:.~. -area ==================== Iteration #3 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> 10.5-19.4 logarea 1 7.82 18.3-5.3 17.85 0.0003 *** elev 1 0.80 11.3-19.3 1.82 0.1904 diss 1 0.45 11.0-20.2 1.02 0.3219 5

coll 1 2.05 12.6-16.3 4.68 0.0407 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: diss 0.321891770767111 formula:.~. -diss ==================== Iteration #4 ==================== Single term deletions Model: logspec ~ logarea + elev + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> 11.0-20.22 logarea 1 7.38 18.3-7.28 16.84 0.00038 *** elev 1 0.73 11.7-20.35 1.66 0.20926 coll 1 3.02 14.0-15.15 6.89 0.01454 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: elev 0.209259986113194 formula:.~. -elev ==================== Iteration #5 ==================== Single term deletions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> 11.7-20.35 logarea 1 7.64 19.3-7.76 17.00 0.00034 *** coll 1 2.92 14.6-15.89 6.49 0.01714 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ========== All variables significant at 0.05 ========== Final Model: There are other ways in which the StepF function can be used. One approach is to build a model that contains variables you wish to force into the model, and then check to see if any other variables will be added after those forced in. For example, what if we wanted to force coll to be in the model, and wanted to know if other variables explain residual variation in logspec after accounting for coll. We could think of coll as a nuisance variable that measures the differences in sampling effort across the islands. Perhaps we want to know if after accounting for the variable sampling effort which other variables are still useful in explaining variation among the islands in the species richness of plants. To do this, we first build a linear model with coll only and save it in a model object. Then we call StepF specifying the model object name as our initial mode, and 6

then the scope argument with coll and any other variable we wish to assess. Note that this process indicates that even after accounting for the variability in sampling effort among islands that logarea still explains residuals variation in logspec. # to determine if any variables would be added to a model with only coll as t he predictor variable # first build linear model with coll as the only predictor variable lm1=lm(formula=logspec~coll) # then use StepF specifying the model with coll and a "scope"" argument listi ng coll and the other candidate variables StepF(model=lm1,scope=formula( ~ coll+logarea+elev+disn+diss+area),level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ coll Df Sum of Sq RSS AIC F value Pr(>F) <none> 19.3-7.76 logarea 1 7.64 11.7-20.35 17.00 0.00034 *** elev 1 0.99 18.3-7.28 1.40 0.24676 disn 1 0.39 18.9-6.35 0.54 0.46998 diss 1 0.06 19.3-5.85 0.08 0.77851 area 1 0.84 18.5-7.05 1.18 0.28692 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea 0.000338442216827803 Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ coll + logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> 11.7-20.4 elev 1 0.728 11.0-20.2 1.66 0.21 disn 1 0.061 11.6-18.5 0.13 0.72 diss 1 0.381 11.3-19.3 0.84 0.37 area 1 0.000 11.7-18.4 0.00 0.98 ========== No further variables significant at 0.05 ========== Final Model: 7

Call: lm(formula = logspec ~ coll + logarea) Coefficients: (Intercept) coll logarea 2.4490 0.0593 0.2691 The StepF pdf file explains other ways in which you might use the StepF function. Note that the StepF function does not report the regression coefficients nor does it compute the residuals and other fit statistics for the final model. After using StepF to select models, one then must use the lm function to fit the selected models and evaluate their adequacy. Best Subsets Regression An alternative approach to model selection is to compute all possible regressions given a set of candidate explanatory variables, or at least the best subset of models for various levels of model complexity. By model complexity, I mean the number predictor variables included in the model. To calculate the number of possible models with 6 predictor variables, we need to compute the number of permutations of 6 variables taken 1, 2, 3, 4, 5, or 6 at a time and add them up. A permutation is like a combination except that we consider the case AB different from the case BA. To calculate the number of permutations of n things k at a time: P k n = n! (n k)! The calculation of the number of permutations is similar to the calculation of the number of combinations of n things k at a time except that it lacks a factor of k! from the denominator. R code to calculate the number of possible models is given below. Do you want to generate, fit and asses all possible 1237 models? Remember that in regression we use Type I sums of squares, so in models with different orderings of the variables the individual variables may explain different amounts of variation in the response variable. # calculate the number of permutations of 6 variables for models with 1 to 6 predictor variables perm=rep(0,5) n=6 for (k in 1:n-1){ perm[k]=factorial(n)/factorial(n-k) 8

} perm [1] 6 30 120 360 720 totperm=sum(perm)+1 totperm [1] 1237 Rather than tackling the daunting task of examining 1237 models, we will use the regsubsets function from the package leaps to select the best k models with 1 predictor, 2 predictors, etc. # load package leaps library(leaps) In leaps we will use the regsubsets function and generate the 3 best models for each level of complexity. You could choose to do more, but the graphical display of the results becomes problematic with large subset sizes. Running the regsubsets function requires the model formula and the specification of the subset size. Performing a summary of the regsubsets object results in a tabulation of the models ranked in order of best fit. An * indicates that the variable is included in the model. # to get k best regression models for each size k=3 mm=regsubsets(logspec~logarea+elev+disn+diss+area+coll,data=dd,nbest=k) summary(mm) Subset selection object Call: regsubsets.formula(logspec ~ logarea + elev + disn + diss + area + coll, data = dd, nbest = k) 6 Variables (and intercept) Forced in Forced out logarea FALSE FALSE elev FALSE FALSE disn FALSE FALSE diss FALSE FALSE area FALSE FALSE coll FALSE FALSE 3 subsets of each size up to 6 Selection Algorithm: exhaustive logarea elev disn diss area coll 1 ( 1 ) "*" " " " " " " " " " " 1 ( 2 ) " " " " " " " " " " "*" 1 ( 3 ) " " "*" " " " " " " " " 2 ( 1 ) "*" " " " " " " " " "*" 2 ( 2 ) "*" " " " " "*" " " " " 2 ( 3 ) "*" "*" " " " " " " " " 3 ( 1 ) "*" "*" " " " " " " "*" 3 ( 2 ) "*" " " " " "*" " " "*" 9

3 ( 3 ) "*" " " "*" " " " " "*" 4 ( 1 ) "*" "*" " " "*" " " "*" 4 ( 2 ) "*" "*" "*" " " " " "*" 4 ( 3 ) "*" "*" " " " " "*" "*" 5 ( 1 ) "*" "*" " " "*" "*" "*" 5 ( 2 ) "*" "*" "*" "*" " " "*" 5 ( 3 ) "*" "*" "*" " " "*" "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*" Although not printed as part of the summary, the summary of the regsubset object contains much more information including the R 2 values for each model. I demonstrate below one way to extract those R 2 values from the summary. You can learn more about the information in the summary by using the str() on the summary. # to get rsq values from regsubsets object nummods=k*(n-1)+1 m=rep(0,nummods) a=summary(mm) for (i in 1:16){ m[i]=summary(mm)[[2]][[i]] } m [1] 0.7932 0.7262 0.4538 0.8345 0.8114 0.8020 0.8448 0.8399 0.8353 0.8511 [11] 0.8495 0.8469 0.8518 0.8518 0.8507 0.8524 There are other options to graphically display the results from regsubsets. For example, one kind of plot shows the model R2 on the y-axis and which variables are in the model by colors on the x axis. White would indicate that the variable is not in the model, while shading indicates the variable is in the model. Variables with mostly white contribute to few models while variables that are mostly black contribute to many models. In our example, as you might expect logarea, coll, and elev are mostly black while the other variables are mostly white. # to plot a summary of which variables are in each model ranked by r2 plot(mm,scale="r2") plot(mm,scale="adjr2") 10

A similar plot can be generated for other goodness-of-fit statistics such as the adjusted R 2. Remember R 2 = 1 (SS error SS total ) and the adjusted R 2 which penalizes the model for its complexity is adjusted R 2 = 1 SS error df error. SS total df total Finally, there is also another graphical display of the results available in the car package. The function subsets in car will plot the results of a call to regsubsets from leaps. # plot a summary of results library(car) subsets(mm,statistic = "rsq", abbrev=1, legend=true, cex=0.7) 11

Abbreviation logarea l elev e disn dsn diss dss area a coll c This summary is similar to the previous one, but labels the models with the included variables. For models with many variables or for larger subset sizes the labels overlap making the plot unreadable. You can also use the regsubsets function with the argument method specified as forward, backward, or seqrep for sequential replacement rather than the default which is exhaustive. At this point, after using one of the stepwise algorithms or an exhaustive search, one has a set of candidate models to examine in more detail. Assessing model fit involves all the same procedures used in bivariate regression since the same assumptions apply. The dependent variable should be normally distributed, scatter plots should indicate linear relationships between the dependent and independent variables, and residual plots should show homoscedasticity (equality of variances in the residuals throughout the regression line). In addition to these issues, one also needs to check for outliers or overly influential data points, and for high inter-correlations between pairs of independent variables (called multi-colinearity). If two independent variables are highly 12

correlated (r>0.9), then inclusion of both variables in the model causes problems in parameter estimation. You can pre-screen your independent variables by getting a correlation matrix prior to performing the regression and only allowing one variable of a pair of high correlated variables to serve as a candidate variable for model building at a time. Remember the tools outlined in Lab 9 for assessing model fit are also applicable to multiple regression models. The norm function from Quantpsyc, plot(modelobj) and plot(data.frame) can provide much useful diagnostic information. Other diagnostic procedures are available in the car package. Advice on Building and Assessing Regression Models Building 1. Choose the set of candidate predictor variables to potentially be included in the model. 2. Examine the distribution of the response variable to determine if it meets the assumption of normality. Transform if necessary. 3. Examine scatter plots of the relationships between the response variable y and the predictor or independent variables x to determine if the relationships are linear. Potentially transform either x or y or, both to achieve linearity. 4. Examine the correlations between the predictor variables. High correlations (values of r >> 0.9) might suggest linear dependencies among the predictor variables which can make the estimates of the regression coefficients unstable and inflate the variance of the estimates. Consider deleting members of these pairs of variables since they are essentially redundant. 5. Choose the algorithmic approach to fitting a model. In blocks (chunkwise), by forcing entry of variables into the model in a particular sequence, by backwards elimination, or forwards addition of variable to the model, etc. 6. Decide on the criteria you will use for retaining variables in the model (significant partial t or F statistics at a specified α). Build the model. Assessing 1. Obtain a plot of the standardized residual against the standardized predicted values. Examine this plot for heterogeneity in the distribution of the residuals. A desirable pattern for the residuals would have both negative and positive residuals of equal magnitude throughout the length of the predicted regression. The envelope of residuals around the regression line should appear to be rectangular and be centered on the regression line. 13

2. Examine the correlations among pairs of predictor variables to check for multicolinearity. If for any pair r >>0.9 then try alternative models that eliminate one pair member. 3. Examine the diagnostic plots to make sure that there are no observations with high leverage or high influence. Influential data points will have Cook s D values greater than 1. 4. Compare alternative models to determine if one or more models fit the data equally well. 5. The model with the best residual pattern, that is not beset with colinearity and influential data points, and that has the highest R 2 is the best model. Note that R 2 is the last criteria to use in choosing a model, not the first. Lab 10 Assignment The exercise to be performed in this lab is to use the StepF and/or regsubsets functions in R to generate a set of candidate models, and to select the individual "best" model or set of best models if 2 or more models seem to be equally good. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit. DESCRIPTION OF DATA The data is stored in a file multr2.csv The variables are as follows (they are in the same order in the data sets): VARIABLE (UNITS) Mean elevation (feet) Mean temperature (degrees F) Mean annual precipitation (inches) Vegetative density (percent cover) Drainage area (miles 2 ) Latitude (degrees) Longitude (degrees) Elevation at temperature station (feet) 1-hour, 25-year precipitation. intensity (inches/hour) Annual water yield (inches) (Dependent variable) The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA. 14