Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Size: px

Start display at page:

Download "Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA"

Alexis Armstrong
5 years ago
Views:

1 ECL 290 Statistical Models in Ecology using R Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA Datasets in this problem set adapted from those provided by Montgomery et al. (Introduction to Linear Regression, 4th ed., Wiley, 2006). URL: 1. Univariate linear regression We will start with a simple example. Animals often spend less time foraging in the presence of greater predation risk. Forage.csv has data on the time spent foraging on zooplankton (in # seconds during a 1-hour trial) for fish in locations with various ambient densities of predators (in mean no. fish per 100 m 2 ). The question: does foraging time vary with predator density? First, read in the data. Forage = read.csv('forage.csv') This will create a data frame named 'Forage', which has two fields: Forage$Preds and Forage$Time. Always plot the data first to look at the relationship between the two variables. Now implement a linear model: Time = β 0 + β 1 *Preds + ε. Note the syntax that R uses for this type of model: forage.fit = lm(time~preds,data=forage) The lm function (and many similar functions that we will be using) creates a "model fit" object which holds all sorts of results pertaining to this linear model (obviously you do not need to name the variable xxx.fit, but I will follow that convention for consistency here). The trick to using lm effectively is to know what other functions you have to use to extract the information you want from the fit object. Use the ANOVA function to obtain an overall test on the model. anova(forage.fit) Then use summary() to get more information on the model fit. summary(forage.fit) Notice the similarity in the results of those two operations. summary() gives you the coefficients and their standard errors. Another way to get the same information is this:

2 forage.coefs = coef(forage.fit) # the coefficients forage.vcov = vcov(forage.fit) # the var-covar matrix for the coefs forage.coef_se = sqrt(diag(forage.vcov)) # the diagonals of vcov are the variances To plot the regression line, we need to create a vector holding the regression coefficients. forage.coefs.vec = as.vector(forage.coefs) # puts the coefs into a vector quartz() plot(forage$preds,forage$time) # plot the points lines(forage$preds,forage.coefs.vec[1]+forage.coefs.vec[2]*forage$preds) # add the line A slightly faster way to do the same thing is: forage.predict = predict.lm(forage.fit,se.fit=true) # get the predicted values and their SEs lines(forage$preds,forage.predict$fit) # plot the predicted values If we wanted to plot the CIs on the regression prediction line (note this is different from the CIs on the regression parameters themselves): CI.lower = forage.predict$fit - forage.predict$se.fit*1.96 CI.upper = forage.predict$fit + forage.predict$se.fit*1.96 # notice this gets the CIs using the SEs we just created using predict.lm To plot the CIs, we need to sort them into either ascending or descending values (otherwise the lines look funny) o = order(forage$age) # this gets a vector giving the rank order of elements in Forage$Age. We will plot the points in exactly that order. lines(forage$preds[o],ci.lower[o],lty=2) # lty = 2 produce a dashed line lines(forage$preds[o],ci.upper[o],lty=2) Now it is time to check that the regression complied with all of our assumptions. First, the residuals. forage.resids = forage.fit$residuals # note they were in the fit object forage.fitted = forage.fit$fitted.values Plot the residuals. plot(forage.fitted,forage.resids) abline(h=0)

3 Nothing looks too crazy, but we should still check the other diagnostics. They are all contained within the fit object - we just need to know what function to use to extract them. forage.rstandard = rstandard(forage.fit) # standardized residuals forage.rstudent = rstudent(forage.fit) # studentized residuals forage.hatvalues = hatvalues(forage.fit) # hat values To see what the hat values actually do, plot them (versus the fitted values in forage.fitted). Also check the two types of residuals we just created. plot(forage.fitted,forage.rstandard,ylim=c(-3,1)) points(forage.fitted,forage.rstudent,pch=21,bg=1) # I used ylim to set y axis limits wide enough to see what is going on To test for normality, do a normal quantile-quantile plot. There is a function to do this automatically; it defaults to putting the data on the y-axis and the predicted values on x-axis. It doesn't matter which is which, but I'm used to looking at the data on the horizontal axis, so I choose that option. qqnorm(forage.resids,datax=true) # 2nd argument puts data quantiles on X axis qqline(forage.resids,datax=true) Finally, another way of making most of these diagnostic plots automatically is to simply use plot(forage.fit) Most of these should look familiar. This also plots Cook's distance, which is a measure of leverage. We didn't discuss it in class, but it is very similar to Studentized residuals; values > 1 are suspect, high leverage points. 2. ANOVA / ANCOVA Now we will work a problem with both continuous regressors and categorical factors. In this example, we have three species of fish feeding on zooplankton flowing across a reef. Faster currents deliver more zooplankton per unit time, so foraging time might depend on flow velocity. Foraging might also be species-dependent. The file Flow.csv has fields for flow rate, feeding time, and species. Read in the data as before. As usual we want to plot the data. In order to tell which points correspond to which species, you can have the data markers be the characters "1", "2", and "3" (corresponding to the 3 species). plot(flow$flow_rate,flow$feeding_time,pch=as.character(flow$species)) Note that this only works with single characters ('A', 'B', '4', etc., but not '11' or 'BC'). So looking at the data it does seem that there could be an effect of both flow and species.

4 Now, in fitting the model, there are a number of different options. Some of them are what you might call 'wrong' but we will explore all of them so you will know how to avoid potential pitfalls. First, fit it like the linear regression in exercise 1. flow.fit = lm(feeding_time~flow_rate+species,data=flow) This fit treat Species like a continuous variable (i.e., it takes the values 1, 2, 3). Sometimes this works out OK but usually it is better to force R to treat Species like a categorical factor: flow.fit2 = lm(feeding_time~flow_rate+as.factor(species),data=flow) This fit has an intercept, and there is no term for the effect of Species 1 (or rather, the intercept represents the mean feeding time for Species 1). The mean effects of Species 2 and 3 are expressed as 'contrasts' to the Species 1 effect - just the arithmetic sum of each term plus the intercept. It is also possible to make a fit without the intercept, so that the 'Species' terms give the mean effect for each species. Note the -1 in the model this time. flow.fit3 = lm(feeding_time~flow_rate+as.factor(species)-1,data=flow) The same syntax will also force a pure regression model (i.e., no categorical factors) to run through the origin (i.e., no intercept). flow.fit4 = lm(feeding_time~flow_rate+species-1,data=flow) The logic behind this syntax is this: written out, the model is y = β 0 + β 1 x 1 + β 2 x 2, where the β's are regression parameters and the x's are the predictor variables in the data. This can also be written as the product of two vectors: Xβ, where β = [β 0 β 1 β 2 ] and X = [1 x 1 x 2 ]'. Notice the 1 as the first term of X, which is necessary to get 1* β 0 = β 0 in the regression model. So putting in -1 in the model specification in R essentially subtracts that leading 1, giving you 0* β 0 = 0, or no intercept. Now use summary() and anova() to perform tests on each of those fit objects. Note which models produce identical results and which produce different results. flow.fit2 is the standard way of doing things, so focusing on that one, plot the residuals vs. predictors and do a normal q-q plot to ensure the assumptions of the tests are met. Finally, it is worth knowing how to do a Tukey test to determine whether the various groups are different from each other in an ANOVA. This test is not really meaningful in an ANCOVA or GLM context, because it only operates on categorical factors. So let's make one more fit using categorical factors only:

5 flow.fit5 = lm(feeding_time~as.factor(species),data=flow) The script for the Tukey test is in the stats package, so load that up: library(stats) Now do the Tukey test (called TukeyHSD because the full name of the test is the Tukey Honest Significant Difference test. This sounds like the name for an old-timey patent medicine, which is probably why everyone calls it HSD). TukeyHSD(aov(flow.fit5)) A minor quirk is that TukeyHSD only works on fit objects created by aov(). aov is basically the same thing as lm but it is specifically written to do ANOVAs with categorical effects only. In fact, aov actually calls lm to do its thing. To convert an lm fit object to an aov fit object, you just have to run aov on the lm fit object, like I did. Alternatively, you could do this: flow.fit6 = aov(feeding_time~as.factor(species),data=flow) TukeyHSD(flow.fit6) Your choice. Notice that aov implements ANOVA models and anova produces ANOVA tables from a fit object. anova also does comparisons on other types of fit objects (like likelihood ratio tests on mle fit objects). Who know why the names are this way - this might actually be a holdover from S+ that predates R. 3. Multiple linear regression Now for an example of multiple regression. For this one we will use a dataset with two predictor variables. Let's say these data were collected on foraging ants (evidently I was hungry when writing this problem set, so everything revolves around foraging). The response variable is the time it takes an ant to return to the nest once it finds a food item, and the two predictors are the distance to be traveled and the mass of the food item. Read in the data from Travel_time.csv. Multivariate data can be hard to visualize, and usually it is necessary to plot different pairwise combinations of them and such. In this case we can quickly plot all three variables together, using a function from the rgl package. library(rgl) plot3d(travel$mass,travel$distance,travel$time) This makes a plot that you can resize and also spin around by clicking and dragging. Note a few things about the data - the two predictor variables are collinear (there is a correlation between mass and distance), which is sometimes problematic, and there are a few points with rather extreme values for all 3 variables. On the plus side, things look fairly linear and the cloud has a fairly ordinary shape (no weird bulges or things like that), so a linear model seems appropriate.

6 Start by fitting a simple model: travel.fit = lm(time~mass+distance,data=travel) Do all of the standard tests & diagnostics on this fit. Are all of the terms significant? How are the residuals? Are there any weird outliers? Try out plot(travel.fit) as well. It looks like one of the observations is an outlier with high leverage, right? To find out which one it is, do this: bad = which.max(travel.rstudent) Assuming travel.rstudent holds the studentized residuals, this will find the index of the outlier point in the dataset. Now try fitting the data without that point. First define a 'logical' vector that has a value of TRUE for the data points we want to use, and a value of FALSE at the one we want to exclude. Instead of creating this by hand, use a bit of programming: Notbad = seq(1,length(travel$mass))!= bad The!= operator means "not equal to." This code make a list of values from 1 to n and then compares each one to index of the outlier. Now fit the data without the outlier (for each variable, we only use the values for which Notbad is TRUE - this is a standard trick for selecting certain values out of a vector): travel.fit2 = lm(time[notbad]~mass[notbad]+distance[notbad],data=travel) summary(travel.fit3) Do the diagnostics again. Does excluding that point fix things? Sometimes the regression fit does not look good in the diagnostic plots because an incorrect model was fitted. In this case, the residuals do not look very normal, do they? Perhaps a better explanatory model would fix that problem. One option is that there is actually an interaction between the two predictors, distance and mass. To test this, fit a model with an interaction effect: travel.fit3 = lm(time~mass*distance,data=travel) This fits the model time = β 0 + β 1 *mass + β 2 *distance + β 3 *distance*time. R assumes you want all of the lower-level effects when you ask for the interaction effect with *. If you really don't want the lower-level effect, you could use the minus sign to call travel.fit3 = lm(time~mass*distance-distance,data=travel)

7 This is just like the procedure for doing a regression with intercept = 0. Only in rare cases should you fit the interaction effect without the lower-level effect (and this is not such a case). Now do all of the regression diagnostics on the model with the interaction effect. You may find that you is still a high-leverage outlier. For now, let's just keep it in the model, though. Once you have a fit you're happy with, it is time to plot. Since this is a trivariate problem, we could plot a 3-D surface and the associated points to show the fit to the date. However, that can require a bit more programming skill (try?persp if you're interested - persp will draw a surface, and you can add points using trans3d and points). Plus 3D plots can actually be hard to look at and really see what's going on with the data - especially if some of the points are far from the surface. Instead we will make what are called partial regression plots. These are bivariate plots showing the relationship between two of the variables, holding all other variables constant. To make a partial regression plot, first generate two sets of residuals. The first is the residuals of the response variable vs. all of the predictors except the one you're interested in. The second is the residuals of the predictor you're interested in vs. all of the other predictors. In this case we will make a partial regression plot for distance. rtime = lm(time~mass,data=travel) # regression of Time vs Mass rdist = lm(distance~mass,data=travel) # regression of Distance vs Mass # to use lm, the two sets of residuals have to be in a data frame. travel2 = data.frame(rtime$residuals,rdist$residuals) # now do a regression of the time residuals vs the distance residuals travel2.fit = lm(rtime.residuals~rdist.residuals,data=travel2) # plot the results just like we did in the first example. travel2.predict = predict.lm(travel2.fit,se.fit=true) plot(rdist$residuals,rtime$residuals) lines(rdist$residuals,travel2.predict$fit,lty=1) The resulting plot shows the relationship between distance and time, independent of mass. Also, if you use summary(travel2.fit), you will get the partial R 2 for that variable. Also, the regression coefficient for that variable from travel2.fit should be the same as the coefficient from the full regression model without interaction effects. In general, partial regression plots will show the correct relationship when there is no correlation between the predictor variables and there are no interaction effects. It is not perfect in this case because there is an interaction, but you get the idea. For practice, make a partial regression plot for mass as well. Finally, as is usually the case, we have created several different regression models (e.g., with and without interaction effects). What you do with these models will depend on the kind of inference you want to make. If you started off with some hypothesis about the presence or absence of an interaction effect (e.g., I predict that the predation rate will depend on the combination of predator density and refuge availability), then you will want to run a model with that interaction effect and pay attention to the significance test for it. If you are merely interested in finding the

8 best predictive relationship for the data, you may not care whether some of the interaction effects are significant, only whether they add explanatory power to the model (e.g., you have 20 predictor variables and few a priori hypotheses about them, and want to find the best model for predicting future observations). In that case, you will want to use AIC (or some other method) to perform model selection. Happily, AICtab works directly on lm fit objects. A word of warning, though- AIC can only be used to compare fits on exactly the same dataset. So if you have some fits that excluded some of the points, you cannot compare those fits to fits using the entire dataset (this is true of any model selection tool, including the likelihood ratio test). So if you have 3 fit objects all using the same dataset, here is the model selection: library(bbmle) AICtab(travel.fit,travel.fit2,travel.fit3)

Applied Regression Modeling: A Business Approach

i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming