Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Size: px
Start display at page:

Download "Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA"

Transcription

1 ECL 290 Statistical Models in Ecology using R Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA Datasets in this problem set adapted from those provided by Montgomery et al. (Introduction to Linear Regression, 4th ed., Wiley, 2006). URL: 1. Univariate linear regression We will start with a simple example. Animals often spend less time foraging in the presence of greater predation risk. Forage.csv has data on the time spent foraging on zooplankton (in # seconds during a 1-hour trial) for fish in locations with various ambient densities of predators (in mean no. fish per 100 m 2 ). The question: does foraging time vary with predator density? First, read in the data. Forage = read.csv('forage.csv') This will create a data frame named 'Forage', which has two fields: Forage$Preds and Forage$Time. Always plot the data first to look at the relationship between the two variables. Now implement a linear model: Time = β 0 + β 1 *Preds + ε. Note the syntax that R uses for this type of model: forage.fit = lm(time~preds,data=forage) The lm function (and many similar functions that we will be using) creates a "model fit" object which holds all sorts of results pertaining to this linear model (obviously you do not need to name the variable xxx.fit, but I will follow that convention for consistency here). The trick to using lm effectively is to know what other functions you have to use to extract the information you want from the fit object. Use the ANOVA function to obtain an overall test on the model. anova(forage.fit) Then use summary() to get more information on the model fit. summary(forage.fit) Notice the similarity in the results of those two operations. summary() gives you the coefficients and their standard errors. Another way to get the same information is this:

2 forage.coefs = coef(forage.fit) # the coefficients forage.vcov = vcov(forage.fit) # the var-covar matrix for the coefs forage.coef_se = sqrt(diag(forage.vcov)) # the diagonals of vcov are the variances To plot the regression line, we need to create a vector holding the regression coefficients. forage.coefs.vec = as.vector(forage.coefs) # puts the coefs into a vector quartz() plot(forage$preds,forage$time) # plot the points lines(forage$preds,forage.coefs.vec[1]+forage.coefs.vec[2]*forage$preds) # add the line A slightly faster way to do the same thing is: forage.predict = predict.lm(forage.fit,se.fit=true) # get the predicted values and their SEs lines(forage$preds,forage.predict$fit) # plot the predicted values If we wanted to plot the CIs on the regression prediction line (note this is different from the CIs on the regression parameters themselves): CI.lower = forage.predict$fit - forage.predict$se.fit*1.96 CI.upper = forage.predict$fit + forage.predict$se.fit*1.96 # notice this gets the CIs using the SEs we just created using predict.lm To plot the CIs, we need to sort them into either ascending or descending values (otherwise the lines look funny) o = order(forage$age) # this gets a vector giving the rank order of elements in Forage$Age. We will plot the points in exactly that order. lines(forage$preds[o],ci.lower[o],lty=2) # lty = 2 produce a dashed line lines(forage$preds[o],ci.upper[o],lty=2) Now it is time to check that the regression complied with all of our assumptions. First, the residuals. forage.resids = forage.fit$residuals # note they were in the fit object forage.fitted = forage.fit$fitted.values Plot the residuals. plot(forage.fitted,forage.resids) abline(h=0)

3 Nothing looks too crazy, but we should still check the other diagnostics. They are all contained within the fit object - we just need to know what function to use to extract them. forage.rstandard = rstandard(forage.fit) # standardized residuals forage.rstudent = rstudent(forage.fit) # studentized residuals forage.hatvalues = hatvalues(forage.fit) # hat values To see what the hat values actually do, plot them (versus the fitted values in forage.fitted). Also check the two types of residuals we just created. plot(forage.fitted,forage.rstandard,ylim=c(-3,1)) points(forage.fitted,forage.rstudent,pch=21,bg=1) # I used ylim to set y axis limits wide enough to see what is going on To test for normality, do a normal quantile-quantile plot. There is a function to do this automatically; it defaults to putting the data on the y-axis and the predicted values on x-axis. It doesn't matter which is which, but I'm used to looking at the data on the horizontal axis, so I choose that option. qqnorm(forage.resids,datax=true) # 2nd argument puts data quantiles on X axis qqline(forage.resids,datax=true) Finally, another way of making most of these diagnostic plots automatically is to simply use plot(forage.fit) Most of these should look familiar. This also plots Cook's distance, which is a measure of leverage. We didn't discuss it in class, but it is very similar to Studentized residuals; values > 1 are suspect, high leverage points. 2. ANOVA / ANCOVA Now we will work a problem with both continuous regressors and categorical factors. In this example, we have three species of fish feeding on zooplankton flowing across a reef. Faster currents deliver more zooplankton per unit time, so foraging time might depend on flow velocity. Foraging might also be species-dependent. The file Flow.csv has fields for flow rate, feeding time, and species. Read in the data as before. As usual we want to plot the data. In order to tell which points correspond to which species, you can have the data markers be the characters "1", "2", and "3" (corresponding to the 3 species). plot(flow$flow_rate,flow$feeding_time,pch=as.character(flow$species)) Note that this only works with single characters ('A', 'B', '4', etc., but not '11' or 'BC'). So looking at the data it does seem that there could be an effect of both flow and species.

4 Now, in fitting the model, there are a number of different options. Some of them are what you might call 'wrong' but we will explore all of them so you will know how to avoid potential pitfalls. First, fit it like the linear regression in exercise 1. flow.fit = lm(feeding_time~flow_rate+species,data=flow) This fit treat Species like a continuous variable (i.e., it takes the values 1, 2, 3). Sometimes this works out OK but usually it is better to force R to treat Species like a categorical factor: flow.fit2 = lm(feeding_time~flow_rate+as.factor(species),data=flow) This fit has an intercept, and there is no term for the effect of Species 1 (or rather, the intercept represents the mean feeding time for Species 1). The mean effects of Species 2 and 3 are expressed as 'contrasts' to the Species 1 effect - just the arithmetic sum of each term plus the intercept. It is also possible to make a fit without the intercept, so that the 'Species' terms give the mean effect for each species. Note the -1 in the model this time. flow.fit3 = lm(feeding_time~flow_rate+as.factor(species)-1,data=flow) The same syntax will also force a pure regression model (i.e., no categorical factors) to run through the origin (i.e., no intercept). flow.fit4 = lm(feeding_time~flow_rate+species-1,data=flow) The logic behind this syntax is this: written out, the model is y = β 0 + β 1 x 1 + β 2 x 2, where the β's are regression parameters and the x's are the predictor variables in the data. This can also be written as the product of two vectors: Xβ, where β = [β 0 β 1 β 2 ] and X = [1 x 1 x 2 ]'. Notice the 1 as the first term of X, which is necessary to get 1* β 0 = β 0 in the regression model. So putting in -1 in the model specification in R essentially subtracts that leading 1, giving you 0* β 0 = 0, or no intercept. Now use summary() and anova() to perform tests on each of those fit objects. Note which models produce identical results and which produce different results. flow.fit2 is the standard way of doing things, so focusing on that one, plot the residuals vs. predictors and do a normal q-q plot to ensure the assumptions of the tests are met. Finally, it is worth knowing how to do a Tukey test to determine whether the various groups are different from each other in an ANOVA. This test is not really meaningful in an ANCOVA or GLM context, because it only operates on categorical factors. So let's make one more fit using categorical factors only:

5 flow.fit5 = lm(feeding_time~as.factor(species),data=flow) The script for the Tukey test is in the stats package, so load that up: library(stats) Now do the Tukey test (called TukeyHSD because the full name of the test is the Tukey Honest Significant Difference test. This sounds like the name for an old-timey patent medicine, which is probably why everyone calls it HSD). TukeyHSD(aov(flow.fit5)) A minor quirk is that TukeyHSD only works on fit objects created by aov(). aov is basically the same thing as lm but it is specifically written to do ANOVAs with categorical effects only. In fact, aov actually calls lm to do its thing. To convert an lm fit object to an aov fit object, you just have to run aov on the lm fit object, like I did. Alternatively, you could do this: flow.fit6 = aov(feeding_time~as.factor(species),data=flow) TukeyHSD(flow.fit6) Your choice. Notice that aov implements ANOVA models and anova produces ANOVA tables from a fit object. anova also does comparisons on other types of fit objects (like likelihood ratio tests on mle fit objects). Who know why the names are this way - this might actually be a holdover from S+ that predates R. 3. Multiple linear regression Now for an example of multiple regression. For this one we will use a dataset with two predictor variables. Let's say these data were collected on foraging ants (evidently I was hungry when writing this problem set, so everything revolves around foraging). The response variable is the time it takes an ant to return to the nest once it finds a food item, and the two predictors are the distance to be traveled and the mass of the food item. Read in the data from Travel_time.csv. Multivariate data can be hard to visualize, and usually it is necessary to plot different pairwise combinations of them and such. In this case we can quickly plot all three variables together, using a function from the rgl package. library(rgl) plot3d(travel$mass,travel$distance,travel$time) This makes a plot that you can resize and also spin around by clicking and dragging. Note a few things about the data - the two predictor variables are collinear (there is a correlation between mass and distance), which is sometimes problematic, and there are a few points with rather extreme values for all 3 variables. On the plus side, things look fairly linear and the cloud has a fairly ordinary shape (no weird bulges or things like that), so a linear model seems appropriate.

6 Start by fitting a simple model: travel.fit = lm(time~mass+distance,data=travel) Do all of the standard tests & diagnostics on this fit. Are all of the terms significant? How are the residuals? Are there any weird outliers? Try out plot(travel.fit) as well. It looks like one of the observations is an outlier with high leverage, right? To find out which one it is, do this: bad = which.max(travel.rstudent) Assuming travel.rstudent holds the studentized residuals, this will find the index of the outlier point in the dataset. Now try fitting the data without that point. First define a 'logical' vector that has a value of TRUE for the data points we want to use, and a value of FALSE at the one we want to exclude. Instead of creating this by hand, use a bit of programming: Notbad = seq(1,length(travel$mass))!= bad The!= operator means "not equal to." This code make a list of values from 1 to n and then compares each one to index of the outlier. Now fit the data without the outlier (for each variable, we only use the values for which Notbad is TRUE - this is a standard trick for selecting certain values out of a vector): travel.fit2 = lm(time[notbad]~mass[notbad]+distance[notbad],data=travel) summary(travel.fit3) Do the diagnostics again. Does excluding that point fix things? Sometimes the regression fit does not look good in the diagnostic plots because an incorrect model was fitted. In this case, the residuals do not look very normal, do they? Perhaps a better explanatory model would fix that problem. One option is that there is actually an interaction between the two predictors, distance and mass. To test this, fit a model with an interaction effect: travel.fit3 = lm(time~mass*distance,data=travel) This fits the model time = β 0 + β 1 *mass + β 2 *distance + β 3 *distance*time. R assumes you want all of the lower-level effects when you ask for the interaction effect with *. If you really don't want the lower-level effect, you could use the minus sign to call travel.fit3 = lm(time~mass*distance-distance,data=travel)

7 This is just like the procedure for doing a regression with intercept = 0. Only in rare cases should you fit the interaction effect without the lower-level effect (and this is not such a case). Now do all of the regression diagnostics on the model with the interaction effect. You may find that you is still a high-leverage outlier. For now, let's just keep it in the model, though. Once you have a fit you're happy with, it is time to plot. Since this is a trivariate problem, we could plot a 3-D surface and the associated points to show the fit to the date. However, that can require a bit more programming skill (try?persp if you're interested - persp will draw a surface, and you can add points using trans3d and points). Plus 3D plots can actually be hard to look at and really see what's going on with the data - especially if some of the points are far from the surface. Instead we will make what are called partial regression plots. These are bivariate plots showing the relationship between two of the variables, holding all other variables constant. To make a partial regression plot, first generate two sets of residuals. The first is the residuals of the response variable vs. all of the predictors except the one you're interested in. The second is the residuals of the predictor you're interested in vs. all of the other predictors. In this case we will make a partial regression plot for distance. rtime = lm(time~mass,data=travel) # regression of Time vs Mass rdist = lm(distance~mass,data=travel) # regression of Distance vs Mass # to use lm, the two sets of residuals have to be in a data frame. travel2 = data.frame(rtime$residuals,rdist$residuals) # now do a regression of the time residuals vs the distance residuals travel2.fit = lm(rtime.residuals~rdist.residuals,data=travel2) # plot the results just like we did in the first example. travel2.predict = predict.lm(travel2.fit,se.fit=true) plot(rdist$residuals,rtime$residuals) lines(rdist$residuals,travel2.predict$fit,lty=1) The resulting plot shows the relationship between distance and time, independent of mass. Also, if you use summary(travel2.fit), you will get the partial R 2 for that variable. Also, the regression coefficient for that variable from travel2.fit should be the same as the coefficient from the full regression model without interaction effects. In general, partial regression plots will show the correct relationship when there is no correlation between the predictor variables and there are no interaction effects. It is not perfect in this case because there is an interaction, but you get the idea. For practice, make a partial regression plot for mass as well. Finally, as is usually the case, we have created several different regression models (e.g., with and without interaction effects). What you do with these models will depend on the kind of inference you want to make. If you started off with some hypothesis about the presence or absence of an interaction effect (e.g., I predict that the predation rate will depend on the combination of predator density and refuge availability), then you will want to run a model with that interaction effect and pay attention to the significance test for it. If you are merely interested in finding the

8 best predictive relationship for the data, you may not care whether some of the interaction effects are significant, only whether they add explanatory power to the model (e.g., you have 20 predictor variables and few a priori hypotheses about them, and want to find the best model for predicting future observations). In that case, you will want to use AIC (or some other method) to perform model selection. Happily, AICtab works directly on lm fit objects. A word of warning, though- AIC can only be used to compare fits on exactly the same dataset. So if you have some fits that excluded some of the points, you cannot compare those fits to fits using the entire dataset (this is true of any model selection tool, including the likelihood ratio test). So if you have 3 fit objects all using the same dataset, here is the model selection: library(bbmle) AICtab(travel.fit,travel.fit2,travel.fit3)

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems RSM Split-Plot Designs & Diagnostics Solve Real-World Problems Shari Kraber Pat Whitcomb Martin Bezener Stat-Ease, Inc. Stat-Ease, Inc. Stat-Ease, Inc. 221 E. Hennepin Ave. 221 E. Hennepin Ave. 221 E.

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables: Regression Lab The data set cholesterol.txt available on your thumb drive contains the following variables: Field Descriptions ID: Subject ID sex: Sex: 0 = male, = female age: Age in years chol: Serum

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Exploratory model analysis

Exploratory model analysis Exploratory model analysis with R and GGobi Hadley Wickham 6--8 Introduction Why do we build models? There are two basic reasons: explanation or prediction [Ripley, 4]. Using large ensembles of models

More information

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2) SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple

More information

Lab 6 More Linear Regression

Lab 6 More Linear Regression Lab 6 More Linear Regression Corrections from last lab 5: Last week we produced the following plot, using the code shown below. plot(sat$verbal, sat$math,, col=c(1,2)) legend("bottomright", legend=c("male",

More information

Year 10 General Mathematics Unit 2

Year 10 General Mathematics Unit 2 Year 11 General Maths Year 10 General Mathematics Unit 2 Bivariate Data Chapter 4 Chapter Four 1 st Edition 2 nd Edition 2013 4A 1, 2, 3, 4, 6, 7, 8, 9, 10, 11 1, 2, 3, 4, 6, 7, 8, 9, 10, 11 2F (FM) 1,

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

More information

8. MINITAB COMMANDS WEEK-BY-WEEK

8. MINITAB COMMANDS WEEK-BY-WEEK 8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are

More information

CREATING THE ANALYSIS

CREATING THE ANALYSIS Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219

More information

Lecture 20: Outliers and Influential Points

Lecture 20: Outliers and Influential Points Lecture 20: Outliers and Influential Points An outlier is a point with a large residual. An influential point is a point that has a large impact on the regression. Surprisingly, these are not the same

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Workshop 8: Model selection

Workshop 8: Model selection Workshop 8: Model selection Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some

More information

Data Management - 50%

Data Management - 50% Exam 1: SAS Big Data Preparation, Statistics, and Visual Exploration Data Management - 50% Navigate within the Data Management Studio Interface Register a new QKB Create and connect to a repository Define

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

Lecture 7: Linear Regression (continued)

Lecture 7: Linear Regression (continued) Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions

More information

SASEG 9B Regression Assumptions

SASEG 9B Regression Assumptions SASEG 9B Regression Assumptions (Fall 2015) Sources (adapted with permission)- T. P. Cronan, Jeff Mullins, Ron Freeze, and David E. Douglas Course and Classroom Notes Enterprise Systems, Sam M. Walton

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

Analysis of variance - ANOVA

Analysis of variance - ANOVA Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try

More information

Discussion Notes 3 Stepwise Regression and Model Selection

Discussion Notes 3 Stepwise Regression and Model Selection Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

Lab #9: ANOVA and TUKEY tests

Lab #9: ANOVA and TUKEY tests Lab #9: ANOVA and TUKEY tests Objectives: 1. Column manipulation in SAS 2. Analysis of variance 3. Tukey test 4. Least Significant Difference test 5. Analysis of variance with PROC GLM 6. Levene test for

More information

PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University

PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University ABSTRACT PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University Daniel M. Chl1ko, West Virginia UniverSity Gerry Hobos, West Virginia University PROC JACKREG

More information

Week 5: Multiple Linear Regression II

Week 5: Multiple Linear Regression II Week 5: Multiple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Adjusted R

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in

More information

Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R

Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R In this course we will be using R (for Windows) for most of our work. These notes are to help students install R and then

More information

Index. Bar charts, 106 bartlett.test function, 159 Bottles dataset, 69 Box plots, 113

Index. Bar charts, 106 bartlett.test function, 159 Bottles dataset, 69 Box plots, 113 Index A Add-on packages information page, 186 187 Linux users, 191 Mac users, 189 mirror sites, 185 Windows users, 187 aggregate function, 62 Analysis of variance (ANOVA), 152 anova function, 152 as.data.frame

More information

ST Lab 1 - The basics of SAS

ST Lab 1 - The basics of SAS ST 512 - Lab 1 - The basics of SAS What is SAS? SAS is a programming language based in C. For the most part SAS works in procedures called proc s. For instance, to do a correlation analysis there is proc

More information

Fly wing length data Sokal and Rohlf Box 10.1 Ch13.xls. on chalk board

Fly wing length data Sokal and Rohlf Box 10.1 Ch13.xls. on chalk board Model Based Statistics in Biology. Part IV. The General Linear Model. Multiple Explanatory Variables. Chapter 13.6 Nested Factors (Hierarchical ANOVA ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6,

More information

Using Large Data Sets Workbook Version A (MEI)

Using Large Data Sets Workbook Version A (MEI) Using Large Data Sets Workbook Version A (MEI) 1 Index Key Skills Page 3 Becoming familiar with the dataset Page 3 Sorting and filtering the dataset Page 4 Producing a table of summary statistics with

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

Source df SS MS F A a-1 [A] [T] SS A. / MS S/A S/A (a)(n-1) [AS] [A] SS S/A. / MS BxS/A A x B (a-1)(b-1) [AB] [A] [B] + [T] SS AxB

Source df SS MS F A a-1 [A] [T] SS A. / MS S/A S/A (a)(n-1) [AS] [A] SS S/A. / MS BxS/A A x B (a-1)(b-1) [AB] [A] [B] + [T] SS AxB Keppel, G. Design and Analysis: Chapter 17: The Mixed Two-Factor Within-Subjects Design: The Overall Analysis and the Analysis of Main Effects and Simple Effects Keppel describes an Ax(BxS) design, which

More information

Package GLDreg. February 28, 2017

Package GLDreg. February 28, 2017 Type Package Package GLDreg February 28, 2017 Title Fit GLD Regression Model and GLD Quantile Regression Model to Empirical Data Version 1.0.7 Date 2017-03-15 Author Steve Su, with contributions from:

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data Introduction About this Document This manual was written by members of the Statistical Consulting Program as an introduction to SPSS 12.0. It is designed to assist new users in familiarizing themselves

More information

Subset Selection in Multiple Regression

Subset Selection in Multiple Regression Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that

More information

Section 4 General Factorial Tutorials

Section 4 General Factorial Tutorials Section 4 General Factorial Tutorials General Factorial Part One: Categorical Introduction Design-Ease software version 6 offers a General Factorial option on the Factorial tab. If you completed the One

More information

Model selection Outline for today

Model selection Outline for today Model selection Outline for today The problem of model selection Choose among models by a criterion rather than significance testing Criteria: Mallow s C p and AIC Search strategies: All subsets; stepaic

More information

Split-Plot General Multilevel-Categoric Factorial Tutorial

Split-Plot General Multilevel-Categoric Factorial Tutorial DX10-04-1-SplitPlotGen Rev. 1/27/2016 Split-Plot General Multilevel-Categoric Factorial Tutorial Introduction In some experiment designs you must restrict the randomization. Otherwise it wouldn t be practical

More information

Regression Models Course Project Vincent MARIN 28 juillet 2016

Regression Models Course Project Vincent MARIN 28 juillet 2016 Regression Models Course Project Vincent MARIN 28 juillet 2016 Executive Summary "Is an automatic or manual transmission better for MPG" "Quantify the MPG difference between automatic and manual transmissions"

More information

Chapter 7: Linear regression

Chapter 7: Linear regression Chapter 7: Linear regression Objective (1) Learn how to model association bet. 2 variables using a straight line (called "linear regression"). (2) Learn to assess the quality of regression models. (3)

More information

Study Guide. Module 1. Key Terms

Study Guide. Module 1. Key Terms Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

For our example, we will look at the following factors and factor levels.

For our example, we will look at the following factors and factor levels. In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball

More information

Lab 07: Multiple Linear Regression: Variable Selection

Lab 07: Multiple Linear Regression: Variable Selection Lab 07: Multiple Linear Regression: Variable Selection OBJECTIVES 1.Use PROC REG to fit multiple regression models. 2.Learn how to find the best reduced model. 3.Variable diagnostics and influential statistics

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Introduction to Statistical Analyses in SAS

Introduction to Statistical Analyses in SAS Introduction to Statistical Analyses in SAS Programming Workshop Presented by the Applied Statistics Lab Sarah Janse April 5, 2017 1 Introduction Today we will go over some basic statistical analyses in

More information

Reference

Reference Leaning diary: research methodology 30.11.2017 Name: Juriaan Zandvliet Student number: 291380 (1) a short description of each topic of the course, (2) desciption of possible examples or exercises done

More information

Stat 5100 Handout #14.a SAS: Logistic Regression

Stat 5100 Handout #14.a SAS: Logistic Regression Stat 5100 Handout #14.a SAS: Logistic Regression Example: (Text Table 14.3) Individuals were randomly sampled within two sectors of a city, and checked for presence of disease (here, spread by mosquitoes).

More information

36-402/608 HW #1 Solutions 1/21/2010

36-402/608 HW #1 Solutions 1/21/2010 36-402/608 HW #1 Solutions 1/21/2010 1. t-test (20 points) Use fullbumpus.r to set up the data from fullbumpus.txt (both at Blackboard/Assignments). For this problem, analyze the full dataset together

More information

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed

More information

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 > section

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

Week 7: The normal distribution and sample means

Week 7: The normal distribution and sample means Week 7: The normal distribution and sample means Goals Visualize properties of the normal distribution. Learning the Tools Understand the Central Limit Theorem. Calculate sampling properties of sample

More information

Why Should We Care? More importantly, it is easy to lie or deceive people with bad plots

Why Should We Care? More importantly, it is easy to lie or deceive people with bad plots Plots & Graphs Why Should We Care? Everyone uses plots and/or graphs But most people ignore or are unaware of simple principles Default plotting tools (or default settings) are not always the best More

More information

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types

More information

Introductory Guide to SAS:

Introductory Guide to SAS: Introductory Guide to SAS: For UVM Statistics Students By Richard Single Contents 1 Introduction and Preliminaries 2 2 Reading in Data: The DATA Step 2 2.1 The DATA Statement............................................

More information

Centering and Interactions: The Training Data

Centering and Interactions: The Training Data Centering and Interactions: The Training Data A random sample of 150 technical support workers were first given a test of their technical skill and knowledge, and then randomly assigned to one of three

More information

General Factorial Models

General Factorial Models In Chapter 8 in Oehlert STAT:5201 Week 9 - Lecture 2 1 / 34 It is possible to have many factors in a factorial experiment. In DDD we saw an example of a 3-factor study with ball size, height, and surface

More information

Statistical Analysis of MRI Data

Statistical Analysis of MRI Data Statistical Analysis of MRI Data Shelby Cummings August 1, 2012 Abstract Every day, numerous people around the country go under medical testing with the use of MRI technology. Developed in the late twentieth

More information

Introduction to the R Statistical Computing Environment R Programming: Exercises

Introduction to the R Statistical Computing Environment R Programming: Exercises Introduction to the R Statistical Computing Environment R Programming: Exercises John Fox (McMaster University) ICPSR 2014 1. A straightforward problem: Write an R function for linear least-squares regression.

More information

One Factor Experiments

One Factor Experiments One Factor Experiments 20-1 Overview Computation of Effects Estimating Experimental Errors Allocation of Variation ANOVA Table and F-Test Visual Diagnostic Tests Confidence Intervals For Effects Unequal

More information

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion

More information

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics:

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics: UPPSALA UNIVERSITY Department of Mathematics Måns Thulin, thulin@math.uu.se Analysis of regression and variance Fall 2011 COMPUTER EXERCISE 2: One-way ANOVA In this computer exercise we will work with

More information

Stat 8053, Fall 2013: Additive Models

Stat 8053, Fall 2013: Additive Models Stat 853, Fall 213: Additive Models We will only use the package mgcv for fitting additive and later generalized additive models. The best reference is S. N. Wood (26), Generalized Additive Models, An

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations

More information

To finish the current project and start a new project. File Open a text data

To finish the current project and start a new project. File Open a text data GGEbiplot version 5 In addition to being the most complete, most powerful, and most user-friendly software package for biplot analysis, GGEbiplot also has powerful components for on-the-fly data manipulation,

More information

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution Univariate Extreme Value Analysis Practice problems using the extremes ( 2.0 5) package. 1 Block Maxima 1. Pearson Type III distribution (a) Simulate 100 maxima from samples of size 1000 from the gamma

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

Package GWRM. R topics documented: July 31, Type Package

Package GWRM. R topics documented: July 31, Type Package Type Package Package GWRM July 31, 2017 Title Generalized Waring Regression Model for Count Data Version 2.1.0.3 Date 2017-07-18 Maintainer Antonio Jose Saez-Castillo Depends R (>= 3.0.0)

More information

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison 610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison R is very touchy about unbalanced designs, partly because

More information

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions) THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

More information

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Contents Overview 2 Generating random numbers 2 rnorm() to generate random numbers from

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several

More information

Example Using Missing Data 1

Example Using Missing Data 1 Ronald H. Heck and Lynn N. Tabata 1 Example Using Missing Data 1 Creating the Missing Data Variable (Miss) Here is a data set (achieve subset MANOVAmiss.sav) with the actual missing data on the outcomes.

More information

Chemical Reaction dataset ( https://stat.wvu.edu/~cjelsema/data/chemicalreaction.txt )

Chemical Reaction dataset ( https://stat.wvu.edu/~cjelsema/data/chemicalreaction.txt ) JMP Output from Chapter 9 Factorial Analysis through JMP Chemical Reaction dataset ( https://stat.wvu.edu/~cjelsema/data/chemicalreaction.txt ) Fitting the Model and checking conditions Analyze > Fit Model

More information