BIOL 458 BIOMETRY Lab 10 - Multiple Regression

Similar documents
BIOL 458 BIOMETRY Lab 10 - Multiple Regression

Linear Methods for Regression and Shrinkage Methods

Discussion Notes 3 Stepwise Regression and Model Selection

Data Management - 50%

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Lab 07: Multiple Linear Regression: Variable Selection

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

SYS 6021 Linear Statistical Models

SPSS INSTRUCTION CHAPTER 9

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

An introduction to SPSS

Missing Data Analysis for the Employee Dataset

Estimation of Design Flow in Ungauged Basins by Regionalization

Chapter 10: Variable Selection. November 12, 2018

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Regression Analysis and Linear Regression Models

Introduction to Statistical Analyses in SAS

Lecture 13: Model selection and regularization

Workshop 8: Model selection

A Beginner's Guide to. Randall E. Schumacker. The University of Alabama. Richard G. Lomax. The Ohio State University. Routledge

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Example Using Missing Data 1

Lecture on Modeling Tools for Clustering & Regression

Variable selection is intended to select the best subset of predictors. But why bother?

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Salary 9 mo : 9 month salary for faculty member for 2004

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Heteroscedasticity-Consistent Standard Error Estimates for the Linear Regression Model: SPSS and SAS Implementation. Andrew F.

Multivariate Analysis Multivariate Calibration part 2

Study Guide. Module 1. Key Terms

Graphical Analysis of Data using Microsoft Excel [2016 Version]

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Fathom Dynamic Data TM Version 2 Specifications

Applied Regression Modeling: A Business Approach

2016 Stat-Ease, Inc. & CAMO Software

Chapter 7: Linear regression

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Minitab 17 commands Prepared by Jeffrey S. Simonoff

IQR = number. summary: largest. = 2. Upper half: Q3 =

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

NAG Toolbox for MATLAB. g02ef.1

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Machine Learning Feature Creation and Selection

JMP Book Descriptions

SASEG 9B Regression Assumptions

Geography 415 Hydrology. LAB 1 (January 23, 2003)

VOID FILL ACCURACY MEASUREMENT AND PREDICTION USING LINEAR REGRESSION VOID FILLING METHOD

Introduction: EViews. Dr. Peerapat Wongchaiwat

Machine Learning. Topic 4: Linear Regression Models

Information Criteria Methods in SAS for Multiple Linear Regression Models

Introduction to Spreadsheets

Gelman-Hill Chapter 3

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

An introduction to plotting data

Multiple Linear Regression

Network Management System Dimensioning with Performance Data. Kaisa Tuisku

DoE with Visual-XSel 13.0

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Lab 7c: Rainfall patterns and drainage density

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Package GWRM. R topics documented: July 31, Type Package

Lasso.jl Documentation

Generalized least squares (GLS) estimates of the level-2 coefficients,

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

JMP 10 Student Edition Quick Guide

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

STAT 311 (3 CREDITS) VARIANCE AND REGRESSION ANALYSIS ELECTIVE: ALL STUDENTS. CONTENT Introduction to Computer application of variance and regression

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

CREATING THE ANALYSIS

Lecture 3 - Object-oriented programming and statistical programming examples

An Interactive GUI Front-End for a Credit Scoring Modeling System by Jeffrey Morrison, Futian Shi, and Timothy Lee

Intermediate SAS: Statistics

Introduction to Mixed Models: Multivariate Regression

GRETL FOR TODDLERS!! CONTENTS. 1. Access to the econometric software A new data set: An existent data set: 3

UNIVERSITY OF NORTH CAROLINA AT CHARLOTTE

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

Using Excel for Graphical Analysis of Data

Package influence.sem

Data Presentation. Figure 1. Hand drawn data sheet

Microsoft Excel Using Excel in the Science Classroom

Copyright 2015 by Sean Connolly

Chapter 4: Analyzing Bivariate Data with Fathom

STA121: Applied Regression Analysis

3. Data Analysis and Statistics

An Introduction to the R Commander

Using Machine Learning to Optimize Storage Systems

Transcription:

BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but several continuous independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form: Y ˆ = b + b x + b x + + b x k k 0 2 2 + ε where Yˆ is the estimated value of Y, the criterion variable ; X, X 2,..., X k are the k predictor variables; and b 0 and b, b 2,.. b k are the regression coefficients. The values of the regression coefficients are determined by minimizing the sum of squares of the residuals, i.e, minimizing Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria, as discussed in lecture. These models can be generated using a forward, back-wards, or stepwise regression routine. In forward regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). In backwards regression all independent variables are initially entered into the model and sequentially taken out if the do not meet a set significance criterion (often p>0., for the partial F - test for removal of a term). Stepwise regression uses both these techniques. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term and if necessary thrown out of the model. The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model. Further Instructions for Lab 0 n ( Y i Yˆ i ) i= In Lab 0 you use the same regression module to fit multiple regression models in SPSS that you used in Lab 9 to fit bivariate regression models. The big difference now

is in entering the multiple independent variables, selecting the algorithm for building the model, and evaluating the fit of each model. In the Linear Regression sub-window, you will see a box with a pull down arrow called Method that by default is occupied by the word Enter. Enter is one of several model building algorithms available in the Method box. Enter in SPSS is equivalent of forcing all the variables in the independents variable box to be entered into the model simultaneously. The opposite of Enter if Remove where all variables are removed simultaneously. Other model building algorithms use various criteria to make decisions about which variables are entered (or removed) from the model, and when to stop adding or removing variables from the model. SPSS has algorithms named; Stepwise, Backward, and Forward. In the Stepwise algorithm, the variable with the smallest probability of its F statistic (if it meets a criteria, such as p<0.05) is entered into the model first. Then this process is repeated for the variables not yet included in the model. The next variable that meets this criterion is added to the model. This process continues to add variables to the model until there are no variables left that have F statistics that meet some user specified criteria (p<0.05 for example). As this process progresses, the F statistics for variables already in the model can change. If the significance level of these F statistics exceeds the criterion, then these variables are removed from the model. Hence, in a Stepwise algorithm, variables can be both added and removed from a model in the model building process. The Forward algorithm is identical to the Stepwise algorithm, except that variables can only be added to the model, not removed. The Backward algorithm puts all variables into the model, but then attempts to sequentially remove variables. The variable with the smallest partial correlation with the dependent variable is removed first if it meets the criterion for removal. If this variable is removed, then the variable with the next smallest partial correlation with the dependent variable is considered for removal, and removed if it meets the criterion. Note that in the Backward algorithm, variables are removed because their partial correlations exceed the significance criterion (p>0.05), the opposite of the criterion for a Stepwise or Forward algorithm. Unfortunately, none of these algorithms are guaranteed to choose the best model. I prefer the Forward algorithm, but sometimes build models with different algorithms to see if they all choose the same best model. Occasionally you might wish to enter variables in a specific sequence into a model, or to use different algorithms for model building for different groups of independent variables. To do so you need to look at the text and buttons surrounding the Independent(s) box in the Linear Regression sub-window. Note a light gray line enclosing this region, and blue text that says Block of. SPSS allows you to group variables into blocks and specify different variable selection methods for each block. For example, to build the Analysis of Covariance models that I described in class, you would place the variable name for the covariate into the Independent(s) box and select Enter as the Method (since you don t want SPSS to do any thinking, just put the variable in the model). Then you would click on the Next button. Note that the blue text now says Block 2 of 2. Here you would enter the names of the dummy variables that define your groupings or factors in the covariance analysis. Again use the Method: Enter. Finally, you would

click on the Next button to create the third block of variables. Here you would enter the variable names for the factor-covariate interactions. Once again use the Method: Enter. Assessing model fit involves all the same procedures used in bivariate regression since the same assumptions apply. The dependent variable should be normally distributed, scatter plots should indicate linear relationships between the dependent and independent variables, and residual plots should show homoscedasticity (equality of variances in the residuals throughout the regression line). In addition to these issues, one also needs to check for outliers or overly influential data points, and for high intercorrelations between pairs of independent variables (called multi-colinearity). If two independent variables are highly correlated (r>0.9), then inclusion of both variables in the model causes problems in parameter estimation. You can pre-screen your independent variables by getting a correlation matrix prior to performing the regression and only allowing one variable of a pair of high correlated variables to serve as a candidate variable for model building at a time. You could also examine the Tolerance values provided by SPSS in the output table named Excluded Variables. These values also provide you information about whether you have a problem with multicolinearity. Come to class to find out how to interpret the tolerance values. Lab 0 Assignment The exercise to be performed in this lab is to use the SPSS stepwise and forward regression routine to generate a series of models, and to select the "best" model from each series, as discussed in lecture. Two data sets will be provided; you are to perform the analysis on either of these two. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit. DESCRIPTION OF DATA The data is stored in a file MULTR2. The variables are as follows (they are in the same order in the data sets): VARIABLE (UNITS) Mean elevation (feet) Mean temperature (degrees F) Mean annual precipitation (inches) Vegetative density (percent cover) Drainage area (miles 2 ) Latitude (degrees) Longitude (degrees) Elevation at temperature station (feet) -hour, 25-year precipitation. intensity (inches/hour) Annual water yield (inches) (Dependent variable) 2

The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA. Lab 0 In R (use 'multr2.csv') To Obtain a General Plot of Every Variable Against Every other Variable When looking at a dataset with multiple variables, this can be a useful tool for seeing correlations between specific pairs of variables S plot(dataset) Stepwise, Backward, and Forward Model Selection Using the StepF script This function is for running a backward or forward model selection algorithm. It uses the add and drop functions to select variables to add or drop based on the F statistic. The model is then updated using update. This is repeated until a level of significance is met. You will see a print out of each iteration with the output of add or drop, and the variable selected for addition or deletion. To use the function, you need to source the script. ) Save the script StepF.R on your computer. 2) In the R Console window choose File -> Source R Code... 3) Select the StepF.R file and press Open Methods of Use: There are a few ways to use StepF. I will use the pubescence data set to illustrate the different methods. The data set should be attached. However it is used, if you assign the result of StepF to a variable, it will contain the final model selected. Example: S mylm<- StepF(dataTable = pubescence, response = "abherb", level =.05, direction = "backward") <Here you will see the output for each iteration> S mylm Call: lm(formula = abherb ~ srherb) Coefficients: (Intercept) srherb -52.08 2.88 3

) Provide a data set and identify the response variable. StepF will then construct a model. If you are using direction = backward the full model based on every column in the data set will be created. If you are using.direction = forward an empty model will be created. This model will then by run through the algorithm, removing or adding variables based on the level of significance. If you want it to use glm instead of lm use the argument general = TRUE. Example: S StepF(dataTable = pubescence, response = "abherb", level =.05, direction = "backward") In the first iteration, the model starts with every variable from pubescence. After 8 iterations only srherb is selected. 2) Provide a model you have made(either lm or glm). StepF will run your model through the algorithm as before. Example: S plm5 <- lm(formula = abherb ~ site + lfsize + cong + range + aveden + avelen + slarea + srherb) S StepF(model = plm5, level =.05, direction = "backward") This is the same full model as StepF created from the data set in the last example. The result is the same. 3) Whether you provide the model, or let StepF make it from a data file, if you want to limit the variables than can be removed you can specify them in a scope as you would when using add or drop. Backward example: S StepF(model = plm5, scope = formula( ~ aveden + avelen + slarea), direction = "backward", level =.05) The model is as before (all variables from pubescence), but only aveden, avelen and slarea are options for StepF to drop from the model. In this case, all three are removed. Forward example: S plm6 <- glm(abherb ~ srherb) S StepF(model=plm6, scope = formula (~ srherb + slarea + aveden + avelen), direction = "forward", level =.05) 4

First we create the model we want to start with. In plm6 we are forcing srherb to be included in the model, and having StepF check if any of the other three listed in the scope formula should be added. In this case no more are added. Function arguments and defaults: StepF<- function(model = NULL, general = FALSE, datatable = NULL, scope = NULL, response = NULL, interactions = FALSE, level = 0.05, direction = "backward", steps = 00) List of arguments: model: The lm or glm to use as a starting point for the algorithm. Note: Can use this, have StepF create a model from datatable and response. general: When StepF creates the model, if glm should be used instead of lm. datatable: Data with variables, both response and predictor(s), to be used in the model. Note: If you use this instead of supplying your own model, you need to also specify response scope: Formula to be passed to add (list of variables to add), or drop (list of variables to drop). response: The string name of the response variable in the datatable. interactions: If TRUE, will use * to link all variables when StepF creates the formula from the datatable. Otherwise + will be used(default). Note: Only effects model creation for backward algorithm. level: The level of significance against which to test if a variable should be removed. direction: The algorithm to use. backwards or forwards steps: The maximum number of iterations that will be run. 5