BIOL 458 BIOMETRY Lab 10 - Multiple Regression
|
|
- Kathleen Lester
- 6 years ago
- Views:
Transcription
1 BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several continuous and potentially discrete independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form: where ε i ~ iid N(0, σ 2 ). Y i = β 0 + β 1 X 1i + + β p X pi + ε i Y i is the response of subject i. β 0 is the y-intercept of the model or one of the model's coefficients β i are the model coefficients or weights that relate the Y i to the experimental treatments or explanatory variables X ii are the values of the measured independent variables or codes that are used to classify a subject as to group membership. For example, 1 for receiving medicine and 0 for placebo might be the codes we use in the two group design. ε i are random errors or residuals that arise from the deviation of the observed values of the responses from the model's prediction. The values of the regression coefficients are determined by minimizing the sum of squares of the residuals (errors), i.e, minimizing n n n ε i 2 = Y i Y i 2 = (Y i [β 0 + β 1 X 1i + β 2 X 2i + + β k X ik ]) 2 1=1 i=1 i=1 Where Y i is the predicted value from the model associated with observation Y i. Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria. These models can be generated using algorithms in which the addition of variables to the model are done in a stepwise fashion, or by examining a large number or even all possible regression models given the data. Stepwise Regression Stepwise regression builds a model by adding or removing variables one at a time. They have lost much of their popularity because it has been shown that they are not guaranteed to select the best model.
2 In a forward stepwise regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). The variable with the lowest p-value added to the model at each step and the algorithm stops when no new variable meets the significance criterion. In a backwards stepwise regression all independent variables are initially entered into the model. They are then sequentially removed if they do not meet a set significance criterion for retention (often p>0.1 or p>0.05, for the partial F - test for removal of a term). The variable with the highest p-value is removed from the model at each step until no additional variable meeting the criterion remains. Stepwise regression uses both these techniques with variables added or removed on each step of the process. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term, and if necessary a variable might be removed from the model. The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model. In R, no stepwise regression module using a partial F test is available in the base installation or as a user contributed package. R does contain a stepwise function called stepaic that uses Akaike s Information Criteria as a basis for stepwise model selection. We will avoid this information theoretic approach to model selection for now. However, a former student (Joe Hill Warren) wrote a function called StepF which we can use to examine the behavior of these algorithms. However, StepF does not perform the stepwise algorithm, only the forward and backwards selection algorithms. To read more about the StepF function click the link. To use the StepF function, down load the file StepF.R. Open this file in RStudio either from the File Menu or from the Code Menu. From the File Menu, open the file and then click on the Code Menu and click Source. Alternatively, from the Code Menu just click Source File and choose the downloaded file StepF.R. This will load the function StepF and you can use its features to perform stepwise regression Later in this Lab we will address issues in building and assessing regression models. We could at this point use a number of techniques to examine our data before beginning the process of model selection, or we could use those same techniques after developing a set of candidate models to assess. In this demonstration I will take the later approach, postponing a detailed assessment of whether a model meets the assumptions of regression until later. To demonstrate multiple regression we will examine data on the specie richness of plants in the Galapagos Islands. The data file Galapagos-plants.txt contains species 1
3 richness and the number of endemic species for plants on 29 islands along with data about the physical characteristics of the islands (island name, island area, maximum elevation, distance to nearest island, area of nearest island, distance to Santa Cruz island, and the number of botanical collecting trips to each island). # read in data file on Galapagos plants dat=read.table("k:/biometry/biometry-fall-2015/lab10/galapagos-plants.txt", h eader=true) head(dat) Isla Spec Area Elev DisN DisS AreA Coll Endm 1 Balt Bart Cald Cham Coam Daph It is traditional to examine the relationship between the log(number of species) and log(area), so I will create these variables and a new data.frame to hold them along with the other original variables for the analysis. # create variable to be used in regression and put in new data.frame logarea=log(dat$area) logspec=log(dat$spec) elev=dat$elev diss=dat$diss disn=dat$disn coll=dat$coll area=dat$area dd=data.frame(logspec,logarea,elev,diss,disn,area,coll) head(dd) logspec logarea elev diss disn area coll As an initial diagnostic step, I obtain the correlation matrix of the variables. I set the options(digits=4) to control how many digits are printed so the matrix will not wraparound. # obtain correlation matrix of variables 2
4 options(digits=4) cor(dd) logspec logarea elev diss disn area coll logspec logarea elev diss disn area coll Note that we can already see that logspec is strongly associated with logarea and coll (the number of collecting trips), and less strongly associated with elev so we might expect these to be the variables that are entered into the regression models.. Now we will source the StepF.R code file. Note that in Rmarkdown you need to give the full path and name of the file to be sourced. source("k:/biometry/biometry-fall-2015/lab10/stepf.r") Now let s use the forward stepwise approach to select a model. Note that the output is a multi-step process. At each step of the process, partial F - tests are reported that test if the reduction in the residual sums of squares associated with adding each variable to the model individually would be a statistically significant reduction in the residual sums of squares. The variable that causes the greatest reduction in the residual sums of squares will be added to the model. Note on iteration 1, that with the grand mean in the model the RSS (residual sums of squares) is 70.6, but that if logarea is added to the model the RSS will be reduced to Logarea also have the smallest p value, so logarea will be added to the model first. On iteration 2 after logarea is added to the model, we see that only addition of coll to the model will result in a statistically significant reduction in the RSS at the α = 0.05 (p = 0.017). Therefore, coll will be added to the model. However, on iteration 3 none of the remaining variables have p < 0.05, so the algorithm stops after adding logarea and coll to the model. # perform forward stepwise regression mod.7=stepf(datatable=dd,response="logspec", level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea e-11 *** elev e-05 *** 3
5 diss disn area coll e-09 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea e-11 Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> elev diss disn area coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: coll Updating model formula:.~. +coll ==================== Iteration #3 ==================== Single term additions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> elev diss disn area ========== No further variables significant at 0.05 ========== Final Model: We can use the backwards stepwise approach as well. This algorithm will not always converge on the same model as the forward approach, but in this instance it does. Again, in the backwards approach all variables are initially put into the model and those that cause the smallest increase in the RSS are sequentially removed from the model. It takes a couple more iterations than the forwards approach, but converges on the same best model. 4
6 # perform backwards stepwise regression mod.8=stepf(datatable=dd,response="logspec",level=0.05, direction="backward") ==================== Iteration #1 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + disn + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss disn area coll Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: disn formula:.~. -disn ==================== Iteration #2 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss area coll Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: area formula:.~. -area ==================== Iteration #3 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss
7 coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: diss formula:.~. -diss ==================== Iteration #4 ==================== Single term deletions Model: logspec ~ logarea + elev + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: elev formula:.~. -elev ==================== Iteration #5 ==================== Single term deletions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ========== All variables significant at 0.05 ========== Final Model: There are other ways in which the StepF function can be used. One approach is to build a model that contains variables you wish to force into the model, and then check to see if any other variables will be added after those forced in. For example, what if we wanted to force coll to be in the model, and wanted to know if other variables explain residual variation in logspec after accounting for coll. We could think of coll as a nuisance variable that measures the differences in sampling effort across the islands. Perhaps we want to know if after accounting for the variable sampling effort which other variables are still useful in explaining variation among the islands in the species richness of plants. To do this, we first build a linear model with coll only and save it in a model object. Then we call StepF specifying the model object name as our initial mode, and 6
8 then the scope argument with coll and any other variable we wish to assess. Note that this process indicates that even after accounting for the variability in sampling effort among islands that logarea still explains residuals variation in logspec. # to determine if any variables would be added to a model with only coll as t he predictor variable # first build linear model with coll as the only predictor variable lm1=lm(formula=logspec~coll) # then use StepF specifying the model with coll and a "scope"" argument listi ng coll and the other candidate variables StepF(model=lm1,scope=formula( ~ coll+logarea+elev+disn+diss+area),level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev disn diss area Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ coll + logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> elev disn diss area ========== No further variables significant at 0.05 ========== Final Model: 7
9 Call: lm(formula = logspec ~ coll + logarea) Coefficients: (Intercept) coll logarea The StepF pdf file explains other ways in which you might use the StepF function. Note that the StepF function does not report the regression coefficients nor does it compute the residuals and other fit statistics for the final model. After using StepF to select models, one then must use the lm function to fit the selected models and evaluate their adequacy. Best Subsets Regression An alternative approach to model selection is to compute all possible regressions given a set of candidate explanatory variables, or at least the best subset of models for various levels of model complexity. By model complexity, I mean the number predictor variables included in the model. To calculate the number of possible models with 6 predictor variables, we need to compute the number of permutations of 6 variables taken 1, 2, 3, 4, 5, or 6 at a time and add them up. A permutation is like a combination except that we consider the case AB different from the case BA. To calculate the number of permutations of n things k at a time: P k n = n! (n k)! The calculation of the number of permutations is similar to the calculation of the number of combinations of n things k at a time except that it lacks a factor of k! from the denominator. R code to calculate the number of possible models is given below. Do you want to generate, fit and asses all possible 1237 models? Remember that in regression we use Type I sums of squares, so in models with different orderings of the variables the individual variables may explain different amounts of variation in the response variable. # calculate the number of permutations of 6 variables for models with 1 to 6 predictor variables perm=rep(0,5) n=6 for (k in 1:n-1){ perm[k]=factorial(n)/factorial(n-k) 8
10 } perm [1] totperm=sum(perm)+1 totperm [1] 1237 Rather than tackling the daunting task of examining 1237 models, we will use the regsubsets function from the package leaps to select the best k models with 1 predictor, 2 predictors, etc. # load package leaps library(leaps) In leaps we will use the regsubsets function and generate the 3 best models for each level of complexity. You could choose to do more, but the graphical display of the results becomes problematic with large subset sizes. Running the regsubsets function requires the model formula and the specification of the subset size. Performing a summary of the regsubsets object results in a tabulation of the models ranked in order of best fit. An * indicates that the variable is included in the model. # to get k best regression models for each size k=3 mm=regsubsets(logspec~logarea+elev+disn+diss+area+coll,data=dd,nbest=k) summary(mm) Subset selection object Call: regsubsets.formula(logspec ~ logarea + elev + disn + diss + area + coll, data = dd, nbest = k) 6 Variables (and intercept) Forced in Forced out logarea FALSE FALSE elev FALSE FALSE disn FALSE FALSE diss FALSE FALSE area FALSE FALSE coll FALSE FALSE 3 subsets of each size up to 6 Selection Algorithm: exhaustive logarea elev disn diss area coll 1 ( 1 ) "*" " " " " " " " " " " 1 ( 2 ) " " " " " " " " " " "*" 1 ( 3 ) " " "*" " " " " " " " " 2 ( 1 ) "*" " " " " " " " " "*" 2 ( 2 ) "*" " " " " "*" " " " " 2 ( 3 ) "*" "*" " " " " " " " " 3 ( 1 ) "*" "*" " " " " " " "*" 3 ( 2 ) "*" " " " " "*" " " "*" 9
11 3 ( 3 ) "*" " " "*" " " " " "*" 4 ( 1 ) "*" "*" " " "*" " " "*" 4 ( 2 ) "*" "*" "*" " " " " "*" 4 ( 3 ) "*" "*" " " " " "*" "*" 5 ( 1 ) "*" "*" " " "*" "*" "*" 5 ( 2 ) "*" "*" "*" "*" " " "*" 5 ( 3 ) "*" "*" "*" " " "*" "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*" Although not printed as part of the summary, the summary of the regsubset object contains much more information including the R 2 values for each model. I demonstrate below one way to extract those R 2 values from the summary. You can learn more about the information in the summary by using the str() on the summary. # to get rsq values from regsubsets object nummods=k*(n-1)+1 m=rep(0,nummods) a=summary(mm) for (i in 1:16){ m[i]=summary(mm)[[2]][[i]] } m [1] [11] There are other options to graphically display the results from regsubsets. For example, one kind of plot shows the model R2 on the y-axis and which variables are in the model by colors on the x axis. White would indicate that the variable is not in the model, while shading indicates the variable is in the model. Variables with mostly white contribute to few models while variables that are mostly black contribute to many models. In our example, as you might expect logarea, coll, and elev are mostly black while the other variables are mostly white. # to plot a summary of which variables are in each model ranked by r2 plot(mm,scale="r2") plot(mm,scale="adjr2") 10
12 A similar plot can be generated for other goodness-of-fit statistics such as the adjusted R 2. Remember R 2 = 1 (SS error SS total ) and the adjusted R 2 which penalizes the model for its complexity is adjusted R 2 = 1 SS error df error. SS total df total Finally, there is also another graphical display of the results available in the car package. The function subsets in car will plot the results of a call to regsubsets from leaps. # plot a summary of results library(car) subsets(mm,statistic = "rsq", abbrev=1, legend=true, cex=0.7) 11
13 Abbreviation logarea l elev e disn dsn diss dss area a coll c This summary is similar to the previous one, but labels the models with the included variables. For models with many variables or for larger subset sizes the labels overlap making the plot unreadable. You can also use the regsubsets function with the argument method specified as forward, backward, or seqrep for sequential replacement rather than the default which is exhaustive. At this point, after using one of the stepwise algorithms or an exhaustive search, one has a set of candidate models to examine in more detail. Assessing model fit involves all the same procedures used in bivariate regression since the same assumptions apply. The dependent variable should be normally distributed, scatter plots should indicate linear relationships between the dependent and independent variables, and residual plots should show homoscedasticity (equality of variances in the residuals throughout the regression line). In addition to these issues, one also needs to check for outliers or overly influential data points, and for high inter-correlations between pairs of independent variables (called multi-colinearity). If two independent variables are highly 12
14 correlated (r>0.9), then inclusion of both variables in the model causes problems in parameter estimation. You can pre-screen your independent variables by getting a correlation matrix prior to performing the regression and only allowing one variable of a pair of high correlated variables to serve as a candidate variable for model building at a time. Remember the tools outlined in Lab 9 for assessing model fit are also applicable to multiple regression models. The norm function from Quantpsyc, plot(modelobj) and plot(data.frame) can provide much useful diagnostic information. Other diagnostic procedures are available in the car package. Advice on Building and Assessing Regression Models Building 1. Choose the set of candidate predictor variables to potentially be included in the model. 2. Examine the distribution of the response variable to determine if it meets the assumption of normality. Transform if necessary. 3. Examine scatter plots of the relationships between the response variable y and the predictor or independent variables x to determine if the relationships are linear. Potentially transform either x or y or, both to achieve linearity. 4. Examine the correlations between the predictor variables. High correlations (values of r >> 0.9) might suggest linear dependencies among the predictor variables which can make the estimates of the regression coefficients unstable and inflate the variance of the estimates. Consider deleting members of these pairs of variables since they are essentially redundant. 5. Choose the algorithmic approach to fitting a model. In blocks (chunkwise), by forcing entry of variables into the model in a particular sequence, by backwards elimination, or forwards addition of variable to the model, etc. 6. Decide on the criteria you will use for retaining variables in the model (significant partial t or F statistics at a specified α). Build the model. Assessing 1. Obtain a plot of the standardized residual against the standardized predicted values. Examine this plot for heterogeneity in the distribution of the residuals. A desirable pattern for the residuals would have both negative and positive residuals of equal magnitude throughout the length of the predicted regression. The envelope of residuals around the regression line should appear to be rectangular and be centered on the regression line. 13
15 2. Examine the correlations among pairs of predictor variables to check for multicolinearity. If for any pair r >>0.9 then try alternative models that eliminate one pair member. 3. Examine the diagnostic plots to make sure that there are no observations with high leverage or high influence. Influential data points will have Cook s D values greater than Compare alternative models to determine if one or more models fit the data equally well. 5. The model with the best residual pattern, that is not beset with colinearity and influential data points, and that has the highest R 2 is the best model. Note that R 2 is the last criteria to use in choosing a model, not the first. Lab 10 Assignment The exercise to be performed in this lab is to use the StepF and/or regsubsets functions in R to generate a set of candidate models, and to select the individual "best" model or set of best models if 2 or more models seem to be equally good. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit. DESCRIPTION OF DATA The data is stored in a file multr2.csv The variables are as follows (they are in the same order in the data sets): VARIABLE (UNITS) Mean elevation (feet) Mean temperature (degrees F) Mean annual precipitation (inches) Vegetative density (percent cover) Drainage area (miles 2 ) Latitude (degrees) Longitude (degrees) Elevation at temperature station (feet) 1-hour, 25-year precipitation. intensity (inches/hour) Annual water yield (inches) (Dependent variable) The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA. 14
BIOL 458 BIOMETRY Lab 10 - Multiple Regression
BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationVariable selection is intended to select the best subset of predictors. But why bother?
Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.
More informationResources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.
Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department
More informationQuantitative Methods in Management
Quantitative Methods in Management MBA Glasgow University March 20-23, 2009 Luiz Moutinho, University of Glasgow Graeme Hutcheson, University of Manchester Exploratory Regression The lecture notes, exercises
More informationStatistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection
Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes, exercises
More informationStatistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010
Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Exploratory regression and model
More informationLab 07: Multiple Linear Regression: Variable Selection
Lab 07: Multiple Linear Regression: Variable Selection OBJECTIVES 1.Use PROC REG to fit multiple regression models. 2.Learn how to find the best reduced model. 3.Variable diagnostics and influential statistics
More informationLecture 13: Model selection and regularization
Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always
More informationWorkshop 8: Model selection
Workshop 8: Model selection Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some
More informationRegression Analysis and Linear Regression Models
Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical
More informationTHIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010
THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE
More informationRSM Split-Plot Designs & Diagnostics Solve Real-World Problems
RSM Split-Plot Designs & Diagnostics Solve Real-World Problems Shari Kraber Pat Whitcomb Martin Bezener Stat-Ease, Inc. Stat-Ease, Inc. Stat-Ease, Inc. 221 E. Hennepin Ave. 221 E. Hennepin Ave. 221 E.
More informationEXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression
EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationSTA121: Applied Regression Analysis
STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationCHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY
23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series
More informationThe Truth behind PGA Tour Player Scores
The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the
More informationLearner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display
CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &
More informationRegression. Dr. G. Bharadwaja Kumar VIT Chennai
Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called
More informationChapter 10: Variable Selection. November 12, 2018
Chapter 10: Variable Selection November 12, 2018 1 Introduction 1.1 The Model-Building Problem The variable selection problem is to find an appropriate subset of regressors. It involves two conflicting
More informationStatistics Lab #7 ANOVA Part 2 & ANCOVA
Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationSYS 6021 Linear Statistical Models
SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are
More informationPackage leaps. R topics documented: May 5, Title regression subset selection. Version 2.9
Package leaps May 5, 2009 Title regression subset selection Version 2.9 Author Thomas Lumley using Fortran code by Alan Miller Description Regression subset selection including
More informationCREATING THE ANALYSIS
Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219
More informationST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.
ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball
More informationChapter 6: Linear Model Selection and Regularization
Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationAssignment 6 - Model Building
Assignment 6 - Model Building your name goes here Due: Wednesday, March 7, 2018, noon, to Sakai Summary Primarily from the topics in Chapter 9 of your text, this homework assignment gives you practice
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationMultivariate Analysis Multivariate Calibration part 2
Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data
More informationMultiple Regression White paper
+44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms
More informationModel selection Outline for today
Model selection Outline for today The problem of model selection Choose among models by a criterion rather than significance testing Criteria: Mallow s C p and AIC Search strategies: All subsets; stepaic
More informationThe problem we have now is called variable selection or perhaps model selection. There are several objectives.
STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We
More informationModel Selection and Inference
Model Selection and Inference Merlise Clyde January 29, 2017 Last Class Model for brain weight as a function of body weight In the model with both response and predictor log transformed, are dinosaurs
More informationUsing Excel for Graphical Analysis of Data
Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are
More informationAnalysis of variance - ANOVA
Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try
More informationTHE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)
THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination
More informationInformation Criteria Methods in SAS for Multiple Linear Regression Models
Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information
More informationLinear Model Selection and Regularization. especially usefull in high dimensions p>>100.
Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records
More informationEXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY
EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 2015 MODULE 4 : Modelling experimental data Time allowed: Three hours Candidates should answer FIVE questions. All questions carry equal
More informationDiscussion Notes 3 Stepwise Regression and Model Selection
Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments
More informationGraphical Analysis of Data using Microsoft Excel [2016 Version]
Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.
More informationSection 4 General Factorial Tutorials
Section 4 General Factorial Tutorials General Factorial Part One: Categorical Introduction Design-Ease software version 6 offers a General Factorial option on the Factorial tab. If you completed the One
More information610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison
610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison R is very touchy about unbalanced designs, partly because
More informationData Management - 50%
Exam 1: SAS Big Data Preparation, Statistics, and Visual Exploration Data Management - 50% Navigate within the Data Management Studio Interface Register a new QKB Create and connect to a repository Define
More informationAn introduction to SPSS
An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible
More informationProblem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA
ECL 290 Statistical Models in Ecology using R Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA Datasets in this problem set adapted from those provided
More informationUsing the DATAMINE Program
6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection
More informationSalary 9 mo : 9 month salary for faculty member for 2004
22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor
More informationRegression on the trees data with R
> trees Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 7 11.0 66 15.6 8 11.0 75 18.2 9 11.1 80 22.6 10 11.2 75 19.9 11 11.3 79 24.2 12 11.4 76
More information[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization
[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization L. Jason Anastasopoulos ljanastas@uga.edu February 2, 2017 Gradient descent Let s begin with our simple problem of estimating
More informationMODEL DEVELOPMENT: VARIABLE SELECTION
7 MODEL DEVELOPMENT: VARIABLE SELECTION The discussion of least squares regression thus far has presumed that the model was known with respect to which variables were to be included and the form these
More informationUsing Excel for Graphical Analysis of Data
EXERCISE Using Excel for Graphical Analysis of Data Introduction In several upcoming experiments, a primary goal will be to determine the mathematical relationship between two variable physical parameters.
More informationSTAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.
STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,
More informationBivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017
Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 4, 217 PDF file location: http://www.murraylax.org/rtutorials/regression_intro.pdf HTML file location:
More informationThis electronic supporting information S4 contains the main steps for fitting a response surface model using Minitab 17 (Minitab Inc.).
This electronic supporting information S4 contains the main steps for fitting a response surface model using Minitab 17 (Minitab Inc.). This process was used in Predicting instrumental mass fractionation
More informationFor our example, we will look at the following factors and factor levels.
In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationMinitab 17 commands Prepared by Jeffrey S. Simonoff
Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save
More informationNonparametric Classification Methods
Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)
More informationTwo-Stage Least Squares
Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes
More informationMiddle School Math Course 3
Middle School Math Course 3 Correlation of the ALEKS course Middle School Math Course 3 to the Texas Essential Knowledge and Skills (TEKS) for Mathematics Grade 8 (2012) (1) Mathematical process standards.
More informationStatistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R
Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R In this course we will be using R (for Windows) for most of our work. These notes are to help students install R and then
More informationLecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010
Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2010 1 / 26 Additive predictors
More information7. Collinearity and Model Selection
Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among
More information3 Feature Selection & Feature Extraction
3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy
More informationData Presentation. Figure 1. Hand drawn data sheet
Data Presentation The purpose of putting results of experiments into graphs, charts and tables is two-fold. First, it is a visual way to look at the data and see what happened and make interpretations.
More informationPredicting Porosity through Fuzzy Logic from Well Log Data
International Journal of Petroleum and Geoscience Engineering (IJPGE) 2 (2): 120- ISSN 2289-4713 Academic Research Online Publisher Research paper Predicting Porosity through Fuzzy Logic from Well Log
More informationTHE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann
Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationPredicting Web Service Levels During VM Live Migrations
Predicting Web Service Levels During VM Live Migrations 5th International DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and the Cloud Helmut Hlavacs, Thomas Treutner
More informationRecall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:
Topic 11. Unbalanced Designs [ST&D section 9.6, page 219; chapter 18] 11.1 Definition of missing data Accidents often result in loss of data. Crops are destroyed in some plots, plants and animals die,
More informationRegression Models Course Project Vincent MARIN 28 juillet 2016
Regression Models Course Project Vincent MARIN 28 juillet 2016 Executive Summary "Is an automatic or manual transmission better for MPG" "Quantify the MPG difference between automatic and manual transmissions"
More informationGeneralized Additive Models
Generalized Additive Models Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Additive Models GAMs are one approach to non-parametric regression in the multiple predictor setting.
More informationSPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL
SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered
More informationUnivariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution
Univariate Extreme Value Analysis Practice problems using the extremes ( 2.0 5) package. 1 Block Maxima 1. Pearson Type III distribution (a) Simulate 100 maxima from samples of size 1000 from the gamma
More informationBasics of Multivariate Modelling and Data Analysis
Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly
More informationStat 8053, Fall 2013: Additive Models
Stat 853, Fall 213: Additive Models We will only use the package mgcv for fitting additive and later generalized additive models. The best reference is S. N. Wood (26), Generalized Additive Models, An
More informationBayesFactor Examples
BayesFactor Examples Michael Friendly 04 Dec 2015 The BayesFactor package enables the computation of Bayes factors in standard designs, such as one- and two- sample designs, ANOVA designs, and regression.
More informationBuilding Better Parametric Cost Models
Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use
More informationMachine Learning. Topic 4: Linear Regression Models
Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationMachine Learning / Jan 27, 2010
Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,
More informationPackage optimus. March 24, 2017
Type Package Package optimus March 24, 2017 Title Model Based Diagnostics for Multivariate Cluster Analysis Version 0.1.0 Date 2017-03-24 Maintainer Mitchell Lyons Description
More informationStat 5303 (Oehlert): Response Surfaces 1
Stat 5303 (Oehlert): Response Surfaces 1 > data
More informationElemental Set Methods. David Banks Duke University
Elemental Set Methods David Banks Duke University 1 1. Introduction Data mining deals with complex, high-dimensional data. This means that datasets often combine different kinds of structure. For example:
More informationLecture 7: Linear Regression (continued)
Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions
More informationGeneral Factorial Models
In Chapter 8 in Oehlert STAT:5201 Week 9 - Lecture 2 1 / 34 It is possible to have many factors in a factorial experiment. In DDD we saw an example of a 3-factor study with ball size, height, and surface
More information1 StatLearn Practical exercise 5
1 StatLearn Practical exercise 5 Exercise 1.1. Download the LA ozone data set from the book homepage. We will be regressing the cube root of the ozone concentration on the other variables. Divide the data
More informationIQR = number. summary: largest. = 2. Upper half: Q3 =
Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number
More informationStat 5303 (Oehlert): Unbalanced Factorial Examples 1
Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 > section
More informationThe RcmdrPlugin.HH Package
Type Package The RcmdrPlugin.HH Package Title Rcmdr support for the HH package Version 1.1-4 Date 2007-07-24 July 31, 2007 Author Richard M. Heiberger, with contributions from Burt Holland. Maintainer
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationUsing Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers
Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being
More information