BIOL 458 BIOMETRY Lab 10 - Multiple Regression

Size: px
Start display at page:

Download "BIOL 458 BIOMETRY Lab 10 - Multiple Regression"

Transcription

1 BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several continuous and potentially discrete independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form: where ε i ~ iid N(0, σ 2 ). Y i = β 0 + β 1 X 1i + + β p X pi + ε i Y i is the response of subject i. β 0 is the y-intercept of the model or one of the model's coefficients β i are the model coefficients or weights that relate the Y i to the experimental treatments or explanatory variables X ii are the values of the measured independent variables or codes that are used to classify a subject as to group membership. For example, 1 for receiving medicine and 0 for placebo might be the codes we use in the two group design. ε i are random errors or residuals that arise from the deviation of the observed values of the responses from the model's prediction. The values of the regression coefficients are determined by minimizing the sum of squares of the residuals (errors), i.e, minimizing n n n ε i 2 = Y i Y i 2 = (Y i [β 0 + β 1 X 1i + β 2 X 2i + + β k X ik ]) 2 1=1 i=1 i=1 Where Y i is the predicted value from the model associated with observation Y i. Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria. These models can be generated using algorithms in which the addition of variables to the model are done in a stepwise fashion, or by examining a large number or even all possible regression models given the data. Stepwise Regression Stepwise regression builds a model by adding or removing variables one at a time. They have lost much of their popularity because it has been shown that they are not guaranteed to select the best model.

2 In a forward stepwise regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). The variable with the lowest p-value added to the model at each step and the algorithm stops when no new variable meets the significance criterion. In a backwards stepwise regression all independent variables are initially entered into the model. They are then sequentially removed if they do not meet a set significance criterion for retention (often p>0.1 or p>0.05, for the partial F - test for removal of a term). The variable with the highest p-value is removed from the model at each step until no additional variable meeting the criterion remains. Stepwise regression uses both these techniques with variables added or removed on each step of the process. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term, and if necessary a variable might be removed from the model. The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model. In R, no stepwise regression module using a partial F test is available in the base installation or as a user contributed package. R does contain a stepwise function called stepaic that uses Akaike s Information Criteria as a basis for stepwise model selection. We will avoid this information theoretic approach to model selection for now. However, a former student (Joe Hill Warren) wrote a function called StepF which we can use to examine the behavior of these algorithms. However, StepF does not perform the stepwise algorithm, only the forward and backwards selection algorithms. To read more about the StepF function click the link. To use the StepF function, down load the file StepF.R. Open this file in RStudio either from the File Menu or from the Code Menu. From the File Menu, open the file and then click on the Code Menu and click Source. Alternatively, from the Code Menu just click Source File and choose the downloaded file StepF.R. This will load the function StepF and you can use its features to perform stepwise regression Later in this Lab we will address issues in building and assessing regression models. We could at this point use a number of techniques to examine our data before beginning the process of model selection, or we could use those same techniques after developing a set of candidate models to assess. In this demonstration I will take the later approach, postponing a detailed assessment of whether a model meets the assumptions of regression until later. To demonstrate multiple regression we will examine data on the specie richness of plants in the Galapagos Islands. The data file Galapagos-plants.txt contains species 1

3 richness and the number of endemic species for plants on 29 islands along with data about the physical characteristics of the islands (island name, island area, maximum elevation, distance to nearest island, area of nearest island, distance to Santa Cruz island, and the number of botanical collecting trips to each island). # read in data file on Galapagos plants dat=read.table("k:/biometry/biometry-fall-2015/lab10/galapagos-plants.txt", h eader=true) head(dat) Isla Spec Area Elev DisN DisS AreA Coll Endm 1 Balt Bart Cald Cham Coam Daph It is traditional to examine the relationship between the log(number of species) and log(area), so I will create these variables and a new data.frame to hold them along with the other original variables for the analysis. # create variable to be used in regression and put in new data.frame logarea=log(dat$area) logspec=log(dat$spec) elev=dat$elev diss=dat$diss disn=dat$disn coll=dat$coll area=dat$area dd=data.frame(logspec,logarea,elev,diss,disn,area,coll) head(dd) logspec logarea elev diss disn area coll As an initial diagnostic step, I obtain the correlation matrix of the variables. I set the options(digits=4) to control how many digits are printed so the matrix will not wraparound. # obtain correlation matrix of variables 2

4 options(digits=4) cor(dd) logspec logarea elev diss disn area coll logspec logarea elev diss disn area coll Note that we can already see that logspec is strongly associated with logarea and coll (the number of collecting trips), and less strongly associated with elev so we might expect these to be the variables that are entered into the regression models.. Now we will source the StepF.R code file. Note that in Rmarkdown you need to give the full path and name of the file to be sourced. source("k:/biometry/biometry-fall-2015/lab10/stepf.r") Now let s use the forward stepwise approach to select a model. Note that the output is a multi-step process. At each step of the process, partial F - tests are reported that test if the reduction in the residual sums of squares associated with adding each variable to the model individually would be a statistically significant reduction in the residual sums of squares. The variable that causes the greatest reduction in the residual sums of squares will be added to the model. Note on iteration 1, that with the grand mean in the model the RSS (residual sums of squares) is 70.6, but that if logarea is added to the model the RSS will be reduced to Logarea also have the smallest p value, so logarea will be added to the model first. On iteration 2 after logarea is added to the model, we see that only addition of coll to the model will result in a statistically significant reduction in the RSS at the α = 0.05 (p = 0.017). Therefore, coll will be added to the model. However, on iteration 3 none of the remaining variables have p < 0.05, so the algorithm stops after adding logarea and coll to the model. # perform forward stepwise regression mod.7=stepf(datatable=dd,response="logspec", level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea e-11 *** elev e-05 *** 3

5 diss disn area coll e-09 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea e-11 Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> elev diss disn area coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: coll Updating model formula:.~. +coll ==================== Iteration #3 ==================== Single term additions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> elev diss disn area ========== No further variables significant at 0.05 ========== Final Model: We can use the backwards stepwise approach as well. This algorithm will not always converge on the same model as the forward approach, but in this instance it does. Again, in the backwards approach all variables are initially put into the model and those that cause the smallest increase in the RSS are sequentially removed from the model. It takes a couple more iterations than the forwards approach, but converges on the same best model. 4

6 # perform backwards stepwise regression mod.8=stepf(datatable=dd,response="logspec",level=0.05, direction="backward") ==================== Iteration #1 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + disn + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss disn area coll Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: disn formula:.~. -disn ==================== Iteration #2 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + area + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss area coll Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: area formula:.~. -area ==================== Iteration #3 ==================== Single term deletions Model: logspec ~ logarea + elev + diss + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev diss

7 coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: diss formula:.~. -diss ==================== Iteration #4 ==================== Single term deletions Model: logspec ~ logarea + elev + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with highest p-value: elev formula:.~. -elev ==================== Iteration #5 ==================== Single term deletions Model: logspec ~ logarea + coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** coll * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ========== All variables significant at 0.05 ========== Final Model: There are other ways in which the StepF function can be used. One approach is to build a model that contains variables you wish to force into the model, and then check to see if any other variables will be added after those forced in. For example, what if we wanted to force coll to be in the model, and wanted to know if other variables explain residual variation in logspec after accounting for coll. We could think of coll as a nuisance variable that measures the differences in sampling effort across the islands. Perhaps we want to know if after accounting for the variable sampling effort which other variables are still useful in explaining variation among the islands in the species richness of plants. To do this, we first build a linear model with coll only and save it in a model object. Then we call StepF specifying the model object name as our initial mode, and 6

8 then the scope argument with coll and any other variable we wish to assess. Note that this process indicates that even after accounting for the variability in sampling effort among islands that logarea still explains residuals variation in logspec. # to determine if any variables would be added to a model with only coll as t he predictor variable # first build linear model with coll as the only predictor variable lm1=lm(formula=logspec~coll) # then use StepF specifying the model with coll and a "scope"" argument listi ng coll and the other candidate variables StepF(model=lm1,scope=formula( ~ coll+logarea+elev+disn+diss+area),level=0.05, direction="forward") ==================== Iteration #1 ==================== Single term additions Model: logspec ~ coll Df Sum of Sq RSS AIC F value Pr(>F) <none> logarea *** elev disn diss area Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Variable with lowest p-value: logarea Updating model formula:.~. +logarea ==================== Iteration #2 ==================== Single term additions Model: logspec ~ coll + logarea Df Sum of Sq RSS AIC F value Pr(>F) <none> elev disn diss area ========== No further variables significant at 0.05 ========== Final Model: 7

9 Call: lm(formula = logspec ~ coll + logarea) Coefficients: (Intercept) coll logarea The StepF pdf file explains other ways in which you might use the StepF function. Note that the StepF function does not report the regression coefficients nor does it compute the residuals and other fit statistics for the final model. After using StepF to select models, one then must use the lm function to fit the selected models and evaluate their adequacy. Best Subsets Regression An alternative approach to model selection is to compute all possible regressions given a set of candidate explanatory variables, or at least the best subset of models for various levels of model complexity. By model complexity, I mean the number predictor variables included in the model. To calculate the number of possible models with 6 predictor variables, we need to compute the number of permutations of 6 variables taken 1, 2, 3, 4, 5, or 6 at a time and add them up. A permutation is like a combination except that we consider the case AB different from the case BA. To calculate the number of permutations of n things k at a time: P k n = n! (n k)! The calculation of the number of permutations is similar to the calculation of the number of combinations of n things k at a time except that it lacks a factor of k! from the denominator. R code to calculate the number of possible models is given below. Do you want to generate, fit and asses all possible 1237 models? Remember that in regression we use Type I sums of squares, so in models with different orderings of the variables the individual variables may explain different amounts of variation in the response variable. # calculate the number of permutations of 6 variables for models with 1 to 6 predictor variables perm=rep(0,5) n=6 for (k in 1:n-1){ perm[k]=factorial(n)/factorial(n-k) 8

10 } perm [1] totperm=sum(perm)+1 totperm [1] 1237 Rather than tackling the daunting task of examining 1237 models, we will use the regsubsets function from the package leaps to select the best k models with 1 predictor, 2 predictors, etc. # load package leaps library(leaps) In leaps we will use the regsubsets function and generate the 3 best models for each level of complexity. You could choose to do more, but the graphical display of the results becomes problematic with large subset sizes. Running the regsubsets function requires the model formula and the specification of the subset size. Performing a summary of the regsubsets object results in a tabulation of the models ranked in order of best fit. An * indicates that the variable is included in the model. # to get k best regression models for each size k=3 mm=regsubsets(logspec~logarea+elev+disn+diss+area+coll,data=dd,nbest=k) summary(mm) Subset selection object Call: regsubsets.formula(logspec ~ logarea + elev + disn + diss + area + coll, data = dd, nbest = k) 6 Variables (and intercept) Forced in Forced out logarea FALSE FALSE elev FALSE FALSE disn FALSE FALSE diss FALSE FALSE area FALSE FALSE coll FALSE FALSE 3 subsets of each size up to 6 Selection Algorithm: exhaustive logarea elev disn diss area coll 1 ( 1 ) "*" " " " " " " " " " " 1 ( 2 ) " " " " " " " " " " "*" 1 ( 3 ) " " "*" " " " " " " " " 2 ( 1 ) "*" " " " " " " " " "*" 2 ( 2 ) "*" " " " " "*" " " " " 2 ( 3 ) "*" "*" " " " " " " " " 3 ( 1 ) "*" "*" " " " " " " "*" 3 ( 2 ) "*" " " " " "*" " " "*" 9

11 3 ( 3 ) "*" " " "*" " " " " "*" 4 ( 1 ) "*" "*" " " "*" " " "*" 4 ( 2 ) "*" "*" "*" " " " " "*" 4 ( 3 ) "*" "*" " " " " "*" "*" 5 ( 1 ) "*" "*" " " "*" "*" "*" 5 ( 2 ) "*" "*" "*" "*" " " "*" 5 ( 3 ) "*" "*" "*" " " "*" "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*" Although not printed as part of the summary, the summary of the regsubset object contains much more information including the R 2 values for each model. I demonstrate below one way to extract those R 2 values from the summary. You can learn more about the information in the summary by using the str() on the summary. # to get rsq values from regsubsets object nummods=k*(n-1)+1 m=rep(0,nummods) a=summary(mm) for (i in 1:16){ m[i]=summary(mm)[[2]][[i]] } m [1] [11] There are other options to graphically display the results from regsubsets. For example, one kind of plot shows the model R2 on the y-axis and which variables are in the model by colors on the x axis. White would indicate that the variable is not in the model, while shading indicates the variable is in the model. Variables with mostly white contribute to few models while variables that are mostly black contribute to many models. In our example, as you might expect logarea, coll, and elev are mostly black while the other variables are mostly white. # to plot a summary of which variables are in each model ranked by r2 plot(mm,scale="r2") plot(mm,scale="adjr2") 10

12 A similar plot can be generated for other goodness-of-fit statistics such as the adjusted R 2. Remember R 2 = 1 (SS error SS total ) and the adjusted R 2 which penalizes the model for its complexity is adjusted R 2 = 1 SS error df error. SS total df total Finally, there is also another graphical display of the results available in the car package. The function subsets in car will plot the results of a call to regsubsets from leaps. # plot a summary of results library(car) subsets(mm,statistic = "rsq", abbrev=1, legend=true, cex=0.7) 11

13 Abbreviation logarea l elev e disn dsn diss dss area a coll c This summary is similar to the previous one, but labels the models with the included variables. For models with many variables or for larger subset sizes the labels overlap making the plot unreadable. You can also use the regsubsets function with the argument method specified as forward, backward, or seqrep for sequential replacement rather than the default which is exhaustive. At this point, after using one of the stepwise algorithms or an exhaustive search, one has a set of candidate models to examine in more detail. Assessing model fit involves all the same procedures used in bivariate regression since the same assumptions apply. The dependent variable should be normally distributed, scatter plots should indicate linear relationships between the dependent and independent variables, and residual plots should show homoscedasticity (equality of variances in the residuals throughout the regression line). In addition to these issues, one also needs to check for outliers or overly influential data points, and for high inter-correlations between pairs of independent variables (called multi-colinearity). If two independent variables are highly 12

14 correlated (r>0.9), then inclusion of both variables in the model causes problems in parameter estimation. You can pre-screen your independent variables by getting a correlation matrix prior to performing the regression and only allowing one variable of a pair of high correlated variables to serve as a candidate variable for model building at a time. Remember the tools outlined in Lab 9 for assessing model fit are also applicable to multiple regression models. The norm function from Quantpsyc, plot(modelobj) and plot(data.frame) can provide much useful diagnostic information. Other diagnostic procedures are available in the car package. Advice on Building and Assessing Regression Models Building 1. Choose the set of candidate predictor variables to potentially be included in the model. 2. Examine the distribution of the response variable to determine if it meets the assumption of normality. Transform if necessary. 3. Examine scatter plots of the relationships between the response variable y and the predictor or independent variables x to determine if the relationships are linear. Potentially transform either x or y or, both to achieve linearity. 4. Examine the correlations between the predictor variables. High correlations (values of r >> 0.9) might suggest linear dependencies among the predictor variables which can make the estimates of the regression coefficients unstable and inflate the variance of the estimates. Consider deleting members of these pairs of variables since they are essentially redundant. 5. Choose the algorithmic approach to fitting a model. In blocks (chunkwise), by forcing entry of variables into the model in a particular sequence, by backwards elimination, or forwards addition of variable to the model, etc. 6. Decide on the criteria you will use for retaining variables in the model (significant partial t or F statistics at a specified α). Build the model. Assessing 1. Obtain a plot of the standardized residual against the standardized predicted values. Examine this plot for heterogeneity in the distribution of the residuals. A desirable pattern for the residuals would have both negative and positive residuals of equal magnitude throughout the length of the predicted regression. The envelope of residuals around the regression line should appear to be rectangular and be centered on the regression line. 13

15 2. Examine the correlations among pairs of predictor variables to check for multicolinearity. If for any pair r >>0.9 then try alternative models that eliminate one pair member. 3. Examine the diagnostic plots to make sure that there are no observations with high leverage or high influence. Influential data points will have Cook s D values greater than Compare alternative models to determine if one or more models fit the data equally well. 5. The model with the best residual pattern, that is not beset with colinearity and influential data points, and that has the highest R 2 is the best model. Note that R 2 is the last criteria to use in choosing a model, not the first. Lab 10 Assignment The exercise to be performed in this lab is to use the StepF and/or regsubsets functions in R to generate a set of candidate models, and to select the individual "best" model or set of best models if 2 or more models seem to be equally good. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit. DESCRIPTION OF DATA The data is stored in a file multr2.csv The variables are as follows (they are in the same order in the data sets): VARIABLE (UNITS) Mean elevation (feet) Mean temperature (degrees F) Mean annual precipitation (inches) Vegetative density (percent cover) Drainage area (miles 2 ) Latitude (degrees) Longitude (degrees) Elevation at temperature station (feet) 1-hour, 25-year precipitation. intensity (inches/hour) Annual water yield (inches) (Dependent variable) The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA. 14

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Quantitative Methods in Management

Quantitative Methods in Management Quantitative Methods in Management MBA Glasgow University March 20-23, 2009 Luiz Moutinho, University of Glasgow Graeme Hutcheson, University of Manchester Exploratory Regression The lecture notes, exercises

More information

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes, exercises

More information

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010 Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Exploratory regression and model

More information

Lab 07: Multiple Linear Regression: Variable Selection

Lab 07: Multiple Linear Regression: Variable Selection Lab 07: Multiple Linear Regression: Variable Selection OBJECTIVES 1.Use PROC REG to fit multiple regression models. 2.Learn how to find the best reduced model. 3.Variable diagnostics and influential statistics

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Workshop 8: Model selection

Workshop 8: Model selection Workshop 8: Model selection Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems RSM Split-Plot Designs & Diagnostics Solve Real-World Problems Shari Kraber Pat Whitcomb Martin Bezener Stat-Ease, Inc. Stat-Ease, Inc. Stat-Ease, Inc. 221 E. Hennepin Ave. 221 E. Hennepin Ave. 221 E.

More information

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

More information

The Truth behind PGA Tour Player Scores

The Truth behind PGA Tour Player Scores The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Chapter 10: Variable Selection. November 12, 2018

Chapter 10: Variable Selection. November 12, 2018 Chapter 10: Variable Selection November 12, 2018 1 Introduction 1.1 The Model-Building Problem The variable selection problem is to find an appropriate subset of regressors. It involves two conflicting

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

SYS 6021 Linear Statistical Models

SYS 6021 Linear Statistical Models SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are

More information

Package leaps. R topics documented: May 5, Title regression subset selection. Version 2.9

Package leaps. R topics documented: May 5, Title regression subset selection. Version 2.9 Package leaps May 5, 2009 Title regression subset selection Version 2.9 Author Thomas Lumley using Fortran code by Alan Miller Description Regression subset selection including

More information

CREATING THE ANALYSIS

CREATING THE ANALYSIS Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219

More information

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball

More information

Chapter 6: Linear Model Selection and Regularization

Chapter 6: Linear Model Selection and Regularization Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Assignment 6 - Model Building

Assignment 6 - Model Building Assignment 6 - Model Building your name goes here Due: Wednesday, March 7, 2018, noon, to Sakai Summary Primarily from the topics in Chapter 9 of your text, this homework assignment gives you practice

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Multivariate Analysis Multivariate Calibration part 2

Multivariate Analysis Multivariate Calibration part 2 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

Model selection Outline for today

Model selection Outline for today Model selection Outline for today The problem of model selection Choose among models by a criterion rather than significance testing Criteria: Mallow s C p and AIC Search strategies: All subsets; stepaic

More information

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

The problem we have now is called variable selection or perhaps model selection. There are several objectives. STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We

More information

Model Selection and Inference

Model Selection and Inference Model Selection and Inference Merlise Clyde January 29, 2017 Last Class Model for brain weight as a function of body weight In the model with both response and predictor log transformed, are dinosaurs

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information

Analysis of variance - ANOVA

Analysis of variance - ANOVA Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try

More information

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions) THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

More information

Information Criteria Methods in SAS for Multiple Linear Regression Models

Information Criteria Methods in SAS for Multiple Linear Regression Models Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 2015 MODULE 4 : Modelling experimental data Time allowed: Three hours Candidates should answer FIVE questions. All questions carry equal

More information

Discussion Notes 3 Stepwise Regression and Model Selection

Discussion Notes 3 Stepwise Regression and Model Selection Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

Section 4 General Factorial Tutorials

Section 4 General Factorial Tutorials Section 4 General Factorial Tutorials General Factorial Part One: Categorical Introduction Design-Ease software version 6 offers a General Factorial option on the Factorial tab. If you completed the One

More information

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison 610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison R is very touchy about unbalanced designs, partly because

More information

Data Management - 50%

Data Management - 50% Exam 1: SAS Big Data Preparation, Statistics, and Visual Exploration Data Management - 50% Navigate within the Data Management Studio Interface Register a new QKB Create and connect to a repository Define

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA ECL 290 Statistical Models in Ecology using R Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA Datasets in this problem set adapted from those provided

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

Salary 9 mo : 9 month salary for faculty member for 2004

Salary 9 mo : 9 month salary for faculty member for 2004 22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor

More information

Regression on the trees data with R

Regression on the trees data with R > trees Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 7 11.0 66 15.6 8 11.0 75 18.2 9 11.1 80 22.6 10 11.2 75 19.9 11 11.3 79 24.2 12 11.4 76

More information

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization [POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization L. Jason Anastasopoulos ljanastas@uga.edu February 2, 2017 Gradient descent Let s begin with our simple problem of estimating

More information

MODEL DEVELOPMENT: VARIABLE SELECTION

MODEL DEVELOPMENT: VARIABLE SELECTION 7 MODEL DEVELOPMENT: VARIABLE SELECTION The discussion of least squares regression thus far has presumed that the model was known with respect to which variables were to be included and the form these

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data EXERCISE Using Excel for Graphical Analysis of Data Introduction In several upcoming experiments, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 4, 217 PDF file location: http://www.murraylax.org/rtutorials/regression_intro.pdf HTML file location:

More information

This electronic supporting information S4 contains the main steps for fitting a response surface model using Minitab 17 (Minitab Inc.).

This electronic supporting information S4 contains the main steps for fitting a response surface model using Minitab 17 (Minitab Inc.). This electronic supporting information S4 contains the main steps for fitting a response surface model using Minitab 17 (Minitab Inc.). This process was used in Predicting instrumental mass fractionation

More information

For our example, we will look at the following factors and factor levels.

For our example, we will look at the following factors and factor levels. In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Nonparametric Classification Methods

Nonparametric Classification Methods Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Middle School Math Course 3

Middle School Math Course 3 Middle School Math Course 3 Correlation of the ALEKS course Middle School Math Course 3 to the Texas Essential Knowledge and Skills (TEKS) for Mathematics Grade 8 (2012) (1) Mathematical process standards.

More information

Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R

Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R Statistical Bioinformatics (Biomedical Big Data) Notes 2: Installing and Using R In this course we will be using R (for Windows) for most of our work. These notes are to help students install R and then

More information

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2010 1 / 26 Additive predictors

More information

7. Collinearity and Model Selection

7. Collinearity and Model Selection Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Data Presentation. Figure 1. Hand drawn data sheet

Data Presentation. Figure 1. Hand drawn data sheet Data Presentation The purpose of putting results of experiments into graphs, charts and tables is two-fold. First, it is a visual way to look at the data and see what happened and make interpretations.

More information

Predicting Porosity through Fuzzy Logic from Well Log Data

Predicting Porosity through Fuzzy Logic from Well Log Data International Journal of Petroleum and Geoscience Engineering (IJPGE) 2 (2): 120- ISSN 2289-4713 Academic Research Online Publisher Research paper Predicting Porosity through Fuzzy Logic from Well Log

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Predicting Web Service Levels During VM Live Migrations

Predicting Web Service Levels During VM Live Migrations Predicting Web Service Levels During VM Live Migrations 5th International DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and the Cloud Helmut Hlavacs, Thomas Treutner

More information

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation: Topic 11. Unbalanced Designs [ST&D section 9.6, page 219; chapter 18] 11.1 Definition of missing data Accidents often result in loss of data. Crops are destroyed in some plots, plants and animals die,

More information

Regression Models Course Project Vincent MARIN 28 juillet 2016

Regression Models Course Project Vincent MARIN 28 juillet 2016 Regression Models Course Project Vincent MARIN 28 juillet 2016 Executive Summary "Is an automatic or manual transmission better for MPG" "Quantify the MPG difference between automatic and manual transmissions"

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Additive Models GAMs are one approach to non-parametric regression in the multiple predictor setting.

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution Univariate Extreme Value Analysis Practice problems using the extremes ( 2.0 5) package. 1 Block Maxima 1. Pearson Type III distribution (a) Simulate 100 maxima from samples of size 1000 from the gamma

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly

More information

Stat 8053, Fall 2013: Additive Models

Stat 8053, Fall 2013: Additive Models Stat 853, Fall 213: Additive Models We will only use the package mgcv for fitting additive and later generalized additive models. The best reference is S. N. Wood (26), Generalized Additive Models, An

More information

BayesFactor Examples

BayesFactor Examples BayesFactor Examples Michael Friendly 04 Dec 2015 The BayesFactor package enables the computation of Bayes factors in standard designs, such as one- and two- sample designs, ANOVA designs, and regression.

More information

Building Better Parametric Cost Models

Building Better Parametric Cost Models Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Package optimus. March 24, 2017

Package optimus. March 24, 2017 Type Package Package optimus March 24, 2017 Title Model Based Diagnostics for Multivariate Cluster Analysis Version 0.1.0 Date 2017-03-24 Maintainer Mitchell Lyons Description

More information

Stat 5303 (Oehlert): Response Surfaces 1

Stat 5303 (Oehlert): Response Surfaces 1 Stat 5303 (Oehlert): Response Surfaces 1 > data

More information

Elemental Set Methods. David Banks Duke University

Elemental Set Methods. David Banks Duke University Elemental Set Methods David Banks Duke University 1 1. Introduction Data mining deals with complex, high-dimensional data. This means that datasets often combine different kinds of structure. For example:

More information

Lecture 7: Linear Regression (continued)

Lecture 7: Linear Regression (continued) Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions

More information

General Factorial Models

General Factorial Models In Chapter 8 in Oehlert STAT:5201 Week 9 - Lecture 2 1 / 34 It is possible to have many factors in a factorial experiment. In DDD we saw an example of a 3-factor study with ball size, height, and surface

More information

1 StatLearn Practical exercise 5

1 StatLearn Practical exercise 5 1 StatLearn Practical exercise 5 Exercise 1.1. Download the LA ozone data set from the book homepage. We will be regressing the cube root of the ozone concentration on the other variables. Divide the data

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 > section

More information

The RcmdrPlugin.HH Package

The RcmdrPlugin.HH Package Type Package The RcmdrPlugin.HH Package Title Rcmdr support for the HH package Version 1.1-4 Date 2007-07-24 July 31, 2007 Author Richard M. Heiberger, with contributions from Burt Holland. Maintainer

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being

More information