Model Selection and Inference

Similar documents
Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Regression on the trees data with R

Multiple Linear Regression: Global tests and Multiple Testing

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Regression Analysis and Linear Regression Models

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Analysis of variance - ANOVA

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

The Statistical Sleuth in R: Chapter 10

Section 2.2: Covariance, Correlation, and Least Squares

Week 4: Simple Linear Regression III

Salary 9 mo : 9 month salary for faculty member for 2004

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Week 5: Multiple Linear Regression II

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

Multiple Linear Regression

One Factor Experiments

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Section 2.1: Intro to Simple Linear Regression & Least Squares

22s:152 Applied Linear Regression

36-402/608 HW #1 Solutions 1/21/2010

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

ITSx: Policy Analysis Using Interrupted Time Series

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

General Factorial Models

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

IQR = number. summary: largest. = 2. Upper half: Q3 =

Applied Statistics and Econometrics Lecture 6

Random coefficients models

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

General Factorial Models

Compare Linear Regression Lines for the HP-67

Bayes Estimators & Ridge Regression

Variable selection is intended to select the best subset of predictors. But why bother?

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Lecture 7: Linear Regression (continued)

Lecture 25: Review I

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Random coefficients models

22s:152 Applied Linear Regression

Multiple Regression White paper

Lecture 13: Model selection and regularization

Lecture 16: High-dimensional regression, non-linear regression

Package citools. October 20, 2018

Week 4: Simple Linear Regression II

Poisson Regression and Model Checking

5.5 Regression Estimation

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Yelp Star Rating System Reviewed: Are Star Ratings inline with textual reviews?

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

SAS data statements and data: /*Factor A: angle Factor B: geometry Factor C: speed*/

A Knitr Demo. Charles J. Geyer. February 8, 2017

Solution to Bonus Questions

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Stat 5303 (Oehlert): Response Surfaces 1

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture IV: Quantitative Comparison of Models

Bivariate (Simple) Regression Analysis

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Statistical foundations of Machine Learning INFO-F-422 TP: Linear Regression

E-Campus Inferential Statistics - Part 2

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

Quantitative - One Population

Introduction to R, Github and Gitlab

Regression. Notes. Page 1 25-JAN :21:57. Output Created Comments

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Subset Selection in Multiple Regression

Contents Cont Hypothesis testing

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

An introduction to SPSS

Multivariate Analysis Multivariate Calibration part 2

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Compare Linear Regression Lines for the HP-41C

Mixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.

Introduction to Statistical Analyses in SAS

Splines and penalized regression

Lab 07: Multiple Linear Regression: Variable Selection

The Truth behind PGA Tour Player Scores

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

The linear mixed model: modeling hierarchical and longitudinal data

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Generalized Additive Models

Assignment 6 - Model Building

Modelling Proportions and Count Data

Lecture 17: Smoothing splines, Local Regression, and GAMs

Stat 411/511 MULTIPLE COMPARISONS. Charlotte Wickham. stat511.cwick.co.nz. Nov

The Coefficient of Determination

More data analysis examples

PSY 9556B (Feb 5) Latent Growth Modeling

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Transcription:

Model Selection and Inference Merlise Clyde January 29, 2017

Last Class Model for brain weight as a function of body weight In the model with both response and predictor log transformed, are dinosaurs outliers? should you test each one individually or as a group; if as a group how do you think you would you do this using lm? do you think your final model is adequate? What else might you change?

Dummy variables Create an indicator variable for each of the dinosaurs: Animals = Animals %>% mutate(name = row.names(animals)) %>% mutate(dino.t = (name == "Triceratops")) %>% mutate(dino.d = (name == "Dipliodocus")) %>% mutate(dino.b = (name == "Brachiosaurus")) %>% mutate(dino = (name %in% c("triceratops", "Brachiosaurus", "Dipliodocus"))) uses the dplyr package and pipes %>% with mutate

New Dataframe ## body brain Dino.T Dino.D Dino.B Dino n ## 1 1.35 8.1 FALSE FALSE FALSE FALSE Mountain bea ## 2 465.00 423.0 FALSE FALSE FALSE FALSE ## 3 36.33 119.5 FALSE FALSE FALSE FALSE Grey w ## 4 27.66 115.0 FALSE FALSE FALSE FALSE G ## 5 1.04 5.5 FALSE FALSE FALSE FALSE Guinea ## 6 11700.00 50.0 FALSE TRUE FALSE TRUE Dipliodo

Dinosaurs as Outliers brain_out.lm = lm(log(brain) ~ log(body) + Dino.T + Dino.B + Dino.D, data=animals) kable(summary(brain_out.lm)$coef) Estimate Std. Error t value Pr(> t ) (Intercept) 2.1504121 0.2006036 10.719710 0.0e+00 log(body) 0.7522607 0.0457186 16.454141 0.0e+00 Dino.TTRUE -4.7839476 0.7913326-6.045432 3.6e-06 Dino.BTRUE -5.6661781 0.8327593-6.804101 6.0e-07 Dino.DTRUE -5.2850740 0.7949261-6.648510 9.0e-07

Anova with Nested Models Compare models through Extra-Sum-of Squares Each additional predictor reduces the SSE (sum of squares error) Adds to model complexity (more parameters) fewer degrees of freedom for error Is the addition worth it? Is the decrease significant? How big is big enough? SSE df F = SSE df SSE F df F = SSE df ˆσ 2 F ( df, n p)

Simultaneous Test: Anova in R Model: log(brain) = β 0 +log(body)β 1 +Dino.Tβ 2 +Dino.Bβ 3 +Dino.Dβ 4 +ɛ Hypothesis Test: β 2 = β 3 = β 4 = 0 anova(brain_out.lm, brain.lm) ## Model 1: log(brain) ~ log(body) + Dino.T + Dino.B + Dino ## Analysis of Variance Table ## ## Model 2: log(brain) ~ log(body) ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 23 12.117 ## 2 26 60.988-3 -48.871 30.921 3.031e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

Other Possible Models 1. all animals follow the same regression 2. the rate of change is the same, but a different intercept for dinosaurs (parallel regression) 3. different slopes and intercepts for dinosaurs and other animals 4. all dinosaurs have a different mean (outliers) brain1.lm = lm(log(brain) ~ log(body),data=animals) #0 brain2.lm = lm(log(brain) ~ log(body) + Dino,data=Animals) brain3.lm = lm(log(brain) ~ log(body)*dino, data=animals) # brain4.lm = lm(log(brain) ~ log(body) + Dino.T + Dino.B + Dino.D, data=animals) #3 Interaction Model 3: log(brain) ~ log(body) + Dino + log(body):dino ## log(brain) ~ log(body) + Dino + log(body):dino

Sequential Sum of Squares anova(brain1.lm, brain2.lm, brain3.lm, brain4.lm) ## Model 4: log(brain) ~ log(body) + Dino.T + Dino.B + Dino ## Analysis of Variance Table ## ## Model 1: log(brain) ~ log(body) ## Model 2: log(brain) ~ log(body) + Dino ## Model 3: log(brain) ~ log(body) * Dino ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 26 60.988 ## 2 25 12.505 1 48.483 92.0248 1.665e-09 *** ## 3 24 12.212 1 0.294 0.5578 0.4627 ## 4 23 12.117 1 0.094 0.1788 0.6763 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

Model Selection Fail to reject Model 3 in favor of Model 4 Fail to reject Model 2 in favor of Model 3 Reject Model 1 in favor of Model 2 Same slope for log(body) for all animals, but different intercepts Estimate Std. Error t value Pr(> t ) (Intercept) 2.1616415 0.1949191 11.08994 0 log(body) 0.7485526 0.0442849 16.90310 0 DinoTRUE -5.2193510 0.5301552-9.84495 0

Distribution of Coefficients Joint Distribution under normality ˆβ σ 2 N(β, ˆσ 2 (X T X) 1 ) Distribution of SSE Marginal distribution SSE χ 2 (n p) ˆβ j β j SE( ˆβ j ) St(n p)

Confidence Intervals (1 α/2)100% Confidence interval for β j ˆβ j ± t n p,α/2 SE( ˆβ j ) kable(confint(brain2.lm)) 2.5 % 97.5 % (Intercept) 1.760198 2.5630848 log(body) 0.657346 0.8397591 DinoTRUE -6.311226-4.1274760

Converting to Original Units brain = e ˆβ 0 +log(body) ˆβ 1 +Dino ˆβ 2 = e ˆβ 0 e log(body) ˆβ 1 e Dino ˆβ 2 = e ˆβ 0 body ˆβ 1 e Dino ˆβ 2 10% increase in body weight implies a brain 1.10 = e ˆβ 0 (1.10 body)ˆβ 1 e Dino ˆβ 2 = 1.10 ˆβ 1 e ˆβ 0 body ˆβ 1 e Dino ˆβ 2 1.1 ˆβ 1 = 1.074 or a 7.4% increase in brain weight

95% Confidence interval To obtain a 95% confidence interval, (1.10 CI 1) 100 2.5 % 97.5 % body 6.465603 8.332779

Interpretation of Intercept log(brain) = ˆβ 0 + log(body) ˆβ 1 + Dino ˆβ 2 For a non-dinosaur, if log(body) = 0 (body weight = 1 kilogram), we expect that brain weight will be 2.16 log(grams)??? Exponentiate: predicted brain weight for non-dinosaur with a 1 kg body weight is grams e ˆβ 0 = 8.69

Plot of Fitted Values 8 6 2 3.00 log(brain) 4 2.75 2.50 2.25 2.00 2 0 4 0 4 8 12 log(body)

Predictions for 259 gram cockatoo 9 6 2 3.00 log(brain) 3 2.75 2.50 2.25 2.00 0 4 0 4 8 12 log(body)

Predictions in original units 95% Confidence Interval for f (x) newdata = data.frame(body=.0259, Dino=FALSE) fit = predict(brain2.lm, newdata=newdata, interval="confidence", se=t) 95% Prediction Interval for Brain Weight pred = predict(brain2.lm, newdata=newdata, interval="predict", se=t)

CI/Predictions in original units for body=259 g 95% Confidence Interval for f (x) exp(fit$fit) ## fit lwr upr ## 1 0.5637161 0.2868832 1.107684 95% Prediction Interval for Brain Weight exp(pred$fit) ## fit lwr upr ## 1 0.5637161 0.1131737 2.80786 95% confident that the brain weight will be between 0.11 and 2.81 grams

Summary Linear predictors may be based on functions of other predictors (dummy variables, interactions, non-linear terms) need to change back to original units log transform useful for non-negative responses (ensures predictions are non-negative) Be careful of units of data plots should show units summary statements should include units Goodness of fit measure: R 2 and Adjusted R 2 depend on scale R 2 is percent variation in Y that is explained by the model where SST = i (Y i Ȳ ) 2 R 2 = 1 SSE/SST