Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Similar documents
Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.2: Covariance, Correlation, and Least Squares

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 2.1: Intro to Simple Linear Regression & Least Squares

Model Selection and Inference

Section 2.1: Intro to Simple Linear Regression & Least Squares

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Applied Statistics and Econometrics Lecture 6

Regression Analysis and Linear Regression Models

Salary 9 mo : 9 month salary for faculty member for 2004

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Lecture 7: Linear Regression (continued)

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Lecture 20: Outliers and Influential Points

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Variable selection is intended to select the best subset of predictors. But why bother?

ITSx: Policy Analysis Using Interrupted Time Series

SLStats.notebook. January 12, Statistics:

Multiple Linear Regression

ES-2 Lecture: Fitting models to data

Week 4: Simple Linear Regression II

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Lecture 16: High-dimensional regression, non-linear regression

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

The linear mixed model: modeling hierarchical and longitudinal data

Estimating R 0 : Solutions

Exercise 2.23 Villanova MAT 8406 September 7, 2015

No more questions will be added

Week 11: Interpretation plus

9.1. How do the nerve cells in your brain communicate with each other? Signals have. Like a Glove. Least Squares Regression

Why Use Graphs? Test Grade. Time Sleeping (Hrs) Time Sleeping (Hrs) Test Grade

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Chapter 7: Dual Modeling in the Presence of Constant Variance

Multiple Regression White paper

x 2 + 3, r 4(x) = x2 1

A Multiple-Line Fitting Algorithm Without Initialization Yan Guo

General Factorial Models

Gelman-Hill Chapter 3

Regression on the trees data with R

Model selection and validation 1: Cross-validation

Important Things to Remember on the SOL

Assignment 2. with (a) (10 pts) naive Gauss elimination, (b) (10 pts) Gauss with partial pivoting

PSY 9556B (Feb 5) Latent Growth Modeling

Chapter 7: Linear regression

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

5.5 Regression Estimation

26, 2016 TODAY'S AGENDA QUIZ

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Graphical Analysis of Data using Microsoft Excel [2016 Version]

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

A Knitr Demo. Charles J. Geyer. February 8, 2017

Week 10: Heteroskedasticity II

Unit 5: Estimating with Confidence

An introduction to plotting data

Correlation and Regression

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

STAT 705 Introduction to generalized additive models

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Missing Data Analysis for the Employee Dataset

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

36-402/608 HW #1 Solutions 1/21/2010

Economics Nonparametric Econometrics

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Poisson Regression and Model Checking

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

Box-Cox Transformation for Simple Linear Regression

Applied Statistics : Practical 9

Week 4: Simple Linear Regression III

STA 570 Spring Lecture 5 Tuesday, Feb 1

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

Beta-Regression with SPSS Michael Smithson School of Psychology, The Australian National University

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Analysis of variance - ANOVA

Solution to Bonus Questions

CSC 411: Lecture 02: Linear Regression

GSE Algebra 1 Name Date Block. Unit 3b Remediation Ticket

Administrivia. Next Monday is Thanksgiving holiday. Tuesday and Wednesday the lab will be open for make-up labs. Lecture as usual on Thursday.

PR3 & PR4 CBR Activities Using EasyData for CBL/CBR Apps

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Splines and penalized regression

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Workshop 8: Model selection

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Regression III: Lab 4

Gov Troubleshooting the Linear Model II: Heteroskedasticity

Machine Learning / Jan 27, 2010

MS&E 226: Small Data

VCEasy VISUAL FURTHER MATHS. Overview

General Factorial Models

Transcription:

Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1

Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions of our linear regression model: (i) The mean of Y is linear in X s. (ii) The additive errors (deviations from line) are normally distributed independent from each other identically distributed (i.e., they have constant variance) Y i X i N(β 0 + β 1 X i, σ 2 ) 2

Regression Model Assumptions Inference and prediction relies on this model being true! If the model assumptions do not hold, then all bets are off: prediction can be systematically biased standard errors, intervals, and t-tests are wrong We will focus on using graphical methods (plots!) to detect violations of the model assumptions. 3

Example y1 4 5 6 7 8 9 10 y2 3 4 5 6 7 8 9 4 6 8 10 12 14 x1 4 6 8 10 12 14 x2 Here we have two datasets... Which one looks compatible with our modeling assumptions? 4

Output from the two regressions... ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665,Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217 ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 3.001 1.125 2.667 0.02576 * ## x2 0.500 0.118 4.239 0.00218 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6662,Adjusted R-squared: 0.6292 ## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179 5

Example The regression output values are exactly the same... y1 4 5 6 7 8 9 10 11 y2 3 4 5 6 7 8 9 4 6 8 10 12 14 x1 4 6 8 10 12 14 x2 Thus, whatever decision or action we might take based on the 6 8 10 12 output would be the same in both cases! y3 y4 6 8 10 12 4 6 8 10 12 14 x3 8 10 12 14 16 18 x4 6

Example...but the residuals (plotted against Ŷ ) look totally different!! resid(fit1) 2 1 0 1 resid(fit2) 2.0 1.0 0.0 1.0 5 6 7 8 9 10 5 6 7 8 9 10 fitted(fit1) fitted(fit2) Plotting e vs Ŷ (and the X s) is your #1 tool for finding model fit problems. 7

Residual Plots We use residual plots to diagnose potential problems with the model. From the model assumptions, the error term (ɛ) should have a few properties... we use the residuals (e) as a proxy for the errors as: ɛ i = yi (β 0 + β 1 x 1i + β 2 x 2i + + β p x pi ) yi (b 0 + b 1 x 1i + b 2 x 2i + + b p x pi = e i 8

Residual Plots What kind of properties should the residuals have?? e i N(0, σ 2 ) iid and independent from the X s We should see no pattern between e and each of the X s This can be summarized by looking at the plot between Ŷ and e Remember that Ŷ is pure X, i.e., a linear function of the X s. If the model is good, the regression should have pulled out of Y all of its x ness... what is left over (the residuals) should have nothing to do with X. 9

Example Mid City (Housing) Left: ŷ vs. y Right: ŷ vs e housing$price 80 120 160 200 resid(housing_fit) 30 10 10 30 100 120 140 160 180 fitted(housing_fit) 100 120 140 160 180 fitted(housing_fit) 10

Example Mid City (Housing) Size vs. e resid(housing_fit) 30 10 10 30 1.6 1.8 2.0 2.2 2.4 2.6 housing$size 11

Example Mid City (Housing) In the Mid City housing example, the residuals plots (both X vs. e and Ŷ vs. e) showed no obvious problem... This is what we want!! Although these plots don t guarantee that all is well it is a very good sign that the model is doing a good job. 12

Non Linearity Example: Telemarketing How does length of employment affect productivity (number of calls per day)? calls 20 25 30 10 15 20 25 30 months 13

Non Linearity Example: Telemarketing Residual plot highlights the non-linearity! resid(telefit) 3 2 1 0 1 2 3 25 30 35 fitted(telefit) 14

Non Linearity What can we do to fix this?? We can use multiple regression and transform our X to create a nonlinear model... Let s try Y = β 0 + β 1 X + β 2 X 2 + ɛ The data... months months2 calls 10 100 18 10 100 19 11 121 22 14 196 23 15 225 25......... 15

Telemarketing: Adding a squared term In R, the quickest way to add a quadratic term (or other transformation) is using I() in the formula: telefit2 = lm(calls~months + I(months^2), data=tele) print(telefit2) ## ## Call: ## lm(formula = calls ~ months + I(months^2), data = tele) ## ## Coefficients: ## (Intercept) months I(months^2) ## -0.14047 2.31020-0.04012 16

Telemarketing ŷ i = b 0 + b 1 x i + b 2 x 2 i 30 calls 25 20 10 15 20 25 30 months 17

Telemarketing What is the marginal effect of X on Y? E[Y X ] X = β 1 + 2β 2 X To better understand the impact of changes in X on Y you should evaluate different scenarios. Moving from 10 to 11 months of employment raises productivity by 1.47 calls Going from 25 to 26 months only raises the number of calls by 0.27. This is similar to variable interactions we saw earlier. The effect of X 1 on the predicted value of Y depends on the value of X 2. Here, X 1 and X 2 are the same variable! 18

Polynomial Regression Even though we are limited to a linear mean, it is possible to get nonlinear regression by transforming the X variable. In general, we can add powers of X to get polynomial regression: Y = β 0 + β 1 X + β 2 X 2... + β m X m You can fit basically any mean function if m is big enough. Usually, m = 2 does the trick. 19

Closing Comments on Polynomials We can always add higher powers (cubic, etc) if necessary. Be very careful about predicting outside the data range. The curve may do unintended things beyond the observed data. Watch out for over-fitting... remember, simple models are better. 20

Be careful when extrapolating... calls 20 25 30 10 15 20 25 30 35 40 months 21

...and, be careful when adding more polynomial terms! calls 15 20 25 30 35 40 2 3 8 10 15 20 25 30 35 40 months 22

Non-constant Variance Example... This violates our assumption that all ε i have the same σ 2. 23

Non-constant Variance Consider the following relationship between Y and X : Y = γ 0 X β 1 (1 + R) where we think about R as a random percentage error. On average we assume R is 0... but when it turns out to be 0.1, Y goes up by 10%! Often we see this, the errors are multiplicative and the variation is something like ±10% and not ±10. This leads to non-constant variance (or heteroskedasticity) 24

The Log-Log Model We have data on Y and X and we still want to use a linear regression model to understand their relationship... what if we take the log (natural log) of Y? [ ] log(y ) = log γ 0 X β 1 (1 + R) log(y ) = log(γ 0 ) + β 1 log(x ) + log(1 + R) Now, if we call β 0 = log(γ 0 ) and ɛ = log(1 + R) the above leads to log(y ) = β 0 + β 1 log(x ) + ɛ a linear regression of log(y ) on log(x )! 25

Price Elasticity In economics, the slope coefficient β 1 in the regression log(sales) = β 0 + β 1 log(price) + ε is called price elasticity. This is the % change in expected sales per 1% change in price. The model implies that E[sales] = A price β 1 where A = exp(β 0 ) 26

Price Elasticity of OJ A chain of gas station convenience stores was interested in the dependence between price of and sales for orange juice... They decided to run an experiment and change prices randomly at different locations. With the data in hand, let s first run an regression of Sales on Price: Sales = β 0 + β 1 Price + ɛ lm(sales~price, data=oj) ## ## Call: ## lm(formula = Sales ~ Price, data = oj) ## ## Coefficients: ## (Intercept) Price ## 89.64-20.93 27

Price Elasticity of OJ 150 Sales 100 50 resid(ojfit) 20 0 20 40 60 80 0 1.5 2.0 2.5 3.0 3.5 4.0 Price 1.5 2.0 2.5 3.0 3.5 4.0 oj$price No good!! 28

Price Elasticity of OJ But... would you really think this relationship would be linear? Is moving a price from $1 to $2 is the same as changing it form $10 to $11?? log(sales) = γ 0 + γ 1 log(price) + ɛ ojfitelas = lm(log(sales)~log(price), data=oj) coef(ojfitelas) ## (Intercept) log(price) ## 4.811646-1.752383 How do we interpret ˆγ 1 = 1.75? (When prices go up 1%, sales go down by 1.75%) 29

Price Elasticity of OJ print(ojfitelas) ## ## Call: ## lm(formula = log(sales) ~ log(price), data = oj) ## ## Coefficients: ## (Intercept) log(price) ## 4.812-1.752 How do we interpret ˆγ 1 = 1.75? (When prices go up 1%, sales go down by 1.75%) 30

Price Elasticity of OJ Sales 20 60 100 140 resid(ojfitelas) 0.5 0.0 0.5 1.5 2.0 2.5 3.0 3.5 4.0 Price 1.5 2.0 2.5 3.0 3.5 4.0 oj$price Much better!! 31

Making Predictions What if the gas station store wants to predict their sales of OJ if they decide to price it at $1.8? The predicted log(sales) = 4.812 + ( 1.752) log(1.8) = 3.78 So, the predicted Sales = exp(3.78) = 43.82. How about the plug-in prediction interval? In the log scale, our predicted interval in [ log(sales) 2s; log(sales) + 2s] = [3.78 2(0.38); 3.78 + 2(0.38)] = [3.02; 4.54]. In terms of actual Sales the interval is [exp(3.02), exp(4.54)] = [20.5; 93.7] 32

Making Predictions log(sales) 2.0 3.0 4.0 5.0 Sales 20 60 100 140 0.4 0.6 0.8 1.0 1.2 1.4 1.5 2.0 2.5 3.0 3.5 4.0 log(price) Price In the log scale (right) we have [Ŷ 2s; Ŷ + 2s] In the original scale (left) we have [exp(ŷ ) exp( 2s); exp(ŷ ) exp(2s)] 33

Some additional comments... Another useful transformation to deal with non-constant variance is to take only the log(y ) and keep X the same. Clearly the elasticity interpretation no longer holds. Always be careful in interpreting the models after a transformation Also be careful in using the transformed model to make predictions 34

Summary of Transformations Coming up with a good regression model is usually an iterative procedure. Use plots of residuals vs X or Ŷ to determine the next step. Log transform is your best friend when dealing with non-constant variance (log(x ), log(y ), or both). Add polynomial terms (e.g. X 2 ) to get nonlinear regression. The bottom line: you should combine what the plots and the regression output are telling you with your common sense and knowledge about the problem. Keep iterating until you a model that makes sense and has nothing obviously wrong with it. 35

Outliers [fragile] Body weight vs. brain weight... X =body weight of a mammal in kilograms Y =brain weight of a mammal in grams brain 0 2000 4000 0 1000 3000 5000 body Does a linear model make sense here? 36

Outliers Let s try logs... log(brain) 2 0 2 4 6 8 4 2 0 2 4 6 8 log(body) Better, but could we be missing less obvious outliers? 37

Checking for Outliers with Standardized Residuals In our model ɛ N(0, σ 2 ) The residuals e are a proxy for ɛ and the standard error s is an estimate for σ Call z = e/s, the standardized residuals... We should expect z N(0, 1) (How aften should we see an observation of z > 3?) 38

Standardized residual plots plot(rstandard(mamfit)~log(body), data=mammals, ylim=c(-4,4)) abline(h=c(-3,3), col='red') rstandard(mamfit) 4 2 0 2 4 4 2 0 2 4 6 8 log(body) One large positive outlier... 39

Outliers It turns out that the data had the brain of a Chinchilla weighting 64 grams!! In reality, it is 6.4 grams... after correcting it: log(brain) -2 0 2 4 6 8 std residuals -4-2 0 2 4-4 -2 0 2 4 6 8 log(body) -4-2 0 2 4 6 8 log(body) 40

How to Deal with Outliers When should you delete outliers? Only when you have a really good reason! There is nothing wrong with running regression with and without potential outliers to see whether results are significantly impacted. Any time outliers are dropped the reasons for removing observations should be clearly noted. 41