Section 2.3: Simple Linear Regression: Predictions and Inference

Similar documents
Section 2.2: Covariance, Correlation, and Least Squares

Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 2.1: Intro to Simple Linear Regression & Least Squares

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Model Selection and Inference

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression III

Applied Statistics and Econometrics Lecture 6

Statistics Lab #7 ANOVA Part 2 & ANCOVA

36-402/608 HW #1 Solutions 1/21/2010

Regression Analysis and Linear Regression Models

Missing Data Analysis for the Employee Dataset

Week 5: Multiple Linear Regression II

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Analysis of variance - ANOVA

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Regression on the trees data with R

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Gelman-Hill Chapter 3

IQR = number. summary: largest. = 2. Upper half: Q3 =

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Estimating R 0 : Solutions

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Multiple Linear Regression

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

A. Using the data provided above, calculate the sampling variance and standard error for S for each week s data.

Cross-validation and the Bootstrap

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Lecture 20: Outliers and Influential Points

STA 570 Spring Lecture 5 Tuesday, Feb 1

Chapters 5-6: Statistical Inference Methods

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Simulating power in practice

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

height VUD x = x 1 + x x N N 2 + (x 2 x) 2 + (x N x) 2. N

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015

Exam 4. In the above, label each of the following with the problem number. 1. The population Least Squares line. 2. The population distribution of x.

Introduction to hypothesis testing

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Robust Linear Regression (Passing- Bablok Median-Slope)

One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test).

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Solution to Bonus Questions

Poisson Regression and Model Checking

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Building Better Parametric Cost Models

Introduction to R, Github and Gitlab

MS&E 226: Small Data

BIOS: 4120 Lab 11 Answers April 3-4, 2018

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

5.5 Regression Estimation

Four Types of Slope Positive Slope Negative Slope Zero Slope Undefined Slope Slope Dude will help us understand the 4 types of slope

Stat 5303 (Oehlert): Response Surfaces 1

Descriptive Statistics, Standard Deviation and Standard Error

Lab #9: ANOVA and TUKEY tests

Lecture 25: Review I

Monte Carlo Integration

Quantitative - One Population

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

10.4 Linear interpolation method Newton s method

Why Use Graphs? Test Grade. Time Sleeping (Hrs) Time Sleeping (Hrs) Test Grade

Machine Learning: An Applied Econometric Approach Online Appendix

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

More Summer Program t-shirts

Chapter 6: DESCRIPTIVE STATISTICS

ES-2 Lecture: Fitting models to data

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Slides 11: Verification and Validation Models

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

CHAPTER 2 DESCRIPTIVE STATISTICS

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Salary 9 mo : 9 month salary for faculty member for 2004

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Missing Data Analysis for the Employee Dataset

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Note that ALL of these points are Intercepts(along an axis), something you should see often in later work.

Chapter 8. Interval Estimation

One Factor Experiments

Lecture 7: Linear Regression (continued)

The Statistical Sleuth in R: Chapter 10

Multiple Regression White paper

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Variable selection is intended to select the best subset of predictors. But why bother?

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Stat 5303 (Oehlert): Unreplicated 2-Series Factorials 1

Transcription:

Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1

Simple Linear Regression: Predictions and Uncertainty Two things that we might want to know: I What value of Y can we expect for a given X? I How sure are we about this prediction (or forecast)? That is, how di erent could Y be from what we expect? Our goal is to measure the accuracy of our forecasts or how much uncertainty there is in the forecast. Onemethodistospecifya range of Y values that are likely, given an X value. Prediction Interval: probable range of Y values for a given X We need the conditional distribution of Y given X. 2

Conditional Distributions vs the Marginal Distribution For example, consider our house price data. We can look at the distribution of house prices in slices determined by size ranges: 3

Conditional Distributions vs the Marginal Distribution What do we see? The conditional distributions are less variable (narrower boxplots) than the marginal distribution. Variation in house sizes expains a lot of the original variation in price. What does this mean about SST, SSR, SSE, and R 2 from last time? 4

Conditional Distributions vs the Marginal Distribution When X has no predictive power, the story is di erent: 5

Probability models for prediciton Slicing our data is an awkward way to build a prediction and prediction interval (Why 500sqft slices and not 200 or 1000? What s the tradeo between large and small slices?) Instead we build a probability model (e.g., normal distribution). Then we can say something like with 95% probability the prediction error will be within ±$28, 000. We must also acknowledge that the fitted line may be fooled by particular realizations of the residuals (an unlucky sample) 6

The Simple Linear Regression Model Simple Linear Regression Model: Y = 0 + 1 X + " " N(0, 2 ) I 0 + 1 X represents the true line ; The part of Y that depends on X. I The error term " is independent idosyncratic noise ; The part of Y not associated with X. 7

The Simple Linear Regression Model Y = 0 + 1 X + " y 160 180 200 220 240 260 1.6 1.8 2.0 2.2 2.4 2.6 The conditional distribution for Y given X is Normal (why?): x (Y X = x) N( 0 + 1 x, 2 ). 8

The Simple Linear Regression Model Example You are told (without looking at the data) that 0 = 40; 1 = 45; = 10 and you are asked to predict price of a 1500 square foot house. What do you know about Y from the model? Y = 40 + 45(1.5) + " = 107.5+" Thus our prediction for the price is E(Y X =1.5) = 107.5(the conditional expected value), and since (Y X =1.5) N(107.5, 10 2 ) a 95% Prediction Interval for Y is 87.5 < Y < 127.5 9

Summary of Simple Linear Regression The model is " i N(0, 2 ). Y i = 0 + 1 X i + " i The SLR has 3 basic parameters: I 0, 1 (linear pattern) I (variation around the line). Assumptions: I independence means that knowing " i doesn t a ect your views about " j I identically distributed means that we are using the same normal distribution for every " i 10

Conditional Distributions vs the Marginal Distribution You know that 0 and 1 determine the linear relationship between X and the mean of Y given X. determines the spread or variation of the realized values around the line (i.e., the conditional mean of Y ) 11

Learning from data in the SLR Model SLR assumes every observation in the dataset was generated by the model: Y i = 0 + 1 X i + " i This is a model for the conditional distribution of Y given X. We use Least Squares to estimate 0 and 1 : ˆ1 = b 1 = r xy s y s x ˆ0 = b 0 = Ȳ b 1 X 12

Estimation of Error Variance We estimate 2 with: s 2 = 1 n 2 nx i=1 e 2 i = SSE n 2 (2 is the number of regression coe cients; i.e. 2 for 0 and 1 ). We have n 2 degrees of freedom because 2 have been used up in the estimation of b 0 and b 1. We usually use s = p SSE/(n 2), inthesameunitsasy.it s also called the regression or residual standard error. 13

Finding s from R output summary(fit) ## ## Call: ## lm(formula = Price ~ Size, data = housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -30.425-8.618 0.575 10.766 18.498 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 38.885 9.094 4.276 0.000903 *** ## Size 35.386 4.494 7.874 2.66e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.14 on 13 degrees of freedom ## Multiple R-squared: 0.8267, Adjusted R-squared: 0.8133 ## F-statistic: 62 on 1 and 13 DF, p-value: 2.66e-06 14

One Picture Summary of SLR I The plot below has the house data, the fitted regression line (b 0 + b 1 X )and±2 s... I From this picture, what can you tell me about b 0, b 1 and s 2? price 60 80 100 120 140 160 1.0 1.5 2.0 2.5 3.0 3.5 size How about 0, 1 and 2? 15

Sampling Distribution of Least Squares Estimates How much do our estimates depend on the particular random sample that we happen to observe? Imagine: I Randomly draw di erent samples of the same size. I For each sample, compute the estimates b 0, b 1,ands. (just like we did for sample means in Section 1.4) If the estimates don t vary much from sample to sample, then it doesn t matter which sample you happen to observe. If the estimates do vary a lot, then it matters which sample you happen to observe. 16

Sampling Distribution of Least Squares Estimates 17

Sampling Distribution of Least Squares Estimates 18

Sampling Distribution of b 1 The sampling distribution of b 1 describes how estimator b 1 = ˆ1 varies over di erent samples with the X values fixed. It turns out that b 1 is normally distributed (approximately): b 1 N( 1, sb 2 1 ). I b 1 is unbiased: E[b 1 ]= 1. I s b1 is the standard error of b 1. In general, the standard error of an estimate is its standard deviation over many randomly sampled datasets of size n. Itdetermineshow close b 1 is to 1 on average. I This is a number directly available from the regression output. 19

Sampling Distribution of b 1 Can we intuit what should be in the formula for s b1? I How should s figure in the formula? I What about n? I Anything else? s 2 b 1 = s 2 P (Xi X ) 2 = s 2 (n 1)s 2 x Three Factors: sample size (n), error variance (s 2 ), and X -spread (s x ). 20

Sampling Distribution of b 0 The intercept is also normal and unbiased: b 0 N( 0, s 2 b 0 ). s 2 b 0 = Var(b 0 )=s 2 1 n + X 2 (n 1)s 2 x What is the intuition here? 21

Confidence Intervals Since b 1 N( 1, s 2 b 1 ), Thus: I 68% Confidence Interval: b 1 ± 1 s b1 I 95% Confidence Interval: b 1 ± 2 s b1 I 99% Confidence Interval: b 1 ± 3 s b1 Same thing for b 0 I 95% Confidence Interval: b 0 ± 2 s b0 The confidence interval provides you with a set of plausible values for the parameters 22

Finding standard errors from R output summary(fit) ## ## Call: ## lm(formula = Price ~ Size, data = housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -30.425-8.618 0.575 10.766 18.498 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 38.885 9.094 4.276 0.000903 *** ## Size 35.386 4.494 7.874 2.66e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.14 on 13 degrees of freedom ## Multiple R-squared: 0.8267, Adjusted R-squared: 0.8133 ## F-statistic: 62 on 1 and 13 DF, p-value: 2.66e-06 23

Confidence intervals in R In R, you can extract confidence intervals easily: confint(fit, level=0.95) ## 2.5 % 97.5 % ## (Intercept) 19.23850 58.53087 ## Size 25.67709 45.09484 These are close to what we get by hand, but not exactly the same: 38.885-2*9.094; 38.885 + 2*9.094; ## [1] 20.697 ## [1] 57.073 35.386-2*4.494; 35.386 + 2*4.494; 24

Confidence intervals in R Why don t our answers agree? R is using a slightly more accurate approximation to the sampling distribution of the coe cients, based on the t distribution. The di erence only matters in small samples, and if it changes your inferences or decisions then you probably need more data! 25

Testing Suppose we want to assess whether or not value 0 1.Thisiscalledhypothesis testing. 1 equals a proposed Formally we test the null hypothesis: H 0 : 1 = 0 1 vs. the alternative H 1 : 1 6= 0 1 (For example, testing 1 =0vs. 1 6= 0istestingwhetherX is predictive of Y under our SLR model assumptions.) 26

Testing That are 2 ways we can think about testing: 1. Building a test statistic... the t-stat, t = b 1 s b1 0 1 This quantity measures how many standard errors (SD of b 1 ) the estimate (b 1 ) is from the proposed value ( 1 0). If the absolute value of t is greater than 2, we need to worry (why?)... we reject the null hypothesis. 27

Testing 2. Looking at the confidence interval. If the proposed value is outside the confidence interval you reject the hypothesis. Notice that this is equivalent to the t-stat. An absolute value for t greater than 2 implies that the proposed value is outside the confidence interval... therefore reject. In fact, a 95% confidence interval contains all the values for a parameter that are not rejected by hypothesis test with a false positive rate of 5% This is my preferred approach for the testing problem. You can t go wrong by using the confidence interval! 28

Example: Mutual Funds Let s investigate the performance of the Windsor Fund, an aggressive large cap fund by Vanguard... Windsor 2.0 1.0 0.0 1.0 1.5 1.0 0.5 0.0 0.5 1.0 SP500 The plot shows 6mos of daily returns for Windsor vs. the S&P500 29

Example: Mutual Funds Consider the following regression model for the Windsor mutual fund: r w = 0 + 1 r sp500 + Let s first test 1 =0 H 0 : H 1 : 1 6= 0 1 = 0. Is the Windsor fund related to the market? 30

Example: Mutual Funds ## ## Call: ## lm(formula = Windsor ~ SP500, data = windsor) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.42557-0.11035 0.01057 0.11915 0.50539 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -0.01027 0.01602-0.641 0.523 ## SP500 1.07875 0.03498 30.841 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1777 on 124 degrees of freedom ## Multiple R-squared: 0.8847,Adjusted R-squared: 0.8837 ## F-statistic: 951.2 on 1 and 124 DF, p-value: < 2.2e-16 31

Example: Mutual Funds The approximate 95% confidence interval is 1.079 ± 2 0.035 = (1.009.1.149), so we d reject H 0 : =0 confint(fit, level=0.95) ## 2.5 % 97.5 % ## (Intercept) -0.04197045 0.021435 ## SP500 1.00951622 1.147976 The t statistic is (1.079 0)/0.035 = 30.8 (see also the R output) - reject! 32

Example: Mutual Funds Now let s test 1 = 1. What does that mean? H 0 : H 1 : 1 =1Windsorisasriskyasthemarket. 1 6= 1 and Windsor softens or exaggerates market moves. We are asking whether Windsor moves in a di erent way than the market (does it exhibit larger/smaller changes than the market, or about the same?). 33

Example: Mutual Funds The approximate 95% confidence interval still 1.079 ± 2 0.035 = (1.009.1.149), so we d reject H 0 : =1as well. confint(fit, level=0.95) ## 2.5 % 97.5 % ## (Intercept) -0.04197045 0.021435 ## SP500 1.00951622 1.147976 The t statistic is (1.079 1)/0.035 = 2.26 - reject! But... 34

Testing Why I like giving an interval I What if the Windsor beta estimate had been 1.07 with a CI of (0.99, 1.14)? Would our assessment of the fund s market risk really change? I Now suppose in testing H 0 : the confidence interval was 1 = 1 you got a t-stat of 6 and [1.00001, 1.00002] Do you reject H 0 : 1 = 1 and conclude Windsor is riskier than the market? Could you justify that to your boss? Probably not! (why?) 35

Testing Why I like giving an interval I Now, suppose in testing H 0 : 1 = 1 you got a t-stat of -0.02 and the confidence interval was [ 100, 100] Do you accept H 0 : boss? Probably not! (why?) 1 = 1? Could you justify that to you The Confidence Interval is your friend when it comes to testing! 36

P-values I The p-value provides a measure of how weird your estimate is if the null hypothesis is true I Small p-values are evidence against the null hypothesis I In the AVG vs. R/G example... H 0 : 1 = 0. How weird is our estimate of b 1 = 33.57? I Remember: b 1 N( 1, sb 2 1 )... If the null was true ( 1 = 0), b 1 N(0, sb 2 1 ) 37

P-values I Where is 33.57 in the picture below? 0.00 0.02 0.04 0.06 0.08 40 20 0 20 40 b 1 (if β 1=0) The p-value is the probability of seeing b 1 equal or greater than 33.57 in absolute terms. Here, p-value=0.000000124!! Small p-value = bad null 38

P-values - Windsor fund Rwillreportp-valuesfortestingeachcoe cientat j = 0, in the column Pr(> t ) ## ## Call: ## lm(formula = Windsor ~ SP500, data = windsor) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.42557-0.11035 0.01057 0.11915 0.50539 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -0.01027 0.01602-0.641 0.523 ## SP500 1.07875 0.03498 30.841 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1777 on 124 degrees of freedom ## Multiple R-squared: 0.8847,Adjusted R-squared: 0.8837 ## F-statistic: 951.2 on 1 and 124 DF, p-value: < 2.2e-16 39

P-values for other null hypotheses We have to do other tests ourselves: To get a p-value for H 0 : 1 = q versus H 0 : 1 6= q, note that b 1 N(q, se(b 1 )) (approximately) under the null, and (b 1 q) se(b 1 ) N(0, 1) 40

P-values for other null hypotheses Under H 0, prob. of seeing a coe cient at least as extreme as b 1 is: Pr( Z > t ), t =(b 1 q)/se(b 1 ) tstat=2.26 4 3 2 1 0 1 2 3 4 The p-value for testing H 0 : =1intheWindsordatais 2*pnorm(abs(1.079-1)/0.035, lower.tail=false) z ## [1] 0.02399915 41

Testing Summary I Large t or small p-value mean the same thing... I p-value < 0.05 is equivalent to a t-stat > 2 in absolute value I Small p-value means the data at hand are unlikely to be observed if the null hypothesis was true... I Bottom line, small p-value! REJECT! Large t! REJECT! I But remember, always look at the confidence interveal! 42

Prediction/Forecasting under Uncertainty The conditional forecasting problem: Given covariate X f and sample data {X i, Y i } n i=1, predict the future observation y f. The solution is to use our LS fitted value: Ŷ f = b 0 + b 1 X f. This is the easy bit. The hard (and very important!) part of forecasting is assessing uncertainty about our predictions. 43

Forecasting: Plug-in Method A common approach is to assume that 0 b 0, 1 b 1 and s... in this case the 95% plug-in prediction interval is: (b 0 + b 1 X f ) ± 2 s It s called plug-in because we just plug-in the estimates (b 0, b 1 and s) fortheunknownparameters( 0, 1 and ). 44

Forecasting: Better intervals in R But remember that you are uncertain about b 0 and b 1!Asa practical matter if the confidence intervals are big you should be careful! R will give you a larger (and correct) prediction interval. A larger prediction error variance (high uncertainty) comes from I Large s (i.e., large " s). I Small n (not enough data). I Small s x (not enough observed spread in covariates). I Large di erence between X f and X. 45

Forecasting: Better intervals in R fit = lm(price~size, data=housing) # Make a data.frame with some X_f values # (X values where we want to forecast) newdf = data.frame(size=c(1, 1.85, 3.2, 4.1)) predict(fit, newdata = newdf, interval = "prediction", level = 0.95) ## fit lwr upr ## 1 74.27065 41.65499 106.8863 ## 2 104.34871 72.80283 135.8946 ## 3 152.11976 117.97174 186.2678 ## 4 183.96713 145.61441 222.3199 46

Forecasting: Better intervals in R Price 0 50 100 150 200 250 300 0 1 2 3 4 5 6 I Red lines: prediction intervals Size I Green lines: plug-in prediction intervals 47

House Data one more time! I R 2 = 82% I Great R 2, we are happy using this model to predict house prices, right? price 60 80 100 120 140 160 1.0 1.5 2.0 2.5 3.0 3.5 size 48

House Data one more time! I But, s = 14 leading to a predictive interval width of about US$60,000!! How do you feel about the model now? I As a practical matter, s is a much more relevant quantity than R 2. Once again, intervals are your friend! price 60 80 100 120 140 160 1.0 1.5 2.0 2.5 3.0 3.5 size 49