Week 4: Simple Linear Regression III

Similar documents
Week 5: Multiple Linear Regression II

Week 4: Simple Linear Regression II

Week 11: Interpretation plus

Week 10: Heteroskedasticity II

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus

Bivariate (Simple) Regression Analysis

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Panel Data 4: Fixed Effects vs Random Effects Models

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

Section 2.3: Simple Linear Regression: Predictions and Inference

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

One Factor Experiments

Week 1: Introduction to Stata

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Regression Analysis and Linear Regression Models

Stata versions 12 & 13 Week 4 Practice Problems

Lab 2: OLS regression

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Compare Linear Regression Lines for the HP-67

IQR = number. summary: largest. = 2. Upper half: Q3 =

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Introduction to Stata: An In-class Tutorial

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Mixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Lecture 25: Review I

Model Selection and Inference

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Section 2.2: Covariance, Correlation, and Least Squares

5.5 Regression Estimation

The Coefficient of Determination

Linear regression Number of obs = 6,866 F(16, 326) = Prob > F = R-squared = Root MSE =

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

/23/2004 TA : Jiyoon Kim. Recitation Note 1

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

For our example, we will look at the following factors and factor levels.

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

An Introductory Guide to Stata

Modelling Proportions and Count Data

schooling.log 7/5/2006

Subset Selection in Multiple Regression

An Econometric Study: The Cost of Mobile Broadband

Lecture 13: Model selection and regularization

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

The linear mixed model: modeling hierarchical and longitudinal data

Cross-validation and the Bootstrap

Modelling Proportions and Count Data

IQC monitoring in laboratory networks

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Health Disparities (HD): It s just about comparing two groups

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Missing Data Analysis for the Employee Dataset

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Page 1. Notes: MB allocated to data 2. Stata running in batch mode. . do 2-simpower-varests.do. . capture log close. .

Practical 4: Mixed effect models

Lecture 7: Linear Regression (continued)

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Compare Linear Regression Lines for the HP-41C

Multiple Linear Regression

Lab #9: ANOVA and TUKEY tests

Soci Statistics for Sociologists

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

8. MINITAB COMMANDS WEEK-BY-WEEK

texdoc 2.0 An update on creating LaTeX documents from within Stata Example 2

CREATING THE ANALYSIS

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data

Introduction to Mixed Models: Multivariate Regression

ST Lab 1 - The basics of SAS

Creating LaTeX and HTML documents from within Stata using texdoc and webdoc. Example 2

Cross-validation and the Bootstrap

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Table Of Contents. Table Of Contents

INTRODUCTION TO PANEL DATA ANALYSIS

Robust Linear Regression (Passing- Bablok Median-Slope)

Review of Stata II AERC Training Workshop Nairobi, May 2002

Standard Errors in OLS Luke Sonnet

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Introduction to STATA 6.0 ECONOMICS 626

Two-Stage Least Squares

Chapters 5-6: Statistical Inference Methods

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Slides 11: Verification and Validation Models

Introductory Applied Statistics: A Variable Approach TI Manual

Quantitative - One Population

MS&E 226: Small Data

Analysis of variance - ANOVA

Unit 8 SUPPLEMENT Normal, T, Chi Square, F, and Sums of Normals

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Lab #7 - More on Regression in R Econ 224 September 18th, 2018

Statistical Matching using Fractional Imputation

Stat 401 B Lecture 26

General Factorial Models

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

Transcription:

Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1

Outline Goodness of fit Confidence intervals Loose ends c 2017 PERRAILLON ARR 2

Partitioning variance Recall from last class that we saw that the sum of the residual is zero, which implies that the predicted ŷ is the same as the observed ȳ, so always ŷ = ȳ Also, for each observation i in our dataset the following is always true by definition of the residual: y i = ŷ i + ˆɛ i, which is equivalent to y i = ŷ i + (y i ŷ i ) In words, an observed outcome value y i is equal to its predicted value plus the residual We can do a little bit of algebra to go from this equality to something more interesting about the way we can interpret linear regression c 2017 PERRAILLON ARR 3

Partitioning variance Subtract ȳ from both sides of the equation and group terms: (y i ȳ) = (ŷ i ȳ) + (y i ŷ i ) Since ȳ = ŷ : (y i ȳ) = (ŷ i ŷ) + (y i ŷ i ) Stare at the last equation for a while. Looks familiar? Deviation from mean observed outcome = Deviation from fit for predicted values + Model residual The above equality can be expressed in terms of the sum of squares; the proof is messy. Your textbook skips several steps; see Wooldridge Chapter 2 c 2017 PERRAILLON ARR 4

Partitioning variance We need to define the following terms Total sum of squared deviations: SST = n i=1 (y i ȳ) 2 Sum of squares due to the regression: SSR = n i=1 (ŷ i ŷ) 2 Sum of squared residuals (errors): SSE = n i=1 (y i ŷ) 2 We can then write (y i ȳ) = (ŷ i ŷ) + (y i ŷ i ) as SST = SSR + SSE SST/(n-1) is the sample variance of the outcome y SSR/(n-1) is the sample variance of the predicted values ŷ SSE/n(-1) is the sample variance of the residuals Confusion alert: in SSR, R stands for regression. In SSE, E stands for error, even though it should be residual, not errors. Wooldridge uses SSE for explained or regression. c 2017 PERRAILLON ARR 5

Partitioning variance Digression: by now it should be clear that in stats distributions, expectations and deviations from the mean are central: E(X ), var(x ) = E[(X µ) 2 ]; sometimes about the data and others about parameters So, SST = SSR + SSE is telling us that the observed variance of the outcome was partitioned into two parts, one that is explained by our model (SSR) and another that our model cannot explain (SSE) So we could then measure how good our model is by the ratio SSR SST (goodness of fit) In other words, the fraction of the total observed variance of the outcome that is explained by our model. That s the famous R 2 : R 2 = SSR SST = 1 SSE SST Another way: total observed variance = explained variance + unexplained variance c 2017 PERRAILLON ARR 6

Understanding regression A Chicago newspaper article argued that Chicago s traffic is one of the most unpredictable in the nation. A commute that on average takes about 20 minutes can take 10, 40, 60, or even 120 minutes some days. Is the author right? Assume that commuting time is the outcome y. What the article meant is that commuting time is highly variable. So SST /(n 1) is high. In other words, the sample variance or standard deviation of y, s 2, is high. But it s not unpredictable You could develop a statistical model that explains average commuting time using weather (snow, rain) as predictor along with accidents, downtown events, day of week, and road work Once you estimate this model, SSE (unexplained variance) will be smaller than a model without these predictors, and R 2 will be higher In other words, our model has explained some of the observed variability in commuting times. I can t emphasize enough how important it is to understand these concepts (!!) c 2017 PERRAILLON ARR 7

Calculate using Stata Replicate R 2 in Stata output from Analysis of Variance (ANOVA) table (see added labels) reg colgpa hsgpa sum of squares mean SS Source SS df MS Number of obs = 141 -------------+---------------------------------- F(1, 139) = 28.85 (R) Model 3.33506006 1 3.33506006 Prob > F = 0.0000 (E) Residual 16.0710394 139.115618988 R-squared = 0.1719 -------------+---------------------------------- Adj R-squared = 0.1659 (T) Total 19.4060994 140.138614996 Root MSE =.34003 ------------------------------------------------------------------------------ colgpa Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- hsgpa.4824346.0898258 5.37 0.000.304833.6600362 _cons 1.415434.3069376 4.61 0.000.8085635 2.022304 ------------------------------------------------------------------------------. * replicate R^2. di 3.33506006/19.4060994.17185628. di 1 - (16.0710394/19.4060994 ).17185628 c 2017 PERRAILLON ARR 8

Calculate using Stata Replicate Mean Square Error, which is the unexplained variability of the model reg colgpa hsgpa Source SS df MS Number of obs = 141 -------------+---------------------------------- F(1, 139) = 28.85 Model 3.33506006 1 3.33506006 Prob > F = 0.0000 Residual 16.0710394 139.115618988 R-squared = 0.1719 -------------+---------------------------------- Adj R-squared = 0.1659 Total 19.4060994 140.138614996 Root MSE =.34003 <... output omitted...>. * replicate root MSE. di (16.0710394/139).11561899. * This is the remaining unexplained variability. di sqrt((16.0710394/139)).34002792 * Compare to the variability without using hsgpa as a predictor. sum colgpa Variable Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- colgpa 141 3.056738.3723103 2.2 4 c 2017 PERRAILLON ARR 9

We can go from explained and unexplained standar deviation to R 2 The trick is that the sum command is dividing by n 1 so we need to get the sum of squares. qui sum colgpa. di (((.34002792)^2)*139) 16.071039 * From the sum command. di ((r(sd)^2)*140) 19.406099 * Replicate R^2. di 1 - (((.34002792)^2)*139) / ((r(sd)^2)*140).17185629 c 2017 PERRAILLON ARR 10

Another way of understanding R 2 Again, one way of defining R 2 is that it is the ratio of the variation explained by the model (SSR) to the total variability (SST) It turns out that R 2 is also the square of the correlation between the observed y and its model-predicted values: cor(y, ŷ) 2 qui reg colgpa hsgpa predict colhat (option xb assumed; fitted values). corr colhat colgpa (obs=141) colhat colgpa -------------+------------------ colhat 1.0000 colgpa 0.4146 1.0000. di 0.4146^2.17189316 The better our model is at predicting y the higher R 2 will be c 2017 PERRAILLON ARR 11

Confidence Intervals We want to build a confidence interval for ˆβ The proper interpretation of a confidence interval in the frequentist approach is that if we repeated the experiment many times, about x% percent of the time the value of β would be within the confidence interval By convention, we build 95% confidence intervals, which implies α = 0.05 Intuitively, we need to know the distribution of ˆβ and its precision, the standard error. To derive these, we need to assume ɛ distributes N(0, σ 2 ) iid A formula for the confidence interval of ˆβ i is: ˆβ i ± t (n 2,α/2) se( ˆβ i ) c 2017 PERRAILLON ARR 12

Confidence Intervals We use t-student but remember that when n is large (larger than about 120) the t distribution approximates a normal distribution Recall this graph from stats 101 and Wikipedia : c 2017 PERRAILLON ARR 13

Direct relationship between statistical tests and confidence intervals Confidence intervals and statistical tests are closely related reg colgpa hsgpa ------------------------------------------------------------------------------ colgpa Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- hsgpa.4824346.0898258 5.37 0.000.304833.6600362 _cons 1.415434.3069376 4.61 0.000.8085635 2.022304 ------------------------------------------------------------------------------. test hsgpa =.304833 ( 1) hsgpa =.304833 F( 1, 139) = 3.91 Prob > F = 0.0500. test hsgpa =.31 ( 1) hsgpa =.31 F( 1, 139) = 3.69 Prob > F = 0.0570. test hsgpa =.29 ( 1) hsgpa =.29 F( 1, 139) = 4.59 Prob > F = 0.0339 Ponder this: If the number θ 0 in null H 0 : ˆβ j = θ 0 is within 95%CI, we won t reject a null for that value; if the number is outside CI, we will reject c 2017 PERRAILLON ARR 14

As usual, simulations are awesome * simulate 9000 observations set obs 9000 gen betahsgpa = rnormal(.4824346,.0898258) sum Variable Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- betahsgpa 9,000.4826007.0900429.1583372.9124259 * count the number of times beta is within the confidence interval count if betahsgpa >=.304833 & betahsgpa <=.6600362 8,552 di 8552/9000.95022222 * Do the same with a t-student. gen zt = rt(139). * By default, Stata simulate a standard t, mean zero and sd of 1 sum zt Variable Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- zt 9,000.0019048.9972998-3.567863 3.574553. * Need to retransform gen betat =.0898258*zt +.4824346 sum betat Variable Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- betat 9,000.4826057.0895833.1619485.8035216 count if betat >=.304833 & betat <=.6600362 8,576. di 8576/9000.95288889 c 2017 PERRAILLON ARR 15

What just happened? We estimated a parameter ˆβ and its standard error var( ˆβ) Because of assumptions about ɛ distributing N(0, σ 2 ) we know that asymptotically ˆβ distributes normal (but the Wald test distributes t-student) That s all we need to calculate confidence intervals and hypotheses tests about the true β in the population Recall that a probability distribution function describes the values a random variable can take and the probability of those values. So we can make statements about the probability of the parameter taking certain values or being within an interval if we know the distribution of the parameter We also know that the t-student converges to a normal for samples larger than about 120, so we could just use the normal distribution c 2017 PERRAILLON ARR 16

Distributions Note the slightly fatter tails of the t-student c 2017 PERRAILLON ARR 17

We can do more Other confidence intervals? centile betahsgpa, centile(2.5(5)97.5) -- Binom. Interp. -- Variable Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- betahsgpa 9,000 2.5.3028123.2985525.3087895 7.5.3527713.3494937.3565182 12.5.3789864.375991.3828722 17.5.3990816.3966145.4013201 22.5.4146384.4117871.4172412 27.5.428773.4263326.4314667 32.5.4425001.440095.4447364 37.5.4546066.4523425.4566871 42.5.466314.4637115.4679226 47.5.4773481.4747118.4795888 52.5.4890074.486511.4909871 57.5.4991535.4969174.5015363 62.5.5109765.5086741.5133994 67.5.5230142.5210292.5253365 72.5.5359863.5334756.5384137 77.5.5504502.5473233.5534496 82.5.5673794.5640415.5700424 87.5.5862278.5831443.589366 92.5.6109914.6076231.6150107 97.5.6565534.6518573.6633566 Note how close the 2.5 and 97.5 percentiles follow the reg CI above c 2017 PERRAILLON ARR 18

Even more What is the probability that the coefficient for hsgpa is greater than 0.4? Greater than 0.8?. count if betahsgpa >0.4 7,394. di 7394/9000.81822222 count if betahsgpa >0.8 2 di 2/9000.00022222 Fairly likely and fairly unlikely, respectively (look at distribution) We have no evidence that the coefficient will be remotely close to zero so no surprise about the p-value in the regression output c 2017 PERRAILLON ARR 19

What about Type II error? The other error we can make when testing hypotheses is Type II error Type II: failing to reject the null when in fact is false We saw that the power of a test is 1-P(Type II) When are we going to fail to reject the null even if it s false? Intuitively, when the confidence interval is wide With a very wide CI, more values are going to be within the CI so they won t get rejected; they are likely to happen at α = 0.05 And when is the confidence interval going to be wide? Look at the formula for CI: when t() or se() are larger. Both depend on sample size When doing power analysis, we re mainly concerned about determining the sample size we need to avoid Type II error c 2017 PERRAILLON ARR 20

Loose ends We have only one thing left to explain and replicate from the regression output. What is that F test? In the grades example: Source SS df MS Number of obs = 141 -------------+---------------------------------- F(1, 139) = 28.85 Model 3.33506006 1 3.33506006 Prob > F = 0.0000 Residual 16.0710394 139.115618988 R-squared = 0.1719 -------------+---------------------------------- Adj R-squared = 0.1659 Total 19.4060994 140.138614996 Root MSE =.34003 That s a test of the overall validity of the model It s the ratio MSR/MSE = 3.33506006/.115618988 = 28.845263. As you know by now, it is F because it is the ratio of two chi-squares Once we cover maximum likelihood we we will see another more general approach to compare models, the likelihood ratio test Both test are (asymptotically) equivalent c 2017 PERRAILLON ARR 21

Preview F test versus LRT (I chose male as predictor so the p-value is not close to zero) reg colgpa male Source SS df MS Number of obs = 141 -------------+---------------------------------- F(1, 139) = 0.82 Model.113594273 1.113594273 Prob > F = 0.3672 Residual 19.2925052 139.138795001 R-squared = 0.0059 -------------+---------------------------------- Adj R-squared = -0.0013 Total 19.4060994 140.138614996 Root MSE =.37255 <... output omitted...> ------------------------------------------------------------------------------ colgpa Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- male -.0568374.0628265-0.90 0.367 -.1810567.0673818 _cons 3.086567.0455145 67.82 0.000 2.996577 3.176557 ------------------------------------------------------------------------------ est sto m1 qui reg colgpa est sto m2. lrtest m1 m2 Likelihood-ratio test LR chi2(1) = 0.83 (Assumption: m2 nested in m1) Prob > chi2 = 0.3629 Three tests in one, all close to 0.36 c 2017 PERRAILLON ARR 22

Summary Linear regression can be thought of as partitioning the variance into two components, explained and unexplained We can measure the goodness of fit of a model based on the comparison of these variances Be carefully about the context. We are talking about a linear model with ɛ N(0, σ 2 ) and iid Not the same in other type of models but the main ideas are valid Once we know the asymptotic distribution of a parameter and its standard deviation (i.e. standar error) we have all we need to test hypotheses and build CIs c 2017 PERRAILLON ARR 23