PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation Simple Linear Regression Software: Stata v 10.1 Emergency Calls to the New York Auto Club Source: Chatterjee, S; Handcock MS and Simonoff JS A Casebook for a First Course in Statistics and Data Analysis. New York, John Wiley, 1995, pp 145-152. Setting: Calls to the New York Auto Club are possibly related to the weather, with more calls occurring during bad weather. This example illustrates descriptive analyses and simple linear regression to explore this hypothesis in a data set containing information on calendar day, weather, and numbers of calls. Data File: ERS.dta - This is a stata data set. Variable Name Label Coding/ DAY Date using informat MMDDYY6. Example: 016193 is January 16, 1993 CALLS Calls answered FHIGH Forecasted high temperature FLOW Forecasted low temperature HIGH High temperature LOW Low temperature RAIN Rain Forecast 0 = NO 1 = RAIN SNOW Snow Forecast 0 = NO 1 = SNOW WEEKDAY Type of Day 0 = NO 1 = Weekday YEAR 0 = 1993 1 = 1994 SUNDAY 0 = NO 1 = SUNDAY SUBZERO 0 = NO 1 = SUBZERO \stata_howto\simple linear regression ny auto club.doc Page 1 of 10
Key - Green: comments (note that a comment begins with an asterisk) Black: Stata command syntax. Note You do not type the leading period. Blue: Output I have also inserted some remarks. *. * Simple Linear Regression Using Stata v 10.1. * toggle off the screen by screen pausing of output. set more off. * Use FILE > OPEN to read in the stata data set ers.dta. use "/Users/carolbigelow/Desktop/ers.dta". * Use the command CODEBOOK followed by a comma and the option COMPACT. * to see a compact description of the data. codebook,compact Variable Obs Unique Mean Min Max Label --------------------------------------------------------------------------------------------------------- day 28 28 12258 12069 12447 calls 28 27 4318.75 1674 8947 fhigh 28 21 34.96429 10 53 flow 28 19 24.46429 4 40 high 28 19 37.46429 10 55 low 28 22 21.75-2 41 rain 28 2.3214286 0 1 snow 28 2.2142857 0 1 weekday 28 2.6428571 0 1 year 28 2.5 0 1 sunday 28 2.1428571 0 1 subzero 28 2.1785714 0 1 ---------------------------------------------------------------------------------------------------------. * 1. Create dictionary of variable values for readability. label define rainf 0 "0=no" 1 "1=rain". label define snowf 0 "0=no" 1 "1=snow". label define weekdayf 0 "0=no" 1 "1=weekday". label define yearf 0 "0=1993" 1 "1=1994". label define sundayf 0 "0=no" 1 "1=Sunday". label define subzerof 0 "0=no" 1 "1=subzero". * 2. Associate the discrete variables with their dictionary of value codes. label values rain rainf. label values snow snowf. label values weekday weekdayf. label values year yearf. label values sunday sundayf. label values subzero subzerof \stata_howto\simple linear regression ny auto club.doc Page 2 of 10
. *2. Use command LIST to produce a listing of the data.. list day calls fhigh flow high low rain snow weekday year sunday subzero +--------------------------------------------------------------------------------------------------------+ day calls fhigh flow high low rain snow weekday year sunday subzero 1. 12069 2298 38 31 39 31 0=no 0=no 0=no 0=1993 0=no 0=no 2. 12070 1709 41 27 41 30 0=no 0=no 0=no 0=1993 1=Sunday 0=no 3. 12071 2395 33 26 38 24 0=no 0=no 0=no 0=1993 0=no 0=no 4. 12072 2486 29 19 36 21 0=no 0=no 1=weekday 0=1993 0=no 0=no 5. 12073 1849 40 19 43 27 0=no 0=no 1=weekday 0=1993 0=no 0=no 6. 12074 1842 44 30 43 29 0=no 0=no 1=weekday 0=1993 0=no 0=no 7. 12075 2100 46 40 53 41 1=rain 0=no 1=weekday 0=1993 0=no 0=no 8. 12076 1752 47 35 46 40 0=no 0=no 0=no 0=1993 0=no 0=no 9. 12077 1776 53 34 55 38 1=rain 0=no 0=no 0=1993 1=Sunday 0=no 10. 12078 1812 38 32 43 31 0=no 0=no 1=weekday 0=1993 0=no 0=no 11. 12079 1842 35 21 35 25 0=no 0=no 1=weekday 0=1993 0=no 0=no 12. 12080 1674 39 27 44 31 1=rain 1=snow 1=weekday 0=1993 0=no 0=no 13. 12081 1692 34 28 40 27 0=no 0=no 1=weekday 0=1993 0=no 0=no 14. 12082 1879 46 28 41 23 0=no 0=no 1=weekday 0=1993 0=no 0=no 15. 12434 6375 17 9 15 3 0=no 0=no 0=no 1=1994 1=Sunday 1=subzero 16. 12435 8827 35 15 47 12 1=rain 1=snow 0=no 1=1994 0=no 0=no 17. 12436 7218 30 32 35 4 1=rain 0=no 1=weekday 1=1994 0=no 0=no 18. 12437 8810 10 4 10-2 0=no 0=no 1=weekday 1=1994 0=no 1=subzero 19. 12438 7841 15 6 15 0 1=rain 1=snow 1=weekday 1=1994 0=no 1=subzero 20. 12439 7745 24 12 21 6 0=no 0=no 1=weekday 1=1994 0=no 1=subzero 21. 12440 6454 33 19 32 15 0=no 0=no 0=no 1=1994 0=no 0=no 22. 12441 4619 32 18 32 18 0=no 0=no 0=no 1=1994 1=Sunday 0=no 23. 12442 6476 48 30 49 31 0=no 0=no 1=weekday 1=1994 0=no 0=no 24. 12443 4692 38 32 42 32 0=no 0=no 1=weekday 1=1994 0=no 0=no 25. 12444 3638 26 23 32 5 1=rain 1=snow 1=weekday 1=1994 0=no 0=no 26. 12445 8947 29 14 31 0 0=no 0=no 1=weekday 1=1994 0=no 1=subzero 27. 12446 6564 48 34 55 31 1=rain 1=snow 1=weekday 1=1994 0=no 0=no 28. 12447 5613 31 40 36 36 1=rain 1=snow 0=no 1=1994 0=no 0=no +---------------------------------------------------------------------------------------------------------+ \stata_howto\simple linear regression ny auto club.doc Page 3 of 10
. *3. Look at your data first! Plot of Y=calls versus X=low. * The following command of SET SCHEME is optional. I downloaded this particular scheme previously.. set scheme lean1. graph twoway (scatter calls day, symbol(d)), title("calls to NY Auto Club 1993-1994"). * At the top bar of the GRAPH window click on the SAVE ICON to save your graph!. * Suggestion: From the drop down menu, choose.png extension. It s nice for cut and paste. * I saved my graph as nyauto_graph01.png" Source: nyauto_graph01.png The scatterplot suggests, as we might expect, that lower temperatures are associated with more calls to the NY Auto club. \stata_howto\simple linear regression ny auto club.doc Page 4 of 10
. *4. Descriptives on the outcome variable Y=calls. * Use command SUMMARIZE followed by comma and then followed by option DETAIL. summarize calls, detail calls ------------------------------------------------------------- Percentiles Smallest 1% 1674 1674 5% 1692 1692 10% 1709 1709 Obs 28 25% 1842 1752 Sum of Wgt. 28 50% 3062 Mean 4318.75 Largest Std. Dev. 2692.564 75% 6520 7841 90% 8810 8810 Variance 7249901 95% 8827 8827 Skewness.4549129 99% 8947 8947 Kurtosis 1.615947. *4. continued - Assess assumption of normality both graphically and with hypothesis test. * There are multiple graphs you might consider.. * Here I do a histogram with the y-axis defined as frequency and with an overlay normal. histogram calls, frequency normal title("histogram of Y=CALLS with overlay NORMAL") (bin=5, start=1674, width=1454.6). * save graph as nyauto_graph02.png Source: nyauto_graph02.png The graph shows what we suspected nonnormality of Y=CALLS. \stata_howto\simple linear regression ny auto club.doc Page 5 of 10
.* There are also a variety of tests of normality. One is the Shapiro Wilk Test..* See Unit 2 lecture notes page 54.* Null : Distribution of calls is normal. Under Null, test statistic W is close to 1.* Evidence of NON normality is reflected in W < 1 and small p-value. swilk calls Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z -------------+-------------------------------------------------- calls 28 0.82916 5.159 3.378 0.00037 The null hypothesis of normality of Y=CALLS is rejected. Take care, sometimes the cure is worse than the problem. For now, we ll continue along anyway; this will give us a chance to see some interesting diagnostics!. * 6. Least Squares estimation and analysis of variance table.. regress calls low Source SS df MS Number of obs = 28 -------------+------------------------------ F( 1, 26) = 27.28 Model 100233719 1 100233719 Prob > F = 0.0000 Residual 95513596.2 26 3673599.85 R-squared = 0.5121 -------------+------------------------------ Adj R-squared = 0.4933 Total 195747315 27 7249900.56 Root MSE = 1916.7 ------------------------------------------------------------------------------ calls Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- low -145.154 27.78868-5.22 0.000-202.2744-88.03352 _cons 7475.849 704.6304 10.61 0.000 6027.46 8924.237 ------------------------------------------------------------------------------ alls ˆ = 7,475.85-145.15*[low] The fitted line is c R 2 =.51 indicates that 51% of the variability in calls is explained. The overall F test significance level <.0001 suggests that the straight line fit performs better in explaining variability in calls than does Y = average # calls From this output, the analysis of variance is the following (next page) \stata_howto\simple linear regression ny auto club.doc Page 6 of 10
Source Df Sum of Squares Mean Square Model 1 n Regression ( Yˆ ) 2 i Y = 100,233,719 SS(model)/1 i= 1 = 100,233,719 Residual (n-2) = 26 n 2 Error ( Y ˆ i Yi i= 1 3,673,599.85 Total, corrected (n-1) = 27 n 2 Y Y ) = 95,513,596.2 SS(residual)/(n-2) = ( i ) = 195,747,315 i= 1. *7. Overlay of straight line fit on the scatter plot. graph twoway (scatter calls low, symbol(d)) (lfit calls low), title("calls to NY Auto Club 1993-1994") subtitle("overlay Straight Line Fit"). * save graph as nyauto_graph03.png Source: nyauto_graph03.png The overlay of the straight line fit is reasonable but substantial variability is seen, too. There is a lot we still don t know, including but not limited to the following --- Case influence, omitted variables, variance heterogeneity, incorrect functional form, etc. \stata_howto\simple linear regression ny auto club.doc Page 7 of 10
. *8. Residuals Analysis - Assessment of Normality of Residuals. * Stata requires that you use post-estimation commands to obtain residuals. * We consider a few here... * Use command PREDICT varname, RESIDUALS to save residuals to a variable you name. predict e, residuals. * Use command PNORM to plot residuals e versus percentiles of normal. * Reasonableness is suggested by points falling along the line. pnorm e, title("normality of Residuals of Y=calls v X=low"). * save graph as nyauto_graph04.png Source: nyauto_graph04.png Not bad, actually. \stata_howto\simple linear regression ny auto club.doc Page 8 of 10
. *9. Residuals analysis - Detection of Outliers Using Cooks Distance. * See Unit 2 lecture notes page 60. * Use command PREDICT varname, COOKSD to save residuals to a variable you name. predict cook, cooksd. * Preliminary to plot of cook s distance, we need to create an ID variable. * This is because the data set ers.dta does not have an ID variable. Most do.. * Use command GENERATE varname=_n to save the system variable _n to a variable you name.. generate id=_n. * Plot Cook s distance values on Y-axis versus id on the X-axis. Look for extreme values. graph twoway (scatter cook id, symbol(d)), title("cook's Distance Values") subtitle("simple Linear Regression of Y=calls on X=low"). * save graph as nyauto_graph05.png Source: nyauto_graph05.png For straight line regression, the suggestion is to regard Cook s Distance values > 1 as significant.. Here, there are no unusually large Cook Distance values. Not shown but useful, too, are examinations of leverage and jackknife residuals. \stata_howto\simple linear regression ny auto club.doc Page 9 of 10
. *10. Assessing Assumptions of Linearity, Heteroscedasticity, Independence using Jacknife Residuals. * See Unit 2 notes page 59. * In Stata, jackknife residuals are referred to as studentized residuals. * Use command PREDICT varname, XB to save predicted outcomes to a variable you name. predict predicted, xb. * Use command PREDICT varname, RSTUDENT to save jackknife residuals to a variable you name. predict jack, rstudent. graph twoway (scatter jack predicted, symbol(d)), title("jacknife Residuals versus Predicted"). * save graph as nyauto_graph06.png Source: nyauto_graph06.png Recall A jackknife residual for an individual is a modification of the solution for a studentized residual in which the mean square error is replaced by the mean square error obtained after deleting that individual from the analysis. This plot in SAS is nice for its inclusion of some useful summaries the fitted line, the R 2 Departures of this plot from a parallel band about the horizontal line at zero are significant. The plot here is a bit noisy but not too bad considering the small sample size. \stata_howto\simple linear regression ny auto club.doc Page 10 of 10