PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

Similar documents
range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Stata v 12 Illustration. First Session

Stata version 13. First Session. January I- Launching and Exiting Stata Launching Stata Exiting Stata..

Bivariate (Simple) Regression Analysis

Introduction to Stata: An In-class Tutorial

Applied Regression Modeling: A Business Approach

Week 4: Simple Linear Regression III

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Lab 2: OLS regression

I Launching and Exiting Stata. Stata will ask you if you would like to check for updates. Update now or later, your choice.

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Introduction to Stata Toy Program #1 Basic Descriptives

Week 4: Simple Linear Regression II

Minitab 17 commands Prepared by Jeffrey S. Simonoff

CREATING THE ANALYSIS

Subset Selection in Multiple Regression

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Empirical Asset Pricing

Introduction to Stata First Session. I- Launching and Exiting Stata Launching Stata Exiting Stata..

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Week 5: Multiple Linear Regression II

Stata versions 12 & 13 Week 4 Practice Problems

Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very

Introduction to STATA 6.0 ECONOMICS 626

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Cluster Randomization Create Cluster Means Dataset

An Introductory Guide to Stata

Chapter 1. Looking at Data-Distribution

Two-Stage Least Squares

A Short Introduction to STATA

Box-Cox Transformation for Simple Linear Regression

IQR = number. summary: largest. = 2. Upper half: Q3 =

1 Introducing Stata sample session

Stat 5100 Handout #19 SAS: Influential Observations and Outliers

Week 10: Heteroskedasticity II

After opening Stata for the first time: set scheme s1mono, permanently

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Stata version 12. Lab Session 1 February Preliminary: How to Screen Capture.. 2. Preliminary: How to Keep a Log of Your Stata Session..

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

Panel Data 4: Fixed Effects vs Random Effects Models

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Exercise 1: Introduction to Stata

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

Introduction to Stata

Multivariate Capability Analysis

Chapter 6: DESCRIPTIVE STATISTICS

Applied Regression Modeling: A Business Approach

Health Disparities (HD): It s just about comparing two groups

Data Management - 50%

Principles of Biostatistics and Data Analysis PHP 2510 Lab2

SPSS. (Statistical Packages for the Social Sciences)

Week 1: Introduction to Stata

STATA 13 INTRODUCTION

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Stata Training. AGRODEP Technical Note 08. April Manuel Barron and Pia Basurto

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

Review of Stata II AERC Training Workshop Nairobi, May 2002

/23/2004 TA : Jiyoon Kim. Recitation Note 1

Getting Correct Results from PROC REG

Section 4 General Factorial Tutorials

A Multiple-Line Fitting Algorithm Without Initialization Yan Guo

Reproducible Research: Weaving with Stata

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Getting started with Stata 2017: Cheat-sheet

SYS 6021 Linear Statistical Models

Sales Price of Laptops Based on Their Specifications. Hyunwoo Cho Jay Jung Gun Hee Lee Chan Hong Park Seoyul Um Mario Wijaya Team #10

Quantitative - One Population

SLStats.notebook. January 12, Statistics:

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Recoding and Labeling Variables

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Linear regression Number of obs = 6,866 F(16, 326) = Prob > F = R-squared = Root MSE =

SASEG 9B Regression Assumptions

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Applied Regression Modeling: A Business Approach

Multiple Linear Regression

Robust Linear Regression (Passing- Bablok Median-Slope)

Stat 401 B Lecture 26

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

CREATING THE DISTRIBUTION ANALYSIS

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Meet MINITAB. Student Release 14. for Windows

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

Bland-Altman Plot and Analysis

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

MHPE 494: Data Analysis. Welcome! The Analytic Process

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Week 11: Interpretation plus

Results Based Financing for Health Impact Evaluation Workshop Tunis, Tunisia October Stata 2. Willa Friedman

Source:

An Econometric Study: The Cost of Mobile Broadband

Transcription:

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation Simple Linear Regression Software: Stata v 10.1 Emergency Calls to the New York Auto Club Source: Chatterjee, S; Handcock MS and Simonoff JS A Casebook for a First Course in Statistics and Data Analysis. New York, John Wiley, 1995, pp 145-152. Setting: Calls to the New York Auto Club are possibly related to the weather, with more calls occurring during bad weather. This example illustrates descriptive analyses and simple linear regression to explore this hypothesis in a data set containing information on calendar day, weather, and numbers of calls. Data File: ERS.dta - This is a stata data set. Variable Name Label Coding/ DAY Date using informat MMDDYY6. Example: 016193 is January 16, 1993 CALLS Calls answered FHIGH Forecasted high temperature FLOW Forecasted low temperature HIGH High temperature LOW Low temperature RAIN Rain Forecast 0 = NO 1 = RAIN SNOW Snow Forecast 0 = NO 1 = SNOW WEEKDAY Type of Day 0 = NO 1 = Weekday YEAR 0 = 1993 1 = 1994 SUNDAY 0 = NO 1 = SUNDAY SUBZERO 0 = NO 1 = SUBZERO \stata_howto\simple linear regression ny auto club.doc Page 1 of 10

Key - Green: comments (note that a comment begins with an asterisk) Black: Stata command syntax. Note You do not type the leading period. Blue: Output I have also inserted some remarks. *. * Simple Linear Regression Using Stata v 10.1. * toggle off the screen by screen pausing of output. set more off. * Use FILE > OPEN to read in the stata data set ers.dta. use "/Users/carolbigelow/Desktop/ers.dta". * Use the command CODEBOOK followed by a comma and the option COMPACT. * to see a compact description of the data. codebook,compact Variable Obs Unique Mean Min Max Label --------------------------------------------------------------------------------------------------------- day 28 28 12258 12069 12447 calls 28 27 4318.75 1674 8947 fhigh 28 21 34.96429 10 53 flow 28 19 24.46429 4 40 high 28 19 37.46429 10 55 low 28 22 21.75-2 41 rain 28 2.3214286 0 1 snow 28 2.2142857 0 1 weekday 28 2.6428571 0 1 year 28 2.5 0 1 sunday 28 2.1428571 0 1 subzero 28 2.1785714 0 1 ---------------------------------------------------------------------------------------------------------. * 1. Create dictionary of variable values for readability. label define rainf 0 "0=no" 1 "1=rain". label define snowf 0 "0=no" 1 "1=snow". label define weekdayf 0 "0=no" 1 "1=weekday". label define yearf 0 "0=1993" 1 "1=1994". label define sundayf 0 "0=no" 1 "1=Sunday". label define subzerof 0 "0=no" 1 "1=subzero". * 2. Associate the discrete variables with their dictionary of value codes. label values rain rainf. label values snow snowf. label values weekday weekdayf. label values year yearf. label values sunday sundayf. label values subzero subzerof \stata_howto\simple linear regression ny auto club.doc Page 2 of 10

. *2. Use command LIST to produce a listing of the data.. list day calls fhigh flow high low rain snow weekday year sunday subzero +--------------------------------------------------------------------------------------------------------+ day calls fhigh flow high low rain snow weekday year sunday subzero 1. 12069 2298 38 31 39 31 0=no 0=no 0=no 0=1993 0=no 0=no 2. 12070 1709 41 27 41 30 0=no 0=no 0=no 0=1993 1=Sunday 0=no 3. 12071 2395 33 26 38 24 0=no 0=no 0=no 0=1993 0=no 0=no 4. 12072 2486 29 19 36 21 0=no 0=no 1=weekday 0=1993 0=no 0=no 5. 12073 1849 40 19 43 27 0=no 0=no 1=weekday 0=1993 0=no 0=no 6. 12074 1842 44 30 43 29 0=no 0=no 1=weekday 0=1993 0=no 0=no 7. 12075 2100 46 40 53 41 1=rain 0=no 1=weekday 0=1993 0=no 0=no 8. 12076 1752 47 35 46 40 0=no 0=no 0=no 0=1993 0=no 0=no 9. 12077 1776 53 34 55 38 1=rain 0=no 0=no 0=1993 1=Sunday 0=no 10. 12078 1812 38 32 43 31 0=no 0=no 1=weekday 0=1993 0=no 0=no 11. 12079 1842 35 21 35 25 0=no 0=no 1=weekday 0=1993 0=no 0=no 12. 12080 1674 39 27 44 31 1=rain 1=snow 1=weekday 0=1993 0=no 0=no 13. 12081 1692 34 28 40 27 0=no 0=no 1=weekday 0=1993 0=no 0=no 14. 12082 1879 46 28 41 23 0=no 0=no 1=weekday 0=1993 0=no 0=no 15. 12434 6375 17 9 15 3 0=no 0=no 0=no 1=1994 1=Sunday 1=subzero 16. 12435 8827 35 15 47 12 1=rain 1=snow 0=no 1=1994 0=no 0=no 17. 12436 7218 30 32 35 4 1=rain 0=no 1=weekday 1=1994 0=no 0=no 18. 12437 8810 10 4 10-2 0=no 0=no 1=weekday 1=1994 0=no 1=subzero 19. 12438 7841 15 6 15 0 1=rain 1=snow 1=weekday 1=1994 0=no 1=subzero 20. 12439 7745 24 12 21 6 0=no 0=no 1=weekday 1=1994 0=no 1=subzero 21. 12440 6454 33 19 32 15 0=no 0=no 0=no 1=1994 0=no 0=no 22. 12441 4619 32 18 32 18 0=no 0=no 0=no 1=1994 1=Sunday 0=no 23. 12442 6476 48 30 49 31 0=no 0=no 1=weekday 1=1994 0=no 0=no 24. 12443 4692 38 32 42 32 0=no 0=no 1=weekday 1=1994 0=no 0=no 25. 12444 3638 26 23 32 5 1=rain 1=snow 1=weekday 1=1994 0=no 0=no 26. 12445 8947 29 14 31 0 0=no 0=no 1=weekday 1=1994 0=no 1=subzero 27. 12446 6564 48 34 55 31 1=rain 1=snow 1=weekday 1=1994 0=no 0=no 28. 12447 5613 31 40 36 36 1=rain 1=snow 0=no 1=1994 0=no 0=no +---------------------------------------------------------------------------------------------------------+ \stata_howto\simple linear regression ny auto club.doc Page 3 of 10

. *3. Look at your data first! Plot of Y=calls versus X=low. * The following command of SET SCHEME is optional. I downloaded this particular scheme previously.. set scheme lean1. graph twoway (scatter calls day, symbol(d)), title("calls to NY Auto Club 1993-1994"). * At the top bar of the GRAPH window click on the SAVE ICON to save your graph!. * Suggestion: From the drop down menu, choose.png extension. It s nice for cut and paste. * I saved my graph as nyauto_graph01.png" Source: nyauto_graph01.png The scatterplot suggests, as we might expect, that lower temperatures are associated with more calls to the NY Auto club. \stata_howto\simple linear regression ny auto club.doc Page 4 of 10

. *4. Descriptives on the outcome variable Y=calls. * Use command SUMMARIZE followed by comma and then followed by option DETAIL. summarize calls, detail calls ------------------------------------------------------------- Percentiles Smallest 1% 1674 1674 5% 1692 1692 10% 1709 1709 Obs 28 25% 1842 1752 Sum of Wgt. 28 50% 3062 Mean 4318.75 Largest Std. Dev. 2692.564 75% 6520 7841 90% 8810 8810 Variance 7249901 95% 8827 8827 Skewness.4549129 99% 8947 8947 Kurtosis 1.615947. *4. continued - Assess assumption of normality both graphically and with hypothesis test. * There are multiple graphs you might consider.. * Here I do a histogram with the y-axis defined as frequency and with an overlay normal. histogram calls, frequency normal title("histogram of Y=CALLS with overlay NORMAL") (bin=5, start=1674, width=1454.6). * save graph as nyauto_graph02.png Source: nyauto_graph02.png The graph shows what we suspected nonnormality of Y=CALLS. \stata_howto\simple linear regression ny auto club.doc Page 5 of 10

.* There are also a variety of tests of normality. One is the Shapiro Wilk Test..* See Unit 2 lecture notes page 54.* Null : Distribution of calls is normal. Under Null, test statistic W is close to 1.* Evidence of NON normality is reflected in W < 1 and small p-value. swilk calls Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z -------------+-------------------------------------------------- calls 28 0.82916 5.159 3.378 0.00037 The null hypothesis of normality of Y=CALLS is rejected. Take care, sometimes the cure is worse than the problem. For now, we ll continue along anyway; this will give us a chance to see some interesting diagnostics!. * 6. Least Squares estimation and analysis of variance table.. regress calls low Source SS df MS Number of obs = 28 -------------+------------------------------ F( 1, 26) = 27.28 Model 100233719 1 100233719 Prob > F = 0.0000 Residual 95513596.2 26 3673599.85 R-squared = 0.5121 -------------+------------------------------ Adj R-squared = 0.4933 Total 195747315 27 7249900.56 Root MSE = 1916.7 ------------------------------------------------------------------------------ calls Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- low -145.154 27.78868-5.22 0.000-202.2744-88.03352 _cons 7475.849 704.6304 10.61 0.000 6027.46 8924.237 ------------------------------------------------------------------------------ alls ˆ = 7,475.85-145.15*[low] The fitted line is c R 2 =.51 indicates that 51% of the variability in calls is explained. The overall F test significance level <.0001 suggests that the straight line fit performs better in explaining variability in calls than does Y = average # calls From this output, the analysis of variance is the following (next page) \stata_howto\simple linear regression ny auto club.doc Page 6 of 10

Source Df Sum of Squares Mean Square Model 1 n Regression ( Yˆ ) 2 i Y = 100,233,719 SS(model)/1 i= 1 = 100,233,719 Residual (n-2) = 26 n 2 Error ( Y ˆ i Yi i= 1 3,673,599.85 Total, corrected (n-1) = 27 n 2 Y Y ) = 95,513,596.2 SS(residual)/(n-2) = ( i ) = 195,747,315 i= 1. *7. Overlay of straight line fit on the scatter plot. graph twoway (scatter calls low, symbol(d)) (lfit calls low), title("calls to NY Auto Club 1993-1994") subtitle("overlay Straight Line Fit"). * save graph as nyauto_graph03.png Source: nyauto_graph03.png The overlay of the straight line fit is reasonable but substantial variability is seen, too. There is a lot we still don t know, including but not limited to the following --- Case influence, omitted variables, variance heterogeneity, incorrect functional form, etc. \stata_howto\simple linear regression ny auto club.doc Page 7 of 10

. *8. Residuals Analysis - Assessment of Normality of Residuals. * Stata requires that you use post-estimation commands to obtain residuals. * We consider a few here... * Use command PREDICT varname, RESIDUALS to save residuals to a variable you name. predict e, residuals. * Use command PNORM to plot residuals e versus percentiles of normal. * Reasonableness is suggested by points falling along the line. pnorm e, title("normality of Residuals of Y=calls v X=low"). * save graph as nyauto_graph04.png Source: nyauto_graph04.png Not bad, actually. \stata_howto\simple linear regression ny auto club.doc Page 8 of 10

. *9. Residuals analysis - Detection of Outliers Using Cooks Distance. * See Unit 2 lecture notes page 60. * Use command PREDICT varname, COOKSD to save residuals to a variable you name. predict cook, cooksd. * Preliminary to plot of cook s distance, we need to create an ID variable. * This is because the data set ers.dta does not have an ID variable. Most do.. * Use command GENERATE varname=_n to save the system variable _n to a variable you name.. generate id=_n. * Plot Cook s distance values on Y-axis versus id on the X-axis. Look for extreme values. graph twoway (scatter cook id, symbol(d)), title("cook's Distance Values") subtitle("simple Linear Regression of Y=calls on X=low"). * save graph as nyauto_graph05.png Source: nyauto_graph05.png For straight line regression, the suggestion is to regard Cook s Distance values > 1 as significant.. Here, there are no unusually large Cook Distance values. Not shown but useful, too, are examinations of leverage and jackknife residuals. \stata_howto\simple linear regression ny auto club.doc Page 9 of 10

. *10. Assessing Assumptions of Linearity, Heteroscedasticity, Independence using Jacknife Residuals. * See Unit 2 notes page 59. * In Stata, jackknife residuals are referred to as studentized residuals. * Use command PREDICT varname, XB to save predicted outcomes to a variable you name. predict predicted, xb. * Use command PREDICT varname, RSTUDENT to save jackknife residuals to a variable you name. predict jack, rstudent. graph twoway (scatter jack predicted, symbol(d)), title("jacknife Residuals versus Predicted"). * save graph as nyauto_graph06.png Source: nyauto_graph06.png Recall A jackknife residual for an individual is a modification of the solution for a studentized residual in which the mean square error is replaced by the mean square error obtained after deleting that individual from the analysis. This plot in SAS is nice for its inclusion of some useful summaries the fitted line, the R 2 Departures of this plot from a parallel band about the horizontal line at zero are significant. The plot here is a bit noisy but not too bad considering the small sample size. \stata_howto\simple linear regression ny auto club.doc Page 10 of 10