range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Similar documents
Bivariate (Simple) Regression Analysis

Week 4: Simple Linear Regression III

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

Stata versions 12 & 13 Week 4 Practice Problems

Week 4: Simple Linear Regression II

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

Quantitative - One Population

Soci Statistics for Sociologists

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

Week 5: Multiple Linear Regression II

An Introductory Guide to Stata

Regression Analysis and Linear Regression Models

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Panel Data 4: Fixed Effects vs Random Effects Models

Introduction to Stata: An In-class Tutorial

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

Introduction to Statistical Analyses in SAS

Lab 2: OLS regression

1 Introducing Stata sample session

schooling.log 7/5/2006

Review of Stata II AERC Training Workshop Nairobi, May 2002

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Fathom Dynamic Data TM Version 2 Specifications

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

ECON Stata course, 3rd session

Section 2.3: Simple Linear Regression: Predictions and Inference

ECON Introductory Econometrics Seminar 4

Applied Regression Modeling: A Business Approach

Empirical Asset Pricing

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

Subset Selection in Multiple Regression

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression

Two-Stage Least Squares

Linear regression Number of obs = 6,866 F(16, 326) = Prob > F = R-squared = Root MSE =

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Multiple Regression White paper

A quick introduction to STATA

Introduction to STATA 6.0 ECONOMICS 626

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

/23/2004 TA : Jiyoon Kim. Recitation Note 1

Week 10: Heteroskedasticity II

Week 11: Interpretation plus

CREATING THE ANALYSIS

Introduction to STATA

CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

Introduction to Programming in Stata

ST Lab 1 - The basics of SAS

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Introduction to Stata Session 3

Health Disparities (HD): It s just about comparing two groups

Getting started with Stata 2017: Cheat-sheet

Robust Linear Regression (Passing- Bablok Median-Slope)

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Introduction to SAS. I. Understanding the basics In this section, we introduce a few basic but very helpful commands.

Dr. Barbara Morgan Quantitative Methods

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Centering and Interactions: The Training Data

Factorial ANOVA. Skipping... Page 1 of 18

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

texdoc 2.0 An update on creating LaTeX documents from within Stata Example 2

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Creating LaTeX and HTML documents from within Stata using texdoc and webdoc. Example 2

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Brief Guide on Using SPSS 10.0

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

An Econometric Study: The Cost of Mobile Broadband

One Factor Experiments

Stata v 12 Illustration. First Session

Data Analysis and Hypothesis Testing Using the Python ecosystem

Lab Session 1. Introduction to Eviews

STAT:5201 Applied Statistic II

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Stat 5100 Handout #6 SAS: Linear Regression Remedial Measures

I Launching and Exiting Stata. Stata will ask you if you would like to check for updates. Update now or later, your choice.

25 Working with categorical data and factor variables

6:1 LAB RESULTS -WITHIN-S ANOVA

Stata version 13. First Session. January I- Launching and Exiting Stata Launching Stata Exiting Stata..

A (very) brief introduction to R

Intermediate SAS: Statistics

Stata Training. AGRODEP Technical Note 08. April Manuel Barron and Pia Basurto

Stat 401 B Lecture 26

STATA 13 INTRODUCTION

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM

Applied Regression Modeling: A Business Approach

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Introduction to Stata - Session 2

STATA Tutorial. Introduction to Econometrics. by James H. Stock and Mark W. Watson. to Accompany

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Introduction to Stata Toy Program #1 Basic Descriptives

8. MINITAB COMMANDS WEEK-BY-WEEK

Conditional and Unconditional Regression with No Measurement Error

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Page 1. Notes: MB allocated to data 2. Stata running in batch mode. . do 2-simpower-varests.do. . capture log close. .

Transcription:

------------------ log: \Term 2\Lecture_2s\regression1a.log log type: text opened on: 22 Feb 2008, 03:29:09. cmdlog using " \Term 2\Lecture_2s\regression1a.do" (cmdlog \Term 2\Lecture_2s\regression1a.do opened). use " \Term 2\Lecture_2s\data01.dta", clear. d Contains data from \Term 2\Lecture_2s\data01.dta obs: 20 vars: 2 14 Jan 2008 09:27 size: 180 (99.9% of memory free) storage display value variable name type format label variable label x byte %8.0g y float %9.0g Sorted by:. codebook x (unlabeled) type: numeric (byte) range: [1,20] units: 1 unique values: 20 missing.: 0/20 mean: 10.5 std. dev: 5.91608 percentiles: 10% 25% 50% 75% 90% 2.5 5.5 10.5 15.5 18.5 y (unlabeled) type: numeric (float) range: [-1.75,20.74] units:.01 unique values: 20 missing.: 0/20 mean: 10.7645 std. dev: 6.16524 percentiles: 10% 25% 50% 75% 90% 1.91 7.255 11.25 14.965 18.81. scatter y x 1

y 0 5 10 15 20 0 5 10 15 20 x. *Perform a simple linear regression and check that you understand the output produced. regress y x Source SS df MS Number of obs = 20 -------------+------------------------------ F( 1, 18) = 10.25 Model 262.070711 1 262.070711 Prob > F = 0.0049 Residual 460.122976 18 25.5623875 R-squared = 0.3629 -------------+------------------------------ Adj R-squared = 0.3275 Total 722.193687 19 38.0101941 Root MSE = 5.0559 y Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x.6277669.1960604 3.20 0.005.2158593 1.039675 _cons 4.172947 2.348637 1.78 0.093 -.7613551 9.10725. * Number of obs = 20 The number of observations used in the model. This might differ from the total number of observations reported by describe if there are missing values.. *Small table on the right. SS = Sum of Squares, df = degrees of freedom, MS = Mean Square = SS/df.. *SS total = sum of squares of (individual observations of the outcome - mean outcome). *Residual SS = sum of squares of (individual observations of the outcome - corresponding values based on the model). *Model SS = total SS - Residual SS. *Model df = number of regressors in the model here only x so 1. *Total df = number of obs - 1 = 20-1 =19. * F ( 1, 18) is the F-statistic obtained by Model MS / Residual MS and referred to an F-distribution with df 1 and 18. * Prob > F = P-value of the F-statistic = 10.25 when referred to an f (1,18). The P- value in this case is 0.0049 indicating that at least one of the coefficients of the terms in the model is significantly different than zero. 2

. * R -squared measures the amount of variability in the outcome (y) explained by the predictor variables (x). So in this case x explains 36% of variability in y. It is obtained by dividing Model SS by total SS.. * Adjusted R - squared it is R-sqaured adjusted by the number of parameters in the model it is calculated as adj Rsquared = 1 - (n-1)(1-rsquared)/(n-p) here p = 2 (one parameter for the constant term and the other for x), n = 20 and Rsquared = 0.3629. * Root MSE = sqrt (Residual MS) (see http://en.wikipedia.org/wiki/mean_squared_error for more details).. * The big table columns are as follows, respectively: name of variables in the model, estimated coefficients, Standard error of estimated coefficients, t-statistic = Coef./Std. Err., P > t is the corresponding P-value for the t-statistic when referred to a t-distribution with df = df of the residual = 18, and lower and upper bounds for a 95% CI.. *The fitted model is y = 4.173 + 0.628 x. Based on the P-value for x, x is a significant predictor of y. This could also been deduced from the fact that the P-value for the F-statistic is significant and x is the only regressor. We could also have looked at the CI which does not include 0, where 0 is the null-hypothesis of no association between x and y.. * The following command shows how the P-value reported for x (you can do the same for the constant) was obtained.. display 2*ttail(18,3.20).00496249. *Interpretation: An increase in one unit in x leads to an increase in 0.628 units in mean y with a corresponding 95% ranging from 0.216 to 1.040.. ***Check that the residuals from your model are normal and homoscedastic.. predict fit (option xb assumed; fitted values). *The above command calculates the predicted values based on the model and generates a new variable called fit to store them. * Let us see how it does it, by looking at observation 6 so. list in 6 +--------------------+ x y fit -------------------- 6. 6.24 7.939549 +--------------------+. *Here x = 6, y = 0.24 and fitted value according to model is 7.94, this value was calculated by. display (4.172947+.6277669*6) 7.9395484. *Not a great fit for this particular observation.. * xb assumed means the calculations carried are based on the linear bit of the regression model.. scatter y x line fit x. * (scatter y x line fit x) overlay the fitted line to the scatterplot.. graph export "F:\Term 2\Lecture_2s\Figure1.2.wmf", as(wmf) (file F:\Term 2\Lecture_2s\Figure1.2.wmf written in Windows Metafile format). * The previous command would export the graph in a Windows Metafile format (wmf) so that we can insert it in a word document and it will call the graph Figure1.2.wmf. If you do not specify the path it will save the 3

file in the directory C: \Data unless you have already changed the directory to your work area. Note that if you are working on a network such as the computer lab you will not be able to write to the C:\Data because you have no writing privileges to this directory. Therefore, you have either to change the directory or specify the pathway..* Here is the graph y/fitted values 0 5 10 15 20 0 5 10 15 20 x y Fitted values. * To check for normality you can plot a qqplot or a histogram for the residuals. predict res1, residuals. qnorm res1 4

Standardized residuals -2-1 0 1 2-2 -1 0 1 2 Inverse Normal. * The command (predict res1, residuals) predicts the residuals, it is simply y - fit. *The command (qnorm res1) plots a QQ-normal plot for the residuals based on this graph there are no gross departures from normality. *Alternatively, you can look at the standardised residuals.. predict rstres, rstandard. qnorm rstres. *now let us save this graph. graph export "F:\Term 2\Lecture_2s\Figure1.3.wmf", as(wmf) replace (file F:\Term 2\Lecture_2s\Figure1.3.wmf written in Windows Metafile format). * To check for same variance ( homoscedasticity) we plot residuals versus fitted values. scatter rstres fit 5

Standardized residuals -2-1 0 1 2 5 10 15 20 Fitted values. graph export "F:\Term 2\Lecture_2s\Figure1.4.wmf", as(wmf) replace (file F:\Term 2\Lecture_2s\Figure1.4.wmf written in Windows Metafile format). * What do you think?. ****Plot a graph of the fitted line with 95% confidence bands and tolerance bands, overlayed. predict confse, stdp. * The above command generates the variable confse which is the standard error of the prediction. This is so because we specified the option stdp.. generate confup=fit+1.96*confse. * (generate confup=fit+1.96*confse) will generate the variable confup which is the fitted values + 1.96 * SE of the fitted values. generate confdn=fit-1.96*confse. * (generate confdn=fit-1.96*confse) will generate the variable confdn which is the fitted values - 1.96 * SE of the fitted values. this will give the lower bound of the confidence band.. scatter y x line fit x line confup x line confdn x 6

y/fitted values/confup/confdn 0 5 10 15 20 0 5 10 15 20 x y confup Fitted values confdn. * (scatter y x line fit x line confup x line confdn x) will overlay a scatter plot of the observed values for y and x with a fitted line based on the model and a corresponding confidence band. Can you tune this graph so that the colours of the confidence band are the same and add a title to this graph?. graph export "F:\Term 2\Lecture_2s\Figure1.5.wmf", as(wmf) (file F:\Term 2\Lecture_2s\Figure1.5.wmf written in Windows Metafile format). *For the tolerance band we use the following commands. predict tol, stdf. generate toldn=fit-1.96*tol. generate tolup=fit+1.96*tol. scatter y x line fit x line tolup x line toldn x 7

y/fitted values/tolup/toldn -10 0 10 20 30 0 5 10 15 20 x y tolup Fitted values toldn. graph export "F:\Term 2\Lecture_2s\Figure1.6.wmf", as(wmf) (file F:\Term 2\Lecture_2s\Figure1.6.wmf written in Windows Metafile format). *Calculate the (Pearson) correlation coefficient between x and y. Can you notice a connection?. corr y x (obs=20) y x -------------+------------------ y 1.0000 x 0.6024 1.0000. * The correlation coefficient is 0.6024. It is the square-root of R-squared reported in the regression output. That is R-squared is the square of this value.. display 0.6024^2.36288576. ***************Q2 Regression and the t-test*************. use "F:\Term 2\Lecture_2s\data03.dta", clear. d Contains data from F:\Term 2\Lecture_2s\data03.dta obs: 40 vars: 2 16 Jan 2008 10:12 size: 360 (99.9% of memory free) 8

storage display value variable name type format label variable label z float %9.0g gp byte %8.0g Sorted by:. sort gp. by gp: sum z -> gp = 0 Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- z 20.0551686 1.145848-2.07459 2.220229 -> gp = 1 Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- z 20 2.463217 1.273995 -.2312334 4.170232. * We see that for group 0 the mean is approximately zero with SD of approximately 1. For group 1 the mean is approximately 2 with SD of approximately 1.. regress z gp Source SS df MS Number of obs = 40 -------------+------------------------------ F( 1, 38) = 39.50 Model 57.9869844 1 57.9869844 Prob > F = 0.0000 Residual 55.7845744 38 1.46801511 R-squared = 0.5097 -------------+------------------------------ Adj R-squared = 0.4968 Total 113.771559 39 2.91721946 Root MSE = 1.2116 z Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- gp 2.408049.3831469 6.28 0.000 1.632408 3.183689 _cons.0551686.2709257 0.20 0.840 -.4932919.6036291. *Here note that there is no need for an indicator variable format as the gp is coded as 0, 1.. *The moel that we fitted is z =.0551686 + 2.408049 * gp. * Therefore for some one in gp 0 the mean predicted value is.0551686 (is it the same as that reported by the previous command?). *For someone in gp 1 the mean predicted value is.0551686 + 2.408049. display (.0551686 + 2.408049 ) 2.4632176. * compare to the previous command.. *Based on the regression output gp is a significant predictor for z. It explains 51% of the variation in z. Therefore, there is association between z and gp. 9

. use "F:\Term 2\Lecture_2s\data04.dta", clear. d Contains data from F:\Term 2\Lecture_2s\data04.dta obs: 20 vars: 2 16 Jan 2008 11:13 size: 240 (99.9% of memory free) storage display value variable name type format label variable label x float %9.0g y float %9.0g Sorted by:. sum x y Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- x 20.0551686 1.145848-2.07459 2.220229 y 20 2.463217 1.273995 -.2312334 4.170232. *observations in x are those for gp 0 and in y are those for gp 1.. ttest x = y, unpaired Two-sample t test with equal variances Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- x 20.0551686.2562194 1.145848 -.4811047.591442 y 20 2.463217.2848739 1.273995 1.866969 3.059465 ---------+-------------------------------------------------------------------- combined 40 1.259193.2700565 1.707987.7129522 1.805434 ---------+-------------------------------------------------------------------- diff -2.408049.3831469-3.183689-1.632408 Degrees of freedom: 38 Ho: mean(x) - mean(y) = diff = 0 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 t = -6.2849 t = -6.2849 t = -6.2849 P < t = 0.0000 P > t = 0.0000 P > t = 1.0000. * The above output corresponds to a ttest of unpaired data and assuming equal variance (you can check for this by using sdtest). However, we need to make this assumption in order to compare to the results from the regression.. *The first part of the table gives summary stats for each group.. *The second part ( combined) gives summary statistics for the combined group, if you do a (ci z) in the previous data set you will get the same ouptut.. *Here it is.*. ci z Variable Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- z 40 1.259193.2700565.7129522 1.805434 10

. *The third part (diff) gives summary statistics of the difference so mean difference is. display (.0551686-2.463217 ) -2.4080484. *Compare this to the coefficient of gp in the previous output.. display (sqrt(.2562194^2 +.2848739^2)).38314686. * (display (sqrt(.2562194^2 +.2848739^2)) ) gives the combined SE. The 95% CI is obtained by. display ( -2.408049 - invttail(38,0.975) *.3831469) -1.6324087. display ( -2.408049+ invttail(38,0.975) *.3831469) -3.1836893. display invttail(38,0.975) -2.0243941. *This is the critical value of t-distribution with df = 38 = 40-2 and a probability 0.975 as we want a two-sided 95% CI (so 95% in between.). Note that this value is close to 1.96.. * To use the ttest we are assuming that the variables are normally distributed, Can you check that? However, note that ttest is robust to departures from normality.. *The last part of the output gives you the P-values of the different hypotheses that one can investigate for the difference between two means. Note you should not pick and choose your hypotheses according to the displayed P-values. You should have a prior hypothesis that you are interested in. In our case we are interested in the two sided one so we look at the column with Ha (stands for alternative hypothesis)! = 0. The combined sign!= means not equal. The t is the value of the t-statistic = Mean diff /SE diff. The P-value associated with this statistic is P > t (read as absolute t since we are interested in a two-sided test). It is calculated by referring the t-statistic to the t-distribution with df = 38 = 40-2. According to this value the two means are statistically significantly different. In other words, there is an association between the outcome and the group.. ***************************************************************************************. ***********Q3: The idea of this exercise is to give you practice in interpreting the distribution of residuals in order to help you check regression model assumptions.. use "F:\Term 2\Lecture_2s\data05.dta", clear. d Contains data from F:\Term 2\Lecture_2s\data05.dta obs: 100 vars: 6 16 Jan 2008 11:33 size: 2,800 (99.7% of memory free) storage display value variable name type format label variable label normvar float %9.0g tvar float %9.0g chivar float %9.0g uvar float %9.0g cauchyvar float %9.0g 11

mixvar float %9.0g Sorted by:. * The name of the variable reflects which distribution they have been generated from.. *normvar = normal variable mean zero and variance one.. *tvar = t-distribution variable with df two.. * Chivar =chi-squared variable with df two. *uvar = uniform variable on the interval zero to one.. *cauchyvar = Cauchy variable (note that t-distribution is a special case of the Cauchy distribution) location 0, scale 1. *mixvar = mixed variable of two distributions: unit variance normals with means 0 and 4. *The following commands graph the histograms for each variable overlaid with a normal graph and saves each graph separately.. hist normvar, title("normal(0, 1)") normal (bin=10, start=-3.0959888, width=.58574033). graph save q1.gph, replace (file q1.gph saved). hist tvar, title("t(df(2))") normal (bin=10, start=-5.5337934, width=4.7400474). graph save q2.gph, replace (file q2.gph saved). hist chivar, title("chi(df(2))") normal (bin=10, start=.00718813, width=1.1191456). graph save q3.gph, replace (file q3.gph saved). hist uvar, title("uniform(0,1)") normal (bin=10, start=.00110148, width=.0975045). graph save q4.gph, replace (file q4.gph saved). hist cauchyvar, title("cauchy(0,1)") normal (bin=10, start=-168.79085, width=21.392432). graph save q5.gph, replace (file q5.gph saved). hist mixvar, title("normal(0,1)+normal(4,1)") normal (bin=10, start=-2.0749254, width=.86990695). graph save q6.gph, replace (file q6.gph saved) *The following command combines the different graphs into one graph. graph combine q1.gph q2.gph q3.gph q4.gph q5.gph q6.gph, saving(hist) (file hist.gph saved) 12

Normal(0, 1) t(df(2)) chi(df(2)) Density 0.1.2.3.4 Density 0.05.1.15 Density 0.1.2.3.4-4 -2 0 2 4 normvar -10 0 10 20 30 40 tvar 0 5 10 chivar Density 0.5 1 1.5 2 uniform(0,1) Density 0.01.02.03.04 Cauchy(0,1) Density 0.05.1.15.2 Normal(0,1)+Normal(4,1) 0.2.4.6.8 1 uvar -150-100 -50 0 50 cauchyvar -2 0 2 4 6 mixvar *This one exports to WMF.. graph export "F:\Term 2\Lecture_2s\Figure3.1.wmf", as(wmf) (file F:\Term 2\Lecture_2s\Figure3.1.wmf written in Windows Metafile format). graph drop q1.gph q2.gph q3.gph q4.gph q5.gph q6.gph *Next we plot the QQ normal plots for the above variables.. qnorm normvar, title("normal(0, 1)"). graph save q1.gph, replace (file q1.gph saved). qnorm tvar, title("t(df(2))"). graph save q2.gph (file q2.gph saved). qnorm chivar, title("chi(df(2))"). graph save q3.gph (file q3.gph saved). qnorm uvar, title("uniform(0,1)"). graph save q4.gph (file q4.gph saved) 13

. qnorm cauchyvar, title("cauchy(0,1)"). graph save q5.gph (file q5.gph saved). qnorm mixvar, title("normal(0,1)+normal(4,1)"). graph save q6.gph (file q6.gph saved). graph combine q1.gph q2.gph q3.gph q4.gph q5.gph q6.gph, saving(qnorm) (file qnorm.gph saved) Normal 0 1 t(df(2)) chi(df(2)) normvar -4-2 0 2 4-3 -2-1 0 1 2 Inverse Normal tvar -10 0 10 20 30 40-10 -5 0 5 10 Inverse Normal chivar -5 0 5 10-2 0 2 4 6 Inverse Normal uvar -.5 0.5 1 1.5 uniform(0,1) cauchyvar -150-100 -50 0 50 Cauchy(0,1) mixvar -5 0 5 10 Normal(0,1)+Normal(4,1) -.5 0.5 1 1.5 Inverse Normal -50 0 50 Inverse Normal -5 0 5 10 Inverse Normal. graph export "F:\Term 2\Lecture_2s\Figure3.2.wmf", as(wmf) (file F:\Term 2\Lecture_2s\Figure3.2.wmf written in Windows Metafile format). * You can do the same using (pnorm). * To do for separate samples you can either use 1/20 or 10/30 (note that these will not be random sample). If you use the command (sample 20) it will give you 20 observations selected randomly. *************************************************************************************** 14

*************************Q4 Interaction term. use "F:\Term 2\Lecture_2s\data02.dta", clear. d Contains data from F:\Term 2\Lecture_2s\data02.dta obs: 100 vars: 3 16 Jan 2008 12:12 size: 1,300 (99.9% of memory free) storage display value variable name type format label variable label x float %9.0g y float %9.0g gp byte %8.0g Sorted by:. codebook x (unlabeled) type: numeric (float) range: [1,40] units:.01 unique values: 100 missing.: 0/100 mean: 20.5 std. dev: 11.5167 percentiles: 10% 25% 50% 75% 90% 4.685 10.5 20.5 30.5 36.315 y (unlabeled) type: numeric (float) range: [9.29,41.83] units:.01 unique values: 98 missing.: 0/100 mean: 22.9022 std. dev: 8.57401 percentiles: 10% 25% 50% 75% 90% 13.165 15.315 21.555 29.625 35.57 gp (unlabeled) type: numeric (byte) range: [0,1] units: 1 15

unique values: 2 missing.: 0/100 tabulation: Freq. Value 50 0 50 1. sort gp. by gp: summ x y -> gp = 0 Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- x 50 10.5 5.65209 1 20 y 50 30.1728 5.586942 18.91 41.83 -> gp = 1 Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- x 50 30.5 5.65209 21 40 y 50 15.6316 3.071295 9.29 21.82. regress y x gp Source SS df MS Number of obs = 100 -------------+------------------------------ F( 2, 97) = 476.60 Model 6605.65019 2 3302.82509 Prob > F = 0.0000 Residual 672.204385 97 6.92994212 R-squared = 0.9076 -------------+------------------------------ Adj R-squared = 0.9057 Total 7277.85457 99 73.5136825 Root MSE = 2.6325 y Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x.6492031.0470482 13.80 0.000.5558255.7425806 gp -27.52526 1.078244-25.53 0.000-29.66528-25.38525 _cons 23.35617.6185794 37.76 0.000 22.12846 24.58388. *Interpret the output. gen xgp = x* gp. regress y x gp xgp Source SS df MS Number of obs = 100 -------------+------------------------------ F( 3, 96) = 503.09 Model 6842.61542 3 2280.87181 Prob > F = 0.0000 Residual 435.239146 96 4.53374111 R-squared = 0.9402 -------------+------------------------------ Adj R-squared = 0.9383 Total 7277.85457 99 73.5136825 Root MSE = 2.1293 y Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x.9243218.0538172 17.18 0.000.8174955 1.031148 gp -16.24539 1.78744-9.09 0.000-19.79343-12.69735 16

xgp -.5502375.076109-7.23 0.000 -.7013128 -.3991623 _cons 20.46742.6403055 31.97 0.000 19.19642 21.73842. *So is there interaction?. predict fit (option xb assumed; fitted values). scatter y x line fit x if gp==0 line fit x if gp==1 y/fitted values 10 20 30 40 0 10 20 30 40 x y Fitted values Fitted values. graph export "F:\Term 2\Lecture_2s\Figure4.1.wmf", as(wmf) (file F:\Term 2\Lecture_2s\Figure4.1.wmf written in Windows Metafile format). *Alternatively, for the interaction you could have used.. xi: regress y i.gp*x i.gp _Igp_0-1 (naturally coded; _Igp_0 omitted) i.gp*x _IgpXx_# (coded as above) Source SS df MS Number of obs = 100 -------------+------------------------------ F( 3, 96) = 503.09 Model 6842.61542 3 2280.87181 Prob > F = 0.0000 Residual 435.239146 96 4.53374111 R-squared = 0.9402 -------------+------------------------------ Adj R-squared = 0.9383 Total 7277.85457 99 73.5136825 Root MSE = 2.1293 y Coef. Std. Err. t P> t [95% Conf. Interval] 17

-------------+---------------------------------------------------------------- _Igp_1-16.24539 1.78744-9.09 0.000-19.79343-12.69735 x.9243218.0538172 17.18 0.000.8174955 1.031148 _IgpXx_1 -.5502375.076109-7.23 0.000 -.7013128 -.3991623 _cons 20.46742.6403055 31.97 0.000 19.19642 21.73842. *Comapre to the previous output. *****************************************. ***Q 5 Functional Form: The idea of this exercise is for you to explore the information that exploratory and residual plots give you about the possible functional form for an explanatory variable... clear. set obs 20 obs was 0, now 20. * (set obs 20) opens space in memory for 20 observations.. g x = _n. *(g x = _n) generates a variable from 1 to 20.. gen y = sqrt(x) + 0.1*invnormal(uniform()). *y = sqrt(x) + error term from the normal distribution (0, 0.1). regress y x Source SS df MS Number of obs = 20 -------------+------------------------------ F( 1, 18) = 532.62 Model 20.0730993 1 20.0730993 Prob > F = 0.0000 Residual.678376485 18.037687583 R-squared = 0.9673 -------------+------------------------------ Adj R-squared = 0.9655 Total 20.7514758 19 1.09218294 Root MSE =.19413 y Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x.1737386.0075282 23.08 0.000.1579226.1895547 _cons 1.254488.0901808 13.91 0.000 1.065025 1.443951. predict fit (option xb assumed; fitted values). scatter y x line fit x 18

1 2 3 4 5 0 5 10 15 20 x y Fitted values. graph export "F:\Term 2\Lecture_2s\Figure5.1.wmf", as(wmf) (file F:\Term 2\Lecture_2s\Figure5.1.wmf written in Windows Metafile format). predict rstres, rst Standardized residuals -3-2 -1 0 1 2-2 -1 0 1 2 Inverse Normal 19

. scatter rstres x Standardized residuals -3-2 -1 0 1 0 5 10 15 20 x * So what do you think this is telling you? Are the two linearly related? You can also look at other transformations or using other Normal distributions for example, try the following repeating the above gen y = sqrt(x) + 2*invnormal(uniform()) gen y = x^2 + 0.2*invnormal(uniform()) gen y = ln(x) + 0.3*invnormal(uniform()) You can also try different sizes by varying (set obs ----) Mona Kanaan Feb 2008 20