. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Similar documents
Bivariate (Simple) Regression Analysis

Week 4: Simple Linear Regression II

Panel Data 4: Fixed Effects vs Random Effects Models

Week 10: Heteroskedasticity II

Introduction to STATA 6.0 ECONOMICS 626

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

Week 11: Interpretation plus

/23/2004 TA : Jiyoon Kim. Recitation Note 1

Week 4: Simple Linear Regression III

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Week 5: Multiple Linear Regression II

25 Working with categorical data and factor variables

22s:152 Applied Linear Regression

Introduction to Stata: An In-class Tutorial

ECON Introductory Econometrics Seminar 4

schooling.log 7/5/2006

optimization_machine_probit_bush106.c

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

Multiple Linear Regression

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

Multiple Regression White paper

Instruction on JMP IN of Chapter 19

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Lab 2: OLS regression

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

22s:152 Applied Linear Regression

Robust Linear Regression (Passing- Bablok Median-Slope)

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus

Stata versions 12 & 13 Week 4 Practice Problems

Cell means coding and effect coding

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton

Dr. Barbara Morgan Quantitative Methods

Centering and Interactions: The Training Data

Linear regression Number of obs = 6,866 F(16, 326) = Prob > F = R-squared = Root MSE =

CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

Two-Stage Least Squares

Source:

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Descriptives. Graph. [DataSet1] C:\Documents and Settings\BuroK\Desktop\Prestige.sav

Laboratory for Two-Way ANOVA: Interactions

Regression Analysis and Linear Regression Models

Review of Stata II AERC Training Workshop Nairobi, May 2002

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM

A Short Introduction to STATA

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Subset Selection in Multiple Regression

An Introductory Guide to Stata

Variable selection is intended to select the best subset of predictors. But why bother?

An Econometric Study: The Cost of Mobile Broadband

An Example of Using inter5.exe to Obtain the Graph of an Interaction

Introduction to Mixed Models: Multivariate Regression

Applied Statistics and Econometrics Lecture 6

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Health Disparities (HD): It s just about comparing two groups

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Salary 9 mo : 9 month salary for faculty member for 2004

Performing Cluster Bootstrapped Regressions in R

Lecture 13: Model selection and regularization

texdoc 2.0 An update on creating LaTeX documents from within Stata Example 2

IQR = number. summary: largest. = 2. Upper half: Q3 =

INTRODUCTION TO PANEL DATA ANALYSIS

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Section 2.3: Simple Linear Regression: Predictions and Inference

Gelman-Hill Chapter 3

Creating LaTeX and HTML documents from within Stata using texdoc and webdoc. Example 2

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Model Selection and Inference

PSY 9556B (Feb 5) Latent Growth Modeling

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

SPSS INSTRUCTION CHAPTER 9

Section 2.1: Intro to Simple Linear Regression & Least Squares

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Statistics Lab #7 ANOVA Part 2 & ANCOVA

3-1 Writing Linear Equations

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Lecture 3: The basic of programming- do file and macro

Multiple-imputation analysis using Stata s mi command

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

An Introduction to Growth Curve Analysis using Structural Equation Modeling

The x-intercept can be found by setting y = 0 and solving for x: 16 3, 0

Reproducible Research: Weaving with Stata

Introduction to Programming in Stata

1 Introducing Stata sample session

Compare Linear Regression Lines for the HP-67

Math 182. Assignment #4: Least Squares

Section 1.5: Point-Slope Form

Generalized least squares (GLS) estimates of the level-2 coefficients,

Factorial ANOVA with SAS

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression. Page 1. Notes. Output Created Comments Data. 26-Mar :31:18. Input. C:\Documents and Settings\BuroK\Desktop\Data Sets\Prestige.

Stat 401 B Lecture 26

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

Transcription:

DUMMY VARIABLES AND INTERACTIONS Let's start with an example in which we are interested in discrimination in income. We have a dataset that includes information for about 16 people on their income, their education, their race-ethnic group (as well as additional variables that we shall, for the present, ignore and that were eliminated from this data subset).. use discrim. des Contains data from discrim.dta obs: 1,66 vars: 6 size: 44,968 (1.% of memory free) - 1. ed float %9.g 2. income float %9.g 3. female float %9.g 4. black float %9.g 5. hisp float %9.g 6. white float %9.g - Sorted by: MODEL 1: The first model includes only education as a predictor. regress income ed ---------+------------------------------ F( 1, 164) = 27.48 Model 8.5922e+1 1 8.5922e+1 Prob > F =. Residual 5.953e+11 164 317659125 R-square =.1443 ---------+------------------------------ Adj R-square =.1438 Total 5.9545e+11 165 37994885 Root MSE = 17823 ed 2314.839 14.757 16.446. 238.765 259.914 _cons -1358.39 1824.838-5.676. -13937.71-6779.73. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education) The graph shows, as expected, that education is related to increased income. In fact, it shows a linear relationship between education and income -- for every year of education, income is predicted to increase by $2314.84. 3 model1 predicted income 2 1-1 5 1 15 2 ed years of education

2 MODEL 2: Next, however, we introduce a dummy variable. A dummy variable has only two categories - and 1. In this case, the dummy variable white = 1 if the individual is white, and if he or she is nonwhite (in the case of this particular dataset, black or hispanic). The coefficient on this new variable asks whether there is a constant difference between whites and nonwhites in income when they have the same the education.. regress income ed white ---------+------------------------------ F( 2, 163) = 141.65 Model 8.943e+1 2 4.4715e+1 Prob > F =. Residual 5.62e+11 163 315668516 R-square =.152 ---------+------------------------------ Adj R-square =.1491 Total 5.9545e+11 165 37994885 Root MSE = 17767 ed 2187.946 145.3798 15.5. 192.792 2473.11 white 396.29 1187.863 3.334. 163.281 629.138 _cons -11998.6 1884.423-6.367. -15694.25-831.868 We want, again, to look at the predicted values, but we can plot them separately for whites and nonwhites. To do so, we set up two new variables: mod2w aand mod2n contain the predicted values of income for whites and nonwhites respectively.. predict mod2. gen mod2w=mod2 if white==1 (294 missing values generated). gen mod2n=mod2 if white== (1312 missing values generated). graph mod2w mod2n ed, connect(ll)xlabel ylabel l1(model2 predicted income) b1(years of education) When we graph these values, we find two parallel lines: the lines for whites and nonwhites differ in only in their intercept. We can see how this happens by writing out the prediction equations for whites and non-whites. FOR NONWHITES: since white=, FOR WHITES: since white=1 b + b 1 ed + b 2 white, becomes, b + b 1 ed b + b 1 ed + b 2 white, becomes, b + b 2 + b 1 ed What the dummy variable has done is to allow us separate intercepts model2 predicted income 3 2 1 mod2w mod2n for nonwhites b for whites b +b 2-1 5 1 15 2 ed years of education

3 This model allows the intercepts to differ by race, BUT assumes the increase in income for each additional year of education is the same for whites and nonwhites alike. MODEL 3: But suppose we want to ask whether or not the slope is the same. To do so, we can use an interaction term that is the product of the variable white and the variable education. This variable is for nonwhites, but for whites is equal to their education.. gen edw = white*ed. regress income ed white edw ---------+------------------------------ F( 3, 162) = 13.8 Model 9.691e+1 3 3.233e+1 Prob > F =. Residual 4.9854e+11 162 311196433 R-square =.1628 ---------+------------------------------ Adj R-square =.1612 Total 5.9545e+11 165 37994885 Root MSE = 17641 ed 155.47 272.4575 3.872. 52.6365 1589.458 white -1436.81 3855.684-3.641. -21599.53-6474.98 edw 1574.962 321.2463 4.93. 944.8548 225.69 _cons 267.359 3124.41.86.932-586.331 6394.943. predict mod3. gen mod3w = mod3 if white==1 (294 missing values generated). gen mod3n = mod3 if white== (1312 missing values generated). graph mod3w mod3n ed, connect(ll) xlabel ylabel l1(model3 predicted income) b1(years of education) What we see is that this whether both the slope and the to nonwhites. model3 predicted income 4 2 mod3w mod3n method has allowed us to ask intercept differ for whites compared -2 5 1 15 2 ed years of education FOR NONWHITES: b + b 1 ed + b 2 white + b 3 edw becomes, since white= edw=, b + b 1 ed FOR WHITES: b + b 1 ed + b 2 white + b 3 edw becomes, since white=1and edw=ed (b + b 2 ) + (b 1 +b 3 ) ed What the dummy variable for White and its interaction with ed have done is to allow us to estimate separate intercepts and separate slopes for the relationship between education and income for whites and nonwhites. These analyses can be done separately for whites and nonwhites.

4 MODEL 4: Whites only. regress income ed if white==1 Source SS df MS Number of obs = 1312 ---------+------------------------------ F( 1, 131) = 214.35 Model 7.4312e+1 1 7.4312e+1 Prob > F =. Residual 4.5415e+11 131 346679198 R-square =.146 ---------+------------------------------ Adj R-square =.14 Total 5.2846e+11 1311 439816 Root MSE = 18619 ed 263.9 179.6354 14.641. 2277.65 2982.414 _cons -13769.51 2385.149-5.773. -18448.64-99.379 MODEL 5: Nonwhites only. regress income ed if white== Source SS df MS Number of obs = 294 ---------+------------------------------ F( 1, 292) = 3.7 Model 4.6664e+9 1 4.6664e+9 Prob > F =. Residual 4.4387e+1 292 152156 R-square =.951 ---------+------------------------------ Adj R-square =.92 Total 4.953e+1 293 16741749 Root MSE = 12329 ed 155.47 19.4222 5.541. 68.2731 1429.821 _cons 267.359 2183.411.122.93-429.912 4564.524 Please note that these separate regressions give the same results as the single analysis in Model 3. The intercept and education coefficients for nonwhites in Model 5 are the same as in Model 3. The intercept in Model 4 is the sum of the intercept and the coefficient for white in Model 3. The coefficient for education in Model 4 is the sum of the coefficient for education and that for edw in Model 3.

5 MORE THAN TWO CATEGORIES MODEL 6: We can extend the analysis to look at blacks and hispanics separately, so that now we have three categories: white, black, hispanic. To carry out this analysis, we need 2 dummy variables. In this case, I choose to use black = 1 if black, hisp=1 if hispanic, and zero otherwise. Whites are zero on both these variables.. regress income ed black hisp ---------+------------------------------ F( 3, 162) = 94.52 Model 8.9546e+1 3 2.9849e+1 Prob > F =. Residual 5.59e+11 162 315793524 R-square =.154 ---------+------------------------------ Adj R-square =.1488 Total 5.9545e+11 165 37994885 Root MSE = 17771 ed 2193.269 145.6749 15.56. 197.535 2479.2 black -4499.993 1486.226-3.28.3-7415.144-1584.841 hisp -3234.25 1689.554-1.914.56-6547.993 79.94294 _cons -816.863 1951.455-4.154. -11934.54-4279.189. predict mod6. gen mod6w = mod6 if (black+hisp==) (294 missing values generated). gen mod6b = mod6 if black==1 (144 missing values generated). gen mod6h = mod6 if hisp==1 (1478 missing values generated). graph mod6w mod6b mod6h ed,connect(lll) xlabel ylabel l1(model6 predicted income) b1(years of education) model6 predicted income 4 2-2 mod6w mod6b mod6h 5 1 15 2 ed years of education Even though hisp is not significantly different from zero, I used the plots anyway. In this case, we got three parallel lines, one for each race-ethnic group.

6 MODEL 7: We can, further, allow the slopes to vary by creating the same kind of interaction variable as before:. gen edb = ed*black. gen edh = ed*hisp. regress income ed black edb hisp edh ---------+------------------------------ F( 5, 16) = 62.41 Model 9.7176e+1 5 1.9435e+1 Prob > F =. Residual 4.9827e+11 16 31141952 R-square =.1632 ---------+------------------------------ Adj R-square =.166 Total 5.9545e+11 165 37994885 Root MSE = 17647 ed 263.9 17.2554 15.447. 2296.62 2963.956 black 185.4 523.695 2.74.38 59.348 2119.77 edb -131.314 437.3397-2.976.3-2159.133-443.4946 hisp 16558.73 4747.557 3.488. 7246.65 2587.82 edh -183.826 411.535-4.383. -2611.22-996.637 _cons -13769.51 226.65-6.91. -1823.57-9335.451.... predict mod7. gen mod7w = mod7 if black+hisp== (294 missing values generated) 4 mod7w mod7b mod7h. gen mod7b = mod7 if black==1 (144 missing values generated). gen mod7h = mod7 if hisp==1 (1478 missing values generated). graph mod7w mod7b mod7h ed, connect(lll) symbol(iii) model7 predicted income 2-2 5 1 15 2 ed years of education In this case, since all coefficients are significant, we see that the slopes and intercepts differ: that there is a different starting value (or intercept) and different slopes. The starting values are higher for blacks and hispanics - i.e., at low levels of education, income is higher. BUT the increase with education is lower (the edb and edh variables have negative coefficients). What this means is that the lines cross, and as education increases, whites outstrip the other groups. WHAT OTHER VARIABLES MIGHT YOU WANT TO INCLUDE TO HAVE A FULLY DEVELOPED MODEL?

7 This part is up to you, the investigator. Statistics can't define issues for you -- using statistics, we can only say whether or not a particular model describes our data well -- or poorly. CONVERTING AN INTERVAL VARIABLE TO DUMMY VARIABLES Many of you asked whether the relationship of income and education was really linear. There are a number of ways of looking at that question. The one I want to introduce today is the use of dummy variables - turning education into a categorical variable. We know that education is measured in years:. sum ed Variable Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- ed 166 12.5741 3.16768 18. * treat ed as a categorical variable: categories lths, HS, somecl, col. * need 3 dummy variables. gen lths=. replace lths=1 if ed<12 (37 real changes made). gen HS=. replace HS=1 if ed==12 (583 real changes made). gen somecl=. replace somecl=1 if ed>12 & ed<16 (295 real changes made). gen col=. replace col=1 if ed>=16 (358 real changes made). regress income HS somecl col ---------+------------------------------ F( 3, 162) = 1.71 Model 9.4478e+1 3 3.1493e+1 Prob > F =. Residual 5.97e+11 162 312714835 R-square =.1587 ---------+------------------------------ Adj R-square =.1571 Total 5.9545e+11 165 37994885 Root MSE = 17684 HS 4445.251 1175.4 3.782. 2139.768 675.733 somecl 9627.242 138.299 6.975. 6919.86 12334.62 col 2123.94 131.984 16.195. 18659.52 2382.37 _cons 1633.87 919.3341 11.567. 883.649 12437.1

8 Our model: or Y i â %â 1 X i1 %â 2 X i2 %â 3 X i3 %å i INCOME â %â 1 HS%â 2 somecl%â 3 col%å TThe estimated model is: ˆ INCOME 1633.87 % 4445.251 HS % 9627.242 somecl %2123.94 col All of the coefficients are significant. What the results say is that, compared to those with less than high school education, income for those with a high school education is, on average, $4445 higher; for those who attend college less than 4 years, $9627 higher, for those who have 4+ years of college, $21231 higher - indicating that the increase is not likely to be linear. To see this more clearly, we could have constructed 18 dummy variables (since we have 19 years - -18) and tested it for each year. I next added in other variables and will call this the small model. It can also be referred to as a main effects model since it contains no interaction terms.. regress income HS somecl col black hisp female ---------+------------------------------ F( 6, 1599) = 99.91 Model 1.6236e+11 6 2.76e+1 Prob > F =. Residual 4.339e+11 1599 2784815 Small R-square =.2727 ---------+------------------------------ Adj R-square =.2699 Total 5.9545e+11 165 37994885 Root MSE = 16457 HS 372.24 1116.481 3.332. 153.284 591.125 somecl 8295.98 1297.519 6.394. 575.892 184.92 col 18724.94 1256.588 14.91. 1626.21 21189.68 black -516.125 1376.6-3.79. -786.255-245.995 hisp -695.616 1555.38-3.919. -9146.414-344.818 female -1248.71 826.7567-15.96. -1412.35-1859.7 _cons 19397.17 142.668 18.63. 17352.3 21442.31 The next step is to consider again interactions between education and EACH of the race-ethnic and gender variables. We have to create interactions with EACH dummy variable representing a category of education. I'll refer to the resulting model as the large model.. gen HSb = HS*black. gen HSh = HS*hisp. gen HSf = HS*female. gen someclf = somecl*female. gen someclb = somecl*black. gen someclh = somecl*hisp. gen colf = col*female. gen colh = col*hisp. gen colb = col*black

9. regress income HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb ---------+------------------------------ F( 15, 159) = 44.99 Model 1.7742e+11 15 1.1828e+1 Prob > F =. Residual 4.182e+11 159 26297569 Large R-square =.298 ---------+------------------------------ Adj R-square =.2913 Total 5.9545e+11 165 37994885 Root MSE = 16214 HS 56.135 1761.53 2.842.5 1551.23 8461.247 somecl 881.515 1991.925 4.419. 4894.44 1278.59 col 26456.23 184.15 14.377. 22846.85 365.6 black -597.74 225.831-2.652.8-1384.98-1555.166 hisp -467.89 2322.167-1.984.47-9161.92-52.25787 female -8164.521 1694.129-4.819. -11487.48-4841.559 HSb 291.335 3195.546.98.364-3366.592 9169.262 HSh 1165.673 3874.66.31.764-6433.142 8764.488 HSf -2863.494 2173.97-1.317.188-7127.643 14.655 someclf -9.7478 2537.782 -.355.723-5878.498 477.2 someclb 932.2547 3942.525.236.813-68.839 8665.349 someclh 723.4574 4123.96.175.861-7363.819 881.733 colf -13985.45 2413.787-5.794. -18719.99-925.912 colh -18291.49 5141.572-3.558. -28376.46-826.52 colb -2745.436 4837.948 -.567.57-12234.86 6743.991 _cons 16937.81 1388.819 12.196. 14213.7 19661.91 Lots of the variables are now non-significant. Can we DROP all of them? Is it really the case that the coefficients for HSb HSh HSf someclf someclb someclh and colb are ALL not significantly different from zero? THE F- test Here's where we use the F-test to our advantage. Remember, the F- test asks the question whether R 2 for the model that includes these variables is significantly greater than the R 2 for the model that omits them, i.e. it asks whether the difference in the two R 2 values is significantly different from zero. In our case, the value for the large model is.298 and for the small it's.2727. Equivalently, it asks whether the residual sum of squares (RSS) in the large model (here 4.182 E+11) is significantly smaller than the RSS for the model with fewer variables (here 4.339E+11). This asks the question whether the estimated Y's are closer to the observed ones when we include these additional variables (even though each appears, alone, not to be significant). The test statistic is (RSS small model - RSS large model )/(df small model - df large model ) --------------------------------------------------------------------- RSS large model / df large model The denominator also appears on our printout as the Residual MS or mean square residual. Please note that the difference in degrees of freedom for the two models is equal to the number of new variables introduced when we expand from the small model to the large one. One such test is done automatically for you in every regression output where you see an F value: this is the

1 particular test for the small model where all Y's are estimated to have the same value, which we call Model C: Model C: Ŷ constant For example, for our large model, F(15,159) = 44.9 and Prob>F =.. We have 15 variables (the X's) more than the model with only a constant, and we have 166 observations and 16 parameters, or 159 df for the large model. Please note that we can calculate the F statistic ourselves from the output when we are comparing to the model with only a constant. Recall that the RSS C for Model C is the Total SS in the output. In this case, the F statistic for comparing our model to model C is (Total SS - Residual SS)/Model df Model SS/Model df Model MS --------------------------------------- = ---------------------------- = ------------ Residual MS Residual MS Residual MS When we want to compare a large model to a small one that still has some predictors, we have to use the more complicated expression given on the previous page -- or ask STATA to do it for us. After you issue the command. regress income HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb. test HSb HSh HSf someclf someclb someclh colb ( 1) HSb =. ( 2) HSh =. ( 3) HSf =. ( 4) someclf =. ( 5) someclb =. ( 6) someclh =. ( 7) colb =. F( 7, 159) =.52 Prob > F =.822 Here we do not reject the joint hypothesis that these coefficients are all zero. We can then estimate the model with them omitted:

11. regress income HS somecl col black hisp female colf colh ---------+------------------------------ F( 8, 1597) = 84.8 Model 1.7647e+11 8 2.258e+1 Prob > F =. Residual 4.1898e+11 1597 262354175 R-square =.2964 ---------+------------------------------ Adj R-square =.2928 Total 5.9545e+11 165 37994885 Root MSE = 16197 HS 388.639 11.882 3.525. 1721.313 639.966 somecl 8562.271 1277.951 6.7. 655.633 1168.91 col 25685.42 158.294 16.254. 22585.75 28785.9 black -4993.614 1355.561-3.684. -7652.48-2334.747 hisp -436.375 1623.255-2.487.13-722.31-852.442 female -9726.11 924.73-1.525. -11538.63-7913.586 colf -1238.23 1949.757-6.35. -1624.58-8555.877 colh -18696.68 4855.455-3.851. -2822.41-9172.944 _cons 17526.3 158.191 16.563. 1545.71 1961.89

12.2.3.15.2 Fraction.1 Fraction.1.5 5 1 15 income 5 1 15 loginc 15 5 1 Residuals 5 Residuals -5-5 -2 2 4 Fitted values model predicting income -1 9 9.5 1 1.5 11 Fitted values model predicting log income The graphs above were created using income itself and then the log of income. The top graphs show that income is not at all normally distributed, while the loginc = log(income) is reasonably close to normal. Why is this important? The same predictors were used in two regressions, but the outcomes are quite different. Before we go too far in interpreting the results, however, we should use a non-linear education variable -- either by using the dummy variables generated earlier or by introducing and ed 2 term. We ll do this as part of the next homework.

13. use discrim. * the smallest value of income is -9391 -- add 9392 to every value so that all values are > MODEL 1: INCOME is the dependent variable. regress inc ed black hisp female edb edh edf ---------+------------------------------ F( 7, 1598) = 89.84 Model 1.6816e+11 7 2.423e+1 Prob > F =. Residual 4.2729e+11 1598 267387523 R-squared =.2824 ---------+------------------------------ Adj R-squared =.2793 Total 5.9545e+11 165 37994885 Root MSE = 16352 ed 3214.128 27.23 15.512. 287.71 362.547 black 948.176 4849.814 1.955.51-32.48913 18992.84 hisp 15332.85 44.84 3.484.1 67.827 23964.88 female 634.197 3364.527 1.874.61-295.1536 1293.55 edb -124.8 45.4487-2.97.3-1999.347-48.8125 edh -1773.716 381.4182-4.65. -2521.849-125.584 edf -1493.81 259.554-5.766. -21.925-985.6767 _cons -14496.23 2772.811-5.228. -19934.96-957.55. predict inchat. predict incres, resid. graph incres inchat, yline() xlabel ylabel b1(model predicting income) MODEL 2: log income is the dependent variable. gen loginc = inc+9392. replace loginc = log(loginc). regress loginc ed black hisp female edb edh edf ---------+------------------------------ F( 7, 1598) = 73.2 Model 161.89468 7 23.1278114 Prob > F =. Residual 56.156362 1598.316743656 R-squared =.2423 ---------+------------------------------ Adj R-squared =.239 Total 668.5142 165.416231179 Root MSE =.5628 loginc Coef. Std. Err. t P> t [95% Conf. Interval] ed.816873.71315 11.454..676993.956754 black -.6227.16692 -.36.971 -.3334279.3213825 hisp.3179774.1514673 2.99.36.28819.615729 female -.1216676.1157997-1.51.294 -.348829.154676 edb -.13138.139547 -.941.347 -.4593.142334 edh -.44542.131276-3.393.1 -.72911 -.187929

14 edf -.234169.89161-2.626.9 -.4954 -.59284 _cons 9.275616.954341 97.194. 9.88427 9.46286. predict loghat. predict logres, resid. graph logres loghat, yline() xlabel ylabel b1(model predicting log income)