. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Size: px

Start display at page:

Download ". predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)"

Elfreda Patterson
5 years ago
Views:

1 DUMMY VARIABLES AND INTERACTIONS Let's start with an example in which we are interested in discrimination in income. We have a dataset that includes information for about 16 people on their income, their education, their race-ethnic group (as well as additional variables that we shall, for the present, ignore and that were eliminated from this data subset).. use discrim. des Contains data from discrim.dta obs: 1,66 vars: 6 size: 44,968 (1.% of memory free) - 1. ed float %9.g 2. income float %9.g 3. female float %9.g 4. black float %9.g 5. hisp float %9.g 6. white float %9.g - Sorted by: MODEL 1: The first model includes only education as a predictor. regress income ed F( 1, 164) = Model e e+1 Prob > F =. Residual 5.953e R-square = Adj R-square =.1438 Total e Root MSE = ed _cons predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education) The graph shows, as expected, that education is related to increased income. In fact, it shows a linear relationship between education and income -- for every year of education, income is predicted to increase by $ model1 predicted income ed years of education

2 2 MODEL 2: Next, however, we introduce a dummy variable. A dummy variable has only two categories - and 1. In this case, the dummy variable white = 1 if the individual is white, and if he or she is nonwhite (in the case of this particular dataset, black or hispanic). The coefficient on this new variable asks whether there is a constant difference between whites and nonwhites in income when they have the same the education.. regress income ed white F( 2, 163) = Model 8.943e e+1 Prob > F =. Residual 5.62e R-square = Adj R-square =.1491 Total e Root MSE = ed white _cons We want, again, to look at the predicted values, but we can plot them separately for whites and nonwhites. To do so, we set up two new variables: mod2w aand mod2n contain the predicted values of income for whites and nonwhites respectively.. predict mod2. gen mod2w=mod2 if white==1 (294 missing values generated). gen mod2n=mod2 if white== (1312 missing values generated). graph mod2w mod2n ed, connect(ll)xlabel ylabel l1(model2 predicted income) b1(years of education) When we graph these values, we find two parallel lines: the lines for whites and nonwhites differ in only in their intercept. We can see how this happens by writing out the prediction equations for whites and non-whites. FOR NONWHITES: since white=, FOR WHITES: since white=1 b + b 1 ed + b 2 white, becomes, b + b 1 ed b + b 1 ed + b 2 white, becomes, b + b 2 + b 1 ed What the dummy variable has done is to allow us separate intercepts model2 predicted income mod2w mod2n for nonwhites b for whites b +b ed years of education

3 3 This model allows the intercepts to differ by race, BUT assumes the increase in income for each additional year of education is the same for whites and nonwhites alike. MODEL 3: But suppose we want to ask whether or not the slope is the same. To do so, we can use an interaction term that is the product of the variable white and the variable education. This variable is for nonwhites, but for whites is equal to their education.. gen edw = white*ed. regress income ed white edw F( 3, 162) = 13.8 Model 9.691e e+1 Prob > F =. Residual e R-square = Adj R-square =.1612 Total e Root MSE = ed white edw _cons predict mod3. gen mod3w = mod3 if white==1 (294 missing values generated). gen mod3n = mod3 if white== (1312 missing values generated). graph mod3w mod3n ed, connect(ll) xlabel ylabel l1(model3 predicted income) b1(years of education) What we see is that this whether both the slope and the to nonwhites. model3 predicted income 4 2 mod3w mod3n method has allowed us to ask intercept differ for whites compared ed years of education FOR NONWHITES: b + b 1 ed + b 2 white + b 3 edw becomes, since white= edw=, b + b 1 ed FOR WHITES: b + b 1 ed + b 2 white + b 3 edw becomes, since white=1and edw=ed (b + b 2 ) + (b 1 +b 3 ) ed What the dummy variable for White and its interaction with ed have done is to allow us to estimate separate intercepts and separate slopes for the relationship between education and income for whites and nonwhites. These analyses can be done separately for whites and nonwhites.

4 4 MODEL 4: Whites only. regress income ed if white==1 Source SS df MS Number of obs = F( 1, 131) = Model e e+1 Prob > F =. Residual e R-square = Adj R-square =.14 Total e Root MSE = ed _cons MODEL 5: Nonwhites only. regress income ed if white== Source SS df MS Number of obs = F( 1, 292) = 3.7 Model e e+9 Prob > F =. Residual e R-square = Adj R-square =.92 Total 4.953e Root MSE = ed _cons Please note that these separate regressions give the same results as the single analysis in Model 3. The intercept and education coefficients for nonwhites in Model 5 are the same as in Model 3. The intercept in Model 4 is the sum of the intercept and the coefficient for white in Model 3. The coefficient for education in Model 4 is the sum of the coefficient for education and that for edw in Model 3.

5 5 MORE THAN TWO CATEGORIES MODEL 6: We can extend the analysis to look at blacks and hispanics separately, so that now we have three categories: white, black, hispanic. To carry out this analysis, we need 2 dummy variables. In this case, I choose to use black = 1 if black, hisp=1 if hispanic, and zero otherwise. Whites are zero on both these variables.. regress income ed black hisp F( 3, 162) = Model e e+1 Prob > F =. Residual 5.59e R-square = Adj R-square =.1488 Total e Root MSE = ed black hisp _cons predict mod6. gen mod6w = mod6 if (black+hisp==) (294 missing values generated). gen mod6b = mod6 if black==1 (144 missing values generated). gen mod6h = mod6 if hisp==1 (1478 missing values generated). graph mod6w mod6b mod6h ed,connect(lll) xlabel ylabel l1(model6 predicted income) b1(years of education) model6 predicted income mod6w mod6b mod6h ed years of education Even though hisp is not significantly different from zero, I used the plots anyway. In this case, we got three parallel lines, one for each race-ethnic group.

6 6 MODEL 7: We can, further, allow the slopes to vary by creating the same kind of interaction variable as before:. gen edb = ed*black. gen edh = ed*hisp. regress income ed black edb hisp edh F( 5, 16) = Model e e+1 Prob > F =. Residual e R-square = Adj R-square =.166 Total e Root MSE = ed black edb hisp edh _cons predict mod7. gen mod7w = mod7 if black+hisp== (294 missing values generated) 4 mod7w mod7b mod7h. gen mod7b = mod7 if black==1 (144 missing values generated). gen mod7h = mod7 if hisp==1 (1478 missing values generated). graph mod7w mod7b mod7h ed, connect(lll) symbol(iii) model7 predicted income ed years of education In this case, since all coefficients are significant, we see that the slopes and intercepts differ: that there is a different starting value (or intercept) and different slopes. The starting values are higher for blacks and hispanics - i.e., at low levels of education, income is higher. BUT the increase with education is lower (the edb and edh variables have negative coefficients). What this means is that the lines cross, and as education increases, whites outstrip the other groups. WHAT OTHER VARIABLES MIGHT YOU WANT TO INCLUDE TO HAVE A FULLY DEVELOPED MODEL?

7 7 This part is up to you, the investigator. Statistics can't define issues for you -- using statistics, we can only say whether or not a particular model describes our data well -- or poorly. CONVERTING AN INTERVAL VARIABLE TO DUMMY VARIABLES Many of you asked whether the relationship of income and education was really linear. There are a number of ways of looking at that question. The one I want to introduce today is the use of dummy variables - turning education into a categorical variable. We know that education is measured in years:. sum ed Variable Obs Mean Std. Dev. Min Max ed * treat ed as a categorical variable: categories lths, HS, somecl, col. * need 3 dummy variables. gen lths=. replace lths=1 if ed<12 (37 real changes made). gen HS=. replace HS=1 if ed==12 (583 real changes made). gen somecl=. replace somecl=1 if ed>12 & ed<16 (295 real changes made). gen col=. replace col=1 if ed>=16 (358 real changes made). regress income HS somecl col F( 3, 162) = 1.71 Model e e+1 Prob > F =. Residual 5.97e R-square = Adj R-square =.1571 Total e Root MSE = HS somecl col _cons

8 8 Our model: or Y i â %â 1 X i1 %â 2 X i2 %â 3 X i3 %å i INCOME â %â 1 HS%â 2 somecl%â 3 col%å TThe estimated model is: ˆ INCOME % HS % somecl % col All of the coefficients are significant. What the results say is that, compared to those with less than high school education, income for those with a high school education is, on average, $4445 higher; for those who attend college less than 4 years, $9627 higher, for those who have 4+ years of college, $21231 higher - indicating that the increase is not likely to be linear. To see this more clearly, we could have constructed 18 dummy variables (since we have 19 years - -18) and tested it for each year. I next added in other variables and will call this the small model. It can also be referred to as a main effects model since it contains no interaction terms.. regress income HS somecl col black hisp female F( 6, 1599) = Model e e+1 Prob > F =. Residual 4.339e Small R-square = Adj R-square =.2699 Total e Root MSE = HS somecl col black hisp female _cons The next step is to consider again interactions between education and EACH of the race-ethnic and gender variables. We have to create interactions with EACH dummy variable representing a category of education. I'll refer to the resulting model as the large model.. gen HSb = HS*black. gen HSh = HS*hisp. gen HSf = HS*female. gen someclf = somecl*female. gen someclb = somecl*black. gen someclh = somecl*hisp. gen colf = col*female. gen colh = col*hisp. gen colb = col*black

9 9. regress income HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb F( 15, 159) = Model e e+1 Prob > F =. Residual 4.182e Large R-square = Adj R-square =.2913 Total e Root MSE = HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb _cons Lots of the variables are now non-significant. Can we DROP all of them? Is it really the case that the coefficients for HSb HSh HSf someclf someclb someclh and colb are ALL not significantly different from zero? THE F- test Here's where we use the F-test to our advantage. Remember, the F- test asks the question whether R 2 for the model that includes these variables is significantly greater than the R 2 for the model that omits them, i.e. it asks whether the difference in the two R 2 values is significantly different from zero. In our case, the value for the large model is.298 and for the small it's Equivalently, it asks whether the residual sum of squares (RSS) in the large model (here E+11) is significantly smaller than the RSS for the model with fewer variables (here 4.339E+11). This asks the question whether the estimated Y's are closer to the observed ones when we include these additional variables (even though each appears, alone, not to be significant). The test statistic is (RSS small model - RSS large model )/(df small model - df large model ) RSS large model / df large model The denominator also appears on our printout as the Residual MS or mean square residual. Please note that the difference in degrees of freedom for the two models is equal to the number of new variables introduced when we expand from the small model to the large one. One such test is done automatically for you in every regression output where you see an F value: this is the

10 1 particular test for the small model where all Y's are estimated to have the same value, which we call Model C: Model C: Ŷ constant For example, for our large model, F(15,159) = 44.9 and Prob>F =.. We have 15 variables (the X's) more than the model with only a constant, and we have 166 observations and 16 parameters, or 159 df for the large model. Please note that we can calculate the F statistic ourselves from the output when we are comparing to the model with only a constant. Recall that the RSS C for Model C is the Total SS in the output. In this case, the F statistic for comparing our model to model C is (Total SS - Residual SS)/Model df Model SS/Model df Model MS = = Residual MS Residual MS Residual MS When we want to compare a large model to a small one that still has some predictors, we have to use the more complicated expression given on the previous page -- or ask STATA to do it for us. After you issue the command. regress income HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb. test HSb HSh HSf someclf someclb someclh colb ( 1) HSb =. ( 2) HSh =. ( 3) HSf =. ( 4) someclf =. ( 5) someclb =. ( 6) someclh =. ( 7) colb =. F( 7, 159) =.52 Prob > F =.822 Here we do not reject the joint hypothesis that these coefficients are all zero. We can then estimate the model with them omitted:

11 11. regress income HS somecl col black hisp female colf colh F( 8, 1597) = 84.8 Model e e+1 Prob > F =. Residual e R-square = Adj R-square =.2928 Total e Root MSE = HS somecl col black hisp female colf colh _cons

12 Fraction.1 Fraction income loginc Residuals 5 Residuals Fitted values model predicting income Fitted values model predicting log income The graphs above were created using income itself and then the log of income. The top graphs show that income is not at all normally distributed, while the loginc = log(income) is reasonably close to normal. Why is this important? The same predictors were used in two regressions, but the outcomes are quite different. Before we go too far in interpreting the results, however, we should use a non-linear education variable -- either by using the dummy variables generated earlier or by introducing and ed 2 term. We ll do this as part of the next homework.

13 13. use discrim. * the smallest value of income is add 9392 to every value so that all values are > MODEL 1: INCOME is the dependent variable. regress inc ed black hisp female edb edh edf F( 7, 1598) = Model e e+1 Prob > F =. Residual e R-squared = Adj R-squared =.2793 Total e Root MSE = ed black hisp female edb edh edf _cons predict inchat. predict incres, resid. graph incres inchat, yline() xlabel ylabel b1(model predicting income) MODEL 2: log income is the dependent variable. gen loginc = inc replace loginc = log(loginc). regress loginc ed black hisp female edb edh edf F( 7, 1598) = 73.2 Model Prob > F =. Residual R-squared = Adj R-squared =.239 Total Root MSE =.5628 loginc Coef. Std. Err. t P> t [95% Conf. Interval] ed black hisp female edb edh

14 14 edf _cons predict loghat. predict logres, resid. graph logres loghat, yline() xlabel ylabel b1(model predicting log income)

Bivariate (Simple) Regression Analysis

Revised July 2018 Bivariate (Simple) Regression Analysis This set of notes shows how to use Stata to estimate a simple (two-variable) regression equation. It assumes that you have set Stata up on your