Salary 9 mo : 9 month salary for faculty member for 2004

Size: px

Start display at page:

Download "Salary 9 mo : 9 month salary for faculty member for 2004"

Clarence Small
5 years ago
Views:

1 22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor of salary for science, technology, engineering, and math (STEM) faculty at Iowa State University. All the information is publicly available, but the names have been removed, and this is a subset of the full variables set. The subsetted data can be found in salary ISU data.csv VARIABLES Department : one of 28 different departments Rank Code : rank of faculty Full professor 2 Associate professor 3 Assistant professor Gender : male or female Salary 9 mo : 9 month salary for faculty member for 2004 Avg Cont Grants : average contracts and grants for fiscal years 200, 2002, Subsetting the data The data set salary ISU data.csv is available from our class website. Download this.csv file to your C://Temp/ directory so we can read it into R. > salary.original=read.csv("c://temp/salary_isu_data.csv") > head(salary.original) Department Rank_Code Gender Salary_9_mo Avg_Cont_Grants CCE E 2 M EEOBS M EEOBS 3 M IMSE 2 M COM S 3 M AN S 2 M

2 ## We wish to exclude faculty members without any contracts or ## grants for this analysis. ## ) Use a boolean statement to pull-out certain rows: > salary=salary.original[salary.original[,5]!=0,] or ## 2) Use a boolean statement within the subset function: > salary=subset(salary.original, salary.original[,5]!=0) > head(salary) Department Rank_Code Gender Salary_9_mo Avg_Cont_Grants CCE E 2 M EEOBS 3 M AN S 2 M FSHNF 2 F AGRON 3 M COM S 2 M > attach(salary) 3. Exploring the data Let s look at the salary variable, the grants variable, and some of their transformations. Income or money variables are often right-skewed. First, let s rename them for ease of scripting: > ACG=Avg_Cont_Grants > Sal=Salary_9_mo > par(mfrow=c(2,2)) > hist(sal) > hist(log(sal)) > hist(acg) > hist(log(acg)) It turns out that using the transformed variables will help meet our assumptions. 2

3 Let s look at the rank variable, which is a categorical variable. How many faculty members are in each category? > table(rank_code) ## Recall: Full Professor== Rank_Code What percentage is in each category? > table(rank_code)/length(rank_code) Rank_Code Why do you think there s so many more full professors? 4. Relationship between log(salary) and log(grants) > plot(log(acg),log(sal)) There doesn t seem to be a strong relationship, but on top of that, there are three individuals with very large grant amounts. Let s look at these observations: > subset(salary,log(acg)>5) Turns out these individuals are different from the others in that they are administrators in university centers. In this case, after discussion with those involved, we felt it was justifiable to remove these observations before further analysis. We ll remove these three observations and proceed: > salary.2=subset(salary,log(acg)<5) > detach(salary) ## The old data set > attach(salary.2) ## After removal of the 3 > ACG=Avg_Cont_Grants ## ACG after removal > Sal=Salary_9_mo ## Sal after removal > plot(log(acg),log(sal)) 3

4 Fit the simple linear regression: > SLR.out=lm(log(Sal) ~ log(acg)) > abline(slr.out) > summary(slr.out) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-6 *** log(acg) *** Signif. codes: 0 *** 0.00 ** 0.0 * Residual standard error: on 422 degrees of freedom Multiple R-Squared: ,Adjusted R-squared: F-statistic: 4.28 on and 422 DF, p-value: The relationship between log(salary) and log(grants) is significant. Write the model: 5. Inclusion of Rank It was known that rank would have an impact on salary. If we re interested in how gender affects salary, we should also include any other variables known to affect salary. Rank is a categorical variable. Let s create dummy variables. Because there are three categories, we ll need 2 dummy variables: rank.dummy.=rep(0,nrow(salary.2)) rank.dummy.[rank_code==3]= rank.dummy.2=rep(0,nrow(salary.2)) rank.dummy.2[rank_code==2]= ## All zeroes at first ## Place s appropriately 4

5 What is the coding we used? (Recall: Full Professor== in dataset) Assistant Associate Full dummy dummy 2 What is the baseline group? Let s check our coding: > data.frame(rank_code,rank.dummy.,rank.dummy.2) Rank_Code rank.dummy. rank.dummy Rank Code was already numeric, why can t we just use that variable in our model? What model are you fitting if you regress log(sal) on Rank Code here? > is.numeric(rank_code) [] TRUE 5

6 Fit an additive model (i.e. no interaction) with grants and rank: > lm.both.out=lm(log(sal) ~ log(acg) + rank.dummy. + rank.dummy.2) > summary(lm.both.out) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-6 *** log(acg) ** rank.dummy < 2e-6 *** rank.dummy < 2e-6 *** Signif. codes: 0 *** 0.00 ** 0.0 * Residual standard error: on 420 degrees of freedom Multiple R-Squared: ,Adjusted R-squared: F-statistic: on 3 and 420 DF, p-value: < 2.2e-6 Write the model: Y i = β 0 + β ACG x i + β D D i + β D2 D 2i + ɛ i What does the hypothesis of H 0 : β D = 0 test? (In the context of data) How do I test if Rank is useful in the model at all? H 0 : β D = β D2 = 0... a Partial F-test: > anova(slr.out,lm.both.out) Analysis of Variance Table Model : log(sal) ~ log(acg) Model 2: log(sal) ~ log(acg) + rank.dummy. + rank.dummy.2 Res.Df RSS Df Sum of Sq F Pr(>F) < 2.2e-6 *** Signif. codes: 0 *** 0.00 ** 0.0 *

7 If we wanted to include an interaction between rank and grants, what other variables would be needed in the model? What test would be used to test for interaction? 6. Inclusion of gender > sex.dummy=rep(0,nrow(salary.2)) > sex.dummy[gender=="m"]= > lm.out.3=lm(log(sal)~log(acg)+rank.dummy.+rank.dummy.2+sex.dummy) > summary(lm.out.3) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-6 *** log(acg) ** rank.dummy < 2e-6 *** rank.dummy < 2e-6 *** sex.dummy ** Signif. codes: 0 *** 0.00 ** 0.0 * Residual standard error: on 49 degrees of freedom Multiple R-Squared: ,Adjusted R-squared: F-statistic: on 4 and 49 DF, p-value: < 2.2e-6 In this model, which sex is estimated to have a slight advantage? 7

8 What if we didn t include rank (only grants and gender)? > lm.out.norank=lm(log(sal)~log(acg)+sex.dummy) > summary(lm.out.norank) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-6 *** log(acg) e-05 *** sex.dummy e-08 *** Signif. codes: 0 *** 0.00 ** 0.0 * Residual standard error: on 42 degrees of freedom Multiple R-Squared: ,Adjusted R-squared: F-statistic: 23. on 2 and 42 DF, p-value: 3.038e-0 Why is gender so much stronger without rank included? (Recall the fundamentals of multiple regression). > table(rank_code) Rank_Code > table(gender,rank_code) Rank_Code Gender 2 3 F M rank and gender are not independent of each other. Knowing the rank of a randomly chosen individual gives you some information on the likelihood of their gender. A large proportion of the women are in lower ranks. If you don t account for rank, it will look like women are paid less (but that s not a good analysis). It turns out that if we also include department (which is also associated with salary), the significant sex effect disappears. 8

9 7. Lattice Plot Lattice plots can be useful when considering a quantitative response and categorical predictors, or factors. R can actually make dummy variables on its own using the as.factor() command (more on this later). Here, we can see how log(sal) is related to log(acg) for each of the six combinations of Sex/Rank: ## lattice is an attachable package. > library(lattice) > xyplot(log(sal)~log(acg) as.factor(sex.dummy)+as.factor(rank_code)) log(sal) log(acg) 9

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3 s:5 Applied Linear Regression DeCook all 0 Lab onday October The data Set In 004, a study was done to examine if gender, after controlling for other variables, was a significant predictor of salary for