Solutions to Problem Set 2 Andrew Stokes Fall 2017 This answer key will use both dplyr and base r. Set working directory and point R to it input<-"/users/jasoncollins/desktop/gh811/assignments/problemset2" setwd(input) Read in any relevant libraries. The solutions below incorporate dplyr, a new package in R that makes data manipuation faster and more intuitive than base R. suppressmessages(library(dplyr)) Now we are ready to read in the Malawi data malawi_raw<-read.csv("malawi2010s.csv") The first step in processing the data is to apply the inclusion criteria. We will restrict the dataset to households in which the primary fuel used for cooking is wood. We will also require that household cooking is performed either in the main residence or a separate building. We exclude households in which cooking is performed outdoors because of the reduced risk of exposure to smoke in these instances. To reduce the size of the dataset, we also subset the data to include only those variables needed for the analysis. In dplyr to subset columns, we use the select verb,and to subset rows we use filter. (Recall tbl_df creates a local data frame) malawi <- tbl_df(malawi_raw) malawi <- select(hv024, hv025, hv226, hv227, hv239, hv240, hv241, hv270, hv108_01, shdist) %>% filter(malawi$hv226 == 8 & malawi$hv241 %in% c(1,2)) malawi ## # A tibble: 15,661 x 10 ## hv024 hv025 hv226 hv227 hv239 hv240 hv241 hv270 hv108_01 shdist ## <fctr> <fctr> <int> <int> <int> <int> <int> <fctr> <int> <fctr> ## 1 central rural 8 1 1 0 2 middle 0 dedza ## 2 central rural 8 1 1 1 2 richest 15 dedza ## 3 central rural 8 1 1 0 2 richer 3 dedza ## 4 central rural 8 1 1 0 2 richest 2 dedza ## 5 central rural 8 0 1 0 2 middle 3 dedza ## 6 central rural 8 1 1 0 2 poorer 5 dedza ## 7 central rural 8 1 1 0 2 poorest 3 dedza ## 8 central rural 8 0 1 0 2 poorer 5 dedza ## 9 central rural 8 1 1 0 2 richest 0 dedza ## 10 central rural 8 1 1 0 2 richer 3 dedza ## #... with 15,651 more rows Question 1 How many households were eliminated as a result of applying the stated inclusion criteria and how many remain in the final analytic dataset? 1
dim(malawi_raw)[1] ## [1] 24825 dim(malawi)[1] ## [1] 15661 dim(malawi_raw)[1] - dim(malawi)[1] ## [1] 9164 Question 2 We are now asked to construct a variable that identifies households with an improved wood stove. We consider the stove an improved wood stove if ANY of the following criteria are met: Food is cooked on an open stove (hv239=2) Food is cooked on a closed stove with chimney (hv239=3) Household has a chimney (hv240=1) Household has a hood (hv240=2) We consider it not an improved stove if ALL the following criteria are met: Food is cooked over an open fire (hv230=1) Household has neither chimney or hood (hv240=0) We use a nested ifelse statement to construct the improved wood stove variable. Don t forget to exclude the missing values! malawi$impstove <- ifelse(malawi$hv239==2 malawi$hv239==3 malawi$hv240==1 malawi$hv240==2, 1, ifelse(malawi$hv239==1 & malawi$hv240==0, 0, 9)) table(malawi$impstove) ## ## 0 1 9 ## 15348 304 7 malawi <- filter(impstove!=9) table(malawi$impstove) ## ## 0 1 ## 15348 304 Question 3 Having created the improved stove variable, we are now asked to compare households with improve stoves to those without on several characteristics. Mean value of wealth index (hv270) Percent of households with wealth index of poorest or poorer Percent of households from the southern region of Malawi (hv024) Percent of households that are urban (hv025) Percent of households that have a bednet for sleeping (hv227) 2
We can use the dplyr summarise verb to calculate the values for the table. Let s start by calculating the mean value of the wealth index by improved stove status. First we need to recode the raw data for hv270, which is stored as text. malawi$wealth_index_num<-9 malawi$wealth_index_num[malawi$hv270=="poorest"]<-1 malawi$wealth_index_num[malawi$hv270=="poorer"]<-2 malawi$wealth_index_num[malawi$hv270=="middle"]<-3 malawi$wealth_index_num[malawi$hv270=="richer"]<-4 malawi$wealth_index_num[malawi$hv270=="richest"]<-5 Check to make sure it worked! table(malawi$wealth_index_num) ## ## 1 2 3 4 5 ## 3278 3373 3555 3534 1912 Now that the data are numeric, we can use dplyr to calculate the mean value of the wealth index by improved stove status. summarise(mean_wealth = mean(wealth_index_num)) ## impstove mean_wealth ## 1 0 2.819716 ## 2 1 3.644737 The next characteristic is the percent of households with wealth index of poorest or poorer. First we create a new variable, low_ses, which we code as 1 if the household belongs to one of those two categories, else 0. malawi$low_ses<-9 malawi$low_ses[malawi$hv270=="poorest" malawi$hv270=="poorer"]<-1 malawi$low_ses[malawi$hv270=="middle" malawi$hv270=="richer" malawi$hv270=="richest"]<-0 Then using similar code as above, we calculate the proportion poorest or poorer by improved stove status. summarise(mean_low_ses = mean(low_ses)) ## impstove mean_low_ses ## 1 0 0.4287204 ## 2 1 0.2335526 Next up is the percent of households from the southern region of Malawi (hv024). I first construct a new dummy variable southern that indicates whether the household is located in the southern region. We find this variable by looking at the levels of the variable we suspect from the recode book. levels(malawi$hv024) ## [1] "central" "northern" "southern" 3
malawi$southern <- ifelse(malawi$hv024=="southern", 1,0) Then, I calcuate the proportion that reside in the southern region by improved stove status. summarise(mean_southern = mean(southern)) ## impstove mean_southern ## 1 0 0.4175788 ## 2 1 0.5559211 Next is the percent of households that are urban (hv025) malawi$urban <- ifelse(malawi$hv025=="urban", 1,0) summarise(mean_urban = mean(urban)) ## impstove mean_urban ## 1 0 0.04854053 ## 2 1 0.14473684 The final one is the percent of households that have a bednet for sleeping (hv227). This one is simple and doesn t require any recoding. summarise(mean_net = mean(hv227)) ## impstove mean_net ## 1 0 0.6849101 ## 2 1 0.7697368 Question 5 Which district of Malawi has the highest prevalence of improved wood stoves? imp_stove_districts <- group_by(shdist) %>% summarise(mean_imp_stove = mean(impstove)) print(imp_stove_districts, n=30) ## # A tibble: 27 x 2 ## shdist mean_imp_stove ## <fctr> <dbl> ## 1 balaka 0.024221453 ## 2 blantyre 0.023255814 ## 3 chikwawa 0.013856813 ## 4 chiradzulu 0.013207547 4
## 5 chitipa 0.005235602 ## 6 dedza 0.014492754 ## 7 dowa 0.008321775 ## 8 karonga 0.008869180 ## 9 kasungu 0.006240250 ## 10 lilongwe 0.031250000 ## 11 machinga 0.008771930 ## 12 mangochi 0.015845070 ## 13 mchinji 0.007936508 ## 14 mulanje 0.066780822 ## 15 mwanza 0.015837104 ## 16 mzimba 0.021361816 ## 17 neno 0.016161616 ## 18 nkhatabay 0.014814815 ## 19 nkhota kota 0.013468013 ## 20 nsanje 0.021680217 ## 21 ntcheu 0.009817672 ## 22 ntchisi 0.014005602 ## 23 phalombe 0.030664395 ## 24 rumphi 0.042586751 ## 25 salima 0.007042254 ## 26 thyolo 0.041459370 ## 27 zomba 0.030303030 From the table above, it appears that the highest prevalence is found in Mulange district. Question 6 Now we are asked to restrict the districts to those that have a prevalence of improved wood stoves greater than the median and represent this with a barchart. We can use base R or dplyr here. x<-prop.table(table(malawi$impstove,malawi$shdist),2)[2,] z<-x[x>median(x)] barplot(z,xlab="district", ylab="proportion of Improved Stoves", main="improved Stoves by District", ylim=c(0,0.08)) 5
Proportion of Improved Stoves 0.00 0.02 0.04 0.06 0.08 Improved Stoves by District balaka lilongwe mulanje mzimba nsanje rumphi zomba District Question 7 We are asked to generate a boxplot to show the distribution of education (in units of single years) of the household head comparing households that have an improved wood stove to those who do not. malawi <- filter(hv108_01<98) malawi$edu<-as.numeric(malawi$hv108_01) boxplot(edu~impstove, data=malawi, main = "Years of school by improved stove status", xlab = "Improved Stove", ylab="education in Single Years", font.lab=3, col= "darkgreen") 6
Years of school by improved stove status Education in Single Years 0 5 10 15 0 1 Improved Stove Question 8 For question 8 we are asked to use a for loop to do something we can already do without a for loop, get the proportions of improved stoves within each wealth status category. To do this we calculate the proportion of impstoves by wealth status category individually by iterating through the different statuses. imp.wealth<-c() for (i in levels(malawi$hv270)){ imp.wealth[i]<- prop.table(table(malawi$impstove[which(malawi$hv270==i)]))[2] } imp.wealth ## middle poorer poorest richer richest ## 0.013288097 0.012816692 0.008282209 0.021856372 0.056722689 Now we generate a barplot. barplot(imp.wealth[c("poorest","poorer","middle","richer","richest")], main="proportion of Improved Stoves by Wealth Status, Malawi", xlab="wealth Index", ylab="proportion of Improved Stoves", ylim=c(0,0.1)) 7
Proportion of Improved Stoves 0.00 0.02 0.04 0.06 0.08 0.10 Proportion of Improved Stoves by Wealth Status, Malawi poorest poorer middle richer richest Wealth Index 8