Solutions to Problem Set 2 Andrew Stokes Fall 2017

Similar documents
SURVEY ON ACCESS AND USAGE OF ICT SERVICES IN MALAWI-

Subsetting, dplyr, magrittr Author: Lloyd Low; add:

Data Manipulation using dplyr

Graphing Bivariate Relationships

Lecture 3: Pipes and creating variables using mutate()

SPSS TRAINING SPSS VIEWS

Blackboard 9 - Creating Categories in the Grade Center

Data wrangling. Reduction/Aggregation: reduces a variable to a scalar

IPUMS Training and Development: Requesting Data

Lecture 10: for, do, and switch

More Numerical and Graphical Summaries using Percentiles. David Gerard

LECTURE 5 Control Structures Part 2

Classification and Regression Trees

R Visualizing Data. Fall Fall 2016 CS130 - Intro to R 1

Week 4. Big Data Analytics - data.frame manipulation with dplyr

Assignment 3 due Thursday Oct. 11

Dplyr Introduction Matthew Flickinger July 12, 2017

MOBILE COVERAGE GLOBAL, REGIONAL, & NATIONAL MOBILE COVERAGE AND ADOPTION TRENDS. June Evans School Policy Analysis & Research Group (EPAR)

Applied Regression Modeling: A Business Approach

EXPLORATORY DATA ANALYSIS. Introducing the data

Selec%on and Decision Structures in Java: If Statements and Switch Statements CSC 121 Fall 2016 Howard Rosenthal

Data Manipulation. Module 5

DINO. Language Reference Manual. Author: Manu Jain


A Cross-national Comparison Using Stacked Data

Statistics Lecture 6. Looking at data one variable

DAY 52 BOX-AND-WHISKER

Computers in Engineering COMP 208. Where s Waldo? Linear Search. Searching and Sorting Michael A. Hawker

Financial Econometrics Practical

Chapter 6: DESCRIPTIVE STATISTICS

Introduction to Computer Science Midterm 3 Fall, Points

Software Testing Fundamentals. Software Testing Techniques. Information Flow in Testing. Testing Objectives

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

Correlation. January 12, 2019

Data Import and Formatting

Division of State Fire Marshal. Florida Public School Fire Safety Report System User Manual

Mobile for Development. mhealth Country Feasibility Report. Malawi

Select Cases. Select Cases GRAPHS. The Select Cases command excludes from further. selection criteria. Select Use filter variables

Using the Health Indicators database to help students research Canadian health issues

EXAMPLE 10: PART I OFFICIAL GEOGRAPHICAL IDENTIFIERS IN THE UNDERSTANDING SOCIETY PART II LINKING MACRO-LEVEL DATA AT THE LSOA LEVEL

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Loops! Loops! Loops! Lecture 5 COP 3014 Fall September 25, 2017

(edit 3/7: fixed a typo in project specification 2-f) user_id that user enters should be in the range [0,n-1] (i.e., from 0 to n-1, inclusive))

Spring 2017 CS130 - Intro to R 1 R VISUALIZING DATA. Spring 2017 CS130 - Intro to R 2

Flow Control: Branches and loops

IPUMS Training and Development: Requesting Data

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Florida Rural Household Travel Survey Mobile App

Minitab 17 commands Prepared by Jeffrey S. Simonoff

SUMMARY RESULTS FOR 2004 PARLIAMENTARY GENERAL ELECTIONS

Using ADePT Edu: A Step-by-Step Guide

Old Faithful Chris Parrish

Assignment 0. Nothing here to hand in

Applied Statistics and Econometrics Lecture 6

Chapter 17: INTERNATIONAL DATA PRODUCTS

Chemical Reaction dataset ( )

Data 8 Final Review #1

Data Feedback Report Tutorial Script Data Collection Cycle

Basics of Plotting Data

The Digital Inclusion Perspective

Presented by Mayamiko Minofu Renew N Able Malawi (RENAMA)

R Basics / Course Business

Applied Regression Modeling: A Business Approach

Stat Day 6 Graphs in Minitab

Preparing for Data Analysis

Technical Working Session on Profiling Equity Focused Information

Salary 9 mo : 9 month salary for faculty member for 2004

Room Searches and Room Requests

1. Descriptive Statistics

Solution to Tumor growth in mice

Lesson 39: Conditionals #3 (W11D4)

2018 HELO Leadership Retreat. The Economic Impact of the Digital Divide on the Latino Community

A more efficient way of finding Hamiltonian cycle

A2. Statistical methodology

Copyright 2018 by KNIME Press

Search Lesson Outline

After Click on Enter site you will get the page which looks like below image.

CMPSC 390 Visual Computing Spring 2014 Bob Roos Notes on R Graphs, Part 2

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

Dr. Barbara Morgan Quantitative Methods

Session 1: Overview of CSPro, Dictionary and Forms

Notes on Topology. Andrew Forrester January 28, Notation 1. 2 The Big Picture 1

Dual-Frame Sample Sizes (RDD and Cell) for Future Minnesota Health Access Surveys

while (condition) { body_statements; for (initialization; condition; update) { body_statements;

Programming Iterative Loops. for while

Лекция 4 Трансформация данных в R

From the User Profile section of your employer account, select User Profile and enter your new password.

Lab 1: Introduction to data

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Statistical Software Camp: Introduction to R

Preparing for Data Analysis

Statistics 251: Statistical Methods

Data Science & . June 14, 2018

TRANSANA and Chapter 8 Retrieval

ICSSR Data Service Indian Social Science Data Repository R : User Guide Indian Council of Social Science Research

Quick introduction to descriptive statistics and graphs in. R Commander. Written by: Robin Beaumont

Dr. V. Alhanaqtah. Econometrics. Graded assignment

Chapter 6: Modifying and Combining Data Sets

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Example how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler

Transcription:

Solutions to Problem Set 2 Andrew Stokes Fall 2017 This answer key will use both dplyr and base r. Set working directory and point R to it input<-"/users/jasoncollins/desktop/gh811/assignments/problemset2" setwd(input) Read in any relevant libraries. The solutions below incorporate dplyr, a new package in R that makes data manipuation faster and more intuitive than base R. suppressmessages(library(dplyr)) Now we are ready to read in the Malawi data malawi_raw<-read.csv("malawi2010s.csv") The first step in processing the data is to apply the inclusion criteria. We will restrict the dataset to households in which the primary fuel used for cooking is wood. We will also require that household cooking is performed either in the main residence or a separate building. We exclude households in which cooking is performed outdoors because of the reduced risk of exposure to smoke in these instances. To reduce the size of the dataset, we also subset the data to include only those variables needed for the analysis. In dplyr to subset columns, we use the select verb,and to subset rows we use filter. (Recall tbl_df creates a local data frame) malawi <- tbl_df(malawi_raw) malawi <- select(hv024, hv025, hv226, hv227, hv239, hv240, hv241, hv270, hv108_01, shdist) %>% filter(malawi$hv226 == 8 & malawi$hv241 %in% c(1,2)) malawi ## # A tibble: 15,661 x 10 ## hv024 hv025 hv226 hv227 hv239 hv240 hv241 hv270 hv108_01 shdist ## <fctr> <fctr> <int> <int> <int> <int> <int> <fctr> <int> <fctr> ## 1 central rural 8 1 1 0 2 middle 0 dedza ## 2 central rural 8 1 1 1 2 richest 15 dedza ## 3 central rural 8 1 1 0 2 richer 3 dedza ## 4 central rural 8 1 1 0 2 richest 2 dedza ## 5 central rural 8 0 1 0 2 middle 3 dedza ## 6 central rural 8 1 1 0 2 poorer 5 dedza ## 7 central rural 8 1 1 0 2 poorest 3 dedza ## 8 central rural 8 0 1 0 2 poorer 5 dedza ## 9 central rural 8 1 1 0 2 richest 0 dedza ## 10 central rural 8 1 1 0 2 richer 3 dedza ## #... with 15,651 more rows Question 1 How many households were eliminated as a result of applying the stated inclusion criteria and how many remain in the final analytic dataset? 1

dim(malawi_raw)[1] ## [1] 24825 dim(malawi)[1] ## [1] 15661 dim(malawi_raw)[1] - dim(malawi)[1] ## [1] 9164 Question 2 We are now asked to construct a variable that identifies households with an improved wood stove. We consider the stove an improved wood stove if ANY of the following criteria are met: Food is cooked on an open stove (hv239=2) Food is cooked on a closed stove with chimney (hv239=3) Household has a chimney (hv240=1) Household has a hood (hv240=2) We consider it not an improved stove if ALL the following criteria are met: Food is cooked over an open fire (hv230=1) Household has neither chimney or hood (hv240=0) We use a nested ifelse statement to construct the improved wood stove variable. Don t forget to exclude the missing values! malawi$impstove <- ifelse(malawi$hv239==2 malawi$hv239==3 malawi$hv240==1 malawi$hv240==2, 1, ifelse(malawi$hv239==1 & malawi$hv240==0, 0, 9)) table(malawi$impstove) ## ## 0 1 9 ## 15348 304 7 malawi <- filter(impstove!=9) table(malawi$impstove) ## ## 0 1 ## 15348 304 Question 3 Having created the improved stove variable, we are now asked to compare households with improve stoves to those without on several characteristics. Mean value of wealth index (hv270) Percent of households with wealth index of poorest or poorer Percent of households from the southern region of Malawi (hv024) Percent of households that are urban (hv025) Percent of households that have a bednet for sleeping (hv227) 2

We can use the dplyr summarise verb to calculate the values for the table. Let s start by calculating the mean value of the wealth index by improved stove status. First we need to recode the raw data for hv270, which is stored as text. malawi$wealth_index_num<-9 malawi$wealth_index_num[malawi$hv270=="poorest"]<-1 malawi$wealth_index_num[malawi$hv270=="poorer"]<-2 malawi$wealth_index_num[malawi$hv270=="middle"]<-3 malawi$wealth_index_num[malawi$hv270=="richer"]<-4 malawi$wealth_index_num[malawi$hv270=="richest"]<-5 Check to make sure it worked! table(malawi$wealth_index_num) ## ## 1 2 3 4 5 ## 3278 3373 3555 3534 1912 Now that the data are numeric, we can use dplyr to calculate the mean value of the wealth index by improved stove status. summarise(mean_wealth = mean(wealth_index_num)) ## impstove mean_wealth ## 1 0 2.819716 ## 2 1 3.644737 The next characteristic is the percent of households with wealth index of poorest or poorer. First we create a new variable, low_ses, which we code as 1 if the household belongs to one of those two categories, else 0. malawi$low_ses<-9 malawi$low_ses[malawi$hv270=="poorest" malawi$hv270=="poorer"]<-1 malawi$low_ses[malawi$hv270=="middle" malawi$hv270=="richer" malawi$hv270=="richest"]<-0 Then using similar code as above, we calculate the proportion poorest or poorer by improved stove status. summarise(mean_low_ses = mean(low_ses)) ## impstove mean_low_ses ## 1 0 0.4287204 ## 2 1 0.2335526 Next up is the percent of households from the southern region of Malawi (hv024). I first construct a new dummy variable southern that indicates whether the household is located in the southern region. We find this variable by looking at the levels of the variable we suspect from the recode book. levels(malawi$hv024) ## [1] "central" "northern" "southern" 3

malawi$southern <- ifelse(malawi$hv024=="southern", 1,0) Then, I calcuate the proportion that reside in the southern region by improved stove status. summarise(mean_southern = mean(southern)) ## impstove mean_southern ## 1 0 0.4175788 ## 2 1 0.5559211 Next is the percent of households that are urban (hv025) malawi$urban <- ifelse(malawi$hv025=="urban", 1,0) summarise(mean_urban = mean(urban)) ## impstove mean_urban ## 1 0 0.04854053 ## 2 1 0.14473684 The final one is the percent of households that have a bednet for sleeping (hv227). This one is simple and doesn t require any recoding. summarise(mean_net = mean(hv227)) ## impstove mean_net ## 1 0 0.6849101 ## 2 1 0.7697368 Question 5 Which district of Malawi has the highest prevalence of improved wood stoves? imp_stove_districts <- group_by(shdist) %>% summarise(mean_imp_stove = mean(impstove)) print(imp_stove_districts, n=30) ## # A tibble: 27 x 2 ## shdist mean_imp_stove ## <fctr> <dbl> ## 1 balaka 0.024221453 ## 2 blantyre 0.023255814 ## 3 chikwawa 0.013856813 ## 4 chiradzulu 0.013207547 4

## 5 chitipa 0.005235602 ## 6 dedza 0.014492754 ## 7 dowa 0.008321775 ## 8 karonga 0.008869180 ## 9 kasungu 0.006240250 ## 10 lilongwe 0.031250000 ## 11 machinga 0.008771930 ## 12 mangochi 0.015845070 ## 13 mchinji 0.007936508 ## 14 mulanje 0.066780822 ## 15 mwanza 0.015837104 ## 16 mzimba 0.021361816 ## 17 neno 0.016161616 ## 18 nkhatabay 0.014814815 ## 19 nkhota kota 0.013468013 ## 20 nsanje 0.021680217 ## 21 ntcheu 0.009817672 ## 22 ntchisi 0.014005602 ## 23 phalombe 0.030664395 ## 24 rumphi 0.042586751 ## 25 salima 0.007042254 ## 26 thyolo 0.041459370 ## 27 zomba 0.030303030 From the table above, it appears that the highest prevalence is found in Mulange district. Question 6 Now we are asked to restrict the districts to those that have a prevalence of improved wood stoves greater than the median and represent this with a barchart. We can use base R or dplyr here. x<-prop.table(table(malawi$impstove,malawi$shdist),2)[2,] z<-x[x>median(x)] barplot(z,xlab="district", ylab="proportion of Improved Stoves", main="improved Stoves by District", ylim=c(0,0.08)) 5

Proportion of Improved Stoves 0.00 0.02 0.04 0.06 0.08 Improved Stoves by District balaka lilongwe mulanje mzimba nsanje rumphi zomba District Question 7 We are asked to generate a boxplot to show the distribution of education (in units of single years) of the household head comparing households that have an improved wood stove to those who do not. malawi <- filter(hv108_01<98) malawi$edu<-as.numeric(malawi$hv108_01) boxplot(edu~impstove, data=malawi, main = "Years of school by improved stove status", xlab = "Improved Stove", ylab="education in Single Years", font.lab=3, col= "darkgreen") 6

Years of school by improved stove status Education in Single Years 0 5 10 15 0 1 Improved Stove Question 8 For question 8 we are asked to use a for loop to do something we can already do without a for loop, get the proportions of improved stoves within each wealth status category. To do this we calculate the proportion of impstoves by wealth status category individually by iterating through the different statuses. imp.wealth<-c() for (i in levels(malawi$hv270)){ imp.wealth[i]<- prop.table(table(malawi$impstove[which(malawi$hv270==i)]))[2] } imp.wealth ## middle poorer poorest richer richest ## 0.013288097 0.012816692 0.008282209 0.021856372 0.056722689 Now we generate a barplot. barplot(imp.wealth[c("poorest","poorer","middle","richer","richest")], main="proportion of Improved Stoves by Wealth Status, Malawi", xlab="wealth Index", ylab="proportion of Improved Stoves", ylim=c(0,0.1)) 7

Proportion of Improved Stoves 0.00 0.02 0.04 0.06 0.08 0.10 Proportion of Improved Stoves by Wealth Status, Malawi poorest poorer middle richer richest Wealth Index 8