Statistical Programming Camp: An Introduction to R

Statistical Programming Camp: An Introduction to R Handout 3: Data Manipulation and Summarizing Univariate Data Fox Chapters 1-3, 7-8 In this handout, we cover the following new materials: ˆ Using logical operators: <, <=, >, >=, ==,!=, &,, and is.na() ˆ Subsetting data with [] and subset() using logical expressions ˆ Using ifelse() for conditional statements ˆ More functions for summary statistics: var() (variance), sd() (standard deviation), weighted.mean(), quantile(x, P), and IQR() (Inter-quartile Range) ˆ Applying functions by indexes using tapply() ˆ Using function() to create user-defined functions. ˆ Common arguments for graphs: main (main title), xlab and ylab (axis labels), xlim and ylim (axis limits), pch (point symbol), lty (line type), lwd (line width), col (color), and cex (sizing) ˆ Adding features to graphs with lines() and abline() (lines), points() (points), text() (text), and arrows() (arrows) ˆ Using identify() to identify points on graphs. ˆ Using \n to break lines. ˆ Using par(mfrow = c(x, Y)) at the beginning of graphical commands to produce X by Y figure in one graphical window. ˆ Using hist() to generate histograms. ˆ Calculating a smooth density via density() ˆ Adding a legend to an existing graph by legend() ˆ Printing and saving graphs We will cover the following Statistical Programming Camp Coding Rule: ˆ Curly Brackets 1

1 Logical Operators and Values ˆ Logical operators (<, <=, >, >=, == and!=) allow for data manipulation and subsetting by determining whether a specified condition is TRUE or FALSE, both of which must be uppercased and are special values in R just like NA. The operators correspond to standard use. For instance, <= evaluates whether a number is greater than or equal to a specified value. The symbol!= corresponds to not equal. The output of a logical statement is of the class logical. > "Hello" == "hello" [1] FALSE > y <- 3 < 4 > y [1] TRUE > class(y) [1] "logical" ˆ Logical operators may be applied to individual data entries or entire vectors (or even a dataframe!). When applied to a vector, logical operators evaluate each element of the vector. > x <- c(3, 2, 1, -2, -1) > x!= 1 [1] TRUE TRUE FALSE TRUE TRUE ˆ Combine logical statements and operations with & (and) and (or). > x > 2 x <= -1 [1] TRUE FALSE FALSE TRUE TRUE > x > 0 & x <= 2 [1] FALSE TRUE TRUE FALSE FALSE ˆ Combinging logical operators with other commands allows us to perform operations only on elements that meet the logical condition. For instance, we can add up the number of TRUE statements using sum(). > sum(x > 0 & x <= 2) # Adds up the number of TRUE statements [1] 2 ˆ The command is.na() is a logical operator that identifies missing data. We may use na.rm() to remove missing data. 2

> x <- c(x, NA) > is.na(x) # identifies missing data by returning a logical vector [1] FALSE FALSE FALSE FALSE FALSE TRUE > mean(x) # cannot compute the mean due to missing data [1] NA > mean(x[!is.na(x)]) # calculates the mean for only non-missing data [1] 0.6 2 Subsetting with Logical Expressions For the remainder of this handout, we will use the following data, which is a collection of countylevel data used by D. Matthews and J. Prothro in Negroes and the New Southern Politics. We can answer some interesting questions using this data set. Does the state-wide mean value of black voter registration depend on the existence of polltax? What about the literacy requirement? Finally, what do you find when considering the four combinations of these two? The variables of the data are: Variable Description state state name county county code polltax the existence of polltax (1 = Yes, 0 = No) litreq the existence of literacy requirement (1 = Yes, 0 = No) blackpop 1960 % black of state population (100s) pblackreg % black voting age population registered in 1964 (black registration rate) fedex66 federal examiner present in county in 1966 pincreasereg % increase in black registration rate from 1964 to 1968 ˆ Previously, we learned that vectors and data frames can be subsetted by using brackets ([ ]). For example, a subset of a data frame can be obtained by specifying row numbers (or row names) and column numbers (or column names) in brackets. Logical expressions can also be used within brackets for subsetting. > reg <- read.table("registration.txt", header=true) > ## black registration is lower where polltax is present > mean(reg$pblackreg[reg$polltax == 1]) [1] 17.95895 > mean(reg$pblackreg[reg$polltax == 0]) [1] 38.24883 > ## black registration is lower where literacy requirement is imposed > mean(reg$pblackreg[reg$litreq == 1]) 3

[1] 30.35514 > mean(reg$pblackreg[reg$litreq == 0]) [1] 53.07164 > ## black registration is lowest where both requirements are present > mean(reg$pblackreg[(reg$polltax == 1) & (reg$litreq == 1)]) [1] 17.95895 > ## no observations returns NaN > mean(reg$pblackreg[(reg$polltax == 1) & (reg$litreq == 0)]) [1] NaN > mean(reg$pblackreg[(reg$polltax == 0) & (reg$litreq == 1)]) [1] 34.63745 > mean(reg$pblackreg[(reg$polltax == 0) & (reg$litreq == 0)]) [1] 53.07164 ˆ In addition to [ ], subset() may be used to subset data, which takes vectors and data frames as the first argument. Then, users can specify subset and/or select as arguments. The former should be a logical vector indicating elements or rows to keep while the latter should specify the variables to keep (either by a vector of variable names or by a numeric vector indicating column numbers) > ## counties with a higher than average black population but lower than > ## average registration rate > lowreg <- subset(reg, subset = ((reg$blackpop >= mean(reg$blackpop)) + & (reg$pblackreg <= mean(reg$pblackreg))), + select = c("blackpop", "pblackreg", "polltax", "litreq")) > ## How many impose both polltax and literacy requirement > nrow(lowreg[(lowreg$polltax == 1) & (lowreg$litreq == 1), ]) [1] 34 > ## Another way > sum((lowreg$polltax == 1) & (lowreg$litreq == 1)) [1] 34 4

3 Using Conditional Statements via ifelse() Conditional Statements evaluate a logical statement, then perform different actions depending on whether the statement is true or false. The function ifelse(x, Y, Z) performs an action Y and returns the result of this action as the output if the statement X is true and performs Z and returns the output if X is false. > ## Creating a new variable indicating counties with higher than average > ## black population and polltax > reg$highpoptax <- ifelse((reg$blackpop >= mean(reg$blackpop) & reg$litreq == 1), + "Yes", "No") > ## a more complex example creating region variable > reg$region <- ifelse(reg$state=="alabama" reg$state=="georgia" + reg$state=="louisiana" reg$state=="mississippi" + reg$state=="south Carolina", "Deep South", "Peripheral South") > reg$region <- as.factor(reg$region) > table(reg$region) Deep South Peripheral South 361 76 4 More Functions for Summarizing Data In addition to the functions we learned last week (i.e., mean(), median(), min(), max(), and range()), we have the following new functions that are useful for summarizing data. ˆ var() (variance) and sd() (standard deviation) summarize numeric data. > ## two ways of calculating standard deviation > sd(reg$pblackreg) [1] 25.34743 > sqrt(var(reg$pblackreg)) [1] 25.34743 ˆ Weighted mean can be computed using weighted.mean(x, Y), where the output is the mean of X weighted by Y. > ## overall registration rate should be weighted by county population > weighted.mean(reg$pblackreg, reg$blackpop) [1] 32.04816 ˆ The function quantile(x, P) provides the sample quantiles of a numeric vector X for each element of another numeric vector P. > quantile(reg$pblackreg) # the default is quartiles plus min and max 5

0% 25% 50% 75% 100% 0.0 12.3 29.3 52.9 99.9 > quantile(reg$pblackreg, seq(from = 0.2, to = 0.8, by = 0.2)) # quintiles 20% 40% 60% 80% 8.70 23.10 36.00 58.18 ˆ The function IQR() returns the interquartile range > IQR(reg$pBlackReg) [1] 40.6 5 Applying Functions by Indexes In many situations, we want to apply the same function repeatedly for different parts of the data. For example, in the black registration data, we may want to compute the registration rate within each state. Doing this manually is a pain especially if the number of states is large; you have to subset the data for one state and then use mean() to compute the registration for that state, and this has to be repeated for each state. The function tapply() (t is a short hand for table) enables you to do such computation in one line. Specifically, tapply(x, INDEX, FUN) applies the function FUN to X for each of the groups defined by a vector INDEX. Replace FUN with mean, median, sd, etc. to generate desired quantity. > ## Calculate the mean of % black registration rates by state > tapply(reg$pblackreg, reg$state, mean) Alabama Florida Georgia Louisiana Mississippi 24.142424 53.071642 31.707006 37.990476 3.886207 North Carolina South Carolina 47.977777 37.436956 6 Writing Functions One of the greatest benefits of R is the flexibility the software allows for users to write their own functions. The syntax takes the form of name <- foo(bar1, bar2,...), where name is the function name, (bar1, bar2,...) are the inputs, and the commands within the brackets { } define the function. We begin with a simple example, creating a function to compute the mean from a vector with missing data. Note that an opening curly brace should never go on its own line. A closing curly brace should always go on its own line. Additionally, code within brackets should be aligned according to the text editor s automatic alignment. > x <- c(10:22, NA, 1:7, NA, 5) > mean(x) # cannot compute mean due to missing data [1] NA 6

> my.mean <- function(x){ + x <- x[!is.na(x)] # removes missing data + sum <- sum(x) + length <- length(x) + mean <- sum/length + out <- c(sum, length, mean) # define the output + names(out) <- c("sum", "length", "mean") + out # end function by calling output + } > my.mean(x) sum length mean 241.00000 21.00000 11.47619 Programming Camp Coding Rule: Curly Brackets An opening curly brace should never go on its own line. A closing curly brace should always go on its own line. Code within brackets should be properly aligned. GOOD Code: name <- foo(bar1, bar2,...){ command1 <- code1 command2 <- code2 } BAD Code: name <- foo(bar1, bar2,...) {command1 <- code1 command2 <- code2} 7 Graphs for Univariate Data: Histograms Graphs are critical tools for summarizing data in a straightforward and easy to understand manner. Great graphics strengthen projects and report by illustrating central features of the data without much additional explanation. Bad graphics are inefficient (leaving out critical information such as labels), potentially misleading, or too complicated. 7

ˆ There are several common graphing arguments that specify basic features of the graph, including the number of figures included on a graph, titles, axis labels, data range, etc. The following table summarizes these arguments: main Main title of the graph. xlab, ylab Labels for the x-axis and y-axis. xlim, ylim Specifies the x-limits and y-limits, as in xlim = c(0, 10), for the interval [0, 10]. col Specifies the color to use, e.g., "blue" or "red". cex Specifies size of plotted text or symbols. cex.axis Specifies size of axis annotation. cex.main Specifies size of plot title. ˆ The second class of graphing commands adds additional features to an existing graph. These functions include points() for adding points, lines() for lines, and text() for texts. lines() abline() points() text() arrows() Adds a plot-line to figure e.g. lines(x, y) where x and y define coordinates Adds a straight line e.g. abline(h = x) to place a horizontal line at height x e.g. abline(v = x) to place a vertical line at point x e.g. abline(a = x, b = y) to place a line with intercept x and slope y Add points e.g. points(x, y) to place dots with x and y as the coordinates e.g. points(x, y, line = TRUE) connects the dots as a line Adds additional text e.g. text(x, y, z) to display z as a text centered at coordinates (x, y) Adds arrows e.g. arrows(x, y, length, angle, code) to display arrows beginning from coordinate x, ending at coordinate y, for the length specified, at the angle specified, of the arrow type specified by code =, and of the color specified by col. ˆ The function identify() allows us to click on points in our graphs and R will return meaningful data about those datapoints. When done, press Esc. ˆ The command \n will force a line break. This is convenient to use with long plot titles. ˆ The command par(mfrow = c(x, Y)) will produce an X by Y figure in one graphical window. mfrow means the graphs will be filled by row whereas mfcol means they will be filled by columns ˆ The function hist() will produce a histogram to summarize the distribution of data. Setting freq = FALSE within hist() will produce a histogram rather than a frequency plot. If you specify a single number as the argument breaks, you will be able to set the number of equally spaced bins. If you give a numeric vector instead, it will specify the breakpoints between histogram cells. ˆ The function density() will calculate the smooth density of a numeric object as an output, which then in turn can be an input to the plot() function to draw the smooth histogram (use the lines() function to add it to the existing graph). 8

ˆ To add a legend to an existing graph, use legend(). The syntax legend(x, y, z) adds legend with text z at coordinates (x, y), which can also be substituted with "topleft", "bottomright", etc. > ## begin by subsetting the data > examiner <- reg[reg$fedex66 == 1, ] > noexaminer <- reg[reg$fedex66 == 0, ] > ## side by side histograms of registration rates > par(mfrow = c(1, 2)) > hist(examiner$pblackreg, freq = FALSE, breaks = 10, xlim = c(0, 100), + main = "Federal Examiner Present", + xlab = "Registration Rates") > hist(noexaminer$pblackreg, freq = FALSE, breaks = 10, xlim = c(0, 100), + main = "No Federal Examiner Present", cex.main = 0.995, ## smaller plot title + xlab = "Registration Rates") Federal Examiner Present No Federal Examiner Present Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.000 0.005 0.010 0.015 0 20 40 60 80 0 20 40 60 80 Registration Rates Registration Rates > ## return to single graph > par(mfrow = c(1,1)) > ## histogram for counties with examiner > hist(examiner$pblackreg, freq = FALSE, breaks = 10, xlim = c(0, 100), + main = "Registration Rates \n Federal Examiner Present", cex.main = 1.5, + xlab = "Registration Rates", cex.axis = 1.5) > ## add counties with no-examiner as smooth density > lines(density(noexaminer$pblackreg)) > ## add lines to compare median of counties with/without examiner > abline(v = median(examiner$pblackreg), col = "red", lty = 2) > abline(v = median(noexaminer$pblackreg), col = "blue", lty = 2) > ## add legend > legend("topright", c("examiner Median", "No Examiner Median"), + lty = c(2, 2), col = c("red", "blue")) 9

Registration Rates Federal Examiner Present Density 0.00 0.04 0.08 0.12 Examiner Median No Examiner Median 0 20 40 60 80 100 Registration Rates 8 Printing and Saving Graphs There are a few ways to print and save the graphs you create in R. ˆ In the window of your graph (if you are a Mac user, make sure your graphic window rather than the R console is selected), you can click File: Save as: PDF... or File: Print... ˆ You can also right-click on a figure in R and copy the image (if you are a Mac user, you need to highlight the graph and type Apple+C to copy it). Then paste that image into Microsoft Word or any other document. ˆ You can also do it via a command by using pdf() before your plotting commands and then dev.off() afterwards. > pdf(file = "myplot.pdf", height = 3, width = 5) # height and width are in inches > dev.off() ## This creates a pdf file in the working directory 9 Practice Questions 9.1 Supreme Court Justice Ideal Points In a 2002 article, Andrew Martin and Kevin Quinn explored the extent to which the ideal points (i.e., policy preferences) of Supreme Court Justices change throughout their tenure on the Court.The data set contains the following: ˆ term Supreme Court Term ˆ justice Justice s Last Name ˆ idealpt Justice s Estimated Ideal Point, where negative values indicate liberal leanings and positive values indicate conservative leanings 10

ˆ pparty President s Political Party 1. Using the tapply() function, create a variable for the median ideal point of court justices for each term of the court. 2. Generate a new variable in the justices data set to indicate whether each justice falls on the Conservative or Liberal end of the ideal point spectrum. Using ifelse(), generate a new variable that takes a value of Liberal if the justice s ideal point is less than 0 and a value of Conservative if the justice s ideal point is greater than 0. Using table(), determine how many justices in the data set were Conservative and how many were liberal. 3. Create a histogram of justice s ideal points. Using tapply(), calculate each justice s median ideal point. Generate a histogram of the justice s ideal points. Be sure to add an informative title and labels. Create a red, vertical dashed line indicating the median. Additionally, add the density line to the plot. Save the graph you created as a pdf file using the file name xxx.pdf where xxx is your netid. Submit it to Blackboard along with your R script file xxx.r (Do not turn in your R console print out). 9.2 The Impact of Increases in the Minimum Wage Many economists believe that increasing the minimum wage actually hurts the poor, the very part of the population such a policy is supposed to help out. The reason is that if employers have to pay higher wages then they would simply hire less people. This means that those who are earning the minimum wage may lose their jobs as a result of increasing the minimum wage. Two researchers, David Card and Alan Krueger, tested this argument using the data from fast food industry in New Jersey and Pennsylvania. We analyze their original data in this precept. The njmin.txt data file, available at Blackboard, contains the following variables Variable chain location wagebefore wageafter fullbefore fullafter partbefore partafter Description fast food chain store location (southj, centralj, northj, shorej & PA) Starting wage measured before the increase Starting wage measured after the increase number of full-time employees before the increase number of full-time employees before the increase number of part-time employees before the increase number of part-time employees before the increase 1. Load the data into R 2. Create a factor variable called state, which takes two values NJ and PA. How many stores in NJ and PA does the study sample contain, respectively? Which chain has the largest number of restaurants NJ and PA, respectively, in this study sample? 3. Create four histograms in one graph using the starting wage data; starting from the left upper corner in a clockwise manner, NJ before the increase, NJ after the increase, PA after the increase, and PA before the increase. Add informative labels to each graph. Are the starting wages similar between NJ and PA before the increase? What about after the increase? Within each state, does the histogram look similar before and after the increase? 11

4. Compute the average number of full-time employees in NJ separately before and after the increase. Do the same for PA. What do these numbers tell you about the impact of the increase in minimum wage? Are these average differences large compared to the standard deviation of full-time employees before the change in each state? 5. Calculate the difference in the number of full-time employees between before and after the increase within each state. Summarize the data using two smoothed histograms in one plot (red solid line for NJ, and blue solid line for PA), with dashed lines for representing the mean difference of each state. Finally, calculate the difference in differences between the two states. (If you are curious, go ahead and conduct the same calculation for part-time employment and see if similar results are obtained.) 12