LAB #6: DATA HANDING AND MANIPULATION

NAVAL POSTGRADUATE SCHOOL LAB #6: DATA HANDING AND MANIPULATION Statistics (OA3102)

Lab #6: Data Handling and Manipulation Goal: Introduce students to various R commands for handling and manipulating data, including resources for learning more about R, the R editor, and loading R packages. Lab type: Interactive lab demonstration. Time allotted: Lecture for ~50 minutes. Data: Rothkopf Data 2004 to 2010.csv Other information: Informs journal paper based on the dataset 1. Probability Functions. DEMONSTRATION a. Before we learn about loops and other types of repeating functions, the table below shows the various discrete and continuous probability functions available in R. 2

b. As you ve seen in class, R has a specific naming convention for its probability functions. Every probability distribution has an abbreviated name that is preceded by one of four letters: i. p the function returns the cumulative probability ii. d the function returns either the P(Y = y) value for a discrete probability distribution or f (y) for a continuous distribution iii. r the function returns one or more random draws from the specified distribution iv. q the function returns the quantile, which for a given cumulative probability p is the y value corresponding to either P(Y = y) value for a discrete probability distribution or f (y) for a continuous distribution 2. Manipulating Data in R. Now, let s learn a bit about manipulating data in R. To begin, download the "Rothkopf Data 2004 to 2010.csv" dataset from the course Sakai site. a. As you've done before, read the data into R: RData <- read.csv(file.choose()) Note that, rather than use the file.choose() function, you can explicitly specify the path to the file. For example: RData <- read.csv("/users/ron Fricker/Desktop/Rothkopf Data 2004 to 2010.csv", header=true) b. Now, let s learn how to look at and through the data. i. First, let s make sure we know what type of object RData is: class(rdata) ii. Now, let's check over the dataset: dim(rdata) summary(rdata) iii. As we've talked about in previous labs, we can now view, extract, or print a variable, say the variable School, by typing RData$School Note that this variable is a "factor." Check that by typing class(rdata$school) Factors are a way to store categorical data compactly. What it does is assign each category an integer value, which it stores in the vector, along with a mapping from the integers to the category names. What you see at the bottom of the output are the levels (i.e., names) of the categories, along with how many categories there are. Here we see that there are 192 levels which correspond to 192 unique school names in the data. 3

We can reassign the type of variable to, say, character by: RData$School <- as.character(rdata$school) Check it: class(rdata$school) And note how the output now looks different: RData$School c. To extract elements of the vector or dataset, we can use square brackets to specify the specific element or elements. For example, to look at the first entry in the School variable, type RData$School[1] To see the first three entries in the School vector, type RData$School[c(1,2,3)] or, more succinctly, RData$School[1:3] i. For a two-dimensional dataset, the notation generalizes. So, for example, if we wanted to look at the element in the first column of the first row of the entire dataset we type ii. iii. RData[1,1] where in [i,j] the i corresponds to the row or rows in the data and the j to column or columns. So now, for example, we can look at the data in the first three columns of the first two rows by typing: RData[1:2,1:3] To return either all the rows or columns, leave either the first or second position in the square brackets blank. For example, to look at the first two rows (observations) of the data, type: RData[1:2,] Essentially, the blank in the columns position says to return all the columns for rows 1 and 2. You can do the same thing for all rows for a given column. For example, since School is the 11 th variable in the dataset, the two expressions below return the same thing: RData$School RData[,11] How did I know that the variable School was the 11 th column? Well, we can look back at the output from the previous step and just count the columns. Or, we can use the names() function to get the column names (and again count over to figure out where the Schools variable is located in the data frame): names(rdata) 4

iv. Now, as we've discussed in class, the square brackets don t have to just contain numbers. They can also contain logical expressions. For example, to extract all the authors not at a US school, type: RData[RData$Country!="United States",7] The "!=" is a logical not equals. Another example: to extract all the authors from NPS, type: unique(rdata[rdata$school=="naval Postgraduate School",7]) Here the unique() function is helpful, where it only returns each unique value so that we don t have multiples of the same name. The double equals sign is a logical equals. Here's a more complicated example to extract all the authors from the Naval Postgraduate School with papers published from 2007 to 2010: RData[RData$School=="Naval Postgraduate School" & (RData$Year==2008 RData$Year==2009 RData $Year==2010),] Note some of the syntax: The ampersand is a "logical and" and the vertical pipe (" ") is a "logical or". The above statement says, "Return all rows in the dataset for which the School variable contains the value "Naval Postgraduate School" and the Year variable contains either "2008" or "2009" or "2010". That results in a lot of output. What if all we wanted were the specific articles that were published by NPS authors? Here's one way to get them: unique( RData[RData$School=="Naval Postgraduate School" & (RData$Year==2008 RData$Year==2009 RData $Year==2010),6] ) And then we might want to sort them in alphabetical order: sort( unique( RData[RData$School=="Naval Postgraduate School" & (RData$Year==2008 RData$Year==2009 RData $Year==2010),6] ) ) Note how we can just keep wrapping functions, one around the next, to get the output in the format we want. d. So, how does this work? The logical statements above simply produce a vector of logical values TRUE and FALSE values and whenever the vector takes on a value of TRUE the data is extracted (or acted upon). 5

i. Once you get used to it, this is a very powerful and convenient feature of R. What it does is allow you to work with subsets of data "on the fly" without having to save subsets of the data. Let's illustrate with a small dataset: ii. small.data <- c(5,7,2,78,3,11,9,3,2,12,7,3,9,8,4,56,2,6,9,22) What if we wanted to know the sum of the observations in small.data that are less than 10? Well, let's start by seeing which observations are less than 10. Type this: small.data<10 Here we see that the first, second, third, fifth, observations are less than 10. Let's save this logical vector for later use: t.f.vector <- small.data<10 Now, look what we get with small.data[t.f.vector] Only those observations for which t.f.vector equals TRUE. So, we can calculate the desired sum with: sum(small.data[t.f.vector]) That took two steps. Here's how we do it in one: sum(small.data[small.data<10]) Did we get the right answer? Check: 5+7+2+3+9+3+2+7+3+9+8+4+2+6+9 Yep! Now, note that the logical inside the brackets can be based on anything. For example, what if I just wanted a random subset of the observations? Here's one (not particularly useful) way to do that: small.data[runif(length(small.data))>0.5] Finally, note that we can use this type of querying in lots of useful ways. For example, if I wanted to count the number of observations that meet a particular criterion, say the observation is greater than 7: table(small.data>7) 3. Using Logical Expressions to Clean Up and Recode Messy Data. So, we saw in the last lab that the Iraq dataset is pretty messy (as is any real-world dataset). So, let s take what we ve just learned and apply it to cleaning up that dataset (a bit). a. First, if you didn t save it from the last lab, re-read in the Iraq dataset: iraq <- read.csv(file.choose()) b. Now, let s subset the data to only those casualties with the Country variable equal to US (where for purposes of this exercise we ll assume that the Country variable is accurate): iraq.us <- iraq[iraq$country=="us",] 6

And now let s see which states appear in this data: sort(unique(iraq.us$state)) So, let s create a vector of actual states against which to match: states <- c("alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming") How many observations do not have states in this list? table(is.na(match(iraq.us$state,states))) Let s look at them to make sure we re right: iraq.us[is.na(match(iraq.us$state,states)),c(10,11,12)] And now we further subset the data: iraq.us <- iraq.us[!is.na(match(iraq.us$state,states)),] c. So, let s create a new variable that corresponds to regions in the country. To do so, first we need to define the regions: west <- c("california", "Colorado", "Idaho", "Montana", "Nevada", "Oregon", "Utah", "Washington", "Wyoming") southwest <- c("arizona", "New Mexico", "Oklahoma", "Texas") southeast <- c("alabama", "Arkansas", "Florida", "Georgia", "Kentucky", "Louisiana", "Mississippi", "North Carolina", "South Carolina", "Tennessee", "Virginia", "West Virginia") northeast <- c("connecticut", "Delaware", "Maine", "Maryland", "Massachusetts", "New Hampshire", "New Jersey", "New York", "Pennsylvania", "Rhode Island", "Vermont") midwest <- c("idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Michigan", "Minnesota", "Nebraska", "North Dakota", "Ohio", "South Dakota", "Wisconsin") ak.hi <- c("alaska", "Hawaii") And now, let s create a new Region variable in the iraq.us dataset: iraq.us$region <- "West" iraq.us$region[!is.na(match(iraq.us$state,southwest))] <- "Southwest" iraq.us$region[!is.na(match(iraq.us$state,southeast))] <- "Southeast" iraq.us$region[!is.na(match(iraq.us$state,northeast))] <- "Northeast" 7

iraq.us$region[!is.na(match(iraq.us$state,midwest))] <- "Midwest" iraq.us$region[!is.na(match(iraq.us$state,ak.hi))] <- "Alaska/Hawaii" Finally, let s inspect our handywork: table(iraq.us$region) 8

Name: INDIVIDUAL EXERCISES 1. First, do some data extraction from the Rothkopf dataset: a. Who is the 100th author listed in the database? b. What are the names of the papers I've published in Interfaces? c. What are the names of the NPS faculty who published one or more "Article"s in Interfaces in 2010? d. What are the last names of those authors in the data with first name Michael? 2. Now, returning to the Lab 5 homework and the Iraq casualties dataset, do some revised plots. That is, create the plots below and turn them in with your answers to the above questions. a. Create a pie chart of the fraction of casualties by region. Appropriately label and embellish the plot. This time, make a plot that you would feel comfortable actually briefing to a commander. b. Now create a bar chart of the same data. Again, appropriately label and embellish the plot, including the axes, for a commander s briefing. c. Subset the iraq.us dataset to contain data only from the southeast region and create a horizontal bar chart of the number of casualties by state. Again, appropriately annotate and embellish the plot. d. Again subset the iraq.us dataset, but now to contain data only from your home state. Create a horizontal bar chart of the Minor.Cause.of.Death variable. Again, appropriately annotate and embellish the plot. 9