LAB #6: DATA HANDING AND MANIPULATION

Similar documents
Manufactured Home Production by Product Mix ( )

Reporting Child Abuse Numbers by State

Chart 2: e-waste Processed by SRD Program in Unregulated States

Alaska ATU 1 $13.85 $4.27 $ $ Tandem Switching $ Termination

Alaska ATU 1 $13.85 $4.27 $ $ Tandem Switching $ Termination

MapMarker Standard 10.0 Release Notes

Alaska no no all drivers primary. Arizona no no no not applicable. primary: texting by all drivers but younger than

What's Next for Clean Water Act Jurisdiction

MapMarker Plus 10.2 Release Notes

Arizona does not currently have this ability, nor is it part of the new system in development.

MapMarker Plus v Release Notes

AGILE BUSINESS MEDIA, LLC 500 E. Washington St. Established 2002 North Attleboro, MA Issues Per Year: 12 (412)

Bulk Resident Agent Change Filings. Question by: Stephanie Mickelsen. Jurisdiction. Date: 20 July Question(s)

MapMarker Plus 12.0 Release Notes

J.D. Power and Associates Reports: Overall Wireless Network Problem Rates Differ Considerably Based on Type of Usage Activity

How Social is Your State Destination Marketing Organization (DMO)?

π H LBS. x.05 LB. PARCEL SCALE OVERVIEW OF CONTROLS uline.com CONTROL PANEL CONTROL FUNCTIONS lb kg 0

Question by: Scott Primeau. Date: 20 December User Accounts 2010 Dec 20. Is an account unique to a business record or to a filer?

Is your standard BASED on the IACA standard, or is it a complete departure from the. If you did consider. using the IACA

4/25/2013. Bevan Erickson VP, Marketing

CONSOLIDATED MEDIA REPORT B2B Media 6 months ended June 30, 2018

Managing Transportation Research with Databases and Spreadsheets: Survey of State Approaches and Capabilities

Oklahoma Economic Outlook 2015

Publisher's Sworn Statement

Wireless Network Data Speeds Improve but Not Incidence of Data Problems, J.D. Power Finds

Introduction to R for Epidemiologists

SECTION 2 NAVIGATION SYSTEM: DESTINATION SEARCH

Oklahoma Economic Outlook 2016

User Experience Task Force

Ted C. Jones, PhD Chief Economist

US STATE CONNECTIVITY

Terry McAuliffe-VA. Scott Walker-WI

MapMarker Plus v Release Notes

Crop Progress. Corn Emerged - Selected States [These 18 States planted 92% of the 2016 corn acreage]

Unsupervised Learning

CONSOLIDATED MEDIA REPORT Business Publication 6 months ended December 31, 2017

How Employers Use E-Response Date: April 26th, 2016 Version: 6.51

WINDSTREAM CARRIER ETHERNET: E-NNI Guide & ICB Processes

Local Telephone Competition: Status as of December 31, 2010

Instructions for Enrollment

Established Lafayette St., P.O. Box 998 Issues Per Year: 12 Yarmouth, ME 04096

Advanced LabVIEW for FTC

JIM TAYLOR PILOT CAR SVC J & J PILOT CAR SVC PILOTCAR.NET ROYAL ESCORT

Crop Progress. Corn Dough Selected States [These 18 States planted 92% of the 2017 corn acreage] Corn Dented Selected States ISSN:

Levels of Measurement. Data classing principles and methods. Nominal. Ordinal. Interval. Ratio. Nominal: Categorical measure [e.g.

ADJUSTER ONLINE UPDATING INSTRUCTIONS

2011 Aetna Producer Certification Help Guide. Updated July 28, 2011

Distracted Driving Accident Claims Involving Mobile Devices Special Considerations and New Frontiers in Legal Liability

The Promise of Brown v. Board Not Yet Realized The Economic Necessity to Deliver on the Promise

DATES OF NEXT EVENT: Conference: June 4 8, 2007 Exhibits: June 4 7, 2007 San Diego Convention Center, San Diego, CA

Embedded Systems Conference Silicon Valley

C.A.S.E. Community Partner Application

Online Certification/Authentication of Documents re: Business Entities. Date: 05 April 2011

Loops. An R programmer can determine the order of processing of commands, via use of the control statements; repeat{}, while(), for(), break, and next

57,611 59,603. Print Pass-Along Recipients Website

Guide to the Virginia Mericle Menu Collection

Real Estate Forecast 2017

12 Interacting with Trellis Displays

Disaster Economic Impact

For Every Action There is An Equal and Opposite Reaction Newton Was an Economist - The Outlook for Real Estate and the Economy

Qualified recipients are Chief Executive Officers, Partners, Chairmen, Presidents, Owners, VPs, and other real estate management personnel.

DATES OF EVENT: Conference: March 31 April 2, 2009 Exhibits: April 1 3, Sands Expo & Convention Center, Las Vegas, NV

NEHA-NRPP APPLICATION FOR CERTIFICATION

Summary of the State Elder Abuse. Questionnaire for Hawaii

24-Month Extension of Post-Completion Optional Practical Training (OPT)

Legal-Compliance Department March 22, 2019 Page 1 of 7

DATES OF EVENT: Conference: March 23 March 25, 2010 Exhibits: March 24 March 26, Sands Expo & Convention Center, Las Vegas, NV

Summary of the State Elder Abuse. Questionnaire for Alaska

BRAND REPORT FOR THE 6 MONTH PERIOD ENDED JUNE 2014

2018 Payroll Tax Table Update Instructions (Effective January 2, 2018)

Legal-Compliance Department October 11, 2017 Page 1 of 8

SQP Product Guide. Paper & Packaging Needs

US PS E d u cati o n K it

Options not included in this section of Schedule No. 12 have previously expired and the applicable pages may have been deleted/removed.

BOUNDARY PVC EVERLASTING FENCE 100% VIRGIN VINYL THE NEW YORK STYLE FENCE STOCK COLORS WHITE BEIGE BROWN/CLAY GRAY. Copyright 2007

EyeforTravel s Hotel Distribution Index. EyeforTravel s Hotel Distribution Index

5 August 22, USPS Network Optimization and First Class Mail Large Commercial Accounts Questionnaire Final August 22, 2011

Telephone Appends. White Paper. September Prepared by

45 th Design Automation Conference

76 Million Boomers. 83 Million Millennials 19 to Million Millennials 16 to 35

Summary of the State Elder Abuse. Questionnaire for Texas

FDA's Collaborative Efforts to Promote ISO/IEC 17025:2005 Accreditation for the Nation's Food/Feed Testing Laboratories

GURLEY PRECISION INSTRUMENTS Sales Representatives List: North America

KEY BENEFITS STANDARD FEATURE(S)

Student ID Upload System User Guide. Student ID Upload System. (For Student Precoded Barcode Labels) User Guide

OPT Work Permission for F-1 Students OPT

IACMI - The Composites Institute

BRAND REPORT FOR THE 6 MONTH PERIOD ENDED JUNE 2018

Summary of the State Elder Abuse. Questionnaire for Nebraska

energy efficiency Building Energy Codes

Ted C. Jones Chief Economist. Ted C. Jones, PhD Chief Economist

The State of E-Discovery: An Overview of State & Uniform Rulemaking Efforts

MEMORANDUM. To: Lynn Soukup Chair of the Commercial Finance Committee Stephen Sepinuck Chair of the Uniform Commercial Code Committee

SAS Visual Analytics 8.1: Getting Started with Analytical Models

LAB #1: DESCRIPTIVE STATISTICS WITH R

Year in Review. A Look Back at Commission on Paraoptometric Certification. 243 N. Lindbergh Blvd St. Louis MO

BUSINESS PUBLICATION CIRCULATION STATEMENT FOR THE 6 MONTH PERIOD ENDED DECEMBER 2012

1. STATEMENT OF MARKET SERVED

Welcome Letter 3. Getting Started as a Partner 4. About the Partner Portal 4. Portal Containers 5. Marketing Tools 6. Customer-Facing Resources 7

Summary of the State Elder Abuse. Questionnaire for New York

Transcription:

NAVAL POSTGRADUATE SCHOOL LAB #6: DATA HANDING AND MANIPULATION Statistics (OA3102)

Lab #6: Data Handling and Manipulation Goal: Introduce students to various R commands for handling and manipulating data, including resources for learning more about R, the R editor, and loading R packages. Lab type: Interactive lab demonstration. Time allotted: Lecture for ~50 minutes. Data: Rothkopf Data 2004 to 2010.csv Other information: Informs journal paper based on the dataset 1. Probability Functions. DEMONSTRATION a. Before we learn about loops and other types of repeating functions, the table below shows the various discrete and continuous probability functions available in R. 2

b. As you ve seen in class, R has a specific naming convention for its probability functions. Every probability distribution has an abbreviated name that is preceded by one of four letters: i. p the function returns the cumulative probability ii. d the function returns either the P(Y = y) value for a discrete probability distribution or f (y) for a continuous distribution iii. r the function returns one or more random draws from the specified distribution iv. q the function returns the quantile, which for a given cumulative probability p is the y value corresponding to either P(Y = y) value for a discrete probability distribution or f (y) for a continuous distribution 2. Manipulating Data in R. Now, let s learn a bit about manipulating data in R. To begin, download the "Rothkopf Data 2004 to 2010.csv" dataset from the course Sakai site. a. As you've done before, read the data into R: RData <- read.csv(file.choose()) Note that, rather than use the file.choose() function, you can explicitly specify the path to the file. For example: RData <- read.csv("/users/ron Fricker/Desktop/Rothkopf Data 2004 to 2010.csv", header=true) b. Now, let s learn how to look at and through the data. i. First, let s make sure we know what type of object RData is: class(rdata) ii. Now, let's check over the dataset: dim(rdata) summary(rdata) iii. As we've talked about in previous labs, we can now view, extract, or print a variable, say the variable School, by typing RData$School Note that this variable is a "factor." Check that by typing class(rdata$school) Factors are a way to store categorical data compactly. What it does is assign each category an integer value, which it stores in the vector, along with a mapping from the integers to the category names. What you see at the bottom of the output are the levels (i.e., names) of the categories, along with how many categories there are. Here we see that there are 192 levels which correspond to 192 unique school names in the data. 3

We can reassign the type of variable to, say, character by: RData$School <- as.character(rdata$school) Check it: class(rdata$school) And note how the output now looks different: RData$School c. To extract elements of the vector or dataset, we can use square brackets to specify the specific element or elements. For example, to look at the first entry in the School variable, type RData$School[1] To see the first three entries in the School vector, type RData$School[c(1,2,3)] or, more succinctly, RData$School[1:3] i. For a two-dimensional dataset, the notation generalizes. So, for example, if we wanted to look at the element in the first column of the first row of the entire dataset we type ii. iii. RData[1,1] where in [i,j] the i corresponds to the row or rows in the data and the j to column or columns. So now, for example, we can look at the data in the first three columns of the first two rows by typing: RData[1:2,1:3] To return either all the rows or columns, leave either the first or second position in the square brackets blank. For example, to look at the first two rows (observations) of the data, type: RData[1:2,] Essentially, the blank in the columns position says to return all the columns for rows 1 and 2. You can do the same thing for all rows for a given column. For example, since School is the 11 th variable in the dataset, the two expressions below return the same thing: RData$School RData[,11] How did I know that the variable School was the 11 th column? Well, we can look back at the output from the previous step and just count the columns. Or, we can use the names() function to get the column names (and again count over to figure out where the Schools variable is located in the data frame): names(rdata) 4

iv. Now, as we've discussed in class, the square brackets don t have to just contain numbers. They can also contain logical expressions. For example, to extract all the authors not at a US school, type: RData[RData$Country!="United States",7] The "!=" is a logical not equals. Another example: to extract all the authors from NPS, type: unique(rdata[rdata$school=="naval Postgraduate School",7]) Here the unique() function is helpful, where it only returns each unique value so that we don t have multiples of the same name. The double equals sign is a logical equals. Here's a more complicated example to extract all the authors from the Naval Postgraduate School with papers published from 2007 to 2010: RData[RData$School=="Naval Postgraduate School" & (RData$Year==2008 RData$Year==2009 RData $Year==2010),] Note some of the syntax: The ampersand is a "logical and" and the vertical pipe (" ") is a "logical or". The above statement says, "Return all rows in the dataset for which the School variable contains the value "Naval Postgraduate School" and the Year variable contains either "2008" or "2009" or "2010". That results in a lot of output. What if all we wanted were the specific articles that were published by NPS authors? Here's one way to get them: unique( RData[RData$School=="Naval Postgraduate School" & (RData$Year==2008 RData$Year==2009 RData $Year==2010),6] ) And then we might want to sort them in alphabetical order: sort( unique( RData[RData$School=="Naval Postgraduate School" & (RData$Year==2008 RData$Year==2009 RData $Year==2010),6] ) ) Note how we can just keep wrapping functions, one around the next, to get the output in the format we want. d. So, how does this work? The logical statements above simply produce a vector of logical values TRUE and FALSE values and whenever the vector takes on a value of TRUE the data is extracted (or acted upon). 5

i. Once you get used to it, this is a very powerful and convenient feature of R. What it does is allow you to work with subsets of data "on the fly" without having to save subsets of the data. Let's illustrate with a small dataset: ii. small.data <- c(5,7,2,78,3,11,9,3,2,12,7,3,9,8,4,56,2,6,9,22) What if we wanted to know the sum of the observations in small.data that are less than 10? Well, let's start by seeing which observations are less than 10. Type this: small.data<10 Here we see that the first, second, third, fifth, observations are less than 10. Let's save this logical vector for later use: t.f.vector <- small.data<10 Now, look what we get with small.data[t.f.vector] Only those observations for which t.f.vector equals TRUE. So, we can calculate the desired sum with: sum(small.data[t.f.vector]) That took two steps. Here's how we do it in one: sum(small.data[small.data<10]) Did we get the right answer? Check: 5+7+2+3+9+3+2+7+3+9+8+4+2+6+9 Yep! Now, note that the logical inside the brackets can be based on anything. For example, what if I just wanted a random subset of the observations? Here's one (not particularly useful) way to do that: small.data[runif(length(small.data))>0.5] Finally, note that we can use this type of querying in lots of useful ways. For example, if I wanted to count the number of observations that meet a particular criterion, say the observation is greater than 7: table(small.data>7) 3. Using Logical Expressions to Clean Up and Recode Messy Data. So, we saw in the last lab that the Iraq dataset is pretty messy (as is any real-world dataset). So, let s take what we ve just learned and apply it to cleaning up that dataset (a bit). a. First, if you didn t save it from the last lab, re-read in the Iraq dataset: iraq <- read.csv(file.choose()) b. Now, let s subset the data to only those casualties with the Country variable equal to US (where for purposes of this exercise we ll assume that the Country variable is accurate): iraq.us <- iraq[iraq$country=="us",] 6

And now let s see which states appear in this data: sort(unique(iraq.us$state)) So, let s create a vector of actual states against which to match: states <- c("alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming") How many observations do not have states in this list? table(is.na(match(iraq.us$state,states))) Let s look at them to make sure we re right: iraq.us[is.na(match(iraq.us$state,states)),c(10,11,12)] And now we further subset the data: iraq.us <- iraq.us[!is.na(match(iraq.us$state,states)),] c. So, let s create a new variable that corresponds to regions in the country. To do so, first we need to define the regions: west <- c("california", "Colorado", "Idaho", "Montana", "Nevada", "Oregon", "Utah", "Washington", "Wyoming") southwest <- c("arizona", "New Mexico", "Oklahoma", "Texas") southeast <- c("alabama", "Arkansas", "Florida", "Georgia", "Kentucky", "Louisiana", "Mississippi", "North Carolina", "South Carolina", "Tennessee", "Virginia", "West Virginia") northeast <- c("connecticut", "Delaware", "Maine", "Maryland", "Massachusetts", "New Hampshire", "New Jersey", "New York", "Pennsylvania", "Rhode Island", "Vermont") midwest <- c("idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Michigan", "Minnesota", "Nebraska", "North Dakota", "Ohio", "South Dakota", "Wisconsin") ak.hi <- c("alaska", "Hawaii") And now, let s create a new Region variable in the iraq.us dataset: iraq.us$region <- "West" iraq.us$region[!is.na(match(iraq.us$state,southwest))] <- "Southwest" iraq.us$region[!is.na(match(iraq.us$state,southeast))] <- "Southeast" iraq.us$region[!is.na(match(iraq.us$state,northeast))] <- "Northeast" 7

iraq.us$region[!is.na(match(iraq.us$state,midwest))] <- "Midwest" iraq.us$region[!is.na(match(iraq.us$state,ak.hi))] <- "Alaska/Hawaii" Finally, let s inspect our handywork: table(iraq.us$region) 8

Name: INDIVIDUAL EXERCISES 1. First, do some data extraction from the Rothkopf dataset: a. Who is the 100th author listed in the database? b. What are the names of the papers I've published in Interfaces? c. What are the names of the NPS faculty who published one or more "Article"s in Interfaces in 2010? d. What are the last names of those authors in the data with first name Michael? 2. Now, returning to the Lab 5 homework and the Iraq casualties dataset, do some revised plots. That is, create the plots below and turn them in with your answers to the above questions. a. Create a pie chart of the fraction of casualties by region. Appropriately label and embellish the plot. This time, make a plot that you would feel comfortable actually briefing to a commander. b. Now create a bar chart of the same data. Again, appropriately label and embellish the plot, including the axes, for a commander s briefing. c. Subset the iraq.us dataset to contain data only from the southeast region and create a horizontal bar chart of the number of casualties by state. Again, appropriately annotate and embellish the plot. d. Again subset the iraq.us dataset, but now to contain data only from your home state. Create a horizontal bar chart of the Minor.Cause.of.Death variable. Again, appropriately annotate and embellish the plot. 9