Introduction to R and the tidyverse

Size: px

Start display at page:

Download "Introduction to R and the tidyverse"

Ilene Banks
6 years ago
Views:

1 Introduction to R and the tidyverse Paolo Crosetto Paolo Crosetto Introduction to R and the tidyverse 1 / 58

2 Lecture 3: merging & tidying data Paolo Crosetto Introduction to R and the tidyverse 2 / 58

3 Before we start: tidyverse you should all by now be with your laptops so please let s go back to the initial setup and let cleanly install the tidyverse install.packages("tidyverse") Paolo Crosetto Introduction to R and the tidyverse 3 / 58

4 Before we start: tidyverse the tidyverse package install lots of stuff but in particular, ggplot2, dplyr -> seen earlier tidyr -> seen today you load the package using library(tidyverse) and it loads all the needed packages for you Paolo Crosetto Introduction to R and the tidyverse 4 / 58

5 Todays topics today we will deal with three topics: 1 getting data into (and out of) R 2 joining data from different tables 3 tidying data Paolo Crosetto Introduction to R and the tidyverse 5 / 58

6 importing data Paolo Crosetto Introduction to R and the tidyverse 6 / 58

7 getting data into R: packages up to now we have worked with data sets that come from packages easy to do: install a package, then call a function with data attached all the hard work has been made for you if you wish you can import the data into your workspace e.g. library(nycflights13) df <- flights Paolo Crosetto Introduction to R and the tidyverse 7 / 58

8 getting data into R: other sources life is not always that easy you might have data in the form of (aaarg!) Excel files you might have comma separated (csv) data you might have data coming from SPSS, SAS, STATA,... or text data from ASCII sources Paolo Crosetto Introduction to R and the tidyverse 8 / 58

9 getting data into R: readr vs haven when you load the tidyverse (library(tidyverse)) you automatically load readr this is a package that gives you (verb) functions to load data into R nicely readr provides functions to load most text-based delimited files especially.csv if you want to read in a STATA or SAS or SPSS file, you need the package haven (library(haven)) readr is autmatically loaded by the tidyverse call haven needs to be loaded explicitely (not shown here) Paolo Crosetto Introduction to R and the tidyverse 9 / 58

10 A simple example you find some data here: goo.gl/ kpycfh this is the human develoment index, by country highest numbers (nearest to 1) are better save the file to disk to somewher you know about save it as HDI.csv open it up with a text editor: what do you see? Paolo Crosetto Introduction to R and the tidyverse 10 / 58

11 A simple example now that your data is saved, how do you import it to R? you use read_csv("path_to_file") in my case: df <- read_csv("/home/paolo/dropbox/public/hdidata.csv") ## Parsed with column specification: ## cols( ## `HDI Rank` = col_integer(), ## Country = col_character(), ## HDI = col_double() ## ) Paolo Crosetto Introduction to R and the tidyverse 11 / 58

12 there is more but... read_csv just made under the hood a ton of things for you but it doesn t really matter at your stage so you just live with the results. other useful functions: if the separator is ; rather than, use read_csv2 if the separator is a TAB rather than, use read_tsv Paolo Crosetto Introduction to R and the tidyverse 12 / 58

13 some hints you can always export to.csv in all programs even in Excel! so once you have exported to.csv, all is downhill from there and it is even ebtter to do it because.csv is universal while other binary formats (.dta,.xls... ) force you to have the appropriate tool for reading them so try to keep a copy of your data in a text-based format, it is always readable should everything go wrong. Paolo Crosetto Introduction to R and the tidyverse 13 / 58

14 Joining datasets Paolo Crosetto Introduction to R and the tidyverse 14 / 58

15 data scattered around you do not always have all the data you need in one dataset it is usually scattered around several datasets that might or might not be linked / linkable e.g. you might need to merge data coming from different sources (INSEE and Eurostat) or you might do some computations / summarize and would like to merge these back Paolo Crosetto Introduction to R and the tidyverse 15 / 58

16 using the nycflights13 dataset again planes library(nycflights13) planes <- nycflights13::planes planes ## # A tibble: 3,322 x 9 ## tailnum year type manufacturer model ## <chr> <int> <chr> <chr> <chr> ## 1 N Fixed wing multi engine EMBRAER EMB-145XR ## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A ## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 5 N Fixed wing multi engine EMBRAER EMB-145LR ## 6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 7 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 8 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 9 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 10 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## #... with 3,312 more rows, and 4 more variables: engines <int>, ## # seats <int>, speed <int>, engine <chr> Paolo Crosetto Introduction to R and the tidyverse 16 / 58

17 using the nycflights13 dataset again airports airports <- nycflights13::airports airports ## # A tibble: 1,458 x 8 ## faa name lat lon alt tz ## <chr> <chr> <dbl> <dbl> <int> <dbl> ## 1 04G Lansdowne Airport ## 2 06A Moton Field Municipal Airport ## 3 06C Schaumburg Regional ## 4 06N Randall Airport ## 5 09J Jekyll Island Airport ## 6 0A9 Elizabethton Municipal Airport ## 7 0G6 Williams County Airport ## 8 0G7 Finger Lakes Regional Airport ## 9 0P2 Shoestring Aviation Airfield ## 10 0S9 Jefferson County Intl ## #... with 1,448 more rows, and 2 more variables: dst <chr>, tzone <chr> Paolo Crosetto Introduction to R and the tidyverse 17 / 58

18 ## #... with 336,766 more rows, and 12 more variables: sched_arr_time <int> ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## ## ## ## ## ## ## ## ## ## ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> Paolo Crosetto Introduction to R and the tidyverse 18 / 58 using the nycflights13 dataset again flights flights <- nycflights13::flights flights

19 inspect the datasets what do these dataset contain? what variables do they have in common? do they have some unique identifier (key)? how are these related to one another? Paolo Crosetto Introduction to R and the tidyverse 19 / 58

20 the datasets planes has information on each plane (model, type, date of construction... ) airports has information on each airport (faa code, location, lat, long ) flights has information on each flight that left/landed in a NYC airport Paolo Crosetto Introduction to R and the tidyverse 20 / 58

21 joining different datasets: example problem: do newer planes fly the longest routes from NYC? to answer this, you need to combine data from two sources: flights to get the route s length in terms of miles planes to get the date the plane was first operational how do you join the two data frames? Paolo Crosetto Introduction to R and the tidyverse 21 / 58

22 joining two datasets: key first you need to find a unique identifier for your data: a key unique identifiers have the characteristics of being unique in the whole dataset in order to find them, either you use your intuition or you check planes %>% count(tailnum) %>% filter(n>1) ## # A tibble: 0 x 2 ## #... with 2 variables: tailnum <chr>, n <int> count(var) gives the count of how many times each element of var appears as a new variable n by filtering for just n>1 you check if any value appears twice Paolo Crosetto Introduction to R and the tidyverse 22 / 58

there is some overlapping information on the two tables but there is also new information column Paolo Crosetto D only in dataset Y Introduction to R and the

23 there is some overlapping information on the two tables but there is also new information column Paolo Crosetto D only in dataset Y Introduction to R and the tidyverse 23 / 58 joining once you know the key, you can use the join family of functions imagine you have two datasets with variables and values as follows: Figure 1:

24 joining joining always combines data from two tables into one syntax alays the same: join(left, right, by = "key") left and right two data frames key the unique identifier of obsevations (in one or both data frames) Paolo Crosetto Introduction to R and the tidyverse 24 / 58

25 the joining family different join functions make different assumptions about what to do of the data that are NOT matched full_join() keeps everything, adds NA inner_join() keeps only matched data Paolo Crosetto Introduction to R and the tidyverse 25 / 58

26 the default left_join() left_join() is the default because you usually add some variable to a large dataset in our case: * do newer planes fly the longest routes from NYC?* we have most information on the flights dataset we need only the year built from the planes dataset Paolo Crosetto Introduction to R and the tidyverse 26 / 58

27 answering our question joining distance <- flights %>% select(tailnum, distance) yearbuilt <- planes %>% select(tailnum, year) answer <- left_join(distance, yearbuilt, by = "tailnum") answer ## # A tibble: 336,776 x 3 ## tailnum distance year ## <chr> <dbl> <int> ## 1 N ## 2 N ## 3 N619AA ## 4 N804JB ## 5 N668DN ## 6 N ## 7 N516JB ## 8 N829AS ## 9 N593JB ## 10 N3ALAA 733 NA ## #... with 336,766 more rows Paolo Crosetto Introduction to R and the tidyverse 27 / 58

28 answering our question: the answer there does not seem to be any connection beteen the year and the length of the flight answer %>% group_by(year) %>% summarise(dist = mean(distance, na.rm = TRUE)) %>% ggplot(aes(x = year, y = dist))+geom_point()+ geom_smooth(method = "lm") dist 1000 Paolo Crosetto Introduction to R and the tidyverse 28 / 58

29 joining exercise how many flights through NYC land in an airport whose altitude is > 1000mt? note: 1 mètre = 3,28084 mètres altitude is in the airports df flights are in the flights df Paolo Crosetto Introduction to R and the tidyverse 29 / 58

30 ## #... with 10,281 more rows, and 13 more variables: sched_arr_time <int>, ## # A tibble: 10,291 x 20 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## ## ## ## ## ## ## ## ## ## ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, Paolo Crosetto Introduction to R and the tidyverse 30 / 58 solution a lot of flights, since denver sits at 1600mt! alt_df <- airports %>% select(faa,alt) %>% mutate(alt = alt/ ) %>% ren answer <- left_join(flights, alt_df, by = "dest") %>% filter(alt>1000) answer

31 solution: a plot to see the impact of Denver answer %>% ggplot(aes(dest))+geom_bar() 6000 count ABQ BZN DEN EGE HDN JAC MTJ SLC dest Paolo Crosetto Introduction to R and the tidyverse 31 / 58

32 joining three datasets how old are the planes that fly to airports whose altitude is >1000mt? Paolo Crosetto Introduction to R and the tidyverse 32 / 58

33 joining three datasets, solution answer <- left_join(flights,yearbuilt, by = "tailnum") answer <- left_join(answer, alt_df, by = "dest") answer %>% filter(alt>1000) %>% summarise(avgyear = mean(year.y, na.rm = TRU ## # A tibble: 1 x 1 ## avgyear ## <dbl> ## answer %>% filter(alt<=1000) %>% summarise(avgyear = mean(year.y, na.rm = TR ## # A tibble: 1 x 1 ## avgyear ## <dbl> ## Paolo Crosetto Introduction to R and the tidyverse 33 / 58

34 tidy data Paolo Crosetto Introduction to R and the tidyverse 34 / 58

35 messy data -> tidy data Happy families are all alike; every unhappy family is unhappy in its own way. - Leo Tolstoy the data we have worked with so far are all well formatted this is not the case in real life we need to be able to format data in a convenint way if you work with the tools we ve seen (dplyr, ggplot2) then you want tidy data Paolo Crosetto Introduction to R and the tidyverse 35 / 58

36 a simple dataset in four versions table1 ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 36 / 58

37 a simple dataset in four versions table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population ## 5 Brazil 1999 cases ## 6 Brazil 1999 population ## 7 Brazil 2000 cases ## 8 Brazil 2000 population ## 9 China 1999 cases ## 10 China 1999 population ## 11 China 2000 cases ## 12 China 2000 population Paolo Crosetto Introduction to R and the tidyverse 37 / 58

38 a simple dataset in four versions table3 ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 38 / 58

39 a simple dataset in four versions table4a #cases ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China table4b #population ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China Paolo Crosetto Introduction to R and the tidyverse 39 / 58

40 tidy, untidy data tidy dat has the following characteristics: each variable has its own column each observation has its own row each value has its own cell have a look at the tables. what is an observation? what is a variable? do you see problems in the tables? Paolo Crosetto Introduction to R and the tidyverse 40 / 58

41 tidy data: the tidyr package tidyr is part fo the tidyverse it is automatically loaded with library(tidyverse) tidyr provides 4 main verbs gather vs. spread separate vs. unite Paolo Crosetto Introduction to R and the tidyverse 41 / 58

42 gathering: from wide to long table4a sometimes variables are in the column names: bad! ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China year is a variable but it is on the column names content is cases but has no variable name Paolo Crosetto Introduction to R and the tidyverse 42 / 58

43 gathering we need to reshape the data from wide to long, so that year becomes a variable and 1999 and 2000 become values. we use gather(vars, key, value) vars is the variable names that are not actually variables but values key is the (new) name to be given to the (new) column that will be created to store the (former) variable names value is the (new) name to be given to the (new) column that will be created to store the values that were spread over several variables Paolo Crosetto Introduction to R and the tidyverse 43 / 58

44 gathering what happens if we just provide NO arguments? everything is gathered just two columns left (key & value) table4a %>% gather() ## # A tibble: 9 x 2 ## key value ## <chr> <chr> ## 1 country Afghanistan ## 2 country Brazil ## 3 country China ## ## ## ## ## ## Paolo Crosetto Introduction to R and the tidyverse 44 / 58

45 gathering what if we provide arguments? cases <- table4a %>% gather(`1999`,`2000`, key = year, value = cases) %>% ar cases ## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 45 / 58

46 gathering we can do the same for the population table (table4b) pop <- table4b %>% gather(`1999`,`2000`, key = year, value = population) pop ## # A tibble: 6 x 3 ## country year population ## <chr> <chr> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China ## 4 Afghanistan ## 5 Brazil ## 6 China Paolo Crosetto Introduction to R and the tidyverse 46 / 58

47 gathering we can merge the two tables and we ll get back to table1 left_join(cases,pop, by = c("country","year")) ## # A tibble: 6 x 4 ## country year cases population ## <chr> <chr> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 47 / 58

48 spreading: from long to wide table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population ## 5 Brazil 1999 cases ## 6 Brazil 1999 population ## 7 Brazil 2000 cases ## 8 Brazil 2000 population ## 9 China 1999 cases ## 10 China 1999 population ## 11 China 2000 cases ## 12 China 2000 population Paolo Crosetto Introduction to R and the tidyverse 48 / 58

49 spreading: from long to wide we need to reshape the data from long to wide, so that type gets split into the variables cases and population and count values get assigned to the proper column. we use spread(key, value) key is the (existing) name of the column that contains variable names value is the (existing) name of the variable that contains values of the (to be created) variables Paolo Crosetto Introduction to R and the tidyverse 49 / 58

50 spreading: from long to wide spread(table2, key = type, value = count) ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 50 / 58

51 separating: from one to more variables what is wrong with this table? table3 ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 51 / 58

52 separating the variable rate contains two informations: number of cases and population we need to separate the variable into two (in this case) variables separate(table3, col = rate, into = c("cases", "population")) ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <chr> <chr> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 52 / 58

53 separating -separate() correctly guessed that the point to separate was / - but this is not always so easy - so you can provide the actual separator character with sep= - if we use the wrong one... separate(table3, col = rate, into = c("cases", "population"), sep = "7") ## Warning: Too many values at 3 locations: 1, 3, 5 ## Warning: Too few values at 1 locations: 2 ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <chr> <chr> ## 1 Afghanistan /1998 ## 2 Afghanistan / <NA> ## 3 Brazil ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 53 / 58

54 separating separate() keeps the variables as characters this is safe: doesnt make assumptions but sometimes it is best to have it create int or dbl variables separate(table3, col = rate, into = c("cases", "population"), convert = TRUE ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 54 / 58

55 uniting: from several to one variable table5 ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 55 / 58

56 uniting the complementary verb to separate() is unite() unite(table5, year, century, year) ## # A tibble: 6 x 3 ## country year rate ## * <chr> <chr> <chr> ## 1 Afghanistan 19_99 745/ ## 2 Afghanistan 20_ / ## 3 Brazil 19_ / ## 4 Brazil 20_ / ## 5 China 19_ / ## 6 China 20_ / by deault unite() uses _ as a separator Paolo Crosetto Introduction to R and the tidyverse 56 / 58

57 uniting unite(table5, year, century, year, sep = "") ## # A tibble: 6 x 3 ## country year rate ## * <chr> <chr> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 57 / 58

58 exercise look at (messy) Eurostat data on GDP and tidy it Paolo Crosetto Introduction to R and the tidyverse 58 / 58

Lecture 3: Data Wrangling I

Lecture 3: Data Wrangling I Data Science for Business Analytics Thibault Vatter Department of Statistics, Columbia University and HEC Lausanne, UNIL 12.03.2018 Outline 1 Overview