Transform Data! The Basics Part I continued!

Size: px

Start display at page:

Download "Transform Data! The Basics Part I continued!"

Curtis Edwards
5 years ago
Views:

1 Transform Data! The Basics Part I continued!

2 arrange()

3 arrange() Order rows from smallest to largest values arrange(.data, ) Data frame to transform One or more columns to order by (addi3onal columns will be used as 3e breakers)

4 Common syntax Each function takes a data frame as the first argument, and returns a data frame arrange(.data, ) dplyr func3on data frame to transform func3on specific arguments

5 arrange() Order rows from smallest to largest values arrange(babynames, n) year sex name n prop 1899 M John M William M James M Lance e M Charles year sex name n prop 1899 M Lance e M Charles M James M William M John

6 Your Turn 3 Arrange babynames by n. Add prop as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of n is? How does adding prop affect the arrangement?

7 arrange(babynames, n) arrange(babynames, n, prop)

8 Helper function desc() Change ordering to go from largest to smallest arrange(babynames, desc(n)) babynames year sex name n prop 1899 M John M William M James M Lance e M Charles year sex name n prop 1899 M John M William M James M Charles M Lance e-05

9 Your Turn 4 Use desc() to find the names with the highest prop. Then, use desc() to find the names with the highest n.

10 arrange(babynames, desc(prop)) arrange(babynames, desc(n))

11 mutate()

12 mutate() Create new columns mutate(.data, ) Data frame to transform One or more new columns to create

13 mutate() Create new columns mutate(babynames, percent = round(prop * 100, 2)) babynames year sex name n prop 1899 M John M William M James M Lance e M Charles year sex name n prop percent 1899 M John M William M James M Lance e M Charles

14 Create new columns mutate() mutate(babynames, percent = round(prop * 100, 2), nper = round(percent)) babynames year sex name n prop 1899 M John M William M James M Lance e M Charles year sex name n prop percent nper 1899 M John M William M James M Lance e M Charles

16 Vectorized function min_rank() A popular ranking function (ties share the lowest rank) min_rank(c(50, 100, 100, 1000)) # [1] min_rank(desc(c(50, 100, 100, 1000))) # [1]

17 Your Turn 5 Use min_rank() and mutate() to rank each row in babynames from largest prop to lowest prop

18 mutate(babynames, rank = min_rank(desc(prop)))

19 %>%

20 Multiple steps (composed functions) arrange(mutate(filter(babynames, year == 2015, sex == M ), rank == min_rank(desc(prop))), rank) 1. Filter babynames to just boys born in Rank the names by proportion so that higher proportions have lower rank 3. Arrange the names by rank

21 Multiple steps (intermediate data frames) boys_2015 <- filter(babynames, year == 2015, sex == M ) boys_2015 <- mutate(boys_2015, rank == min_rank(desc(prop))) boys_2015 <- arrange(boys_2015, rank) boys_2015

22 Multiple steps (intermediate data frames) boys_2015 <- filter(babynames, year == 2015, sex == M ) boys_2015 <- mutate(boys_2015, rank == min_rank(desc(prop))) boys_2015 <- arrange(boys_2015, rank) boys_2015

23 The pipe operator %>% %>% babynames filter(, n == 99680) Passes result on left into first argument of the function on right. So, these two lines do the same thing. Try it! filter(babynames, n == 99680) babynames %>% filter(n == 99680)

24 Multiple steps (pipe operator) babynames %>% filter(year == 2015, sex == M ) %>% mutate(rank = min_rank(desc(prop))) %>% arrange(rank) 1. Allows us to eliminate redundant code (assigning to the same data frame over and over) and/or unwanted intermediate data frames 2. Allows us to write code in the same way we think about the problem

25 Shortcut to type %>%

26 Your Turn 6 Use %>% to write a sequence of functions that: 1. Filter babynames to just the girls born in Mutate to make a percent column rounded to a whole number 3. Arrange the results so that the most popular names, based on the percent column, appear first.

27 babynames %>% filter(year == 1977, sex == "F") %>% mutate(percent = round(prop * 100)) %>% arrange(desc(percent))

28 Your Turn 7 Write code to do the following: 1. Trim babynames to just the rows that contain your name and your sex 2. Plot the results as a line graph with year on the x-axis and prop on the y-axis

29 babynames %>% filter(name == Lance, sex == M ) %>% ggplot() + geom_line(aes(year, prop))

30 What are the most popular names?

31 How should we define popularity? A name is popular if: 1. Sums a large number of children have the name when you sum across years 2. Ranks it consistently ranks among the top names from year to year

32 Question Do we have the right tools to: 1. Calculate the total number of children with each name? 2. Rank names within each year?

33 Deriving information mutate() create new variables summarise() summarise variables group_by() group cases

34 summarise()

35 summarise() Compute table of summaries babynames %>% summarise(total = sum(n), max = max(n)) babynames year sex name n prop 1899 M John M William M James M Lance e M Charles total max

36 Your Turn 8 Use summarise() to compute three statistics about the data: 1. The first (minimum) year in the data set 2. The last (maximum) year in the data set 3. The total number of children represented in the data set

37 babynames %>% summarise(first = min(year), last = max(year), total = sum(n))

38 Your Turn 9 Extract the rows where name == Khaleesi. Then use summarise() and summary functions to find: 1. The first year Khaleesi appeared in the data 2. The total number of children named Khaleesi

39 babynames %>% filter(name == Khaleesi ) %>% summarise(first = min(year), total = sum(n))

41 n() The number of rows in a data set babynames %>% summarise(n = n()) babynames year sex name n prop 1899 M John M William M James M Lance e M Charles F John e-04 n 6

42 n_distinct() The number of distinct values in a variable babynames %>% summarise(n = n(), nname = n_distinct(name)) babynames year sex name n prop 1899 M John M William M James M Lance e M Charles F John e-04 n nname 6 5

43 group_by()

44 group_by() Groups cases by common values of one or more columns babynames %>% group_by(sex)

45 group_by() babynames %>% group_by(sex) %>% summarise(total = sum(n)) babynames year sex name n prop 1899 F Anne e F John e F Mary M John M Mary e M Lance e-05 sex total F M 7094

46 group_by() babynames %>% group_by(year, sex) %>% summarise(total = sum(n)) babynames year sex name n prop 1899 F Anne e F John e F Mary M John M Mary e M Lance e-05 year sex total 1899 F M F M 99

47 Your Turn 10 Use group_by(), summarise(), and arrange() to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name.

48 babynames %>% group_by(name, sex) %>% summarise(total = sum(n)) %>% arrange(desc(total))

50 babynames %>% group_by(name, sex) %>% summarise(total = sum(n)) %>% arrange(desc(total)) %>% ungroup() %>% slice(1:10) %>% ggplot() + geom_col(aes(fct_reorder(name, desc(total)), total/ , fill = sex)) + theme_bw() + scale_fill_brewer() + labs(x = name, y = total (in millions) )

Transform Data! The Basics Part I!

Transform Data! The Basics Part I! arrange() arrange() Order rows from smallest to largest values arrange(.data, ) Data frame to transform One or more columns to order by (addi3onal columns will be used