The Tidyverse BIOF 339 9/25/2018

Similar documents
Data Wrangling in the Tidyverse

Chapter 7. The Data Frame

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1)

A Whistle-Stop Tour of the Tidyverse

WEEK 13: FSQCA IN R THOMAS ELLIOTT

Data Import and Formatting

Accessing Databases from R

getting started in R

Regression Models Course Project Vincent MARIN 28 juillet 2016

Getting started with ggplot2

Stat 241 Review Problems

Introduction to R: Day 2 September 20, 2017

Quick Guide for pairheatmap Package

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

The xtablelist Gallery. Contents. David J. Scott. January 4, Introduction 2. 2 Single Column Names 7. 3 Multiple Column Names 9.

Introducing R/Tidyverse to Clinical Statistical Programming

Lecture 12: Data carpentry with tidyverse

R Visualizing Data. Fall Fall 2016 CS130 - Intro to R 1

Introduction for heatmap3 package

Лекция 4 Трансформация данных в R

Will Landau. January 24, 2013

Modeling in the Tidyverse. Max Kuhn (RStudio)

Data Manipulation. Module 5

Spring 2017 CS130 - Intro to R 1 R VISUALIZING DATA. Spring 2017 CS130 - Intro to R 2

Package assertr. R topics documented: February 23, Type Package

Introduction to R and the tidyverse

STAT 1291: Data Science

Lecture 19: Oct 19, Advanced SQL. SQL Joins dbplyr SQL Injection Resources. James Balamuta STAT UIUC

Getting and Cleaning Data. Biostatistics

Grammar of data. dplyr. Bjarki Þór Elvarsson and Einar Hjörleifsson. Marine Research Institute. Bjarki&Einar (MRI) R-ICES 1 / 29

grammar statistical graphics Building a in Clojure Kevin Lynagh 2012 November 9 Øredev Keming Labs Malmö, Sweden

An Introduction to R. Ed D. J. Berry 9th January 2017

Package descriptr. March 20, 2018

Tidy Evaluation. Lionel Henry and Hadley Wickham RStudio

This is a simple example of how the lasso regression model works.

Pure, predictable, pipeable: creating fluent interfaces with R. Hadley Chief Scientist, RStudio

Introduction to Huxtable David Hugh-Jones

Package purrrlyr. R topics documented: May 13, Title Tools at the Intersection of 'purrr' and 'dplyr' Version 0.0.2

Introduction to R Software

Package postal. July 27, 2018

A Short Guide to R with RStudio

Package modelr. May 11, 2018

Lab #7 - More on Regression in R Econ 224 September 18th, 2018

Package ezsummary. August 29, 2016

Dplyr Introduction Matthew Flickinger July 12, 2017

Package tidyimpute. March 5, 2018

Session 3 Nick Hathaway;

Visualizing Data: Customization with ggplot2

Subsetting, dplyr, magrittr Author: Lloyd Low; add:

Metropolis. A modern beamer theme. Matthias Vogelgesang October 12, Center for modern beamer themes

Objects, Class and Attributes

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Package infer. July 11, Type Package Title Tidy Statistical Inference Version 0.3.0

AN INTRODUCTION TO R FOR MANAGEMENT SCHOLARS

Package dotwhisker. R topics documented: June 28, Type Package

Graphics in R STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Data wrangling. Reduction/Aggregation: reduces a variable to a scalar

Notes based on: Data Mining for Business Intelligence

An Introduction to Tidyverse

Lecture 3. Homework Review and Recoding I R Teaching Team. September 5, 2018

Working with Data. L5-1 R and Databases

Tidy data. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Reproducible Research with R, L A TEX, & Sweave

Data-informed collection decisions using R or, learning R using collection data

Introduction to R and the tidyverse. Paolo Crosetto

Data Import and Export

R for Libraries. Session 2: Data Exploration. Clarke Iakovakis Scholarly Communications Librarian University of Houston-Clear Lake

Learning SAS. Hadley Wickham

Data Input/Output. Andrew Jaffe. January 4, 2016

Assignment 5.5. Nothing here to hand in

genius Example with Jimi Hendrix and the Beatles

Session 1 Nick Hathaway;

How to Wrangle Data. using R with tidyr and dplyr. Ken Butler. March 30, / 44

Intro to R for Epidemiologists

Lecture 09. Graphics::ggplot I R Teaching Team. October 1, 2018

Handling Missing Values

Stat405. Tables. Hadley Wickham. Tuesday, October 23, 12

Package tidystats. May 6, 2018

Tidy evaluation (hygienic fexprs) Lionel Henry and Hadley Wickham RStudio

Introduction to R. Andy Grogan-Kaylor October 22, Contents

Package ruler. February 23, 2018

Merging, exploring, and batch processing data from the Human Fertility Database and Human Mortality Database

Package visdat. February 15, 2019

Package rcv. August 11, 2017

Package tibble. August 22, 2017

Package anomalize. April 17, 2018

CSSS 512: Lab 1. Logistics & R Refresher

String Manipulation. Module 6

Package crossword.r. January 19, 2018

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

Financial Econometrics Practical

Package oec. R topics documented: May 11, Type Package

Incident Response Programming with R. Eric Zielinski Sr. Consultant, Nationwide

Package zebu. R topics documented: October 24, 2017

Package driftr. June 14, 2018

Importing rectangular text files Importing other types of data Trasforming data

R. Muralikrishnan Max Planck Institute for Empirical Aesthetics Frankfurt. 08 June 2017

Advance Analytics with Power BI and R

Package widyr. August 14, 2017

Analyzing Economic Data using R

Transcription:

The Tidyverse BIOF 339 9/25/2018

What is the Tidyverse? The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures 2/41

What is the Tidyverse? A set of R packages that: help make data more computer friendly while making your code more human friendly Most of these packages are (co-)written by Dr. Hadley Wickham, who has rockstar status in the R world They are supported by the company RStudio 3/41

Tidying data

Tidy data Tidy datasets are all alike, but every messy data is messy in its own way 5/41

Tidy data Tidy data is a computer-friendly format based on the following characteristics: Each row is one observation Each column is one variable Each set of observational unit forms a table All other forms of data can be considered messy data. 6/41

Let us count the ways There are many ways data can be messy. An incomplete list. Column headers are values, not variables Multiple variables are stored in a single column Variables are stored in both rows and columns Multiple types of observational units are saved in the same table A single observational unit is stored in multiple tables 7/41

Ways to have messy (i.e. not tidy) data 1. Column headers contain values Country < $10K $10-20K $20-50K $50-100K > $100K India 40 25 25 9 1 USA 20 20 20 30 10 8/41

Ways to have messy (i.e. not tidy) data 1. Column headers contain values Country Income Percentage India < $10K 40 USA < $10K 20 This is a case of reshaping or melting 9/41

Ways to have messy (i.e. not tidy) data 1. Multiple variables in one column Country Year M_0-14 F_0-14 M_ 15-60 F_15-60 M_60+ F_60+ UK 2010 UK 2011 Country Year Gender Age Count Separating columns into different variables 10/41

Tidying data The typical steps are Transforming data from wide to tall (gather) and from tall to wide (spread) Separating columns into different columns Putting columns together into new variables 11/41

Cleaning data

Some actions on data Creating new variables (mutate) Choose some columns (select) Selecting rows based on some criteria (filter) Sort data based on some variables (arrange) 13/41

Example data data(mtcars) knitr::kable(head(mtcars, 3)) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Car names are in an attribute of the data.frame called rownames. So it s not in a column We might want to convert fuel economy to metric We might just want to look at the relationship between displacement and fuel economy based on number of cylinders 14/41

Example data (link) link <- 'https://dl.dropboxusercontent.com/s/pqavhcckshqxtjm/brca.csv' download.file(link, 'brca.csv') brca_data <- read.csv('brca.csv', stringsasfactors=false) 15/41

The tidyverse package The tidyverse package is a meta-package that installs a set of packages that are useful for data cleaning, data tidying and data munging (manipulating data to get a computationally attractive dataset) 16/41

The tidyverse package # install.packages('tidyverse') library(tidyverse) ## Attaching packages ## ggplot2 3.0.0 purrr 0.2.5 ## tibble 1.4.2 dplyr 0.7.6 ## tidyr 0.8.1 stringr 1.3.1 ## readr 1.1.1 forcats 0.3.0 ## Conflicts ## dplyr::filter() masks stats::filter() ## dplyr::lag() masks stats::lag() You can specify a function from a particular package as dplyr::filter. Note there are two colons there 17/41

Core tidyverse packages Package Description ggplot2 Data visualization (next week) tibble data.frame on steroids tidyr Data tidying (today) readr Reading text files (CSV) purrr Applying functions to data iteratively(later this sem) dplyr Data cleaning and munging (today) stringr String (character) manipulation forcats Manipulating categorical variables 18/41

Additional tidyverse packages Package Description readxl Read Excel files haven Read SAS, SPSS, Stata files lubridate Deal with dates and times magrittr Provides the pipe operator %>% glue Makes pasting text and data easier 19/41

Pipes

A pipe example (data munging) library(tidyverse) updated_cars <- mtcars %>% rownames_to_column(var = 'Model') %>% mutate(kmpg = mpg * 1.6) %>% select(model, kmpg, cyl, disp) %>% filter(cyl == 6) Model kmpg cyl disp Mazda RX4 33.60 6 160.0 Mazda RX4 Wag 33.60 6 160.0 Hornet 4 Drive 34.24 6 258.0 Valiant 28.96 6 225.0 21/41

A pipe example (data munging) library(tidyverse) updated_cars <- mtcars %>% # Take the data set rownames_to_column(var = 'Model') %>% # Make rownames a column select(model, mpg, cyl, disp) %>% # Keep only certain columns mutate(kmpg = mpg * 1.6) %>% # Create a new variable filter(cyl == 6) # Keep only certain rows The idea is to use verbs to express operations on a dataset, so it is easier to express what you want to do in code. 22/41

The pipe operator The pipe operator %>% (technically from the package magrittr) takes a data.frame or tibble object on the left, then pipes it to a function that takes the data.frame object as its first argument. mtcars %>% mutate(kmpg = mpg * 1.6) would be the same as mutate(mtcars, kmpg = mpg * 1.6) 23/41

The pipe operator With pipes mtcars %>% rownames_to_column(var = "Model") %>% select(model:disp) %>% mutate(kmpg = mpg * 1.6) %>% filter(cyl == 6) Without pipes tmp <- rownames_to_column(mtcars, var="model") tmp3 <- select(tmp2, Model:disp) tmp2 <- mutate(tmp, kmpg = mpg * 1.6) tmp4 <- filter(tmp3, cyl == 6) Both are fine, but I find pipes help translating my thoughts into code better 24/41

The pipe operator With pipes + tidyverse updated_cars <- mtcars %>% rownames_to_column(var = "Model") %>% select(model:disp) %>% mutate(kmpg = mpg * 1.6) %>% filter(cyl == 6) Without tidyverse mtcars[,'model'] <- rownames(mtcars) tmp <- mtcars[,c('model','mpg','cyl','disp')] # Can't use the : operator for tmp[,'kmpg'] <- tmp[,'mpg'] * 1.6 # Note we need to quote the names updated_cars <- tmp[tmp[,'cyl'] == 6,] Idea was to make it easier to write expressive code without getting too hung up with syntax. 25/41

Going through step by step

mtcars mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 27/41

mtcars %>% rownames_to_column(var = "Model") Model mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 28/41

mtcars %>% rownames_to_column(var = "Model") %>% select(model:disp) Model mpg cyl disp Mazda RX4 21.0 6 160 Mazda RX4 Wag 21.0 6 160 Datsun 710 22.8 4 108 29/41

mtcars %>% rownames_to_column(var = "Model") %>% select(model:disp) %>% mutate(kmpg = mpg * 1.6) Model mpg cyl disp kmpg Mazda RX4 21.0 6 160 33.60 Mazda RX4 Wag 21.0 6 160 33.60 Datsun 710 22.8 4 108 36.48 30/41

mtcars %>% rownames_to_column(var = "Model") %>% select(model:disp) %>% mutate(kmpg = mpg * 1.6) %>% filter(cyl == 6) Model mpg cyl disp kmpg Mazda RX4 21 6 160 33.6 Mazda RX4 Wag 21 6 160 33.6 31/41

Data tidying

link <- 'https://dl.dropboxusercontent.com/s/pqavhcckshqxtjm/brca.csv' download.file(link, 'brca.csv') brca_data <- read.csv('brca.csv', stringsasfactors=false) 33/41

library(tidyverse) names(brca_data) ## [1] "id" "diagnosis" ## [3] "radius_mean" "texture_mean" ## [5] "perimeter_mean" "area_mean" ## [7] "smoothness_mean" "compactness_mean" ## [9] "concavity_mean" "concave.points_mean" ## [11] "symmetry_mean" "fractal_dimension_mean" ## [13] "radius_se" "texture_se" ## [15] "perimeter_se" "area_se" ## [17] "smoothness_se" "compactness_se" ## [19] "concavity_se" "concave.points_se" ## [21] "symmetry_se" "fractal_dimension_se" ## [23] "radius_worst" "texture_worst" ## [25] "perimeter_worst" "area_worst" ## [27] "smoothness_worst" "compactness_worst" ## [29] "concavity_worst" "concave.points_worst" ## [31] "symmetry_worst" "fractal_dimension_worst" ## [33] "X" 34/41

Need for tidying 1. For the same variable (e.g., radius) there are 3 columns giving the mean, se and worst value. So the names of the metric are stored in the column names 2. There are really 3 kinds of summaries for each metric mean, se and worst. 35/41

Tidying library(tidyverse) brca_data id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compact 842302 M 17.99 10.38 122.8 1001 0.11840 842517 M 20.57 17.77 132.9 1326 0.08474 84300903 M 19.69 21.25 130.0 1203 0.10960 36/41

Tidying (selecting) library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst -starts_with('fractal')) # removes columns starting with "fractal" id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compact 842302 M 17.99 10.38 122.8 1001 0.11840 842517 M 20.57 17.77 132.9 1326 0.08474 84300903 M 19.69 21.25 130.0 1203 0.10960 37/41

Tidying (gathering) library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst'), -starts_with('fractal')) %>% gather(variable, value, -id, -diagnosis) # operate on everything but Column names become variable, everything stays aligned with id and diagnosis id diagnosis variable value 842302 M radius_mean 17.99 842517 M radius_mean 20.57 84300903 M radius_mean 19.69 38/41

Tidying (separating) library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst'), -starts_with('fractal')) %>% gather(variable, value, -id, -diagnosis) %>% separate(variable, c("variable","stat"), sep="_", remove = T) Split variable into 2 cols, Variable and stat id diagnosis Variable stat value 842302 M radius mean 17.99 842517 M radius mean 20.57 84300903 M radius mean 19.69 39/41

Tidying (spreading) library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst'), -starts_with('fractal')) %>% gather(variable, value, -id, -diagnosis) %>% separate(variable, c("variable","stat"), sep="_", remove = T) %>% spread(stat, value) id diagnosis Variable mean se worst 8670 M area 748.90000 48.31000 1156.0000 8670 M compactness 0.12230 0.01484 0.2394 8670 M concave.points 0.08087 0.01093 0.1514 40/41

References 1. Introduction to dplyr 2. Tidy data 3. The paper on tidy data 41/41