Transformations. Hadley Wickham. October 2009

Similar documents
A quick introduction to plyr

Statistical Computing 1 Stat 590

Stat405. Tables. Hadley Wickham. Tuesday, October 23, 12

Model visualisation. Hadley Wickham. October 2009

Tidy data. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Statistical Computing (36-350)

Large data. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

An Exploration of the Dark Arts. reshape/reshape2, plyr & ggplot2

ggplot2 basics Hadley Wickham Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University September 2011

Maps & layers. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Text & Patterns. stat 579 Heike Hofmann

Transform Data! The Basics Part I continued!

Stat405. Displaying distributions. Hadley Wickham. Thursday, August 23, 12

Transform Data! The Basics Part I!

Statistical Computing (36-350) Importing Data from the Web II. Cosma Shalizi and Vincent Vu November 21, 2011

Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.

The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan.

Grammar of data. dplyr. Bjarki Þór Elvarsson and Einar Hjörleifsson. Marine Research Institute. Bjarki&Einar (MRI) R-ICES 1 / 29

Stat405. More about data. Hadley Wickham. Tuesday, September 11, 12

Making sense of census microdata

Facets and Continuous graphs

An introduction to ggplot: An implementation of the grammar of graphics in R

Package ggloop. October 20, 2016

the R environment The R language is an integrated suite of software facilities for:

Introduction to Graphics with ggplot2

Testing. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University. June 2011

Building visualisations Hadley Wickham

Install RStudio from - use the standard installation.

The Art of Defensive Programming: Coping with Unseen Data

Lab #3. Viewing Data in SAS. Tables in SAS. 171:161: Introduction to Biostatistics Breheny

Example1D.1.sas. * Procedures : ; * 1. print to show the dataset. ;

Laboratory: Recursion Basics

Introduction to Data Visualization

Package ggqc. R topics documented: January 30, Type Package Title Quality Control Charts for 'ggplot' Version Author Kenith Grey

PRESENTING DATA. Overview. Some basic things to remember

Workshop: R and Bioinformatics

Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident.

Package reshape2. R topics documented: February 15, Type Package. Title Flexibly reshape data: a reboot of the reshape package. Version 1.2.

Operating systems fundamentals - B07

Lecture 4: Data Visualization I

Running Minitab for the first time on your PC

Getting started with ggplot2

Stat405. Dates and times. Hadley Wickham. Thursday, October 18, 12

Lab: Using R and Bioconductor

Replication of paper: Women As Policy Makers: Evidence From A Randomized Policy Experiment In India

This document is designed to get you started with using R

Surviving SPSS.

Ready To Become Really Productive Using PROC SQL? Sunil K. Gupta, Gupta Programming, Simi Valley, CA

Package lvplot. August 29, 2016

lazyeval A uniform approach to NSE

Integrated Circuit Design Using. Open Cores and Design Tools. Martha SaloméLópez de la Fuente

03 - Intro to graphics (with ggplot2)

Quality Control of Clinical Data Listings with Proc Compare

Notes for week 3. Ben Bolker September 26, Linear models: review

dope, n. information especially from a reliable source [the inside dope]; v. figure out usually used with out; adj. excellent 1

Representing Images as Lists of Spots

Apply. A. Michelle Lawing Ecosystem Science and Management Texas A&M University College Sta,on, TX

Package ggsubplot. February 15, 2013

Precalculus An Investigation of Functions

Introduction to R. Nishant Gopalakrishnan, Martin Morgan January, Fred Hutchinson Cancer Research Center

Data input & output. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Lecture 3: Basics of R Programming

NCSS Statistical Software. Design Generator

Advanced R. V!aylor & Francis Group. Hadley Wickham. ~ CRC Press

Lecture 09: Feb 13, Data Oddities. Lists Coercion Special Values Missingness and NULL. James Balamuta STAT UIUC

Data 8 Final Review #1

Package raker. October 10, 2017

Lecture 12 Level Sets & Parametric Transforms. sec & ch. 11 of Machine Vision by Wesley E. Snyder & Hairong Qi

Assignment 2: Processing New York City Open Data

S4C03, HW2: model formulas, contrasts, ggplot2, and basic GLMs

Data Visualization Using R & ggplot2. Karthik Ram October 6, 2013

Introduction to C. Sami Ilvonen Petri Nikunen. Oct 6 8, CSC IT Center for Science Ltd, Espoo. int **b1, **b2;

A Quick and focused overview of R data types and ggplot2 syntax MAHENDRA MARIADASSOU, MARIA BERNARD, GERALDINE PASCAL, LAURENT CAUQUIL

Epidemiology Principles of Biostatistics Chapter 3. Introduction to SAS. John Koval

Simultaneous equations 11 Row echelon form

STP 226 ELEMENTARY STATISTICS NOTES

DSC 201: Data Analysis & Visualization

A. Using the data provided above, calculate the sampling variance and standard error for S for each week s data.

March 13, Bellwork. Due Today!

Programming in R. John Verzani. CUNY/the College of Staten Island. October 28, 2008

Checking whether the protocol was followed: gender and age 51

Package rollply. R topics documented: August 29, Title Moving-Window Add-on for 'plyr' Version 0.5.0

Name & Recitation Section:

Transformations with Fred Functions- Packet 1

Data Import and Export

In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train.

WEEK 8: FUNCTIONS AND LOOPS. 1. Functions

Let s use Technology Use Data from Cycle 14 of the General Social Survey with Fathom for a data analysis project

Package purrrlyr. R topics documented: May 13, Title Tools at the Intersection of 'purrr' and 'dplyr' Version 0.0.2

Individual Covariates

JavaScript: More Syntax

Microsoft Access - Using Relational Database Data Queries (Stored Procedures) Paul A. Harris, Ph.D. Director, GCRC Informatics.

Package forcats. February 19, 2018

Package ggseas. June 12, 2018

The Average and SD in R

Using Flickr Storm for locating imagery for digital storytelling projects

Joins, NULL, and Aggregation

Lab 4: Distributions of random variables

Repetition Through Recursion

Canadian National Longitudinal Survey of Children and Youth (NLSCY)

Transcription:

Transformations Hadley Wickham October 2009

1. US baby names data 2. Transformations 3. Summaries 4. Doing it by group

Baby names Top 1000 male and female baby names in the US, from 1880 to 2008. 258,000 records (1000 * 2 * 129) But only four variables: year, name, sex and prop. CC BY http://www.flickr.com/photos/the_light_show/2586781132

Getting started library(plyr) bnames <- read.csv("baby-names.csv", stringsasfactors = FALSE) library(ggplot2) qplot(year, prop, colour = sex, data = subset(bnames, name == "Hadley"), geom = "line")

> head(bnames, 15) year name percent sex 1 1880 John 0.081541 boy 2 1880 William 0.080511 boy 3 1880 James 0.050057 boy 4 1880 Charles 0.045167 boy 5 1880 George 0.043292 boy 6 1880 Frank 0.027380 boy 7 1880 Joseph 0.022229 boy 8 1880 Thomas 0.021401 boy 9 1880 Henry 0.020641 boy 10 1880 Robert 0.020404 boy 11 1880 Edward 0.019965 boy 12 1880 Harry 0.018175 boy 13 1880 Walter 0.014822 boy 14 1880 Arthur 0.013504 boy 15 1880 Fred 0.013251 boy > tail(bnames, 15) year name percent sex 257986 2008 Neveah 0.000130 girl 257987 2008 Amaris 0.000129 girl 257988 2008 Hadassah 0.000129 girl 257989 2008 Dania 0.000129 girl 257990 2008 Hailie 0.000129 girl 257991 2008 Jamiya 0.000129 girl 257992 2008 Kathy 0.000129 girl 257993 2008 Laylah 0.000129 girl 257994 2008 Riya 0.000129 girl 257995 2008 Diya 0.000128 girl 257996 2008 Carleigh 0.000128 girl 257997 2008 Iyana 0.000128 girl 257998 2008 Kenley 0.000127 girl 257999 2008 Sloane 0.000127 girl 258000 2008 Elianna 0.000127 girl

Brainstorm Thinking about the data, what are some of the trends that you might want to explore? What additional variables would you need to create? What other data sources might you want to use? Pair up and brainstorm for 2 minutes.

Some of my ideas First/last letter Length Number/percent of vowels Rank Ecdf (how many babies have a name in the top 2, 3, 5, 100 etc) Biblical names? Hurricanes?

letter <- function(x, n = 1) { if (n < 0) { nc <- nchar(x) n <- nc + n + 1 } tolower(substr(x, n, n)) } vowels <- function(x) { nchar(gsub("[^aeiou]", "", x)) } bnames$length <- nchar(bnames$name) table(bnames$length) bnames[bnames$length == 2, ] bnames[bnames$length == 10, ] Very verbose! bnames$first <- letter(bnames$name, 1) bnames$last <- letter(bnames$name, -1) bnames$vowels <- vowels(bnames$name)

Transform, summarise & subset subset(df, subset) transform(df, var1 = expr1,...) summarize(df, var1 = expr1,...) Subset selects rows from a data frame. Transform modifies an existing data frame. Summarise creates a new data frame. All evaluate arguments in scope of df.

bnames <- transform(bnames, first = letter(name, 1), last = letter(name, -1), vowels = vowels(name), length = nchar(name) ) summarise(bnames, min_length = min(length), max_length = max(length) ) subset(bnames, length == 2) subset(bnames, length == 10)

Aside: never use attach! Non-local effects; not symmetric; implicit, not explicit. Makes it very easy to make mistakes. Use with() instead. with(bnames, table(year, length))

Your turn Extract your name from the dataset. Plot the trend over time (hint: use geom="line"). Create a new variable that contains the first three (or four, or five) letters of each name. How many names start the same as yours? Plot the trend over time.

dian <- subset(bnames, substr(name, 1, 4) == "Dian") qplot(year, prop, data = dian, geom = "line") qplot(year, prop, data = dian, geom = "line", group = name) qplot(year, prop, data = dian, geom = "line", group = interaction(name, sex)) qplot(year, prop, data = dian, geom = "line", colour = sex) + facet_wrap(~ name) last_plot() + geom_point()

Group-wise What about group-wise transformations or summaries? e.g. what if we want to compute the rank of a name within a sex and year? This task is easy if we have a single year & sex, but hard otherwise.

Group-wise What about group-wise transformations or summaries? e.g. what if we want to compute the rank of a name within a sex and year? This task is easy if we have a single year & sex, but hard otherwise. Take two minutes to sketch out an approach

one <- subset(bnames, sex == "boy" & year == 2008) one$rank <- rank(-one$prop, ties.method = "first") # or one <- transform(one, rank = rank(-prop, ties.method = "first")) head(one) What if we want to transform every sex and year?

# Split pieces <- split(bnames, list(bnames$sex, bnames$year)) # Apply results <- vector("list", length(pieces)) for(i in seq_along(pieces)) { piece <- pieces[[i]] piece <- transform(piece, rank = rank(-prop, ties.method = "first")) results[[i]] <- piece } # Combine result <- do.call("rbind", results)

# Or equivalently bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-prop, ties.method = "first"))

# Or equivalently Input data Way to split up input Function to apply to each piece bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-prop, ties.method = "first")) 2 nd argument to transform()

Summaries In a similar way, we can use ddply() for group-wise summaries. There are many base R functions for special cases. Where available, these are often much faster; but you have to know they exist, and have to remember how to use them.

# Explore average length wtd.mean <- function(x, weights) sum(weights * x) / sum(weights) sy <- ddply(bnames, c("sex", "year"), summarise, avg_length = wtd.mean(length, prop)) qplot(year, avg_length, data = sy, colour = sex, geom = "line")

# Explore number of names of each length syl <- ddply(bnames, c("sex", "length", "year"), summarise, prop = sum(prop)) qplot(year, prop, data = syl, colour = sex, geom = "line") + facet_wrap(~ length) twoletters <- subset(bnames, length == 2) unique(twoletters$name) qplot(year, prop, data = twoletters, colour = sex, geom = "line") + facet_wrap(~ name)

Your turn Use these tools to explore how the following have changed over time: The total proportion of babies with names in the top 1000. The number of vowels in a name. The distribution of first (or last) letters.

sy <- ddply(bnames, c("year","sex"), summarise, prop = sum(prop), npop = sum(prop > 1/1000)) qplot(year, prop, data = sy, colour = sex, geom = "line") qplot(year, npop, data = sy, colour = sex, geom = "line") syl <- ddply(bnames, c("sex", "last", "year"), summarise, prop = sum(prop)) qplot(year, prop, data = syl, colour = sex, geom = "line") + facet_wrap(~ last)

More about plyr

Many problems involve splitting up a large data structure, operating on each piece and joining the results back together: split-apply-combine

How you split up depends on the type of input: arrays, data frames, lists How you combine depends on the type of output: arrays, data frames, lists, nothing

array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

array data frame list nothing array apply adply alply a_ply data frame daply aggregate by d_ply list sapply ldply lapply l_ply n replicates replicate rdply replicate r_ply function arguments mapply mdply mapply m_ply

Fiddly details Labelling Progress bars Consistent argument names Missing values / Nulls

http://had.co.nz/plyr

This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/ 3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.