Week 4. Big Data Analytics - data.frame manipulation with dplyr

Similar documents
ETC 2420/5242 Lab Di Cook Week 5

Week 8. Big Data Analytics Visualization with plotly for R

The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan.

Internet Accessibility Continental Comparison. Marcus Leaning Professor of Digital Media Education University of Winchester U.K.

Rstudio GGPLOT2. Preparations. The first plot: Hello world! W2018 RENR690 Zihaohan Sang

Sub-setting Data. Tzu L. Phang

Subsetting, dplyr, magrittr Author: Lloyd Low; add:

POL 345: Quantitative Analysis and Politics

dundee Software Carpentry: R lesson speaker notes

Explore a dataset with Shiny

ITS Introduction to R course

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Introduction to R Commander

An Introduction to R- Programming

Introduction to R Programming

The Average and SD in R

Lecture Notes on Memory Layout

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html

Command Line and Python Introduction. Jennifer Helsby, Eric Potash Computation for Public Policy Lecture 2: January 7, 2016

Lab #9: ANOVA and TUKEY tests

How to Win Coding Competitions: Secrets of Champions. Week 2: Computational complexity. Linear data structures Lecture 5: Stack. Queue.

Dr. Barbara Morgan Quantitative Methods

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

Today Function. Note: If you want to retrieve the date and time that the computer is set to, use the =NOW() function.

Draft Proof - do not copy, post, or distribute DATA MUNGING LEARNING OBJECTIVES

GRAPHING IN EXCEL EXCEL LAB #2

R in Linguistic Analysis. Week 2 Wassink Autumn 2012

STAT STATISTICAL METHODS. Statistics: The science of using data to make decisions and draw conclusions

LAB #1: DESCRIPTIVE STATISTICS WITH R

TrelliscopeJS. Ryan Hafen. Modern Approaches to Data Exploration with Trellis Display

Data Manipulation. Module 5

Data Manipulation using dplyr

Multiple-Subscripted Arrays

An Introduction to R. Ed D. J. Berry 9th January 2017

MA 1128: Lecture 02 1/22/2018

THE STATE OF ICT IN SOUTH AND SOUTH-WEST ASIA

A Step by Step Guide to Learning SAS

Today Function. Note: If you want to retrieve the date and time that the computer is set to, use the =NOW() function.

The Energy Data Management Center & Database Structure

How To Test Your Code A CS 1371 Homework Guide

INTRODUCTION TO R. Basic Graphics

Upper left area. It is R Script. Here you will write the code. 1. The goal of this problem set is to learn how to use HP filter in R.

Consolidation and Review

Project III. Today due: Abstracts and description of the data (follow link from course website) schedule so far

Revision of Stata basics in STATA 11:

Working with recursion. From definition to template. Readings: HtDP, sections 11, 12, 13 (Intermezzo 2).

Working with recursion

What is an algorithm?

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Univariate descriptives

Introduction to STATA

Intermediate Programming in R Session 4: Avoiding Loops. Olivia Lau, PhD

CSE 341: Programming Languages

THE WORLD IN 2009: ICT FACTS AND FIGURES

Probability and Statistics for Final Year Engineering Students

Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller

Excel Basics Fall 2016

No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot.

Orientation Assignment for Statistics Software (nothing to hand in) Mary Parker,

6.001 Notes: Section 6.1

Excel Functions & Tables

ScholarOne Manuscripts. COGNOS Reports User Guide

22. LECTURE 22. I can define critical points. I know the difference between local and absolute minimums/maximums.

PS2: LT2.4 6E.1-4 MEASURE OF CENTER MEASURES OF CENTER

Analyzing Economic Data using R

An Introduction to Stata Part II: Data Analysis

Data Import and Formatting

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

Lists, loops and decisions

A Linear-Time Heuristic for Improving Network Partitions

CSci 1113 Lab Exercise 6 (Week 7): Arrays & Strings

Minitab Notes for Activity 1

In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train.

Data types and structures

Python for C programmers

CS 211: Recursion. Chris Kauffman. Week 13-1

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Data Frames and Control September 2014

CS 211: Recursion. Chris Kauffman. Week 13-1

NOTES.md - R lesson THINGS TO REMEMBER. Start the slides TITLE ETHERPAD LEARNING OBJECTIVES

Python review. 1 Python basics. References. CS 234 Naomi Nishimura

Provided by - Microsoft Placement Paper Technical 2012

Telefonica in Latin America: Transforming Investment into Growth and Development The EURO: Global Implications and Relevance for Latin America

Emerging Markets Payment Index: Q Report by Fortumo

Big Data and Scripting mongodb map reduce

Service Line Export and Pivot Table Report (Windows Excel 2010)

Section 13: data.table

QUEEN MARY, UNIVERSITY OF LONDON DCS128 ALGORITHMS AND DATA STRUCTURES Class Test Monday 27 th March

Lecture 1: Getting Started and Data Basics

CSci 1113, Fall 2015 Lab Exercise 7 (Week 8): Arrays! Strings! Recursion! Oh my!

Tutorial on Memory Management, Deadlock and Operating System Types

Statistics for Biologists: Practicals

CSE 20. Lecture 4: Number System and Boolean Function. CSE 20: Lecture2

Lecture 3: Basics of R Programming

Dynamic Memory Allocation

Do not turn this page until you have received the signal to start. In the meantime, please read the instructions below carefully.

Simplified User Rights

2. The histogram. class limits class boundaries frequency cumulative frequency

MAXIMIZING THE WIRELESS OPPORTUNITY TO CLOSE THE DIGITAL GAP

START GUIDE QUICK. RoomWizard Analytics Console.

Transcription:

Week 4. Big Data Analytics - data.frame manipulation with dplyr Hyeonsu B. Kang hyk149@eng.ucsd.edu April 2016 1 Dplyr In the last lecture we have seen how to index an individual cell in a data frame, extract a vector of values from a column of a data frame, and subset a data frame using the age-weight-height data set. Today, we will learn how to use a package (dplyr) to efficiently manipulate data frames. Download the data file from https://goo.gl/pgmbkj and place it in the working directory of your R project. Loading a csv file is pretty simple, you can use the built-in read function of R. > getwd () # find your working directory # place the data file in the directory and > data = read. csv ("") > data NOTE: You can set the working directory using setwd(<new path>) 1.1 Sanity Check Let s first see how the data set is structured: > nrow ( data ) [1] 1704 > ncol ( data ) [1] 6 > colnames ( data ) [1] " country " " year " " pop " " continent " " lifeexp " [6] " gdppercap " There are 1,704 observations and 6 variables. A reasonable check might be to see whether observations have correct years, bigger than 0 population, life expectancy and GDP per capita. > sum ( data $ year > 2016 data $ pop <= 0 data $ lifeexp <= 0 data $ gdppercap <= 0) [1] 0 Good. The data does not seem to have invalid observations. Now, installing the dplyr package is possible as follows: > install. packages (" dplyr ") > library (" dplyr ") 1.2 group by() and summarize() Let us first generate some simple statistics. How can we compute yearly mean GDP per capita of different countries? If we were to use the elementary data frame manipulation operations, we would do something like below using logical indexing: > unique ( data $ year ) [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 > Y1952 = data $ year == 1952 > Y1957 = data $ year == 1957 > Y1962 = data $ year == 1962 >... > GDP1952 = mean ( data [ Y1952,]$ gdppercap ) > GDP1952 [1] 3725. 276 1

Figure 1: Gapminder Data As you can see, there are mainly two steps: (1) make individual variables for each of the observation year, and (2) generate statistics for each of the variables. Of course, you would have to find the unique values of the year variable beforehand. Since the structure of computation is repetitive each year, one might make a module that generates a new variable for each of the unique year numbers and then stores the statistics into it and reuse it. Dplyr makes the matter even simpler, the repetitive procedures are simply done by calling the group by function: > GDPbyYear = data % >% + group _by( year ) % >% + summarize ( MeanGDP = mean ( gdppercap )) > GDPbyYear Source : local data frame [12 x 2] year MeanGDP ( int ) ( dbl ) 1 1952 3725. 276 2 1957 4299. 408 3 1962 4725. 812 4 1967 5483. 653 5 1972 6770. 083 6 1977 7313. 166 7 1982 7518. 902 8 1987 7900. 920 9 1992 8158. 609 10 1997 9090. 175 11 2002 9917. 848 12 2007 11680. 072 There is a new notation to become familiar with: %>%. This dplyr pipeline operator passes object from the left hand side as first argument (or. argument) of the function on right hand side. For example, x % >% f( y) is the same as f(x, y) y % >% f(x,., z) is the same as f(x, y, z) The group by operation by itself simply outputs the grouped data frame. Since the number of variables nor observations changes, it is a little confusing whether or not the grouping operation is performed. This can be verified by looking at the data frames internal representation using str(): 2

> groupeddata = data % >% group _by( year ) > str ( groupeddata ) Classes grouped _df, tbl _df, tbl and data. frame : 1704 obs. of 6 variables : > str ( data ) data. frame : 1704 obs. of 6 variables : So the summarize() function we used before was, in fact, operated over these internal groups formed in the data frame after group by(). Grouping by multiple variables is also possible. For example, we can group the GDP per capita variable with respect of continent and year. Note that the order of grouping affects the representation of the resulting data frame: > GDPByContiByYear = data % >% + group _by( continent, year ) % >% + summarize ( meangdp = mean ( gdppercap )) > GDPByContiByYear Source : local data frame [60 x 3] Groups : continent [?] 1 Africa 1952 1252. 572 2 Africa 1957 1385. 236 3 Africa 1962 1598. 079 4 Africa 1967 2050. 364 5 Africa 1972 2339. 616 6 Africa 1977 2585. 939 7 Africa 1982 2481. 593 8 Africa 1987 2282. 669 9 Africa 1992 2281. 810 10 Africa 1997 2378. 760........... > GDPByYearByConti = data % >% + group _by(year, continent ) % >% + summarize ( meangdp = mean ( gdppercap )) > GDPByYearByConti Source : local data frame [60 x 3] Groups : year [?] year continent meangdp ( int ) ( fctr ) ( dbl ) 1 1952 Africa 1252. 572 2 1952 Americas 4079. 063 3 1952 Asia 5195. 484 4 1952 Europe 5661. 057 5 1952 Oceania 10298. 086 6 1957 Africa 1385. 236 7 1957 Americas 4616. 044 8 1957 Asia 5787. 733 9 1957 Europe 6963. 013 10 1957 Oceania 11598. 522........... 1.3 filter() What if we wanted to compare different continents GDP per capita in year 2002? > Y2002GDPByConti = GDPByContiByYear % >% + filter ( year == 2002) > Y2002GDPByConti Source : local data frame [5 x 3] Groups : continent [5] 1 Africa 2002 2599. 385 3

2 Americas 2002 9287. 677 3 Asia 2002 10174. 090 4 Europe 2002 21711. 732 5 Oceania 2002 26938. 778 We can use the filter() operation. This operation performs the task that is similar to logical indexing in default R. Now we can answer questions like which continent had the highest/lowest GDP per capita in year 2002? or which country had the highest life expectancy in year 1997? > Y2002GDPByConti [ which. max ( Y2002GDPByConti $ meangdp ),] Source : local data frame [1 x 3] Groups : continent [1] 1 Oceania 2002 26938. 78 Similarly, for the lowest GDP per capita we can use which.min(). 1.4 mutate() Appending a new column with, for example, log value of GDP per capita is also possible. To compute and append one or more new columns to a data frame we use mutate() > LogGDPByContiByYear = GDPByContiByYear % >% + mutate ( loggdp = log ( meangdp )) > LogGDPByContiByYear Source : local data frame [60 x 4] Groups : continent [5] loggdp ( dbl ) 1 Africa 1952 1252. 572 7. 132955 2 Africa 1957 1385. 236 7. 233626 3 Africa 1962 1598. 079 7. 376557 4 Africa 1967 2050. 364 7. 625773 5 Africa 1972 2339. 616 7. 757742 6 Africa 1977 2585. 939 7. 857844 7 Africa 1982 2481. 593 7. 816656 8 Africa 1987 2282. 669 7. 733101 9 Africa 1992 2281. 810 7. 732724 10 Africa 1997 2378. 760 7. 774334.............. Let us generate a histogram for logged GDP per capita per country in year 2002. The resulting histogram is fig. 2. 1.5 Practice > Y2002LogGDP = data % >% + mutate ( loggdp = log ( gdppercap )) % >% + filter ( year ==2002) > hist ( Y2002LogGDP $ loggdp ) Problem 1. Which country had the highest life expectancy in 2002 and what was the value of it? Problem 2. Which continent had the lowest GDP per capita in 1967? Problem 3. How does the distribution of GDP per capita look like in 1967? 4

Figure 2: Logged GDP in 2002 5