Introduction to R and the tidyverse
|
|
- Ilene Banks
- 6 years ago
- Views:
Transcription
1 Introduction to R and the tidyverse Paolo Crosetto Paolo Crosetto Introduction to R and the tidyverse 1 / 58
2 Lecture 3: merging & tidying data Paolo Crosetto Introduction to R and the tidyverse 2 / 58
3 Before we start: tidyverse you should all by now be with your laptops so please let s go back to the initial setup and let cleanly install the tidyverse install.packages("tidyverse") Paolo Crosetto Introduction to R and the tidyverse 3 / 58
4 Before we start: tidyverse the tidyverse package install lots of stuff but in particular, ggplot2, dplyr -> seen earlier tidyr -> seen today you load the package using library(tidyverse) and it loads all the needed packages for you Paolo Crosetto Introduction to R and the tidyverse 4 / 58
5 Todays topics today we will deal with three topics: 1 getting data into (and out of) R 2 joining data from different tables 3 tidying data Paolo Crosetto Introduction to R and the tidyverse 5 / 58
6 importing data Paolo Crosetto Introduction to R and the tidyverse 6 / 58
7 getting data into R: packages up to now we have worked with data sets that come from packages easy to do: install a package, then call a function with data attached all the hard work has been made for you if you wish you can import the data into your workspace e.g. library(nycflights13) df <- flights Paolo Crosetto Introduction to R and the tidyverse 7 / 58
8 getting data into R: other sources life is not always that easy you might have data in the form of (aaarg!) Excel files you might have comma separated (csv) data you might have data coming from SPSS, SAS, STATA,... or text data from ASCII sources Paolo Crosetto Introduction to R and the tidyverse 8 / 58
9 getting data into R: readr vs haven when you load the tidyverse (library(tidyverse)) you automatically load readr this is a package that gives you (verb) functions to load data into R nicely readr provides functions to load most text-based delimited files especially.csv if you want to read in a STATA or SAS or SPSS file, you need the package haven (library(haven)) readr is autmatically loaded by the tidyverse call haven needs to be loaded explicitely (not shown here) Paolo Crosetto Introduction to R and the tidyverse 9 / 58
10 A simple example you find some data here: goo.gl/ kpycfh this is the human develoment index, by country highest numbers (nearest to 1) are better save the file to disk to somewher you know about save it as HDI.csv open it up with a text editor: what do you see? Paolo Crosetto Introduction to R and the tidyverse 10 / 58
11 A simple example now that your data is saved, how do you import it to R? you use read_csv("path_to_file") in my case: df <- read_csv("/home/paolo/dropbox/public/hdidata.csv") ## Parsed with column specification: ## cols( ## `HDI Rank` = col_integer(), ## Country = col_character(), ## HDI = col_double() ## ) Paolo Crosetto Introduction to R and the tidyverse 11 / 58
12 there is more but... read_csv just made under the hood a ton of things for you but it doesn t really matter at your stage so you just live with the results. other useful functions: if the separator is ; rather than, use read_csv2 if the separator is a TAB rather than, use read_tsv Paolo Crosetto Introduction to R and the tidyverse 12 / 58
13 some hints you can always export to.csv in all programs even in Excel! so once you have exported to.csv, all is downhill from there and it is even ebtter to do it because.csv is universal while other binary formats (.dta,.xls... ) force you to have the appropriate tool for reading them so try to keep a copy of your data in a text-based format, it is always readable should everything go wrong. Paolo Crosetto Introduction to R and the tidyverse 13 / 58
14 Joining datasets Paolo Crosetto Introduction to R and the tidyverse 14 / 58
15 data scattered around you do not always have all the data you need in one dataset it is usually scattered around several datasets that might or might not be linked / linkable e.g. you might need to merge data coming from different sources (INSEE and Eurostat) or you might do some computations / summarize and would like to merge these back Paolo Crosetto Introduction to R and the tidyverse 15 / 58
16 using the nycflights13 dataset again planes library(nycflights13) planes <- nycflights13::planes planes ## # A tibble: 3,322 x 9 ## tailnum year type manufacturer model ## <chr> <int> <chr> <chr> <chr> ## 1 N Fixed wing multi engine EMBRAER EMB-145XR ## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A ## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 5 N Fixed wing multi engine EMBRAER EMB-145LR ## 6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 7 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 8 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 9 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## 10 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A ## #... with 3,312 more rows, and 4 more variables: engines <int>, ## # seats <int>, speed <int>, engine <chr> Paolo Crosetto Introduction to R and the tidyverse 16 / 58
17 using the nycflights13 dataset again airports airports <- nycflights13::airports airports ## # A tibble: 1,458 x 8 ## faa name lat lon alt tz ## <chr> <chr> <dbl> <dbl> <int> <dbl> ## 1 04G Lansdowne Airport ## 2 06A Moton Field Municipal Airport ## 3 06C Schaumburg Regional ## 4 06N Randall Airport ## 5 09J Jekyll Island Airport ## 6 0A9 Elizabethton Municipal Airport ## 7 0G6 Williams County Airport ## 8 0G7 Finger Lakes Regional Airport ## 9 0P2 Shoestring Aviation Airfield ## 10 0S9 Jefferson County Intl ## #... with 1,448 more rows, and 2 more variables: dst <chr>, tzone <chr> Paolo Crosetto Introduction to R and the tidyverse 17 / 58
18 ## #... with 336,766 more rows, and 12 more variables: sched_arr_time <int> ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## ## ## ## ## ## ## ## ## ## ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> Paolo Crosetto Introduction to R and the tidyverse 18 / 58 using the nycflights13 dataset again flights flights <- nycflights13::flights flights
19 inspect the datasets what do these dataset contain? what variables do they have in common? do they have some unique identifier (key)? how are these related to one another? Paolo Crosetto Introduction to R and the tidyverse 19 / 58
20 the datasets planes has information on each plane (model, type, date of construction... ) airports has information on each airport (faa code, location, lat, long ) flights has information on each flight that left/landed in a NYC airport Paolo Crosetto Introduction to R and the tidyverse 20 / 58
21 joining different datasets: example problem: do newer planes fly the longest routes from NYC? to answer this, you need to combine data from two sources: flights to get the route s length in terms of miles planes to get the date the plane was first operational how do you join the two data frames? Paolo Crosetto Introduction to R and the tidyverse 21 / 58
22 joining two datasets: key first you need to find a unique identifier for your data: a key unique identifiers have the characteristics of being unique in the whole dataset in order to find them, either you use your intuition or you check planes %>% count(tailnum) %>% filter(n>1) ## # A tibble: 0 x 2 ## #... with 2 variables: tailnum <chr>, n <int> count(var) gives the count of how many times each element of var appears as a new variable n by filtering for just n>1 you check if any value appears twice Paolo Crosetto Introduction to R and the tidyverse 22 / 58
23 there is some overlapping information on the two tables but there is also new information column Paolo Crosetto D only in dataset Y Introduction to R and the tidyverse 23 / 58 joining once you know the key, you can use the join family of functions imagine you have two datasets with variables and values as follows: Figure 1:
24 joining joining always combines data from two tables into one syntax alays the same: join(left, right, by = "key") left and right two data frames key the unique identifier of obsevations (in one or both data frames) Paolo Crosetto Introduction to R and the tidyverse 24 / 58
25 the joining family different join functions make different assumptions about what to do of the data that are NOT matched full_join() keeps everything, adds NA inner_join() keeps only matched data Paolo Crosetto Introduction to R and the tidyverse 25 / 58
26 the default left_join() left_join() is the default because you usually add some variable to a large dataset in our case: * do newer planes fly the longest routes from NYC?* we have most information on the flights dataset we need only the year built from the planes dataset Paolo Crosetto Introduction to R and the tidyverse 26 / 58
27 answering our question joining distance <- flights %>% select(tailnum, distance) yearbuilt <- planes %>% select(tailnum, year) answer <- left_join(distance, yearbuilt, by = "tailnum") answer ## # A tibble: 336,776 x 3 ## tailnum distance year ## <chr> <dbl> <int> ## 1 N ## 2 N ## 3 N619AA ## 4 N804JB ## 5 N668DN ## 6 N ## 7 N516JB ## 8 N829AS ## 9 N593JB ## 10 N3ALAA 733 NA ## #... with 336,766 more rows Paolo Crosetto Introduction to R and the tidyverse 27 / 58
28 answering our question: the answer there does not seem to be any connection beteen the year and the length of the flight answer %>% group_by(year) %>% summarise(dist = mean(distance, na.rm = TRUE)) %>% ggplot(aes(x = year, y = dist))+geom_point()+ geom_smooth(method = "lm") dist 1000 Paolo Crosetto Introduction to R and the tidyverse 28 / 58
29 joining exercise how many flights through NYC land in an airport whose altitude is > 1000mt? note: 1 mètre = 3,28084 mètres altitude is in the airports df flights are in the flights df Paolo Crosetto Introduction to R and the tidyverse 29 / 58
30 ## #... with 10,281 more rows, and 13 more variables: sched_arr_time <int>, ## # A tibble: 10,291 x 20 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## ## ## ## ## ## ## ## ## ## ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, Paolo Crosetto Introduction to R and the tidyverse 30 / 58 solution a lot of flights, since denver sits at 1600mt! alt_df <- airports %>% select(faa,alt) %>% mutate(alt = alt/ ) %>% ren answer <- left_join(flights, alt_df, by = "dest") %>% filter(alt>1000) answer
31 solution: a plot to see the impact of Denver answer %>% ggplot(aes(dest))+geom_bar() 6000 count ABQ BZN DEN EGE HDN JAC MTJ SLC dest Paolo Crosetto Introduction to R and the tidyverse 31 / 58
32 joining three datasets how old are the planes that fly to airports whose altitude is >1000mt? Paolo Crosetto Introduction to R and the tidyverse 32 / 58
33 joining three datasets, solution answer <- left_join(flights,yearbuilt, by = "tailnum") answer <- left_join(answer, alt_df, by = "dest") answer %>% filter(alt>1000) %>% summarise(avgyear = mean(year.y, na.rm = TRU ## # A tibble: 1 x 1 ## avgyear ## <dbl> ## answer %>% filter(alt<=1000) %>% summarise(avgyear = mean(year.y, na.rm = TR ## # A tibble: 1 x 1 ## avgyear ## <dbl> ## Paolo Crosetto Introduction to R and the tidyverse 33 / 58
34 tidy data Paolo Crosetto Introduction to R and the tidyverse 34 / 58
35 messy data -> tidy data Happy families are all alike; every unhappy family is unhappy in its own way. - Leo Tolstoy the data we have worked with so far are all well formatted this is not the case in real life we need to be able to format data in a convenint way if you work with the tools we ve seen (dplyr, ggplot2) then you want tidy data Paolo Crosetto Introduction to R and the tidyverse 35 / 58
36 a simple dataset in four versions table1 ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 36 / 58
37 a simple dataset in four versions table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population ## 5 Brazil 1999 cases ## 6 Brazil 1999 population ## 7 Brazil 2000 cases ## 8 Brazil 2000 population ## 9 China 1999 cases ## 10 China 1999 population ## 11 China 2000 cases ## 12 China 2000 population Paolo Crosetto Introduction to R and the tidyverse 37 / 58
38 a simple dataset in four versions table3 ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 38 / 58
39 a simple dataset in four versions table4a #cases ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China table4b #population ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China Paolo Crosetto Introduction to R and the tidyverse 39 / 58
40 tidy, untidy data tidy dat has the following characteristics: each variable has its own column each observation has its own row each value has its own cell have a look at the tables. what is an observation? what is a variable? do you see problems in the tables? Paolo Crosetto Introduction to R and the tidyverse 40 / 58
41 tidy data: the tidyr package tidyr is part fo the tidyverse it is automatically loaded with library(tidyverse) tidyr provides 4 main verbs gather vs. spread separate vs. unite Paolo Crosetto Introduction to R and the tidyverse 41 / 58
42 gathering: from wide to long table4a sometimes variables are in the column names: bad! ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China year is a variable but it is on the column names content is cases but has no variable name Paolo Crosetto Introduction to R and the tidyverse 42 / 58
43 gathering we need to reshape the data from wide to long, so that year becomes a variable and 1999 and 2000 become values. we use gather(vars, key, value) vars is the variable names that are not actually variables but values key is the (new) name to be given to the (new) column that will be created to store the (former) variable names value is the (new) name to be given to the (new) column that will be created to store the values that were spread over several variables Paolo Crosetto Introduction to R and the tidyverse 43 / 58
44 gathering what happens if we just provide NO arguments? everything is gathered just two columns left (key & value) table4a %>% gather() ## # A tibble: 9 x 2 ## key value ## <chr> <chr> ## 1 country Afghanistan ## 2 country Brazil ## 3 country China ## ## ## ## ## ## Paolo Crosetto Introduction to R and the tidyverse 44 / 58
45 gathering what if we provide arguments? cases <- table4a %>% gather(`1999`,`2000`, key = year, value = cases) %>% ar cases ## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 45 / 58
46 gathering we can do the same for the population table (table4b) pop <- table4b %>% gather(`1999`,`2000`, key = year, value = population) pop ## # A tibble: 6 x 3 ## country year population ## <chr> <chr> <int> ## 1 Afghanistan ## 2 Brazil ## 3 China ## 4 Afghanistan ## 5 Brazil ## 6 China Paolo Crosetto Introduction to R and the tidyverse 46 / 58
47 gathering we can merge the two tables and we ll get back to table1 left_join(cases,pop, by = c("country","year")) ## # A tibble: 6 x 4 ## country year cases population ## <chr> <chr> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 47 / 58
48 spreading: from long to wide table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population ## 5 Brazil 1999 cases ## 6 Brazil 1999 population ## 7 Brazil 2000 cases ## 8 Brazil 2000 population ## 9 China 1999 cases ## 10 China 1999 population ## 11 China 2000 cases ## 12 China 2000 population Paolo Crosetto Introduction to R and the tidyverse 48 / 58
49 spreading: from long to wide we need to reshape the data from long to wide, so that type gets split into the variables cases and population and count values get assigned to the proper column. we use spread(key, value) key is the (existing) name of the column that contains variable names value is the (existing) name of the variable that contains values of the (to be created) variables Paolo Crosetto Introduction to R and the tidyverse 49 / 58
50 spreading: from long to wide spread(table2, key = type, value = count) ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 50 / 58
51 separating: from one to more variables what is wrong with this table? table3 ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 51 / 58
52 separating the variable rate contains two informations: number of cases and population we need to separate the variable into two (in this case) variables separate(table3, col = rate, into = c("cases", "population")) ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <chr> <chr> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 52 / 58
53 separating -separate() correctly guessed that the point to separate was / - but this is not always so easy - so you can provide the actual separator character with sep= - if we use the wrong one... separate(table3, col = rate, into = c("cases", "population"), sep = "7") ## Warning: Too many values at 3 locations: 1, 3, 5 ## Warning: Too few values at 1 locations: 2 ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <chr> <chr> ## 1 Afghanistan /1998 ## 2 Afghanistan / <NA> ## 3 Brazil ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 53 / 58
54 separating separate() keeps the variables as characters this is safe: doesnt make assumptions but sometimes it is best to have it create int or dbl variables separate(table3, col = rate, into = c("cases", "population"), convert = TRUE ## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan ## 2 Afghanistan ## 3 Brazil ## 4 Brazil ## 5 China ## 6 China Paolo Crosetto Introduction to R and the tidyverse 54 / 58
55 uniting: from several to one variable table5 ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 55 / 58
56 uniting the complementary verb to separate() is unite() unite(table5, year, century, year) ## # A tibble: 6 x 3 ## country year rate ## * <chr> <chr> <chr> ## 1 Afghanistan 19_99 745/ ## 2 Afghanistan 20_ / ## 3 Brazil 19_ / ## 4 Brazil 20_ / ## 5 China 19_ / ## 6 China 20_ / by deault unite() uses _ as a separator Paolo Crosetto Introduction to R and the tidyverse 56 / 58
57 uniting unite(table5, year, century, year, sep = "") ## # A tibble: 6 x 3 ## country year rate ## * <chr> <chr> <chr> ## 1 Afghanistan / ## 2 Afghanistan / ## 3 Brazil / ## 4 Brazil / ## 5 China / ## 6 China / Paolo Crosetto Introduction to R and the tidyverse 57 / 58
58 exercise look at (messy) Eurostat data on GDP and tidy it Paolo Crosetto Introduction to R and the tidyverse 58 / 58
Lecture 3: Data Wrangling I
Lecture 3: Data Wrangling I Data Science for Business Analytics Thibault Vatter Department of Statistics, Columbia University and HEC Lausanne, UNIL 12.03.2018 Outline 1 Overview
More informationDplyr Introduction Matthew Flickinger July 12, 2017
Dplyr Introduction Matthew Flickinger July 12, 2017 Introduction to Dplyr This document gives an overview of many of the features of the dplyr library include in the tidyverse of related R pacakges. First
More informationЛекция 4 Трансформация данных в R
Анализ данных Лекция 4 Трансформация данных в R Гедранович Ольга Брониславовна, старший преподаватель кафедры ИТ, МИУ volha.b.k@gmail.com 2 Вопросы лекции Фильтрация (filter) Сортировка (arrange) Выборка
More informationIntroduction to rsolr
Introduction to rsolr Michael Lawrence August 21, 2018 Contents 1 Introduction 1 2 Demonstration: nycflights13 2 2.1 The Dataset............................ 2 2.2 Populating a Solr core......................
More informationStat. 450 Section 1 or 2: Homework 3
Stat. 450 Section 1 or 2: Homework 3 Prof. Eric A. Suess So how should you complete your homework for this class? First thing to do is type all of your information about the problems you do in the text
More informationStat. 450 Section 1 or 2: Homework 8
Stat. 450 Section 1 or 2: Homework 8 Prof. Eric A. Suess So how should you complete your homework for this class? First thing to do is type all of your information about the problems you do in the text
More informationData Import and Formatting
Data Import and Formatting http://datascience.tntlab.org Module 4 Today s Agenda Importing text data Basic data visualization tidyverse vs data.table Data reshaping and type conversion Basic Text Data
More informationData Manipulation in R
Data Manipulation in R Introduction to dplyr May 15, 2017 Data Manipulation in R May 15, 2017 1 / 67 Introduction to dplyr dplyr is Hadley s package for data manipulation dplyr provides abstractions for
More informationImporting rectangular text files Importing other types of data Trasforming data
Lecture 3 STATS/CME 195 Matteo Sesia Stanford University Spring 2018 Contents Importing rectangular text files Importing other types of data Trasforming data Importing data with readr The readr package
More informationThe Tidyverse BIOF 339 9/25/2018
The Tidyverse BIOF 339 9/25/2018 What is the Tidyverse? The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar,
More informationAn Introduction to R. Ed D. J. Berry 9th January 2017
An Introduction to R Ed D. J. Berry 9th January 2017 Overview Why now? Why R? General tips Recommended packages Recommended resources 2/48 Why now? Efficiency Pointandclick software just isn't time efficient
More informationLoading Data into R. Loading Data Sets
Loading Data into R Loading Data Sets Rather than manually entering data using c() or something else, we ll want to load data in stored in a data file. For this class, these will usually be one of three
More informationData Wrangling in the Tidyverse
Data Wrangling in the Tidyverse 21 st Century R DS Portugal Meetup, at Farfetch, Porto, Portugal April 19, 2017 Jim Porzak Data Science for Customer Insights 4/27/2017 1 Outline 1. A very quick introduction
More informationSession 1 Nick Hathaway;
Session 1 Nick Hathaway; nicholas.hathaway@umassmed.edu Contents R Basics 1 Variables/objects.............................................. 1 Functions..................................................
More informationData Import and Export
Data Import and Export Eugen Buehler October 17, 2018 Importing Data to R from a file CSV (comma separated value) tab delimited files Excel formats (xls, xlsx) SPSS/SAS/Stata RStudio will tell you if you
More informationSTA130 - Class #2: Nathan Taback
STA130 - Class #2: Nathan Taback 2018-01-15 Today's Class Histograms and density functions Statistical data Tidy data Data wrangling Transforming data 2/51 Histograms and Density Functions Histograms and
More informationReading data into R. 1. Data in human readable form, which can be inspected with a text editor.
Reading data into R There is a famous, but apocryphal, story about Mrs Beeton, the 19th century cook and writer, which says that she began her recipe for rabbit stew with the instruction First catch your
More informationA Whistle-Stop Tour of the Tidyverse
A Whistle-Stop Tour of the Tidyverse Aimee Gott Senior Consultant agott@mango-solutions.com @aimeegott_r In This Workshop You will learn What the tidyverse is & why bother using it What tools are available
More informationIntroduction to R and the tidyverse. Paolo Crosetto
Introduction to R and the tidyverse Paolo Crosetto Lecture 1: plotting Before we start: Rstudio Interactive console Object explorer Script window Plot window Before we start: R concatenate: c() assign:
More informationData Input/Output. Introduction to R for Public Health Researchers
Data Input/Output Introduction to R for Public Health Researchers Common new user mistakes we have seen 1. Working directory problems: trying to read files that R "can't find" RStudio can help, and so
More informationLab2 Jacob Reiser September 30, 2016
Lab2 Jacob Reiser September 30, 2016 Introduction: An R-Blogger recently found a data set from a project of New York s Public Library called What s on the Menu, which can be found at https://www.r-bloggers.com/a-fun-gastronomical-dataset-whats-on-the-menu/.
More informationCOSC 6339 Big Data Analytics. NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment. Edgar Gabriel Spring 2017.
COSC 6339 Big Data Analytics NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment Edgar Gabriel Spring 2017 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in Tables Tables
More informationSTAT 1291: Data Science
STAT 1291: Data Science Lecture 20 - Summary Sungkyu Jung Semester recap data visualization data wrangling professional ethics statistical foundation Statistical modeling: Regression Cause and effect:
More informationEXCELLING WITH ANALYSIS AND VISUALIZATION
EXCELLING WITH ANALYSIS AND VISUALIZATION A PRACTICAL GUIDE FOR DEALING WITH DATA Prepared by Ann K. Emery July 2016 Ann K. Emery 1 Welcome Hello there! In July 2016, I led two workshops Excel Basics for
More informationETC1010: Data Modelling and Computing. Lecture 6: Reading di erent data formats
ETC1010: Data Modelling and Computing Lecture 6: Reading di erent data formats Di Cook (dicook@monash.edu, @visnut) Week 6 1 / 16 Overview SPSS format (PISA data) read_csv vs read.csv Handling large data
More informationIntroduction to R Commander
Introduction to R Commander 1. Get R and Rcmdr to run 2. Familiarize yourself with Rcmdr 3. Look over Rcmdr metadata (Fox, 2005) 4. Start doing stats / plots with Rcmdr Tasks 1. Clear Workspace and History.
More informationUniversity of North Dakota PeopleSoft Finance Tip Sheets. Utilizing the Query Download Feature
There is a custom feature available in Query Viewer that allows files to be created from queries and copied to a user s PC. This feature doesn t have the same size limitations as running a query to HTML
More informationK-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017
K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017 Requirements This demo requires several packages: tidyverse (dplyr, tidyr, tibble, ggplot2) modelr broom proc Background K-fold
More informationGetting and Cleaning Data. Biostatistics
Getting and Cleaning Data Biostatistics 140.776 Getting and Cleaning Data Getting data: APIs and web scraping Cleaning data: Tidy data Transforming data: Regular expressions Getting Data Web site Nature
More informationPower Query for Parsing Data
Excel Power Query Power Query for Parsing Data Data Models Screen 1In Excel 2010 and 2013 need to install the Power Query; however, in 2016 is automatically part of the Data Tab ribbon and the commands
More informationCSSS 512: Lab 1. Logistics & R Refresher
CSSS 512: Lab 1 Logistics & R Refresher 2018-3-30 Agenda 1. Logistics Labs, Office Hours, Homeworks Goals and Expectations R, R Studio, R Markdown, L ATEX 2. Time Series Data in R Unemployment in Maine
More informationPreparing IBM SPSS Data and MS Excel Files for Conducting Mplus Analyses. Lynn N. Tabata
Ronald H. Heck 1 Preparing IBM SPSS Data and MS Excel Files for Conducting Mplus Analyses Lynn N. Tabata IBM SPSS and Excel data files (.sav and.xls) may be exported to one of several file formats that
More informationLab 1. Introduction to R & SAS. R is free, open-source software. Get it here:
Lab 1. Introduction to R & SAS R is free, open-source software. Get it here: http://tinyurl.com/yfet8mj for your own computer. 1.1. Using R like a calculator Open R and type these commands into the R Console
More informationData Input/Output. Introduction to R for Public Health Researchers
Data Input/Output Introduction to R for Public Health Researchers Common new user mistakes we have seen 1. Working directory problems: trying to read files that R can t find RStudio can help, and so do
More informationFile Input/Output in Python. October 9, 2017
File Input/Output in Python October 9, 2017 Moving beyond simple analysis Use real data Most of you will have datasets that you want to do some analysis with (from simple statistics on few hundred sample
More informationTidy Evaluation. Lionel Henry and Hadley Wickham RStudio
Tidy Evaluation Lionel Henry and Hadley Wickham RStudio Tidy evaluation Our vision for dealing with a special class of R functions Usually called NSE but we prefer quoting functions Most interesting language
More informationLearning SAS. Hadley Wickham
Learning SAS Hadley Wickham Outline Intro & data manipulation basics Fitting models x2 Writing macros No graphics (see http://support.sas.com/ techsup/sample/sample_graph.html for why) Today s outline
More informationAssignment 5.5. Nothing here to hand in
Assignment 5.5 Nothing here to hand in Load the tidyverse before we start: library(tidyverse) ## Loading tidyverse: ggplot2 ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse:
More informationFinancial Econometrics Practical
Financial Econometrics Practical Practical 3: Plotting in R NF Katzke Table of Contents 1 Introduction 1 1.0.1 Install ggplot2................................................. 2 1.1 Get data Tidy.....................................................
More informationData Manipulation. Module 5
Data Manipulation http://datascience.tntlab.org Module 5 Today s Agenda A couple of base-r notes Advanced data typing Relabeling text In depth with dplyr (part of tidyverse) tbl class dplyr grammar Grouping
More informationData Input/Output. Andrew Jaffe. January 4, 2016
Data Input/Output Andrew Jaffe January 4, 2016 Before we get Started: Working Directories R looks for files on your computer relative to the working directory It s always safer to set the working directory
More informationSession 3 Nick Hathaway;
Session 3 Nick Hathaway; nicholas.hathaway@umassmed.edu Contents Manipulating Data frames and matrices 1 Converting to long vs wide formats.................................... 2 Manipulating data in table........................................
More informationDepending on the computer you find yourself in front of, here s what you ll need to do to open SPSS.
1 SPSS 11.5 for Windows Introductory Assignment Material covered: Opening an existing SPSS data file, creating new data files, generating frequency distributions and descriptive statistics, obtaining printouts
More informationWeek 4. Big Data Analytics - data.frame manipulation with dplyr
Week 4. Big Data Analytics - data.frame manipulation with dplyr Hyeonsu B. Kang hyk149@eng.ucsd.edu April 2016 1 Dplyr In the last lecture we have seen how to index an individual cell in a data frame,
More informationR. Muralikrishnan Max Planck Institute for Empirical Aesthetics Frankfurt. 08 June 2017
R R. Muralikrishnan Max Planck Institute for Empirical Aesthetics Frankfurt 08 June 2017 Introduction What is R?! R is a programming language for statistical computing and graphics R is free and open-source
More informationSpatial Ecology Lab 6: Landscape Pattern Analysis
Spatial Ecology Lab 6: Landscape Pattern Analysis Damian Maddalena Spring 2015 1 Introduction This week in lab we will begin to explore basic landscape metrics. We will simply calculate percent of total
More informationcomma separated values .csv extension. "save as" CSV (Comma Delimited)
What is a CSV and how do I import it? A CSV is a comma separated values file which allows data to be saved in a table structured format. CSVs look like normal spreadsheet but with a.csv extension. Traditionally
More informationAN INTRODUCTION TO R FOR MANAGEMENT SCHOLARS
AN INTRODUCTION TO R FOR MANAGEMENT SCHOLARS 24 January 2017 Stefan Breet breet@rsm.nl www.stefanbreet.com TODAY What is R? How to use R? The Basics How to use R? The Data Analysis Process WHAT IS R? AN
More informationAssignment 0. Nothing here to hand in
Assignment 0 Nothing here to hand in The questions here have solutions attached. Follow the solutions to see what to do, if you cannot otherwise guess. Though there is nothing here to hand in, it is very
More informationChapter 2 The SAS Environment
Chapter 2 The SAS Environment Abstract In this chapter, we begin to become familiar with the basic SAS working environment. We introduce the basic 3-screen layout, how to navigate the SAS Explorer window,
More informationHow to Wrangle Data. using R with tidyr and dplyr. Ken Butler. March 30, / 44
1 / 44 How to Wrangle Data using R with tidyr and dplyr Ken Butler March 30, 2015 It is said that... 2 / 44 80% of data analysis: getting the data into the right form maybe 20% is making graphs, fitting
More informationThe Data Journalist Chapter 7 tutorial Geocoding in ArcGIS Desktop
The Data Journalist Chapter 7 tutorial Geocoding in ArcGIS Desktop Summary: In many cases, online geocoding services are all you will need to convert addresses and other location data into geographic data.
More informationIntroduction to Stata - Session 1
Introduction to Stata - Session 1 Simon, Hong based on Andrea Papini ECON 3150/4150, UiO January 15, 2018 1 / 33 Preparation Before we start Sit in teams of two Download the file auto.dta from the course
More informationBusiness Analytics Nanodegree Syllabus
Business Analytics Nanodegree Syllabus Master data fundamentals applicable to any industry Before You Start There are no prerequisites for this program, aside from basic computer skills. You should be
More informationHOW TO EXPORT BUYER NAMES & ADDRESSES FROM PAYPAL TO A CSV FILE
HOW TO EXPORT BUYER NAMES & ADDRESSES FROM PAYPAL TO A CSV FILE If your buyers use PayPal to pay for their purchases, you can quickly export all names and addresses to a type of spreadsheet known as a
More informationWorkshop. Import Workshop
Import Overview This workshop will help participants understand the tools and techniques used in importing a variety of different types of data. It will also showcase a couple of the new import features
More informationLecture 12: Data carpentry with tidyverse
http://127.0.0.1:8000/.html Lecture 12: Data carpentry with tidyverse STAT598z: Intro. to computing for statistics Vinayak Rao Department of Statistics, Purdue University options(repr.plot.width=5, repr.plot.height=3)
More informationA framework for data-related skills
Setting the stage for data science: integration of data management skills in introductory and second courses in statistics Nicholas J. Horton, Benjamin S. Baumer, and Hadley Wickham March 25, 2015 Statistics
More informationIntroducing R/Tidyverse to Clinical Statistical Programming
Introducing R/Tidyverse to Clinical Statistical Programming MBSW 2018 Freeman Wang, @freestatman 2018-05-15 Slides available at https://bit.ly/2knkalu Where are my biases Biomarker Statistician Genomic
More informationGetting Our Feet Wet with Stata SESSION TWO Fall, 2018
Getting Our Feet Wet with Stata SESSION TWO Fall, 2018 Instructor: Cathy Zimmer 962-0516, cathy_zimmer@unc.edu 1) REMINDER BRING FLASH DRIVES! 2) QUESTIONS ON EXERCISES? 3) WHAT IS Stata SYNTAX? a) A set
More informationIntroduction to Stata: An In-class Tutorial
Introduction to Stata: An I. The Basics - Stata is a command-driven statistical software program. In other words, you type in a command, and Stata executes it. You can use the drop-down menus to avoid
More informationA whirlwind introduction to using R for your research
A whirlwind introduction to using R for your research Jeremy Chacón 1 Outline 1. Why use R? 2. The R-Studio work environment 3. The mock experimental analysis: 1. Writing and running code 2. Getting data
More informationCS130 Software Tools. Fall 2010 Intro to SPSS and Data Handling
Software Tools Intro to SPSS and Data Handling 1 Types of Analyses When doing data analysis, we are interested in two types of summaries: Statistical Summaries (e.g. descriptive, hypothesis testing) Visual
More informationIntroduction to Functions. Biostatistics
Introduction to Functions Biostatistics 140.776 Functions The development of a functions in R represents the next level of R programming, beyond writing code at the console or in a script. 1. Code 2. Functions
More informationLecture 1: MATLAB - advanced use cases
Lecture 1: MATLAB - advanced use cases Data handling and analysis Juha Kuortti and Heikki Apiola February 10, 2018 Aalto University juha.kuortti@aalto.fi Importing and exporting data: basics Creating and
More informationLAB #1: DESCRIPTIVE STATISTICS WITH R
NAVAL POSTGRADUATE SCHOOL LAB #1: DESCRIPTIVE STATISTICS WITH R Statistics (OA3102) Lab #1: Descriptive Statistics with R Goal: Introduce students to various R commands for descriptive statistics. Lab
More informationBarchard Introduction to SPSS Marks
Barchard Introduction to SPSS 22.0 3 Marks Purpose The purpose of this assignment is to introduce you to SPSS, the most commonly used statistical package in the social sciences. You will create a new data
More informationAn Introduction to Tidyverse
An Introduction to Tidyverse Joey Stanley Doctoral Candidate in Linguistics, University of Georgia joeystanley.com Presented at the UGA Willson Center DigiLab Friday, November 10, 2017 This is the third
More informationplot(seq(0,10,1), seq(0,10,1), main = "the Title", xlim=c(1,20), ylim=c(1,20), col="darkblue");
R for Biologists Day 3 Graphing and Making Maps with Your Data Graphing is a pretty convenient use for R, especially in Rstudio. plot() is the most generalized graphing function. If you give it all numeric
More informationWhat is Stata? A programming language to do sta;s;cs Strongly influenced by economists Open source, sort of. An acceptable way to manage data
Introduc)on to Stata Training Workshop on the Commitment to Equity Methodology CEQ Ins;tute, Asian Development Bank, and The Ministry of Finance Dili May-June, 2017 What is Stata? A programming language
More informationFrances Provan i #)# #%'
!"#$%&#& Frances Provan i ##+), &'!#( $& #)# *% #%' & SPSS Versions... 2 Some slide shorthand... 2 Did you know you could... 2 Nice newish graphs... 2 Population Pyramids... 2 Population Pyramids: categories...
More informationMIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel
MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel Objective: Find and fix a data set with incorrect values Learning Outcomes: Use Excel to identify incorrect
More informationHow to import text files to Microsoft Excel 2016:
How to import text files to Microsoft Excel 2016: You would use these directions if you get a delimited text file from a government agency (or some other source). This might be tab-delimited, comma-delimited
More informationExample how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler
JMP in a nutshell 1 HR, 17 Apr 2018 The software JMP Pro 14 is installed on the Macs of the Phonetics Institute. Private versions can be bought from
More informationCode Plug Management: Contact List Import/Export. Version 1.0, Dec 16, 2015
Code Plug Management: Contact List Import/Export Version 1.0, Dec 16, 2015 Background This presentation will show how to update and maintain contact lists in the CS750 The following applications will be
More informationReference Guide. Adding a Generic File Store - Importing From a Local or Network ShipWorks Page 1 of 21
Reference Guide Adding a Generic File Store - Importing From a Local or Network Folder Page 1 of 21 Adding a Generic File Store TABLE OF CONTENTS Background First Things First The Process Creating the
More informationCS130/230 Lecture 6 Introduction to StatView
Thursday, January 15, 2004 Intro to StatView CS130/230 Lecture 6 Introduction to StatView StatView is a statistical analysis program that allows: o Data management in a spreadsheet-like format o Graphs
More informationExploratory data analysis
Lecture 4 STATS/CME 195 Matteo Sesia Stanford University Spring 2018 Contents Exploratory data analysis Exploratory data analysis What is exploratory data analysis (EDA) In this lecture we discuss how
More informationSTAT 1291: Data Science
STAT 1291: Data Science Lecture 18 - Statistical modeling II: Machine learning Sungkyu Jung Where are we? data visualization data wrangling professional ethics statistical foundation Statistical modeling:
More informationThis lab will introduce you to MySQL. Begin by logging into the class web server via SSH Secure Shell Client
Lab 2.0 - MySQL CISC3140, Fall 2011 DUE: Oct. 6th (Part 1 only) Part 1 1. Getting started This lab will introduce you to MySQL. Begin by logging into the class web server via SSH Secure Shell Client host
More informationToday Function. Note: If you want to retrieve the date and time that the computer is set to, use the =NOW() function.
Today Function The today function: =TODAY() It has no arguments, and returns the date that the computer is set to. It is volatile, so if you save it and reopen the file one month later the new, updated
More informationAn Introductory Guide to SpecTRM
An Introductory Guide to SpecTRM SpecTRM (pronounced spectrum and standing for Specification Tools and Requirements Methodology) is a toolset to support the specification and development of safe systems
More informationTIPS AND TRICKS: IMPROVE EFFICIENCY TO YOUR SAS PROGRAMMING
TIPS AND TRICKS: IMPROVE EFFICIENCY TO YOUR SAS PROGRAMMING Guillaume Colley, Lead Data Analyst, BCCFE Page 1 Contents Customized SAS Session Run system options as SAS starts Labels management Shortcut
More informationDATA SCIENCE AND MACHINE LEARNING
DATA SCIENCE AND MACHINE LEARNING Introduction to Data Tables Associate Professor in Applied Statistics, Department of Mathematics, School of Applied Mathematical & Physical Sciences, National Technical
More information1. Open the New American FactFinder using this link:
Exercises for Mapping and Using US Census Data MIT GIS Services, IAP 2012 More information, including a comparison of tools available through the MIT Libraries, can be found at: http://libraries.mit.edu/guides/types/census/tools-overview.html
More informationThese are notes for the third lecture; if statements and loops.
These are notes for the third lecture; if statements and loops. 1 Yeah, this is going to be the second slide in a lot of lectures. 2 - Dominant language for desktop application development - Most modern
More informationTutorial 4 - Attribute data in ArcGIS
Tutorial 4 - Attribute data in ArcGIS COPY the Lab4 archive to your server folder and unpack it. The objectives of this tutorial include: Understand how ArcGIS stores and structures attribute data Learn
More informationLecture 1: Overview
15-150 Lecture 1: Overview Lecture by Stefan Muller May 21, 2018 Welcome to 15-150! Today s lecture was an overview that showed the highlights of everything you re learning this semester, which also meant
More informationExcel Functions & Tables
Excel Functions & Tables Winter 2012 Winter 2012 CS130 - Excel Functions & Tables 1 Review of Functions Quick Mathematics Review As it turns out, some of the most important mathematics for this course
More informationMiniBase Workbook. Schoolwires Centricity2
MiniBase Workbook Schoolwires Centricity2 Table of Contents Introduction... 1 Create a New MiniBase... 2 Add Records to the MiniBase:... 3 Add Records One at a Time... 3 Import Records:... 4 Deploy the
More informationImporting data sets in R
Importing data sets in R R can import and export different types of data sets including csv files text files excel files access database STATA data SPSS data shape files audio files image files and many
More informationIntroduction to Stata Getting Data into Stata. 1. Enter Data: Create a New Data Set in Stata...
Introduction to Stata 2016-17 02. Getting Data into Stata 1. Enter Data: Create a New Data Set in Stata.... 2. Enter Data: How to Import an Excel Data Set.... 3. Import a Stata Data Set Directly from the
More informationIncident Response Programming with R. Eric Zielinski Sr. Consultant, Nationwide
Incident Response Programming with R Eric Zielinski Sr. Consultant, Nationwide About Me? Cyber Defender for Nationwide Over 15 years in Information Security Speaker at various conferences FIRST, CEIC,
More informationUsing the CRM Pivot Tables
Using the CRM Pivot Tables Pivot tables have now been added to your CRM system: we hope that these will provide you with an easy way to produce charts and graphs straight from your CRM, using the most
More informationImporting vehicles into Glass-Net from ANY Dealer Management System
Importing vehicles into Glass-Net from ANY Dealer Management System Step 1: Preparing your DMS CSV Ensure you download a Comma Separated Values (CSV) file from your Dealer Management System (DMS) and save
More informationUsing vletter Handwriting Software with Mail Merge in Word 2007
Using vletter Handwriting Software with Mail Merge in Word 2007 Q: What is Mail Merge? A: The Mail Merge feature in Microsoft Word allows you to merge an address file with a form letter in order to generate
More informationExcel Functions & Tables
Excel Functions & Tables Fall 2014 Fall 2014 CS130 - Excel Functions & Tables 1 Review of Functions Quick Mathematics Review As it turns out, some of the most important mathematics for this course revolves
More informationEarthquake data in geonet.org.nz
Earthquake data in geonet.org.nz There is are large gaps in the 2012 and 2013 data, so let s not use it. Instead we ll use a previous year. Go to http://http://quakesearch.geonet.org.nz/ At the screen,
More informationAnalysis and visualization with v isone
Analysis and visualization with v isone Jürgen Lerner University of Konstanz Egoredes Summerschool Barcelona, 21. 25. June, 2010 About v isone. Visone is the Italian word for mink. In Spanish visón. visone
More informationSPSS TRAINING SPSS VIEWS
SPSS TRAINING SPSS VIEWS Dataset Data file Data View o Full data set, structured same as excel (variable = column name, row = record) Variable View o Provides details for each variable (column in Data
More information