R. Muralikrishnan Max Planck Institute for Empirical Aesthetics Frankfurt. 08 June 2017

Size: px

Start display at page:

Download "R. Muralikrishnan Max Planck Institute for Empirical Aesthetics Frankfurt. 08 June 2017"

Shannon Reed
5 years ago
Views:

1 R R. Muralikrishnan Max Planck Institute for Empirical Aesthetics Frankfurt 08 June 2017

2 Introduction

3 What is R?! R is a programming language for statistical computing and graphics R is free and open-source software available on Linux, Windows and OS X R implements a wide variety of statistical and graphical techniques User-created packages vastly extend and enhance the capabilities of R

4 Is it useful for me? If you are at all going to run some empirical study and you would like to run the statistical analysis independently and make some beautiful & meaningful graphical depictions of your data or you would like to do some text corpus analysis well then, Yes, very much!

5 Is it difficult??? Well, let s say the learning curve in the beginning is a bit steep ;-) But rest assured that it is every bit worth it. R related questions / problems??? Most probably someone else has already looked for answers online which means: The answers are in most cases just an online search away really!

6 Installing R Download and install for free from Core packages, functions and the R console are installed by default R commands can then be issued via the text-based R console Additional packages can be installed on the fly as and when necessary If you like, you could learn the very basics of R even before installing it Try-R, a browser-based basic R tutorial

7 Installing RStudio (optional, but recommended) RStudio is one of the development environments available for R Download and install RStudio from Using RStudio, you could use the full capability of R plus design web apps or even create presentations of this sort

8 Alright, I ve installed things, what next? You re all set to explore, visualise and analyse your data and more. Just a couple of things to know before starting the R console: 1. If you would like to install a package that is not already installed: In R Studio: Tools -> Install Packages In R console: Enter install.packages("packagename") 2. Set the working directory to the one in which you have your data files: In R Studio: Session -> Set Working Directory -> Choose Directory In R console: Use the command setwd("path/to/directory") 3. And at anytime you have questions about a certain R function: Type help(functionname) to read the documentation Type example(functionname) to see a usage example

9 R Basics

10 First steps The > R prompt indicates R is ready to receive & interpret commands We can now type commands into the R console > ## [1] 3 > bla <- 100 # Assign the value 100 to the variable / object bla > # <- is the assignment operator in R > > bla # Now, just typing bla would print its value ## [1] 100 > Bla # Bla is not the same as bla! Everything is case sensitive! > # So this simply returns an error message.

11 More examples > var_a <- 1 > var_a ## [1] 1 > var_b <- 2 > var_b ## [1] 2 > var_c <- var_a + var_b > var_c ## [1] 3

12 Objects and data types Objects / variables are simply handles or names for different kinds of data Object names may be alphanumeric, but must begin with an alphabet No spaces allowed in object names! Every object is of a certain data type > var_a <- 1 > typeof(var_a) # typeof(xyz) : returns the type of the object xyz ## [1] "double" > var_vector <- c(1,2,3) # c() : combines multiple objects > # of the same type into a vector > var_name <- "Mr. Bean" # Notice the " "??? > # ==> character string object > typeof(var_name) ## [1] "character"

13 Let s type > var_name <- "Mr. Bean" > var_age <- 45 And now type > var_x <- var_name + var_age What is the output you get?

14 Let s try > var_y <- c(var_name, var_age) > var_y ## [1] "Mr. Bean" "45" > typeof(var_y) ## [1] "character" Why?

15 Why is the type of an object important? Identifying objects with a certain data type ensures data integrity Because, only functions appropriate for that data type can apply to them. Refer: Quick-R Data Types

16 Scalars Scalars are nothing but singleton values > # Singleton values # typeof(.) returns the following > var_numeric_int <- 1 # 'double'! > > var_numeric_double <- 1.0 # 'double' > > var_char_string <- "A1" # 'char' > > var_logical_tf <- TRUE # 'logical' > # Not a character string! No " ", see? > var_logical_notavailable <- NA # 'logical'!

17 Vectors Vectors are 1D arrays The elements of a vector must all be of the identical type! > var_vector_numeric <- c(1,2,3) > var_vector_char <- c("a","b","1") > var_vector_logical <- c(true, FALSE) # Notice the absence of ""?

18 Matrices and Data Frames Matrices are 2D arrays All columns of a matrix must be of the identical type and length! Data frames are more generic than matrices; comparable to excel tables; Each column can be of any type Each column is accessible as a vector Data frames are the most common type of data in R Refer: Quick-R Data Types Refer: Quick-R Matrices

19 Arithmetic operators Arithmetic operators work on scalars, vectors and matrices Also called binary operators in R Operator Description + addition - subtraction * multiplication / division ** exponentiation (circumflex also works) x %% y modulus (x mod y) 5%%2 is 1 x %/% y integer division 5%/%2 is 2 Refer:

20 Logical Operators Logical operators are for comparing things; they return TRUE / FALSE Operator Description < less than <= less than or equal to > greater than >= greater than or equal to == exactly equal to!= not equal to!x Not x x && y short circuit AND; for single values; used in if checks x y short circuit OR; for single values; used in if checks x & y vectorised AND (applies to all elements in a vector) x y vectorised OR (applies to all elements in a vector)

21 Loops, condition checks, user-defined functions etc. Base R Cheatsheet base-r.pdf

22 Data Analysis Workflow

23 Workflow General workflow These steps often happen in a repeating cycle 1. Read data into R. Input files can be, among other things: an excel sheet, a comma / space / tab separated text file an xml file, or something directly from the web running text such as a corpus 2. Understand the data structure and what you want to do with it 3. Transform data to do what you want 4. Do what you want: calculate descriptive statistics generate plots run various statistical tests you name it! 5. Save your R code for later use, say as SomethingMeaningful.R

24 R Scripts The code we write on the console can be saved as an R Script In RStudio: File -> New -> R Script opens editor panel to write & save code Elsewhere: Simply use any text editor to write & save code Save the file as, say SomethingMeaningful.R To save output generated by R script (and not see it on the console), include: sink("meaningfuloutput.txt") in the beginning of the R script and sink() at the end of the R script Now, you can source the R Script, meaning execute all the commands in it all at once In R Studio, just click the Source button above the editor panel In R console, type source("somethingmeaningful.r")

25 Data formats: Wide-format Data Each row contains multiple variables of interest for each observation > WideData ## # A tibble: 4 7 ## Participant ExpV RT1 RT2 RT3 RT4 YorN ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 S Y ## 2 S N ## 3 S Y ## 4 S Y

26 Data formats: Long-format Data Each row contains a single variable of interest for a single observation > LongData ## # A tibble: 16 5 ## Participant ExpV YorN Trial Measurement ## <chr> <int> <chr> <chr> <dbl> ## 1 S001 1 Y RT ## 2 S001 1 Y RT ## 3 S001 1 Y RT ## 4 S001 1 Y RT ## 5 S002 1 N RT ## 6 S002 1 N RT ## 7 S002 1 N RT ## 8 S002 1 N RT ## 9 S003 2 Y RT ## 10 S003 2 Y RT

27 Which format is best? In most cases, long-format data is the easiest to work with, because: each observation is in its own row each variable is in its own column Transforming, visualising and analysing long-format data is straightforward There are R packages to convert between the two formats Refer: R Cookbook wide to long format and vice versa We ll learn one of the methods soon.

28 Read data from text files: a basic example Import tabular text data as a data frame # > BehavData <- read.table("allres.txt") > # BehavData : typing the data frame name displays the whole df > head(behavdata) # head() : displays the first few rows of the df ## V1 V2 V3 V4 V5 V6 V7 V8 ## 1 NF01 16 FOS F OS C 2 ## 2 NF01 25 MOS M OS C 2 ## 3 NF01 13 FSO F SO C 2 ## 4 NF01 12 MSO M SO C 2 ## 5 NF01 4 FOS F OS C 2 ## 6 NF01 8 FSO F SO C 2 > # tail(behavdata) # tail() : displays the last few rows of the df # We won t use this method to import data for ver long! We will learn a better method in a bit.

29 Data structure and its dimensions > str(behavdata) # str() : displays the structure of the object ## 'data.frame': 3478 obs. of 8 variables: ## $ V1: Factor w/ 29 levels "NF01","NF02",..: ## $ V2: int ## $ V3: Factor w/ 4 levels "FOS","FSO","MOS",..: ## $ V4: Factor w/ 2 levels "F","M": ## $ V5: Factor w/ 2 levels "OS","SO": ## $ V6: num ## $ V7: Factor w/ 2 levels "C","X": ## $ V8: int > dim(behavdata) # dim() : displays the dimensions of the object ## [1]

30 Name columns in a data frame > names(behavdata) <- + c("subj", "Item", "Condition", "WF1", "WF2", "RT", + "Accuracy", "Response") > head(behavdata) ## Subj Item Condition WF1 WF2 RT Accuracy Response ## 1 NF01 16 FOS F OS C 2 ## 2 NF01 25 MOS M OS C 2 ## 3 NF01 13 FSO F SO C 2 ## 4 NF01 12 MSO M SO C 2 ## 5 NF01 4 FOS F OS C 2 ## 6 NF01 8 FSO F SO C 2 # Again, we won t need this when we learn the better method to import data soon.

31 Access different fields of a data frame > head(behavdata) ## Subj Item Condition WF1 WF2 RT Accuracy Response ## 1 NF01 16 FOS F OS C 2 ## 2 NF01 25 MOS M OS C 2 ## 3 NF01 13 FSO F SO C 2 ## 4 NF01 12 MSO M SO C 2 ## 5 NF01 4 FOS F OS C 2 ## 6 NF01 8 FSO F SO C 2 > head(behavdata$rt) ## [1]

32 Plots and Statistics

33 Histogram > library(ggplot2) > ggplot(behavdata, aes(x = RT)) + geom_histogram(binwidth = 0.2) count RT Refer: Refer:

34 Density Plot > ggplot(behavdata, aes(x = RT)) + geom_density() 0.3 density RT

35 Checking for Normality: Q-Q Norm Plot > ggplot(behavdata) + geom_qq(aes(sample = RT)) 6 sample theoretical

36 Statistical Normality Tests > # Anderson-Darling normality test > library(nortest) > ad.test(behavdata$rt) ## ## Anderson-Darling normality test ## ## data: BehavData$RT ## A = , p-value < 2.2e-16 > # Shapiro-Wilk normality test > shapiro.test(behavdata$rt) ## ## Shapiro-Wilk normality test ## ## data: BehavData$RT ## W = , p-value < 2.2e-16

37 Statistical Normality Tests > # Kolmogorov-Smirnot normality test > ks.test(behavdata$rt, "pnorm") ## Warning in ks.test(behavdata$rt, "pnorm"): ties should not be present ## the Kolmogorov-Smirnov test ## ## One-sample Kolmogorov-Smirnov test ## ## data: BehavData$RT ## D = , p-value < 2.2e-16 ## alternative hypothesis: two-sided Refer: Blog entry on the topic Refer: Stackexchange page on the topic

38 Mean, Median, Standard Deviation > # Arithmetic Mean > mean(behavdata$rt) ## [1] > # Median > median(behavdata$rt) ## [1] > # Standard Deviation > sd(behavdata$rt) ## [1] > # Variance = SD2 > var(behavdata$rt) ## [1] Refer: Quick-R Descriptive Statistics

39 Aggregating over factors Calculate mean, sd etc. over specified factor(s): aggregate function > # Aggregate Variable by a single Factor > aggregate(variable ~ Factor, data = XyzData, FUN = mean) > # FUN = mean => calculate mean; > # Other possible options: sd, var, length... > > # Aggregate Variable by a multiple Factors > aggregate(variable ~ Factor1 * Factor2, data = XyzData, FUN = mean) > # The Variable ~ Factors part is referred to as the 'formula'

40 Aggregating over factors > RT_m_Subj <- + aggregate(rt ~ Subj, data = BehavData, FUN = mean, na.rm = TRUE) > # RT ~ Subj => aggregate RT by the factor Subj > # na.rm = TRUE => exclude missing values (NA = not available) > head(rt_m_subj) ## Subj RT ## 1 NF ## 2 NF ## 3 NF ## 4 NF ## 5 NF ## 6 NF

41 ANOVA > # Repeated Measures ANOVA : Reaction Time -- Analysis by SUBJECTS > # To test if the SUBJECTS differ significantly between each other > > # First calculate a mean per subject per condition. > RT_m_Subj_WF1_WF2 <- aggregate(rt ~ Subj * WF1 * WF2, + data = BehavData, FUN = mean, na.rm = T > # Run the ANOVA > RT_aov_Subj <- aov(rt ~ WF1 * WF2 + Error(Subj/(WF1*WF2)), + data = RT_m_Subj_WF1_WF2) > > print(summary(rt_aov_subj))

42 ANOVA > # Repeated Measures ANOVA : Reaction Time -- Analysis by ITEMS > # To test if the ITEMS differ significantly between each other > BehavData$Item <- as.factor(behavdata$item) > # First calculate a mean per item per condition. > RT_m_Item_WF1_WF2 <- aggregate(rt ~ Item * WF1 * WF2, + data = BehavData, FUN = mean, + na.rm = T) > # Run the ANOVA > RT_aov_Item <- aov(rt ~ WF1 * WF2 + Error(Item/(WF1*WF2)), + data = RT_m_Item_WF1_WF2 ) > > print(summary(rt_aov_item)) Refer:

43 Correlations, t-tests, An exhaustive list of statistical tests

44 Good to know Many ways to do the same thing Many common tasks can be accomplished in more than one way in R This is both appealing and frustrating, depending on the context Hmmm

45 Good to know Many ways to do the same thing Many common tasks can be accomplished in more than one way in R This is both appealing and frustrating, depending on the context Hmmm This begs the question: wouldn t it be lovely if there s a way to do most of the common tasks in a consistent manner???

46 Good to know Many ways to do the same thing Many common tasks can be accomplished in more than one way in R This is both appealing and frustrating, depending on the context Hmmm This begs the question: wouldn t it be lovely if there s a way to do most of the common tasks in a consistent manner??? Enter The Tidyverse

47 The Tidyverse

48 The Tidyverse A collection of R packages that share common philosophies and are designed to work together tidyverse.org Goal : Solve complex problems by combining simple, uniform pieces! Package Design See Data Science in tidyverse: Hadley Wickham One function = one task Input and output of every function is a tidy dataframe (= tibble) Consequence: tidyverse functions are pipeable! # > install.packages("tidyverse") # Installs the tidyverse collection # Curious what pipeable means??? Wait a bit more to know :-)

49 The Tidyverse > library(tidyverse) # Loads the core tidyverse packages ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse: readr ## Loading tidyverse: purrr ## Loading tidyverse: dplyr ## Conflicts with tidy packages ## filter(): dplyr, stats ## lag(): dplyr, stats > library(readxl) # Other tidyverse packages loaded when needed

50 Tidy Data Each variable is a column, each obser ation / case is a row! See Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59 (1), > LongData ## # A tibble: 16 5 ## Participant ExpV YorN Trial Measurement ## <chr> <int> <chr> <chr> <dbl> ## 1 S001 1 Y RT ## 2 S001 1 Y RT ## 3 S001 1 Y RT ## 4 S001 1 Y RT ## 5 S002 1 N RT ## 6 S002 1 N RT ## 7 S002 1 N RT ## 8 S002 1 N RT

51 Read data from text files: readr::read_delim > library(tidyverse) # This also loads readr, among other packages! > # For comma separated file with header row present in the input file > ExpData <- read_delim("filename.csv", delim = ",", col_names = TRUE) > # delim => delimiter, i.e., the column separator in the input > > # For tab separated file with no header row present in the input > ExpData <- read_delim("filename.txt", delim = "\t", + col_names = c("subject", "Task", "RT")) > # We provide meaningful column names in the command > > # For space separated file: > ExpData <- read_delim("filename.xyz", delim = " ", col_names = TRUE) > # For semicolon separated file: > ExpData <- read_delim("filename.log", delim = ";", + col_names = c("name", "Age"))

52 Read data from excel files: readxl::read_excel > library(readxl) > # Read a single worksheet (the first by default, if multiple worksheet > ExpData <- read_excel("filename.xlsx", col_names = TRUE) > # Read specific worksheet from the file, by index > ExpData <- read_excel("filename.xlsx", 3, + col_names = c("name", "Age", "RT")) > # Read specific worksheet from the file, by index > ExpData <- read_excel("filename.xlsx", 3, + col_names = c("name", "Age", "RT")) Attention please!!! Spaces are bad bad bad in filenames, column names and basically any names! Bad apples: Exp Data.xlsx, Subj ID, bla bla bla etc. Instead, use: Exp_Data.xlsx, Subj-ID, bla_blabla etc.

53 Read Data : Tidy Example > BehavDataTidy <- + readr::read_delim("allres.txt", delim = " ", + col_names = c("subj", "Item", "Condition", "WF1", + "WF2", "RT", "Accuracy", "Resp")) ## Parsed with column specification: ## cols( ## Subj = col_character(), ## Item = col_integer(), ## Condition = col_character(), ## WF1 = col_character(), ## WF2 = col_character(), ## RT = col_double(), ## Accuracy = col_character(), ## Resp = col_integer() ## )

54 Read Data : Tidy Example > BehavDataTidy ## # A tibble: 3,478 8 ## Subj Item Condition WF1 WF2 RT Accuracy Resp ## <chr> <int> <chr> <chr> <chr> <dbl> <chr> <int> ## 1 NF01 16 FOS F OS C 2 ## 2 NF01 25 MOS M OS C 2 ## 3 NF01 13 FSO F SO C 2 ## 4 NF01 12 MSO M SO C 2 ## 5 NF01 4 FOS F OS C 2 ## 6 NF01 8 FSO F SO C 2 ## 7 NF01 6 MOS M OS X 1 ## 8 NF01 2 FOS F OS C 2 ## 9 NF01 9 MSO M SO C 2 ## 10 NF01 28 MOS M OS C 2 ## #... with 3,468 more rows So what is so tidy about it??? Compare with the dataframe created earlier!

55 Tidy tibble enhanced data frame Most non-tidyverse functions that take a data frame work with tibbles For legacy functions that won t work with a tibble: use as.data.frame() See: > mean(behavdata$rt) ## [1] > mean(behavdatatidy$rt) ## [1] > aggregate(rt ~ WF2, data = BehavData, FUN = mean) ## WF2 RT ## 1 OS ## 2 SO > aggregate(rt ~ WF2, data = BehavDataTidy, FUN = mean) ## WF2 RT ## 1 OS ## 2 SO

56 Should all data be tidy data? Of course not! Other types of non-tidy data have their uses, too. Not every dataset needs to be wrangled into a tidy dataset! Nevertheless, the tidy format works well for most kinds of rectangular data.

57 Data Wrangling and Transformations

58 Why focus on Data Wrangling? Some form of data transformation is almost always inevitable prior to analysis This is usually the most time consuming and error prone part The actual statistical analysis is usually only one or two lines of R code Most analytical functions work best if the data is in a certain format Efficient data wrangling techniques are thus very important

59 Wide-format to long-format conversion Use the gather function from tidyr package of the tidyverse > library(tidyr) > gather(widedata, Trial, Measurement, RT1:RT4) ## # A tibble: 16 5 ## Participant ExpV YorN Trial Measurement ## <chr> <int> <chr> <chr> <dbl> ## 1 S001 1 Y RT ## 2 S002 1 N RT ## 3 S003 2 Y RT ## 4 S004 2 Y RT ## 5 S001 1 Y RT ## 6 S002 1 N RT ## 7 S003 2 Y RT ## 8 S004 2 Y RT ## 9 S001 1 Y RT

60 Long-format to wide-format conversion Use the spread function from tidyr package of the tidyverse > spread(longdata, Trial, Measurement) ## # A tibble: 4 7 ## Participant ExpV YorN RT1 RT2 RT3 RT4 ## * <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 S001 1 Y ## 2 S002 1 N ## 3 S003 2 Y ## 4 S004 2 Y

61 Summarising data: dplyr::summarise, dplyr::count > WideData <- readxl::read_excel("widedata.xlsx", 3, col_names = TRUE) > TidyData <- gather(widedata, Trial, Measurement, RT1:RT4) > OverallMeanRT <- summarise(tidydata, MeanRT = mean(measurement)) > OverallMeanRT ## # A tibble: 1 1 ## MeanRT ## <dbl> ## > N_of_Measurements <- count(tidydata, Participant) > N_of_Measurements ## # A tibble: 4 2 ## Participant n ## <chr> <int> ## 1 S001 4 ## 2 S002 4 ## 3 S003 4

62 The real power and elegance of tidyverse: pipeable functions All functions in the tidyverse share a consistent syntax Therefore the output of one function can be piped to the next function magrittr::%>% Piping avoids having to save temporary intermediate variables Piping results in code that is: simple and more efficient linear, reflecting each simple step that contributed to the complex analysis concise and more legible less error-prone overall

63 Pipe versus no pipe > # The more common non-pipe method ================================= > SomeData_1 <- f1(somedata_0, param1, param2) > SomeData_2 <- f2(somedata_1, bla1, bla2, bla3) > SomeData_3 <- f3(somedata_2, whatever1) > Result_1 <- f4(somedata_3, younameit) > # Another method ================================================== > Result_2 <- f1( f2( f3( f4(somedata_0, param1, param2), + bla1, bla2, bla3), whatever1), younameit) > # And now with the pipe! ========================================== > Result_3 <- + SomeData_0 %>% + f1(param1, param2) %>% + f2(bla1, bla2, bla3) %>% + f3(whatever1) %>% + f4(younameit)

64 Pipe : Example Non-pipe version > WideData <- readxl::read_excel("widedata.xlsx", 3, col_names = TRUE) > TidyData <- gather(widedata, Trial, Measurement, RT1:RT4) > N_of_Measurements <- count(tidydata, Participant) Pipe version > readxl::read_excel("widedata.xlsx", 3, col_names = TRUE) %>% + gather(trial, Measurement, RT1:RT4) %>% + count(meanrt = mean(measurement)) -> N_of_Measurements

65 Grouping data by factor(s): dplyr::group_by > readxl::read_excel("widedata.xlsx", 3, col_names = TRUE) %>% + gather(trial, Measurement, RT1:RT4) %>% + group_by(participant) %>% + count(meanrt = mean(measurement)) ## Source: local data frame [4 x 3] ## Groups: Participant [?] ## ## Participant MeanRT n ## <chr> <dbl> <int> ## 1 S ## 2 S ## 3 S ## 4 S

66 Renaming a column: dplyr::rename > readxl::read_excel("widedata.xlsx", 3, col_names = TRUE) %>% + gather(trial, Measurement, RT1:RT4) -> LongData > > library(magrittr) ## ## Attaching package: 'magrittr' ## The following object is masked from 'package:purrr': ## ## set_names ## The following object is masked from 'package:tidyr': ## ## extract > LongData %<>% rename(rt = Measurement) What s that %<>% thing??? And where did <- go??? Do you see the point?

67 Let s take stock a bit The tidyverse packages share a consistent syntax such that piping is possible Piping with %>% feeds the LHS to the RHS The RHS generates an output to feed further or assign or print or plot Double-piping with %<>% also feeds the LHS to the RHS, but The RHS generates an output and feeds (= assigns) it back to the LHS! There s more: %T% and %$% See

68 and import a new dataset to work further > IntData <- readxl::read_excel("intensity-data.xlsx", col_names = T) > IntData ## # A tibble: ## Participant Note NoteType Time Intensity OnsetInterval ## <chr> <dbl> <chr> <dbl> <dbl> <chr> ## 1 S01 1 NA NA ## 2 S01 2 Note_M ## 3 S01 3 Note_S ## 4 S01 4 Note_S ## 5 S01 5 Note_S ## 6 S01 6 Note_S ## 7 S01 7 Note_M ## 8 S01 8 Note_M ## 9 S01 9 Note_M ## 10 S01 10 Note_M ## #... with 410 more rows

69 Extract columns by name: dplyr::select > IntData %>% select(participant, NoteType, Time, Intensity) ## # A tibble: ## Participant NoteType Time Intensity ## <chr> <chr> <dbl> <dbl> ## 1 S01 NA ## 2 S01 Note_M ## 3 S01 Note_S ## 4 S01 Note_S ## 5 S01 Note_S ## 6 S01 Note_S ## 7 S01 Note_M ## 8 S01 Note_M ## 9 S01 Note_M ## 10 S01 Note_M ## #... with 410 more rows

70 Extract rows that meet certain criteria: dplyr::filter > IntData %>% filter(onsetinterval > 0.75 & OnsetInterval < 0.85) ## # A tibble: 5 6 ## Participant Note NoteType Time Intensity OnsetInterval ## <chr> <dbl> <chr> <dbl> <dbl> <chr> ## 1 S01 16 Note_L ## 2 S07 16 Note_L ## 3 S08 16 Note_L ## 4 S12 16 Note_L ## 5 S14 16 Note_L Notice the use of single & : this is the vectorised AND operator Unlike the scalar AND &&, this applies to all the elements of a column! There s of course the vectorised OR, as opposed to the scalar OR

71 Compute a new column: dplyr::mutate > IntData %>% + select(participant, Intensity) %>% + mutate(sno = row_number(), + GoodBad = if_else(intensity >= 120, "Good", "Bad")) ## # A tibble: ## Participant Intensity SNo GoodBad ## <chr> <dbl> <int> <chr> ## 1 S Good ## 2 S Bad ## 3 S Bad ## 4 S Good ## 5 S Good ## 6 S Good ## 7 S Good ## 8 S Good ## 9 S Good

72 Compute a new column, drop others: dplyr::transmute > IntData %>% + select(participant) %>% + distinct() %>% # Get rid of duplicate rows + transmute(subject = Participant, + NewSubjID = paste("drummer", row_number() + 100, sep="")) ## # A tibble: 14 2 ## Subject NewSubjID ## <chr> <chr> ## 1 S01 Drummer101 ## 2 S02 Drummer102 ## 3 S03 Drummer103 ## 4 S04 Drummer104 ## 5 S05 Drummer105 ## 6 S06 Drummer106 ## 7 S07 Drummer107 ## 8 S08 Drummer108

73 Exercise 1 Add a new column with the mean of the OnsetInterval. This mean should be on a per Participant and per NoteType basis! Before attempting to do this, see if mean(intdata$onsetinterval) works Have a very charful look at the output of typing IntData Do you see a / the problem?

74 Solution : Know your data well OnsetInterval contains the string NA in some cases So read_excel assumed that this column is made up of strings! Type readxl::read_excel("intensity-data.xlsx", col_names = T) Study what you see on the console Now type help(read_excel) to see what could be done

75 Exercise 1 : Solution > IntData <- readxl::read_excel("intensity-data.xlsx", col_names = T, + na = "NA") # <NA> is "NA" in the input vector! > IntData %>% select(-time, -Intensity) %>% # - => drop these vectors + group_by(participant, NoteType) %>% + mutate(oimean = mean(onsetinterval)) ## Source: local data frame [420 x 5] ## Groups: Participant, NoteType [56] ## ## Participant Note NoteType OnsetInterval OIMean ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 S01 1 <NA> NA NA ## 2 S01 2 Note_M ## 3 S01 3 Note_S ## 4 S01 4 Note_S ## 5 S01 5 Note_S ## 6 S01 6 Note_S

76 Exercise 2 Add a new column with the name AdjustedTime This should be the Time for the current Note minus the Time for Note 1. This value should be on a per Participant basis! Do you need something specific to solve this???

77 Solution : Extract first value by position: dplyr::first > IntData %>% + group_by(participant) %>% + mutate(timebegin = first(time)) %>% + select(participant, Time, TimeBegin) ## Source: local data frame [420 x 3] ## Groups: Participant [14] ## ## Participant Time TimeBegin ## <chr> <dbl> <dbl> ## 1 S ## 2 S ## 3 S ## 4 S ## 5 S ## 6 S ## 7 S

78 Exercise 2 : Solution > IntData %>% + group_by(participant) %>% + mutate(timebegin = first(time)) %>% + select(participant, Time, TimeBegin) %>% + mutate(adjustedtime = Time - TimeBegin) ## Source: local data frame [420 x 4] ## Groups: Participant [14] ## ## Participant Time TimeBegin AdjustedTime ## <chr> <dbl> <dbl> <dbl> ## 1 S ## 2 S ## 3 S ## 4 S ## 5 S ## 6 S

79 ..

80 Some resources R Cheatsheets : R for Data Science : Cookbook for R : Graphs with ggplot2 : Tidy Text Mining : Quick R : Advanced R :

81 Thanks! > Thanks <- "Thanks for your attention!" > Thanks ## [1] "Thanks for your attention!" > # Command to quit from R Console > q()

An Introduction to R. Ed D. J. Berry 9th January 2017

An Introduction to R. Ed D. J. Berry 9th January 2017 An Introduction to R Ed D. J. Berry 9th January 2017 Overview Why now? Why R? General tips Recommended packages Recommended resources 2/48 Why now? Efficiency Pointandclick software just isn't time efficient