R Basics / Course Business We ll be using a sample dataset in class today: CourseWeb: Course Documents " Sample Data " Week 2 Can download to your computer before class CourseWeb survey on research/stats background still available. Thanks to everyone who s responded
R Basics
R Basics
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
R Commands Simplest way to interact with R is by typing in commands at the > prompt: R STUDIO R
R as a Calculator Typing in a simple calculation shows us the result: 608 + 28 What s 11527 minus 283? Some more examples: 400 / 65 (division) 2 * 4 (multiplication) 5 ^ 2 (exponentiation)
Functions More complex calculations can be done with functions: sqrt(64) What the function is (square root) Can often read these left to right ( square root of 64 ) What do you think this means? abs(-7) In parenthesis: What we want to perform the function on
Arguments Some functions have settings ( arguments ) that we can adjust: round(3.14) - Rounds off to the nearest integer (zero decimal places) round(3.14, digits=1) - One decimal place
Nested Functions
Nested Functions We can use multiple functions in a row, one inside another - sqrt(abs(-16)) - Square root of the absolute value of -16 Don't get scared when you see multiple parentheses - Can often just read left to right - R first figures out the thing nested in the middle Can you round off the square root of 7?
Using Multiple Numbers at Once When we want to use multiple numbers, we concatenate them c(2,6,16) - A list of the numbers 2, 6, and 16 Sometimes a computation requires multiple numbers - mean(c(2,6,16)) Also a quick way to do the same thing to multiple different numbers: - sqrt(c(16,100,144))
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
Course Documents: Sample Data: Week 2 Reading plausible versus implausible sentences Scott chopped the carrots with a knife. Scott chopped the carrots with a spoon. Measure reading time on final word Note: Simulated data; not a real experiment.
Course Documents: Sample Data: Week 2 Reading plausible versus implausible sentences Reading time on critical word 36 subjects Each subject sees 30 items (sentences) half plausible, half implausible Interested in changes over time, so we ll track serial position (trial 1 vs trial 2 vs trial 3 )
Reading in Data Make sure you have the dataset at this point if you want to follow along: Course Documents " Sample Data " Week 2
Reading in Data RStudio Navigate to the folder in lower-right More -> Set as Working Directory Open a comma-separated value file: - experiment <-read.csv('week2.csv') Name of the dataframe we re creating (whatever we want to call this dataset) read.csv is the function name File name
Reading in Data Regular R Read in a comma-separated value file: - experiment <- read.csv('/users/ scottfraundorf/desktop/week2.csv') Name of the dataframe we re creating (whatever we want to call this dataset) read.csv is the function name Folder & file name Drag & drop the file into R to get the full folder & filename
Looking at the Data: Summary A big picture of the dataset: summary(experiment) summary() is a very important function Basic info & descriptive statistics Check to make sure the data are correct
Looking at the Data: Summary A big picture of the dataset: summary(experiment) We can use $ to refer to a specific column/variable in our dataset: summary(experiment$itemname)
Looking at the Data: Raw Data Let s look at the data experiment Ack That s too much How about just a few rows? head(experiment) head(experiment, n=10)
Reading in Data: Other Formats Excel: - library(gdata) - experiment <- read.xls('/users/ scottfraundorf/desktop/week2.xls') SPSS: - library(foreign) - experiment <- read.spss('/users/ scottfraundorf/desktop/ week2.spss', to.data.frame=true)
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
R Scripts Save & reuse commands with a script R File -> New Document R STUDIO
R Scripts Run commands without typing them all again R Studio: Code -> Run Region -> Run All: Run entire script Code -> Run Line(s): Run just what you ve highlighted/selected R: - Highlight the section of script you want to run - Edit -> Execute Keyboard shortcut for this: - Ctrl+Enter (PC), +Enter (Mac)
R Scripts Saves times when re-running analyses Other advantages? Some: - Documentation for yourself - Documentation for others - Reuse with new analyses/experiments - Quicker to run can automatically perform one analysis after another
R Scripts Comments Add # before a line to make it a comment - Not commands to R, just notes to self (or other readers) Can also add a # to make the rest of a line a comment summary(experiment$subject) #awesome
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
Descriptive Statistics Remember how we referred to a particular variable in a dataframe? - $ Combine that with functions: - mean(experiment$rt) - median(experiment$rt) - sd(experiment$rt) Or, for a categorical variable: - levels(experiment$itemname) - summary(experiment$subject)
Descriptive Statistics We often want to look at a dependent variable as a function of some independent variable(s) - tapply(experiment$rt, experiment $Condition, mean) - Split up the RTs by Condition, then get the mean Try getting the mean RT for each item How about the median RT for each subject? To combine multiple results into one table, column bind them with cbind(): cbind( tapply(experiment$rt, experiment$condition, mean), tapply(experiment$rt, experiment$condition, sd) )
Descriptive Statistics Can have 2-way tables... - tapply(experiment$rt, list(experiment$subject, experiment$condition), mean) - 1 st variable is rows, 2 nd is columns...or more - tapply(experiment$rt, list(experiment$itemname, experiment$condition, experiment $TestingRoom), mean)
Descriptive Statistics Contingency tables for categorical variables: - xtabs (~ Subject + Condition, data=experiment)
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
Subsetting Data Often, we want to examine or use just part of a dataframe Remember how we read our dataframe? - experiment <- read.csv(...) Create a new dataframe that's just a subset of experiment: - experiment.longrtsremoved <- subset(experiment, RT < 2000) New dataframe name Original dataframe Inclusion criterion: RT less than 2000 ms
Subsetting Data: Logical Operators Try getting just the observations with RTs 200 ms or more: - experiment.shortrtsremoved <- subset(experiment, RT >= 200) Why not just delete the bad RTs from the spreadsheet? Easy to make a mistake / miss some of them Faster to have the computer do it We d lose the original data No documentation of how we subsetted the data
Subsetting Data: AND and OR What if we wanted only RTs between 200 and 2000 ms? - Could do two steps: - experiment.temp <- subset(experiment, RT >= 200) - experiment.badrtsremoved <- subset(experiment.temp, RT <= 2000) One step with & for AND: - experiment2 <- subset(experiment, RT >= 200 & RT <= 2000)
Subsetting Data: AND and OR What if we wanted only RTs between 200 and 2000 ms? One step with & for AND: - experiment2 <- subset(experiment, RT >= 200 & RT <= 2000) means OR: - experiment.badrts <- subset(experiment, RT < 200 RT > 2000) - Logical OR ( either or both )
Subsetting Data: == and = Get a match / equals: - experiment.firsttrials <- subset(experiment, SerialPosition == 1) Words/categorical variables need quotes: - experiment.implausiblesentences <- subset(experiment, Condition=='Implausible') = means not equal to : - experiment.badsubjectremoved <- subset(experiment, Subject = 'S23') Note DOUBLE equals sign Drops subject S23
Subsetting Data: %in% Sometimes our inclusion criteria aren't so mathematical Suppose I just want the Ducks, Hornet, and Panther items We can check against any arbitrary list: - experiment.specialitems <- subset(experiment, ItemName %in% c('ducks', 'Hornet', 'Panther')) Or, keep just things that aren't in a list: - experiment.nonnativespeakersremoved <- subset(experiment, Subject %in% c('s10', 'S23') == FALSE)
Logical Operators Review Summary - > Greater than - >= Greater than or equal to - < Less than - <= Less than or equal to - & AND - OR - == Equal to - = Not equal to - %in% Is this included in a list?
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
Assignment Remember the pointing arrow used to create dataframes and subsets? - e.g., experiment <- read.csv(...) This is the assignment operator. It saves results or values in a variable - x <- sqrt(64) - CriticalTrialsPerSubject <- 30 - Remember, typing a name by itself shows you the current value: - CriticalTrialsPerSubject Assigning a new value overwrites the old
Assignment We can use this to create new columns in our dataframe: - experiment$experimentnumber <- 1 - Here, the same number (1) is assigned to every trial Or, compute a value for each row: - experiment$logserialposition <- log(experiment$serialposition) - For each trial, takes the log serial position and saves that into LogSerialPosition - Similar to an Excel formula
ifelse() IF YOU WANT DESSERT, EAT YOUR PEAS OR ELSE
ifelse() ifelse(): Use a test to decide which of two values to assign: experiment$half <- ifelse( experiment$serialposition <= 15, 1, 2) Function name Half is 1 If it s NOT, Half is 2 If serial position IS <= 15 Possible to nest ifelse() if we need more than 2 categories: experiment$third <- ifelse(experiment$serialposition <= 10, 1, ifelse(experiment $SerialPosition <= 20, 2, 3))
ifelse() Instead of specific numbers, can use other columns or a formula: experiment$rt.fixed <- ifelse( experiment$testingroom==2, experiment$rt + 100, experiment$rt) How can we check if this worked? head(experiment) Fixed RTs are indeed 100 ms longer for TestingRoom 2
Which do you like better? - experiment$half <-ifelse(experiment $SerialPosition <= 15, 1, 2) - Shorter & faster to write vs: - CriticalTrialsPerSubject <- 30 - experiment$half <- ifelse(experiment $SerialPosition <= (CriticalTrialsPerSubject / 2), 1, 2) - Explains where the 15 comes from helpful if we come back to this script later - We can also refer to CriticalTrialsPerSubject variable later in the script & this ensure it s consistent - Easy to update if we change the number of critical trials
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
Referring to Specific Cells So far, we ve seen how to - Create a new dataframe that s a subset of an existing dataframe - Modify a dataframe by creating an entire column What if we want to modify a dataframe by adjusting some existing values? e.g., replace all RTs above 2000 ms with the number 2000 ( fencing ) Creating a new subset won t work because we want to change the original dataframe Need a way to edit specific values
Referring to Specific Cells Use square brackets [ ] to refer to specific entries in a dataframe: - Row, column - experiment[3,7] Omit the row or column number to mean all rows or all columns: - experiment[3,] - experiment[,4] Row 3, all columns All rows in column 4 Can also use column names: experiment[,'rt'] All rows in the RT column Remember c()? We can check multiple rows: - experiment[c(1:4),]
Logical Indexing We can look at rows or columns that meet a specific criterion... - experiment[experiment$rt < 200,] Can use this as another way to subset: - experiment.shortrtsremoved <- experiment[experiment$rt > 200, ] - Actually, subset() just does this But we can also set values this way - experiment[experiment$rt < 200, 'RT'] <- 200 - In the dataframe experiment, find the rows where RT < 200, and set the column RT to 200
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
Types R treats continuous & categorical variables differently: These are different data types: - Numeric - Factor: Variable w/ fixed set of categories (e.g., treatment vs. placebo) - Character: Freely entered text (e.g., open response question)
Types R's heuristic when reading in data: - Letters anywhere in the column factor - No letters, purely numbers numeric
Type Conversion: Numeric Factor Sometimes we need to correct this - Room 4 is not twice as much Room 2 Create a new column that's the factor (categorical) version of TestingRoom: - experiment$room.factor <- as.factor(experiment$testingroom) Or, just overwrite the old column: - experiment$testingroom <- as.factor(experiment$testingroom)
Conversion: Character Factor When ifelse() results in words, R creates a character variable rather than a factor - Need to convert it Wrong: - experiment$faveroom <- ifelse(experiment$testingroom==3, 'My favorite room', 'Not favorite') Right: - experiment$faveroom <- as.factor(ifelse(experiment $TestingRoom== 3, 'My favorite room', 'Not favorite'))
Type Conversion: Factor Numeric To change a factor to a number, need to turn it into a character first: - experiment$age.numeric <- as.numeric(as.character(experiment$ Age))
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
NA We might have run into some problems trying to change Age into a numerical variable... NA means not available... - Characters that don't convert to numbers - Missing data in a spreadsheet - Invalid computations
NA If we try to do computations on a set of numbers where any of them is NA, we get NA as a result... - sd(experiment$age.numeric) R wants you to think about how you want to treat these missing values
NA Solutions To ignore the NAs when doing a specific computation, use na.rm=true: - mean(experiment$age.numeric,na.rm=true) To get a copy of the dataframe that excludes all rows with an NA (in any column): - experiment.nonas <- na.omit(experiment) Change NAs to something else with logical indexing: - experiment[is.na(experiment$ Age.Numeric)==TRUE, ]$Age.Numeric <- 23
R Basics R commands & functions Reading in data Saving R scripts Descriptive statistics Subsetting data Assigning new values Referring to specific cells Types & type conversion NA values Getting help
Getting Help Get help on a specific known function: -?sqrt -?write.csv Try to find a function on a particular topic: -??logarithm - External resources: - ling-r-lang-l < https://mailman.ucsd.edu/mailman/listinfo/ling-r-lang-l > - Mailing list for using R in language research
Wrap-Up Can use R for: - Reading in data - Descriptive statistics - Subsetting data - Creating new variables Additional practice: - Baayen (2008) p. 20 - Use commands on pp. 1-2 to get the data Next week: Mixed effects models Baayen, Davidson, & Bates (2008): Introduced mixed effects models to cog psych. Covers fixed and random effects