R Workshop Guide. 1 Some Programming Basics. 1.1 Writing and executing code in R

R Workshop Guide This guide reviews the examples we will cover in today s workshop. It should be a helpful introduction to R, but for more details, you can access a more extensive user guide for R on the ERC website. 1 Some Programming Basics You should always write code in a script that you can save and modify as necessary. To start a new script, access the File menu, chose New File, and then choose R Script. It s always a good idea to start by clearing your workspace. rm(list=ls(all=true)) # clear all objects in memory 1.1 Writing and executing code in R Basic calculations 4 [1] 4 "yes" [1] "yes" 2+3 [1] 5 1039/49 [1] 21.20408 46^700 [1] Inf (3.5+2.7)/(900*2) [1] 0.003444444 Assignment operator x <- 3 x [1] 3 y <- "this is a string" y [1] "this is a string" z <- 2 z 1

[1] 2 x+z [1] 5 x==5 # this is a logical operator [1] FALSE x [1] 3 x <- TRUE # assign logical values to variables x+z # explain this output numeric value of TRUE = 1, so 1 + 2 [1] 3 # clear your workspace again rm(list=ls(all=true)) 1.2 Data objects in R Vectors The function c() allows you to concatenate multiple items into a vector x <- c(1,2,3,4) x [1] 1 2 3 4 x[2] [1] 2 y <- c(5,6,7,8,9) y [1] 5 6 7 8 9 y[5] [1] 9 You can append one vector to another z <- c(x,y) z [1] 1 2 3 4 5 6 7 8 9 Another way to produce a vector containing a sequence of integers 2

q <- 1:5 q [1] 1 2 3 4 5 You can repeat vectors multiple times ab <- rep(1:5, times=3) ab [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ab <- rep(1:5, 3) you do not need the "times" with rep; also, notice that R lets you overwrite cd <- rep(c(1,3,7,9), times=2) cd [1] 1 3 7 9 1 3 7 9 a <- seq(from=2, to=100, by=2) a [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 [22] 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 [43] 86 88 90 92 94 96 98 100 2 Performing Basic Tasks 2.1 Setting up your work space See the objects currently in memory ls() [1] "a" "ab" "cd" "q" "x" "y" "z" Clear your workspace many R users include this as the first line in any script rm(list=ls(all=true)) Working Directory The working directory is the location on your computer where R will access and save files. You can seeyour working directory, and you can set your working diretory. getwd() [1] "/Users/patriciakirkland/Dropbox/Empiprical Reasoning Center/R Workshop" setwd("/users/patriciakirkland/dropbox/empiprical Reasoning Center/R Workshop") getwd() # check again [1] "/Users/patriciakirkland/Dropbox/Empiprical Reasoning Center/R Workshop" 3

2.2 Installing and loading packages You will need to install packages to handle certain tasks. You only need to install packages once, but you will need to load them any time you want to use them. # install.packages("dplyr", dependencies=true) # install.packages("ggplot2", dependencies=true) # install.packages("foreign", dependencies=true) # install.packages("xtable", dependencies = TRUE) # install.packages("stargazer", dependencies = TRUE) # install.packages("arm", dependencies = TRUE) # load packages library(foreign) library(xtable) library(arm) Loading required package: Matrix Loading required package: lme4 arm (Version 1.8-6, built: 2015-7-7) Working directory is /Users/patriciakirkland/Dropbox/Empiprical Reasoning Center/R Workshop Attaching package: arm The following object is masked from package:xtable : display library(ggplot2) library(dplyr) library(stargazer) Please cite as: Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2. http://cran.r-project.org/package=stargazer Some useful packages: foreign load data formatted for other software xtable export code to produce tables in LaTeX arm applied regression and multi-level modeling ggplot2 make plots and figures dplyr user-friendly data cleaning & manipulation more packages: http://cran.r-project.org/web/packages/ 4

2.3 Read in data R can read data files in a variety of formats. Today, we will use a.csv file, but see below for code to read other types of data files. Note: If the data file is stored in your working directory, you need only specify the file name. However, if the file is stored somewhere else on your computer, you will need to include the file path. # csv file data <- read.csv("teachingratingsexcel.csv", header=true) #.dta file (Stata) # dtafile <- read.dta("fakedata.dta") # dtafile #.RData file # load("fakedata1.rdata") # data 2.4 Looking at data: basic info, printing objects, and generating basic summary stats See variable names and dimensions of the data names(data) [1] "minority" "age" "female" "onecredit" "beauty" "course_eval" [7] "intro" "nnenglish" dim(data) [1] 463 8 dim(data)[1] [1] 463 dim(data)[2] [1] 8 You can refer to specific rows or columns in a data frame by row or column number(s) this allows you to see a subset of your data. You could even assign it to a new object and you would have effectively subset your data. data[1,] # row 1 only minority age female onecredit beauty course_eval intro nnenglish 1 1 36 1 0 0.2899157 4.3 0 0 data[1:3,] # rows 1 to 3 only minority age female onecredit beauty course_eval intro nnenglish 1 1 36 1 0 0.2899157 4.3 0 0 2 0 59 0 0-0.7377322 4.5 0 0 3 0 51 0 0-0.5719836 3.7 0 0 5

# data[,1] # column 1 only # data[,2:4] # columns 2 to 4 only Print some or all of the data to the console # data # data[1:5,] # data[,3] head(data) minority age female onecredit beauty course_eval intro nnenglish 1 1 36 1 0 0.2899157 4.3 0 0 2 0 59 0 0-0.7377322 4.5 0 0 3 0 51 0 0-0.5719836 3.7 0 0 4 0 40 1 0-0.6779634 4.3 0 0 5 0 31 1 0 1.5097940 4.4 0 0 6 0 62 0 0 0.5885687 4.2 0 0 # data$course_eval # data$female # data$beauty # course_eval # error! why? Find out the classification or type of an object such as a data frame or a variable class(data) [1] "data.frame" class(data$course_eval) [1] "numeric" class(data$female) [1] "integer" Summarize your dataset or a specific variable. # summary() function summary(data) minority age female onecredit beauty Min. :0.0000 Min. :29.00 Min. :0.0000 Min. :0.00000 Min. :-1.4504940 1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:-0.6562689 Median :0.0000 Median :48.00 Median :0.0000 Median :0.00000 Median :-0.0680143 Mean :0.1382 Mean :48.37 Mean :0.4212 Mean :0.05832 Mean : 0.0000001 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 0.5456024 Max. :1.0000 Max. :73.00 Max. :1.0000 Max. :1.00000 Max. : 1.9700230 course_eval intro nnenglish Min. :2.100 Min. :0.0000 Min. :0.00000 1st Qu.:3.600 1st Qu.:0.0000 1st Qu.:0.00000 Median :4.000 Median :0.0000 Median :0.00000 6

Mean :3.998 Mean :0.3391 Mean :0.06048 3rd Qu.:4.400 3rd Qu.:1.0000 3rd Qu.:0.00000 Max. :5.000 Max. :1.0000 Max. :1.00000 summary(data$beauty) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.4500000-0.6563000-0.0680100 0.0000001 0.5456000 1.9700000 Tables # table() function table(data$female, usena="always") 0 1 <NA> 268 195 0 crosstab <- table(data$female, data$minority, usena="always", dnn=c("gender", "Race or Ethnicity")) crosstab <- crosstab[c(2, 1, 3), c(2, 1, 3)] row.names(crosstab) <- c("female", "Male", "NA") colnames(crosstab) <- c("minority", "White", "NA") mytable <- table(data$female, data$minority, usena="always", dnn=c("female", "Minority")) margin.table(mytable, 1) Female 0 1 <NA> 268 195 0 margin.table(mytable, 2) Minority 0 1 <NA> 399 64 0 prop.table(mytable) Minority Female 0 1 <NA> 0 0.51835853 0.06047516 0.00000000 1 0.34341253 0.07775378 0.00000000 <NA> 0.00000000 0.00000000 0.00000000 prop.table(mytable, 1) Minority Female 0 1 <NA> 0 0.8955224 0.1044776 0.0000000 1 0.8153846 0.1846154 0.0000000 <NA> prop.table(mytable, 2) Minority Female 0 1 <NA> 0 0.6015038 0.4375000 1 0.3984962 0.5625000 <NA> 0.0000000 0.0000000 7

2.5 Basic histograms and scatterplots A few easy ways to see the distribution of your data. We will look at some more complex figures later. Histogram # hist() hist(data$course_eval, breaks=25, main="histogram of Outcome Variable - Course Evaluation", xlab="outcom Histogram of Outcome Variable Course Evaluation Frequency 0 10 20 30 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Outcome Variable Y Scatterplot # plot() plot(data$beauty, data$course_eval, main="scatterplot of Beauty and Course Evaluations", pch=16) abline(v=0, col="red") abline(h=3.5, col="grey80", lty=2, lwd=3) 8

Scatterplot of Beauty and Course Evaluations data$course_eval 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 data$beauty You can save a plot in PDF format. R will save the file to your working directory unless you specify a different file path. # save to disk pdf("basic_plot.pdf") plot(data$beauty, data$course_eval, main="scatterplot of Beauty and Course Evaluations", pch=16) abline(v=0, col="red") abline(h=3.5, col="grey80", lty=2, lwd=3) dev.off() pdf 2 9

2.6 Basic operators Arithmetic/Math/Numeric Operators + / addition subtraction multiplication division An example make a new variable age_squared data$age_squared <- data$age^2 2.7 Logical Operators Logical operators test conditions. For example, you might want a subset of data that includes observations for which a specific variable exceeds some value, or you may want to find observations with missing values. You can also use these operators to generate variables and data often using the if() or ifelse() function. < less than <= less than or equal to > greater than >= greater than or equal to == exactly equal to! = not equal to!x Not x x y x OR y x & y x AND y istrue(x) test if X is TRUE Example make a new variable using a logical test to determine which subjects are minorities who are non-native English speakers data$nnenglish_minority <- data$minority == 1 & data$nnenglish # data$nnenglish_minority data$nnenglish_minority <- as.numeric(data$nnenglish_minority) # data$nnenglish_minority Now make a new variable to indicate whether a subject is older than the average age. We can use the ifelse() function. 10

data$older <- ifelse(data$age > mean(data$age), 1, 0) 2.8 Subsetting data data[4,3] [1] 1 data[4,] minority age female onecredit beauty course_eval intro nnenglish age_squared 4 0 40 1 0-0.6779634 4.3 0 0 1600 nnenglish_minority older 4 0 0 data[,3] [1] 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 [43] 1 0 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 [85] 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 [127] 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 [169] 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 [211] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 [253] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 [295] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 [337] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [379] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 [421] 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 [463] 1 data[4:10, 2:3] age female 4 40 1 5 31 1 6 62 0 7 33 1 8 51 1 9 33 1 10 47 0 You can also subset by variables. Designate variables to keep or exclude. select.vars <- c("course_eval", "female") # data[select.vars] # data[ data$female==1,] Make a new data frame that includes only women female <- data[ data$female==1,] Here is another way to make a new data frame that includes only women 11

female2 <- subset(data, female==1) 2.9 Writing data to disk write.csv(data, "evaluation_data.csv", row.names=false) write.dta(data, "evaluation_data.dta") save(data, file="evaluation_data.rdata") # save just a data frame save.image(file="course_evaluations.rdata") # save your current workspace 2.10 Regression We could simply proceed, but let s clear the workspace and load the.rdata file we just saved rm(list=ls(all=true)) load("evaluation_data.rdata") # clear all objects in memory # load the data Specify a regression model the following examples are OLS models. See the more detailed user guide for more information on other classes of models. fit_1 <- lm(course_eval ~ female, data=data) summary(fit_1) Call: lm(formula = course_eval ~ female, data = data) Residuals: Min 1Q Median 3Q Max -1.96903-0.36903 0.03097 0.43097 0.99897 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.06903 0.03355 121.29 < 2e-16 *** female -0.16800 0.05169-3.25 0.00124 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5492 on 461 degrees of freedom Multiple R-squared: 0.0224,Adjusted R-squared: 0.02028 F-statistic: 10.56 on 1 and 461 DF, p-value: 0.001239 To include additional independent variables... fit_2 <- lm(course_eval ~ female + beauty + age + minority + nnenglish, data=data) summary(fit_2) 12

Call: lm(formula = course_eval ~ female + beauty + age + minority + nnenglish, data = data) Residuals: Min 1Q Median 3Q Max -1.87797-0.35784 0.04323 0.37956 1.02073 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.241680 0.143283 29.604 < 2e-16 *** female -0.207452 0.052563-3.947 9.17e-05 *** beauty 0.140942 0.032938 4.279 2.29e-05 *** age -0.002707 0.002750-0.984 0.32545 minority -0.044374 0.075725-0.586 0.55817 nnenglish -0.313490 0.108630-2.886 0.00409 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5324 on 457 degrees of freedom Multiple R-squared: 0.08919,Adjusted R-squared: 0.07922 F-statistic: 8.95 on 5 and 457 DF, p-value: 4.001e-08 To add fixed effects... fit_3 <- lm(course_eval ~ factor(intro) + female + beauty + age + minority + nnenglish, data=data) summary(fit_3) Call: lm(formula = course_eval ~ factor(intro) + female + beauty + age + minority + nnenglish, data = data) Residuals: Min 1Q Median 3Q Max -1.84713-0.35266 0.04673 0.38961 1.05248 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.182596 0.146567 28.537 < 2e-16 *** factor(intro)1 0.098401 0.054097 1.819 0.069570. female -0.197257 0.052730-3.741 0.000207 *** beauty 0.140213 0.032858 4.267 2.41e-05 *** age -0.002238 0.002756-0.812 0.417182 minority -0.070909 0.076930-0.922 0.357154 nnenglish -0.274246 0.110484-2.482 0.013415 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5311 on 456 degrees of freedom Multiple R-squared: 0.09575,Adjusted R-squared: 0.08385 13

F-statistic: 8.047 on 6 and 456 DF, p-value: 2.836e-08 To include an interaction... fit_4 <- lm(course_eval ~ female*beauty + age + minority + nnenglish, data=data) summary(fit_4) Call: lm(formula = course_eval ~ female * beauty + age + minority + nnenglish, data = data) Residuals: Min 1Q Median 3Q Max -1.84616-0.34549 0.04303 0.39253 1.05515 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.217031 0.143705 29.345 < 2e-16 *** female -0.204181 0.052488-3.890 0.000115 *** beauty 0.193681 0.045052 4.299 2.1e-05 *** age -0.002169 0.002763-0.785 0.432713 minority -0.017367 0.077195-0.225 0.822097 nnenglish -0.330643 0.108864-3.037 0.002525 ** female:beauty -0.111446 0.065107-1.712 0.087627. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5313 on 456 degrees of freedom Multiple R-squared: 0.095,Adjusted R-squared: 0.08309 F-statistic: 7.978 on 6 and 456 DF, p-value: 3.372e-08 You can also export regression results there are multiple packages you could use, but the example below uses stargazer. # stargazer(fit_1, fit_2, fit_3, fit_4, omit=("intro"), # omit.stat=("n"), # add.lines=list(c("fixed Effects", "No", "intro", "No")), # notes=c("ols Regression models."), # notes.align="l", # notes.append=t, # covariate.labels = c(), # float=f, dep.var.labels = "Course Evaluation", # out = "course_eval_regressions") 3 More Plots & Figures We can start by creating factors factors designate groups or categories (this is optional, depending on the figures you need). 14

data$gender <- factor(data$female,levels=c(0, 1), labels=c("male","female")) data$minority_status <- factor(data$minority,levels=c(0,1), labels=c("non-minority","minority")) data$age_status <- factor(data$older,levels=c(0, 1), labels=c("younger","older")) You need the ggplo2t package to use the qplot() & ggplot() functions. # Kernel density plots for course evaluations # grouped by number of gender (indicated by color) qplot(course_eval, data=data, geom="density", fill=gender, alpha=i(.5), main="distribution of Course Evaluations", xlab="evaluation Score", ylab="density") 15

Distribution of Course Evaluations 0.6 Density 0.4 gender Male Female 0.2 0.0 2 3 4 5 Evaluation Score # Histogram for course evaluations # grouped by number of gender (indicated by color) qplot(course_eval, data=data, geom="histogram", fill=gender, alpha=i(.75), main="distribution of Course Evaluations", xlab="evaluation Score", ylab="density") stat_bin: binwidth defaulted to range/30. Use binwidth = x to adjust this. 16

Distribution of Course Evaluations 30 Density 20 gender Male Female 10 0 2 3 4 5 Evaluation Score # Scatterplot of course evaluations vs. beauty for each combination of gender and age_status # in each facet, gender is represented by shape and color qplot(course_eval, beauty, data=data, shape=gender, color=gender, facets=age_status~minority_status, size=i(3), xlab="beauty", ylab="course Evaluation") 17

2 Non minority Minority 1 Course Evaluation 0 1 2 1 0 Younger Older gender Male Female 1 2 3 4 5 2 3 4 5 Beauty # Separate regressions of course evaluations on beauty for each gender qplot(beauty, course_eval, data=data, geom=c("point", "smooth"), method="lm", formula=y~x, color=gender, main="regression of Evaluations on Beauty", xlab="beauty", ylab="course Evaluation") 18

Regression of Evaluations on Beauty 5 4 Course Evaluation gender Male Female 3 2 1 0 1 2 Beauty # Boxplots of course evaluations by gender # observations (points) are overlayed and jittered qplot(gender, course_eval, data=data, geom=c("boxplot", "jitter"), fill=gender, main="course Evaluations by Gender", xlab="", ylab="course Evaluations") 19

Course Evaluations by Gender 5 4 Course Evaluations gender Male Female 3 2 Male Female plot <- ggplot(data, aes(beauty, course_eval)) + geom_point(alpha=.5) + geom_smooth() plot geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use method = x to change the smoothing method. 20

5 4 course_eval 3 2 1 0 1 2 beauty plot <- ggplot(data, aes(beauty, course_eval)) + geom_point(colour="green", alpha=1) + geom_smooth(method="lm", colour="black", se=false) + scale_y_continuous(limits=c(0, 10)) + scale_x_continuous(limits=c(-2, 2.5)) + theme_bw() + xlab("beauty") + ylab("course Evaluations") + ggtitle("course Evaluations & Beauty") + geom_vline(xintercept = 0, colour="grey") plot 21

Course Evaluations & Beauty 10.0 7.5 Course Evaluations 5.0 2.5 0.0 2 1 0 1 2 Beauty plot_2 <- plot + theme_bw() + ylab("course Evaluations") + xlab("beauty") + ggtitle("course Evaluations & Beauty") + scale_y_continuous(limits=c(0, 6), breaks=seq(1, 6, 1.5)) + scale_x_continuous(limits=c(-2, 2), breaks=seq(-2, 2,.5)) Scale for y is already present. Adding another scale for y, which will replace the existing scale. Scale for x is already present. Adding another scale for x, which will replace the existing scale. plot_2 22

Course Evaluations & Beauty 5.5 Course Evaluations 4.0 2.5 1.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Beauty plot_3 <- ggplot(data, aes(beauty, course_eval)) + geom_point(alpha=.5) + geom_smooth(se=false) + theme_bw() + ylab("course Evaluation") + xlab("beauty") + ggtitle("course Evaluations & Beauty") + scale_y_continuous(limits=c(1, 6), breaks=seq(1.5, 6, 1.5)) + scale_x_continuous(limits=c(-1, 2.5), breaks=seq(-1.5, 2.5, 1)) plot_3 geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use method = x to change the smoothing method. Warning: Removed 46 rows containing missing values (stat_smooth). Warning: Removed 46 rows containing missing values (geom_point). 23

Course Evaluations & Beauty 6.0 4.5 Course Evaluation 3.0 1.5 0.5 0.5 1.5 2.5 Beauty plot_4 <- plot_3 %+% aes(age, course_eval) + ylab("course Evaluation") + xlab("age") + ggtitle("course Evaluations & Age") + scale_x_continuous(limits=c(25, 75), breaks=seq(25, 75, 15)) Scale for x is already present. Adding another scale for x, which will replace the existing scale. plot_4 geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use method = x to change the smoothing method. 24

Course Evaluations & Age 6.0 4.5 Course Evaluation 3.0 1.5 25 40 55 70 Age To save a plot as a PDF... name the file pdf("plot_evals_age.pdf") print the object (plot) print(plot_4) geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use method = x to change the smoothing method. close the figure file dev.off() pdf 2 25