Introduction to R Syllabus Instructor Grant Cavanaugh Department of Agricultural Economics University of Kentucky E-mail: gcavanugh@uky.edu Course description Introduction to R is a short course intended for students with limited or no previous use of R but some familiarity with other stats/math packages. The course presents some of the basic operations in R (importing, running OLS regressions, etc.) as well as some of the uses of R that distinguish the package from its licensed (programming, displaying data, matrix multiplication). After taking this course, students should have an understanding of what R is any why you might want to use it in your research rather than SAS, Stata, or other mathematical packages. Topics covered Advantages and disadvantages of R Help Importing data Program files (.R) Object oriented programming what is it? Summarizing and analyzing data Linear Regression Partial autocorrelation or residuals Getting new packages in R Regression using ARIMA Why R is sometimes frustrating Making your own function Graphics Matrix multiplication References R's single biggest strenght is it online community. There are tons of free tutorials on R. You can find a great list of free online resources for learing R at: http://jeromyanglim.blogspot.com/2010/05/videos-on-data-analysis-with-r.html #We start by reading in the data using the command read.csv. I've put the data in a csv file. Its a standard file type that excel can save in.# costs<-read.csv("/users/grantcavanaugh/desktop/lopdata.csv", header=true) #Note that I have read in our data file as the "object" costs that if I type "data" I call up our whole data set.# costs > costs pchicago ptoledo trans 1 2,4597 2,3874 0,5222 2 2,4902 2,4163 0,5222 3 2,4902 2,4163 0,5222 #R is an object oriented language meaning that it can do things like store the number 64 as a letter#
b<-64 #now when ever you type b you get 64# 6+b > 6+b [1] 70 #Here we load our data set into R's active directory. This means that all of its variables will be objects automatically.# attach(costs) #The function class() tells how R thinks about a given object i.e. is it a time series object? In this case we are looking at what R calls a "data frame" its like a matrix but without the dimensions.# class(costs) > class(costs) [1] "data.frame" class(pchicago) > class(pchicago) [1] "factor" #For some strange reason my R is reading all out data as "factors" (factors are generally things like red, blue, and green) rather than as numbers. lets change that using the function as.numeric()# nchicago<-as.numeric(pchicago); ntoledo<-as.numeric(ptoledo); ntrans<as.numeric(trans) #Lets just look at the data for a moment# summary(nchicago) > summary(nchicago) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 524.8 1046.0 1072.0 1618.0 2238.0 #Heres the mean# mean(nchicago) > mean(nchicago) [1] 1072.427 #Heres the standard deviation# sd(nchicago) > sd(nchicago) [1] 641.16 #Now lets run a regression# base.reg<-lm(nchicago~ntoledo+ntrans) summary(base.reg) > summary(base.reg) Call: lm(formula = nchicago ~ ntoledo + ntrans) Residuals: Min 1Q Median 3Q Max -1621.35-37.42 11.75 43.60 858.46 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 18.123324 3.875234 4.677 3.06e-06 *** ntoledo 1.002515 0.003078 325.722 < 2e-16 *** ntrans -0.119601 0.016689-7.166 9.96e-13 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 90.6 on 2637 degrees of freedom Multiple R-squared: 0.98, Adjusted R-squared: 0.98 F-statistic: 6.477e+04 on 2 and 2637 DF, p-value: < 2.2e-16
#now lets see if the residuals in our regression have some autocorrelation meaning that we should reall model this as a time series.# residuals<-ts(base.reg$resid) #to do this we nee the package tseries which gives us a nice autocorrelation funtion# #R has many many many packages with very cool and specialized commands, but the trick is that 1)know which one you want 2) you have to install it and 3)you have to tell R that you want to use it.# #You can find packages just by google searching or looking at the help menu# #Here I install the package I want. You could also install it from within the program or download it at http://cran.rproject.org/web/packages/nlme/index.html.# #install.packages("tseries") #Here I point the program to that package# library(tseries) pacf(residuals) #from this graph we can see that there is some correlation between regression erros in one term and regression errors in the next. so its best to model the whole thing as a time series# #To begin this time series analysis, I'm going to make each of our variables an "object". This means that I can call them up easily. I'm
going to use the command ts() to let R know that these are time series data.# tspchicago<-ts(pchicago); tsptoledo<-ts(ptoledo);tstrans<-ts(trans) #we already have the time series package in R's library so lets go ahead and run the model we did in class# two.lags<-arima(tspchicago, order=c(2,0,0), xreg=cbind(tsptoledo,tstrans)) two.lags > two.lags Call: arima(x = tspchicago, order = c(2, 0, 0), xreg = cbind(tsptoledo, tstrans)) Coefficients: ar1 ar2 intercept tsptoledo tstrans 0.4174 0.3105 27.9986 0.9850-0.0738 s.e. 0.0186 0.0186 10.6595 0.0082 0.0456 sigma^2 estimated as 4750: log likelihood = -14921.19, aic = 29854.37 #If you want to know more about a function, you can simply type? and the function's name# #?arima #One of the frustrating things about R is that not all the functions work well. For example, on the previous example you would have to do some manipulation to get R to spit back p-values for that ARIMA regression. Alternativly you could use the function below, but it keeps crashing my computer, so lets skip it.# #install.packages("nlme") #library(nlme) #fit.gls<-gls(tspchicago~tsptoledo + tstrans, correlation=corarma(p=2),
method="ml") http://xkcd.com/196/ #Okay so now that we've gone through and completed a little task in R lets look at some of the things that really make R special compared to Stata or SAS# #First, R is not just a stats package. Its a full programming language meaning that you can create your own functions.# #There is a simple example of this in the book "Bayesian Computation with R" by Jim Albert in which he creates his own code for a t-statistic fucntion.he begins by explaining the function's parts then puts them all togeather.# #These give use the lengths of 2 vectors on which we want to use the function.# #m<-length(x) #n<-length(y) #Here we get the pooled standard deviation for the two vectors. the function sd() gives us standard deviation# #sp<-sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2)) #Here we define the t stat# #t<-(mean(x)-mean(y))/(sp*sqrt(1/m+1/n)) #now we put them all togeather in a separate text file labled tstatistic.r (We label it.r even though its a text file.)# #tstatistic=function(x,y) #{ # m=length(x) # n=length(y) # sp=sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2)) # t=(mean(x)-mean(y))/(sp*sqrt(1/m+1/n)) return(t) #} #Now load this new function onto R by pointing R toward the file with the function.# source("/users/grantcavanaugh/dropbox/tstatistic.r") #Now we'll use the new function on some made up data. Note we use the function c() to join up numbers in a vector# data.x<-c(1,4,3,6,5) data.y<-c(5,4,7,6,10) #Now run the function.# tstatistic(data.x, data.y) > tstatistic(data.x, data.y) [1] -1.937926 #Beyond its great community and programability, R is prefered by stats folks because it's data visualiztion is better than other canned packages. Here we'll go through some very basic graphics.# hist(nchicago)
#We can manipulate the size and number of bars in our historgram by specifing breaks# brk<-c(0,25,125,400,1000,1050,5000) hist(nchicago, breaks=brk)
#you can easily put mutliple graphs on a single panel for example using the function par() and specifing that you want 1 row of charts and 3 columns in this case using the command mfrow()# par(mfrow=c(1,3)) boxplot(nchicago) boxplot(ntoledo) boxplot(ntrans)
#Now reset the window.# par(mfrow=c(1,1)) #and put all three on the same set of axes# boxplot(nchicago,ntoledo,ntrans) #now lets generate some dandom data, plot that data, and play with the labels on the axes# cookies<-rnorm(500, mean=50, sd=60) monsters<-rnorm(500, mean=50, sd=60) plot(monsters, cookies, ylab="cookies!", xlab="monsters", main="size of cookie predicted by size of monster") cmreg<-lm(cookies~monsters) abline(cmreg, col='blue')
#The final thing to mention is that R, unlike Stata or SAS can do all the same matrix multiplication as matlab. That means that you can keep all your work in one program. To show this I'm going to generate 2 vectors and multiply them using the function t() for transpose and the function seq() for a sequence of numbers. Note that you have to use %*% if you are multiplying matricies.# x<-seq(1:10) > x [1] 1 2 3 4 5 6 7 8 9 10 y<-seq(1:4) > y [1] 1 2 3 4 xymatrix<-x%*%t(y) > xymatrix [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 2 4 6 8 [3,] 3 6 9 12
[4,] 4 8 12 16 [5,] 5 10 15 20 [6,] 6 12 18 24 [7,] 7 14 21 28 [8,] 8 16 24 32 [9,] 9 18 27 36 [10,] 10 20 30 40