References R's single biggest strenght is it online community. There are tons of free tutorials on R.

Similar documents
Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

A (very) brief introduction to R

Getting Started in R

Getting Started in R

Introduction to R, Github and Gitlab

36-402/608 HW #1 Solutions 1/21/2010

Exercise 2.23 Villanova MAT 8406 September 7, 2015

1 Lab 1. Graphics and Checking Residuals

Input/Output Machines

Math 263 Excel Assignment 3

AA BB CC DD EE. Introduction to Graphics in R

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 2.3: Simple Linear Regression: Predictions and Inference

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Gelman-Hill Chapter 3

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

An Introduction to R- Programming

Regression III: Lab 4

Introduction to R. Introduction to Econometrics W

Bernt Arne Ødegaard. 15 November 2018

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age

8.1 R Computational Toolbox Tutorial 3

Some issues with R It is command-driven, and learning to use it to its full extent takes some time and effort. The documentation is comprehensive,

Dr. Barbara Morgan Quantitative Methods

Stat 579: More Preliminaries, Reading from Files

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very

S CHAPTER return.data S CHAPTER.Data S CHAPTER

Solution to Bonus Questions

Regression on the trees data with R

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Section 2.2: Covariance, Correlation, and Least Squares

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Performing Cluster Bootstrapped Regressions in R

Applied Statistics and Econometrics Lecture 6

Introduction to Statistics using R/Rstudio

1. What specialist uses information obtained from bones to help police solve crimes?

plot(seq(0,10,1), seq(0,10,1), main = "the Title", xlim=c(1,20), ylim=c(1,20), col="darkblue");

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Introduction to R and R-Studio Toy Program #1 R Essentials. This illustration Assumes that You Have Installed R and R-Studio

Experiment 1 CH Fall 2004 INTRODUCTION TO SPREADSHEETS

Bivariate (Simple) Regression Analysis

CLAREMONT MCKENNA COLLEGE. Fletcher Jones Student Peer to Peer Technology Training Program. Basic Statistics using Stata

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Regression Analysis and Linear Regression Models

GRETL FOR TODDLERS!! CONTENTS. 1. Access to the econometric software A new data set: An existent data set: 3

Among those 14 potential explanatory variables,non-dummy variables are:

The goal of this handout is to allow you to install R on a Windows-based PC and to deal with some of the issues that can (will) come up.

Organizing data in R. Fitting Mixed-Effects Models Using the lme4 Package in R. R packages. Accessing documentation. The Dyestuff data set

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Excel Primer CH141 Fall, 2017

Comparing Fitted Models with the fit.models Package

STAT 540 Computing in Statistics

Poisson Regression and Model Checking

LAB #1: DESCRIPTIVE STATISTICS WITH R

Introduction. Matlab for Psychologists. Overview. Coding v. button clicking. Hello, nice to meet you. Variables

Stat 5303 (Oehlert): Response Surfaces 1

SLStats.notebook. January 12, Statistics:

ASSIGNMENT 6 Final_Tracts.shp Phil_Housing.mat lnmv %Vac %NW Final_Tracts.shp Philadelphia Housing Phil_Housing_ Using Matlab Eire

Introduction to Excel Workshop

The linear mixed model: modeling hierarchical and longitudinal data

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

5.5 Regression Estimation

Standard Errors in OLS Luke Sonnet

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Statistics 251: Statistical Methods

Using Excel for Graphical Analysis of Data

The "R" Statistics library: Research Applications

Matlab notes Matlab is a matrix-based, high-performance language for technical computing It integrates computation, visualisation and programming usin

Introduction to Spreadsheets

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

An introduction to plotting data

Programming Paradigms

Lab 1: Introduction, Plotting, Data manipulation

Integrated Mathematics I Performance Level Descriptors

An Introduction to Statistical Computing in R

Creating a Box-and-Whisker Graph in Excel: Step One: Step Two:

An Introduction to R. Ed D. J. Berry 9th January 2017

Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)

Estimating R 0 : Solutions

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Variables: Objects in R

Multiple Regression White paper

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Compare Linear Regression Lines for the HP-67

R - A Gentle Introduction

Quality and Six Sigma Tools using MINITAB Statistical Software: A complete Guide to Six Sigma DMAIC Tools using MINITAB

Homework set 4 - Solutions

Risk Management Using R, SoSe 2013

Stat 4510/7510 Homework 4

Getting started with Stata 2017: Cheat-sheet

An Introductory Guide to R

Advanced Econometric Methods EMET3011/8014

STATS PAD USER MANUAL

Transcription:

Introduction to R Syllabus Instructor Grant Cavanaugh Department of Agricultural Economics University of Kentucky E-mail: gcavanugh@uky.edu Course description Introduction to R is a short course intended for students with limited or no previous use of R but some familiarity with other stats/math packages. The course presents some of the basic operations in R (importing, running OLS regressions, etc.) as well as some of the uses of R that distinguish the package from its licensed (programming, displaying data, matrix multiplication). After taking this course, students should have an understanding of what R is any why you might want to use it in your research rather than SAS, Stata, or other mathematical packages. Topics covered Advantages and disadvantages of R Help Importing data Program files (.R) Object oriented programming what is it? Summarizing and analyzing data Linear Regression Partial autocorrelation or residuals Getting new packages in R Regression using ARIMA Why R is sometimes frustrating Making your own function Graphics Matrix multiplication References R's single biggest strenght is it online community. There are tons of free tutorials on R. You can find a great list of free online resources for learing R at: http://jeromyanglim.blogspot.com/2010/05/videos-on-data-analysis-with-r.html #We start by reading in the data using the command read.csv. I've put the data in a csv file. Its a standard file type that excel can save in.# costs<-read.csv("/users/grantcavanaugh/desktop/lopdata.csv", header=true) #Note that I have read in our data file as the "object" costs that if I type "data" I call up our whole data set.# costs > costs pchicago ptoledo trans 1 2,4597 2,3874 0,5222 2 2,4902 2,4163 0,5222 3 2,4902 2,4163 0,5222 #R is an object oriented language meaning that it can do things like store the number 64 as a letter#

b<-64 #now when ever you type b you get 64# 6+b > 6+b [1] 70 #Here we load our data set into R's active directory. This means that all of its variables will be objects automatically.# attach(costs) #The function class() tells how R thinks about a given object i.e. is it a time series object? In this case we are looking at what R calls a "data frame" its like a matrix but without the dimensions.# class(costs) > class(costs) [1] "data.frame" class(pchicago) > class(pchicago) [1] "factor" #For some strange reason my R is reading all out data as "factors" (factors are generally things like red, blue, and green) rather than as numbers. lets change that using the function as.numeric()# nchicago<-as.numeric(pchicago); ntoledo<-as.numeric(ptoledo); ntrans<as.numeric(trans) #Lets just look at the data for a moment# summary(nchicago) > summary(nchicago) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 524.8 1046.0 1072.0 1618.0 2238.0 #Heres the mean# mean(nchicago) > mean(nchicago) [1] 1072.427 #Heres the standard deviation# sd(nchicago) > sd(nchicago) [1] 641.16 #Now lets run a regression# base.reg<-lm(nchicago~ntoledo+ntrans) summary(base.reg) > summary(base.reg) Call: lm(formula = nchicago ~ ntoledo + ntrans) Residuals: Min 1Q Median 3Q Max -1621.35-37.42 11.75 43.60 858.46 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 18.123324 3.875234 4.677 3.06e-06 *** ntoledo 1.002515 0.003078 325.722 < 2e-16 *** ntrans -0.119601 0.016689-7.166 9.96e-13 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 90.6 on 2637 degrees of freedom Multiple R-squared: 0.98, Adjusted R-squared: 0.98 F-statistic: 6.477e+04 on 2 and 2637 DF, p-value: < 2.2e-16

#now lets see if the residuals in our regression have some autocorrelation meaning that we should reall model this as a time series.# residuals<-ts(base.reg$resid) #to do this we nee the package tseries which gives us a nice autocorrelation funtion# #R has many many many packages with very cool and specialized commands, but the trick is that 1)know which one you want 2) you have to install it and 3)you have to tell R that you want to use it.# #You can find packages just by google searching or looking at the help menu# #Here I install the package I want. You could also install it from within the program or download it at http://cran.rproject.org/web/packages/nlme/index.html.# #install.packages("tseries") #Here I point the program to that package# library(tseries) pacf(residuals) #from this graph we can see that there is some correlation between regression erros in one term and regression errors in the next. so its best to model the whole thing as a time series# #To begin this time series analysis, I'm going to make each of our variables an "object". This means that I can call them up easily. I'm

going to use the command ts() to let R know that these are time series data.# tspchicago<-ts(pchicago); tsptoledo<-ts(ptoledo);tstrans<-ts(trans) #we already have the time series package in R's library so lets go ahead and run the model we did in class# two.lags<-arima(tspchicago, order=c(2,0,0), xreg=cbind(tsptoledo,tstrans)) two.lags > two.lags Call: arima(x = tspchicago, order = c(2, 0, 0), xreg = cbind(tsptoledo, tstrans)) Coefficients: ar1 ar2 intercept tsptoledo tstrans 0.4174 0.3105 27.9986 0.9850-0.0738 s.e. 0.0186 0.0186 10.6595 0.0082 0.0456 sigma^2 estimated as 4750: log likelihood = -14921.19, aic = 29854.37 #If you want to know more about a function, you can simply type? and the function's name# #?arima #One of the frustrating things about R is that not all the functions work well. For example, on the previous example you would have to do some manipulation to get R to spit back p-values for that ARIMA regression. Alternativly you could use the function below, but it keeps crashing my computer, so lets skip it.# #install.packages("nlme") #library(nlme) #fit.gls<-gls(tspchicago~tsptoledo + tstrans, correlation=corarma(p=2),

method="ml") http://xkcd.com/196/ #Okay so now that we've gone through and completed a little task in R lets look at some of the things that really make R special compared to Stata or SAS# #First, R is not just a stats package. Its a full programming language meaning that you can create your own functions.# #There is a simple example of this in the book "Bayesian Computation with R" by Jim Albert in which he creates his own code for a t-statistic fucntion.he begins by explaining the function's parts then puts them all togeather.# #These give use the lengths of 2 vectors on which we want to use the function.# #m<-length(x) #n<-length(y) #Here we get the pooled standard deviation for the two vectors. the function sd() gives us standard deviation# #sp<-sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2)) #Here we define the t stat# #t<-(mean(x)-mean(y))/(sp*sqrt(1/m+1/n)) #now we put them all togeather in a separate text file labled tstatistic.r (We label it.r even though its a text file.)# #tstatistic=function(x,y) #{ # m=length(x) # n=length(y) # sp=sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2)) # t=(mean(x)-mean(y))/(sp*sqrt(1/m+1/n)) return(t) #} #Now load this new function onto R by pointing R toward the file with the function.# source("/users/grantcavanaugh/dropbox/tstatistic.r") #Now we'll use the new function on some made up data. Note we use the function c() to join up numbers in a vector# data.x<-c(1,4,3,6,5) data.y<-c(5,4,7,6,10) #Now run the function.# tstatistic(data.x, data.y) > tstatistic(data.x, data.y) [1] -1.937926 #Beyond its great community and programability, R is prefered by stats folks because it's data visualiztion is better than other canned packages. Here we'll go through some very basic graphics.# hist(nchicago)

#We can manipulate the size and number of bars in our historgram by specifing breaks# brk<-c(0,25,125,400,1000,1050,5000) hist(nchicago, breaks=brk)

#you can easily put mutliple graphs on a single panel for example using the function par() and specifing that you want 1 row of charts and 3 columns in this case using the command mfrow()# par(mfrow=c(1,3)) boxplot(nchicago) boxplot(ntoledo) boxplot(ntrans)

#Now reset the window.# par(mfrow=c(1,1)) #and put all three on the same set of axes# boxplot(nchicago,ntoledo,ntrans) #now lets generate some dandom data, plot that data, and play with the labels on the axes# cookies<-rnorm(500, mean=50, sd=60) monsters<-rnorm(500, mean=50, sd=60) plot(monsters, cookies, ylab="cookies!", xlab="monsters", main="size of cookie predicted by size of monster") cmreg<-lm(cookies~monsters) abline(cmreg, col='blue')

#The final thing to mention is that R, unlike Stata or SAS can do all the same matrix multiplication as matlab. That means that you can keep all your work in one program. To show this I'm going to generate 2 vectors and multiply them using the function t() for transpose and the function seq() for a sequence of numbers. Note that you have to use %*% if you are multiplying matricies.# x<-seq(1:10) > x [1] 1 2 3 4 5 6 7 8 9 10 y<-seq(1:4) > y [1] 1 2 3 4 xymatrix<-x%*%t(y) > xymatrix [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 2 4 6 8 [3,] 3 6 9 12

[4,] 4 8 12 16 [5,] 5 10 15 20 [6,] 6 12 18 24 [7,] 7 14 21 28 [8,] 8 16 24 32 [9,] 9 18 27 36 [10,] 10 20 30 40