R for Wildlife Ecologists (Quick Reference Guide)

Size: px

Start display at page:

Download "R for Wildlife Ecologists (Quick Reference Guide)"

Charles Kelley
5 years ago
Views:

1 R for Wildlife Ecologists (Quick Reference Guide) Bret Collier Institute of Renewable Natural Resources, Texas A&M University, College Station, Texas 77845; 979/595/50706 Contents 1 Course Introduction 3 2 Starting in R R Basics Simple Programming R Objects Classes and Modes R Project and Data Management Working directories Importing and exporting data Creation of, types, and working with data: a super short primer Basic Mathematical/Operators R Creating Graphics Scatterplots Other Simple plots Statistical Models with R Contingency Tables Linear Regression Generalized Linear Models Writing Functions in R Functions Wildlife-Specic Methods Capture-Recapture Analysis Distance Sampling Spatial Models Contact After 1 March: School of Renewable Natural Resources, Louisiana State University, bcollier.work@gmail.com or 979/595/5076 1

2 8 Literature Too Look At! Here is a pretty short list of good books to get on your shelves R packages that I use regularly and a few websites that will make your life easier,

3 1 Course Introduction First, since you are reading this you have taking the rst steps towards freeing yourselves from the forced servitude of point and click statistical programs that control your data's structure and drive your data analyses. No longer will you be told what numbers will be available for you to interpret or what statistical tests you should use. Rather, we are going to move, together today, into computing on the data and questions of interest where you decide the how to develop, manipulate, examine, and interpret statistical results. Our philosophy for today is simple, using R to become better analysts, as described by Dr. Harrell, library(fortunes) fortune("good data analyst") Can one be a good data analyst without being a half-good programmer? The short answer to that is, 'No.' The long answer to that is, 'No.' -- Frank Harrell 1999 S-PLUS User Conference, New Orleans (October 1999) Additionally, for todays's work, we are going with the motto that GUIs normally make it simple to accomplish simple actions and impossible to accomplish complex actions. (I read this somewhere but cannot remember who said it, if you do let me know). Thus, today is going to be all about programming, get ready! Since everyone is a scientist here, you have probably realized that you are going to need at least a basic understanding of R. But, that's ok, because understanding R will benet you long-term. The good thing(s) about R, to list a few, include: 1. There is a wealth of online documentation related to the use of R. Just look at the R homepage ( for a host of useful links. 2. There are huge numbers of freely available R packages that can be used to perform specic analyses and you can develop packages that archive your data and code so that other folks can see/use it just as easy ( packages/). 3. Because R is a exible environment, there are entire elds of study (e.g., Analysis of Spatial Data) on which there are a wide range of approaches developed to conduct various analyses (some of which can be seen at Additionally, Springer has an entire series called Use-R ( statistics) consisting of books published on various R-related statistical topics. 4. R is not just for analysis, but merges seemlessly into the writing of theses & dissertations, books, articles, presentations, course notes, etc. Our course notes for today were written entirely in LYX (pronounced 'Licks') using the R package knitr ( knitr/) to 'knit the R code and the text (ported via TEX/L A TEX, which is pronounced 'Tech' and 'La Tech'-note that all pronunciation is open to interpretation depending on whether you are a American English or English English speaker it seems) and is entirely reproducible on your computers. This integration of R into dynamic document presentations is the foundation of literate programming and is well grounded in the process of reproducible research ( 3

4 5. R is libre (open source) and gratis (freeware) software ( gpl.html)-think: freedom of speech (libre) and free as in beer (gratis). Now, the R downside that I want to put right out front for you. 1. R is a programming environment. If you are not used to developing programming code, or just don't have any experience programming, then the learning curve will be steep initially (but we will solve some of that today). 2. Because R is a programming environment, it will not 'do' things for you that you are used to having done for you by various programs. If you want R to do something with your data, you have to tell it to and you have to know what the outcome should look like so you can ensure what you told R to give you and what R gave you are the same thing. Throughout this document, I am using notes that I have pulled together, presented elsewhere to other audiences, borrowed from friends, etc. So, I am probably not giving enough credit where it is due (I am admitting to plagarism right here) but since these are notes and I want them to be comprehensive as possible while focused on the issues I think you all need to know, too bad. The following is repeated from the quick start guide I sent out earlier, but I wanted it in here as well just for consistency: Your rst stop(s) (preferably before we meet ) are listed below, these are the main ones, but there are tons of other sites you can frequent if your interested and do a little searching: R Project website: R FAQ: (general/os specic FAQs on here) R Manuals: CRAN (Comprehensive R Archive Network): R Search: Texas A&M University that has worked up some R video's that are interesting: //dist.stat.tamu.edu/pub/rvideos/. http: 4

5 2 Starting in R 2.1 R Basics First and foremost, these are not notes on how to do your particular kind of statistics. I won't be teaching statistics, I will be talking more about programming on the R language. Thus, you will see more text when I am talking about how R works, and less when I am applying R to a specic instance (e.g., to linear regression). With that in mind, we will not cover every bit of code/text in this document during the workshop, rather, I wanted this reference to be useful to you in the future, but I will highlight the immediately relevant parts as we work through the code examples for today. Working with a language like R requires 2 basic things: time and interaction, thus, you have to practice. So, your rst goal will be to stop using whatever program you have been using for data management, manipulation, and analysis. For some of you, this is excel, which is bad, because excel sucks, makes bad graphs, and gives wrong answers, and well, it sucks and you should not use it for anything other than simplifying to.csv data, cause it sucksexcel SUCKS DONT USE IT! fortune("microsoft excel") Friends don't let friends use Excel for statistics! -- Jonathan D. Cryer (about problems with using Microsoft Excel for statistics) JSM 2001, Atlanta (August 2001) I should note, that as I started to work up these course notes, I am probably starting at a level that is very basic to many of you. The reason for this is multi-fold; 1) I don't know what level of experience you all have, so I gure it is best to begin at the beginning, 2) thorough understanding of the basics is important, as you will waste considerably more time with data formatting and getting it into R early in your career than you will actually running any analyses, and 3) I want the basics to be included so someone could run through everything without help. However, these notes are by no stretch of the imagination anything near comprehensive. 2.2 Simple Programming Lets start with the most basic, R as a really nice calculator. For instance, if I need to add, I can add [1] 4 Amazing, right? I can, if I want, create a sequence of numbers going one direction or the other. 1:10 [1] :1 [1]

6 Yep, that's cool. I can create a plot (more on this later). hist(rnorm(100), xlab = "", las = 1, main = "Yay, a plot!") Yay, a plot! Frequency Wow, addition, number sequences, plots, all in little code snippets that are completely reproducible. What is this magic elixir you are showing us...well, I can write a function that tells you... my.function = function(x) { ifelse(x > 1, "Bud Light is not good beer", "R is like good beer") } my.function(1) [1] "R is like good beer" my.function(2) [1] "Bud Light is not good beer" 2.3 R Objects Everything in R is a object, and each object has a set of attributes associated with it that describe the objects contents and how it can and should be used. Frequently, when working in R, several calculations may be dependent upon each other. Thus, you will want to save those results for future use by assigning those results to a object. In R, the usual assignment operator is <- (e.g., x<-1, so 1 is assigned to x, or 'x gets 1'). You are probably wondering why you cannot use a = sign (e.g., x=1, so 1 is equal to x). Based on my work, they are fairly interchangable, although I have run into some situations where <- was required when I was doing some simulations that required a for() loop. Most folks use <-, and unless you are writing functions (where you have use argument=value) or the boolean equality operator (==) you can use either, but I prefer to just use an = sign. So, using the assignment operator, you can assign objects any name you choose. Object names can be upper- or lower-case letters, numbers, underscore (_), or periods(.). Good 6

7 programming practices are to assign objects to begin with a letter, not a number or period. Also, realize that R is case sensitive (I have made this mistake many times), for instance as I show in the below, I assigned a lower case x to be 2+2, a upper case X to be 3+3, and then I printed both and show that R will tell you whether or not x and X are equal or not (note this is one of those places where the = sign would cause problems as if I had not use the negation character (!), x would get X, or be equal to 3+3. x = X = x [1] 4 X [1] 6 x!= X [1] TRUE Important to note here, there are certain letters, words, etc. that are used in R that you should not use. For instance, never, ever, ever, EVER call your datasets data, quoting B. Ripley from a R fortune You would not call your dog, dog would you. data() is actually at R function for loading datasets, so changing what it means is kind of a problem. Another example is c(), which is a function that concantenates data together into a vector and allows you to give the vector a name and operate on it as shown below, thus changing what its does is probably not a good idea. x <- c(2, 4, 6, 8) x [1] x * 2 [1] Finding out what you can or cannot label things is somethings a trial and error process, but, when in doubt, use a underscore or a period in your object names as that reduces the chance of mis-naming something. But, when in doubt, you can always type in the name into R and see if it is already used, such as, c function (..., recursive = FALSE).Primitive("c") 2.4 Classes and Modes This part will be kind of painful, but its important so read it once and then move on. Since we now know how R uses the assignment operator to specify an object, we need to consider that each R object has attributes associated with it. Attributes describe the contents of the R 7

8 object and how the object can and should be used. Probably the most important attributes of an R object is the class and mode of the object. There are several functions to evaluate the structure of your data, mainly 'mode', 'class', and 'str' (which, if you see below, will tell you the mode and the value(s) of the object, which is very useful when dealing with data frames or lists). x < x [1] 4 mode(x) [1] "numeric" class(x) [1] "numeric" str(x) num 4 Thus, you can see that a number gets a mode 'numeric' and a class 'numeric'. Additionally, there are several other modes, mainly atomic, complex, and raw; none of which you should expect to see with any frequency in your work unless you really get into the programming end of things. There is also a logical model, which has values TRUE, and FALSE (never T or F) and character mode, which is a character (specied with quotation marks). mode(true) [1] "logical" class(true) [1] "logical" mode("lyla") [1] "character" class("lyla") [1] "character" Finally, you can verify whether an object has a particular mode or is a member of a particular class using one of several R function to test if an object is of a specic type where you can see the x is in fact numeric (TRUE) and not factor (FALSE). Some of these predicate functions are: is.numeric, is.factor, is.list, and so on. x <- 2 mode(x) [1] "numeric" 8

9 is.numeric(x) [1] TRUE is.factor(x) [1] FALSE mode(as.integer(2)) [1] "numeric" class(as.integer(2)) [1] "integer" But, you can also have mixed vectors of numeric and character, which R will convert all to character values but you can change character values back to numeric using as.numeric. test2 <- c("a", 2, "C") test2 [1] "A" "2" "C" class(test2) [1] "character" as.numeric(c("1", "2", "3")) [1] Finally, in addition to having classes and modes, vectors have length attributes, which you can get using the length function. length(c("a", 2, "C")) [1] 3 Now, before we nish, we need to real quick touch on factors and how they are stored in R as factor objects are of numeric mode, but with a class attribute such that character attributes are displayed even though the storage mode is numeric. For example, stealing some notes from my friend Je, see the below. What is actually being stored in my.factor is a numeric vector c(2, 1, 3) because the levels are alphabetical, hence B is the second so by default it gets a 2, A is rst so it gets a 1, and so on. my.factor <- c("b", "A", "C") my.factor <- factor(my.factor) mode(my.factor) [1] "numeric" class(my.factor) [1] "factor" 9

10 my.factor [1] B A C Levels: A B C 10

11 3 R Project and Data Management Here I want to talk a bit about managing R projects, importing and managing data, and manipulation of datasets in R. So, rst of all, this will be a ridiculously short primer on the topic, as there are entire books written on data manipulation in R (e.g., Spector's Data Manipulation with R book) and there are quite a few really great packages for working with various types of data (if you have not ever heard of Hadley, then see his R packages for data manipulation at I will highlight a few of the base R methods, and point out a a few functions I nd extremely useful, and then your o on your own to go forth and prosper. 3.1 Working directories So, now you have R installed and started on your computer. One thing that some folks nd handy is to set a working directory, or a place where a particular project will be house. You don't have to use a working directory, but it can be helpful to set a working directory for projects in R that involve more than 30 seconds of thought. In simple terms, a working directory is exactly that, a directory where all the work on a particular project will be conducted, where you R session information will be saved, where R will look for any les or source data functions you want to use when you are working, and where any output you create and write from R will go. There are several ways to set a working directory, for example, in Windows you could open R and go to File >Change dir and set the working directory to any location (e.g., for instance, you can create a folder called RCourse and put in the Documents section of your computer). However, when I used working directories I tend set the working directory specic to each analysis project that I conduct using setwd() using the PATH format for each of the 3 standard operating systems (these are based o of my various work machines I use for R package builds and computer programming stu, your paths will be dierent). Note that the slashes are forward (/) not back (\) slashes in the PATH name: Linux: setwd(/home/bret/bretresearch/workshops/txtws_rworkshop/) Windows: setwd(c:/users/bret.collier/documents/workshops/txtws_rworkshop/) Mac: setwd(/users/bretcollier/bretresearch/workshops/txtws_rworkshop/) There are several nice things about working directories, but the main one is that after you set a working directory, then when you need to load data into your workspace, or save data or graphs, then having a working directory set saves lots of time. For example you could write a code snippet where you dene where you want R to go look for the data you are interested in analyzing: example.data <- read.csv("f:/rio209.csv", header = TRUE) head(example.data) ID Lat Lon Date :57: :57: :57: :57:34 11

12 :57: :57:32 Which shows R where the datale you want to load is, tells R to go out and load it, and then gives that datale the R object name 'data'. If you type something wrong (which you will), you will get this: bad.data <- read.csv("f:/rrio209.csv", header = TRUE) Warning: cannot open file 'F:/RRio209.csv': No such file or directory Error: cannot open the connection head(bad.data) Error: object 'bad.data' not found However, if you are importing multiple datasets, or planning on exporting multiple datasets or graphics, the perhaps a better option is below where you set a working directory rst, then R knows where to go to look for your data, and where to put anything you output. setwd("f:/") same.data <- read.csv("rio209.csv", header = TRUE) head(same.data) ID Lat Lon Date :57: :57: :57: :57: :57: :57:32 Working directories have some downfalls, in that if you are sourcing in from various workspaces, or if all your R work is housed in a single workspace to simplify project management and package development (like mine is, ask if you want to see my setup), then using a setwd() can be a pain. And yes, I know I just showed you how to load data and that was supposed to come later, don't freak out, it was just an example. 3.2 Importing and exporting data Probably the simplest method for loading a small (or large) dataset when all the data is of the same mode is to use the brilliantly named set of read functions read.foo functions where 'foo' is a name such as.txt,.csv, etc.. So, just as an example, using the Rio209.csv le above, you can read it into your R session in a variety of ways. First, you can just read it straight in: example.data <- read.csv("f:/rio209.csv", header = TRUE) head(example.data) ID Lat Lon Date :57:32 12

13 :57: :57: :57: :57: :57:32 You can identify a working directory and read it in from there: setwd("f:/") same.data <- read.csv("rio209.csv", header = TRUE) head(same.data) ID Lat Lon Date :57: :57: :57: :57: :57: :57:32 Ok, your probably thinking: how in the heck do I know what the function is to read data in? Well, R has a nice little object identier for the question mark when, if typed into the console in front of a function name, will open the help les for the R function of interes. For instance, using?read.table will open the help les for the read.table() function, and all the other read.foo() functions available in base R. For those of you who work with databases on a regular basis, then there is a R package RODBC that is extremely useful for opening connections with various ODBC database structures and importing tables of data, either as is or using SQL language queries to specify exactly what is needed. 3.3 Creation of, types, and working with data: a super short primer Vectors We all know you can really not do too much fancy mathematics on a scalar (vector with 1 value) so, we need to look into the wide variety other methods for working with data in R. Now, we are going to start stepping into the creation and manipulation of several dierent types of data within R. You will see a wide variety of things coming up here, creating of data using random number generators, sequences of data, combinations of numeric and factor data, creation and manipulation of vectors, matrices and operations on these vectors and matrices. This is probably what you were all more interested in using as I will start outlining some specic R functions for doing specic tasks. First, remember that you can create a simple vector as: c.data <- c(10, 21, 13, 34, 25) c.data [1] Ok, so now we have a vector called c.data in our workspace. So, for instance, R excels at vectorized operations, so we can do vectorized arithmetic on it, or perhaps do write some code to estimate summary statistics for the data in the c.data vector: 13

14 c.data/2 [1] xbar <- sum(c.data)/length(c.data) xbar [1] 20.6 std.dev <- sqrt(sum((c.data - xbar)^2)/(length(c.data) - 1)) std.dev [1] Or, since R is a statistical program, we could just use the R internal functions for mean and standard deviation to get the same answers mean(c.data) [1] 20.6 sd(c.data) [1] Back to vectors for the time being. There are lots of other ways to create vector data using functions that create sequences of data. For example, we can use a colon (:), a sequence operator or repeat function (seq, rep). For numeric arguments, a:b will generate a sequence of ordered data from a to b. If a and b are integers, then so is the sequence, if not, the are of type double. 1:10 [1] :11 [1] :5 [1] :-5 [1] The colon (:) operator cannot be used with letters (e.g., A:F will not get you a vector = a, b, c, d, e, f) as R will expect the values to be named objects. So, you would typically work with sequences of letters and numbers by combining them (either brute force or via the interaction function) data frames: num.factor <- factor(1:4) alpha.factor <- factor(c("a", "b", "c", "d")) num.factor:alpha.factor 14

15 [1] 1:a 2:b 3:c 4:d 16 Levels: 1:a 1:b 1:c 1:d 2:a 2:b 2:c 2:d 3:a 3:b 3:c 3:d 4:a 4:b... 4:d interaction(alpha.factor, num.factor) [1] a.1 b.2 c.3 d.4 16 Levels: a.1 b.1 c.1 d.1 a.2 b.2 c.2 d.2 a.3 b.3 c.3 d.3 a.4 b.4... d.4 The colon operator is a simple method for vector creation, but we could also use the seq (sequence) function which can be used with numeric, dates, times, and we can make the sequences change by values other than +1 or -1: seq(from = 2, to = 6, by = 1) [1] seq(2, 6, 1) [1] seq(2, 6, 0.5) [1] seq(-2, 2, 0.5) [1] seq(from = as.date(" "), to = as.date(" "), by = 5) [1] " " " " " " " " " " [6] " " In addition, there is a general rep (repeat) function that can be used to generate repeated sequences of vectors combining data of any mode. Underlying rep are internal options called 'each' and 'time' both which will aect how your data is put into the sequence: rep(1:3, each = 3) [1] rep(1:3, time = 3) [1] rep(alpha.factor, each = 2) [1] a a b b c c d d Levels: a b c d rep(alpha.factor, time = 2) [1] a b c d a b c d Levels: a b c d Because R is so handy, we can actually nest various functions to create data sequences: 15

16 rep(seq(1, 4, 1), each = 3) [1] rep(rep(c(1, 2), 2), each = 3) [1] # or alternatively rep(c(rep(1, 3), rep(2, 3)), 2) [1] Now, one of the things we often want to do is look at a specic value in a vector. Luckily, values in your vector (or whatever data object you are using) are subscripted by R, so we can extract a subset of a vector simply and eciently, which not surprisingly is called subscripting. Subscripting can be inclusive (what to include) or exclusive (what to exclude) and syntax can use names, numeric, or logical subscripts. So, as an example, consider the sequence of data from 5 to 50 by 5's and our interest is in extracting the 7th element. sub.seq <- seq(5, 50, 5) sub.seq [1] sub.seq[7] [1] 35 We could be interested in every value except for the 4th value, which we want to exclude: sub.seq[-4] [1] You can extract or exclude >1 element: sub.seq[c(2, 3, 6)] [1] sub.seq[-(3:5)] [1] In addition to regular subscripting, logical subscripting is a powerful method of subsetting data. Remember, there are quite a few logical operators (<, >, <=, >=; less than, greater than, less than or equal to, greater than or equal to) you have seen before. Equality uses a double = (==) and exclusion of equality uses the not (!) operator. Logical operation compares two objects using a operator and returns a logical vector (e.g., the vector will consist of TRUE or FALSE values): 16

17 sub.seq [1] sub.seq > 30 [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE sub.seq < 30 [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE sub.seq == 25 [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE sub.seq!= 25 [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE sub.seq == 26 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE Logical operators can be combined with other operators like & (and) (or) and! (not): lt.50 <- sub.seq < 50 gt.20 <- sub.seq > 20 lt.50 & gt.20 [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE!lt.50!gt.20 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE and you can subscript with logicals as well: sub.seq[lt.50 & gt.20] [1] sub.seq[!lt.50!gt.20] [1] Matrices A matrix is two-dimensional array of vectors, viewed as row vectors or column vectors, where each vector is of the same length and mode. Matricies are used for many research purposes in statistics, and typically consist of numeric variables. Matricies have 2 dimensions, the number of rows and the number of columns. One convenient way to create matrices is through the matrix function, or by using the diag() function to create a diagonal matrix. For example, consider a simple vector x that we want to put into a matrix 'x' with 4 rows and 2 columns: 17

18 x <- 1:8 dim(x) <- c(4, 2) x [,1] [,2] [1,] 1 5 [2,] 2 6 [3,] 3 7 [4,] 4 8 Of course, we lled that matrix with data (vector x), but we could have just as easily created a matrix with all the same values, or create the matrix above using the matrix function: my.matrix <- matrix(1, nrow = 4, ncol = 2) my.matrix [,1] [,2] [1,] 1 1 [2,] 1 1 [3,] 1 1 [4,] 1 1 my.matrix2 <- matrix(1:8, nrow = 4, byrow = TRUE) my.matrix2 [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6 [4,] 7 8 You can look at the dimensions of your matrix using some of the internal functions in R, and you can alter the dimensions of a matrix as long as the overall size is the same: ncol(my.matrix) [1] 2 nrow(my.matrix) [1] 4 dim(my.matrix) [1] 4 2 dim(my.matrix) = c(2, 4) my.matrix [,1] [,2] [,3] [,4] [1,] [2,] and we can create diagonal matrices: 18

19 my.matrix <- diag(1, nrow = 4, ncol = 4) my.matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] diag(my.matrix) <- 1:4 my.matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Obviously you can specify in what order you want the cells of a matrix lled: matrix(1:16, nrow = 4) [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] matrix(1:16, nrow = 4, byrow = TRUE) [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Now, we are not going to get into matrix algebra now (although the commands are readily available for things like multiplication (%*%, transpose using (t), inverse using solve(x)). But, we are going to talk about subscripting and some basic mathematics you can do on matricies. Multi-dimension objects like matricies require a dierent approach to subsetting than vectors as there is an option of an empty subscript or a null dimension. Consider the below matrix: set.seed(10) mat <- matrix(rpois(16, 4), nrow = 4) mat [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Which have 4 rows and 4 columns. If we were interested in the element in the matrix that was in the 3rd row and the 3rd column, then we would extract that element (5), or, if 19

20 we wanted to extract an entire row we could identify rows 1 through 4 of the rst column as below: mat[3, 3] [1] 5 mat[1:4, 1] [1] However, matrix subscripting has been made much simpler, because of the null dimension, we can extract entire rows and columns simply and eciently. The trick is to use a comma (,). Thus, for accessing entire rows and/or columns, you can just leave out the subscript for the dimension you are not interested in. Remember, R will return these values as a vector, not a matrix, so, if you want the information you extract to remain a matrix, you need to add drop=false to you code (which means you could actually subscript from that matrix as well): mat[, 1] [1] mat[2, ] [1] smaller.mat <- mat[1,, drop = FALSE] smaller.mat [,1] [,2] [,3] [,4] [1,] smaller.mat[1] [1] 4 Remember our earlier discussion on logical subscripting, works here too: mat > 3 [,1] [,2] [,3] [,4] [1,] TRUE FALSE TRUE FALSE [2,] FALSE FALSE FALSE TRUE [3,] FALSE FALSE TRUE FALSE [4,] TRUE FALSE TRUE FALSE mat[mat > 3] [1] mat[mat > 3] = -22 mat [,1] [,2] [,3] [,4] 20

21 [1,] [2,] [3,] [4,] Dataframes Dataframes are the typical structures folks use to store data for analysis (most of you would call dataframes a spreadsheet). Similar to matrices in that dataframes have column vectors of the same length (same number of rows), but dierent in that dataframes can have column vectors of dierent modes. Most data in ecology is of mixed modes, consisting of some combination of numeric, character, or factor information. So, it benets us to learn how R treats data and what options there are for managing data. First, just so it is easiest, I am going to use a data le that currently resides in R called iris which is the famous Andersons/Fisher Iris measurement data (cm) for 50 owers from 3 species of Iris's. There are a number of datasets provided with the base distribution of R, which you can access by typing data() into the R console. iris is actually a internal dataset that is distributed with R so we will just load it from within R below. For this part, sometimes we just want to look over the data that you imported into R. Luckily, there are a couple of simple ways to look at the datale, or parts of it, within R. Using the iris data, we can extract relevant information on the dataframe using functions such as str and names, for instance: data(iris) head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa setosa setosa str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num $ Sepal.Width : num $ Petal.Length: num $ Petal.Width : num $ Species : Factor w/ 3 levels "setosa","versicolor",..: names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" [5] "Species" summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 21

22 Median :5.80 Median :3.00 Median :4.35 Median :1.3 Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5 Species setosa :50 versicolor:50 virginica :50 But, we could also be interested in working with specic columns within a data frame. There are a couple of ways to access and manipulate/summarize specic colums in a dataframe in R. First, you can extract from a specic column by using the $, such as (just showing the rst 10 records for simplicity): iris$sepal.length[1:10] [1] You can summarize those columns individually if you so choose: mean(iris$sepal.length) [1] var(iris$sepal.length) [1] sd(iris$sepal.length) [1] summary(iris$sepal.length) Min. 1st Qu. Median Mean 3rd Qu. Max Or, another option that some people like when working in R is to attach their data using the attach function (see?attach). Then you can direct access your data based on the column names without identifying the dataframe. I tend not to do this as I don't like having dataframes attached, especially if I am working with multiple frames with the same column names (e.g., GPS data from multiple critters that have the same data columns), but I will quickly for this example do it once: attach(iris) Sepal.Length[1:10] [1] mean(sepal.length) [1] detach(iris) 22

23 Lists Lists are the most general structure in R and provide a way for the user to store a collection of data objects in one location, primarily because there is no limitations on the mode of the objects that a list may hold. Lists can have elements that can contain any other object, such as a dataframe, a matrix, a vector, scalar, etc. A list is a vector with mode list. But, lists are often weird for folks to understand, so as an example, rst I am going to create a fairly simple list and do some subscripting and manipulation of that list, then, I will compile a more complicated list and show how to manipulate that list. First, consider a simple set of vectors: x <- c(11, 34, 56, 17) y <- c("bret", "Reagan", "Kennedy", "Lyla") z <- c(10) x [1] y [1] "Bret" "Reagan" "Kennedy" "Lyla" z [1] 10 Here, I am combining the vectors above into a list, which has a mode of 'list' and 3 unique objects (list.a, list.b, list.c): simple.list <- list(list.a = x, list.b = y, list.c = z) mode(simple.list) [1] "list" simple.list $list.a [1] $list.b [1] "Bret" "Reagan" "Kennedy" "Lyla" $list.c [1] 10 Now, we can extract (via subscripting) elements from the list: simple.list[1] $list.a [1] simple.list[2] $list.b [1] "Bret" "Reagan" "Kennedy" "Lyla" 23

24 simple.list[3] $list.c [1] 10 Okay, so, based on what we know about R, we should be able to use a internal function like mean() on a list element and get the mean of the list.a portion, for instance. mean(simple.list[1]) Warning: argument is not numeric or logical: returning NA [1] NA What, it gave us an NA? This is because $list.a is actually a list containing the vector x. So, to apply operations to elements of a list, you have to identify specically the elements you want to analyze. In our (and most) situation, the elements of the list have been named, so you can access said elements using the name of the element with a dollar sign ($), like you would to extract columns from a dataframe (which, you may or may not have noticed, dataframes are lists where the list elements are the dataframe columns). Additionally, sometimes you want to access list elements via their index or a name, then you can use double bracketing ([[]]) to subscript lists (this is especially important when writing functions that return lists as the function result). mean(simple.list$list.a) [1] 29.5 mean(simple.list[[1]]) [1] 29.5 mean(simple.list[["list.a"]]) [1] 29.5 Well, that is simple enough. Lets try a more complicated list just so we have an example in our notes. So, my little complicated list will be the c.data from earlier, which is a numeric vector of length 5, the iris dataframe, and a made up character vector from earlier with my families names in it (Bret, Reagan, Kennedy, Lyla): complicated <- list(c.data = c.data, iris.data = iris, family = y) str(complicated) List of 3 $ c.data : num [1:5] $ iris.data:'data.frame': 150 obs. of 5 variables:..$ Sepal.Length: num [1:150] $ Sepal.Width : num [1:150] $ Petal.Length: num [1:150] $ Petal.Width : num [1:150] $ Species : Factor w/ 3 levels "setosa","versicolor",..: $ family : chr [1:4] "Bret" "Reagan" "Kennedy" "Lyla" 24

25 Now, lets assume I want to extract the rst 10 rows of the list object iris.data and nd the mean and variance for the iris dataframe element Sepal.Length. In addition, lets try to use the internal R function summary() to summarize the irisdata for us as well using a couple of dierent approaches. complicated$iris.data[1:10, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa mean(complicated$iris.data$sepal.length) [1] mean(complicated[[2]]$sepal.length) [1] var(complicated[[2]]$sepal.length) [1] summary(complicated$iris.data) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 Median :5.80 Median :3.00 Median :4.35 Median :1.3 Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5 Species setosa :50 versicolor:50 virginica :50 summary(complicated[[2]]) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 Median :5.80 Median :3.00 Median :4.35 Median :1.3 Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 25

26 Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5 Species setosa :50 versicolor:50 virginica :50 A few thoughts on data manipulation We need to talk about summarizing and/or aggregating data as this is probably something, that at one time or another, you will have to do. Now, the dierent ways you can summarize data are pretty much limited only by your imagination or programming skills, so it is a huge waste of eort to focus on all the dierent ways to aggregate data so I am just going to scratch the surface here to give you a general idea of what can be done. First, R has a variety of internal functions set up that allow for ecient summarization of various data types we have discussed early, things like mean, median or range so just to repeat those here using the iris data: summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 Median :5.80 Median :3.00 Median :4.35 Median :1.3 Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5 Species setosa :50 versicolor:50 virginica :50 mean(iris$sepal.length) [1] median(iris$sepal.length) [1] 5.8 range(iris$sepal.length) [1] Often our interest is in aggregatting data, and there are a ton of ways to do that, including table or subset: 26

27 dogs <- c("springer", "Bulldog", "Springer", "Mutt", "Chihuahua", "Bulldog") dog.table <- table(dogs) dog.table dogs Bulldog Chihuahua Mutt Springer dog.table["springer"] Springer 2 as.data.frame(dog.table) dogs Freq 1 Bulldog 2 2 Chihuahua 1 3 Mutt 1 4 Springer 2 subset(iris, iris$sepal.length > mean(iris$sepal.length))[1:10, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor Ok, getting a bit more complicated, this section is going to be about applying functions, either pre-dened or user-dened, to repeatedly conduct a set of calculations specic to dierent values of the data. Makes no sense, does it? Well, it will. As a real quick example, consider a simple matrix with 2 rows and 2 columns. Now, based on our previous examples you would know how to create this matrix, px}erhaps using the matrix function. Now, matrices have dimensions as you have often seen them described such as 2x3, or 3x5, or 1x1 (which is a scalar by the way). Now, in R, the dimensions of the matrix are referred to as margins, which will be important later. So, consider the following loop set up for getting the row sums and columns sums from that matrix: loop.matrix <- matrix(1:4, nrow = 2, ncol = 2) loop.matrix [,1] [,2] [1,] 1 3 [2,] 2 4 row.sums <- vector("numeric", nrow(loop.matrix)) # Loop over the rows and sum the elements 27

28 for (i in 1:nrow(loop.matrix)) row.sums[i] = sum(loop.matrix[i, ]) row.sums [1] 4 6 col.sums <- vector("numeric", ncol(loop.matrix)) # Loop over the columns and sum the elements for (i in 1:ncol(loop.matrix)) col.sums[i] = sum(loop.matrix[, i]) col.sums [1] 3 7 What do you know, we have written a short piece of code to estimate row and column sums from a matrix. But, come on, this is R, there has to be something better. Luckily, there is something better, its the family of apply statements. Now, you can do?apply to look at the specics, but in a nutshell is apply(yourdata, margin you are interested in looking at, function you want to apply to that margin). Remember that in a matrix there 2 margins, rows (margin=1) and columns (margin=2). loop.matrix [,1] [,2] [1,] 1 3 [2,] 2 4 apply(loop.matrix, 1, sum) [1] 4 6 apply(loop.matrix, 2, sum) [1] 3 7 So, lets make up a little bit bigger matrix so we can mess with it some data. Now, we just did a simple sum of the rows (margin=1) and the columns (margin=2): big.matrix <- matrix(1:12, nrow = 3, ncol = 4) big.matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] apply(big.matrix, 1, sum) #rows [1] apply(big.matrix, 2, sum) #columns [1] apply(big.matrix, 1, mean) #mean rows [1] apply(big.matrix, 2, mean) #mean columns [1]

29 Note, however, R also has pretty nice little functions for simple cases like sum, mean, etc. that will work in this case as well: rowsums(big.matrix) [1] colsums(big.matrix) [1] rowmeans(big.matrix) [1] colmeans(big.matrix) [1] Also, note that if NA's exist, you can use na.rm in these apply functions: big.matrix[2, 2] = NA big.matrix [,1] [,2] [,3] [,4] [1,] [2,] 2 NA 8 11 [3,] apply(big.matrix, 1, sum) [1] 22 NA 30 apply(big.matrix, 1, sum, na.rm = TRUE) [1] rowsums(big.matrix, na.rm = TRUE) [1] Most of the time, you will probably want the structure of the data you are looping over to be returned to you in the same form as your original data. If you have a list, then lapply is your friend. Making up a quick list of data, and evaluating the list using lapply will return a list. Notice that c is a vector of character values, so when you try to take the mean, you should get an NA: my.list <- list(a = 10:20, b = rnorm(10), c = c("a", "b", "A", "b", "A", "b", "A", "b")) lapply(my.list, mean) Warning: argument is not numeric or logical: returning NA $a [1] 15 29

30 $b [1] $c [1] NA If we don't want a list returned, we could use sapply which would return a vector or a matrix: sapply(my.list, mean) Warning: argument is not numeric or logical: returning NA a b c NA Uses for the various apply statements are wide-ranging, so I am showing only quick examples here as you will need to just go and play with them some to see what works best for you. Here is an example I use for estimating survival in a simulation model with demographic stochasticity (e.g., everyone survives based on a random draw from a binomial with probability equal to the user dened survival estimate): No.alive <- c(100) low.survival < high.survival < low <- sapply(lapply(1, function(i) sample(x = c(1, 0), replace = T, size = No.alive, prob = c(low.survival, 1 - low.survival))), sum) high <- sapply(lapply(1, function(i) sample(x = c(1, 0), replace = T, size = No.alive, prob = c(high.survival, 1 - high.survival))), sum) low [1] 21 high [1] 73 There are some pretty useful internal R functions for aggregating data, such as oh, I don't know, aggregate which work pretty well, for instance, with the iris data aggregate(iris[, 1:4], list(species = iris[, 5]), mean) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa versicolor virginica Basic Mathematical/Operators First, how do we use R as a calculator (and why are we doing this now and not at the beginning)? Since R is interactive, you want to use R to do some basic calculations so you get the hang of it as the basic calculations are what build up to be fairly complex calculations. 30

31 So, here are a few really quick examples showing how R can be used to get the result for any equation by typing in the equation and it will return to you as shown below: [1] 2 sqrt(8) [1] exp(1) [1] For each example, the result is a vector containing a single number. The [1] that you see before each value represents the fact that after R computes a result, it is calling a generic (default) print function to display the contends of the vector. For example, you could call the print function explicitly: print(1 + 1) [1] 2 print(sqrt(8)) [1] print(sqrt(8), digits = 5) [1] print(sqrt(8), digits = 10) [1] and, to be honest, you can probably get more precision than you would ever need: print(sqrt(8), digits = 20) [1] R can do pretty much any basic mathematical operation you need. [1] [1] 2 2 * 2 * 2 [1] 8 2/2 31

32 [1] 1 sqrt(16) [1] 4 In addition, R has a set of logical operators (?Logic) which can be used for a wide variety of manipulations. Consider the made up data below for log.data and x. log.data <- 1 + (x <- rpois(20, 1)/3) x [1] [11] log.data [1] [12] You can do random number generation pretty simply and quickly (you will see more of this later on): rnorm(10) [1] [9] rpois(10, 10) [1] You can work with 'NA' values within your data in dierent ways. First, note that I am replacing the 3rd value in the log.data vector with a 'NA' means that using a simple function like 'mean' will return 'NA' because the log.data vector contains missing values. When this occurs, R has some handy functions for handling data with missing values, usually using na.rm= or something like that: log.data[3] = NA mean(log.data) [1] NA mean(log.data, na.rm = TRUE) [1] Or, alternatively, you could do this and get the same answer: log.data[3] = NA newlog.data = na.omit(log.data) mean(newlog.data) [1]

33 And remember, R does arithmetic on vectors and matrices just ne: c.data [1] * c.data [1] loop.matrix [,1] [,2] [1,] 1 3 [2,] 2 4 loop.matrix - 5 [,1] [,2] [1,] -4-2 [2,] -3-1 Date and time This will be a pretty quick section as there are quite a few dierent ways to deal with datetime classes, and for the most part when you deal with them it will be in a categorization or subsetting context. Dates in R are pretty simple to deal with, and there are a variety of options for working with dates. First, and probably the simplest introduction is to just create a date in some format and play with it. So, for example: Sys.time() [1] " :56:36 CST" as.date(sys.time()) [1] " " as.date("2010/04/02") [1] " " Now, whats nice about dates is you can manipulate them pretty easy by changing the format string to get them in the format you are needing. Note, very important, when you are re-formatting the dates, you have to use the exact same description in the format() command, e.g., if your date has a comma after the day, your format command has to have a comma after the %d as well or you will get a NA (see the last example using 1 September 1973). as.date(" ", format = "%Y-%m-%d") [1] " " as.date("april 2, 2010", format = "%B %d, %Y") 33

34 [1] " " as.date("2april10", format = "%d%b%y") [1] " " as.date("september 1, 1973", format = "%B %d %Y") [1] NA as.date("september 1, 1973", format = "%b %d, %Y") [1] " " Or, we can also tell R to get out the current time to play with you can do something like this to see the current time and assign it a name: Sys.time() [1] " :56:36 CST" system.time <- Sys.time() str(system.time) POSIXct[1:1], format: " :56:36" system.time [1] " :56:36 CST" So now we have a object called system.time that has a date-time combination. Also not that I used str to look at it and it was of class POSIX, which is a common format for date-time values. I tend to use POSIX classes more frequently than most other date functions (e.g., chron package) because it stores time to the nearest second. POSIX data's input format is year, then month, then day, a space, then time in hours:minutes:seconds. POSIX works similarly to other date functions for manipulating dates between formats as well: time.posix <- c(" :00:30") as.posixct(time.posix) [1] " :00:30 CDT" class.date <- strptime("2/april/2010:08:01:27", format = "%d/%b/%y:%h:%m:%s") str(class.date) POSIXlt[1:1], format: " :01:27" You can work with dates pretty easily. For example, we can create a time sequence of dates and then run some general R functions on them looking at date ranges, mean date, time between dates, etc: seq(as.date(" "), by = "days", length = 5) [1] " " " " " " " " " " 34

35 dates <- seq(as.date(" "), by = "days", length = 30) mean(dates) [1] " " range(dates) [1] " " " " summary(dates) Min. 1st Qu. Median Mean 3rd Qu. " " " " " " " " " " Max. " " dates[8] - dates[1] Time difference of 7 days difftime(dates[8], dates[1], units = "hours") Time difference of 168 hours 35

36 4 R Creating Graphics Few tools in ecology (or in any eld for that matter) are as powerful as a graphical representation of your data. We should use graphs as an analytical tool to assist with data visualizing and analysis, but lots of times folks just use graphs to summarize statistical results from their analysis (e.g., showing means). I can talk on and on about graphs in R (see Collier 2008), but as to not waste time, concisely, there are not too many things graphs you cannot make in R, period. So, to show you some examples of graphs, I am going to create several datasets, some will be simple univariate simulations, some will be a bit more complicated dataframes. The reason I am using simulated data is two-fold: 1) using simulated data helps you to understand the structure of the data because you created it, so you know what it should look like and can use the tools we learned earlier to get other datasets into the correct format, and 2) its fairly simple to simulate a wide range of data types quickly and eciently, rather than try to load individual datasets, walk through all the manipulation in this document, and then do some examples (although that is on the horizon). One thing it is important to notice, lots of time you will see the same code put into a command. That is because R can use the same command, like the function call col for dening the color you want to use. Remember this, its handy (see?par for more details). 4.1 Scatterplots So, lets start with the easy stu, a scatter plot. In the simplest sense, we can create plots in R pretty quick with short statements. plot(rnorm(100, 10, 1), main = "A scatterplot") A scatterplot rnorm(100, 10, 1) Index First, a few things to notice. First, this is just a very simple scatter of made up points, there is not rhyme or reason to them. Second, there is really no relationship between the x-axis values and the y-axis values, as basically I just simulated 100 points from a normal 36

37 distribution with a mean of 10 and a standard deviation of 1 (?Normal) and the index (xaxis) is the order they were simulated in. Also, note that R provides some default axis labels, the y-axis is basically what was called with the plot command from above, and the x-axis value Index was just the order of simulation as dened before. Nothing is really formatted, the axis labels are laying the wrong way on the y-axis, font is weird, so on and so forth. Ok, but we can obviously do more with R than some dumb scatter plot data. What if, for example, we have data where we actually have a good reason to label the axes correctly, such as data relating counts of the eyeworms in the eyes of Quaily Mc'OweMyEyesHurt 1 to the mass of Quaily Mc'OweMyEyesHurt (this is Texas and lots of people see to care about eyeworm numbers right now, so its topical but see the footnote...). Below I made up a completely ridiculous dataframe for example plotting purposes only: set.seed(10002) worms = round(rnorm(50, 66, 10), digits = 0) presence = factor(round(rbinom(50, 1, 0.7), digits = 2)) mass = worms + 3 * (round(rnorm(50, 0.25, 0.2), digits = 2)) long = worms * (round(rnorm(50, 60, 15), digits = 2)) group <- factor(rep(1:5, 10)) quaily <- data.frame(worms, presence, mass, long, group) attach(quaily) The following objects are masked _by_.globalenv: group, long, mass, presence, worms str(quaily) 'data.frame': 50 obs. of 5 variables: $ worms : num $ presence: Factor w/ 2 levels "0","1": $ mass : num $ long : num $ group : Factor w/ 5 levels "1","2","3","4",..: head(quaily) worms presence mass long group So, we can see that our dataframe quaily has a couple of continuous variables, a couple of factor variables, and is all around ridiculous. But, lets go ahead and plot some data anyway. So, what do we see when we look at this gure? ˆ The axis values look approximately correct (although notice that there are a few values on the graph >80, yet the x-axis only goes to 80) so we will probably want to adjust those; 1 There is no such thing as a Quaily and and it obviously does not reference any species in Texas and as far as I know, Mc'OweMyEyesHurt is not a real word 37

38 ˆ The numbers at each tick mark on the y-axis are parallel to the axis, which makes them harder to read; ˆ The graph is contained in a box, neither good nor bad, its more of a preference thing; ˆ The labels for each axis are correct, but they do not convey much information; ˆ There is no gure title (not that it's needed) plot(mass, worms) worms mass So, there are quite a few things we might want to change with this graph, correct? Well, lets change them. When you want to change things in your graph,?par is your friend.?par provides a detailed list of the many options for manipulating graphs in R, so, lets make it pretty: plot(mass, worms, las = 1, main = "This is a Wormy Quaily Graph", ylab = "Quaily Worms", xlab = "Quaily Fatness", pch = 19, col = "red", xlim = c(40, 90), ylim = c(40, 90)) 38

39 This is a Wormy Quaily Graph Quaily Worms Quaily Fatness Wow, pretty. Amazingly, when you want to nd a relationship, you can! For our example on quaily and eyeworms, it looks as if there is a positive relationship between worm numbers and quaily mass, so what if we wanted to add a plot of the linear regression curve to this plot? Well, we could run the regression, t the line, and just for kicks, lets t the vertical error distances as well. plot(mass, worms, las = 1, main = "This is a Wormy Quaily Graph", ylab = "Quaily Worms", xlab = "Quaily Fatness", pch = 19, col = "red", xlim = c(40, 90), ylim = c(40, 90)) quaily.reg <- lm(worms ~ mass) abline(quaily.reg, col = "blue", lwd = 2) fit.quaily <- fitted(quaily.reg) segments(mass, fit.quaily, mass, worms, col = "blue") 39

40 This is a Wormy Quaily Graph Quaily Worms Quaily Fatness So, you get the idea that there are all kinds of cool ways to manipulate data and make graphs. Below I will show a few examples of dierent types of plots that are typically used. I tried to keep most of these examples fairly consistent with what can be easily found in either the R help les for each plot type, or what you would nd when you google R barplot, so that you will be able to nd some additional examples later and match them to what we did in class. 4.2 Other Simple plots So, barplots, the workhorse of wildlife ecologists (and often called histograms, don't do that). Using the mtcars dataset in base R, a quick barplot data(mtcars) count = table(mtcars$gear) barplot(count, main = "Example Barplot", xlab = "Gear number") 40

41 Example Barplot Gear number Wow, that is, simple, how about one this one, just a little dierent example. # Grouped Bar Plot counts <- table(mtcars$cyl, mtcars$gear) barplot(counts, main = "Car Distribution by Gears and Cylinders", xlab = "Number of Cyllinders", col = c("red", "yellow", "blue"), legend = rownames(counts), beside = TRUE, las = 1) Car Distribution by Gears and Cylinders Number of Cyllinders What about condence intervals, we need to do that, right? Here are a couple of dierent ways to add condence intervals to a barplot, or just create condence intervals (straight from the plotci help le. 41

42 library(plotrix) data(warpbreaks) attach(warpbreaks) err = y = runif(10) wmeans <- by(warpbreaks$breaks, warpbreaks$tension, mean) wsd <- by(warpbreaks$breaks, warpbreaks$tension, sd) note that barplot() returns the midpoints of the bars, which plotci uses as x-coordinates plotci(barplot(wmeans, col = "gray", ylim = c(0, max(wmeans + wsd))), wmeans, wsd, add = TRUE) L M H using labels instead of points labs <- sample(letters, replace = TRUE, size = 10) plotci(1:10, y, err, pch = NA, gap = 0.02, main = "plotci with labels at points", las = 1) text(1:10, y, labs) 42

43 plotci with labels at points 1.5 y V L D C V Q P P W V Now, there are tons of ways to do this, lots of R packages can be used to add condence intervals, some more elegantly than others. But, its important to realize that you can do it for many dierent types of plots, for instance, a logistic regression 1:10 set.seed(123) mydata = data.frame(response = rbinom(100, 1, 0.5), Predictor = rnorm(100, 100, 50)) attach(mydata) test.glm = glm(response ~ Predictor, family = "binomial") predict.data = seq(4, 496, 4) y = plogis(test.glm$coefficients[1] + test.glm$coefficients[2] * predict.data) xy = data.frame(predictor = predict.data) yhat = predict(test.glm, xy, type = "link", se.fit = TRUE) upperlogit = yhat$fit * yhat$se.fit lowerlogit = yhat$fit * yhat$se.fit ucl = plogis(upperlogit) lcl = plogis(lowerlogit) plot(predict.data, y, ylim = c(0, 1), type = "l", lwd = 2, ylab = "Prob(Success)", xlab = "Predictor Variable", xaxt = "n", las = 1) axis(1) lines(predict.data, ucl, lty = 2, lwd = 2) lines(predict.data, lcl, lty = 2, lwd = 2) 43

44 1.0 Prob(Success) Another simple one is a dotchart, y1 <- runif(2) g <- c("0-50", "50-100") dotchart(y1, g, pch = 20, xlim = c(0, 1)) Predictor Variable Which can be used to create some pretty elegant graphs rather quickly that show lots of data, for instance using the mtcars dataframe. x <- mtcars[order(mtcars$mpg), ] # sort by mpg x$cyl <- factor(x$cyl) it must be a factor 44

45 x$color[x$cyl == 4] <- "red" x$color[x$cyl == 6] <- "blue" x$color[x$cyl == 8] <- "darkgreen" dotchart(x$mpg, labels = row.names(x), cex = 0.7, groups = x$cyl, main = "Gas Mileage", xlab = "Miles Per Gallon", gcolor = "black", color = x$color) Gas Mileage Toyota Corolla Fiat 128 Lotus Europa Honda Civic Fiat X1 9 Porsche Merc 240D Merc 230 Datsun 710 Toyota Corona Volvo 142E Hornet 4 Drive Mazda RX4 Wag Mazda RX4 Ferrari Dino Merc 280 Valiant Merc 280C Pontiac Firebird Hornet Sportabout Merc 450SL Merc 450SE Ford Pantera L Dodge Challenger AMC Javelin Merc 450SLC Maserati Bora Chrysler Imperial Duster 360 Camaro Z28 Lincoln Continental Cadillac Fleetwood Miles Per Gallon Again, there are many ways to create a graph, here are some examples using ggplot2 and # create factors with value labels library(ggplot2) Loading required package: methods mtcars$gear <- factor(mtcars$gear, levels = c(3, 4, 5), labels = c("3gears", 45

46 "4gears", "5gears")) mtcars$am <- factor(mtcars$am, levels = c(0, 1), labels = c("automatic", "Manual")) mtcars$cyl <- factor(mtcars$cyl, levels = c(4, 6, 8), labels = c("4cyl", "6cyl", "8cyl")) # Scatterplot of mpg vs. hp for each combination of gears and cylinders in # each facet, transmittion type is represented by shape and color qplot(hp, mpg, data = mtcars, shape = am, color = am, facets = gear ~ cyl, size = I(3), xlab = "Horsepower", ylab = "Miles per Gallon") Miles per Gallon cyl 6cyl 8cyl 3gears 4gears 5gears Horsepower am Automatic Manual And another example on the same data qplot(mtcars$gear, mtcars$mpg, data = mtcars, geom = c("boxplot", "jitter"), fill = gear, main = "Mileage by Gear Number", xlab = "", ylab = "Miles per Gallon") 46

47 35 Mileage by Gear Number 30 Miles per Gallon gear 3gears 4gears 5gears 10 3gears 4gears 5gears We can even plot spatial locations quick and easy, for instance, here are some Texas turkey GPS locations (more on this later)... suppresspackagestartupmessages(library(moveud)) data(rawturkey) par(mfrow = c(2, 1)) plot(rawturkey$lon, rawturkey$lat, main = "Unedited Points", pch = 20, col = "red", xlab = "Longitude", ylab = "Latitude", las = 1, cex.axis = 0.7) newrawturkey = rawturkey[rawturkey$lon < -98.1, ] plot(newrawturkey$lon, newrawturkey$lat, main = "Edited Points", pch = 20, col = "red", xlab = "Longitude", ylab = "Latitude", las = 1, cex.axis = 0.7) 47

48 TX TWS Workshop, Feb 2014 Unedited Points Latitude Longitude Edited Points Latitude Longitude 48

49 5 Statistical Models with R 5.1 Contingency Tables Obviously, being able to enter data in contingency tables is a pretty useful skill. As a quick example, so you get a feel for it, lets create a quick 9 by 2 contingency table in R using some nest predator data from Dreibelbis et al. (2008). Note that two way tables need to be matrix objects. Now, because we wanted the data entered column-wise, we used byrow=f (which would have been the default if we had not included the byrow=f). You can see what happens if you don't dene byrow= by changing it from T to F. Ok, so know we have dened the object class.status, but it is just a matrix, no column or row headings. This is important, as many times you will want to add column and row headings to your data. Easiest way, use colnames or rownames. class.status <- matrix(c(0, 0, 2, 4, 2, 0, 2, 1, 3, 1, 1, 1, 2, 7, 3, 0, 0, 4), nrow = 9, byrow = F) class.status [,1] [,2] [1,] 0 1 [2,] 0 1 [3,] 2 1 [4,] 4 2 [5,] 2 7 [6,] 0 3 [7,] 2 0 [8,] 1 0 [9,] 3 4 colnames(class.status) <- c("2006", "2007") rownames(class.status) <- c("nine-banded Armadillo", "Bobcat", "Feral hog", "Gray fox", "Common raccoon", "Common raven", "Striped skunk", "Texas rat snake", "Total multiple predator events") class.status Nine-banded Armadillo 0 1 Bobcat 0 1 Feral hog 2 1 Gray fox 4 2 Common raccoon 2 7 Common raven 0 3 Striped skunk 2 0 Texas rat snake 1 0 Total multiple predator events 3 4 Often we will have data in some sort of a dataframe where we have 1 row for each data point in the dataset. So, lets try some examples using our earlier dataset called quaily. Now, since you have the quaily loaded, lets play a bit with it by using the function table. Using table() we can look at the raw counts of the number of times parasites were present, a cross-tab of parasite presence by group, we can even look at the proportion of each count that falls in each category using the function prop.table(). 49

50 head(quaily) worms presence mass long group table(quaily$presence) table(quaily$presence, quaily$group) table.quaily <- (table(quaily$presence, quaily$group)) prop.table(table.quaily) Tables can get extremely complicated really quick, and R can make looking at data using tables pretty easy (e.g., see ftable or xtabs as other options for looking at tabular data). But, what if we are interested in conducting some statistical evaluations on data in tables? The list is endless of what you can do, but lets do a example of a test of independent proportions and a chi-square test on a 2 by 2 contingency table. Lets assume that our data consists of the number of juvenile and adult sh that successfully survived some experimental testing done over in the Biology (nerd's) building. fish.not.dead <- c(10, 6) fish.total.tested <- c(20, 21) prop.test(fish.not.dead, fish.total.tested) 2-sample test for equality of proportions with continuity correction data: fish.not.dead out of fish.total.tested X-squared = 1.179, df = 1, p-value = alternative hypothesis: two.sided 95 percent confidence interval: sample estimates: prop 1 prop 2 50

51 So these results indicate no dierence between the proportions (see the p-values and such in the output). What about a χ 2 test? First, we have to turn our data into a matrix as that is the required formatting (see the Arguments section under?chisq.test) (also note that because we are running a this using chisq.test the second column of the table has to be the number of negative outcomes (failures: 10 & 15) as opposed to the totals (20 & 21) as given above. For a 2 2 table, the results using prop.test and chisq.test are equivalent. chi.data <- matrix(c(10, 6, 10, 15), 2) chi.data [,1] [,2] [1,] [2,] 6 15 chisq.test(chi.data) Pearson's Chi-squared test with Yates' continuity correction data: chi.data X-squared = 1.179, df = 1, p-value = We can also do r c contingency tables. Consider the data from class.status above. class.status Nine-banded Armadillo 0 1 Bobcat 0 1 Feral hog 2 1 Gray fox 4 2 Common raccoon 2 7 Common raven 0 3 Striped skunk 2 0 Texas rat snake 1 0 Total multiple predator events 3 4 chisq.test(class.status) Warning: Chi-squared approximation may be incorrect Pearson's Chi-squared test data: class.status X-squared = 11.43, df = 8, p-value = chisq.test(class.status)$expected Warning: Chi-squared approximation may be incorrect Nine-banded Armadillo

52 Bobcat Feral hog Gray fox Common raccoon Common raven Striped skunk Texas rat snake Total multiple predator events Notice that chisq.test includes more information than is printed by default. Always remember this about R, you can see what is included in the function using some of the tricks from earlier. For example, you can use str to determine what is included in function chisq.test. But, if you want to know what is included in the information after calling textttchisq.test on our data, then you could use str and extract the contents of your function call, then, it is simply a matter of identifying what you are interested in extracting, and pulling it from the list identied above. For example, you can see that we have a list containing 8 dierent objects, all identied using the $ operator. str(chisq.test) function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) str(chisq.test(class.status)) Warning: Chi-squared approximation may be incorrect List of 9 $ statistic: Named num attr(*, "names")= chr "X-squared" $ parameter: Named int 8..- attr(*, "names")= chr "df" $ p.value : num $ method : chr "Pearson's Chi-squared test" $ data.name: chr "class.status" $ observed : num [1:9, 1:2] attr(*, "dimnames")=list of 2....$ : chr [1:9] "Nine-banded Armadillo" "Bobcat" "Feral hog" "Gray fox" $ : chr [1:2] "2006" "2007" $ expected : num [1:9, 1:2] attr(*, "dimnames")=list of 2....$ : chr [1:9] "Nine-banded Armadillo" "Bobcat" "Feral hog" "Gray fox" $ : chr [1:2] "2006" "2007" $ residuals: num [1:9, 1:2] attr(*, "dimnames")=list of 2....$ : chr [1:9] "Nine-banded Armadillo" "Bobcat" "Feral hog" "Gray fox" $ : chr [1:2] "2006" "2007" $ stdres : num [1:9, 1:2] attr(*, "dimnames")=list of 2....$ : chr [1:9] "Nine-banded Armadillo" "Bobcat" "Feral hog" "Gray fox" $ : chr [1:2] "2006" "2007" - attr(*, "class")= chr "htest" chisq.test(class.status)$observed 52

53 Warning: Chi-squared approximation may be incorrect Nine-banded Armadillo 0 1 Bobcat 0 1 Feral hog 2 1 Gray fox 4 2 Common raccoon 2 7 Common raven 0 3 Striped skunk 2 0 Texas rat snake 1 0 Total multiple predator events 3 4 chisq.test(class.status)$expected Warning: Chi-squared approximation may be incorrect Nine-banded Armadillo Bobcat Feral hog Gray fox Common raccoon Common raven Striped skunk Texas rat snake Total multiple predator events Linear Regression The basics behind this section is to get you comfortable with general approaches to regression analysis. The methods build on each other, but for the most part remain consistent. First, I will outline a simple linear regression with one response and one predictor variable, then discuss how this relates to analysis of variance. I will follow with multiple regression 2 predictor variables and generalized linear models for binary and count data. Linear regression, the workhorse of statistical methodology, is used to explain the relationship between 2 variables, primarily focused on how one variable impacts the level of another variable. Just because I want to see how to put a formula in LYX, here is the basic equation for linear regression, y i = α + βx i + ɛ i You all have seen this, so we will not belabor the point. regression in R? We do it well... But, how do we do linear lm(quaily$worms ~ quaily$mass) Call: lm(formula = quaily$worms ~ quaily$mass) Coefficients: 53

54 (Intercept) quaily$mass Doesn't seem like much when you do it like that, does it? I mean, R pretty much just shows us the function call, and the estimated beta coecientsis that all R did? Why are we here again? Now, R does other things, but remember earlier when I said R would not give you things, you had to ask for them? Well, now its time to learn how to ask. First, you can use the summary function to extract a little bit more information (you can ignore the usefancyquotes code, it is so I could output the summary in a pdf, something screwy with Sweave and R). So, what did we get using summary? Well, our call to lm() created a model object (just as the chi-square test we used earlier did) consisting of several parts. First, we have a repeat of the function call, then a summary of the distribution of the residuals, then the model coecients are printed, followed by some various information on model t. options(usefancyquotes = FALSE) summary(lm(quaily$worms ~ quaily$mass)) Call: lm(formula = quaily$worms ~ quaily$mass) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * quaily$mass <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 48 degrees of freedom Multiple R-squared: 0.996,Adjusted R-squared: F-statistic: 1.22e+04 on 1 and 48 DF, p-value: <2e-16 Pretty cool, huh. Now, what if we just wanted to extract the coecients without all the other stu? Remember earlier I told you we would be using some stu later? Here goes: rst you can use names to see what is contained within the model object. example.regression <- lm(quaily$worms ~ quaily$mass) names(example.regression) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" Notice there is one in there called coefficients, so we can probably get those out in a couple other ways, 54

55 example.regression$coefficients (Intercept) quaily$mass coef(example.regression) (Intercept) quaily$mass summary(example.regression)$coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-02 quaily$mass e-59 str(summary(example.regression)$coefficients) num [1:2, 1:4] attr(*, "dimnames")=list of 2..$ : chr [1:2] "(Intercept)" "quaily$mass"..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(> t )" summary(example.regression)$coefficients[2, ] Estimate Std. Error t value Pr(> t ) 1.010e e e e-59 Now, while you saw this example earlier, it is probably worthwhile to redo it here to show how you can also build plots based o of your linear regression analysis simply and eciently. plot(mass, worms, las = 1, main = "This is a Wormy Quaily Graph", ylab = "Quaily Worms", xlab = "Quaily Fatness", pch = 19, col = "red", xlim = c(40, 90), ylim = c(40, 90)) abline(example.regression, col = "blue", lwd = 2) fit.quaily <- fitted(example.regression) segments(mass, fit.quaily, mass, worms, col = "blue") 55

56 This is a Wormy Quaily Graph Quaily Worms Quaily Fatness Ok, so what if we wanted to see the values used for developing this plot, or the residuals (dierence between observered and expected)? fitted(example.regression) resid(example.regression)

57 You can obviously extend your linear regression to multiple predictor values following the same approach as above for main eects models. multi.regression <- lm(quaily$worms ~ quaily$mass + quaily$long) summary(multi.regression) Call: lm(formula = quaily$worms ~ quaily$mass + quaily$long) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) quaily$mass <2e-16 *** quaily$long Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 47 degrees of freedom Multiple R-squared: 0.996,Adjusted R-squared: F-statistic: 6.02e+03 on 2 and 47 DF, p-value: <2e-16 multi2.regression <- lm(quaily$worms ~ quaily$mass * quaily$long) summary(multi2.regression) Call: lm(formula = quaily$worms ~ quaily$mass * quaily$long) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) quaily$mass <2e-16 *** quaily$long quaily$mass:quaily$long Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 46 degrees of freedom Multiple R-squared: 0.996,Adjusted R-squared: F-statistic: 3.97e+03 on 3 and 46 DF, p-value: <2e-16 You can do lots with basic regression, see?lm for more details. 5.3 Generalized Linear Models GLM's are characterized by the use of a 'link' function which provides the relationship 57

58 between the predictor variables and the expected value response variable. Probably the most common GLM is logistic regression, but a GLM with a normal link function would give the same results as using linear regression modeling with lm. in R, there is a nice little function called glm for running generalized linear models. As an example, here is some bird count data we can use for doing some logistic regression analysis on. bird.data <- read.table("f:/bretresearch/workshops/txtws_rworkshop/birddata.txt", header = TRUE, colclasses = c("numeric", "numeric", "factor", "numeric")) str(bird.data) 'data.frame': 154 obs. of 4 variables: $ present: num $ area : num $ reg : Factor w/ 4 levels "5","6","7","8": $ canopy : num head(bird.data) present area reg canopy So, we have a simple dataset for some birds surveys on which presence or absence was measured and we want to see if presence/absence is inuenced by either the area of habitat (in hectares), the region of the state (factor variable with 4 levels), or the percentage of canopy cover (range from 0-100). Now, glm has a trick to it you have to remember, although if you do?glm you would see it in the help le. When you specify glm, you have to dene a value for 'family', which tells R which link function from the exponential family to use to relate the predictors to the response variable. Since we are dealing with binary data, we will use binomial to dene family. bird.model <- glm(bird.data$present ~ bird.data$area, family = "binomial") Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(bird.model) Call: glm(formula = bird.data$present ~ bird.data$area, family = "binomial") Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) bird.data$area * 58

59 --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 153 degrees of freedom Residual deviance: on 152 degrees of freedom AIC: Number of Fisher Scoring iterations: 8 Maybe we want to see how the predicted probability of presence changes over area. Looking at the summary above, we can see that the intercept and slope are both positive, so we would expect a positive impact of area on presence. We can show that several ways, but probably using a graphic would be best. plot(bird.data$area, fitted(glm(bird.data$present ~ bird.data$area, family = "binomial")), xlab = "Area (ha)", ylab = "Probability present") Probability present Area (ha) But, this plot is really not that pretty, what with all the dots and stu. What say we try another way to clean it up a bit. First, we do a bit of data manipulation so that we can use the predict function in R, which is a pretty useful little function. Now, if you look at the above gure you see that we are pretty much plotting the predicted response (presence probability) for each level of are for which we have data. But, what if we wanted to know what the prediction looked like for area sizes we did not collect? Well, that is pretty simple to do. First, we do a bit of data manipulation where we dene a new variable for area (Area) which ranges from 0 to 10,000 (it could have been any value, 100, 10,000, etc.) and then we use predict to predict the estimate presence probability for each value of Area. I used head below to show the rst 6 values of the predictions. 59

60 attach(bird.data) bird.predict <- glm(present ~ area, family = "binomial") Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Area <- seq(0, 10000, 1) new.area <- data.frame(area = Area) predicted.bird <- predict(bird.predict, new.area, type = "resp") head(predicted.bird) plot(predicted.bird ~ Area, type = "l", las = 1, ylab = "Probability present") Probability present Area Well, what do you know, a prettier graph. But, what would a journal editor say? Where are the condence limits? This takes a bit of tweaking and there are several ways to do this, but here is the one I tend to like. First, you have to do a little data manipulation using predict. First, notice that I changed the type to link, there is a reason. Above, when I used type="resp" I was predicting on the 'real' scale, or the actual predicted probabilities for each level of Area. But, when you build your condence intervals based on the real scale values, you can get estimates >1, which cannot happen. This is my crude hack that I use all the time to build condence intervals when I am working on logistic regression models. So, rst, I bring in the non-logit transformed estimates for each value of Area, and I build a condence interval for each level. pred.cl <- predict(bird.predict, new.area, interval = c("confidence"), level = 0.95, type = "link", se.fit = TRUE) uppercl <- pred.cl$fit * pred.cl$se.fit lowercl <- pred.cl$fit * pred.cl$se.fit Now, plotting these is pretty simple using the lines function. Note I am extending the 60

61 y-axis here so that I can add data to the graph showing the spread of the sites with detections or not (shown using the points statement below; in green on the graphic). plot(predicted.bird ~ Area, type = "l", las = 1, ylab = "Probability present", ylim = c(0, 1)) lines(plogis(uppercl), col = "blue") lines(plogis(lowercl), col = "red") points(area, present, col = "green") Probability present Area In case your wondering, the reason the condence intervals are not symmetric around the line, like you are probably used to, is because they are build on the logit scale. But, wait a minute, what the heck is plogis()? It must be important, right? Yes, it is as it keeps your values bounded between 0 and 1 (see?plogis) and its really useful. First, remember that a basic logistic regression looks like this: e β 0+β i x i 1 + e β 0+β i x i so, if we have estimates for β 0 and β 1 then we can actually use plogis to predict each probability. For example, coef(bird.predict) (Intercept) area then we have a estimate for the intercept and the slope. Say our interest was in predicting the probability of presence given a area estimate of 10. Well, using the above logistic regression formula, it would look like e = e

62 or, we can get at this a couple way, plogis( * 10) [1] plogis(summary(bird.predict)$coefficients[1] + summary(bird.predict)$coefficients[2] * 10) [1] Say we wanted predictions for area estimates of 25:50? plogis(summary(bird.predict)$coefficients[1] + summary(bird.predict)$coefficients[2] * 25:50) [1] [11] [21] I nd that plogis() is a generous friend and I use it every day! 62

63 6 Writing Functions in R 6.1 Functions Sometimes, One of the basics of R is that users can contribute code to conduct various analyses. In R, the standard contribution is a function, or something that the end user uses on their data to get some result. For instance, consider the simple function below to add 2 user-supplied values together: addtwo = function(a, b) { out = a + b return(out) } addtwo(2, 2) [1] 4 addtwo(2, 4) [1] 6 This function works with any 2 numerical values. Functions can be more complex, like the below function that creates summary output: my.summary = function(x) { my.n = length(x) my.mean = mean(x, na.rm = TRUE) my.var = var(x, na.rm = TRUE) my.sd = sd(x, na.rm = TRUE) my.median = median(x, na.rm = TRUE) out = list(samplesize = my.n, Mean = my.mean, Variance = my.var, StdDev = my.sd, Median = my.median) return(out) } sum.data = rnorm(10) my.summary(sum.data) $SampleSize [1] 10 $Mean [1] $Variance [1] $StdDev [1] $Median [1] Functions can do a ton of work for you, so I am barely (and I mean that, barely) scratching the surface. If, for instance, you wanted to see what the function bbmm.polygon() from the 63

64 moveud package looks like (and I know exactly what it looks like cause I wrote it, but bbmm.polygon creates the utilization distribution contours based on the bbmm.contour from package BBMM and exports the created contour lines as polygon shapele for further analysis in ArcMap (or GIS program of choice), then you can just type in the name of the function into R and out it pops. Eectively, it imports a dataframe, reprojects it to UTM, uses brownian.bridge() to create a BBMM and exports the contour lines via bbmm.contour, creates a raster, transforms that raster to a spatial polygon data frame, adds a couple of variables to the data frame, and writes the output to a shapele appropriate for reading into ArcMap. Not to complicated... bbmm.polygon function (x, crs.current, crs.utm, lev, plot = FALSE, path, indid) { coordinates(x) = ~Lon + Lat proj4string(x) = CRS(crs.current) x = data.frame(sptransform(x, CRS(crs.utm))) out.bbmm = brownian.bridge(x = x$lon, y = x$lat, time.lag = x$tl[-1], location.error = 15, cell.size = 20, max.lag = 180) contours = bbmm.contour(out.bbmm, levels = lev, locations = x, plot = plot) probs <- data.frame(x = out.bbmm$x, y = out.bbmm$y, z = out.bbmm$probability) out.raster <- rasterfromxyz(probs, crs = CRS(crs.utm), digits = 5) raster.contour <- rastertocontour(out.raster, levels = contours$z) raster.contour <- spchfids(raster.contour, paste(lev, "% Contour Line", sep = "")) out = sptransform(raster.contour, CRS(crs.utm)) out = SpatialLines2PolySet(out) out = PolySet2SpatialPolygons(out) out = as(out, "SpatialPolygonsDataFrame") out$udlevels = paste(rev(lev)) out$bandid = paste(indid) setwd(path) writeogr(obj = out, dsn = ".", layer = paste(indid), driver = "ESRI Shapefile") } <environment: namespace:moveud> Or, if for instance, you want to do a little simulation to look at the impacts of detection heterogeneity in deer spotlight survey count data and see how many times you would expect to over estimate, underestimate, or be correct (within 10% error bounds), of how many deer were near the road you were driving down (even though this is fraught with errors) then you could use: deer.sim = function(survey, reps) { x = replicate(reps, { pr = rnorm(survey, , ) pr[pr < 0] = 0 x = survey/pr lower = x[x < mean(x) * mean(x)] upper = x[x > mean(x) * mean(x)] c(length(lower), length(upper))/length(x) }) 64

65 } ml = mean(x[1, ]) mu = mean(x[2, ]) constant = 1 - ml - mu x = cbind(decreased = ml, Constant = constant, Increased = mu, Count = survey) return(x) And then we can use the function call to estimate how many times we might be too low, too high, or just right based on those numbers (although some would argue about it anyway because deer spotlight surveys are sacrosanct and inviolable in their eyes...). deer.sim(100, 100) Decreased Constant Increased Count [1,] That about does it for functions, we could spend time on lexical scoping and such, but that is way beyond this class... 65

66 7 Wildlife-Specic Methods 7.1 Capture-Recapture Analysis The most comprehensive software package for analysis of capture-recapture data is the program MARK (White and Burnham 1999). While it is unparalleled in the range of models, quality of the user documentation ( and active base of user-driven support ( the interface for building models can be limiting for large data sets and complex models. While there is some capability for automatic model creation in MARK, most models are built manually with a graphical user interface to specify the parameter structures and design matrices. Manual model creation can be useful during the learning process but eventually it becomes a timeconsuming and sometimes frustrating exercise that may add an unnecessary source of error in the analysis. Finally, for those that analyze data from on-going monitoring programs, there is no way to extend the capture-history in MARK, which necessitaes manual recreation of all models as data from future sampling occasions is collected. RMark is a R package that provides a formula based interface for MARK. RMark has been available since 2005 and is on the Contributed R Archive Network (CRAN) ( r-project.org). RMark contains functionality to build models for MARK from formulas, run the model with MARK, extract the output, and summarize and display the results with automatic labeling. RMark also has functions for model averaging, prediction, variance components, and exporting models back to the MARK interface. In addition, all of the tools in R are available which enable a completely scripted analysis from data to results and inclusion into a document with Sweave (Leisch 2002) and L A TEX to create a reproducible manuscript such as this one. The report which represents the appropriate citation (eective 2013) for RMARK can be found at and is included in the workshop notes as well. I have not included the Here we are going to provide an overview of the RMark package and how it can be used to benet MARK users. For more detailed documentation, refer to the online documentation at and the help within the RMark package. And, just to be fair, a signciant portion of these course notes came from various documents Je created while explaining or documenting RMark for teaching purposes and to a lesser extent from some notes I have put together for students at A&M. Background RMark does not t models to data, rather, RMark is a R package that was designed to provide a alternative user interface to MARK and its GUI. RMark uses the R language to construct models, create the input le (.inp), then call MARK which ts the model(s) to the data, extracts the results from the output le created by MARK, and allows the user to manipulate (via R or some other program) the resultant model output. Thus, RMark is a R interface to MARK, not a stand along capture-recapture modeling environment. That said, if results you got using MARK do not match the results you got when you used RMark, then you have made a mistake in one or the other. Where to find help? 66

67 Currently, or at least as best we can tell, MARK supports 140 dierent modeling options. At present, RMark does not fully replicate every option available in MARK. Although new models are added to RMark fairly regularly, not every model in MARK is available in RMark, and some things you can do in MARK such as data bootstrapping or computing median c-hat values are not available through the RMark interface. For a list of models available in RMark, you can use something like system.file("markmodels.pdf", package="rmark") which will provide you with a PATH statement telling you where you can access the pdf le containing the list of MARK models available in RMark, along with the appropriate code, parameter, and help le names (or, if you have a specic R_LIBS path where you R packages are installed locally, just go there and look for the RMark and the le MarkModels.pdf will be found there. First, it is important to remember that RMark needs MARK, so without an understanding of MARK, you will be limited in your ability to use RMark. So, your rst stop should always be the the "MARKBOOK", authored/edited by Evan Cooch and Gary White, with contributions from a wide variety of others. The MARKBOOK is freely available (all pages of it) at http: // Unequivocally, this is the primary desk reference for capture-recapture modeling approaches supported by MARK (although you should never cite it in a manuscript; see MARK FAQ at Details on RMark are found in Appendix C. Additionally, there is a very active community of ecologists who use MARK regularly that are willing to provide expertise to folks across a wide variety of capture-recapture modeling techniques, and a online forum (managed by Evan Cooch) is available at The user group of the phidot.org forum is typically extremely helpful, given you have read the MARKBOOK and have searched the archives. If you are not already a member, sign up. Finally, RMark operates just like any other R package, if you need the help/reference les for a particular function within RMark, you can access that function using the? followed by the name of the function you are interested in (e.g.,?mark). Advantages/Disadvantages So, why would one want to use RMark as an interface to MARK rather than MARK's GUI? Reasons abound, some are valid, some are not, lots of it is just individual point of view or project-specic needs. We think that there are some convincing reasons to use a scripted approach for your MARK analysis, but in the end it becomes a personal choice (one I think it is obvious that Je and I have already made). A few of the primary reasons we like to use RMark are (but not limited to): 1. RMark provides the user with the ability to automate analysis of monitoring data sets even as monitoring occassions are added. This is a signicant benet that RMark brings to MARK users as script generation of PIM and DM allow you to create the script once and if as monitoring data are collected, typically no changes to the script are needed. You just re-run the script with the new datale. 2. Design matrix creation. RMark uses a formula-based approach, which is faster and typically less error-prone (although not entirely error prone). Thus, less need to manually create the PIMS or DM. But, understanding of what the DM should look like is still necessary. 67

68 3. PIM simplication. RMark automatically creates the simplest PIM structure for each model, as opposed to MARK which uses the full DM even when reduced models are created. This will speed up model evaluation. 4. Collaborative Development: MARK and RMark play well together, so you can move analyses back and forth fairly cleanly using functions such as export.mark() and convert.inp(). 5. Entire analyses can be scripted. Although this is related to No.1 above, the scripting of analyses and the ability to use some of the functionality that comes along with R for additional computational support, publication quality graphing, among other things is quite benecial. 6. Reproducible analysis and documentation. Nearly all MARK analyses are reproducible so long as one keeps the.inp/.dbf/.fpt les and documents what was done. One thing that RMark excels at is that documentation support capabilities for R are widely applicable for MARK analyses. Thus, complete data sets and analysis, with metadata and detailed documentation, can be developed as R packages or data/code can be seemlessly integrated into L A TEX style manuscripts and documents (although Evan does a pretty good job with the MARKBOOK). We nd it really useful that the entirely of a dataset and analysis can be documented cleanly in one place (see?dipper for an example). Obviously, good data management protocols for reproducible analyses using only MARK are equally good, so this is more of a personal preference. Ok, so lets jump in with a quick example. As with most R packages, to access the functionality in RMark you type library(rmark) and R will respond with its appropriate version number and relevant information (I have it in.rprole on my system, so no output will be show below when I do it). For a quick example, we will use the ubiquitos European dipper (Cinclus cinclus) capture-recapture data from many examples in the MARKBOOK and a variety of manuscripts (it is included as a datale in RMark). For the dipper example, if we look at the structure of the dataset, we can see that it is a dataframe with 2 elds. The rst eld is the encounter history, which has a required column heading name of 'ch' and must be a character (chr) variable. The eld label ch is required for all MARK analyses, and typically a eld identifying the number of individuals with that specic encounter history (denoted 'freq') is included, along with additional elds are all optional. In this example, the eld sex species group structure (e.g., whether an individual is male or female) and is identied as a factor variable (Factor) with values 1=Female and 2=Male as ordering is alphabetic and ignores the ordering of the columns in the dipper.inp le which we can see using levels(). Finally, we can run a simple CJS analysis using the default of constant survival and constant recapture probabilities for the dipper data using the simple code mark(dipper). library(rmark) This is RMark data(dipper) str(dipper) 'data.frame': 294 obs. of 2 variables: $ ch : chr " " " " " " " "... $ sex: Factor w/ 2 levels "Female","Male":

69 levels(dipper$sex) [1] "Female" "Male" ex = mark(dipper) Output summary for CJS model Name : Phi(~1)p(~1) Npar : 2-2lnL: AICc : Beta estimate se lcl ucl Phi:(Intercept) p:(intercept) Real Parameter Phi Real Parameter p Importing and Manipulating Data Now that we have RMark up and running (and we know that it works), the rst thing we all want to do it load our data and do some analysis! RMark has several options/ways for one to create or load data for analysis in MARK. As most are familiar with the le.inp structure used by MARK, lets start with the approach that converts a encounter history inp le to a dataframe for use in RMark. For this demonstration, we will use the dipper.inp le which on my 64bit system is located in- C:\Program Files (x86)\mark\examples and the RMark function convert.inp(). Conversion of a.inp le to a dataframe using convert.inp() requires that that we specify the input le location and name, group and optional covariate names, and if the.inp le has commented areas (/* and */ in MARK parlance), that we let 69

70 RMark know. So you don't have to go look (or you can look above), the structure of dipper.inp is pretty straightforward, the encounter history has 7 encounter occasions, does include the freq column given the number of individuals with each specic encounter history, and has 2 groups (columns) representing either Male or Female (1 or 0). Because Males are in the rst column and females are in the second column, when we dene group.df= that will be the order we use. So, converting the dipper.inp data would work as follows: dipper.convert = convert.inp("c:/program Files (x86)/mark/examples/dipper.inp", group.df = data.frame(sex = c("male", "Female"))) When we look at the structure of the newly created le dipper.convert, we will see that it is now a R dataframe with 3 elds. The rst eld is the capture history (ch) which is a character values, the second eld is the frequency variable (freq) or the number of individuals with that unique encounter history (a numeric value), and the third eld is the grouping variable sex, which is a factor variable with 2 levels and can be shown using levels(). str(dipper.convert) 'data.frame': 294 obs. of 3 variables: $ ch : chr " " " " " " " "... $ freq: num $ sex : Factor w/ 2 levels "Female","Male": levels(dipper.convert$sex) [1] "Female" "Male" Once your data is in R as a dataframe, there are some handy options for manipulating data that you can do using standard R functions. A simple example is to add a numeric column representing some covariate (weight in typically used) to the newly created dipper.convert dataframe. dipper.convert$weight = rnorm(nrow(dipper.convert), mean = 11, sd = 3) summary(dipper.convert$weight) Min. 1st Qu. Median Mean 3rd Qu. Max Processing Data Many of you will be familiar with the MARK model specication window as it is where you identify the dataset you want to use for analysis, choose the model type specic for your analysis as well as providing details on the various descriptors for your dataset such as the number of encounter occasions, name and number of groups and individual covariates. 70

RMark (read: Je when he wrote it) takes care of some of these specications such as number of occasions, group labels and individual covariate names (drawn from the input le column names) by setting

71 RMark (read: Je when he wrote it) takes care of some of these specications such as number of occasions, group labels and individual covariate names (drawn from the input le column names) by setting these for you. However, some of the options such as titles, number of mixtures, time intervals, among others are all argument options for the function process.data(), which takes the place of the model specication window from MARK. process.data() does exactly what it sounds like, it processes the specied input data le, and creates a R list structure that include the original dataframe, all the required attribute data, and what model the dataset should be analyzed with: dipper.proc = process.data(dipper.convert, model = "CJS", groups = "sex", begin.time = 1980) str(dipper.proc) List of 15 $ data :'data.frame': 294 obs. of 5 variables:..$ ch : chr [1:294] " " " " " " " ".....$ freq : num [1:294] $ sex : Factor w/ 2 levels "Female","Male": $ weight: num [1:294] $ group : Factor w/ 2 levels "1","2": $ model : chr "CJS" $ mixtures : num 1 $ freq :'data.frame': 294 obs. of 2 variables:..$ sexfemale: num [1:294] $ sexmale : num [1:294] $ nocc : num 7 $ nocc.secondary : NULL $ time.intervals : num [1:6] $ begin.time : num 1980 $ age.unit : num 1 $ initial.ages : num [1:2]

R 4 Ecologists. Bret A. Collier March 27, Introduction 3. 2 My Take on Things 5. 3 Places to Look for R Information 6

R 4 Ecologists. Bret A. Collier March 27, Introduction 3. 2 My Take on Things 5. 3 Places to Look for R Information 6 R 4 Ecologists Bret A. Collier March 27, 2010 Contents 1 Introduction 3 2 My Take on Things 5 3 Places to Look for R Information 6 4 Starting in R 6 4.1 Working Directories.........................................