command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C ))

Tutorial 3: Data Manipulation Anatomy of an R Command Every command has a unique name. These names are specific to the program and case-sensitive. In the example below, command.name is the name of the command. Command names are always followed by a set of parentheses within which the names of data are given, or supplied, and various parameters for running the command are set. In general, the first entry in the parentheses requires that you specify (supply) the dataset. It may be a data object (vector, data frame, matrix) or part of a data object (like a single row or column of numerical values or measurements we will see how to do this later in this tutorial) that you want to analyze. Before running any command for the first time, it is a good idea to look at the help topics to see what type(s) of data needs to be supplied. command.name(vector) command.name(dataframe) command.name(matrix) A command may require multiple measurements or vectors. command.name(measurement1, measurement2) command.name(vector1, vector2) In some commands, the next entry (or next few entries) require that you specify some additional settings with regard to how your data will be handled by the command. In this example, grouping is a categorical variable (perhaps representing experimental treatments) by which to group the measurement values. command.name(measurement, grouping) Most commands in R have multiple parameters that can be used to customize your analyses. These parameters are called arguments and have default settings that are shown in the help topics. There are multiple types of arguments and names of arguments are given in the help topics. Most help topics should provide a list of available options. Some arguments are like on/off switches and only need to be set equal to TRUE or FALSE ( T or F work too). See argument 1 in the example below. Some arguments need to be set equal to a numeric value that corresponds to a certain condition. See argument 2 below. Some arguments need to be set equal to a character string (one or more letters/words in quotation marks). See argument 3 below. If multiple character strings or numbers are required, they must be concatenated with c(), as in argument 4 below. command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C )) 1

Quickly Review Column Headings To quickly remind yourself what the column names are in your dataset, you can use the head() command to see just the first several rows of your data. head(dataset) Recalling Columns from a Data Frame or Matrix The most reliable way to recall and display columns from a data frame or matrix is to specify both the name of the dataset and the name of the column, separated by a $ symbol. The example code below tells R to display all records (rows) of a particular variable (one column) from a particular dataset. dataset$variable1 dataset$variable2 The utility of this is that it also allows you to specify a specific column from a dataset that you want to analyze with some command. Also, any column can be used to create an independent vector with a new name. command.name(dataset$variable3) new.name=dataset$variable3 command.name(new.name) Attach: Another way to recall columns, but not recommended There is shorter way to recall columns by attaching names of the variables to the data. Variable names will be the column headings (first row of cells, header row) from the dataset. This allows you to recall individual columns of data without having to specify the dataset. To see the names of the variables (columns of a data frame), use the names() command. attach(dataset) names(dataset) Recall and display variables in the dataset by typing the column names. variable1 CAUTION: The attach() method can cause problems if you are working with multiple datasets (read in from separate.csv files) that happen to have some or all of the same column headings. Only the last attached dataset will be recognized, so it s up to you to organize your R code carefully or to give unique names to all columns across all datasets. This is generally not a problem if you are only working with one dataset during an R session. The dataset$variable convention is recommended for avoiding confusion and mistakes in analyses. 2

Selecting Values from a Vector (Indexing a Vector) Subsets of data from a vector can generated by indexing the desired positions within the vector. Each value contained in a vector is located in a particular position. The number of positions depends on the length of the vector, or how many values it contains. A set of square brackets directly after the vector name are used to indicate the position(s) of the values you want to select. vector[position] For example, suppose we generate a vector containing the numbers 11 through 20. vector=c(11:20) vector [1] 11 12 13 14 15 16 17 18 19 20 We can see how long the vector is by using the length() command. It tells us how many values are in the vector, and thus how many positions there are. This example vector has 10 values, so it is comprised of 10 positions. length(vector) [1] 10 Now say we want to extract the value in the 4 th position. Use square brackets to specify position 4 of the example vector. It will return the number 14, which is the 4 th value in the vector. vector[4] [1] 14 We can also extract values from multiple positions. Consecutive positions can be specified as a range using a colon. Commas separate non-consecutive positions. Note that we have to group our selected positions together using the c() command. So here we have selected values from the 4 th through 6 th positions, 8 th position, and 10 th position. vector[c(4:6, 8, 10)] [1] 14 15 16 18 20 3

Selecting Rows and Columns (Indexing a Data Frame or Matrix) Data frames and matrices can be indexed in a similar way to vectors, but the basic anatomy of the selection code is a little different. The name of the dataset is followed by brackets containing the positions within rows and columns, in that order, and separated by a comma. dataset[row.position, column.position] For clarity, consider the matrix of values below. Note how at the left of each row there is a set of brackets containing a number followed by a comma. These are the positions of each row (similar to how rows are numbered in an Excel spreadsheet). Note how the top of each column has a set of brackets containing a comma followed by a number. These are the positions of each column (similar to how columns are labeled with letters in an Excel spreadsheet). [,1] [,2] [,3] [,4] [1,] 11 12 13 14 [2,] 21 22 23 24 [3,] 31 32 33 34 [4,] 41 42 43 44 To find the dimensions (number of rows and columns) a data frame or matrix has, use the dim() command. The first value returned tells how many rows there are, and the second value returned tells how many columns. So if this matrix was named dataset, we would find its dimensions: dim(dataset) [1] 4 4 The simplest selection would be to extract a single value from matrix. To select the position located in the first row and first column (the value 11), we would use: dataset[1,1] We could extract multiple values from matrix. To select the positions located in the first two rows and first column (the values 11 and 21), we would use: dataset[1:2, 1] To select the positions located in the first two rows and first three columns (the values 11, 12, 13, 21, 22, and 23), we would use: dataset[1:2, 1:3] 4

Any selection should be assigned to a new name. You can see your new subset dataset by running the assigned name ( subset in this example). subset=dataset[row.position, column.position] subset It is not necessary to provide selection criteria for both rows and columns if you are only choosing based on one, the other can be left blank, but don t forget the comma! The next two sections will show examples of how indexing can be used in this way. dataset[, column.position] dataset[row.position,] Selecting Dataset Rows (Records) to Create a Subset Sometimes you want to choose a particular subset of rows (records) to analyze separately from the rest of your data. Three common ways to accomplish this are (1) selecting rows based on position, (2) selecting rows based on a numerical variable, and (3) selecting rows based on a categorical (text or character string) variable. 1. Selecting rows (records) based on positions in a data frame or matrix. As we saw in the previous section, a set of consecutive rows can be selected by specifying a range (use a colon). For example, to select the first 30 rows in a dataset: dataset[1:30,] We can also select non-consecutive rows by separating the positions with a comma and using the c() command to group the selected positions together. The third example will select first, fifth, seventh, and ninth rows. dataset[c(1, 5, 7, 9),] We can select groups of consecutive rows. This example will select the first 30 rows in the dataset and the 50 th 60 th rows. dataset[c(1:30, 50:60),] We can select non-consecutive rows and groups of consecutive rows at the same time. This example will select the first, fifth, and tenth rows and the 50 th 60 th rows. dataset[c(1, 5, 10, 50:60),] Note that the order of the positions matters. The selected rows will be placed in the subset in the order specified. 5

2. To select certain rows based on a numerical variable contained in one of the columns of the dataset. Selections can be made to extract rows that contain values less than, equal to, or greater than a designated threshold within a particular column. To select rows from the dataset that have a value less than 1 in the column named variable. dataset[dataset$variable<1,] To select rows from the dataset that have a value equal to 1 in the column named variable. Note the use of double equals signs (==) to represent the mathematical operator equal to. dataset[dataset$variable==1,] To select rows from the dataset that have a value greater than 1 in the column named variable. dataset[dataset$variable>1,] To select rows from the dataset that have a value less than or equal to 1 in the column named variable. dataset[dataset$variable<=1,] To select rows from the dataset that have a value greater than or equal to 1 in the column named variable. dataset[dataset$variable>=1,] 3. To select certain rows by a text criterion. If the dataset has a column of categorical variables (text), you can choose for rows that belong to a particular category. Rows that contain the desired text in the column named variable will be selected. The ignore.case argument is used to designate whether the selection is case-sensitive. dataset[grep( text, dataset$variable, ignore.case=true),] Selecting Dataset Columns to Create a Subset Columns can be selected based either on (1) the column positions or (2) by the names of the columns. You can also (3) remove particular columns from a dataset. 1. Selecting based on column position. Note the use of c() to group the selected positions together. Consecutive columns are selected by specifying a range. dataset[, c(1:3)] 6

Non-consecutive column selection: dataset[, c(1, 3, 5)] A combination of consecutive and non-consecutive column selections: dataset[, c(1:3, 6, 8)] 2. Selecting based on column names. Place the names of the columns in quotation marks. Use c() to group multiple column name selections together. Spelling and case must exactly match the names in the dataset. dataset[, column1 ] dataset[, c( column1, column2 )] 3. Remove a column from a dataset by preceding the positions with a dash (negative sign). Multiple rows can be deleted by using c(). Note that these removals do not affect the original.csv file saved on your computer. dataset[, -4] dataset[, c(-4, -6)] Selecting Data Based on Rows and Columns Both row and column criteria can be used together to select a subset of data. Any of the selection methods can be combined to make even more specific data selections. Select data that are in rows 1 through 30 and from columns named column1 and column2. dataset[1:30, c( column1, column2 )] Select data from the column named column1, but only for rows 1, 3, 6, and 8 through 12. dataset[c(1, 3, 6, 8:12), column1 ] Select data from columns 1 and 5, but only for records (rows) with a value greater than one in the column named variable. dataset[dataset$variable>1, c(1, 5)] 7

Renaming Columns Sometimes it is useful to rename columns in a newly created output dataset or an existing dataset without having to go back to your original data files. The colnames() command requires the name of the dataset in the parentheses, followed by brackets with the column positions of the columns you wish to rename. In the example below, the first column in the dataset will be changed to New.Name. colnames(dataset)[1]= New.Name If you wish to change the names of multiple, consecutive columns, you can specify the range of columns (1:3 in the example below, corresponding to the first three columns of the dataset) and supply multiple new column names using the c(). In this example, the first three columns of the dataset will be renamed to Name1, Name2, and Name3, respectively. colnames(dataset)[1:3]=c( Name1, Name2, Name3 ) Note: This method of renaming columns is only temporary these new names will exist only for the duration of your R session. The original.csv file is unchanged. Merging Two Data Frames The merge() command can be used to merge two data frames based on a common column (commonly by species or plot). If the column by which data frames are to be merged have the same name in both data frames, no other columns have the same name, and the data frames are the same length (same number of rows), then the simplest use of merge() may work. This example takes two data frames (dataset.x and dataset.y) and searches each for a column name common to both, then merges the datasets together row-by-row by matching up the values in the column the datasets have in common. merge(dataset.x, dataset.y) If the columns by which datasets are to be merged have different names, then these names must be specified. This example will merge together two datasets (dataset.x, dataset.y) based on two columns that have different names. merge(dataset.x, dataset.y, by.x= column.x, by.y= column.y, all=true, sort=false) If the datasets are not of the same length (they do not have the same number of rows, or are otherwise not a perfect match), unmatched records will be omitted from the output unless all rows are kept via the all argument. If all=true, unmatched rows will be kept, inserting na values for missing data. 8

If sort=false, the output will be organized so that successfully merged records appear first, followed by records from dataset.x that did not match any records in dataset.y, followed by records from dataset.y that did not match any records in dataset.x. Note: Input dataset columns that have the same name prior to the merge will be appended with.x or.y depending on whether they come from dataset.x or dataset.y. Exporting Data from R Data frames, matrices, and other outputs generated in R can be exported to a.csv file to save and use later. If you make modifications to a dataset and want to save your changes to use later, you can generate a.csv file directly from R using the write.csv() command. First, the name of the dataset you want to save is supplied, followed by a file pathname for where you want the file to be saved. Note that the file pathname includes the name of the file that will be created (new.dataset.csv in this example). write.csv(new.dataset, file= /Users/username/Desktop/ folder/new.dataset.csv ) write.csv(new.dataset, file="c:/documents and Settings/ Owner/Desktop/new.dataset.csv") If you have set a working directory, you do not need provide the full pathname, only the name of the file you want to create is required. The resulting file will be saved in the folder you set as your working directory. setwd("/users/johndoe/desktop/") setwd("c:/documents and Settings/Owner/Desktop/") write.csv(dataset, dataset.csv") 9

Tutorial Code #read in a.csv file using the full pathname; assigns dataframe to the name data data=read.csv("/users/johndoe/desktop/r_example_dataframe.csv") data #view the dataset head(data) #view headers of the data dataset #change the name of the first column of the data dataset to Sample colnames(data)[1]="sample" head(data) #change the name of the second column of the data dataset to Comm_Type colnames(data)[2]="comm_type" head(data) #remove the first column of the data dataset and assign to a new name, remove1 remove1=data[, -1] remove1 #remove the first three columns of the data dataset and assign to a new name, remove1 remove2=data[, c(-1, -2, -3)] remove2 setwd("/users/johndoe/desktop/") #set working directory example=read.csv("r_example_dataframe.csv") #read in.csv file example #view dataset #take values in the Diversity column of the example dataset and use them to generate a vector named div div=example$diversity div #view the vector 10

#take values in the Richness column of the example dataset and generate a vector named rich rich=example[, "Richness"] rich #take the Richness and Plot columns of the example dataset and generate a new dataframe called rich2 rich2=example[, c("richness", "Plot")] rich2 #take values in the first column of the example dataset and use them to generate a vector named plot plot=example[, 1] plot #create a subset of the example dataset called subset1 that contains rows 1 through 16 subset1=example[1:16,] subset1 #create a subset of the example dataset that only contains rows where Diversity values are greater than 4 subset2=example[example$diversity >4,] subset2 #create a subset of the example dataset that only contains rows where Age values are Young subset3=example[grep("y", example$age),] subset3 #create a subset of the example dataset that rows 1 through 16 and rows 25 through 32 subset4=example[c(1:16, 25:32),] subset4 #create a subset of the example dataset that rows 1, 5, 7, and 9 through 14 subset5=example[c(1,5,7,9:14),] subset5 11

#create a subset of the example dataset that only contains rows where Richness values are equal to 4 subset6=example[example$richness == 8,] subset6 #create a subset of the example dataset that only contains plots of the Oak community type oakplots=example[grep("oak", example$community, ignore.case=t),] oakplots #create a subset of example that contains only the Richness, Plot, and Age data from rows 1 through 16 crazy=example[1:16, c("richness", "Plot", "Age")] crazy #create a subset of example that contains only the data from rows 1 through 16 and columns 1 through 5 crazy2=example[c(1:16, 25:32), c(1,5)] crazy2 #create a subset of example that contains only the Maple community type data from columns 1 through 5 crazy3=example[grep("maple", example$community), c(1:5)] crazy3 #create a subset of example that contains only the Young plot data from columns 1 through 5 crazy4=example[grep("young", example$age), c(1:5)] crazy4 #Merge two datasets data.x=example[,c(1,5,6)] head(data.x) data.y=example[,c(1,2,3)] head(data.y) merge(data.x, data.y) 12