1 Instructions and Result Summary VU Biostatistics and Experimental Design PLA.216 Exercise 1 Introduction to R & Biostatistics Name and Student ID MAXIMILIANE MUSTERFRAU Name and Student ID JOHN DUMMY Work in teams of 2 students only!

2 0. General Information!!Read this paragraph carefully!! Aim: This section aims at providing general information about the use of specific programs or functions, installation instructions and other tips and tricks. Please read carefully. R Studio: RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. # The program is installed on your computer. # Start the program # if you want to clean up everything, this is how you do it # but be careful it deletes everything rm(list=ls()) R code: Some of the code snippets for this exercise are available in this document. Run each line step by step in R Studio to make sure you understand what it does and complete the questions in this template. Report template: You will need to write a report during this exercise and it to us after the session. Save your report regularly. After the session it as PDF to Do not forget to state your names in the report! Install R packages: Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. # install.packages("<nameofpackage>") # install.packages("ggplot2") # should be installed already Install Bioconductor R packages: source(" # bioclite("<nameofpackage>") Load R packages: # load the following packages # library(<nameofpackage>) library("ggplot2") Working directory: Create a folder L:/Biostatistics/Ex1/" in your home directory before you run the code below. This is the folder where you will find your result files! Also save your R file here. The raw data file (CSV) is available at

3 You can either download the raw file to this folder or load it via URL. # set working directory # also save your R file here! # write your result files to this directory! setwd("l:/biostatistics/ex1/") R help: Use the help to retrieve more information about a function, class or package. Use? to access the help: #?<nameoffunction> Save an image/figure in R: To save an image as PDF, do the following: # filename is the name (and path) of your file! # pdf(file = filename) # <create plot here your code here> # # Or just use Export/Save Plot as PDF in the Plots pane. Read your data table from disk: Use e.g. the read.table() function. #?write.table # tab-separated # read.table(file="filenamehere.txt", sep="\t") # comma-separated # read.csv(file="filenamehere.csv") # read.table(file="filenamehere.txt", sep=",") Write your result table to disk: Use e.g. the write.table() function. #?write.table # tab-separated # write.table(dataframe, file="filenamehere.txt", sep="\t") # comma-separated # write.csv(dataframe, file="filenamehere.csv") Don t worry, we will help you through the exercise!

4 1. Data frames in R The beginning of a data analysis usually starts with getting a table of data loaded into R. Here we have a Comma Separated Values (CSV) file. Excel sheets can be converted to CSV files and CSV files can be easily read into R. More about this file format is available here: Read CSV-files into R Let's start here with a CSV file of mammalian sleep data. Read the CSV file msleep.csv into R using the function read.csv(). Call the data.frame tab. More information on the dataset: tab <- read.csv("msleep.csv") # have you set the working directory? The variable tab has the class data.frame, which is R's name for a table of data. class(tab) Two useful things to know are: what does the top of the dataframe look like, what are the dimensions of the dataframe and what is the structure of the dataframe? head(tab) dim(tab) str(tab) Type?read.csv() and read the description of the arguments there. Note that the header was assumed to exist because of the argument header=true. If the CSV file did not have a header, the first line of data would be taken as the header. The fix for this would be to specify the argument header=false. The $ operator We can get a column of the data from a dataframe by typing the name of the dataframe followed by a $ symbol and the name of the column with no spaces in between. First get the column names using colnames(tab) and then extract one of the columns. The column will be returned as a vector of numbers. Try using autocompletion on the column name using the TAB key on your keyboard. Type the name of the dataframe and a $ symbol followed by the first few letters of the column and then hit TAB.

5 What is the name of the first animal in the table? The name of the first animal is Cheetah Vectors can be combined using the function c(). For example, we can add a number, 12, to the sleep totals: c(tab$sleep_total, 12) The summary() function gives the summary statistics of a set of values. summary(tab$sleep_total) What is the 3rd quartile of the total sleep of all the animals? The 3rd quantile of the total sleep of the animals is hours. Indexing and Subsetting Subsetting a dataframe to the first two rows: tab[ c(1,2), ] The rows where the total sleep is greater than 18 hours: tab[ tab$sleep_total > 18, ] Subsetting a vector looks very similar, but we just remove the comma (because there are no columns now). The first two elements can be subset like so: tab$sleep_total[ c(1,2) ] What is the average total sleep, using the function mean() and vector subsetting, for the animals with total sleep smaller than or equal to 10 hours? The average total sleep for these animals is 6.67 hours. The function which() gives us the numeric index that satisfies a logical question: which(tab$sleep_total > 18)

6 For example, let s say we want to get the first value where the total sleep was more than 18 hours. This combines three operations: which() gives the number of values which have total sleep more than 18 hours, then on the right side, we index this vector with [1] to get the first number. Then we index the original vector with that number. Take a while to look over this and take the command apart to understand what is going on: tab$sleep_total[ which(tab$sleep_total > 18)[1] ] We can also combine two logical vectors and use which() to see the rows that satisfy both criteria. Logical conditions are added using the ampersand symbol: & (logical AND). What is the row number of the animal, which has more than 18 hours of total sleep and less than 3 hours of REM sleep? The row number is 43. Also try with instead of & and explain the results. is the logical symbol for OR, so the command results to all the animals that have either 18 h of total sleep OR less than 3 hours of REM sleep, or both. The function subset() provides another possibility to subset vectors, matrices and dataframes by a condition. The following code line for example reduces the dataframe to contain only 3 columns: "order", name and total_sleep and only 22 rows with order being Rodentia. subset(tab, subset = tab$order == "Rodentia", select = c("order", "name", "sleep_total")) Use the function subset() to obtain a new dataframe, which contains only rows where order is Primates with the columns name, total_sleep and bodywt. How many rows does the new dataframe contain? The subset contains 12 rows. Now save the dataframe you just created to a tab-separated file (file extension.txt) using the function write.table().

7 Consult the help page of the function by typing?write.table() in the console. Set the parameters to avoid printing the row names. Also set the separator correctly. In the windows explorer navigate to the directory where you saved the.txt file and open it. Include the table here. "name" "sleep_total" "bodywt" "Owl monkey" "Grivet" "Patas monkey" "Galago" "Human" 8 62 "Mongoose lemur" "Macaque" "Slow loris" "Chimpanzee" "Baboon" "Potto" "Squirrel monkey"

8 2. Plotting Regular plots Let's go ahead and make a plot of the brain weight (brainwt) and the total sleep (sleep_total), to see what the data look like: plot(tab$brainwt, tab$sleep_total) Once more, with a logarithmic scale x-axis: plot(tab$brainwt, tab$sleep_total, log="x") abline(h=15) Add axis labels (name & unit) to the plot. Add a title. Change one graphical parameter, e.g. the color (?plot(),?par()). Add a horizontal line at y=15 using the abline() function. Include your plot and R code here:

9 plot(tab$brainwt, tab$sleep_total, log="x", col="#587498", main="sleep vs. Brain weight", ylab="total sleep [h]", xlab="brain weight [log(kg)]") abline(h=15) Save your plot as PDF using the pdf() function. State your code here. Hint: Look at 0. General Information for more information how to do this! pdf(file = "your_plot.pdf") [...your code for plotting...] ggplots (BONUS) Let s try a different way to plot: ggplots. There are many tutorials for ggplots, e.g. In order to use ggplots, we need to load the package. If the package cannot be loaded you will have to install the package first. library(ggplot2) #install.packages("ggplot2") # install if you get an error The first line of code removes all rows with NA values in the brainwt column. Now let s plot the same as above. Every ggplot2 plot has a data layer, which defines the data set to plot (which would be tab3), and the basic mappings of data to aesthetic elements (aes(x,y)). Then we define the basic data-to-aesthetic mappings to add geometries to the data we would like to get a scatterplot (points) so geom_point(). tab3 <- tab[!$brainwt),] ggplot(tab3,aes(x=log(brainwt), y=sleep_total)) + geom_point() Have a look at how the plot can be manipulated using ggplots: ggplot(tab3,aes(x=log(brainwt), y=sleep_total, color=vore)) + geom_point() ggplot(tab3,aes(x=log(brainwt), y=sleep_total)) + geom_point(color="#587498")

10 g <- ggplot(tab3,aes(x=log(brainwt), y=sleep_total)) + geom_point() g1 <- g + geom_smooth(); print(g1) g2 <- g + geom_hline(yintercept = 10); print(g2) Add axis labels (name & unit) to the plot. Add a title. Add a horizontal line at y=15. Include your ggplot here (BONUS): g <- ggplot(tab3,aes(x=log(brainwt), y=sleep_total)) + geom_point(color="#587498") + labs(x = "Brain weight [log(kg)]", y = "Total sleep [h]"); print(g) g <- g + ggtitle("sleep vs. Brain weight"); print(g) g <- g + geom_hline(yintercept = 15); print(g)

11 3. For Loop Simple Examples A for loop can be used to iterate over the elements in a vector, the rows or columns in a matrix or a dataframe, the elements of a list etc. In each loop a block of code is executed on the current element. Let s look at a really simple example: for (i in 1:5){ print(paste("we are in the loop. Iteration #", i)) # another example x <- c(3,4,5,2); x for (){ y <- x[i] + 3 print(paste("y is", y)) When iterating over the elements of a vector in a for-loop the expressions in the code block within the for-loop are evaluated in each iteration. This is rather inefficient (can take very long) especially for a large number of elements (~ ). In R many functions are vectorized. Thanks to vectorization we do not need to use a for-loop to add 3 to a vector. We can simply replace the for-loop with x + 3 y <- x + 3; y cat(paste("y is", x + 3, "\n")) Now we will us an if statement, a logical NOT (!) and next. x <- c(3,4,5,2); x for (i in 1:length(x)){ y <- x[i] + 3 if (!(y %% 2)){ next print(paste("y is", y)) Try to understand what the line if (!y %% 2) does. Hint: modulus operand %%.

12 Also this could be vectorized, e.g. y <- (x + 3) %% 2 y <- x[y > 0] + 3 cat(paste("y is", y, "\n")) Explain the effect of the next statement on the For loop in one sentence. With the next statement the rest of the current iteration is skipped and loop goes to the next iteration. Write a For Loop Let s go back to our table. We can also iterate over the rows in our sleep dataset and subtract the REM sleep time from the total sleep time to obtain the non-rem sleep time. To do so, create a new vector with length equal to the number of rows. This has to be done outside the for-loop. sleep_other <- numeric(nrow(tab)) Note: The function nrow(tab) returns the number of rows of tab. With the function numeric() a vector of mode numeric and length equal to nrow(tab) is created. The elements of sleep_other are by default initialized with zero. Now iterate over the rows in the sleep dataset and store the difference of total sleep time and rem sleep time to the corresponding element in the vector. for ( ) { <your code here> str(sleep_other) State your R code here. for (i in 1:length(sleep_other)) { sleep_other[i] <- tab$sleep_total[i]-tab$sleep_rem[i]

13 If either the total sleep time or the REM sleep time are not available (NA) the difference cannot be calculated and NA is returned. Use the following code lines of nested functions to determine the number of NA values: length(which( table( How many values in sleep_other are NA? What does this mean? 22 values is sleep_other are na, this means for 22 values either the total sleep or rem sleep are not available and therefore sleep_other can not be calculated. Again, thanks to vectorization we do not need to use a for-loop to calculate the values in sleep_other. We can simply replace the for-loop with sleep_other2 <- tab$sleep_total tab$sleep_rem Double For Loops Let s have a look at a double for loop now (BONUS): # double for loop y <- c(5,6,2) for (i in 1:length(x)){ for (j in 1:length(y)){ z <- x[i] + y[j] + 3 print(paste("z is", z)) # here x + y + 3 does not work! Create a 15 x 15 matrix. For each row and for each column, assign the values of the matrix based on position using the product of the two indexes. When the indexes are equal, set to 1 using an if / else statement. Copy your code here. (BONUS) mat <- matrix(nrow=15, ncol=15) for(i in 1:dim(mat)[1]) { for(j in 1:dim(mat)[2]) { if (i==j) { mat[i,j] = 1 else {

14 mat[i,j] = i*j

15 4. Dataframe Manipulations Similar to the function c(), which concatenates the elements of vectors to a single vector, cbind() and rbind() can be used to concatenate vectors, matrices or dataframes into one single dataframe. To concatenate objects with rbind() i.e. to increase the number of rows they need to have the same number of columns. To concatenate objects with cbind() i.e. to increase the number of columns they need to have the same number of rows. Add a new column to the sleep dataset containing the sleep hours other than rem (sleep_other) using cbind(). tab2 <- cbind(tab, sleep_other) For reasons of clarity we would like all columns containing sleep hours to appear next to each other. Thus we have to reorder the columns in the dataframe. tab2 <- tab2[, c("name", "genus", "vore", "order", "conservation", "sleep_total", "sleep_rem", sleep_other, "sleep_cycle", "awake", "brainwt", "bodywt") ] Additionally, we want to reorder the rows of the dataframe so that the animal, which has the longest sleep of type other than REM is listed at the top of the table and the one with the shortest at the bottom. Sort and print out the number of sleeping hours other than REM with the longest sleep at the top using the sort() function (here you have a vector!). sort(tab2$sleep_other, decreasing = TRUE) How many hours (sleep_other) does the animal with the longest sleep other than REM sleep? It sleeps for 17.9 h (sleep_other). Reorder the rows in the dataframe to list the animals that sleep longest at the top of the table. tab2[ order(tab2$sleep_other, decreasing = TRUE), ] Use for example the function head() to display the top rows of the dataframe.

16 5. Useful Functions in R split split() is a function which takes a vector and splits it into a list by grouping the vector according to a factor. Let's use our mammal sleep data again to try this out. Split the total sleep column by the mammals Order (here Order means the biological taxonomy, above Family and below Class) s <- split(tab$sleep_total, tab$order); s We can pull out a single vector from the list using the name of the Order or the number that it occurs in the list (Note: this is where the level occurs in the levels of the factor). Lists are indexed with double square brackets [[]], instead of a single square bracket []: s[[17]] s[["rodentia"]] How many hours do rodents sleep (total sleep) on average? They sleep for hours on average. apply The family of apply() functions are used to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. lapply() and sapply() are useful functions for applying a function repeatedly to a vector or list. lapply() returns a list, while sapply() tries to "simplify", returning a vector if possible (if there is only one element returned by the function for each element of the input. Let's use lapply() to get the average total sleep for each Order: lapply(s, mean) As you can see, a list is returned. Let's use sapply() instead: sapply(s, mean) # the above is equivalent to sapply(s, function (x) { mean(x) )

17 Use any lapply() or sapply()to answer the following question: What is the standard deviation of total hours of sleep for the Primates Order? The standard deviaton of total hours of sleep of Primates is 2.21 hours. Use sapply() to search through the list s and retrieve all indexes where the value equals to State your R code and the results here. (BONUS) sapply(s, function (x) { which(x == 10.1) ) $Carnivora [1] 3 $Erinaceomorpha [1] 1 $Primates [1] 7

18 6. User defined Functions A Simple Example One of the strengths of R is the ability to add functions. The syntax of a function looks like this in R: # myfunction <- function(arg1, arg2,... ){ # statements # return(object) # Now let s write our first function! First we have to define the function and give it a name. We will call it square.value(). Our function will simply compute the square value of a given value. # simple example # define function square.value <- function(x) { sqval <- x*x return(sqval) Now that we have defined our function, we can call it with a value of our choosing as argument. # call function square.value(4) Note that our function is already vectorized: x <- c(3,4,5,2); x square.value(x) Write your own function, which first takes the square root of a given value second adds 10 to the result of the above. State your R code here. calculate.value <- function(x) { r <- sqrt(x) val <- r + 10

19 return(val) calculate.value(x) Write your own function, which first takes the mean of a given vector (arg1), second adds a given value (arg2) to the result of the above third takes the log2 of the result of the above State your R code here. (BONUS) calculate.value <- function(x,y) { m <- mean(x) addval <- m + y endval <- log(addval,2) return(endval) calculate.value(x, 4)

