Computer lab 2 Course: Introduction to R for Biologists April 23, 2012 1 Scripting As you have seen, you often want to run a sequence of commands several times, perhaps with small changes. An efficient way to do this is to store your commands in a text file, and run the text file from R. This concept is called scripting and is vital to doing efficient analyses and you also get documented exactly what you have done in your computations for later use. 1. Create a directory at some convenient place of your computer, possibly a specific folder for this course, for storing your R files. Usually it is easiest to keep the scripts for a single analysis in a separate folder so if you prefer create a sub-folder named lab2 or something similar to collect the files for this lab. 2. Change your working directory to your newly created directory. In R studio you can find the command Set Working Directory under the Tools or Project menu and browse for the directory. In any R environment you have access to the commands getwd and setwd functions to see what working directory you are currently in and to set the working directory. 3. Objects you create in R can be stored when you close R, in a workspace. This workspace will be stored in the current directory, the one you just set above. Try this out, by creating a couple of objects in R, closing R, while saving the workspace, and then go to the directory you created and right-click on the R icon and select open with R studio (this is how to do it under Windows). Your created objects should now be available, try ls(). You can also manually load workspaces by using the load button in the top right workspace panel in R studio or using the command load(). 4. Now select New R Script from the File menu in R studio and save it as myscript.r in your chosen directory. If not using R Studio you can create a file of the same name and use a text editor of your own choice. In this file, write 1
mydata <- c(432, 44, 1) mean(mydata) and save it. Running scripts is done by using the source command which can also be accessed through the code menu in R studio. To run your script write in the R console source("myscript.txt") If you get an error message, try the function dir(), it will list the files in the current directory. If myscript.r is not listed you should either move the file to your current working directory or change your working directory to the files location. If you do not get any error message this means that the script executed, however you will not get any output at all: If you try ls(), you will see that R now has an object called mydata. By default commands in R scripts are silent so it did not print out the mean of the data when the script was run. To get R to print out something as a result of a command in a script you need to write for example, print(mean(mydata)) Edit your file accordingly, and run the script again, to see the output. (Don t forget to save your file after you have edited it). 5. Script files as the myscript.r above are useful to store sequences of commands. They can also store other text that explains your computations and your thinking, right next to the commands. The symbol # will make R ignore all text following it on the same line. Thus it is called the comment symbol. Write a text file containing a solution to the following exercise: Read in the data 34, 54, 25, 53, 24, 41, 49, 32, 26, 51 and analyze it by producing some of the summary statistics, and some of the plots, you have learned to produce so far (Hint: again use c() to combine the data into a vector, useful other functions are mean, sd, hist, and summary). The text file should contain, as comments, an explanation about what each command is doing. A very useful feature in R Studio is the ability to only run part of a script. This is done by selecting those lines in the file panel, top left, 2
and pressing crtl + enter ( cmd + enter, on mac). Try this on some of the code in your script. NOTE: All your obligatory exercises should be written and handed in using the format above: A text file that can be run as a script by R, and which contains, as comments, the additional text that explains the computations. 2 Data structures in R The data objects we have looked at so far have been either vectors or matrices, containing either numbers, text strings, or logical values. We will now look at a few other common data structures: Factors, data frames, and lists. From now on it is suggested that you write the solutions to your exercises in scripts so that you easily can redo steps, change commands and get back to your solutions in the future. 1. Categorical variables are variables that can take on certain specific levels : The variable sex could have the two levels male and female, a variable color could have levels red, green, and blue, for example. Such variables are represented in R with factors. Create a factor as follows: > data <- c("woman", "man", "man", "woman", "woman") > d <- factor(data) A factor is represented in a specific way in the computer; try to guess how by applying the functions levels and as.numeric to d. Also, try out the function as.character on d. That a vector is stored as a factor will change the behavior of many functions; sometimes in a direction you want, sometimes not. We will return to see cases when factors are very useful. 2. Real data sets often come in the form of tables. Often, each row represents an observation, and each column an attribute for each observation. The attributes can be of various types, sometimes represented by numbers, sometimes by text. Try out and explain the outputs of the following commands: > attribute1 <- c(34, 52, 31) > attribute2 <- 1:3 > attribute3 <- c("man", "woman", "woman") > mymatrix1 <- cbind(attribute1, attribute2) > mymatrix2 <- cbind(attribute1, attribute2, attribute3) NOTE: If you have lines of code in your script that you do not want to run it can be useful to comment them away by writing a # in front of 3
that line. This allows you to keep the code but prevents it from being executed. 3. If you examine mymatrix2 (simply write mymatrix2 in the console) you will see that all the values considered as characters as opposed to numbers as in mymatrix1. Since matrices cannot contain different data types R forces all the data into the same type without giving a warning. However mixing different types of data is often necessary and this is possible using a data frame. Try > myframe <- data.frame(attribute1, attribute2, attribute3) You will see when you display it that the data frame has named columns. Use the names function to assign suitable names to the three columns. 4. If you for example named the first column Age, you will now be able to access this column in two ways: > myframe$age > myframe[,1] Use the first type of access to change the first woman s age from 52 to 49, and the second type of access to change the mans age from 34 to 32. 5. Use the class function to investigate the type of the third column: You will find that it is a factor. This may or may not be what you would like. Read the help for data.frame, and find a way to re-construct myframe in such a way that the last column gets type character and not factor. This can also be done by replacing the last column with itself but changing the type using as.character. 6. In the data above, each row represented an observation, so naturally, all columns had the same length. In other types of data, the data set might be a collection of several vectors of different length. In this case you can use a list to collect the data in a single object. OPTIONAL: Use help on the list function to find out how you can represent such data as a list. You can for example create a list containing mymatrix1, mymatrix2 and myframe using > mylist <- list(mymatrix1,mymatrix2,myframe). You acces the list using double square brackets, for example mylist[[1]] would give you the contents of mymatrix1. 4
3 Input and output of data We are finally getting to a very important point: Input and output of data. Real data sets will most often be in the form of an output from some other program. A general way of inputing such data to R, is to make sure it is in some kind of text format. 1. Download the file Example1.txt from the course homepage and put it into your current R directory. Open the file in Windows, with for example Notepad: You will see that it has three columns of data, that the first line represents headings, that the first column is text and the other two columns consist of numbers, and that the columns are separated by tab values. The data is in fact part of the result from a Microarray experiment; the first column consists of names of probes for genes. The file has been produced by Microsoft Excel, using the output option tab-delimited text. 2. A general way to input data in the format of a table is to use the read.table function. Try first > mydata <- read.table("example.txt") You are likely to get an error message, as you should adapt the read.table function to this particular type of output. Try to read help(read.table) to identify the problem or problems. Use the help information to find a way to change the arguments of read.table so that it will read in the data without problems. 3. Investigate your new object using functions you know. A useful function may be head. Other useful functions to apply are dim, class, names ; make sure you understand the output from each. Try also > class(mydata$genename) which will show that the first column is a factor, and not just a character vector. That columns in a data frame are factors may cause unexpected behavior, if they are intended to be interpreted just as a character vector. Go back to the help function for read.table, or for read.delim, and find an option so that when you re-read the data from file, the first column becomes a character vector, i.e., the last command above responds with character. 4. Try the function > newdata <- edit(mydata) 5
and change some of the probe names to names you find prettier. 5. Create from newdata a new dataset consisting of only the lines where the probe name has - as the second character. (Hint: Consider the function substr ). 6. To write out data on table format, the function write.table is often useful. Read help(write.table) to find out how to output your data again in a text file, name it newdata.txt. Use Notepad or another text editor to view the data file. OPTIONAL: if you like, try to open the new file with Excel. In the helpfile for write.table, you may find an alternative command which may be better suited for outputting a table if the table is going to be read into Excel. 7. Data can also be contained in packages, for example, the package connected to our textbook, ISwR, contains a number of datasets. Activate the package (for this R session) by writing > library(iswr) If you get an error message, it means that the package has not yet been downloaded to the computer you use. To do so, use either > install.packages("iswr") or use the Packages menu (under Windows). After the package has been activated with the library function, use > help(packages=iswr) to see a list of the datasets contained in the package. You can read more about each dataset using help, e.g., for the energy dataset, write > help(energy) To activate the dataset, so that it appears among your objects when you use the ls function, use the data function, e.g., to activate the energy dataset, write > data(energy) Finally, visualize the data: Try the two commands > plot(energy) > plot(energy$expend~energy$stature) and explain the output. 6
4 R programming So far, you have applied R either by using single commands, or by using sequences of commands, placed together in a script. One of the strengths of R is that you can seamlessly expand the way you use R into using it as a programming language. 1. Even if a script can store a useful sequence of commands it is not very flexible if you want to apply it on different data or use it multiple times. The standard functions in R such as plot and sd offers this flexibility. In R you have the option to write your own functions. As an example let s assume that you need to go through a vector of words and replace any occurrence of the string 3XSSC with another given string. Write the following code in a script, myreplace <- function(v, newstring="yes") { index <- v == "3XSSC" v[index] <- newstring return(v) } Source the script so that the code gets executed in R. Check the workspace panel in R studio or use ls() to see that the function appeared. Now test out this function on the first column of the data set read in as mydata above by writing. > outputvector <- myreplace(mydata[,1]) Can you tell the difference between mydata[,1] and outputvector, hint use head to look at the first few values of each of them? Now lets go through the meaning of each line in this function. myreplace <- function(v, newstring="yes") { } This part states that we create a function called myreplace that has two input arguments (also know as parameters), v and newstring. newstring also has a default value yes that will be used if we do not supply a value for newstring. The curly brackets { and } then defines what is inside our function. Any code between these will be executed when the function is called. index <- v == "3XSSC" v[index] <- newstring return(v) 7
These three lines define what the function actually does. It first finds all occurrences of 3XSSC and stores this information in index. Then it replaces all these containing 3XSSC with the value stored in new- String. Finally the return command states that v should be given as the output of our function call. Now try to call the function myreplace on the first column mydata again but give the function an additional argument so that it changes 3XSSC into another word. 2. Make your function more general, by giving it an extra argument, with default value No, indicating which word should be replaced. Test this new version of your function by for example replacing geno1 in mydata with something else. 3. In some situations you want to perform the same set of commands multiple times. This can be done with loops. There are a few different options for this in R but the most common one is the for-loop. Write the following code in a script and run it, for ( i in 1:10 ) { print(i) } This loop runs the code print(i) for each value of i contained in the vector 1:10. Note again the construction 1:10 which is a very quick way of creating a sequence of values. Now create a new loop that prints the first ten genenames in mydata. 4. The final concept to consider is conditional execution of commands. This means having code in your script of function that only gets executed if some condition holds. For example try running the following code using a script, a <- 10 if ( a > 15) { print("a is greater than 15") } else { print("a is less than or equal to 15") } This code checks the statement after if, in this case if a is larger than 15, and if that is true it executes the first block of code. If it is not true it sees if an else command in given and executes any code following that. Try what happens when you change the value of a to something larger than 15. 8
Now combine your knowledge of for loops and if statements to write a short script that steps through the twenty first rows of mydata, for each row calculates the difference between the red and green values (column 2 and 3) and then prints the name of any gene where the difference is larger than 5000. Hint, use the abs function to get around problems with negative differences. The same procedure can be done only using direct vector operations but try to use a for loop containing an if statement. If you want to read more about for loops and if statements, chapter 2.3.1 in Dalgaard covers this. You can also find information in chapter 9 in the text An Introduction to R accessible through the built-in R help, use help.start() to start it. These final two exercises are very nice and combine many of the concepts we have covered so far but they can be slightly demanding. At this point you have the option to head straight to the first three hand-in assignments, labs three to five. 5. OPTIONAL: Transform the list stored in mydata as follows: For all lines where the Genename is duplicated, remove all but the first one. Then, sort all lines according to the Genenames in the first column. You may have use for such functions as sort, unique, and duplicated. 6. OPTIONAL: We would like to create a new function similar to myreplace that can do the following: To replace the first letter in each word, if it is a capital letter, with the corresponding lower-case letter. You may have use for the built-in vectors LETTERS and letters. The best thing, for speed of execution, is to write your function using vectors: Try this. Alternatively, try to write the function using a for loop. 9