Computer lab 2 Course: Introduction to R for Biologists

Similar documents
R in Linguistic Analysis. Week 2 Wassink Autumn 2012

Statistics for Biologists: Practicals

Instruction: Download and Install R and RStudio

This document is designed to get you started with using R

Mails : ; Document version: 14/09/12

A brief introduction to R

Module 1: Introduction RStudio

Introduction to Statistics using R/Rstudio

An Introduction to Stata Exercise 1

R: BASICS. Andrea Passarella. (plus some additions by Salvatore Ruggieri)

Introduction to R. base -> R win32.exe (this will change depending on the latest version)

Chapter 2 Assignment (due Thursday, April 19)

Reading and writing data

Lab 1: Getting started with R and RStudio Questions? or

Author: Leonore Findsen, Qi Wang, Sarah H. Sellke, Jeremy Troisi

Lecture 1: Getting Started and Data Basics

A whirlwind introduction to using R for your research

Lab 1. Introduction to R & SAS. R is free, open-source software. Get it here:

Entering and Outputting Data 2 nd best TA ever: Steele H. Valenzuela February 2-6, 2015

LAB #1: DESCRIPTIVE STATISTICS WITH R

Lab 1 (fall, 2017) Introduction to R and R Studio

GEO 425: SPRING 2012 LAB 9: Introduction to Postgresql and SQL

STAT 540 Computing in Statistics

An Introduction to R- Programming

POL 345: Quantitative Analysis and Politics

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 2 Working with data in Excel and exporting to JMP Introduction

Statistical Software Camp: Introduction to R

An Introductory Tutorial: Learning R for Quantitative Thinking in the Life Sciences. Scott C Merrill. September 5 th, 2012

Introduction to R Reading, writing and exploring data

Linkage analysis with paramlink Session I: Introduction and pedigree drawing

Stata: A Brief Introduction Biostatistics

No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot.

Introduction to scientific programming in R

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010

Week 1: Introduction to R, part 1

ELEC4042 Signal Processing 2 MATLAB Review (prepared by A/Prof Ambikairajah)

An Introduction to Statistical Computing in R

EGR 111 Functions and Relational Operators

Introduction to SPSS

An Introduction to R 1.3 Some important practical matters when working with R

STAT 113: R/RStudio Intro

Let s use Technology Use Data from Cycle 14 of the General Social Survey with Fathom for a data analysis project

Running Minitab for the first time on your PC

STA 248 S: Some R Basics

Why use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first

Regression III: Advanced Methods

Introduction to R Commander

Lab 1 Introduction to R

MATLAB TUTORIAL WORKSHEET

Practical 2: Plotting

Matlab notes Matlab is a matrix-based, high-performance language for technical computing It integrates computation, visualisation and programming usin

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Matlab for FMRI Module 1: the basics Instructor: Luis Hernandez-Garcia

EGR 111 Functions and Relational Operators

command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C ))

Lab # 2. For today s lab:

Note on homework for SAS date formats

Chapter 2 The SAS Environment

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Installing and running R

Extensible scriptlet-driven tool to manipulate, or do work based on, files and file metadata (fields)

STAT 20060: Statistics for Engineers. Statistical Programming with R

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

Logical operators: R provides an extensive list of logical operators. These include

LAB 5 Implementing an ALU

Introduction (SPSS) Opening SPSS Start All Programs SPSS Inc SPSS 21. SPSS Menus

limma: A brief introduction to R

CSE 101 Introduction to Computers Development / Tutorial / Lab Environment Setup

NCSS Statistical Software. Design Generator

INTRODUCTION TO SPSS. Anne Schad Bergsaker 13. September 2018

R Basics / Course Business

Chapter 2. Editing And Compiling

SISG/SISMID Module 3

Introduction to R 21/11/2016

Using R for statistics and data analysis

MATLAB Introductory Course Computer Exercise Session

WORKSHOP: Using the Health Survey for England, 2014

3. Data Tables & Data Management

Introduction to Minitab 1

Chapter 2 Assignment (due Thursday, October 5)

Brief cheat sheet of major functions covered here. shoe<-c(8,7,8.5,6,10.5,11,7,6,12,10)

You will have to download all of the data used from the internet before R can access the data.

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html

Part I. Introduction to Linux

introduction to records in touchdevelop

Statistics 13, Lab 1. Getting Started. The Mac. Launching RStudio and loading data

Introduction to Programming for Biology Research

the star lab introduction to R Day 2 Open R and RWinEdt should follow: we ll need that today.

Introduction to R. Course in Practical Analysis of Microarray Data Computational Exercises

EE 301 Signals & Systems I MATLAB Tutorial with Questions

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 2: SEP. 8TH INSTRUCTOR: JIAYIN WANG

Molecular Statistics Exercise 1. As was shown to you this morning, the interactive python shell can add, subtract, multiply and divide numbers.

Why use R? Getting started. Why not use R? Introduction to R: It s hard to use at first. To perform inferential statistics (e.g., use a statistical

Description/History Objects/Language Description Commonly Used Basic Functions. More Specific Functionality Further Resources

MATLAB Part 2. This week we will look at techniques which will allow you to build powerful applications in MATLAB

CSCU9B2 Practical 1: Introduction to HTML 5

Creating a data file and entering data

Exercise 1-Solutions TMA4255 Applied Statistics

ICOM 4015 Advanced Programming Laboratory. Chapter 1 Introduction to Eclipse, Java and JUnit

Transcription:

Computer lab 2 Course: Introduction to R for Biologists April 23, 2012 1 Scripting As you have seen, you often want to run a sequence of commands several times, perhaps with small changes. An efficient way to do this is to store your commands in a text file, and run the text file from R. This concept is called scripting and is vital to doing efficient analyses and you also get documented exactly what you have done in your computations for later use. 1. Create a directory at some convenient place of your computer, possibly a specific folder for this course, for storing your R files. Usually it is easiest to keep the scripts for a single analysis in a separate folder so if you prefer create a sub-folder named lab2 or something similar to collect the files for this lab. 2. Change your working directory to your newly created directory. In R studio you can find the command Set Working Directory under the Tools or Project menu and browse for the directory. In any R environment you have access to the commands getwd and setwd functions to see what working directory you are currently in and to set the working directory. 3. Objects you create in R can be stored when you close R, in a workspace. This workspace will be stored in the current directory, the one you just set above. Try this out, by creating a couple of objects in R, closing R, while saving the workspace, and then go to the directory you created and right-click on the R icon and select open with R studio (this is how to do it under Windows). Your created objects should now be available, try ls(). You can also manually load workspaces by using the load button in the top right workspace panel in R studio or using the command load(). 4. Now select New R Script from the File menu in R studio and save it as myscript.r in your chosen directory. If not using R Studio you can create a file of the same name and use a text editor of your own choice. In this file, write 1

mydata <- c(432, 44, 1) mean(mydata) and save it. Running scripts is done by using the source command which can also be accessed through the code menu in R studio. To run your script write in the R console source("myscript.txt") If you get an error message, try the function dir(), it will list the files in the current directory. If myscript.r is not listed you should either move the file to your current working directory or change your working directory to the files location. If you do not get any error message this means that the script executed, however you will not get any output at all: If you try ls(), you will see that R now has an object called mydata. By default commands in R scripts are silent so it did not print out the mean of the data when the script was run. To get R to print out something as a result of a command in a script you need to write for example, print(mean(mydata)) Edit your file accordingly, and run the script again, to see the output. (Don t forget to save your file after you have edited it). 5. Script files as the myscript.r above are useful to store sequences of commands. They can also store other text that explains your computations and your thinking, right next to the commands. The symbol # will make R ignore all text following it on the same line. Thus it is called the comment symbol. Write a text file containing a solution to the following exercise: Read in the data 34, 54, 25, 53, 24, 41, 49, 32, 26, 51 and analyze it by producing some of the summary statistics, and some of the plots, you have learned to produce so far (Hint: again use c() to combine the data into a vector, useful other functions are mean, sd, hist, and summary). The text file should contain, as comments, an explanation about what each command is doing. A very useful feature in R Studio is the ability to only run part of a script. This is done by selecting those lines in the file panel, top left, 2

and pressing crtl + enter ( cmd + enter, on mac). Try this on some of the code in your script. NOTE: All your obligatory exercises should be written and handed in using the format above: A text file that can be run as a script by R, and which contains, as comments, the additional text that explains the computations. 2 Data structures in R The data objects we have looked at so far have been either vectors or matrices, containing either numbers, text strings, or logical values. We will now look at a few other common data structures: Factors, data frames, and lists. From now on it is suggested that you write the solutions to your exercises in scripts so that you easily can redo steps, change commands and get back to your solutions in the future. 1. Categorical variables are variables that can take on certain specific levels : The variable sex could have the two levels male and female, a variable color could have levels red, green, and blue, for example. Such variables are represented in R with factors. Create a factor as follows: > data <- c("woman", "man", "man", "woman", "woman") > d <- factor(data) A factor is represented in a specific way in the computer; try to guess how by applying the functions levels and as.numeric to d. Also, try out the function as.character on d. That a vector is stored as a factor will change the behavior of many functions; sometimes in a direction you want, sometimes not. We will return to see cases when factors are very useful. 2. Real data sets often come in the form of tables. Often, each row represents an observation, and each column an attribute for each observation. The attributes can be of various types, sometimes represented by numbers, sometimes by text. Try out and explain the outputs of the following commands: > attribute1 <- c(34, 52, 31) > attribute2 <- 1:3 > attribute3 <- c("man", "woman", "woman") > mymatrix1 <- cbind(attribute1, attribute2) > mymatrix2 <- cbind(attribute1, attribute2, attribute3) NOTE: If you have lines of code in your script that you do not want to run it can be useful to comment them away by writing a # in front of 3

that line. This allows you to keep the code but prevents it from being executed. 3. If you examine mymatrix2 (simply write mymatrix2 in the console) you will see that all the values considered as characters as opposed to numbers as in mymatrix1. Since matrices cannot contain different data types R forces all the data into the same type without giving a warning. However mixing different types of data is often necessary and this is possible using a data frame. Try > myframe <- data.frame(attribute1, attribute2, attribute3) You will see when you display it that the data frame has named columns. Use the names function to assign suitable names to the three columns. 4. If you for example named the first column Age, you will now be able to access this column in two ways: > myframe$age > myframe[,1] Use the first type of access to change the first woman s age from 52 to 49, and the second type of access to change the mans age from 34 to 32. 5. Use the class function to investigate the type of the third column: You will find that it is a factor. This may or may not be what you would like. Read the help for data.frame, and find a way to re-construct myframe in such a way that the last column gets type character and not factor. This can also be done by replacing the last column with itself but changing the type using as.character. 6. In the data above, each row represented an observation, so naturally, all columns had the same length. In other types of data, the data set might be a collection of several vectors of different length. In this case you can use a list to collect the data in a single object. OPTIONAL: Use help on the list function to find out how you can represent such data as a list. You can for example create a list containing mymatrix1, mymatrix2 and myframe using > mylist <- list(mymatrix1,mymatrix2,myframe). You acces the list using double square brackets, for example mylist[[1]] would give you the contents of mymatrix1. 4

3 Input and output of data We are finally getting to a very important point: Input and output of data. Real data sets will most often be in the form of an output from some other program. A general way of inputing such data to R, is to make sure it is in some kind of text format. 1. Download the file Example1.txt from the course homepage and put it into your current R directory. Open the file in Windows, with for example Notepad: You will see that it has three columns of data, that the first line represents headings, that the first column is text and the other two columns consist of numbers, and that the columns are separated by tab values. The data is in fact part of the result from a Microarray experiment; the first column consists of names of probes for genes. The file has been produced by Microsoft Excel, using the output option tab-delimited text. 2. A general way to input data in the format of a table is to use the read.table function. Try first > mydata <- read.table("example.txt") You are likely to get an error message, as you should adapt the read.table function to this particular type of output. Try to read help(read.table) to identify the problem or problems. Use the help information to find a way to change the arguments of read.table so that it will read in the data without problems. 3. Investigate your new object using functions you know. A useful function may be head. Other useful functions to apply are dim, class, names ; make sure you understand the output from each. Try also > class(mydata$genename) which will show that the first column is a factor, and not just a character vector. That columns in a data frame are factors may cause unexpected behavior, if they are intended to be interpreted just as a character vector. Go back to the help function for read.table, or for read.delim, and find an option so that when you re-read the data from file, the first column becomes a character vector, i.e., the last command above responds with character. 4. Try the function > newdata <- edit(mydata) 5

and change some of the probe names to names you find prettier. 5. Create from newdata a new dataset consisting of only the lines where the probe name has - as the second character. (Hint: Consider the function substr ). 6. To write out data on table format, the function write.table is often useful. Read help(write.table) to find out how to output your data again in a text file, name it newdata.txt. Use Notepad or another text editor to view the data file. OPTIONAL: if you like, try to open the new file with Excel. In the helpfile for write.table, you may find an alternative command which may be better suited for outputting a table if the table is going to be read into Excel. 7. Data can also be contained in packages, for example, the package connected to our textbook, ISwR, contains a number of datasets. Activate the package (for this R session) by writing > library(iswr) If you get an error message, it means that the package has not yet been downloaded to the computer you use. To do so, use either > install.packages("iswr") or use the Packages menu (under Windows). After the package has been activated with the library function, use > help(packages=iswr) to see a list of the datasets contained in the package. You can read more about each dataset using help, e.g., for the energy dataset, write > help(energy) To activate the dataset, so that it appears among your objects when you use the ls function, use the data function, e.g., to activate the energy dataset, write > data(energy) Finally, visualize the data: Try the two commands > plot(energy) > plot(energy$expend~energy$stature) and explain the output. 6

4 R programming So far, you have applied R either by using single commands, or by using sequences of commands, placed together in a script. One of the strengths of R is that you can seamlessly expand the way you use R into using it as a programming language. 1. Even if a script can store a useful sequence of commands it is not very flexible if you want to apply it on different data or use it multiple times. The standard functions in R such as plot and sd offers this flexibility. In R you have the option to write your own functions. As an example let s assume that you need to go through a vector of words and replace any occurrence of the string 3XSSC with another given string. Write the following code in a script, myreplace <- function(v, newstring="yes") { index <- v == "3XSSC" v[index] <- newstring return(v) } Source the script so that the code gets executed in R. Check the workspace panel in R studio or use ls() to see that the function appeared. Now test out this function on the first column of the data set read in as mydata above by writing. > outputvector <- myreplace(mydata[,1]) Can you tell the difference between mydata[,1] and outputvector, hint use head to look at the first few values of each of them? Now lets go through the meaning of each line in this function. myreplace <- function(v, newstring="yes") { } This part states that we create a function called myreplace that has two input arguments (also know as parameters), v and newstring. newstring also has a default value yes that will be used if we do not supply a value for newstring. The curly brackets { and } then defines what is inside our function. Any code between these will be executed when the function is called. index <- v == "3XSSC" v[index] <- newstring return(v) 7

These three lines define what the function actually does. It first finds all occurrences of 3XSSC and stores this information in index. Then it replaces all these containing 3XSSC with the value stored in new- String. Finally the return command states that v should be given as the output of our function call. Now try to call the function myreplace on the first column mydata again but give the function an additional argument so that it changes 3XSSC into another word. 2. Make your function more general, by giving it an extra argument, with default value No, indicating which word should be replaced. Test this new version of your function by for example replacing geno1 in mydata with something else. 3. In some situations you want to perform the same set of commands multiple times. This can be done with loops. There are a few different options for this in R but the most common one is the for-loop. Write the following code in a script and run it, for ( i in 1:10 ) { print(i) } This loop runs the code print(i) for each value of i contained in the vector 1:10. Note again the construction 1:10 which is a very quick way of creating a sequence of values. Now create a new loop that prints the first ten genenames in mydata. 4. The final concept to consider is conditional execution of commands. This means having code in your script of function that only gets executed if some condition holds. For example try running the following code using a script, a <- 10 if ( a > 15) { print("a is greater than 15") } else { print("a is less than or equal to 15") } This code checks the statement after if, in this case if a is larger than 15, and if that is true it executes the first block of code. If it is not true it sees if an else command in given and executes any code following that. Try what happens when you change the value of a to something larger than 15. 8

Now combine your knowledge of for loops and if statements to write a short script that steps through the twenty first rows of mydata, for each row calculates the difference between the red and green values (column 2 and 3) and then prints the name of any gene where the difference is larger than 5000. Hint, use the abs function to get around problems with negative differences. The same procedure can be done only using direct vector operations but try to use a for loop containing an if statement. If you want to read more about for loops and if statements, chapter 2.3.1 in Dalgaard covers this. You can also find information in chapter 9 in the text An Introduction to R accessible through the built-in R help, use help.start() to start it. These final two exercises are very nice and combine many of the concepts we have covered so far but they can be slightly demanding. At this point you have the option to head straight to the first three hand-in assignments, labs three to five. 5. OPTIONAL: Transform the list stored in mydata as follows: For all lines where the Genename is duplicated, remove all but the first one. Then, sort all lines according to the Genenames in the first column. You may have use for such functions as sort, unique, and duplicated. 6. OPTIONAL: We would like to create a new function similar to myreplace that can do the following: To replace the first letter in each word, if it is a capital letter, with the corresponding lower-case letter. You may have use for the built-in vectors LETTERS and letters. The best thing, for speed of execution, is to write your function using vectors: Try this. Alternatively, try to write the function using a for loop. 9