Computer Exercise - Microarray Analysis using Bioconductor Introduction The SWIRL dataset The SWIRL dataset comes from an experiment using zebrafish to study early development in vertebrates. SWIRL is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. One of the goals of the SWIRL experiment is to identify genes with altered expression in the BMP2 mutant compared to the wild-type zebrafish. The SWIRL dataset is provided by Katrin Wuennenberg-Stapleton from the Ngai Lab at UC Berkley. Table 1 shows the experimental setup. R stands for red and G for green which is the names that the two dyes usually are called. Other common names are Cy5 (red) and Cy3 (green). Array number Mutant dye Wild-type dye 1 Cy3 (G) Cy5 (R) 2 Cy5 (R) Cy3 (G) 3 Cy3 (G) Cy5 (R) 4 Cy5 (R) Cy3 (G) Table 1: Experimental setup for the SWIRL dataset. To download the data write the following lines in a xterm window wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.1.spot wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.2.spot 1
wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.3.spot wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.4.spot wget http://www.math.chalmers.se/~erikkr/macourse2008/fish.gal LIMMA - a package in Bioconductor LIMMA stands for Linear Models in Microarry Analysis and is a Bioconductor package for microarray analysis. The package is maintained by Gordon Smyth who has also written several papers in the field of microarray analysis. The LIMMA package contains a broad collection of tools and some of them are especially designed for the analysis of two-channel spotted cdna microarray data. In this lab we will use LIMMA for several reasons. First, LIMMA is developed at a fast pace, which means that new methods are continuously added as they come available. LIMMA is also fairly easy to learn and well documented (at least relative to the other packages in Bioconductor). To load LIMMA in R simply type library(limma) and wait a few seconds. There are several ways to access the LIMMA documentation. The easiest way is to use the included help files. These can be read directly in R by using the help command: help( 01.Introduction ) In addition, the following sections might be of interest: 02.Classes, 03.ReadingData, 04.Background, 05.Normalization, 06.LinearModels, 07.SingleChannel, 08.Tests, 09.Diagnostics and 10.Other. A user guide for LIMMA is available on http://www.math.chalmers.se/ erikkr/macourse2008/. Exercises Exercises marked with a star (*) are a bit more tricky and may be skipped without interrupting the flow of the lab. 2
Basic input/output in LIMMA To start analyzing the data the first step is to read the data into R. This part can be rather tricky depending on the format of the data. In our case, the data is an output file from the image analysis program Spot. Exercise 1 Use the read.maimages command to load the files into LIMMA. The best way to do this is to save the names of the different files in a vector > files<-c("swirl.1.spot", "swirl.2.spot", "swirl.3.spot", "swirl.4.spot") After that, use the read.maimages command to read the files into R > RG<-read.maimages(files, source="spot") Use the names command to see which elements the resulting list RG contains. Can you figure out what they stand for? Take a look at the contents of the different elements. What type of objects are they? The command class can be used here in the following way class(rg[[1]]) The raw numbers from the slide has now been read into LIMMA, but we also need some metadata, that is, some information about the data. Examples on metadata in this case is the layout of the array and an annotation list. Exercise 2a In our case, the layout and the annotation list are stored in a so called GALfile. Make sure that you have downloaded the GAL-file ( fish.gal ) and read it into LIMMA by using the readgal command. > RG$genes<-readGAL(galfile="fish.gal") This saves the result in element called genes in the list RG. Make sure that everything worked by listing the first 15 rows of RG$genes. Exercise 2b The next step is to extract the information of the layout. Since the GAL-file contais this information as well, we can get it directly from the RG$genes by the getlayout command. > RG$printer<-getLayout(RG$genes) 3
This saves the result in RG$printer. printer element contain? What kind of information does the Exercise 3a The MA.RG command can be used to create MA values of our list RG. To do this, simply type MA<-MA.RG(RG) As you might remember from the lectures, the M and A values are defined as M = log(r) log(g) A = log(r) + log(g). 2 MA values has several advantages compared to RG values both when it comes to visualization and statistical analysis. It is also possible to go from MA values and create RG values. Based on the equations above, can you figure out how to do this? Exercise 3b Look at the documentation for the MA.RG function - what are the default values of the parameters? How would you create MA values that are not background corrected? (Note that background correction not always is advisory.) Visualization of microarray data You should now have two variables, one named RG which contains the raw data, the annotation list, and the layout, and one containing all the MA-values. Our next step is to try to get a picture of what the data looks like. Here, a useful command is x11() which produces a new window to plot in, thus keeping the current plot. Exercise 4 We start by examining the RG values for each array. The plotdensities function plots the distribution of spot values for both channels and such a plot can be used to see if there is any bias toward any of the dyes. Plot the distribution of the spot values for all four arrays both with and without log 4
transformation. Can you say something about the dye bias from these plots? Exercise 4b To determine which pairs of densities fit together, use the command layout to plot several subplots in the same plot (layout(matrix(1:4,ncol=2))). In each subplot, plot only the density of one of the arrays (look at the documentation of plotdensities). Exercise 5 Use plotma to create a MA-plot for each array. Do the arrays differ? Are there any trends? Use the text command to plot BMP2 at the location of the BMP2 gene. Is the gene regulated? Should it be regulated? Hint: The M-values for the BMP2 -gene can be gotten using the command: MA$M[RG$genes$Name=="BMP2",] The text command is here used as follows: text(x=bmp2.a,y=bmp2.m,labels="bmp2",col="red") Exercise 6 Since we are going to use the information from all four arrays it is important to check that none of the arrays are different. One way to get an easy overview is to make box-plots of the M-values for each array. Create a boxplot of the M-values. The command you need is boxplot, which does not handle matrices properly, so convert the M-values to a dataframe: boxplot(as.data.frame(ma$m)) Interpret the result! Normalization of microarrays Using the MA-plots that were created in Exercise 5, it is possible to see a trend which depends on the A-value, that is, the total intensity. We have also detected some dye bias in the density plots from Exercise 4. Exercise 7 Create a MA-plot of one of the arrays and add a loess-line. As in Exercise 5, the command to create a MA-plot in LIMMA is plotma. To calculate a loess line, the command lowess is useful and use lines to add a line to an existing plot. 5
Exercise 8 Normalize the data by the global loess method with the normalize- WithinArrays command. This commands takes an MA-list and returns a normalized MA-list. For example, MAnorm<-normalizeWithinArrays(MA, method= loess ) Create a MA-plot for each array of the result. Compare to the plots made in Exercise 5. Exercise 9 Repeat Exercise 4 with the normalized data. Has the dye bias disappeared? Why? Use RG.MA to convert the normalized MA values to RG values. Exercise 10 Repeat Exercise 6 with the normalized data. Have the differences increased or decreased? Use the command normalizebetweenarrays with the quantile method to make a second normalization. Compare the result with a new boxplot. Make a new density plot of the RG values afterwards. Compare to exercises 4 and 9. Statistics and ranking We are now ready to identify the genes that are most likely to be regulated, using several different statistics, for both the non-normalized and normalized MA-values. First, we need to calculate the average fold-change over all the arrays. In LIMMA this is usually done by the lmfit command which requires two arguments; MA-values and a design matrix. The design matrix in our case is a vector containing 1 and -1 indicating the different dyes. In our case, a valid design matrix can be created by designmatrix<-c(-1,1,-1,1) Call lmfit in the following way MAfit<-lmFit(MA, designmatrix) # Call lmfit and save the result in MAfit Exercise 11 Use lmfit and the design matrix above to calculate the average M-values over all the arrays. Do this for both the non-normalized and the normalized values. Save the result in variables with suitable names. 6
Exercise 12 Calculate the moderated statistics by the ebayes command which takes a result from lmfit as an argument and adds the moderated t-statistics. For example, MAstat<-eBayes(MAfit) Do this for both the non-normalized and the normalized values. Exercise 13 Use the toptable command to create a list of the 50 most regulated genes based on the M-value and the moderated t-statistic. toptable(mastat, n=50) Do they differ much? Is there any way to see if one of the lists is more true than the other one? Do the lists of genes between the non-normalized data and the normalized data differ? Can we say which one that is more correct? Exercise 14 Create new MA-plots with the average A-value in the x-axis and the average M-value on the y-axis. Mark the 50 most regulated genes according to the M-value threshold on one of the plots and the 50 most regulated genes according to the moderated t-statistic in the other plot. Do you spot any difference? Why? The average M-value is available from the result from lmfit and the moderated t-statistic is available from the result of ebayes. To sort the statistics use order. apply can be used to calculate the average A-values and the plot function to create a plot. To mark the top 50 genes use the points with the argument col= blue. GOOD LUCK 7
Functions LIMMA backgroundcorrect - background correction ebayes - calculates statistics getlayout - extracts the array layout from the annotation list lmfit - calculates the average M-values over a set of arrays MA.RG - transforms RG values into MA values normalizebetweenarrays - normalization between different arrays normalizewithinarrays - normalization within a single array plotdensities - creates density plots of the colors from a array plotma - creates a MA plot read.maimages - reads microarray data into LIMMA readgal - reads annotation list into LIMMA RG.MA - transforms MA values into RG values toptable - prints the top most regulated genes R boxplot - creates a boxplot lines - plots a line lowess - calculates a loess line points - plots a point to an existing plot text - plots texts to an existing plot 8