MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 6 November 2009 3.00 pm BRAGG Cluster This document contains the tasks need to be done and completed by students taking the modules MATH3880 Introduction to Statistics and DNA and MATH5880 Statistics and DNA. A report needs to be submitted one week after the practical on Monday 23 November 2009 during the lecture. The report should be written using computer. If you have problem with this, for example, if you have certain disabilities that restrict you considerably in using the computer, let me know as soon as possible. Preparation Before we procedd with the practical session, make sure that you have checked or done the following: Read the limma package usersguide, available from: http://www.bioconductor.org/packages/2.5/bioc/html/limma.html especially Chapter 3, Chapter 8 (Sections 8., 8.2, and 8.4), and Chapter 0 (Sections 0. and 0.2). Read the note How to install Bioconductor packages in the University of Leeds Bragg Cluster. This note is also available from the module webpage: http://www.maths.leeds.ac.uk/~arief/math3880-5880 Following the notes, please check that you have enough space on your My Documents folder. Install the limma package as directed in the note (Section Extracting and installing the packages) Open R and load the limma package as directed in the note (Section Preparation in R) Download the LPS data from the webpage to your Data directory (again, see the note). Read the background and objective of LPS experiment in the handout of Lecture 8. Set the working directory in R into M:/Data, by typing > setwd("m:/data")

2 Reading the LPS data into your R session 2. Reading the raw expression data Once you have done the preparation above, you can start reading the raw microarray data by using the following commands. > file.list = dir(patt="gpr") # list of microarray raw data files > file.list # Check that you have four.gpr files > f <- function(x) as.numeric(x$flags > -50) # filter out bad genes > RG = read.maimages(files=file.list, source="genepix", wt.fun=f) > show(rg) Answer the following questions:. What are the names of the microarray data files? In each file, which experimental condition is labelled with each dye? 2. What components are contained in the object RG? 3. There are four matrices in the RG list: R, G, Rb, and Gb. What information is contained in each of those matrices? What do the rows and columns correspond to? 4. Draw a scatterplot where the horizontal axis represents the expression of Green channel of array 355-5 and the vertical axis represents the Red channel. What can you say about the plot? Hint: Use pch="." as an argument of the function plot(). 5. Draw the same plot where the axes are in log (base 2) scale. What can you say about the plot? 2.2 Expression data in log-ratio scale > MA = MA.RG(RG, bc.method="none") > show(ma) The above command MA.RG creates an object called MA from RG, where we do not subtract the background intensity from the foreground spot intensity. No normalisation is performed at this stage. The above command simply creates a log ratio from RG list. Answer the following questions: 6. What information is contained in MA list? 7. What do matrix M and A represent? Are they in log-scale? 8. Does matrix M contain the log-ratios of Red over Green channels or Treatment ( hour) over Control (0 hour)? 2

9. Draw a scatterplot from the first array, where the horizontal axis is the first column of matrix A and the vertical axis is the first column of matrix M. Repeat this for all the other arrays. What can you say about the plot? What would you expect from the distribution of log ratio in the figure if many of the genes are not differentially expressed between Control and Treatment? Hint: Use the command par(mfrow=c(2,2)) before drawing the plot, and use the argument pch="." in the function plot(). 3 Normalisation The above object MA contains log-ratio of foreground intensities without background correction (and non-normalised). In this section, we use background-adjusted intensities. The following R commands perform normalisation from the information contained in RG into an object called MA. > MA = normalizewithinarrays(rg, method="loess", + span=0.3, bc.method="subtract") The information contained in MA are already normalised (and background-corrected). The normalisation method used was loess (using argument method="loess"). Other available options for this argument are: "none" (no normalisation performed), "median" (median normalisation performed, see lecture notes), "printtiploess" (loess normalisation performed based on the configuration of microarray printer blocks, this is the default), "composite" (combination of loess and printiploess normalisation performed), "control" (normalisation based on control spots performed), and "robustspline" (normalisation using spline performed). Answer the following questions: 0. What information is contained in the object MA?. Draw an MA-plot (the type of plot in Question 9) from object MA for all arrays. What can you say about the plot? Hint: You may use the function plotma(ma, array=n ), where n is the n-th array to be plotted (n-th column of M). 4 Linear models for cdna microarray data In this section, we will perform linear model fit to the microarray data that we have. After normalisation described in Section 3 above, the log ratio of expression of RED over GREEN (software default, rows of matrix M in MA) can be modelled as y = Xβ + ε where X is the design matrix, constructed so that β represents differential expression of Treatment ( hr.) over Control (0 hr.) in these arrays (see the lecture handout). β 3

here is our main interest, a parameter of differential expression between two biological groups. To make β represent differential expression of Treatment ( hr.) over Control (0 hr.), we need to look into how the log-ratio data is laid out by R, and experimental design (file LPS-info.txt): > colnames(ma$m) [] "355-2" "355-5" "358-3" "358-7" > exp.design <- read.table(file="lps-info.txt", header=t) > exp.design Array Green Red 355-5 0 2 355-2 0 3 358-3 0 4 358-7 0 The above outputs indicate that the order of file in the object MA is 355-2, 355-5, 358-3, 358-7. If we look into the experimental design, the log ratio of RED over GREEN in M with the above ordering correspond to log ratio of Control (0 hr.) over Treatment ( hr.), Treatment over Control, Treatment over Control, and Control over Treatment. Therefore, to make β to represent differential expression of Treatment over Control, we need to make the design matrix X to be: Had we set X = X = then β would represent the differential expression of RED over GREEN instead of Treatment over Control (remember, y is a vector of of log ratio of RED over GREEN, corresponds to a row of matrix M in object MA). We continue the analysis with the following commands: > design.matrix = c(-,,,-) > fit = lmfit(ma, design=design.matrix) > fit The above commands perform a linear model fit (using least squares) to each of the rows of matrix M in object MA with design matrix X. The command did not perform any test nor calculate any test statistic. The limma package, by default, use an., 4

empirical Bayes approach in calculating a test statistic (moderated t-statistic). Our interest here is to calculate the test statistic t g = ˆβ. () SE( ˆβ) To get the test statistic, we need to compute it by either using available information in object fit or using the standard function lm() on each row of matrix M in object MA (the latter is left for your exercise, see Question 5 below). The object fit contains information on ˆβ (component coefficient), square-root of the matrix (X X) (component stdev.unscaled), and ˆσ (component sigma). ˆσ is the estimate of square root of error variance. From these information we can compute the standard error of ˆβ as multiplication of components stdev.unscaled and sigma (See handout from Lecture 7). Do the following tasks: 2. Calculate the test statistic t g in Equation (), and save it as an object called tg in your R session. (Note that the object tg should be a vector whose length should be equal to the number of rows in the matrix M in object MA). 3. Calculate the two-sided p-value of the statistic, and save it in an object called pval.tg in your R session. (Note that the degrees of freedom for each gene is contained in the component df.residual in the object fit). 4. Create a data.frame object in R, called result.table, where its columns contain the following information: Gene ID, ˆβ, SE( ˆβ), t g, and p-value (of t g ). Hint: Information on gene ID can be found in the component genes in the object fit. 5. We can use the standard R function lm() in estimating ˆβ and p-values for each gene, based on the design matrix X. Verify this by analysing the 00-th gene in the list (00-th row of matrix M in object MA), and show that the summary of the model fitting using lm() contains the same information as the 00-th row of the object result.table. Hint: By default, lm() adds an intercept to the model. In fitting our model, do not use the intercept by adding an argument - before adding design matrix. 6. Sort the data frame result.table where the gene with smallest p-value should be at the top, followed by the second most significant gene, and so forth. Show the top 0 genes, and put this in your report. 5 Two-sample t-test for single-colour arrays To explore the use of two-sample t-test with single-colour array Affymetrix data, we first download the R workspace file (ending with.rdata) from the module webpage: http://www.maths.leeds.ac.uk/~arief/math3880-5880 5

and go to the section Datasets and then Breast cancer dataset. Save the file in your Data folder within your My Documents folder. Load the.rdata file into your R session, and check that it contains objects er and x. The object x is a matrix of expression, where the rows correspond to the genes/probesets and the columns correspond to the arrays. Since each single-colour array contains information of expression from one sample/individual, the columns also correspond to the breast cancer patients. The data in object x are already normalised and in log scale. The object er contains information on the ER (Estrogen receptor) status of the patients. The object er indicates that the first 5 columns of x are from ER-positive patients (er value ) and the remaining 5 columns are ER-negative (er value 0). Verify these information by checking the details of objects er and x. Our interest in this study is to identify genes that are differentially expressed between ER-positive and ER-negative patients. Do the following tasks: 7. Calculate two-sample t-statistics of differential expression between ER-positive and ER-negative patients under the assumption of equal variance between the two groups. Save the quantity into an object called t2. Please note that t2 should be a vector whose length is the same as the number of rows of x. Hint: Use the argument var.equal=t in the t-test. 8. Compute the p-values associated with the t-statistics, and save this quantity into an object called pval.t2. 9. Create a data.frame object in R, called result.table2, where its columns contain the following information: Gene ID, t-statistics, and p-values. Hint: Information on gene ID can be found as row names of matrix x. 20. Sort the data frame result.table2 where the gene with smallest p-value should be at the top, followed by the second most significant gene, and so forth. Show the top 0 genes, and put this in your report. 6