MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

Similar documents
Computer Exercise - Microarray Analysis using Bioconductor

Course on Microarray Gene Expression Analysis

The analysis of acgh data: Overview

/ Computational Genomics. Normalization

CARMAweb users guide version Johannes Rainer

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

`Three sides of a 500 square foot rectangle are fenced. Express the fence s length f as a function of height x.

Preprocessing -- examples in microarrays

Normalization: Bioconductor s marray package

WEEK 4 REVIEW. Graphing Systems of Linear Inequalities (3.1)

Lab 5, part b: Scatterplots and Correlation

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Exploring cdna Data. Achim Tresch, Andreas Buness, Wolfgang Huber, Tim Beißbarth

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Organizing, cleaning, and normalizing (smoothing) cdna microarray data

Section 4.4: Parabolas

Application of Hierarchical Clustering to Find Expression Modules in Cancer

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Analysis of Spotted Microarray Data

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Image Manipulation in MATLAB Due Monday, July 17 at 5:00 PM

The x-intercept can be found by setting y = 0 and solving for x: 16 3, 0

Chapter 12: Quadratic and Cubic Graphs

Practice Test (page 391) 1. For each line, count squares on the grid to determine the rise and the run. Use slope = rise

Bayesian Robust Inference of Differential Gene Expression The bridge package

Package TilePlot. April 8, 2011

STRAIGHT LINE GRAPHS THE COORDINATES OF A POINT. The coordinates of any point are written as an ordered pair (x, y)

Clustering Techniques

Bioconductor exercises 1. Exploring cdna data. June Wolfgang Huber and Andreas Buness

Introduction to CS databases and statistics in Excel Jacek Wiślicki, Laurent Babout,

Assumption 1: Groups of data represent random samples from their respective populations.

Graphical Analysis of Data using Microsoft Excel [2016 Version]

UNIT 8: SOLVING AND GRAPHING QUADRATICS. 8-1 Factoring to Solve Quadratic Equations. Solve each equation:

9.1: GRAPHING QUADRATICS ALGEBRA 1

slope rise run Definition of Slope

PROMO 2017a - Tutorial

Bioconductor tutorial

Microarray Technology (Affymetrix ) and Analysis. Practicals

LAB #1: DESCRIPTIVE STATISTICS WITH R

Microarray Excel Hands-on Workshop Handout

AB1700 Microarray Data Analysis

Lecture 16: High-dimensional regression, non-linear regression

Package INCATome. October 5, 2017

Package OLIN. September 30, 2018

Practical 2: Plotting

How to use CNTools. Overview. Algorithms. Jianhua Zhang. April 14, 2011

Introduction to Bioinformatics AS Laboratory Assignment 2

Package agilp. R topics documented: January 22, 2018

Drug versus Disease (DrugVsDisease) package

Lesson 19: The Graph of a Linear Equation in Two Variables is a Line

How do microarrays work

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

MiChip. Jonathon Blake. October 30, Introduction 1. 5 Plotting Functions 3. 6 Normalization 3. 7 Writing Output Files 3

A short reference to FSPMA definition files

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Matlab Practice Sessions

Sketching graphs of polynomials

Vertical and Horizontal Translations

Introduction to R Programming

Exploratory data analysis for microarrays

How to use the rbsurv Package

User Guide. IR-TEx: Insecticide Resistance Transcript Explorer. V.A Ingham, D. Peng, S. Wagstaff and H. Ranson

Analysis of Spotted Microarray Data

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Package ffpe. October 1, 2018

1 StatLearn Practical exercise 5

Name: THE SIMPLEX METHOD: STANDARD MAXIMIZATION PROBLEMS

x = 12 x = 12 1x = 16

Using metama for differential gene expression analysis from multiple studies

Building R objects from ArrayExpress datasets

Lecture 13: Model selection and regularization

Nature Publishing Group

For more info and downloads go to: Gerrit Stols

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

Stat 8053, Fall 2013: Additive Models

Topic. Section 4.1 (3, 4)

Applied Regression Modeling: A Business Approach

Introduction to Minitab 1

Rational functions, like rational numbers, will involve a fraction. We will discuss rational functions in the form:

hp calculators hp 39g+ & hp 39g/40g Using Matrices How are matrices stored? How do I solve a system of equations? Quick and easy roots of a polynomial

Analysis of (cdna) Microarray Data: Part I. Sources of Bias and Normalisation

The Power and Sample Size Application

Mastery. PRECALCULUS Student Learning Targets

This assignment is due the first day of school. Name:

Section Graphs and Lines

PROCEDURE HELP PREPARED BY RYAN MURPHY

Recitation Handout 10: Experiments in Calculus-Based Kinetics

This is called the vertex form of the quadratic equation. To graph the equation

Package TilePlot. February 15, 2013

Cluster Analysis for Microarray Data

ft-uiowa-math2550 Assignment HW8fall14 due 10/23/2014 at 11:59pm CDT 3. (1 pt) local/library/ui/fall14/hw8 3.pg Given the matrix

Quadratic Functions Dr. Laura J. Pyzdrowski

Review for Mastery Using Graphs and Tables to Solve Linear Systems

Math 121. Graphing Rational Functions Fall 2016

Section 4.1 Review of Quadratic Functions and Graphs (3 Days)

Package snm. July 20, 2018

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Transcription:

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 6 November 2009 3.00 pm BRAGG Cluster This document contains the tasks need to be done and completed by students taking the modules MATH3880 Introduction to Statistics and DNA and MATH5880 Statistics and DNA. A report needs to be submitted one week after the practical on Monday 23 November 2009 during the lecture. The report should be written using computer. If you have problem with this, for example, if you have certain disabilities that restrict you considerably in using the computer, let me know as soon as possible. Preparation Before we procedd with the practical session, make sure that you have checked or done the following: Read the limma package usersguide, available from: http://www.bioconductor.org/packages/2.5/bioc/html/limma.html especially Chapter 3, Chapter 8 (Sections 8., 8.2, and 8.4), and Chapter 0 (Sections 0. and 0.2). Read the note How to install Bioconductor packages in the University of Leeds Bragg Cluster. This note is also available from the module webpage: http://www.maths.leeds.ac.uk/~arief/math3880-5880 Following the notes, please check that you have enough space on your My Documents folder. Install the limma package as directed in the note (Section Extracting and installing the packages) Open R and load the limma package as directed in the note (Section Preparation in R) Download the LPS data from the webpage to your Data directory (again, see the note). Read the background and objective of LPS experiment in the handout of Lecture 8. Set the working directory in R into M:/Data, by typing > setwd("m:/data")

2 Reading the LPS data into your R session 2. Reading the raw expression data Once you have done the preparation above, you can start reading the raw microarray data by using the following commands. > file.list = dir(patt="gpr") # list of microarray raw data files > file.list # Check that you have four.gpr files > f <- function(x) as.numeric(x$flags > -50) # filter out bad genes > RG = read.maimages(files=file.list, source="genepix", wt.fun=f) > show(rg) Answer the following questions:. What are the names of the microarray data files? In each file, which experimental condition is labelled with each dye? 2. What components are contained in the object RG? 3. There are four matrices in the RG list: R, G, Rb, and Gb. What information is contained in each of those matrices? What do the rows and columns correspond to? 4. Draw a scatterplot where the horizontal axis represents the expression of Green channel of array 355-5 and the vertical axis represents the Red channel. What can you say about the plot? Hint: Use pch="." as an argument of the function plot(). 5. Draw the same plot where the axes are in log (base 2) scale. What can you say about the plot? 2.2 Expression data in log-ratio scale > MA = MA.RG(RG, bc.method="none") > show(ma) The above command MA.RG creates an object called MA from RG, where we do not subtract the background intensity from the foreground spot intensity. No normalisation is performed at this stage. The above command simply creates a log ratio from RG list. Answer the following questions: 6. What information is contained in MA list? 7. What do matrix M and A represent? Are they in log-scale? 8. Does matrix M contain the log-ratios of Red over Green channels or Treatment ( hour) over Control (0 hour)? 2

9. Draw a scatterplot from the first array, where the horizontal axis is the first column of matrix A and the vertical axis is the first column of matrix M. Repeat this for all the other arrays. What can you say about the plot? What would you expect from the distribution of log ratio in the figure if many of the genes are not differentially expressed between Control and Treatment? Hint: Use the command par(mfrow=c(2,2)) before drawing the plot, and use the argument pch="." in the function plot(). 3 Normalisation The above object MA contains log-ratio of foreground intensities without background correction (and non-normalised). In this section, we use background-adjusted intensities. The following R commands perform normalisation from the information contained in RG into an object called MA. > MA = normalizewithinarrays(rg, method="loess", + span=0.3, bc.method="subtract") The information contained in MA are already normalised (and background-corrected). The normalisation method used was loess (using argument method="loess"). Other available options for this argument are: "none" (no normalisation performed), "median" (median normalisation performed, see lecture notes), "printtiploess" (loess normalisation performed based on the configuration of microarray printer blocks, this is the default), "composite" (combination of loess and printiploess normalisation performed), "control" (normalisation based on control spots performed), and "robustspline" (normalisation using spline performed). Answer the following questions: 0. What information is contained in the object MA?. Draw an MA-plot (the type of plot in Question 9) from object MA for all arrays. What can you say about the plot? Hint: You may use the function plotma(ma, array=n ), where n is the n-th array to be plotted (n-th column of M). 4 Linear models for cdna microarray data In this section, we will perform linear model fit to the microarray data that we have. After normalisation described in Section 3 above, the log ratio of expression of RED over GREEN (software default, rows of matrix M in MA) can be modelled as y = Xβ + ε where X is the design matrix, constructed so that β represents differential expression of Treatment ( hr.) over Control (0 hr.) in these arrays (see the lecture handout). β 3

here is our main interest, a parameter of differential expression between two biological groups. To make β represent differential expression of Treatment ( hr.) over Control (0 hr.), we need to look into how the log-ratio data is laid out by R, and experimental design (file LPS-info.txt): > colnames(ma$m) [] "355-2" "355-5" "358-3" "358-7" > exp.design <- read.table(file="lps-info.txt", header=t) > exp.design Array Green Red 355-5 0 2 355-2 0 3 358-3 0 4 358-7 0 The above outputs indicate that the order of file in the object MA is 355-2, 355-5, 358-3, 358-7. If we look into the experimental design, the log ratio of RED over GREEN in M with the above ordering correspond to log ratio of Control (0 hr.) over Treatment ( hr.), Treatment over Control, Treatment over Control, and Control over Treatment. Therefore, to make β to represent differential expression of Treatment over Control, we need to make the design matrix X to be: Had we set X = X = then β would represent the differential expression of RED over GREEN instead of Treatment over Control (remember, y is a vector of of log ratio of RED over GREEN, corresponds to a row of matrix M in object MA). We continue the analysis with the following commands: > design.matrix = c(-,,,-) > fit = lmfit(ma, design=design.matrix) > fit The above commands perform a linear model fit (using least squares) to each of the rows of matrix M in object MA with design matrix X. The command did not perform any test nor calculate any test statistic. The limma package, by default, use an., 4

empirical Bayes approach in calculating a test statistic (moderated t-statistic). Our interest here is to calculate the test statistic t g = ˆβ. () SE( ˆβ) To get the test statistic, we need to compute it by either using available information in object fit or using the standard function lm() on each row of matrix M in object MA (the latter is left for your exercise, see Question 5 below). The object fit contains information on ˆβ (component coefficient), square-root of the matrix (X X) (component stdev.unscaled), and ˆσ (component sigma). ˆσ is the estimate of square root of error variance. From these information we can compute the standard error of ˆβ as multiplication of components stdev.unscaled and sigma (See handout from Lecture 7). Do the following tasks: 2. Calculate the test statistic t g in Equation (), and save it as an object called tg in your R session. (Note that the object tg should be a vector whose length should be equal to the number of rows in the matrix M in object MA). 3. Calculate the two-sided p-value of the statistic, and save it in an object called pval.tg in your R session. (Note that the degrees of freedom for each gene is contained in the component df.residual in the object fit). 4. Create a data.frame object in R, called result.table, where its columns contain the following information: Gene ID, ˆβ, SE( ˆβ), t g, and p-value (of t g ). Hint: Information on gene ID can be found in the component genes in the object fit. 5. We can use the standard R function lm() in estimating ˆβ and p-values for each gene, based on the design matrix X. Verify this by analysing the 00-th gene in the list (00-th row of matrix M in object MA), and show that the summary of the model fitting using lm() contains the same information as the 00-th row of the object result.table. Hint: By default, lm() adds an intercept to the model. In fitting our model, do not use the intercept by adding an argument - before adding design matrix. 6. Sort the data frame result.table where the gene with smallest p-value should be at the top, followed by the second most significant gene, and so forth. Show the top 0 genes, and put this in your report. 5 Two-sample t-test for single-colour arrays To explore the use of two-sample t-test with single-colour array Affymetrix data, we first download the R workspace file (ending with.rdata) from the module webpage: http://www.maths.leeds.ac.uk/~arief/math3880-5880 5

and go to the section Datasets and then Breast cancer dataset. Save the file in your Data folder within your My Documents folder. Load the.rdata file into your R session, and check that it contains objects er and x. The object x is a matrix of expression, where the rows correspond to the genes/probesets and the columns correspond to the arrays. Since each single-colour array contains information of expression from one sample/individual, the columns also correspond to the breast cancer patients. The data in object x are already normalised and in log scale. The object er contains information on the ER (Estrogen receptor) status of the patients. The object er indicates that the first 5 columns of x are from ER-positive patients (er value ) and the remaining 5 columns are ER-negative (er value 0). Verify these information by checking the details of objects er and x. Our interest in this study is to identify genes that are differentially expressed between ER-positive and ER-negative patients. Do the following tasks: 7. Calculate two-sample t-statistics of differential expression between ER-positive and ER-negative patients under the assumption of equal variance between the two groups. Save the quantity into an object called t2. Please note that t2 should be a vector whose length is the same as the number of rows of x. Hint: Use the argument var.equal=t in the t-test. 8. Compute the p-values associated with the t-statistics, and save this quantity into an object called pval.t2. 9. Create a data.frame object in R, called result.table2, where its columns contain the following information: Gene ID, t-statistics, and p-values. Hint: Information on gene ID can be found as row names of matrix x. 20. Sort the data frame result.table2 where the gene with smallest p-value should be at the top, followed by the second most significant gene, and so forth. Show the top 0 genes, and put this in your report. 6