Computer Exercise - Microarray Analysis using Bioconductor

Similar documents
Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

Introduction to the Bioconductor marray package : Input component

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

Normalization: Bioconductor s marray package

The analysis of acgh data: Overview

Bioconductor s stepnorm package

/ Computational Genomics. Normalization

PROCEDURE HELP PREPARED BY RYAN MURPHY

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Analysis of Spotted Microarray Data

Exploring cdna Data. Achim Tresch, Andreas Buness, Wolfgang Huber, Tim Beißbarth

Course on Microarray Gene Expression Analysis

Bioconductor exercises 1. Exploring cdna data. June Wolfgang Huber and Andreas Buness

Analysis of Spotted Microarray Data

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005

Package INCATome. October 5, 2017

Codelink Legacy: the old Codelink class

CARMAweb users guide version Johannes Rainer

Package OLIN. September 30, 2018

limma: A brief introduction to R

Practical 2: Plotting

Package AffyExpress. October 3, 2013

Agi4x44Preprocess. Pedro Lopez-Romero. March 30, 2012

Organizing, cleaning, and normalizing (smoothing) cdna microarray data

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

Preprocessing -- examples in microarrays

AgiMicroRna. Pedro Lopez-Romero. April 30, 2018

Vector Xpression 3. Speed Tutorial: III. Creating a Script for Automating Normalization of Data

genbart package Vignette Jacob Cardenas, Jacob Turner, and Derek Blankenship

ROTS: Reproducibility Optimized Test Statistic

Hands-On Exercise: Implementing a Basic Recommender

Package TilePlot. April 8, 2011

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

MiChip. Jonathon Blake. October 30, Introduction 1. 5 Plotting Functions 3. 6 Normalization 3. 7 Writing Output Files 3

Computer lab 2 Course: Introduction to R for Biologists

How do microarrays work

Bioconductor tutorial

Package ffpe. October 1, 2018

Exploring gene expression datasets

Introduction to the Codelink package

Excel 2. Module 3 Advanced Charts

Package dyebias. March 7, 2019

From raw data to gene annotations

Using metama for differential gene expression analysis from multiple studies

Facets and Continuous graphs

NENS 230 Assignment 4: Data Visualization

Section 7D Systems of Linear Equations

CompClustTk Manual & Tutorial

Technical Arts 101 Prof. Anupam Saxena Department of Mechanical engineering Indian Institute of Technology, Kanpur. Lecture - 7 Think and Analyze

Differential Expression Analysis at PATRIC

Matlab notes Matlab is a matrix-based, high-performance language for technical computing It integrates computation, visualisation and programming usin

How to use the DEGseq Package

Meeting 1 Introduction to Functions. Part 1 Graphing Points on a Plane (REVIEW) Part 2 What is a function?

Section 4 General Factorial Tutorials

Expander Online Documentation

Charts in Excel 2003

AB1700 Microarray Data Analysis

Package cornai. R topics documented: April 14, Type Package Title Analysis of co-knock-down RNAi data Version Author Elin Axelsson

The crosshybdetector Package

What does analyze.itraq( )?

Lecture 3 - Template and Vectors

A short reference to FSPMA definition files

TIGR MIDAS Version 2.19 TIGR MIDAS. Microarray Data Analysis System. Version 2.19 November Page 1 of 85

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

by Stuart David James McHattie Supervised by Katherine Denby, Vicky Buchanan-Wollaston and Andrew Mead of Warwick HRI

Why use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first

OECD QSAR Toolbox v.4.1. Example illustrating endpoint vs. endpoint correlation using ToxCast data

Using R for statistics and data analysis

Tutorial - Analysis of Microarray Data. Microarray Core E Consortium for Functional Glycomics Funded by the NIGMS

Package TilePlot. February 15, 2013

Towards an Optimized Illumina Microarray Data Analysis Pipeline

Illuminating the Big Picture

Package AgiMicroRna. R topics documented: November 9, Version

Expander 7.2 Online Documentation

Package matchbox. December 31, 2018

The Allen Human Brain Atlas offers three types of searches to allow a user to: (1) obtain gene expression data for specific genes (or probes) of

GeneSifter.Net User s Guide

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Assignment 2 Ray Tracing

FlowJo Software Lecture Outline:

Methodology for spot quality evaluation

ECE 3793 Matlab Project 1

Package RTCGAToolbox

Package stepnorm. R topics documented: April 10, Version Date

CQN (Conditional Quantile Normalization)

Package plmde. February 20, 2015

Graphing by. Points. The. Plotting Points. Line by the Plotting Points Method. So let s try this (-2, -4) (0, 2) (2, 8) many points do I.

LAB #1: DESCRIPTIVE STATISTICS WITH R

Analyzing Variant Call results using EuPathDB Galaxy, Part II

BIOL 417: Biostatistics Laboratory #3 Tuesday, February 8, 2011 (snow day February 1) INTRODUCTION TO MYSTAT

TIGR ExpressConverter

Install RStudio from - use the standard installation.

Exploring IX1D The Terrain Conductivity/Resistivity Modeling Software

Package diffcyt. December 26, 2018

Programming Exercise 3: Multi-class Classification and Neural Networks

CARLETON UNIVERSITY. Laboratory 2.0

MAGE-ML: MicroArray Gene Expression Markup Language

Using Charts in a Presentation 6

Transcription:

Computer Exercise - Microarray Analysis using Bioconductor Introduction The SWIRL dataset The SWIRL dataset comes from an experiment using zebrafish to study early development in vertebrates. SWIRL is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. One of the goals of the SWIRL experiment is to identify genes with altered expression in the BMP2 mutant compared to the wild-type zebrafish. The SWIRL dataset is provided by Katrin Wuennenberg-Stapleton from the Ngai Lab at UC Berkley. Table 1 shows the experimental setup. R stands for red and G for green which is the names that the two dyes usually are called. Other common names are Cy5 (red) and Cy3 (green). Array number Mutant dye Wild-type dye 1 Cy3 (G) Cy5 (R) 2 Cy5 (R) Cy3 (G) 3 Cy3 (G) Cy5 (R) 4 Cy5 (R) Cy3 (G) Table 1: Experimental setup for the SWIRL dataset. To download the data write the following lines in a xterm window wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.1.spot wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.2.spot 1

wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.3.spot wget http://www.math.chalmers.se/~erikkr/macourse2008/swirl.4.spot wget http://www.math.chalmers.se/~erikkr/macourse2008/fish.gal LIMMA - a package in Bioconductor LIMMA stands for Linear Models in Microarry Analysis and is a Bioconductor package for microarray analysis. The package is maintained by Gordon Smyth who has also written several papers in the field of microarray analysis. The LIMMA package contains a broad collection of tools and some of them are especially designed for the analysis of two-channel spotted cdna microarray data. In this lab we will use LIMMA for several reasons. First, LIMMA is developed at a fast pace, which means that new methods are continuously added as they come available. LIMMA is also fairly easy to learn and well documented (at least relative to the other packages in Bioconductor). To load LIMMA in R simply type library(limma) and wait a few seconds. There are several ways to access the LIMMA documentation. The easiest way is to use the included help files. These can be read directly in R by using the help command: help( 01.Introduction ) In addition, the following sections might be of interest: 02.Classes, 03.ReadingData, 04.Background, 05.Normalization, 06.LinearModels, 07.SingleChannel, 08.Tests, 09.Diagnostics and 10.Other. A user guide for LIMMA is available on http://www.math.chalmers.se/ erikkr/macourse2008/. Exercises Exercises marked with a star (*) are a bit more tricky and may be skipped without interrupting the flow of the lab. 2

Basic input/output in LIMMA To start analyzing the data the first step is to read the data into R. This part can be rather tricky depending on the format of the data. In our case, the data is an output file from the image analysis program Spot. Exercise 1 Use the read.maimages command to load the files into LIMMA. The best way to do this is to save the names of the different files in a vector > files<-c("swirl.1.spot", "swirl.2.spot", "swirl.3.spot", "swirl.4.spot") After that, use the read.maimages command to read the files into R > RG<-read.maimages(files, source="spot") Use the names command to see which elements the resulting list RG contains. Can you figure out what they stand for? Take a look at the contents of the different elements. What type of objects are they? The command class can be used here in the following way class(rg[[1]]) The raw numbers from the slide has now been read into LIMMA, but we also need some metadata, that is, some information about the data. Examples on metadata in this case is the layout of the array and an annotation list. Exercise 2a In our case, the layout and the annotation list are stored in a so called GALfile. Make sure that you have downloaded the GAL-file ( fish.gal ) and read it into LIMMA by using the readgal command. > RG$genes<-readGAL(galfile="fish.gal") This saves the result in element called genes in the list RG. Make sure that everything worked by listing the first 15 rows of RG$genes. Exercise 2b The next step is to extract the information of the layout. Since the GAL-file contais this information as well, we can get it directly from the RG$genes by the getlayout command. > RG$printer<-getLayout(RG$genes) 3

This saves the result in RG$printer. printer element contain? What kind of information does the Exercise 3a The MA.RG command can be used to create MA values of our list RG. To do this, simply type MA<-MA.RG(RG) As you might remember from the lectures, the M and A values are defined as M = log(r) log(g) A = log(r) + log(g). 2 MA values has several advantages compared to RG values both when it comes to visualization and statistical analysis. It is also possible to go from MA values and create RG values. Based on the equations above, can you figure out how to do this? Exercise 3b Look at the documentation for the MA.RG function - what are the default values of the parameters? How would you create MA values that are not background corrected? (Note that background correction not always is advisory.) Visualization of microarray data You should now have two variables, one named RG which contains the raw data, the annotation list, and the layout, and one containing all the MA-values. Our next step is to try to get a picture of what the data looks like. Here, a useful command is x11() which produces a new window to plot in, thus keeping the current plot. Exercise 4 We start by examining the RG values for each array. The plotdensities function plots the distribution of spot values for both channels and such a plot can be used to see if there is any bias toward any of the dyes. Plot the distribution of the spot values for all four arrays both with and without log 4

transformation. Can you say something about the dye bias from these plots? Exercise 4b To determine which pairs of densities fit together, use the command layout to plot several subplots in the same plot (layout(matrix(1:4,ncol=2))). In each subplot, plot only the density of one of the arrays (look at the documentation of plotdensities). Exercise 5 Use plotma to create a MA-plot for each array. Do the arrays differ? Are there any trends? Use the text command to plot BMP2 at the location of the BMP2 gene. Is the gene regulated? Should it be regulated? Hint: The M-values for the BMP2 -gene can be gotten using the command: MA$M[RG$genes$Name=="BMP2",] The text command is here used as follows: text(x=bmp2.a,y=bmp2.m,labels="bmp2",col="red") Exercise 6 Since we are going to use the information from all four arrays it is important to check that none of the arrays are different. One way to get an easy overview is to make box-plots of the M-values for each array. Create a boxplot of the M-values. The command you need is boxplot, which does not handle matrices properly, so convert the M-values to a dataframe: boxplot(as.data.frame(ma$m)) Interpret the result! Normalization of microarrays Using the MA-plots that were created in Exercise 5, it is possible to see a trend which depends on the A-value, that is, the total intensity. We have also detected some dye bias in the density plots from Exercise 4. Exercise 7 Create a MA-plot of one of the arrays and add a loess-line. As in Exercise 5, the command to create a MA-plot in LIMMA is plotma. To calculate a loess line, the command lowess is useful and use lines to add a line to an existing plot. 5

Exercise 8 Normalize the data by the global loess method with the normalize- WithinArrays command. This commands takes an MA-list and returns a normalized MA-list. For example, MAnorm<-normalizeWithinArrays(MA, method= loess ) Create a MA-plot for each array of the result. Compare to the plots made in Exercise 5. Exercise 9 Repeat Exercise 4 with the normalized data. Has the dye bias disappeared? Why? Use RG.MA to convert the normalized MA values to RG values. Exercise 10 Repeat Exercise 6 with the normalized data. Have the differences increased or decreased? Use the command normalizebetweenarrays with the quantile method to make a second normalization. Compare the result with a new boxplot. Make a new density plot of the RG values afterwards. Compare to exercises 4 and 9. Statistics and ranking We are now ready to identify the genes that are most likely to be regulated, using several different statistics, for both the non-normalized and normalized MA-values. First, we need to calculate the average fold-change over all the arrays. In LIMMA this is usually done by the lmfit command which requires two arguments; MA-values and a design matrix. The design matrix in our case is a vector containing 1 and -1 indicating the different dyes. In our case, a valid design matrix can be created by designmatrix<-c(-1,1,-1,1) Call lmfit in the following way MAfit<-lmFit(MA, designmatrix) # Call lmfit and save the result in MAfit Exercise 11 Use lmfit and the design matrix above to calculate the average M-values over all the arrays. Do this for both the non-normalized and the normalized values. Save the result in variables with suitable names. 6

Exercise 12 Calculate the moderated statistics by the ebayes command which takes a result from lmfit as an argument and adds the moderated t-statistics. For example, MAstat<-eBayes(MAfit) Do this for both the non-normalized and the normalized values. Exercise 13 Use the toptable command to create a list of the 50 most regulated genes based on the M-value and the moderated t-statistic. toptable(mastat, n=50) Do they differ much? Is there any way to see if one of the lists is more true than the other one? Do the lists of genes between the non-normalized data and the normalized data differ? Can we say which one that is more correct? Exercise 14 Create new MA-plots with the average A-value in the x-axis and the average M-value on the y-axis. Mark the 50 most regulated genes according to the M-value threshold on one of the plots and the 50 most regulated genes according to the moderated t-statistic in the other plot. Do you spot any difference? Why? The average M-value is available from the result from lmfit and the moderated t-statistic is available from the result of ebayes. To sort the statistics use order. apply can be used to calculate the average A-values and the plot function to create a plot. To mark the top 50 genes use the points with the argument col= blue. GOOD LUCK 7

Functions LIMMA backgroundcorrect - background correction ebayes - calculates statistics getlayout - extracts the array layout from the annotation list lmfit - calculates the average M-values over a set of arrays MA.RG - transforms RG values into MA values normalizebetweenarrays - normalization between different arrays normalizewithinarrays - normalization within a single array plotdensities - creates density plots of the colors from a array plotma - creates a MA plot read.maimages - reads microarray data into LIMMA readgal - reads annotation list into LIMMA RG.MA - transforms MA values into RG values toptable - prints the top most regulated genes R boxplot - creates a boxplot lines - plots a line lowess - calculates a loess line points - plots a point to an existing plot text - plots texts to an existing plot 8