Bioconductor exercises 1. Exploring cdna data. June Wolfgang Huber and Andreas Buness

Similar documents
Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Exploring cdna Data. Achim Tresch, Andreas Buness, Wolfgang Huber, Tim Beißbarth

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

Course on Microarray Gene Expression Analysis

Normalization: Bioconductor s marray package

Introduction to the Bioconductor marray package : Input component

Working with Affymetrix data: estrogen, a 2x2 factorial design example

Package OLIN. September 30, 2018

Performance assessment of vsn with simulated data

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Package stepnorm. R topics documented: April 10, Version Date

Why use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

Computer Exercise - Microarray Analysis using Bioconductor

Bioconductor s stepnorm package

CARMAweb users guide version Johannes Rainer

AB1700 Microarray Data Analysis

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Using R for statistics and data analysis

Analysis of Genomic and Proteomic Data. Practicals. Benjamin Haibe-Kains. February 17, 2005

I. Overview of the Bioconductor Project. Bioinformatics and Biostatistics Lab., Seoul National Univ. Seoul, Korea Eun-Kyung Lee

Preprocessing Affymetrix Data. Educational Materials 2005 R. Irizarry and R. Gentleman Modified 21 November, 2009, M. Morgan

MiChip. Jonathon Blake. October 30, Introduction 1. 5 Plotting Functions 3. 6 Normalization 3. 7 Writing Output Files 3

Introduction to the Codelink package

Package twilight. August 3, 2013

ROTS: Reproducibility Optimized Test Statistic

Why use R? Getting started. Why not use R? Introduction to R: It s hard to use at first. To perform inferential statistics (e.g., use a statistical

Package AffyExpress. October 3, 2013

Microarray Technology (Affymetrix ) and Analysis. Practicals

/ Computational Genomics. Normalization

Bioconductor tutorial

The analysis of acgh data: Overview

Package cycle. March 30, 2019

Introduction: microarray quality assessment with arrayqualitymetrics

Package GSRI. March 31, 2019

Topics for today Input / Output Using data frames Mathematics with vectors and matrices Summary statistics Basic graphics

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010

From raw data to gene annotations

Identifying differentially expressed genes with siggenes

The crosshybdetector Package

Codelink Legacy: the old Codelink class

Analysis of Spotted Microarray Data

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

Analysis of multi-channel cell-based screens

Preprocessing -- examples in microarrays

ReadqPCR: Functions to load RT-qPCR data into R

Introduction to R: Using R for statistics and data analysis

Evaluation of VST algorithm in lumi package

Exploring and Understanding Data Using R.

Analysis of Spotted Microarray Data

Affymetrix Microarrays

Statistics 251: Statistical Methods

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

Preliminary Figures for Renormalizing Illumina SNP Cell Line Data

Package ABSSeq. September 6, 2018

Install RStudio from - use the standard installation.

Package les. October 4, 2013

Analyse RT PCR data with ddct

Package ffpe. October 1, 2018

CQN (Conditional Quantile Normalization)

Introduction to R: Using R for statistics and data analysis

Mixed models in R using the lme4 package Part 2: Lattice graphics

Package TilePlot. February 15, 2013

BioConductor Overviewr

Package crossmeta. September 5, 2018

limma: A brief introduction to R

Expander Online Documentation

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays

Lab: An introduction to Bioconductor

NacoStringQCPro. Dorothee Nickles, Thomas Sandmann, Robert Ziman, Richard Bourgon. Modified: April 10, Compiled: April 24, 2017.

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Expander 7.2 Online Documentation

Package virtualarray

PROCEDURE HELP PREPARED BY RYAN MURPHY

Agilent Feature Extraction Software (v10.5)

How to use the DEGseq Package

Package dyebias. March 7, 2019

Drug versus Disease (DrugVsDisease) package

affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data

Stats with R and RStudio Practical: basic stats for peak calling Jacques van Helden, Hugo varet and Julie Aubert

Towards an Optimized Illumina Microarray Data Analysis Pipeline

Textual Description of webbioc

Package vsn. September 2, 2018

The analysis of rtpcr data

Package vsn. August 3, 2013

Analysis of two-way cell-based assays

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Package polyphemus. February 15, 2013

How to use CNTools. Overview. Algorithms. Jianhua Zhang. April 14, 2011

Example how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler

STEM. Short Time-series Expression Miner (v1.1) User Manual

Package macat. October 16, 2018

Using oligonucleotide microarray reporter sequence information for preprocessing and quality assessment

Bioconductor s sva package

Package makecdfenv. March 13, 2019

Statistical Software Camp: Introduction to R

From microarray images to Biological knowledge. Junior Barrera BIOINFO-USP DCC/IME-USP

Transcription:

Bioconductor exercises Exploring cdna data June 2004 Wolfgang Huber and Andreas Buness The following exercise will show you some possibilities to load data from spotted cdna microarrays into R, and to explore it using the statistical and visualization facilities of R and Bioconductor..) Preliminaries. To go through this exercise, you need to have installed R >=.9.0, the release.4 versions of the Bioconductor libraries Biobase, marray, multtest, limma, vsn, array- Magic, and the library lymphoma, which contains the excercises and the lymphoma data set (see http://llmpp.nih.gov/lymphoma/). > library(vsn) > library(multtest) > library(marray) > library(limma) > library(arraymagic) > library(lymphoma) 2.) Reading and exploring data files. a. First, you need to find the directory with the data on your harddisk. The path ends in library/lymphoma/data, and this is in the subdirectory where your R resides. On the command line, you can use the commands dir(), getwd and setwd to navigate around. In the GUI, you can use File, Change dir in the menu. b. Open the file lc7b048rex.dat in a text editor. This is the typical file format for the results from the image analysis on a cdna slide. Different image analysis programs use slightly different conventions and column headings, but you can always adapt the input function (see below) to your needs. c. Use the function read.delim to read the file into a data frame (that is a rectangular table of data) in R. > x <- read.delim("lc7b048rex.dat") d. The table is too large to print it out as a whole, but we can find out about its size (with the function dim) and look at individual rows of the table. > dim(x) > colnames(x) > x[:6, ] 3.) Simple plots. a. Let us first look at the histogram of the values in the column CHI, that is the channel foreground intensity (see Fig. ). > hist(x$chi) > hist(log2(x$chi), breaks = 00) > hist(log2(x$chi), breaks = seq(5, 5, by = 0.25), col = "blue") b. Visualization of the spatial homogeneity of the hybridisations may help to assess their quality. The logarithm of the background of channel is shown in Fig. 2. Various transformations of the background intensity values, like the logarithm or rank, emphasize inhomogeneities more or less. Additionally, the foreground of channel is visualized. > bg <- spatiallayout(value = x$chb, row = x$row, col = x$col, + block = x$grid) > plot(log(bg), main = "Log Background of Channel ") > plot(bg) > rankedbg <- spatiallayout(value = rank(x$chb), row = x$row,

Bioconductor exercises 2 Histogram of log2(x$chi) Frequency 0 00 200 300 400 500 600 700 6 8 0 2 4 log2(x$chi) Figure : + col = x$col, block = x$grid) > plot(rankedbg) > fg <- spatiallayout(value = x$chi, row = x$row, col = x$col, + block = x$grid) > plot(log(fg)) Log Background of Channel :length(ylabels) 96 89 82 75 68 6 54 47 40 33 26 9 2 6 4.5 5.0 5.5 6.0 6.5 7.0 6 2 9 26 33 40 47 54 6 68 75 82 89 96 :length(xlabels) Figure 2: c. Save one of the plots as PDF, and as Windows metafile. Copy and paste it into an MS-Office application. 4.) Calibration and variance stabilization. a. Subtract the background intensities CHB, CH2B from the foreground intensities CHI, CH2I, and store the result in a 926 x 2 matrix. > y = cbind(x$chi - x$chb, x$ch2i - x$ch2b)

Bioconductor exercises 3 b. What does the function cbind do? Use the R online help to find out. c. Now we can use the function vsn to calibrate and transform the data, and plot the result (Fig. 3). > ny <- vsn(y) vsn: 926 x 2 matrix ( stratum). Please wait for 0 dots:... > plot(exprs(ny), pch = ".") exprs(ny)[,2] 0 2 4 6 8 0 0 2 4 6 8 0 exprs(ny)[,] Figure 3: d. Have a look at the vignette try the command openvignette("vsn"). a. Reading a collection of files. b. The file phenodata.txt contains information on the samples that were hybridized onto the arrays. Look at it in a text editor. To load it into a phenodata object > samples <- read.phenodata("phenodata.txt", header = TRUE, as.is = TRUE) > pdata(samples) filename sampleid tumortype sex slidenumber lc7b047rex.dat CLL-3 CLL m 2 lc7b048rex.dat CLL-3 CLL m 2 3 lc7b069rex.dat CLL-52 CLL f 3 4 lc7b070rex.dat CLL-39 CLL f 4 5 lc7b09rex.dat DLCL-0032 DLCL f 5 6 lc7b056rex.dat DLCL-0024 DCLC m 6 7 lc7b057rex.dat DLCL-0029 DLCL m 7 8 lc7b058rex.dat DLCL-0023 DLCL <NA> 8 phenodata objects are where the Bioconductor stores information about samples, for example, treatment conditions in a cell line experiment or clinical or histopathological characteristics of tissue biopsies. c. Now we can load the whole set of 8 slides into the data object a. > files <- samples$filename > files [] "lc7b047rex.dat" "lc7b048rex.dat" "lc7b069rex.dat" "lc7b070rex.dat" [5] "lc7b09rex.dat" "lc7b056rex.dat" "lc7b057rex.dat" "lc7b058rex.dat" > thelayout <- new("marraylayout", mangr = 4, mangc = 4, mansr = 24, + mansc = 24) > datdir <- system.file("extdata", package = "lymphoma") > a <- read.marrayraw(files, path = datdir, name.gf = "CHI", name.gb = "CHB", + name.rf = "CH2I", name.rb = "CH2B", layout = thelayout)

Bioconductor exercises 4 d.... and try out different normalization methods:.vsn (affine normalization and variance stabilization) 2.maNorm with global median location normalization 3.maNorm with loess for intensity- or A-dependent location normalization using the loess smoother > na <- vsn(a) > na2 <- manorm(a, norm = "median", echo = T) e. These commands take their time! You can save the results into a file with the save function, and later restore them with the load function. You can use the GUI for the latter. f. Now we want to extract the normalized log-ratios. The first line in the following code creates a three-dimensional array M with space for 926 genes, 8 samples and 3 different normalization methods. na is the result of vsn; the normalized intensities are accessed via the function exprs, and the log-ratios are obtained by subtracting the red intensities from the green ones. na2 and na3 are the output of marraynorm, and the log-ratios are obtained through the slot mam using the accessor @. > odd <- seq(, 5, by = 2) > even <- seq(2, 6, by = 2) > M <- array(na, dim = c(926, 8, 2)) > M[,, ] <- exprs(na)[, even] - exprs(na)[, odd] > M[,, 2] <- na2@mam 5.) Compare the results. Look at scatterplots of the values of M from the same slide, calculated with different normalization methods. Do the values generally agree? How do they differ? > plot(m[, 4, ], M[, 4, 2], pch = ".", xlab = "vsn", ylab = "median") median 4 2 0 2 4 6 4 2 0 2 4 vsn Figure 4: 6.) Testing for differential transcription. Now we are ready to calculate test statistics and to select genes. Note: The number of replicates (4 versus 4) that we are considering here is too small to derive significant conclusions about individual genes. The full data set contains many more chips. Here we restrict ourselves to a few of them in order to keep things simple for the purpose of this course.

Bioconductor exercises 5 a. Look at the built-in function t.test, and at mt.teststat from the package multtest. Here, we use mt.teststat to calculate the t-test statistic for the comparison. The package multtest provides extensive functionality to calculate multiple-testing adjustments. > classlabel = c(0, 0, 0, 0,,,, ) > tstat = mt.teststat(m[,, ], classlabel) > summary(tstat) Min. st Qu. Median Mean 3rd Qu. Max. -24.3400 -.0330 0.86 0.832.3670 28.000 > hist(tstat, breaks = 00, col = "#fb6090") Histogram of tstat False Discovery Count Frequency 0 200 400 600 800 000 false discovery count 0.0 0.5.0.5 2.0 20 0 0 0 20 30 tstat 50 00 50 200 number of selected genes Figure 5: a) left: histogram of t-statistic, b) right: false disocvery count. b. Similar to the FDR (false discovery rate) we estimate the significance of the most extreme t-values. The false discovery count is calculated for several lists of various length of the highest scored genes (Fig. 5 right). > fc <- fdc(m[,, ], factor(classlabel), teststatfun = "rowttests") > plot(fc$nrgenesel, fc$fdc, main = "False Discovery Count", xlab = "number of selected genes", + ylab = "false discovery count", type = "b") c. Now we load the spot (gene) description table > spotdescr <- read.delim("annotationdata.txt") and print the 5 genes with the lowest values of the t-statistic > selection = order(tstat)[:5] > selection [] 4532 8076 6635 4586 739 as well as the 5 genes with the highest values of the t-statistic > selection = order(tstat, decreasing = TRUE)[:5] > selection [] 4323 4069 433 2026 243 In a following step we extract the annotation information for the 5 selected genes and generate a html-report (Fig. 6). > spotids = x$spot[selection] > geneanno = spotdescr[spotdescr[, "spotid"] %in% spotids, ] > write.htmltable(geneanno, filename = "candidategenes", title = "Candidate Genes") 7.) A simplified route to preprocessing and quality control.

Bioconductor exercises 6 Figure 6: a. The package arraymagic can be used to automatically process the image files, to create R- objects and to generate quality diagnostics. See Figs. 7 9. > res <- processarraydata(slidedescriptionfile = "phenodata.txt", + loadpath = datdir, savepath = ".", filenamecolumn = "filename", + slidenumbercolumn = "slidenumber", spotidentifier = "SPOT", + imagefiletype = "ScanAlyze", subtractbackground = TRUE, normalisationmethod = "vsn") > qr <- qualityparameters(arraydataobject = res$arraydataobject, + exprsetrgobject = res$exprsetrgobject, spotidentifier = "SPOT", + hybnamecolumn = "slidenumber", resultfilename = "qualityresult.txt") > qualitydiagnostics(exprsetrgobject = res$exprsetrgobject, arraydataobject = res$arraydataobje + qualityparameterslist = qr, plotoutput = "pdf", savepath = ".") > visualisehybridisations(arraydataobject = res$arraydataobject[, + c(, 2, 6)], namecolumnname = "slidenumber", mappingcolumns = list(block = "GRID", + Column = "COL", Row = "ROW"), type = "raw", savepath = ".", + plotoutput = "pdf") > glratios <- getexprsetgreenminusred(res$exprsetrgobject) > phenod <- phenodataslide(res$exprsetrgobject) distributionofrawdataslidewise quantiles: lower:0.25; middle:0.5; upper:0.75 0 200 400 600 800 000 200 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Figure 7: Interquartile ranges of raw data.

Bioconductor exercises 7 slidedistances 2 3 4 5 6 7 8 0.4 0.5 0.6 0.7 0.8 2 3 4 5 6 7 8 Figure 8: Matrix of distances between all pairs of slides. green Foreground 4 5 6 7 8 9 0 green Background 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 red Foreground 5 6 7 8 9 0 red Background 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 6 2 8 24 30 36 42 48 54 60 66 72 78 84 90 96 9 8 29 40 5 62 73 84 95 6 2 8 24 30 36 42 48 54 60 66 72 78 84 90 96 9 8 29 40 5 62 73 84 95 Figure 9: Spatial distribution of log raw intensities of one slide.