Bayesian Robust Inference of Differential Gene Expression The bridge package

Similar documents
Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

AMCMC: An R interface for adaptive MCMC

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM

Package BAC. R topics documented: November 30, 2018

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

1 Methods for Posterior Simulation

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Linear Modeling with Bayesian Statistics

MCMC Methods for data modeling

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen

BAYESIAN OUTPUT ANALYSIS PROGRAM (BOA) VERSION 1.0 USER S MANUAL

The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data

Package atmcmc. February 19, 2015

1. Start WinBUGS by double clicking on the WinBUGS icon (or double click on the file WinBUGS14.exe in the WinBUGS14 directory in C:\Program Files).

Stata goes BUGS (via R)

A Short History of Markov Chain Monte Carlo

Bioconductor s stepnorm package

/ Computational Genomics. Normalization

A noninformative Bayesian approach to small area estimation

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

COPULA MODELS FOR BIG DATA USING DATA SHUFFLING

Markov Chain Monte Carlo (part 1)

Computer vision: models, learning and inference. Chapter 10 Graphical Models

STATISTICS (STAT) Statistics (STAT) 1

Assessing the Quality of the Natural Cubic Spline Approximation

ClaNC: The Manual (v1.1)

Discussion on Bayesian Model Selection and Parameter Estimation in Extragalactic Astronomy by Martin Weinberg

Correlation Motif Vignette

Package bsam. July 1, 2017

Analysis of Incomplete Multivariate Data

Package mcmcse. February 15, 2013

Quantitative Biology II!

Bayesian Analysis of Differential Gene Expression

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

The boa Package. May 3, 2005

QAnalyzer 1.0c. User s Manual. Features

Modeling Criminal Careers as Departures From a Unimodal Population Age-Crime Curve: The Case of Marijuana Use

Package batchmeans. R topics documented: July 4, Version Date

Exploratory data analysis for microarrays

Running WinBUGS from within R

CHAPTER 1 INTRODUCTION

Journal of Statistical Software

Introduction to WinBUGS

MCMC GGUM v1.2 User s Guide

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Organizing, cleaning, and normalizing (smoothing) cdna microarray data

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri

Missing Data and Imputation

10.4 Linear interpolation method Newton s method

Package hergm. R topics documented: January 10, Version Date

Model validation through "Posterior predictive checking" and "Leave-one-out"

Bayesian Analysis of Extended Lomax Distribution

Clustering Relational Data using the Infinite Relational Model

Package dirmcmc. R topics documented: February 27, 2017

Bayesian Multiple QTL Mapping

Level-set MCMC Curve Sampling and Geometric Conditional Simulation

multiscan: Combining multiple laser scans of microarrays

PIANOS requirements specifications

Package BayesMetaSeq

Approximate Bayesian Computation. Alireza Shafaei - April 2016

Package MfUSampler. June 13, 2017

Heterotachy models in BayesPhylogenies

Course on Microarray Gene Expression Analysis

Package snm. July 20, 2018

From Bayesian Analysis of Item Response Theory Models Using SAS. Full book available for purchase here.

DDP ANOVA Model. May 9, ddpanova-package... 1 ddpanova... 2 ddpsurvival... 5 post.pred... 7 post.sim DDP ANOVA with univariate response

Normalization: Bioconductor s marray package

BHMSMAfMRI User Manual

Package mfa. R topics documented: July 11, 2018

Preprocessing -- examples in microarrays

Package iterativebmasurv

Monte Carlo Methods and Statistical Computing: My Personal E

Online Supplement to Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

Model-Based Clustering for Online Crisis Identification in Distributed Computing

Chapter 1. Introduction

Package sspse. August 26, 2018

The GLMMGibbs Package

Variability in Annual Temperature Profiles

Supplementary Material sppmix: Poisson point process modeling using normal mixture models

PROCEDURE HELP PREPARED BY RYAN MURPHY

A guided walk Metropolis algorithm

affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data

Package acebayes. R topics documented: November 21, Type Package

Package beast. March 16, 2018

Simulating from the Polya posterior by Glen Meeden, March 06

RJaCGH, a package for analysis of

Time Series Analysis by State Space Methods

Random Number Generation and Monte Carlo Methods

From raw data to gene annotations

Probabilistic Graphical Models

Manual for the R gaga package

You are free to use this program, for non-commercial purposes only, under two conditions:

Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation

STAT 203 SOFTWARE TUTORIAL

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

General Factorial Models

General Factorial Models

DPP: Reference documentation. version Luis M. Avila, Mike R. May and Jeffrey Ross-Ibarra 17th November 2017

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Transcription:

Bayesian Robust Inference of Differential Gene Expression The bridge package Raphael Gottardo October 30, 2017 Contents Department Statistics, University of Washington http://www.rglab.org raph@stat.washington.edu 1 Overview 1 2 Installation 2 2.1 Windows.......................................... 2 2.2 Unix/Linux......................................... 2 3 Data Import 3 3.1 bridge.2samples....................................... 3 3.2 bridge.3samples....................................... 3 4 Testing for differential expression between two samples 3 5 Testing for differential expression among three samples 6 6 Computing time and batch mode 6 7 Acknowledgment 6 1 Overview The bridge package consists of several functions for testing for differential expression among multiple samples. As for now, the package can only be used with two and three samples. The extension to more than three samples should be straight forward. However, it is good to note that as the number of sample increases, the complexity of the model increases exponentially. In pratice, it would be hard to deal with a large number of samples. 1

We use a robust Bayesian hierarchical model for testing for differential expression with microarrays. The model is a generalization of a model used by Gottardo et al. (2006a) to estimate the true mean intensities from replicates. Outliers are modeled explicitly using a t-distribution. The model is flexible with an exchangeable prior for the variances which allow different variances for the genes but still shrink extreme variances. Our model can be used for testing for differentially expressed genes among multiple samples. Parameter estimation is carried out using Markov Chain Monte Carlo. The robust estimation is achieved by hierarchical Bayesian modeling (Gottardo et al. 2006b). Realizations are generated from the posterior distribution via Markov chain Monte Carlo (MCMC) algorithms (Gelfand and Smith 1990). Where the full conditionals are of simple form, Gibbs updates are used; where the full conditionals were not of standard form, slice sampling is used. We use the stepping out procedure for the slice sampling as introduced in Neal (2003). Convergence of the MCMC can be assessed using the coda library available from CRAN. Our experience with the software is that 50,000 iterations are usually enough to estimate quantities of interest such as posterior means, credible intervals, etc. Because the number of genes I can be quite large, the computing time can be long. We recommend running the functions provided in the package in batch mode. This can be done via R CMD BATCH. An example is presented in the next section. Help files. As with any R package, detailed information on functions, their arguments and value, can be obtained in the help files. For instance, to view the help file for the function bridge.2samples in a browser, use?bridge.2samples. 2 Installation 2.1 Windows You can install the package using the menu Packages Install Packages from local zip... Then browse you local directory to find the package. Make sure you have root privilegies if you want to install the package in the default directory. 2.2 Unix/Linux Under Unix, go into your favorite shell window and type R CMD INSTALL bridge_xx.tar.gz (the package has to be in the directory you are executing the command from). By default R will install the package in /usr/local/r/lib/r... and you will need to have root privilegies in order to do so. You can also install the package locally by using the option -l. If you want to install the package in /home/user/rlib/ use the command R CMD INSTALL bridge_xx.tar.gz -l /home/user/rlib. If you install the package in a directory different from the R default directory you will need to specify the full path when you load the package, i.e. > library(bridge) 2

3 Data Import 3.1 bridge.2samples You can read your data into R, using the read.table command. It is hard to give a general guideline as different data will have different format. In order to use the package you will need to create two data sets, one for each sample (control and treatment). The data should be on the raw scale, i.e. not transformed. The two datasets should be arranged such that rows correspond to genes and columns to replicates. Table 1 give an example of a data set in the right format. Table 1: Format of the control and treatment datasets used in the rama package. The two samples must be seperated into two datasets. sample 1 1 1 1 2 2 2 2 replicate 1 2 3 4 1 2 3 4 Dataset 1 Dataset 2 gene 1........................ gene 2................................................. gene n........................ 3.2 bridge.3samples In this case, one would need to have three datasets, one for each sample. Again the data should be on the raw scale, i.e. not transformed! 4 Testing for differential expression between two samples Testing for differential between two samples is done via the function bridge.2samples. We demonstrate its functionality using gene expression data from an HIV study of van t Wout et al. (2003). To load the HIV dataset, use data(hiv), and to view a description of the experiments and data, type?hiv. We first load the hiv dataset. > data(hiv) This data set consists of 4 experiments using the same RNA preparation on 4 different slides. The expression levels of about 8000 genes were assessed in CD4-T-cell lines at time t = 24 hour after infection with HIV virus type 1. The first 4 columns correspond to the first treatment state (hiv infected). The second four represent the control state. The experiment is a balanced dye swap experiment. Finally, the last two columns contain the row and column positions of each gene on the array (slide). Our goal is to test for differentially expressed genes between the two samples. To demonstrate the function bridge.2samples we will only use a subset of 640 genes. 3

We model y = log 2 (y) where y are the raw intensities. Testing of differential expression is done using the function bridge.2samples. This function output an object of type bridge2 containing the sampled values from the posterior distribution. In our case the inference will be based on the posterior probability of differential expression, post.p. > bridge.hiv<-bridge.2samples(hiv[1:640,c(1:4)],hiv[1:640,c(5:8)],b=2000,min.iter=0,batch=1,mcm Once you have fitted the model you may view or plot the sampled parameters from the posterior distribution. We can also obtain point estimates of the paramaters of interest such as the gene effects > gamma1<-mat.mean(bridge.hiv$gamma1)[,1] > gamma2<-mat.mean(bridge.hiv$gamma2)[,1] and/or plot the resulting posterior probabilities as a function of the posterior log ratios of γ 1 γ 2. > plot(gamma1-gamma2,bridge.hiv$post.p,col=1,pch=1,main="posterior probability",xlab="log ratio Posterior probability posterior probability 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 log ratio 4

Robustness is achieved by using a t-distribution for the errors with a different degree of freedoms for each replicate array. This is formally equivalent to giving each replicate a different weight in the estimation process for each gene. A low number of degrees of freedom for one array will indicate that the corresponding array contains numerous outliers. One could use the function hist to look at the posterior mode of the degrees of freedom for each array. > hist(bridge.hiv$nu1[3,],main="posterior degree of freedoms, array 3",xlab="nu",50) Posterior degree of freedoms, array 3 Frequency 0 100 200 300 400 500 600 5 10 15 20 25 30 nu Fitting a t-distribution can be seen as a weighted least square problem with a special hierarchical structure for the variances. Similarly to a weighted least squares function, our function computes weights associated with each replicate. The weights can be useful as a tool for diagnosing the quality of the replicate measurements. If one knows the exact positions of the genes on the array, it is possible to look at the spatial variation of the weights. This can be done via the function weight.plot of the rama package. We refer the reader to the rama package vignette for further details. 5

5 Testing for differential expression among three samples Because (most) microarray technologies allow the direct comparison of at most two samples, the model used in bridge.2samples needs to be slightly modified as well to accommodate for changes in the experimental design. In bridge.2samples, the log measurements were modeled as a bivariate t-distribution for the log measurements. Depending on the technology used, with three samples, the data will either be ratios (cdna) or raw measurements (Affymetrix) and therefore a natural model is a univariate t-distribution. The estimation of the parameters is done similarly to the function bridge.2samples. The function output an object of type bridge3 containing the sampled values from the posterior distribution. In our case the inference will be based on the posterior probability of differential expression. > sample1<-matrix(exp(rnorm(150)),50,3) > sample2<-matrix(exp(rnorm(200)),50,4) > sample3<-matrix(exp(rnorm(150)),50,3) > mcmc.bridge3<-bridge.3samples(sample1,sample2,sample3,b=10,min.iter=0,batch=1,mcmc.obj=null,a 6 Computing time and batch mode The computing time can be quite long depending on the number of genes and replicates. It takes about 5 hours to run the MCMC on the full HIV data set using a Linux machine with AMD Athlon at 2GHz. However we feel that this additional computational cost is worth the effort and can lead to good results. The code can be run in batch mode without any user intervention; this optimizes computing resources and allows the user to perform other tasks at the same time. R code can be run in batch mode by typing R CMD BATCH myfile.r at the shell prompt. The file myfile.r contains the command to execute. An example file could be the following, Then, when the execution is done one can load the resulting output into R using the load command. For further details about the BATCH command type?batch in R. 7 Acknowledgment This research was supported by NIH Grant 8 R01 EB002137-02. References Gelfand, A. E. and A. F. M. Smith (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398 409. Gottardo, R., A. Raftery, K. Yeung, and R. Bumgarner (2006a, Jan). Quality control and robust estimation for cdna microarrays with replicates. J Am Stat Assoc 101, 30 40. Gottardo, R., A. E. Raftery, K. Y. Yeung, and R. E. Bumgarner (2006b, Mar). Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62 (1), 10 8. Neal, R. (2003). Slice sampling. The Annals of Statistics 31, 705 767. 6

van t Wout, A. B., G. K. Lehrma, S. A. Mikheeva, G. C. O Keeffe, M. G. Katze, R. E. Bumgarner, G. K. Geiss, and J. I. Mullins (2003). Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4 + T Cell lines. Journal of Virology 77, 1392 1402. 7