Building R objects from ArrayExpress datasets

Similar documents
Package ArrayExpress

How to store and visualize RNA-seq data

Introduction: microarray quality assessment with arrayqualitymetrics

ArrayExpress and Expression Atlas: Mining Functional Genomics data

Drug versus Disease (DrugVsDisease) package

Analysis of Genomic and Proteomic Data. Practicals. Benjamin Haibe-Kains. February 17, 2005

An introduction to Genomic Data Structures

Introduction: microarray quality assessment with arrayqualitymetrics

ReadqPCR: Functions to load RT-qPCR data into R

Seminar III: R/Bioconductor

Affymetrix Microarrays

MAGE-TAB Specification Version 1.1

MiChip. Jonathon Blake. October 30, Introduction 1. 5 Plotting Functions 3. 6 Normalization 3. 7 Writing Output Files 3

Package AffyExpress. October 3, 2013

BioConductor Overviewr

Package Risa. November 28, 2017

An Introduction to Bioconductor s ExpressionSet Class

Biobase development and the new eset

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

ReadqPCR: Functions to load RT-qPCR data into R

Import GEO Experiment into Partek Genomics Suite

Package frma. R topics documented: March 8, Version Date Title Frozen RMA and Barcode

Bioconductor tutorial

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

From raw data to gene annotations

Package arrayqualitymetrics

Package ArrayExpressHTS

User Guide. IR-TEx: Insecticide Resistance Transcript Explorer. V.A Ingham, D. Peng, S. Wagstaff and H. Ranson

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Lab: An introduction to Bioconductor

CARMAweb users guide version Johannes Rainer

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

I. Overview of the Bioconductor Project. Bioinformatics and Biostatistics Lab., Seoul National Univ. Seoul, Korea Eun-Kyung Lee

Introduction to R and microarray analysis

Package BgeeDB. January 5, 2019

affypara: Parallelized preprocessing algorithms for high-density oligonucleotide array data

Package PGSEA. R topics documented: May 4, Type Package Title Parametric Gene Set Enrichment Analysis Version 1.54.

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005

Exploring cdna Data. Achim Tresch, Andreas Buness, Wolfgang Huber, Tim Beißbarth

Tutorial:OverRepresentation - OpenTutorials

Using FARMS for summarization Using I/NI-calls for gene filtering. Djork-Arné Clevert. Institute of Bioinformatics, Johannes Kepler University Linz

Microarray Technology (Affymetrix ) and Analysis. Practicals

Package SCAN.UPC. October 9, Type Package. Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC)

Introduction to the xps Package: Comparison to Affymetrix Power Tools (APT)

Package affypdnn. January 10, 2019

Labelled quantitative proteomics with MSnbase

AB1700 Microarray Data Analysis

NacoStringQCPro. Dorothee Nickles, Thomas Sandmann, Robert Ziman, Richard Bourgon. Modified: April 10, Compiled: April 24, 2017.

Package STROMA4. October 22, 2018

MAGE-TAB. Draft, 24 March 2006

The mammaprintdata package: annotated gene expression data from the Glas and Buyse breast cancer cohorts

Package DFP. February 2, 2018

Introduction to the Codelink package

Package SCAN.UPC. July 17, 2018

Gene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT

Package MEAL. August 16, 2018

HT Expression Data Analysis

Preprocessing Affymetrix Data. Educational Materials 2005 R. Irizarry and R. Gentleman Modified 21 November, 2009, M. Morgan

Introduction to the Bioconductor marray package : Input component

Description of gcrma package

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

Using metama for differential gene expression analysis from multiple studies

Package makecdfenv. March 13, 2019

How to use the rbsurv Package

Package virtualarray

User guide for GEM-TREND

Bayesian Pathway Analysis (BPA) Tutorial

Package mirnapath. July 18, 2013

Lab: Using R and Bioconductor

Course Introduction. Welcome! About us. Thanks to Cancer Research Uk! Mark Dunning 27/07/2015

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

genbart package Vignette Jacob Cardenas, Jacob Turner, and Derek Blankenship

ChIPXpress: enhanced ChIP-seq and ChIP-chip target gene identification using publicly available gene expression data

Maximizing Public Data Sources for Sequencing and GWAS

ISAcreator V User Guide. User Guide: V1.3.2 February2011 Contact: Download:

End-to-end analysis of cell-based screens: from raw intensity readings to the annotated hit list (Short version)

safe September 23, 2010

Package GSCA. October 12, 2016

Package GSCA. April 12, 2018

Bioconductor: Annotation Package Overview

An Introduction to the methylumi package

Biobase. April 20, 2011

Package PSEA. R topics documented: November 17, Version Date Title Population-Specific Expression Analysis.

Pathway Studio Quick Start Guide

Retrieving sample attributes

Web Resources. iphemap: An atlas of phenotype to genotype relationships of human ipsc models of neurological diseases

SimBindProfiles: Similar Binding Profiles, identifies common and unique regions in array genome tiling array data

Package puma. November 15, Type Package

SEEK User Manual. Introduction

Package BEARscc. May 2, 2018

Package yaqcaffy. January 19, 2019

MAGE-ML: MicroArray Gene Expression Markup Language

An Introduction to the genoset Package

Package EventPointer

Package affyplm. January 6, 2019

Evaluation of VST algorithm in lumi package

PROMO 2017a - Tutorial

Bioconductor exercises 1. Exploring cdna data. June Wolfgang Huber and Andreas Buness

Package HomoVert. November 10, 2010

BRB-ArrayTools Data Archive for Human Cancer Gene Expression: A Unique and Efficient Data Sharing Resource

Transcription:

Building R objects from ArrayExpress datasets Audrey Kauffmann October 30, 2017 1 ArrayExpress database ArrayExpress is a public repository for transcriptomics and related data, which is aimed at storing MIAME-compliant data in accordance with MGED recommendations. The Gene Expression Atlas stores gene-condition summary expression profiles from a curated subset of experiments in the repository. Currently, around 917 460 assays are represented in ArrayExpress. Please visit http://www.ebi.ac.uk/arrayexpress/ for more information on the database. 2 MAGE-TAB format In the repository, for each dataset, ArrayExpress stores a MAGE-TAB document with standardized format. A MAGE-TAB document contains five different types of files Investigation Description Format (IDF), Array Design Format (ADF), Sample and Data Relationship Format (SDRF), the raw data files and the processed data files. The tab-delimited IDF file contains top level information about the experiment including title, description, submitter contact details and protocols. The tab-delimited ADF file describes the design of an array, e.g., what sequence is located at each position on an array and what the annotation of this sequence is. The tab-delimited SDRF file contains the sample annotation and the relationship between arrays as provided by the submitter. The raw zip file contains the raw files (like the CEL files for Affymetrix chips or the GPR files from the GenePix scanner for instance), and the processed zip file contains all processed data in one generic tab delimited text format. 3 Querying the database It is possible to query the ArrayExpress repository, using the queryae function. Two arguments are available, keywords and species. You can use both of them simultaneously or each of them independently, according to your needs. If you want to use several words, they need to be separated by a + without any space. Here is an example where the object sets contains the identifiers of all the datasets for which the word pneumonia was found in the description and for which the Homo sapiens is the studied species. You need to be connected to Internet to have access to the database. > library("arrayexpress") > sets = queryae(keywords = "pneumonia", species = "homo+sapiens") In August 2012, this query retrieved 21 identifiers. The output is a dataframe with the identifiers and 7 columns giving the number of data raw and processed. When raw and processed data are available for the same dataset, a unique identifier is given for both. There is no way to distinguish between a processed and a raw dataset just from the identifier. The column Raw from the output of queryae allows to know if the raw data are available for a data set. The following columns contain the release date of the dataset on the database, the pubmed ID of the publication related to the dataset, the species, the experiment design and the experimental factors. The 1

experiment design defines what was studied by the authors, such as time series, disease state or compound treatment. The experimental factors are the values of the studied factors, Day 1, Day 2 for instance. For a more extended querying mode, the ArrayExpress website offers an advanced query interface with detailed results. 4 Import an ArrayExpress dataset into R 4.1 Call of the function ArrayExpress Once you know which identifier you wish to retrieve, the function ArrayExpress can be called, using the following arguments: ˆ accession: an ArrayExpress experiment identifier for which the raw data are available. ˆ path: the name of the directory in which the files downloaded from the ArrayExpress repository will be extracted. The default is the current directory. ˆ save: if TRUE, the files downloaded from the database will not be deleted from path after executing the function. ˆ datacols: by default, the columns are automatically selected according to the scanner type. If the scanner is unknown or if the user wants to use different columns than the default, the argument datacols can be set. For two colour arrays it must be a list with the fields R, G, Rb and Gb giving the column names to be used for red and green foreground and background. For one colour arrays, it must be a character string with the column name to be used. These column names must correspond to existing column names of the expression files. You still need to be connected to Internet to have access to the database. 4.2 Examples and ouput format Simple example The output object is an AffyBatch if the ArrayExpress identifier corresponds to an Affymetrix experiment, it is an ExpressionSet if the identifier corresponds to a one colour non Affymetrix experiment and it is a NChannelSet if the identifier corresponds to a two colours experiment. > rawset = ArrayExpress("E-MEXP-1422") Reading in : /tmp/rtmpdllnex/af15.cel Reading in : /tmp/rtmpdllnex/af16.cel Reading in : /tmp/rtmpdllnex/af6.cel Reading in : /tmp/rtmpdllnex/af14.cel Reading in : /tmp/rtmpdllnex/af7.cel Reading in : /tmp/rtmpdllnex/af8.cel In this example, E-MEXP-1422 being an Affymetrix experiment, rawset is an AffyBatch. The expression values of rawset are the content from the CEL files. The phenodata of rawset contains the whole sample annotation file content from the MAGE-TAB sdrf file. The featuredata of rawset contains the whole feature annotation file content from the MAGE-TAB adf file. The experimentdata of rawset contains the experiment annotation file content from the MAGE-TAB idf file. 2

Example when the column names are needed In the case of non Affymetrix experiments, ArrayExpress decides automatically, based on known quantitation types, which column from the scan files are to be used to fill the exprs. However, in some cases the scanner software is not recognized or unknown and this decision cannot be made automatically. The argument rawcol is then needed. Here is an example. > eset = try(arrayexpress("e-mexp-1870")) Here, the object cannot be built because the columns are not recognized. The error message also gives a list of possible column names. We can then call the function again, feeding the argument datacols with the appropriate column names. > eset = ArrayExpress("E-MEXP-1870", + datacols=list(r="scanarray Express:F650 Mean", + G="ScanArray Express:F550 Mean", + Rb="ScanArray Express:B650 Mean", + Gb="ScanArray Express:B550 Mean")) Then eset is created. However, there is still a warning, the phenodata cannot be built. This means that the object is correctly created but the sample annotation has not been attached to it. It is still possible to manually correct the files and try to build the object. To do so, the functions getae and ae2bioc, used by the ArrayExpress function, can be called separately. 5 Download an ArrayExpress dataset on your local machine It is possible to only download the data, by calling the function getae. The arguments input and path are the same than for the ArrayExpress function. The argument type determines if you retrieve the MAGE-TAB files with the raw data only (by setting type to raw ), the MAGE- TAB files with the processed data only (by setting type to processed ) or if you retrieve all the MAGE-TAB files, both raw and processed (by setting type to full ). Here, you also need Internet connection to access the database. > mexp1422 = getae("e-mexp-1422", type = "full") Here, the output is a list of all the files that have been downloaded and extracted in the directory path. 6 Build an R object from local raw MAGE-TAB If you have your own raw MAGE-TAB data or if you want to build an R object from existing MAGE-TAB data that are stored locally on your computer. The function ae2bioc can convert them into an R object. The arguments for the ae2bioc are: ˆ magefiles A list as given from getae function. Containing the elements rawfiles (the expression files to use to create the object), sdrf (name of the sdrf file), idf (name of the idf file), adf (name of the adf file) and path (the name of the directory containing these files). ˆ datacols by default, the columns are automatically selected according to the scanner type. If the scanner is unknown, the argument datacols can be set. For two colour arrays it must be a list with the fields R, G, Rb and Gb giving the column names to be used for red and green foreground and background. For one colour arrays, it must be a character string with the column name to be used. These column names must correspond to existing column names of the expression files. 3

As an example, we can use the files that we have downloaded in the previous example. > rawset= ae2bioc(magefiles = mexp1422) Reading in : /tmp/rtmpwajfp3/rbuild7b2f1ce5b833/arrayexpress/vignettes/af15.cel Reading in : /tmp/rtmpwajfp3/rbuild7b2f1ce5b833/arrayexpress/vignettes/af16.cel Reading in : /tmp/rtmpwajfp3/rbuild7b2f1ce5b833/arrayexpress/vignettes/af6.cel Reading in : /tmp/rtmpwajfp3/rbuild7b2f1ce5b833/arrayexpress/vignettes/af14.cel Reading in : /tmp/rtmpwajfp3/rbuild7b2f1ce5b833/arrayexpress/vignettes/af7.cel Reading in : /tmp/rtmpwajfp3/rbuild7b2f1ce5b833/arrayexpress/vignettes/af8.cel The object rawset is an AffyBatch. 7 Build an R object from local processed MAGE-TAB Processed data in the database are less uniform as processing methods vary a lot. To import a processed dataset from ArrayExpress, three steps are required: (i) download the dataset using getae, (ii) identify which column is of interest thanks to getcolproc, (iii) create the R object with procset. 7.1 Identification of the columns to extract Once the data are downloaded, we need to know the different column names available in the processed file to be able to choose a relevant one to extract. The function getcolproc needs, as an input, a list containing two slots: ˆ procfile the name of the processed expression file. ˆ path the name of the directory containing the procfile. This kind of list is given as an output from the function getae. > cn = getcolproc(mexp1422) > show(cn) [1] "Composite Element REF" "Software Unknown:GC-RMA" [3] "Unknown:Detection Call" cn is a character vector with the available column names in the procfile. 7.2 Creation of the object We can now create the ExpressionSet using procset. This function has two arguments: ˆ files a list as given by getae containing the names of the processed, sdrf, idf, adf files and the path. ˆ procol the name of the column chosen after using getcolproc. > proset = procset(mexp1422, cn[2]) Reading processed data matrix file, /tmp/rtmpwajfp3/rbuild7b2f1ce5b833/arrayexpress/vignettes/ex proset is an ExpressionSet containing the processed log(ratio) of the dataset E-MEXP-1422. 4

8 Example of a standard microarray analysis using data from ArrayExpress In this section, we are briefly describing an analysis from the data import to the identification of differentially expressed genes, of data publicly available on ArrayExpress. The first step consists of importing a dataset from ArrayExpress. We chose E-MEXP-1416. This dataset studies the transcription profiling of melanized dopamine neurons isolated from male and female patients with Parkinson disease to investigate gender differences. > AEset = ArrayExpress("E-MEXP-1416") As AEset is an Affymetrix experiment, we can use RMA normalisation to process the raw data, please read the rma function help. > library("affy") > AEsetnorm = rma(aeset) To check the normalisation efficiency, we can run a quality assessment. For details on the arguments used, please read the arrayqualitymetrics vignette. Please note that, at the time of its release (Oct 2010), Bioconductor 2.7 didn t include a 64-bit Windows binary version of the arrayqualitymetrics package (but this might change in the future). > fac = grep("factor.value",colnames(pdata(aesetnorm)), value=t) > if (suppresswarnings(require("arrayqualitymetrics", quietly=true))) { + qanorm = arrayqualitymetrics(aesetnorm, + outdir = "QAnorm", + intgroup = fac) + } Now that we have ensured that the data are well processed, we can search for differentially expressed genes using the package limma. To understand the details of each steps, please see the limma user guide. > library("limma") > facs = pdata(aesetnorm)[,fac] > facs[facs[,2]=="female",2]="f" > facs[facs[,2]=="male",2]="m" > facs[facs[,1]=="parkinson disease",1]="parkinson" > facs = paste(facs[,1],facs[,2], sep=".") > f = factor(facs) > design = model.matrix(~0+f) > colnames(design) = levels(f) > fit = lmfit(aesetnorm, design) > cont.matrix = makecontrasts(normal.fvsm = normal.f-normal.m, + parkinson.fvsm = parkinson.f-parkinson.m, + Diff=(parkinson.F-parkinson.M)-(normal.F-normal.M), + levels=design) > fit2 = contrasts.fit(fit, cont.matrix) > fit2 = ebayes(fit2) > res = toptable(fit2, coef = "parkinson.fvsm", adjust = "BH") Here we end up with a list of genes that are differentially expressed between the female and the male patients having Parkinson s disease, which was the topic of the paper, but one can perform other comparisons with the same data. 5