How to store and visualize RNA-seq data

How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory.

Talk summary How do we archive RNA-seq data in ArrayExpress How do we process RNA-seq data How we display RNA-seq data in the Expression Atlas 2

Components of a functional genomics experiment 3

ArrayExpress www.ebi.ac.uk/arrayexpress/ Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated data in a structured and standardized format Facilitates the sharing of microarray designs, experimental protocols, Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/) 4

Standards for sequencing MINSEQE guidelines Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): 1. General information about the experiment 2. Essential sample annotation including experimental factors and their values (e.g. compound and dose) 3. Experimental design including sample data relationships (e.g. which raw data file relates to which sample,.) 4. Essential experimental and data processing protocols 5. Sequence read data with quality scores, raw intensities and processing parameters for the instrument 6. Final processed data for the set of assays in the experiment 5

Standards for microarray & sequencing MAGE-TAB format MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data: IDF SDRF Data files Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. Raw and processed data files. The raw data files are the trace data files (.srf or.sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a data matrix file containing processed values, e.g. files in which the expression values are linked to genome coordinates. 6

Types of data that can be submitted 7

ArrayExpress two databases 8

What is the difference between Archive and Atlas? Archive Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Atlas Gene and/or condition queries Query across experiments and across platforms 9

ArrayExpress two databases 10

How much data in AE Archive? 11 ArrayExpress

Browsing the AE Archive 12

AE unique experiment ID Browsing the AE Archive Curated title of experiment Number of assays Species investigated The date when the data were loaded in the Archive loaded in Atlas flag Raw sequencing data available in ENA 13 The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available.

Browsing the AE Archive 14

RNA-seq data in AE Archive 15

16 06.09.2011 HTS data in AE Archive

HTS data in AE Archive 17

Link to raw data in ENA 18 06.09.2011 Master headline

RNA-seq processing pipeline Direct data submissions and GEO import Short reads (FASTQ files) Summary level data ArrayExpress Archive Data FASQ Acquisition files SDRF FASTQ RNAseq Processing pipeline Expression Atlas RPKMs BAMs EGA ENA Ensembl 19

RNA-seq processing pipeline: ArrayExpressHTS ArrayExpressHTS is an R based pipeline for pre-processing, expression estimation and data quality assessment of RNA-seq datasets The pipeline can be used for analyzing: private data public data, available through ArrayExpress and ENA It can be used: on a local computer remotely on the EBI R Cloud, www.ebi.ac.uk/tools/rcloud Goncalves et al., Bioinformatics 2011 20

ArrayExpressHTS in Bioconductor 21

ArrayExpressHTS pipeline transcriptome or genome Bowtie, BWA or TopHat filtering options (e.g., average base quality, read complexity, ) cufflinks or MMSEQ 22

Using ArrayExpressHTS library("arrayexpresshts") aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE) 23

ArrayExpressHTS on the R cloud R-cloud R-server R-server R-server ArrayExpressHTS R package - SDRF -IDF ArrayExpress References, Index & Annotation -RAW DATA - Experiment meta data Pipeline tools - tophat - bowtie -bwa - cufflinks - samtools - ExpressionSet - Quality reports User Project Storage ENA 24

ArrayExpress two databases 26

Expression Atlas Experiment selection criteria The criteria we use for selecting experiments for inclusion in the Atlas are as follows: For microarray-based experiments, array designs must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME/MINSEQE scores Experiment must have 6 or more assays Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available 27

Expression Atlas Atlas construction Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to identify the differentially expressed genes across all biological conditions, in all the experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to biological conditions The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression 28

Gene Expression Atlas Atlas construction

Gene Expression Atlas 30

Query for genes Atlas home page http://www.ebi.ac.uk/gxa/ Restrict query by direction of differential expression Query for conditions The advanced query option allows building more complex queries 31

Atlas gene summary page 32

Atlas heatmap view 33

Atlas experiment page 34 06.09.2011

View of RNA-seq data in Ensembl 35

Atlas gene-condition query 36

Data submission to AE 37

Submission of HTS gene expression data Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. 38

What happens after submission? Email confirmation Curation The curation team will review your submission and will email you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. 39

To find out more Email questions regarding ArrayExpressHTS to: Angela Goncalves, filimon@ebi.ac.uk Andrew Tikhonov, andrew@ebi.ac.uk Read more at: Goncalves et al. (2011). A pipeline for RNA-seq data processing and quality assessment. http://www.ncbi.nlm.nih.gov/pubmed/21233166 http://www.bioconductor.org/packages/2.9/bioc/html/arrayexpresshts.html R-cloud: http://www.ebi.ac.uk/tools/rcloud/ elearning courses: http://www.ebi.ac.uk/training/online/ 40