How to store and visualize RNA-seq data

Similar documents
ArrayExpress and Expression Atlas: Mining Functional Genomics data

Package ArrayExpressHTS

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Building R objects from ArrayExpress datasets

Maximizing Public Data Sources for Sequencing and GWAS

Drug versus Disease (DrugVsDisease) package

Maize genome sequence in FASTA format. Gene annotation file in gff format

Sequence Analysis Pipeline

Differential Expression Analysis at PATRIC

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Introduction to Cancer Genomics

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

Galaxy workshop at the Winter School Igor Makunin

Tutorial:OverRepresentation - OpenTutorials

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Quantification. Part I, using Excel

ROTS: Reproducibility Optimized Test Statistic

Our typical RNA quantification pipeline

srap: Simplified RNA-Seq Analysis Pipeline

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

EBI patent related services

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

Ensembl RNASeq Practical. Overview

Taking a view on bio-ontologies. Simon Jupp Functional Genomics Production Team ICBO, 2012 Graz, Austria

Package Risa. November 28, 2017

Genomic Data Analysis Services Available for PL-Grid Users

RNA-seq. Manpreet S. Katari

Mapping NGS reads for genomics studies

Illumina Next Generation Sequencing Data analysis

Using the Galaxy Local Bioinformatics Cloud at CARC

Performing whole genome SNP analysis with mapping performed locally

Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

Import GEO Experiment into Partek Genomics Suite

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

This tutorial will guide a curator/user to create the files to upload phenotype experiment annotation and data onto T3.

Fast-track to Gene Annotation and Genome Analysis

Genomic Files. University of Massachusetts Medical School. October, 2014

MAGE-TAB Specification Version 1.1

ChIP-seq (NGS) Data Formats

CLC Server. End User USER MANUAL

Ballgown. flexible RNA-seq differential expression analysis. Alyssa Frazee Johns Hopkins

Reference guided RNA-seq data analysis using BioHPC Lab computers

NGS Data Visualization and Exploration Using IGV

Bioinformatics Services for HT Sequencing

Single/paired-end RNAseq analysis with Galaxy

Bioinformatics Hubs on the Web

Read mapping with BWA and BOWTIE

Gene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT

Curatr: a web application for creating, curating, and sharing a mass spectral library

EMBL-EBI Patent Services

Introduction to Systems Biology II: Lab

HT Expression Data Analysis

EBI services. Jennifer McDowall EMBL-EBI

The software and data for the RNA-Seq exercise are already available on the USB system

New generation of patent sequence databases Information Sources in Biotechnology Japan

Genomic Files. University of Massachusetts Medical School. October, 2015

RNA-Seq analysis with Astrocyte Differential expression and transcriptome assembly

Information Resources in Molecular Biology Marcela Davila-Lopez How many and where

Bio wikis. Paolo Romano Bioinformatics, National Cancer Research Institute, Genova

MetaStorm: User Manual

High-throughout sequencing and using short-read aligners. Simon Anders

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

RNA-seq Data Analysis

Package ArrayExpress

Database Repository and Tools

Bioinformatics in next generation sequencing projects

Galaxy. Daniel Blankenberg The Galaxy Team

Facilitating Semantic Alignment of EBI Resources

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Web Resources. iphemap: An atlas of phenotype to genotype relationships of human ipsc models of neurological diseases

Quick start guide for PmiRExAt: Plant mirna Expression Atlas Database

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Package BgeeDB. January 5, 2019

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Development of SRADE tool and analysis of quality scores of the reads of Next-Generation Sequencing data

Uploading data to the NCBI SRA database

User s Guide. Using the R-Peridot Graphical User Interface (GUI) on Windows and GNU/Linux Systems

GeneSifter.Net User s Guide

Categorized software tools: (this page is being updated and links will be restored ASAP. Click on one of the menu links for more information)

STEM. Short Time-series Expression Miner (v1.1) User Manual

EukBank Submission Guidelines

Welcome - webinar instructions

Lecture 12. Short read aligners

Galaxy Platform For NGS Data Analyses

AMNH Gerstner Scholars in Bioinformatics & Computational Biology Application Instructions

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

Easy visualization of the read coverage using the CoverageView package

BRB-ArrayTools Data Archive for Human Cancer Gene Expression: A Unique and Efficient Data Sharing Resource

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

The ELIXIR of Linked Data

mirnet Tutorial Starting with expression data

All About PlexSet Technology Data Analysis in nsolver Software

Package SCAN.UPC. October 9, Type Package. Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC)

Lecture 8. Sequence alignments

User Guide for DNAFORM Clone Search Engine

Intro to NGS Tutorial

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

wgmlst typing in BioNumerics: routine workflow

Database and R Interfacing for Annotated Microarray Data

Transcription:

How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory.

Talk summary How do we archive RNA-seq data in ArrayExpress How do we process RNA-seq data How we display RNA-seq data in the Expression Atlas 2

Components of a functional genomics experiment 3

ArrayExpress www.ebi.ac.uk/arrayexpress/ Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated data in a structured and standardized format Facilitates the sharing of microarray designs, experimental protocols, Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/) 4

Standards for sequencing MINSEQE guidelines Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): 1. General information about the experiment 2. Essential sample annotation including experimental factors and their values (e.g. compound and dose) 3. Experimental design including sample data relationships (e.g. which raw data file relates to which sample,.) 4. Essential experimental and data processing protocols 5. Sequence read data with quality scores, raw intensities and processing parameters for the instrument 6. Final processed data for the set of assays in the experiment 5

Standards for microarray & sequencing MAGE-TAB format MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data: IDF SDRF Data files Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. Raw and processed data files. The raw data files are the trace data files (.srf or.sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a data matrix file containing processed values, e.g. files in which the expression values are linked to genome coordinates. 6

Types of data that can be submitted 7

ArrayExpress two databases 8

What is the difference between Archive and Atlas? Archive Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Atlas Gene and/or condition queries Query across experiments and across platforms 9

ArrayExpress two databases 10

How much data in AE Archive? 11 ArrayExpress

Browsing the AE Archive 12

AE unique experiment ID Browsing the AE Archive Curated title of experiment Number of assays Species investigated The date when the data were loaded in the Archive loaded in Atlas flag Raw sequencing data available in ENA 13 The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available.

Browsing the AE Archive 14

RNA-seq data in AE Archive 15

16 06.09.2011 HTS data in AE Archive

HTS data in AE Archive 17

Link to raw data in ENA 18 06.09.2011 Master headline

RNA-seq processing pipeline Direct data submissions and GEO import Short reads (FASTQ files) Summary level data ArrayExpress Archive Data FASQ Acquisition files SDRF FASTQ RNAseq Processing pipeline Expression Atlas RPKMs BAMs EGA ENA Ensembl 19

RNA-seq processing pipeline: ArrayExpressHTS ArrayExpressHTS is an R based pipeline for pre-processing, expression estimation and data quality assessment of RNA-seq datasets The pipeline can be used for analyzing: private data public data, available through ArrayExpress and ENA It can be used: on a local computer remotely on the EBI R Cloud, www.ebi.ac.uk/tools/rcloud Goncalves et al., Bioinformatics 2011 20

ArrayExpressHTS in Bioconductor 21

ArrayExpressHTS pipeline transcriptome or genome Bowtie, BWA or TopHat filtering options (e.g., average base quality, read complexity, ) cufflinks or MMSEQ 22

Using ArrayExpressHTS library("arrayexpresshts") aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE) 23

ArrayExpressHTS on the R cloud R-cloud R-server R-server R-server ArrayExpressHTS R package - SDRF -IDF ArrayExpress References, Index & Annotation -RAW DATA - Experiment meta data Pipeline tools - tophat - bowtie -bwa - cufflinks - samtools - ExpressionSet - Quality reports User Project Storage ENA 24

RNA-seq processing pipeline Direct data submissions and GEO import Short reads (FASTQ files) Summary level data ArrayExpress Archive Data FASQ Acquisition files SDRF FASTQ RNAseq Processing pipeline Expression Atlas RPKMs BAMs EGA ENA Ensembl 25

ArrayExpress two databases 26

Expression Atlas Experiment selection criteria The criteria we use for selecting experiments for inclusion in the Atlas are as follows: For microarray-based experiments, array designs must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME/MINSEQE scores Experiment must have 6 or more assays Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available 27

Expression Atlas Atlas construction Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to identify the differentially expressed genes across all biological conditions, in all the experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to biological conditions The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression 28

Gene Expression Atlas Atlas construction

Gene Expression Atlas 30

Query for genes Atlas home page http://www.ebi.ac.uk/gxa/ Restrict query by direction of differential expression Query for conditions The advanced query option allows building more complex queries 31

Atlas gene summary page 32

Atlas heatmap view 33

Atlas experiment page 34 06.09.2011

View of RNA-seq data in Ensembl 35

Atlas gene-condition query 36

Data submission to AE 37

Submission of HTS gene expression data Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. 38

What happens after submission? Email confirmation Curation The curation team will review your submission and will email you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. 39

To find out more Email questions regarding ArrayExpressHTS to: Angela Goncalves, filimon@ebi.ac.uk Andrew Tikhonov, andrew@ebi.ac.uk Read more at: Goncalves et al. (2011). A pipeline for RNA-seq data processing and quality assessment. http://www.ncbi.nlm.nih.gov/pubmed/21233166 http://www.bioconductor.org/packages/2.9/bioc/html/arrayexpresshts.html R-cloud: http://www.ebi.ac.uk/tools/rcloud/ elearning courses: http://www.ebi.ac.uk/training/online/ 40