de.nbi and its Galaxy interface for RNA-Seq

Similar documents
NGS Data Visualization and Exploration Using IGV

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

TP RNA-seq : Differential expression analysis

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy

Galaxy Platform For NGS Data Analyses

NGS Analysis Using Galaxy

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Galaxy workshop at the Winter School Igor Makunin

Using the Galaxy Local Bioinformatics Cloud at CARC

Single/paired-end RNAseq analysis with Galaxy

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Accessible, Transparent and Reproducible Analysis with Galaxy

Analyzing ChIP- Seq Data in Galaxy

CLC Server. End User USER MANUAL

Analysis of ChIP-seq data

Counting with summarizeoverlaps

RNA-Seq Analysis With the Tuxedo Suite

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

RNA-seq. Manpreet S. Katari

Helpful Galaxy screencasts are available at:

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Galaxy. Daniel Blankenberg The Galaxy Team

Exercises: Analysing RNA-Seq data

!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

NGS FASTQ file format

Sequence Analysis Pipeline

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

NGS : reads quality control

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Importing your Exeter NGS data into Galaxy:

Getting Started. April Strand Life Sciences, Inc All rights reserved.

Short Read Sequencing Analysis Workshop

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Introduction to Galaxy

Introduction to Galaxy

ChIP-seq (NGS) Data Formats

replace my_user_id in the commands with your actual user ID

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

RNA-seq Data Analysis

Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

Copyright 2014 Regents of the University of Minnesota

JunctionSeq Package User Manual

ChIP-Seq Tutorial on Galaxy

Copyright 2014 Regents of the University of Minnesota

Today's outline. Resources. Genome browser components. Genome browsers: Discovering biology through genomics. Genome browser tutorial materials

Expression Analysis with the Advanced RNA-Seq Plugin

Part 1: How to use IGV to visualize variants

A short Introduction to UCSC Genome Browser

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Using Galaxy for NGS Analyses Luce Skrabanek

Our typical RNA quantification pipeline

ChIP- seq Analysis. BaRC Hot Topics - Feb 24 th 2015 BioinformaBcs and Research CompuBng Whitehead InsBtute. hgp://barc.wi.mit.

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Maize genome sequence in FASTA format. Gene annotation file in gff format

JunctionSeq Package User Manual

Benchmarking of RNA-seq aligners

Copy Number Variations Detection - TD. Using Sequenza under Galaxy

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

The QoRTs Analysis Pipeline Example Walkthrough

INF-BIO5121/ Oct 7, Analyzing mirna data using Lifeportal PRACTICALS

m6aviewer Version Documentation

JunctionSeq Package User Manual

GenomeStudio Software Release Notes

RNA Alternative Splicing and Structures

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Standard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines:

Understanding and Pre-processing Raw Illumina Data

Genome Environment Browser (GEB) user guide

Ensembl RNASeq Practical. Overview

Genomic Data Analysis Services Available for PL-Grid Users

How to store and visualize RNA-seq data

Integrative Genomics Viewer. Prat Thiru

Short Read Sequencing Analysis Workshop

Tutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017

High-throughout sequencing and using short-read aligners. Simon Anders

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Centre (CNIO). 3rd Melchor Fernández Almagro St , Madrid, Spain. s/n, Universidad de Vigo, Ourense, Spain.

Lecture 8. Sequence alignments

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

Bioinformatics in next generation sequencing projects

Using Galaxy: RNA-seq

ChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute.

Using Galaxy to Perform Large-Scale Interactive Data Analyses

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Easy visualization of the read coverage using the CoverageView package

The UCSC Gene Sorter, Table Browser & Custom Tracks

Maximizing Public Data Sources for Sequencing and GWAS

Advanced genome browsers: Integrated Genome Browser and others Heiko Muller Computational Research

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Analysing High Throughput Sequencing Data with SeqMonk

Differential Expression Analysis at PATRIC

NGS Data Analysis. Roberto Preste

From the Schnable Lab:

Transcription:

de.nbi and its Galaxy interface for RNA-Seq Jörg Fallmann Thanks to Björn Grüning (RBC-Freiburg) and Sarah Diehl (MPI-Freiburg) Institute for Bioinformatics University of Leipzig http://www.bioinf.uni-leipzig.de/ fall/summerschool2016.pdf 30.09.2016 C R B 1 / 33

Deutsches Netzwerk für Bioinformatik Infrastruktur de.nbi The German Network for Bioinformatics Infrastructure provides comprehensive first-class bioinformatics services to users in life sciences research, industry and medicine. The de.nbi program coordinates bioinformatics training and education and the cooperation of the German bioinformatics community with international bioinformatics network structures. 2 / 33

de.nbi Structure 3 / 33

The RBC - RNA Bioinformatic Center Masterminds Peter F. Stadler (Leipzig) Rolf Backofen (Freiburg) Uwe Ohler (Berlin) Nikolaus Rajewsky (Berlin) 4 / 33

Purpose Central Contact Point for RNA Bioinformatics Offering Support Maintaining Software/Databases Documentation Workshops/Training 5 / 33

Aims Raise the awareness for RNA-based regulation make RNA tools accessible integrate RNA tools into NGS pipelines training, workshops, support 6 / 33

The RBC - Tools trnadb cermit GraphProt DoRiNA antarna PARalyzer ViennaRNA CRISPRmap Mummie mirdeep2 ExpaRNA-P RNAz RNAsnoop CopraRNA Snoreport PicTar2 7 / 33

The RBC - de.nbi interactome 8 / 33

Bridging tools and users make tools available to users moving tools to data as less installation as possible scalable reproducible transparent 9 / 33

GALAXY https://galaxyproject.org/ Data intensive biology for everyone Galaxy is an open, web-based platform for data intensive biomedical research Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses 10 / 33

RNA-Seq in PubMed More and more people do NGS experiments Who does analysis? 3000 # of publications 2000 1000 NGS application ChIP CLIP Epigenome RNA Singlecell srna 0 2010 2011 2012 2013 2014 2015 Year 11 / 33

Tools to Data Hands On RNA-Seq analysis with Galaxy 12 / 33

Freiburger Galaxy Instance http://galaxy.uni-freiburg.de Login with your username and pwd For this course you get guest accounts If you want to give it a try with your own data later, just register at galaxy@informatik.uni-freiburg.de 13 / 33

14 / 33

Upload data Click category name to expand History op)ons Click tool name to use dataset 15 / 33

Worflow Galaxy-Training Due to time constraints, we will skip some parts, but you are very welcome to give it a try at home, or with your own datasets once you created an account. For now we will use RNA-seq data from the study by Brooks et al. 2011, in which the pasilla gene in Drosophila melanogaster was depleted by RNAi and the effects on splicing events were analysed by RNA-seq. The data is available at NCBI Gene Expression Omnibus (GEO) under accession number GSE18508. 16 / 33

Step 1: Inspecting the FASTQ files Create a new history for this RNA-seq exercise. Import a FASTQ file pair from Zenodo Fastq1 Fastq2 Load them into Galaxy by right-clicking copy link location and paste the link in Galaxy Get Data Upload File from your computer paste/fetch data Start (Recommended: Select the correct file type ( fastqsanger ) and genome ( dm3 ) directly in the upload dialogue. A lot of downstream programs will require these information. With the upload you can assign the correct settings for all uploaded files at once!) 17 / 33

download more info Datasets delete edit a?ributes display in main frame rerun tool visualise links to display in genome browser preview Click dataset name to expand 18 / 33

Both files contain the first 100.000 paired-end reads of one untreated sample. Run the tool FastQC on one of the two FASTQ files to control the quality of the reads. What is the read length? Is there anything you find striking? 19 / 33

Wai)ng to be run Running Successfully finished Dataset states Failed Send bug report 20 / 33

Trim low quality bases from the 3 end using Trim Galore on both paired-end datasets. In order to use Trim Galore make sure that the file type is set to fastqsanger (not fastq), change it if necessary: click on the pencil button displayed in your dataset in the history, choose Datatype select fastqsanger Save. Re-run FastQC and inspect the differences. 21 / 33

Step 2: Mapping of the reads with TopHat (version 2) Import the Ensembl gene annotation for Drosophila melanogaster (Drosophila melanogasterḃdgp5 78ġtf) Drosophila melanogaster.bdgp5.78.gtf Right-click copy link location and paste the link in Galaxy Upload File from your computer paste/fetch data Start 22 / 33

Tophat Parameters Tophat needs information about the type of quality scores in the FASTQ files. The most common type nowadays is fastqsanger, signalling Sanger-scaled quality scores, which are also used by the current generation of Illumina high-throughput sequencers. Make sure that the type is set correctly. TopHat also needs to know two important parameters about the sequencing library: 1) the strandedness being unstranded or stranded (if stranded there are many types) and 2) the inner distance between the two reads for paired-end data. These information should usually come with your FASTQ files!!! If not, try to find them on the site where you downloaded the data or in the corresponding publication. 23 / 33

Mapping Run TopHat with full parameter set for best mapping results Use paired-end (as individual datasets) and specify the FASTQ files Set Mean Inner Distance to 112 Select the built in reference Drosophila melanogaster dm3 genome Allow Tophat settings to use Full parameter list Set the correct library type FR First Strand Supply own junction data Yes, Use Gene Annotation Model Yes and select the appropriate Gene Model Annotations Drosophila melanogaster.bdgp5.78.gtf Enable coverage-based search for junctions Yes ( coverage-search) to increase sensitivity TopHat splits reads into segments to map reads across splice junctions. Default minimum length of read segments is 25, doesn t 18 seem to be more appropriate? 24 / 33

Step 3: Inspecting TopHat results TopHat returns a BAM file with the mapped reads and three bed files containing splice junctions, insertions and deletions However, this example datasets are too small to give you a good impression of real data Therefore import 4 files, restricted to chr4, from Tophat output into your history GSM461177 untreat paired chr4.bam GSM461177 untreat paired deletions chr4.bed GSM461177 untreat paired insertions chr4.bed GSM461177 untreat paired junctions chr4.bed You may have to change the data type from tabular to bed (use the pencil button) 25 / 33

Visualise mapping files with IGV Open dataset click on display with IGV web current Open the file with a JAVA plugin (e.g. IcedTea) Go to View Preferences Alignments and set the visibility range to >= 50kb Inspect the region on chr4 between 560 kb to 600 kb copy chr4:560000-600000 to locus window and click GO Now import the bed output into IGV Open dataset and click on display with IGV local Inspect the results using a Sashimi plot (right-click on the bam file select Sashimi Plot from the context menu) 26 / 33

Reproducibilty and Transparency Save your workflow Click on History Options the little gearwheel on top of your history Choose History Actions Extract Workflow Annotate your workflow and save Go to Workflow section and have a look 27 / 33

Users, users, users Would you be so kind to fill out a short survey? de.nbi summerschool 11/2016 survey 28 / 33

Contact Jörg Fallmann fall@bioinf.uni-leipzig.de http://www.bioinf.uni-leipzig.de Björn Grüning bjoern.gruening@gmail.com http://www.bioinf.uni-freiburg.de C R B 29 / 33

Still time left? Analysing Differential Gene Expression with DESeq2 Proceed from Step 5 Count the number of reads per annotated gene with htseq-count htseq-count can be used to count reads per features in different samples It expects a BAM file as input In case of paired-end reads, the alignments in BAM should be sorted by read name Use the tool Sort of NGS:SAM Tools to sort the paired-end BAM file Sort by read names We need a GFF/GTF file with features, i.e. gene, annotations Drosophila melanogaster.bdgp5.78.gtf Apply the tool htseq-count to all samples, select Drosophila melanogaster.bdgp5.78.gtf file as feature file, use the Union mode for reads overlapping more than one feature, set the Minimum Alignment Quality to 10 Inspect the result files 30 / 33

THEN: We counted only reads that mapped to chr4. To get more meaningful results: Import the 3 treated and 4 untreated count files from Zenodo(as type tabular!) GSM461176 untreat single.counts GSM461177 untreat paired.counts GSM461178 untreat paired.counts GSM461179 treat single.counts GSM461180 treat paired.counts GSM461181 treat paired.counts GSM461182 untreat single.counts 31 / 33

Run DESeq2 using the count files as input. In addition to the first factor condition with the levels treated and untreated, please add a second factor sequencing with the levels PE and SE. Choose the corresponding count files for each factor and level. File names have all information needed. The file with the independent filtering results should be used for further downstream analysis as it excludes genes with only few read counts as these genes will not be called as significantly differentially expressed. Filter for all genes from the DESeq2 result file that have a significant adjusted p-value of 0.05 or below (Filter tool: condition c7<=0.05). Please note that the output was already sorted by adjusted p-value. Similarly, separate the up and down regulated genes (3rd column contains fold changes). Select first 100 lines of the data set. 32 / 33

Step 7: Functional enrichment among differentially expressed genes Use the adjusted p-value filtered data from Step 6 as input data set for DAVID The identifiers in the first column are Flybase gene ids The output of the DAVID tool is a HTML file with a link to the DAVID website There, you can for example analyse cluster of functional enrichment 33 / 33