de.nbi and its Galaxy interface for RNA-Seq Jörg Fallmann Thanks to Björn Grüning (RBC-Freiburg) and Sarah Diehl (MPI-Freiburg) Institute for Bioinformatics University of Leipzig http://www.bioinf.uni-leipzig.de/ fall/summerschool2016.pdf 30.09.2016 C R B 1 / 33
Deutsches Netzwerk für Bioinformatik Infrastruktur de.nbi The German Network for Bioinformatics Infrastructure provides comprehensive first-class bioinformatics services to users in life sciences research, industry and medicine. The de.nbi program coordinates bioinformatics training and education and the cooperation of the German bioinformatics community with international bioinformatics network structures. 2 / 33
de.nbi Structure 3 / 33
The RBC - RNA Bioinformatic Center Masterminds Peter F. Stadler (Leipzig) Rolf Backofen (Freiburg) Uwe Ohler (Berlin) Nikolaus Rajewsky (Berlin) 4 / 33
Purpose Central Contact Point for RNA Bioinformatics Offering Support Maintaining Software/Databases Documentation Workshops/Training 5 / 33
Aims Raise the awareness for RNA-based regulation make RNA tools accessible integrate RNA tools into NGS pipelines training, workshops, support 6 / 33
The RBC - Tools trnadb cermit GraphProt DoRiNA antarna PARalyzer ViennaRNA CRISPRmap Mummie mirdeep2 ExpaRNA-P RNAz RNAsnoop CopraRNA Snoreport PicTar2 7 / 33
The RBC - de.nbi interactome 8 / 33
Bridging tools and users make tools available to users moving tools to data as less installation as possible scalable reproducible transparent 9 / 33
GALAXY https://galaxyproject.org/ Data intensive biology for everyone Galaxy is an open, web-based platform for data intensive biomedical research Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses 10 / 33
RNA-Seq in PubMed More and more people do NGS experiments Who does analysis? 3000 # of publications 2000 1000 NGS application ChIP CLIP Epigenome RNA Singlecell srna 0 2010 2011 2012 2013 2014 2015 Year 11 / 33
Tools to Data Hands On RNA-Seq analysis with Galaxy 12 / 33
Freiburger Galaxy Instance http://galaxy.uni-freiburg.de Login with your username and pwd For this course you get guest accounts If you want to give it a try with your own data later, just register at galaxy@informatik.uni-freiburg.de 13 / 33
14 / 33
Upload data Click category name to expand History op)ons Click tool name to use dataset 15 / 33
Worflow Galaxy-Training Due to time constraints, we will skip some parts, but you are very welcome to give it a try at home, or with your own datasets once you created an account. For now we will use RNA-seq data from the study by Brooks et al. 2011, in which the pasilla gene in Drosophila melanogaster was depleted by RNAi and the effects on splicing events were analysed by RNA-seq. The data is available at NCBI Gene Expression Omnibus (GEO) under accession number GSE18508. 16 / 33
Step 1: Inspecting the FASTQ files Create a new history for this RNA-seq exercise. Import a FASTQ file pair from Zenodo Fastq1 Fastq2 Load them into Galaxy by right-clicking copy link location and paste the link in Galaxy Get Data Upload File from your computer paste/fetch data Start (Recommended: Select the correct file type ( fastqsanger ) and genome ( dm3 ) directly in the upload dialogue. A lot of downstream programs will require these information. With the upload you can assign the correct settings for all uploaded files at once!) 17 / 33
download more info Datasets delete edit a?ributes display in main frame rerun tool visualise links to display in genome browser preview Click dataset name to expand 18 / 33
Both files contain the first 100.000 paired-end reads of one untreated sample. Run the tool FastQC on one of the two FASTQ files to control the quality of the reads. What is the read length? Is there anything you find striking? 19 / 33
Wai)ng to be run Running Successfully finished Dataset states Failed Send bug report 20 / 33
Trim low quality bases from the 3 end using Trim Galore on both paired-end datasets. In order to use Trim Galore make sure that the file type is set to fastqsanger (not fastq), change it if necessary: click on the pencil button displayed in your dataset in the history, choose Datatype select fastqsanger Save. Re-run FastQC and inspect the differences. 21 / 33
Step 2: Mapping of the reads with TopHat (version 2) Import the Ensembl gene annotation for Drosophila melanogaster (Drosophila melanogasterḃdgp5 78ġtf) Drosophila melanogaster.bdgp5.78.gtf Right-click copy link location and paste the link in Galaxy Upload File from your computer paste/fetch data Start 22 / 33
Tophat Parameters Tophat needs information about the type of quality scores in the FASTQ files. The most common type nowadays is fastqsanger, signalling Sanger-scaled quality scores, which are also used by the current generation of Illumina high-throughput sequencers. Make sure that the type is set correctly. TopHat also needs to know two important parameters about the sequencing library: 1) the strandedness being unstranded or stranded (if stranded there are many types) and 2) the inner distance between the two reads for paired-end data. These information should usually come with your FASTQ files!!! If not, try to find them on the site where you downloaded the data or in the corresponding publication. 23 / 33
Mapping Run TopHat with full parameter set for best mapping results Use paired-end (as individual datasets) and specify the FASTQ files Set Mean Inner Distance to 112 Select the built in reference Drosophila melanogaster dm3 genome Allow Tophat settings to use Full parameter list Set the correct library type FR First Strand Supply own junction data Yes, Use Gene Annotation Model Yes and select the appropriate Gene Model Annotations Drosophila melanogaster.bdgp5.78.gtf Enable coverage-based search for junctions Yes ( coverage-search) to increase sensitivity TopHat splits reads into segments to map reads across splice junctions. Default minimum length of read segments is 25, doesn t 18 seem to be more appropriate? 24 / 33
Step 3: Inspecting TopHat results TopHat returns a BAM file with the mapped reads and three bed files containing splice junctions, insertions and deletions However, this example datasets are too small to give you a good impression of real data Therefore import 4 files, restricted to chr4, from Tophat output into your history GSM461177 untreat paired chr4.bam GSM461177 untreat paired deletions chr4.bed GSM461177 untreat paired insertions chr4.bed GSM461177 untreat paired junctions chr4.bed You may have to change the data type from tabular to bed (use the pencil button) 25 / 33
Visualise mapping files with IGV Open dataset click on display with IGV web current Open the file with a JAVA plugin (e.g. IcedTea) Go to View Preferences Alignments and set the visibility range to >= 50kb Inspect the region on chr4 between 560 kb to 600 kb copy chr4:560000-600000 to locus window and click GO Now import the bed output into IGV Open dataset and click on display with IGV local Inspect the results using a Sashimi plot (right-click on the bam file select Sashimi Plot from the context menu) 26 / 33
Reproducibilty and Transparency Save your workflow Click on History Options the little gearwheel on top of your history Choose History Actions Extract Workflow Annotate your workflow and save Go to Workflow section and have a look 27 / 33
Users, users, users Would you be so kind to fill out a short survey? de.nbi summerschool 11/2016 survey 28 / 33
Contact Jörg Fallmann fall@bioinf.uni-leipzig.de http://www.bioinf.uni-leipzig.de Björn Grüning bjoern.gruening@gmail.com http://www.bioinf.uni-freiburg.de C R B 29 / 33
Still time left? Analysing Differential Gene Expression with DESeq2 Proceed from Step 5 Count the number of reads per annotated gene with htseq-count htseq-count can be used to count reads per features in different samples It expects a BAM file as input In case of paired-end reads, the alignments in BAM should be sorted by read name Use the tool Sort of NGS:SAM Tools to sort the paired-end BAM file Sort by read names We need a GFF/GTF file with features, i.e. gene, annotations Drosophila melanogaster.bdgp5.78.gtf Apply the tool htseq-count to all samples, select Drosophila melanogaster.bdgp5.78.gtf file as feature file, use the Union mode for reads overlapping more than one feature, set the Minimum Alignment Quality to 10 Inspect the result files 30 / 33
THEN: We counted only reads that mapped to chr4. To get more meaningful results: Import the 3 treated and 4 untreated count files from Zenodo(as type tabular!) GSM461176 untreat single.counts GSM461177 untreat paired.counts GSM461178 untreat paired.counts GSM461179 treat single.counts GSM461180 treat paired.counts GSM461181 treat paired.counts GSM461182 untreat single.counts 31 / 33
Run DESeq2 using the count files as input. In addition to the first factor condition with the levels treated and untreated, please add a second factor sequencing with the levels PE and SE. Choose the corresponding count files for each factor and level. File names have all information needed. The file with the independent filtering results should be used for further downstream analysis as it excludes genes with only few read counts as these genes will not be called as significantly differentially expressed. Filter for all genes from the DESeq2 result file that have a significant adjusted p-value of 0.05 or below (Filter tool: condition c7<=0.05). Please note that the output was already sorted by adjusted p-value. Similarly, separate the up and down regulated genes (3rd column contains fold changes). Select first 100 lines of the data set. 32 / 33
Step 7: Functional enrichment among differentially expressed genes Use the adjusted p-value filtered data from Step 6 as input data set for DAVID The identifiers in the first column are Flybase gene ids The output of the DAVID tool is a HTML file with a link to the DAVID website There, you can for example analyse cluster of functional enrichment 33 / 33