Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Similar documents
Galaxy Platform For NGS Data Analyses

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

NGS Data Visualization and Exploration Using IGV

NGS Analysis Using Galaxy

ChIP-seq hands-on practical using Galaxy

Analyzing ChIP- Seq Data in Galaxy

Rsubread package: high-performance read alignment, quantification and mutation discovery

NGS Data Analysis. Roberto Preste

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

High-throughout sequencing and using short-read aligners. Simon Anders

Single/paired-end RNAseq analysis with Galaxy

CLC Server. End User USER MANUAL

RNA-seq. Manpreet S. Katari

Rsubread package: high-performance read alignment, quantification and mutation discovery

ChIP-seq (NGS) Data Formats

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Differential gene expression analysis using RNA-seq

Copyright 2014 Regents of the University of Minnesota

Ensembl RNASeq Practical. Overview

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

NGS FASTQ file format

Copyright 2014 Regents of the University of Minnesota

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

ChIP-seq hands-on practical using Galaxy

Analyzing Variant Call results using EuPathDB Galaxy, Part II

How to store and visualize RNA-seq data

Aligners. J Fass 21 June 2017

Bioinformatics in next generation sequencing projects

TP RNA-seq : Differential expression analysis

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Our typical RNA quantification pipeline

Helpful Galaxy screencasts are available at:

Using Galaxy for NGS Analyses Luce Skrabanek

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

m6aviewer Version Documentation

Galaxy workshop at the Winter School Igor Makunin

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

Short Read Alignment. Mapping Reads to a Reference

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Sequence Analysis Pipeline

ChIP-Seq Tutorial on Galaxy

Illumina Next Generation Sequencing Data analysis

Genome Browsers - The UCSC Genome Browser

RNA-seq Data Analysis

Creating and Using Genome Assemblies Tutorial

Mapping NGS reads for genomics studies

RNA-Seq Analysis With the Tuxedo Suite

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

NGS Data and Sequence Alignment

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Integrated Genome browser (IGB) installation

Advanced UCSC Browser Functions

Long Read RNA-seq Mapper

NGS : reads quality control

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

Package Rsubread. July 21, 2013

Intro to NGS Tutorial

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

Exercises: Analysing RNA-Seq data

Integrative Genomics Viewer. Prat Thiru

Tutorial 4 BLAST Searching the CHO Genome

Analysis of ChIP-seq data

Introduction to Galaxy

Tutorial. Find Very Low Frequency Variants With QIAGEN GeneRead Panels. Sample to Insight. November 21, 2017

Using the Galaxy Local Bioinformatics Cloud at CARC

Maize genome sequence in FASTA format. Gene annotation file in gff format

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

Using Galaxy: RNA-seq

Subread/Rsubread Users Guide

Genomic Files. University of Massachusetts Medical School. October, 2014

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Aligning reads: tools and theory

Subread/Rsubread Users Guide

Goal: Learn how to use various tool to extract information from RNAseq reads.

Subread/Rsubread Users Guide

AgroMarker Finder manual (1.1)

Browser Exercises - I. Alignments and Comparative genomics

Lecture 12. Short read aligners

The software and data for the RNA-Seq exercise are already available on the USB system

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Genome Browsers Guide

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

Understanding and Pre-processing Raw Illumina Data

Accessible, Transparent and Reproducible Analysis with Galaxy

de.nbi and its Galaxy interface for RNA-Seq

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Fusion Detection Using QIAseq RNAscan Panels

Reference guided RNA-seq data analysis using BioHPC Lab computers

INTRODUCTION TO BIOINFORMATICS

Genomic Files. University of Massachusetts Medical School. October, 2015

Transcription:

Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017

RNA-seq workflow I Hypothesis (a.k.a. the research question) Differentially expressed genes across several conditions of an experiment Simple two conditions: Wild type vs. gene knockout mouse Healthy person vs. cancer patient Control vs. treatment with drug Complexity can increase arbitrarily: Many conditions, confounding factors, time course experiments, etc.

RNA-seq workflow I Experimental design Important to ensure (statistical) validity of results Depends on the hypothesis: Cell cultures or animals/patients? Phenotypic effect mild or severe? Inclusion of non-coding RNA?... Affects choice of protocols for culturing, RNA extraction, sample preparation, sequencing, bioinformatics and esp. number of replicates per condition! Involve statistician/bioinformatician from the beginning!

RNA-seq workflow I Sequencing processing Post-processing of intensity values basecalling: convert sequence of intensities to nucleotide sequences ( reads ) demultiplexing: assign reads to samples based on their adapter sequences ( barcodes ) Sample-specific sequence read files Fragments can be sequenced from one or both ends unpaired / single-end vs. paired-end RNA-seq often run with single-end

RNA-seq workflow II FASTQ the sequencing read file format Raw reads from sample-specific fragments Per-base quality information (Phred score 33) biocluster.ucr.edu

RNA-seq workflow II FASTQ processing Steps towards identifying differential expression of genes between samples: 1) Quality assessment of raw reads 2) Alignment of reads to the genome 3) Quantification of gene expression QC of Raw Reads Read Alignment How can I do that on my own? Quantification

Galaxy Open source, web-based platform for data intensive biomedical research developed at Penn State and Johns Hopkins University Many (NGS) bioinformatics tools available as plug-ins Container-based server runs in a container that can be installed and customized on other systems many instances of Galaxy running worldwide User works on histories of data and processes, data can be shared with other users Galaxy@GWDG: https://galaxy.gwdg.de/

Galaxy practical I Open https://galaxy.gwdg.de/ and login with your GWDG/course account

Galaxy practical I Uploading data into Galaxy a sandbox example: Go to www.ensembl.org Click Downloads, then Download data via FTP Click on GTF for Human Gene sets Download Homo_sapiens.GRCh38.90.gtf.gz to your PC Go back to Galaxy Click Get Data, then Upload File from your computer Choose local file from your PC (check Download folder) If successful, close the window Optional: rename history (click on unnamed history )

You should see this: Your history should look like this:

Galaxy practical I Uploading data may be time-consuming Galaxy allows importing data from public repositories and sharing data with other users We shared a data set from a published study: Published January 2017

Galaxy practical I Shared Data Data Libraries RNA-Seq_MolBio_Lecture Raw Data 3 control condition samples ( GFP... ), 3 overexpression samples ( PCDH7... ) Click any of the files to inspect data Add all files to your history; several options: Individually open files and click to History (slow) Mark files in folder view and click to History (fast) Mark whole folder and click to History (fast) Import into existing history, go to Main menu and click the eye symbol for one of the samples

You should see this:

Zoom in to see FastQ file features read nucleotide sequence base quality information read length

RNA-seq workflow II essential questions about quality control How many reads should I have? >=25 million reads required for representative transcriptome profile of model organisms such as human and mouse PCR introduces many (uninformative) duplicates How good are the reads? Assess signal-to-noise ratio of sequencing Determine proportion of ambigous bases ( N ) Identify fraction of adapters, contamination, etc.

RNA-seq workflow II Phred scores reflecting on basecall accuracy How good are the bases/reads? Phred scale: logarithmic scale of basecall accuracy Common threshold for good quality Phred Quality Score Probability of Incorrect Basecall Basecall accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% 50 1 in 100000 99.999%

RNA-seq workflow II Quality control indices Further quality indices: Distribution of nucleotide frequencies across the sequences GC content per sequence Fraction of N Length distribution of sequences Sequence duplication level Amount of overrepresented sequences and short (6-8 bp) stretches of nucleotides ( k-mers ) Adapter content trimming may be required

RNA-seq workflow II FastQC: A quality control tool for high throughput sequence data Systematically assess quality for NGS samples in Galaxy FastQC Open source tool Runs on all platforms Assess various quality parameters including contamination by adapters Allows to provide contamination sequences by user Generates intuitively interpretable output and visualization

RNA-seq workflow II FastQC per base quality scores

Galaxy practical II Quality control with FastQC General Sequencing Quality Control FastQC and read the description Click Multiple datasets and select all FASTQ files from your history Click Execute

Galaxy practical II Quality control with FastQC Execution calls several instances of the FastQC program, which are scheduled by the server execution time depends on file size, number of files, number of users and server load After a few minutes you should see FastQC results in your history (hit refresh symbol if not) As soon as any job is finished you can inspect the results choose Webpage, then eye symbol Scroll through the Webpage we are here to answer your questions! FastQC RawData contains detailed reports

RNA-seq workflow III Short read alignment Goal: determine the origin of sequenced reads w.r.t. the genome http://www.nature.com/nbt/journal/v27/n5/fig_tab/nbt0509-455_f2.html

RNA-seq workflow III Short read alignment Sequence alignment: Re-arrangement of two or more biological sequences to identify corresponding nucleotides/amino acids Example: sequence 1: sequence 2: ACATCGA ACTAGCTA possible alignment: ACATCG--A AC-TAGCTA

RNA-seq workflow III Short read alignment Terminology: match: two residues in a position match mismatch: residue is substituted by different residue gap: residue(s) is/are inserted or deleted match insertion ACATCG--A AC-TAGCTA deletion mismatch

RNA-seq workflow III Short read alignment Quality of an aligment: alignment score: sum of quality of position matches Example: position scores: match=+1, mismatch=-1, gap=-1 possibility 1: possibility 2: A C A T C G - - A A C - T A G C T A A C A T C - G - - A A C - T - A G C T A score: 5*1 + 4*(-1)=1 score: 5*1 + 5*(-1)=0

RNA-seq workflow III Short read alignment Global vs local aligment: Global: align sequences end-to-end Local: find optimal placement of (sub)sequence(s) within longer sequence

RNA-seq workflow III Short read alignment Application of sequence alignment: Homology detection: identify best match of a sequence to many sequences in a database e.g. NCBI BLAST Identify conserved sites via multiple alignments of related protein sequences e.g. EMBL-EBI Clustal Omega Short read alignment ( mapping ): Identify origin of a sequence w.r.t. a genomic reference sequence e.g. Bowtie, BWA, TopHat, STAR, HiSAT,...

RNA-seq workflow III Short read alignment Reference sequence: complement of DNA sequences (genome) or mrna sequences (transcriptome) from an organism usually provided as (multi-)fasta file containing one sequence per chromosome/transcript completeness and complexity depends on organism's genome project advance: Organism Assembly Length (Mb) Chromosomes Human (Homo sapiens) GRCh38.p11 3253.85 22 chromosomes, 2 sex chromosomes and nonnuclear mitochondrial DNA African clawed frog (Xenopus laevis) Xenopus_laevi s_v2 2718.43 18 chromosomes, non-nuclear mitochondrial DNA Genes 60298 36776

RNA-seq workflow III Short read alignment Transcriptome sizes are substantially smaller, e.g. human transcriptome: 20,338 coding genes 22,521 non-coding genes 5,363 small non-coding 14,720 long non-coding 2,222 misc non-coding Total number of transcripts can be much higher: 200,310 gene transcripts

RNA-seq workflow III Short read alignment Goal: determine (optimal) mapping of each sequencing read to reference genome/transcriptome @SRR2549634.1 SEB9BZKS1:279:C4JALACXX:8:1101:1292:2222/1 NCCCCTTGGTCACCTTGCTTGATTATCGTAGCACCTTTGGGGACGGACTTC @SRR2549634.2 SEB9BZKS1:279:C4JALACXX:8:1101:1771:2249/1 GTTAGATGCAACTCTTGGCCATAAATCGGCACATTCCTTACCGACTGGACC @SRR2549634.3 SEB9BZKS1:279:C4JALACXX:8:1101:4645:2229/1 NGAATGGTATGTTGCTGGACCTCAGAAGGATGTTCAAAACCACAGTCAATG @SRR2549634.4 SEB9BZKS1:279:C4JALACXX:8:1101:4518:2229/1 NTGGATCCTCAAATCCCACCACATCCATCCAAGGATCATGATTAAAAGCGT @SRR2549634.5 SEB9BZKS1:279:C4JALACXX:8:1101:5231:2241/1 NTGGGTATTCACTGAAAGCTTCAACACACATTGGCTTAGATGGAACGAACT @SRR2549634.6 SEB9BZKS1:279:C4JALACXX:8:1101:5383:2243/1 TGGGTGTAGACATCTTCAACACCAGCCAATTGCAACAACTTTTTGACAGCT @SRR2549634.7 SEB9BZKS1:279:C4JALACXX:8:1101:7221:2245/1 TGGAAATGTTGTCCAGAGTTATCTGGATGATCTAACGTGGGGTTATTGTTT @SRR2549634.8 SEB9BZKS1:279:C4JALACXX:8:1101:8304:2249/1 GCCAGACAGAGGTTTTTCAAATTAGGAAATGTTTGAGCCAATGTGGAAATT @SRR2549634.9 SEB9BZKS1:279:C4JALACXX:8:1101:9168:2233/1 NCTATTTTCATCATCTGATTGAAAAAAAACATTGAAAATATACTCATCATT @SRR2549634.10 SEB9BZKS1:279:C4JALACXX:8:1101:9915:2241/1 NGTGGACAAGATTCTTGGAGCCTTACCCTTGTGTGGACCCATACCGAAGTG

RNA-seq workflow III Short read alignment Mapping = always local alignment Reads from RNA can span exons spliced (gapped) alignment necessary

RNA-seq workflow III Short read alignment Galaxy@GWDG provides three read alignment tools: RNA STAR* Advantage: one of the most sensitive, precise, versatile and fast read alignment programs Disadvantage: memory-intensive HISAT2** - fast and sensitive, can be run on a laptop TopHat*** - fast splice junction mapper, uses Bowtie2 and then analyzes the mapping results to identify splice junctions between exons genome indexes precomputed for human and mouse *Dobin et al., Bioinformatics, 2013 **Kim et al., Nature Methods, 2015 ***Kim et. al., Genome Biology, 2013

Galaxy practical part III short read alignment Transcriptomics Mapping HISAT2 Select unpaired reads Choose one(!!!) of the six FASTQ files Select Homo_sapiens... as a reference genome Click Execute When job is scheduled click on HISAT2 again and read the description Note: mapping will take a while (~30min.)!

Galaxy practical part III short read alignment

RNA-seq workflow III Short read alignment Visualization of alignments as stacked read sequences:

RNA-seq workflow III Short read alignment More flexible: Genome browsers Visualization of reads, splice patterns, mutations etc. Integration of annotation, public data, known SNPs etc. UCSC online genome browser: genome.ucsc.edu Downloadable and usable from Galaxy: IGV from Broad Institute* software.broadinstitute.org/software/igv/ *Robinson et al., Nature Biotechnology, 2011

The RNA-seq workflow III Short read alignment

RNA-seq workflow III Short read alignment Read coverage: # of reads matching a position/region Allows statements about gene expression level (RNA-seq) High coverage helps to identify genomic variants Depends on sequencing depth

RNA-seq workflow III Short read alignment SAM = Sequence Alignment/Map format Human-readable standard format for alignment characterization Contains general information on alignment program/parameters and reference sequence used One entry per alignment with information on location, quality and more BAM = Binary (compressed) version samtools: popular tool for SAM/BAM file manipulation

RNA-seq workflow III Short read alignment

RNA-seq workflow III Short read alignment Several metrics allow statements about the total sample alignment quality: Total number of mapped reads ( coverage) and fraction of reads mapping to the genome......uniquely: evidence for particular gene/transcript...multiply: paralogs, CNV, ribosomal RNA,......not at all: contamination, genomic DNA,... # mismatches # novel splice junctions...

RNA-seq workflow III Short read alignment Example mapping output: Click on the finished job and inspect the mapping statistics Click the info icon to assess information on the job details including version of the software used

Galaxy practical part III short read alignment Start IGV on your system (search on Desktop) Open.bat file Choose Human Hg38 as a reference genome Go to the locus field and enter PCDH7

Galaxy practical part III short read alignment Shared Data Data Libraries RNA-Seq_MolBio_Lecture Aligned_Files Import all alignment ( BAM ) files into your history Ignore file Aligned_PCDH7-3.bam with size 770.4 Mb Go to main view ( Analyze Data ) Select one alignment file from GFP, one alignment file from PCDH7, and click display with IGV local Go to IGV, zoom in on the first exon of PCDH7 Right-click on the data tracks and choose Collapsed

RNAseq-workflow IV - quantification of expression Gene expression quantification Goal: estimate the gene expression level from counting reads overlapping annotated genes discoveringthegenome.org

RNAseq-workflow IV quantification of expression Annotations are often available from genome project websites or Ensembl Standard format for annotations is the general feature format (GFF) or gene transfer format (GTF) Tab-delimited files with information on gene structures 10 fields including flexible Attributes

RNAseq-workflow IV quantification of expression The file we down-/uploaded earlier is an annotation in GTF format for the human genome

RNAseq-workflow IV - quantification of expression Standard procedure: count number of reads that overlap features (here: exons of a gene) and summarize on meta-feature (here: gene) level

RNAseq-workflow IV - quantification of expression Questions and pitfalls when counting mapped reads Consider multiply mapped reads? Count on gene or exon/transcript level? How to count partially mapping reads? How to treat overlapping features?...

RNAseq-workflow IV - quantification of expression Galaxy@GWDG provides featurecounts* tool for fast and flexible quantification Transcriptomics Counting featurecounts and read the description Click Multiple datasets and select all imported alignment files load the annotation file (the GTF file) from your history Click Execute quantification should take between 1 to 10 min. *Liao et al., Bioinformatics, 2014

Galaxy practical part IV gene expression quantification When any dataset is finished, click on eye symbol Copy identifier of a gene with >1000 reads assigned and paste it into Ensembl search window Optional: rename files according to alignment input

RNA workflow addendum Summary of quality from multiple samples Quality assessment of 6 samples easy enough to do one by one What about more? Solution: MultiQC Supports summary logs from multiple software, including FastQC, STAR, Bowtie2, featurecounts, etc. Generates a single HTML file, summarizing all results in a single, interactive report

RNA workflow addendum Summary of quality from multiple samples

Galaxy practical addendum quality summary (FastQC)

Galaxy practical addendum quality summary Questions?