Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017

RNA-seq workflow I Hypothesis (a.k.a. the research question) Differentially expressed genes across several conditions of an experiment Simple two conditions: Wild type vs. gene knockout mouse Healthy person vs. cancer patient Control vs. treatment with drug Complexity can increase arbitrarily: Many conditions, confounding factors, time course experiments, etc.

RNA-seq workflow I Experimental design Important to ensure (statistical) validity of results Depends on the hypothesis: Cell cultures or animals/patients? Phenotypic effect mild or severe? Inclusion of non-coding RNA?... Affects choice of protocols for culturing, RNA extraction, sample preparation, sequencing, bioinformatics and esp. number of replicates per condition! Involve statistician/bioinformatician from the beginning!

RNA-seq workflow I Sequencing processing Post-processing of intensity values basecalling: convert sequence of intensities to nucleotide sequences ( reads ) demultiplexing: assign reads to samples based on their adapter sequences ( barcodes ) Sample-specific sequence read files Fragments can be sequenced from one or both ends unpaired / single-end vs. paired-end RNA-seq often run with single-end

RNA-seq workflow II FASTQ the sequencing read file format Raw reads from sample-specific fragments Per-base quality information (Phred score 33) biocluster.ucr.edu

RNA-seq workflow II FASTQ processing Steps towards identifying differential expression of genes between samples: 1) Quality assessment of raw reads 2) Alignment of reads to the genome 3) Quantification of gene expression QC of Raw Reads Read Alignment How can I do that on my own? Quantification

Galaxy Open source, web-based platform for data intensive biomedical research developed at Penn State and Johns Hopkins University Many (NGS) bioinformatics tools available as plug-ins Container-based server runs in a container that can be installed and customized on other systems many instances of Galaxy running worldwide User works on histories of data and processes, data can be shared with other users Galaxy@GWDG: https://galaxy.gwdg.de/

Galaxy practical I Open https://galaxy.gwdg.de/ and login with your GWDG/course account

Galaxy practical I Uploading data into Galaxy a sandbox example: Go to www.ensembl.org Click Downloads, then Download data via FTP Click on GTF for Human Gene sets Download Homo_sapiens.GRCh38.90.gtf.gz to your PC Go back to Galaxy Click Get Data, then Upload File from your computer Choose local file from your PC (check Download folder) If successful, close the window Optional: rename history (click on unnamed history )

You should see this: Your history should look like this:

Galaxy practical I Uploading data may be time-consuming Galaxy allows importing data from public repositories and sharing data with other users We shared a data set from a published study: Published January 2017

Galaxy practical I Shared Data Data Libraries RNA-Seq_MolBio_Lecture Raw Data 3 control condition samples ( GFP... ), 3 overexpression samples ( PCDH7... ) Click any of the files to inspect data Add all files to your history; several options: Individually open files and click to History (slow) Mark files in folder view and click to History (fast) Mark whole folder and click to History (fast) Import into existing history, go to Main menu and click the eye symbol for one of the samples

You should see this:

Zoom in to see FastQ file features read nucleotide sequence base quality information read length

RNA-seq workflow II essential questions about quality control How many reads should I have? >=25 million reads required for representative transcriptome profile of model organisms such as human and mouse PCR introduces many (uninformative) duplicates How good are the reads? Assess signal-to-noise ratio of sequencing Determine proportion of ambigous bases ( N ) Identify fraction of adapters, contamination, etc.

RNA-seq workflow II Phred scores reflecting on basecall accuracy How good are the bases/reads? Phred scale: logarithmic scale of basecall accuracy Common threshold for good quality Phred Quality Score Probability of Incorrect Basecall Basecall accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% 50 1 in 100000 99.999%

RNA-seq workflow II Quality control indices Further quality indices: Distribution of nucleotide frequencies across the sequences GC content per sequence Fraction of N Length distribution of sequences Sequence duplication level Amount of overrepresented sequences and short (6-8 bp) stretches of nucleotides ( k-mers ) Adapter content trimming may be required

RNA-seq workflow II FastQC: A quality control tool for high throughput sequence data Systematically assess quality for NGS samples in Galaxy FastQC Open source tool Runs on all platforms Assess various quality parameters including contamination by adapters Allows to provide contamination sequences by user Generates intuitively interpretable output and visualization

RNA-seq workflow II FastQC per base quality scores

Galaxy practical II Quality control with FastQC General Sequencing Quality Control FastQC and read the description Click Multiple datasets and select all FASTQ files from your history Click Execute

Galaxy practical II Quality control with FastQC Execution calls several instances of the FastQC program, which are scheduled by the server execution time depends on file size, number of files, number of users and server load After a few minutes you should see FastQC results in your history (hit refresh symbol if not) As soon as any job is finished you can inspect the results choose Webpage, then eye symbol Scroll through the Webpage we are here to answer your questions! FastQC RawData contains detailed reports

RNA-seq workflow III Short read alignment Goal: determine the origin of sequenced reads w.r.t. the genome http://www.nature.com/nbt/journal/v27/n5/fig_tab/nbt0509-455_f2.html

RNA-seq workflow III Short read alignment Sequence alignment: Re-arrangement of two or more biological sequences to identify corresponding nucleotides/amino acids Example: sequence 1: sequence 2: ACATCGA ACTAGCTA possible alignment: ACATCG--A AC-TAGCTA

RNA-seq workflow III Short read alignment Terminology: match: two residues in a position match mismatch: residue is substituted by different residue gap: residue(s) is/are inserted or deleted match insertion ACATCG--A AC-TAGCTA deletion mismatch

RNA-seq workflow III Short read alignment Quality of an aligment: alignment score: sum of quality of position matches Example: position scores: match=+1, mismatch=-1, gap=-1 possibility 1: possibility 2: A C A T C G - - A A C - T A G C T A A C A T C - G - - A A C - T - A G C T A score: 5*1 + 4*(-1)=1 score: 5*1 + 5*(-1)=0

RNA-seq workflow III Short read alignment Global vs local aligment: Global: align sequences end-to-end Local: find optimal placement of (sub)sequence(s) within longer sequence

RNA-seq workflow III Short read alignment Application of sequence alignment: Homology detection: identify best match of a sequence to many sequences in a database e.g. NCBI BLAST Identify conserved sites via multiple alignments of related protein sequences e.g. EMBL-EBI Clustal Omega Short read alignment ( mapping ): Identify origin of a sequence w.r.t. a genomic reference sequence e.g. Bowtie, BWA, TopHat, STAR, HiSAT,...

RNA-seq workflow III Short read alignment Reference sequence: complement of DNA sequences (genome) or mrna sequences (transcriptome) from an organism usually provided as (multi-)fasta file containing one sequence per chromosome/transcript completeness and complexity depends on organism's genome project advance: Organism Assembly Length (Mb) Chromosomes Human (Homo sapiens) GRCh38.p11 3253.85 22 chromosomes, 2 sex chromosomes and nonnuclear mitochondrial DNA African clawed frog (Xenopus laevis) Xenopus_laevi s_v2 2718.43 18 chromosomes, non-nuclear mitochondrial DNA Genes 60298 36776

RNA-seq workflow III Short read alignment Transcriptome sizes are substantially smaller, e.g. human transcriptome: 20,338 coding genes 22,521 non-coding genes 5,363 small non-coding 14,720 long non-coding 2,222 misc non-coding Total number of transcripts can be much higher: 200,310 gene transcripts

RNA-seq workflow III Short read alignment Goal: determine (optimal) mapping of each sequencing read to reference genome/transcriptome @SRR2549634.1 SEB9BZKS1:279:C4JALACXX:8:1101:1292:2222/1 NCCCCTTGGTCACCTTGCTTGATTATCGTAGCACCTTTGGGGACGGACTTC @SRR2549634.2 SEB9BZKS1:279:C4JALACXX:8:1101:1771:2249/1 GTTAGATGCAACTCTTGGCCATAAATCGGCACATTCCTTACCGACTGGACC @SRR2549634.3 SEB9BZKS1:279:C4JALACXX:8:1101:4645:2229/1 NGAATGGTATGTTGCTGGACCTCAGAAGGATGTTCAAAACCACAGTCAATG @SRR2549634.4 SEB9BZKS1:279:C4JALACXX:8:1101:4518:2229/1 NTGGATCCTCAAATCCCACCACATCCATCCAAGGATCATGATTAAAAGCGT @SRR2549634.5 SEB9BZKS1:279:C4JALACXX:8:1101:5231:2241/1 NTGGGTATTCACTGAAAGCTTCAACACACATTGGCTTAGATGGAACGAACT @SRR2549634.6 SEB9BZKS1:279:C4JALACXX:8:1101:5383:2243/1 TGGGTGTAGACATCTTCAACACCAGCCAATTGCAACAACTTTTTGACAGCT @SRR2549634.7 SEB9BZKS1:279:C4JALACXX:8:1101:7221:2245/1 TGGAAATGTTGTCCAGAGTTATCTGGATGATCTAACGTGGGGTTATTGTTT @SRR2549634.8 SEB9BZKS1:279:C4JALACXX:8:1101:8304:2249/1 GCCAGACAGAGGTTTTTCAAATTAGGAAATGTTTGAGCCAATGTGGAAATT @SRR2549634.9 SEB9BZKS1:279:C4JALACXX:8:1101:9168:2233/1 NCTATTTTCATCATCTGATTGAAAAAAAACATTGAAAATATACTCATCATT @SRR2549634.10 SEB9BZKS1:279:C4JALACXX:8:1101:9915:2241/1 NGTGGACAAGATTCTTGGAGCCTTACCCTTGTGTGGACCCATACCGAAGTG

RNA-seq workflow III Short read alignment Mapping = always local alignment Reads from RNA can span exons spliced (gapped) alignment necessary

RNA-seq workflow III Short read alignment Galaxy@GWDG provides three read alignment tools: RNA STAR* Advantage: one of the most sensitive, precise, versatile and fast read alignment programs Disadvantage: memory-intensive HISAT2** - fast and sensitive, can be run on a laptop TopHat*** - fast splice junction mapper, uses Bowtie2 and then analyzes the mapping results to identify splice junctions between exons genome indexes precomputed for human and mouse *Dobin et al., Bioinformatics, 2013 **Kim et al., Nature Methods, 2015 ***Kim et. al., Genome Biology, 2013

Galaxy practical part III short read alignment Transcriptomics Mapping HISAT2 Select unpaired reads Choose one(!!!) of the six FASTQ files Select Homo_sapiens... as a reference genome Click Execute When job is scheduled click on HISAT2 again and read the description Note: mapping will take a while (~30min.)!

Galaxy practical part III short read alignment

RNA-seq workflow III Short read alignment Visualization of alignments as stacked read sequences:

RNA-seq workflow III Short read alignment More flexible: Genome browsers Visualization of reads, splice patterns, mutations etc. Integration of annotation, public data, known SNPs etc. UCSC online genome browser: genome.ucsc.edu Downloadable and usable from Galaxy: IGV from Broad Institute* software.broadinstitute.org/software/igv/ *Robinson et al., Nature Biotechnology, 2011

The RNA-seq workflow III Short read alignment

RNA-seq workflow III Short read alignment Read coverage: # of reads matching a position/region Allows statements about gene expression level (RNA-seq) High coverage helps to identify genomic variants Depends on sequencing depth

RNA-seq workflow III Short read alignment SAM = Sequence Alignment/Map format Human-readable standard format for alignment characterization Contains general information on alignment program/parameters and reference sequence used One entry per alignment with information on location, quality and more BAM = Binary (compressed) version samtools: popular tool for SAM/BAM file manipulation

RNA-seq workflow III Short read alignment

RNA-seq workflow III Short read alignment Several metrics allow statements about the total sample alignment quality: Total number of mapped reads ( coverage) and fraction of reads mapping to the genome......uniquely: evidence for particular gene/transcript...multiply: paralogs, CNV, ribosomal RNA,......not at all: contamination, genomic DNA,... # mismatches # novel splice junctions...

RNA-seq workflow III Short read alignment Example mapping output: Click on the finished job and inspect the mapping statistics Click the info icon to assess information on the job details including version of the software used

Galaxy practical part III short read alignment Start IGV on your system (search on Desktop) Open.bat file Choose Human Hg38 as a reference genome Go to the locus field and enter PCDH7

Galaxy practical part III short read alignment Shared Data Data Libraries RNA-Seq_MolBio_Lecture Aligned_Files Import all alignment ( BAM ) files into your history Ignore file Aligned_PCDH7-3.bam with size 770.4 Mb Go to main view ( Analyze Data ) Select one alignment file from GFP, one alignment file from PCDH7, and click display with IGV local Go to IGV, zoom in on the first exon of PCDH7 Right-click on the data tracks and choose Collapsed

RNAseq-workflow IV - quantification of expression Gene expression quantification Goal: estimate the gene expression level from counting reads overlapping annotated genes discoveringthegenome.org

RNAseq-workflow IV quantification of expression Annotations are often available from genome project websites or Ensembl Standard format for annotations is the general feature format (GFF) or gene transfer format (GTF) Tab-delimited files with information on gene structures 10 fields including flexible Attributes

RNAseq-workflow IV quantification of expression The file we down-/uploaded earlier is an annotation in GTF format for the human genome

RNAseq-workflow IV - quantification of expression Standard procedure: count number of reads that overlap features (here: exons of a gene) and summarize on meta-feature (here: gene) level

RNAseq-workflow IV - quantification of expression Questions and pitfalls when counting mapped reads Consider multiply mapped reads? Count on gene or exon/transcript level? How to count partially mapping reads? How to treat overlapping features?...

RNAseq-workflow IV - quantification of expression Galaxy@GWDG provides featurecounts* tool for fast and flexible quantification Transcriptomics Counting featurecounts and read the description Click Multiple datasets and select all imported alignment files load the annotation file (the GTF file) from your history Click Execute quantification should take between 1 to 10 min. *Liao et al., Bioinformatics, 2014

Galaxy practical part IV gene expression quantification When any dataset is finished, click on eye symbol Copy identifier of a gene with >1000 reads assigned and paste it into Ensembl search window Optional: rename files according to alignment input

RNA workflow addendum Summary of quality from multiple samples Quality assessment of 6 samples easy enough to do one by one What about more? Solution: MultiQC Supports summary logs from multiple software, including FastQC, STAR, Bowtie2, featurecounts, etc. Generates a single HTML file, summarizing all results in a single, interactive report

RNA workflow addendum Summary of quality from multiple samples

Galaxy practical addendum quality summary (FastQC)

Galaxy practical addendum quality summary Questions?