Short Read Sequencing Analysis Workshop

Short Read Sequencing Analysis Workshop Day 8: Introduc/on to RNA-seq Analysis In-class slides

Day 7 Homework 1.) 14 GABPA ChIP-seq peaks 2.) Error: Dataset too large (> 100000). Rerun with larger maxsize Command should contain maxsize 200000 The unknown TF is human c-myc

Outline For Today Brief review of RNA-seq Discuss TopHat splice aware aligner In-class exercise: map human RNA-seq data with TopHat2 Discuss gene quan/ﬁca/on In-class exercise: generate gene-level quan/ta/on using Htseq counts

QuesAons You Can Address With RNA-seq Catalogue and quan/fy gene expression RNA Diﬀeren/al expression analysis Novel transcript discovery; transcriptome assembly

Important ConsideraAons for RNA-seq Libraries Many diﬀerent protocols for RNA-seq library preps What RNA(s) do you want to sequence? Remove rrna by polya enrichment or rrna subtrac/on What ques/ons do you want to ask? Include spike-in controls for be[er quan/ﬁca/on accuracy Use longer read lengths for heavily spliced RNAs

How Does Splicing Aﬀect Read Mapping RNA Sequence RNA-seq reads Genome Sequence Splicing creates sequences that do not occur in the genome

How Does Splicing Aﬀect Read Mapping RNA Sequence RNA-seq reads Genome Sequence Splicing creates sequences that do not occur in the genome Reads that span splice junc/ons will not map to the genome

TopHat Is A Splice-aware Aligner Developed by Cole Trapnell Designed to discover splice sites from RNA-seq Iden/ﬁes splice junc/ons from two sources of evidence S/tching together independently mapped read segments Pairing together coverage islands from con/nuously mapped reads

Important Note About TopHat2 As of 2/23/16 TopHat2 has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 HISAT2 is more accurate and much more eﬃcient HISAT2 is general purpose DNA and RNA read aligner

Overview of TopHat2 Transcriptome Alignment (op/onal) Genomic Alignment Spliced Alignment Transcriptome Index Genome Index Segment Mapping Coverage Islands Junc/on Index

Coverage Islands Paired To Create Splice JuncAons Unmapped reads GT AG Step 1: Map reads to genome using Bow/e Step 2: Assemble con/nuous regions Step 3: Build library of puta/ve splice junc/ons Step 4: Map remaining reads to 2kb window around splice junc.

TopHat Splice JuncAon Discovery From Read Segments Read segment mapping Reads 45bp are broken into segments and mapped S/tch segments from same read that map near one another Improves indel discovery and allows detec/on of gene fusions Read Segments GT GC AT AG AC

Important ConsideraAons For Running TopHat2 Has several dependencies Appropriate SAMTools and Bow/e modules must be loaded Many version compa/bility issues for these dependencies Can run either Bow/e or Bow/e2 (default) Only performs global, end-to-end Bow/e alignment Should include read group header informa/on for ID, sample, library type, and plajorm Numerous default sekngs and op/ons to customize

Running TopHat2 General usage statement: $ tophat2 <options> <index> <singleend.fq> $ tophat2 <options> <index> \ <pairedend_1.fq,pairedend_2.fq> Where you must include -r/--mate-inner-dist <int> --mate-std-dev <int> Don t forget read group headers --rg-id --rg-library --rg-sample --rg-platform

OpAons For Running TopHat2 To only map to known transcripts (i.e. no novel junc/ons) --no-coverage-search --no-novel-juncs -G <genes.gtf> -T/--transcriptome-only --microexon-search Island To reduce running /me create a bow/e index of transcriptome

Output from TopHat2 TopHat will create several output ﬁles and temporary ﬁles TopHat output is wri[en to a directory Must make this directory before running tophat Give the directory a detailed, unique name Use op/on: -o <directory> Files accepted_hits.bam and unmapped.bam junctions.bed insertions.bed and deletions.bed

Running TopHat2 We will map a paired-end human RNA-seq dataset The average inner mate distance is 325bp ± 150 The library is NEBNext dutp kit R1 is reverse and R2 is forward strand fr-ﬁrststrand

Running TopHat2 Edits to TopHat.chr21.template.pbs Change wall/me to 45 minutes Replace <USERNAME> with your username (lines 28, 29) Change Hg38.refseqGenes.gj to Hg38.genes.chr21.gj Add samtools ﬂagstat command (line 46) samtools flagstat $TOPHAT/accepted_hits.bam \ > $TOPHAT/accepted_hits.alignment_stats.txt Add samtools index command (line 52) samtools index $TOPHAT/accepted_hits.bam

Running TopHat2 In your Workshop/PBS/ is the script TopHat.chr21.template.pbs This script will run TopHat2 on human paired-end RNA-seq data FASTQ/Hg_RNA_R1.chr21.fastq FASTQ/Hg_RNA_R1.chr21.fastq We will add 2 more commands Final TopHat alignment stats Create index of TopHat2 output BAM Submit job; you will see several new ﬁles in RNA-seq/TopHat/chr21

Visualize Your TopHat Alignment in IGV Start up X2Go and open IGV Make sure you are looking at Hg38 genome Load accepted_hits.bam from RNA-seq/TopHat/chr21/

What To Do With Alignment Data Catalogue and quan/fy gene expression Which genes are expressed or not expressed in sample Which genes are diﬀeren/ally expressed between 2+ samples Metrics of gene expression from RNA-seq Counts: how many reads map to a gene; not normalized RPKM/FPKM: reads/fragments per kilobase million; normalized TPM: transcripts per million; normalized

NormalizaAon Normaliza/on is required to make comparisons in gene expression Between 2+ genes in one sample Between genes in 2+ samples Genes will have more reads mapped in sample with high coverage than with low read coverage 2x depth 2x expression Longer genes will have more reads mapped than shorter genes 2x length 2x more reads

NormalizaAon FPKM vs TPM Gene A; read count = 40; length = 2kb; M = 10 Divide by Millions Mapped (40/10) Divide by kilobases (40/2) 4 (RPM) 20 (RPK) Divide by kilobases (4/2) Divide by ΣRPK (20/5.5) 2 RPKM 3.63 TPM StatQuest: RPKM, FPKM and TPM

NormalizaAon FPKM vs TPM TPM: because you divide all genes by the ΣRPKAll the TPM value of a gene is the % reads that map to that read This makes TPM a perfect, comparable value RPKM is a scaled value Sample 1 Sample 2 RPKM = 2 Sample 1 Sample 2 TPM = 3.63 StatQuest: RPKM, FPKM and TPM

First Step To QuanAﬁcaAon Read Counts To calculate RPKM or TPM you ﬁrst need to know how many reads map to each gene (aka read count) 12 reads (SE) 12 reads (PE = 6 fragments) There are many tools available to generate counts from a BAM and annota/on ﬁle HTSeq - python package for seq data analysis Stand alone scripts: htseq-qa htseq-counts

HTSeq-counts Usage: htseq-counts <options> <alignments.sam> <genes.gff> > <gene_counts.txt> Important op/ons: -f -r -t -s -a -m <file format> sam bam <sort_oder> name position <feature> <library strandedness> yes no reverse <int> ignore reads < <int> mapping quality <mode> union intersection_strict intersection_nonempty

SelecAng the HTSeq-count Mode

Important Notes About htseq-counts htseq-counts requires several dependencies module module module module load load load load htseq_0.6.1 python_2.7.3 numpy_1.9.2 pysam_0.8.4

Running htseq-counts with TopHat2 Results In your Workshop/PBS/ is HTseq-counts.chr21.template.pbs This script will take the output from TopHat and sort the bam ﬁle by read name and run HTSeq-counts on this new bam Edits: Replace <USERNAME> with your username (line 28) Add in the appropriate path to the TOPHAT path variable (line 29) Output: Hg38.genes.chr21.counts.txt in RNA-seq/TopHat/chr21/ Run a head and tail on this ﬁle

The End Ques/ons?? Don t forget the homework. Homework ques/ons will provide addi/onal prac/ce the with ChIP-seq pipeline Watch Day 8 videos for introduc/on to RNA-seq analysis Help sessions: 10-11:30am JSCBB B231

Acknowledgements Workshop Coordinators: Jamie Prior Kershner and Jessica Vera Funding: BioFron/ers Ins/tute and Colorado Oﬃce of Economic Development and Interna/onal Trade AddiAonal Acknowledgments Compute Resources: BioFron/ers IT Staﬀ Robin Dowell and Dowell Lab 2016