Short Read Sequencing Analysis Workshop Day 8: Introduc/on to RNA-seq Analysis In-class slides
Day 7 Homework 1.) 14 GABPA ChIP-seq peaks 2.) Error: Dataset too large (> 100000). Rerun with larger maxsize Command should contain maxsize 200000 The unknown TF is human c-myc
Outline For Today Brief review of RNA-seq Discuss TopHat splice aware aligner In-class exercise: map human RNA-seq data with TopHat2 Discuss gene quan/fica/on In-class exercise: generate gene-level quan/ta/on using Htseq counts
QuesAons You Can Address With RNA-seq Catalogue and quan/fy gene expression RNA Differen/al expression analysis Novel transcript discovery; transcriptome assembly
Important ConsideraAons for RNA-seq Libraries Many different protocols for RNA-seq library preps What RNA(s) do you want to sequence? Remove rrna by polya enrichment or rrna subtrac/on What ques/ons do you want to ask? Include spike-in controls for be[er quan/fica/on accuracy Use longer read lengths for heavily spliced RNAs
How Does Splicing Affect Read Mapping RNA Sequence RNA-seq reads Genome Sequence Splicing creates sequences that do not occur in the genome
How Does Splicing Affect Read Mapping RNA Sequence RNA-seq reads Genome Sequence Splicing creates sequences that do not occur in the genome Reads that span splice junc/ons will not map to the genome
TopHat Is A Splice-aware Aligner Developed by Cole Trapnell Designed to discover splice sites from RNA-seq Iden/fies splice junc/ons from two sources of evidence S/tching together independently mapped read segments Pairing together coverage islands from con/nuously mapped reads
Important Note About TopHat2 As of 2/23/16 TopHat2 has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 HISAT2 is more accurate and much more efficient HISAT2 is general purpose DNA and RNA read aligner
Overview of TopHat2 Transcriptome Alignment (op/onal) Genomic Alignment Spliced Alignment Transcriptome Index Genome Index Segment Mapping Coverage Islands Junc/on Index
Coverage Islands Paired To Create Splice JuncAons Unmapped reads GT AG Step 1: Map reads to genome using Bow/e Step 2: Assemble con/nuous regions Step 3: Build library of puta/ve splice junc/ons Step 4: Map remaining reads to 2kb window around splice junc.
TopHat Splice JuncAon Discovery From Read Segments Read segment mapping Reads 45bp are broken into segments and mapped S/tch segments from same read that map near one another Improves indel discovery and allows detec/on of gene fusions Read Segments GT GC AT AG AC
Important ConsideraAons For Running TopHat2 Has several dependencies Appropriate SAMTools and Bow/e modules must be loaded Many version compa/bility issues for these dependencies Can run either Bow/e or Bow/e2 (default) Only performs global, end-to-end Bow/e alignment Should include read group header informa/on for ID, sample, library type, and plajorm Numerous default sekngs and op/ons to customize
Running TopHat2 General usage statement: $ tophat2 <options> <index> <singleend.fq> $ tophat2 <options> <index> \ <pairedend_1.fq,pairedend_2.fq> Where you must include -r/--mate-inner-dist <int> --mate-std-dev <int> Don t forget read group headers --rg-id --rg-library --rg-sample --rg-platform
OpAons For Running TopHat2 To only map to known transcripts (i.e. no novel junc/ons) --no-coverage-search --no-novel-juncs -G <genes.gtf> -T/--transcriptome-only --microexon-search Island To reduce running /me create a bow/e index of transcriptome
Output from TopHat2 TopHat will create several output files and temporary files TopHat output is wri[en to a directory Must make this directory before running tophat Give the directory a detailed, unique name Use op/on: -o <directory> Files accepted_hits.bam and unmapped.bam junctions.bed insertions.bed and deletions.bed
Running TopHat2 We will map a paired-end human RNA-seq dataset The average inner mate distance is 325bp ± 150 The library is NEBNext dutp kit R1 is reverse and R2 is forward strand fr-firststrand
Running TopHat2 Edits to TopHat.chr21.template.pbs Change wall/me to 45 minutes Replace <USERNAME> with your username (lines 28, 29) Change Hg38.refseqGenes.gj to Hg38.genes.chr21.gj Add samtools flagstat command (line 46) samtools flagstat $TOPHAT/accepted_hits.bam \ > $TOPHAT/accepted_hits.alignment_stats.txt Add samtools index command (line 52) samtools index $TOPHAT/accepted_hits.bam
Output from TopHat2 TopHat will create several output files and temporary files TopHat output is wri[en to a directory Must make this directory before running tophat Give the directory a detailed, unique name Use op/on: -o <directory> Files accepted_hits.bam and unmapped.bam junctions.bed insertions.bed and deletions.bed
Running TopHat2 In your Workshop/PBS/ is the script TopHat.chr21.template.pbs This script will run TopHat2 on human paired-end RNA-seq data FASTQ/Hg_RNA_R1.chr21.fastq FASTQ/Hg_RNA_R1.chr21.fastq We will add 2 more commands Final TopHat alignment stats Create index of TopHat2 output BAM Submit job; you will see several new files in RNA-seq/TopHat/chr21
Visualize Your TopHat Alignment in IGV Start up X2Go and open IGV Make sure you are looking at Hg38 genome Load accepted_hits.bam from RNA-seq/TopHat/chr21/
What To Do With Alignment Data Catalogue and quan/fy gene expression Which genes are expressed or not expressed in sample Which genes are differen/ally expressed between 2+ samples Metrics of gene expression from RNA-seq Counts: how many reads map to a gene; not normalized RPKM/FPKM: reads/fragments per kilobase million; normalized TPM: transcripts per million; normalized
NormalizaAon Normaliza/on is required to make comparisons in gene expression Between 2+ genes in one sample Between genes in 2+ samples Genes will have more reads mapped in sample with high coverage than with low read coverage 2x depth 2x expression Longer genes will have more reads mapped than shorter genes 2x length 2x more reads
NormalizaAon FPKM vs TPM Gene A; read count = 40; length = 2kb; M = 10 Divide by Millions Mapped (40/10) Divide by kilobases (40/2) 4 (RPM) 20 (RPK) Divide by kilobases (4/2) Divide by ΣRPK (20/5.5) 2 RPKM 3.63 TPM StatQuest: RPKM, FPKM and TPM
NormalizaAon FPKM vs TPM TPM: because you divide all genes by the ΣRPKAll the TPM value of a gene is the % reads that map to that read This makes TPM a perfect, comparable value RPKM is a scaled value Sample 1 Sample 2 RPKM = 2 Sample 1 Sample 2 TPM = 3.63 StatQuest: RPKM, FPKM and TPM
First Step To QuanAficaAon Read Counts To calculate RPKM or TPM you first need to know how many reads map to each gene (aka read count) 12 reads (SE) 12 reads (PE = 6 fragments) There are many tools available to generate counts from a BAM and annota/on file HTSeq - python package for seq data analysis Stand alone scripts: htseq-qa htseq-counts
HTSeq-counts Usage: htseq-counts <options> <alignments.sam> <genes.gff> > <gene_counts.txt> Important op/ons: -f -r -t -s -a -m <file format> sam bam <sort_oder> name position <feature> <library strandedness> yes no reverse <int> ignore reads < <int> mapping quality <mode> union intersection_strict intersection_nonempty
SelecAng the HTSeq-count Mode
Important Notes About htseq-counts htseq-counts requires several dependencies module module module module load load load load htseq_0.6.1 python_2.7.3 numpy_1.9.2 pysam_0.8.4
Running htseq-counts with TopHat2 Results In your Workshop/PBS/ is HTseq-counts.chr21.template.pbs This script will take the output from TopHat and sort the bam file by read name and run HTSeq-counts on this new bam Edits: Replace <USERNAME> with your username (line 28) Add in the appropriate path to the TOPHAT path variable (line 29) Output: Hg38.genes.chr21.counts.txt in RNA-seq/TopHat/chr21/ Run a head and tail on this file
The End Ques/ons?? Don t forget the homework. Homework ques/ons will provide addi/onal prac/ce the with ChIP-seq pipeline Watch Day 8 videos for introduc/on to RNA-seq analysis Help sessions: 10-11:30am JSCBB B231
Acknowledgements Workshop Coordinators: Jamie Prior Kershner and Jessica Vera Funding: BioFron/ers Ins/tute and Colorado Office of Economic Development and Interna/onal Trade AddiAonal Acknowledgments Compute Resources: BioFron/ers IT Staff Robin Dowell and Dowell Lab 2016