Short Read Sequencing Analysis Workshop

Similar documents
Short Read Sequencing Analysis Workshop

Sequence Analysis Pipeline

RNA-seq. Manpreet S. Katari

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

TopHat, Cufflinks, Cuffdiff

Tiling Assembly for Annotation-independent Novel Gene Discovery

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Our typical RNA quantification pipeline

NGS FASTQ file format

RNA-Seq Analysis With the Tuxedo Suite

NGS Analysis Using Galaxy

de.nbi and its Galaxy interface for RNA-Seq

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Maize genome sequence in FASTA format. Gene annotation file in gff format

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

Goal: Learn how to use various tool to extract information from RNAseq reads.

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Differential gene expression analysis using RNA-seq

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

replace my_user_id in the commands with your actual user ID

High-throughout sequencing and using short-read aligners. Simon Anders

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

NGS Data Visualization and Exploration Using IGV

Ensembl RNASeq Practical. Overview

Rsubread package: high-performance read alignment, quantification and mutation discovery

A review of RNA-Seq normalization methods

Single/paired-end RNAseq analysis with Galaxy

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:

Rsubread package: high-performance read alignment, quantification and mutation discovery

Identiyfing splice junctions from RNA-Seq data

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Using Galaxy: RNA-seq

Long Read RNA-seq Mapper

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Read Mapping. Slides by Carl Kingsford

RNA-seq Data Analysis

Standard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines:

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

Benchmarking of RNA-seq aligners

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

Analysis of ChIP-seq data

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

RNASeq2017 Course Salerno, September 27-29, 2017

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

TP RNA-seq : Differential expression analysis

Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Aligners. J Fass 21 June 2017

preparation methods and new bacterial strains. Parts of the pipeline that can be updated will be annotated in this guide.

Circ-Seq User Guide. A comprehensive bioinformatics workflow for circular RNA detection from transcriptome sequencing data

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Genomic Files. University of Massachusetts Medical School. October, 2015

From the Schnable Lab:

NGI-RNAseq. Processing RNA-seq data at the National Genomics Infrastructure. NGI stockholm

all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m TopHat2

The software and data for the RNA-Seq exercise are already available on the USB system

Short Read Sequencing Analysis Workshop

Galaxy Platform For NGS Data Analyses

ChIP-seq hands-on practical using Galaxy

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Introduc)on to annota)on with Artemis. Download presenta.on and data

Reference guided RNA-seq data analysis using BioHPC Lab computers

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Evaluate NimbleGen SeqCap RNA Target Enrichment Data

Galaxy workshop at the Winter School Igor Makunin

Sequence Preprocessing: A perspective

CLC Server. End User USER MANUAL

m6aviewer Version Documentation

Genome representa;on concepts. Week 12, Lecture 24. Coordinate systems. Genomic coordinates brief overview 11/13/14

Tutorial: RNA-Seq analysis part I: Getting started

srap: Simplified RNA-Seq Analysis Pipeline

Calling variants in diploid or multiploid genomes

RNA- SeQC Documentation

The preseq Manual. Timothy Daley Victoria Helus Andrew Smith. January 17, 2014

ChIP-seq hands-on practical using Galaxy

ChIP-Seq Tutorial on Galaxy

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

v0.3.0 May 18, 2016 SNPsplit operates in two stages:

ChIP- seq Analysis. BaRC Hot Topics - Feb 24 th 2015 BioinformaBcs and Research CompuBng Whitehead InsBtute. hgp://barc.wi.mit.

Integra(ve Genomics Viewer IGV. Tom Carroll MRC Clinical Sciences Centre

Understanding and Pre-processing Raw Illumina Data

MISO Documentation. Release. Yarden Katz, Eric T. Wang, Edoardo M. Airoldi, Christopher B. Bur

Package Rsubread. July 21, 2013

Introduction to Cancer Genomics

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

Easy visualization of the read coverage using the CoverageView package

Genomic Files. University of Massachusetts Medical School. October, 2014

Mar%n Norling. Uppsala, November 15th 2016

ChIP-seq Analysis Practical

Demo 1: Free text search of ENCODE data

Variant calling using SAMtools

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Analyzing ChIP- Seq Data in Galaxy

Transcription:

Short Read Sequencing Analysis Workshop Day 8: Introduc/on to RNA-seq Analysis In-class slides

Day 7 Homework 1.) 14 GABPA ChIP-seq peaks 2.) Error: Dataset too large (> 100000). Rerun with larger maxsize Command should contain maxsize 200000 The unknown TF is human c-myc

Outline For Today Brief review of RNA-seq Discuss TopHat splice aware aligner In-class exercise: map human RNA-seq data with TopHat2 Discuss gene quan/fica/on In-class exercise: generate gene-level quan/ta/on using Htseq counts

QuesAons You Can Address With RNA-seq Catalogue and quan/fy gene expression RNA Differen/al expression analysis Novel transcript discovery; transcriptome assembly

Important ConsideraAons for RNA-seq Libraries Many different protocols for RNA-seq library preps What RNA(s) do you want to sequence? Remove rrna by polya enrichment or rrna subtrac/on What ques/ons do you want to ask? Include spike-in controls for be[er quan/fica/on accuracy Use longer read lengths for heavily spliced RNAs

How Does Splicing Affect Read Mapping RNA Sequence RNA-seq reads Genome Sequence Splicing creates sequences that do not occur in the genome

How Does Splicing Affect Read Mapping RNA Sequence RNA-seq reads Genome Sequence Splicing creates sequences that do not occur in the genome Reads that span splice junc/ons will not map to the genome

TopHat Is A Splice-aware Aligner Developed by Cole Trapnell Designed to discover splice sites from RNA-seq Iden/fies splice junc/ons from two sources of evidence S/tching together independently mapped read segments Pairing together coverage islands from con/nuously mapped reads

Important Note About TopHat2 As of 2/23/16 TopHat2 has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 HISAT2 is more accurate and much more efficient HISAT2 is general purpose DNA and RNA read aligner

Overview of TopHat2 Transcriptome Alignment (op/onal) Genomic Alignment Spliced Alignment Transcriptome Index Genome Index Segment Mapping Coverage Islands Junc/on Index

Coverage Islands Paired To Create Splice JuncAons Unmapped reads GT AG Step 1: Map reads to genome using Bow/e Step 2: Assemble con/nuous regions Step 3: Build library of puta/ve splice junc/ons Step 4: Map remaining reads to 2kb window around splice junc.

TopHat Splice JuncAon Discovery From Read Segments Read segment mapping Reads 45bp are broken into segments and mapped S/tch segments from same read that map near one another Improves indel discovery and allows detec/on of gene fusions Read Segments GT GC AT AG AC

Important ConsideraAons For Running TopHat2 Has several dependencies Appropriate SAMTools and Bow/e modules must be loaded Many version compa/bility issues for these dependencies Can run either Bow/e or Bow/e2 (default) Only performs global, end-to-end Bow/e alignment Should include read group header informa/on for ID, sample, library type, and plajorm Numerous default sekngs and op/ons to customize

Running TopHat2 General usage statement: $ tophat2 <options> <index> <singleend.fq> $ tophat2 <options> <index> \ <pairedend_1.fq,pairedend_2.fq> Where you must include -r/--mate-inner-dist <int> --mate-std-dev <int> Don t forget read group headers --rg-id --rg-library --rg-sample --rg-platform

OpAons For Running TopHat2 To only map to known transcripts (i.e. no novel junc/ons) --no-coverage-search --no-novel-juncs -G <genes.gtf> -T/--transcriptome-only --microexon-search Island To reduce running /me create a bow/e index of transcriptome

Output from TopHat2 TopHat will create several output files and temporary files TopHat output is wri[en to a directory Must make this directory before running tophat Give the directory a detailed, unique name Use op/on: -o <directory> Files accepted_hits.bam and unmapped.bam junctions.bed insertions.bed and deletions.bed

Running TopHat2 We will map a paired-end human RNA-seq dataset The average inner mate distance is 325bp ± 150 The library is NEBNext dutp kit R1 is reverse and R2 is forward strand fr-firststrand

Running TopHat2 Edits to TopHat.chr21.template.pbs Change wall/me to 45 minutes Replace <USERNAME> with your username (lines 28, 29) Change Hg38.refseqGenes.gj to Hg38.genes.chr21.gj Add samtools flagstat command (line 46) samtools flagstat $TOPHAT/accepted_hits.bam \ > $TOPHAT/accepted_hits.alignment_stats.txt Add samtools index command (line 52) samtools index $TOPHAT/accepted_hits.bam

Output from TopHat2 TopHat will create several output files and temporary files TopHat output is wri[en to a directory Must make this directory before running tophat Give the directory a detailed, unique name Use op/on: -o <directory> Files accepted_hits.bam and unmapped.bam junctions.bed insertions.bed and deletions.bed

Running TopHat2 In your Workshop/PBS/ is the script TopHat.chr21.template.pbs This script will run TopHat2 on human paired-end RNA-seq data FASTQ/Hg_RNA_R1.chr21.fastq FASTQ/Hg_RNA_R1.chr21.fastq We will add 2 more commands Final TopHat alignment stats Create index of TopHat2 output BAM Submit job; you will see several new files in RNA-seq/TopHat/chr21

Visualize Your TopHat Alignment in IGV Start up X2Go and open IGV Make sure you are looking at Hg38 genome Load accepted_hits.bam from RNA-seq/TopHat/chr21/

What To Do With Alignment Data Catalogue and quan/fy gene expression Which genes are expressed or not expressed in sample Which genes are differen/ally expressed between 2+ samples Metrics of gene expression from RNA-seq Counts: how many reads map to a gene; not normalized RPKM/FPKM: reads/fragments per kilobase million; normalized TPM: transcripts per million; normalized

NormalizaAon Normaliza/on is required to make comparisons in gene expression Between 2+ genes in one sample Between genes in 2+ samples Genes will have more reads mapped in sample with high coverage than with low read coverage 2x depth 2x expression Longer genes will have more reads mapped than shorter genes 2x length 2x more reads

NormalizaAon FPKM vs TPM Gene A; read count = 40; length = 2kb; M = 10 Divide by Millions Mapped (40/10) Divide by kilobases (40/2) 4 (RPM) 20 (RPK) Divide by kilobases (4/2) Divide by ΣRPK (20/5.5) 2 RPKM 3.63 TPM StatQuest: RPKM, FPKM and TPM

NormalizaAon FPKM vs TPM TPM: because you divide all genes by the ΣRPKAll the TPM value of a gene is the % reads that map to that read This makes TPM a perfect, comparable value RPKM is a scaled value Sample 1 Sample 2 RPKM = 2 Sample 1 Sample 2 TPM = 3.63 StatQuest: RPKM, FPKM and TPM

First Step To QuanAficaAon Read Counts To calculate RPKM or TPM you first need to know how many reads map to each gene (aka read count) 12 reads (SE) 12 reads (PE = 6 fragments) There are many tools available to generate counts from a BAM and annota/on file HTSeq - python package for seq data analysis Stand alone scripts: htseq-qa htseq-counts

HTSeq-counts Usage: htseq-counts <options> <alignments.sam> <genes.gff> > <gene_counts.txt> Important op/ons: -f -r -t -s -a -m <file format> sam bam <sort_oder> name position <feature> <library strandedness> yes no reverse <int> ignore reads < <int> mapping quality <mode> union intersection_strict intersection_nonempty

SelecAng the HTSeq-count Mode

Important Notes About htseq-counts htseq-counts requires several dependencies module module module module load load load load htseq_0.6.1 python_2.7.3 numpy_1.9.2 pysam_0.8.4

Running htseq-counts with TopHat2 Results In your Workshop/PBS/ is HTseq-counts.chr21.template.pbs This script will take the output from TopHat and sort the bam file by read name and run HTSeq-counts on this new bam Edits: Replace <USERNAME> with your username (line 28) Add in the appropriate path to the TOPHAT path variable (line 29) Output: Hg38.genes.chr21.counts.txt in RNA-seq/TopHat/chr21/ Run a head and tail on this file

The End Ques/ons?? Don t forget the homework. Homework ques/ons will provide addi/onal prac/ce the with ChIP-seq pipeline Watch Day 8 videos for introduc/on to RNA-seq analysis Help sessions: 10-11:30am JSCBB B231

Acknowledgements Workshop Coordinators: Jamie Prior Kershner and Jessica Vera Funding: BioFron/ers Ins/tute and Colorado Office of Economic Development and Interna/onal Trade AddiAonal Acknowledgments Compute Resources: BioFron/ers IT Staff Robin Dowell and Dowell Lab 2016