Aligners. J Fass 21 June 2017

Similar documents
Aligners. J Fass 23 August 2017

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Short Read Alignment. Mapping Reads to a Reference

RNA-seq. Manpreet S. Katari

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Mapping NGS reads for genomics studies

Read Mapping. Slides by Carl Kingsford

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Under the Hood of Alignment Algorithms for NGS Researchers

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

RNA-seq Data Analysis

Aligning reads: tools and theory

Long Read RNA-seq Mapper

Differential gene expression analysis using RNA-seq

Read Mapping and Assembly

Scalable RNA Sequencing on Clusters of Multicore Processors

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

Our typical RNA quantification pipeline

High-throughout sequencing and using short-read aligners. Simon Anders

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq Analysis With the Tuxedo Suite

Sequence mapping and assembly. Alistair Ward - Boston College

Bioinformatics in next generation sequencing projects

Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Galaxy Platform For NGS Data Analyses

A review of RNA-Seq normalization methods

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Short Read Alignment Algorithms

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:

NGS Analysis Using Galaxy

Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

Read mapping with BWA and BOWTIE

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

BLAST & Genome assembly

Sequence Analysis Pipeline

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

Maize genome sequence in FASTA format. Gene annotation file in gff format

Illumina Next Generation Sequencing Data analysis

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary

NGS FASTQ file format

Quark enables semi-reference-based compression of RNA-seq data

Galaxy workshop at the Winter School Igor Makunin

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Using Galaxy for NGS Analyses Luce Skrabanek

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Mapping Reads to Reference Genome

A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference

NGS Data and Sequence Alignment

The Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt April 1st, 2015

Ensembl RNASeq Practical. Overview

Bioinformatics for High-throughput Sequencing

Shotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis

The Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

RNA Sequencing with TopHat Alignment v1.0 and Cufflinks Assembly & DE v1.1 App Guide

Finding the appropriate method, with a special focus on: Mapping and alignment. Philip Clausen

SpliceTAPyR - An efficient method for transcriptome alignment

Single/paired-end RNAseq analysis with Galaxy

Genomic Files. University of Massachusetts Medical School. October, 2014

Integrated Genome browser (IGB) installation

Variation among genomes

Ballgown. flexible RNA-seq differential expression analysis. Alyssa Frazee Johns Hopkins

RNA-seq. Read mapping and Quantification. Genomics: Lecture #12. Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

Concurrent and Accurate RNA Sequencing on Multicore Platforms

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

Sequence Preprocessing: A perspective

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Browser Exercises - I. Alignments and Comparative genomics

From the Schnable Lab:

A Tutorial: Genome- based RNA- Seq Analysis Using the TUXEDO Package

de Bruijn graphs for sequencing data

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Darwin-WGA. A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup

Using Galaxy: RNA-seq

Tiling Assembly for Annotation-independent Novel Gene Discovery

On enhancing variation detection through pan-genome indexing

Resequencing and Mapping. Andreas Gisel Inernational Institute of Tropical Agriculture (IITA) Ibadan, Nigeria

AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018)

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Lecture 12. Short read aligners

Transcription:

Aligners J Fass 21 June 2017

Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21

Definitions Alignment: Somebody plagiarized parts of my document; where did they copy paragraphs from and where did each of the words come from?

Definitions Mapping: Somebody plagiarized parts of my document; where did they copy paragraphs from and where did each of the words come from?

Why align (or map)? Gene X align / map Gene X To infer abundance:

Why align (or map)? Individual A Individual B To measure variation

Why align (or map)?

More Definitions: Global and local Global aligners try to align all provided sequence, end to end, both query and subject / target E.g. Aligning two Salmonella genomes Aligning human and gorilla orthologous coding regions

Global and local Local aligners try to find hits or chains of hits within each provided sequence E.g. Finding mitochondrial splinters in nuclear chromosomes Finding genes that share a domain with a gene of interest

Glocal? Short read aligners generally assume that the whole read came from somewhere within the target (reference) sequence so, global with respect to the read, and local with respect to the reference.

Short Read (Non-splicing) Aligners Li, H and Homer, N (2010) Briefings in Bioinformatics 11:473 A survey of sequence alignment algorithms for next-generation sequencing These two were fastest, at ~7 Gbp (vs human) per CPU day HiSeq 2500 generated 50-100 Gbp per day (at the time) (Fall 12-13) 150-180 Gbp per day (Summer 16) 600 Gbp per day (Summer 17) 1-3 Tbp per day https://www.illumina.com/systems/sequencing-platforms.html

Burrows-Wheeler Aligners Burrows-Wheeler Transform used in bzip2 file compression tool; FM-index (Ferragina & Manzini) allow efficient finding of substring matches within compressed text algorithm is sub-linear with respect to time and storage space required for a certain set of input data (reference 'ome, essentially). Reduced memory footprint, faster execution.

BWA BWA is a fast gapped aligner. Long read aligners (bwasw and mem) also fast, and can perform well for 454, Ion Torrent, Sanger, and PacBio reads. BWA is actively maintained and has a strong user community. bio-bwa.sourceforge.net bwa aln (BWA backtrack ) for reads < 70 bp bwa bwasw bwa mem (seeds with maximal exact matches, extends via Smith-Waterman)

Bowtie (now Bowtie 2) comparable to BWA. Bowtie is part of a suite of tools (Bowtie, Tophat, Cufflinks, CummeRbund) that address RNAseq experiments. http://bowtie-bio.sourceforge.net Written by same folks as Tophat so, full compatibility.

Tophat2 Aligns full reads to genome, to determine coverage islands. Creates simulated spliced exons based on these islands, then aligns remaining short reads to the simulated cdna (to find reads crossing splice junctions).

STAR Spliced Transcripts Aligned to a Reference Aligns short reads and full length cdna Similar to BWA MEM algorithm, but searches uncompressed version (suffix array) of genome (faster, but more RAM required!). Claims better sensitivity and specificity than previous short read aligners, and 50x speed vs TopHat2. Requires ~27 GB RAM for human genome!

HISAT2 Improved replacement for Tophat2, multiple high level and low level indexes of compressed sequence (similar to graph-based assemblers using De Bruijn graphs to represent overlapping k-mers). Graph structure can represent multiple sequence paths, i.e. a population of references, allowing rapid genotyping of samples. HISAT2 is currently probably the best combination, in a full aligner, of speed (faster than STAR) and memory (~4 GB for human genome). Kim, Langmead, Salzberg (2015) Nature Methods 12:357

Kallisto A pseudoaligner (mapper? see Pachter blog post) that compares read k-mers (overlapping subsequences) to a transcriptome de Bruijn graph (T-DBG) to find transcripts compatible with the read. Also uses expectation maximization (EM) and bootstrapping to determine most likely transcript and abundance uncertainty. Bray Pachter (2016) Nat Biotech 34:525 See sleuth for DGE analysis. http://pachterlab.github.io/kallisto/ http://pachterlab.github.io/sleuth/

Salmon Supersedes Sailfish, an older k-mer based transcript analysis tool. Performs quasi-mapping to a set of transcripts (not genome), similar in methods to kallisto. Also performs bias correction for multiple modes of bias (sequence, position) to more accurately determine abundance. Patro, Duggard, Kingsford (2015) biorxiv http://dx.doi.org/10.1101/021592

General Alignment Parameters / Concepts

General Alignment Parameters / Concepts

General Alignment Parameters / Concepts Clipping: 134M 16S(H) Splicing: 32M27N59M

General Alignment Parameters / Concepts

General Alignment Parameters / Concepts Multimappers: Reads that align equally well to more than one reference location. Generally, multimappers are discounted in variant detection, and are often discounted in counting applications (like RNA-Seq would cancel out anyway). Note: multimapper rescue in some algorithms (RSEM, Express?). Duplicates: Reads or read pairs arising from the same original library fragment, either during library preparation (PCR duplicates) or colony formation (optical duplicates; not an issue anymore). Generally, duplicates can only be detected reliably with paired-end sequencing. If PE, they re discounted in variant detection, and discounted in counting applications (like RNA-Seq).

General Alignment Parameters / Concepts

Alignment Viewers IGV (Integrated Genomics Viewer) www.broadinstitute.org/igv/ BAMview, tview (in SAMtools), IGB, GenomeView, SAMscope... UCSC Genome Browser, GBrowse

IGV

IGV

IGV More on IGV s interface, file formats, and display can be found here: http://www.broadinstitute.org/igv/alignmentdata More on interpreting and customizing IGV s display can be found here: http://www.broadinstitute.org/software/igv/interpreting_insert_size