Short Read Alignment. Mapping Reads to a Reference

Similar documents
Aligners. J Fass 21 June 2017

Aligners. J Fass 23 August 2017

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Read Mapping. Slides by Carl Kingsford

RNA-seq. Manpreet S. Katari

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Long Read RNA-seq Mapper

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

Mapping NGS reads for genomics studies

Sequence mapping and assembly. Alistair Ward - Boston College

Under the Hood of Alignment Algorithms for NGS Researchers

RNA-seq Data Analysis

Mapping Reads to Reference Genome

Short Read Alignment Algorithms

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

NGS Data and Sequence Alignment

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

BLAST, Profile, and PSI-BLAST

Read Mapping and Assembly

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Scalable RNA Sequencing on Clusters of Multicore Processors

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

BLAST & Genome assembly

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

High-throughout sequencing and using short-read aligners. Simon Anders

Scoring and heuristic methods for sequence alignment CG 17

Computational Molecular Biology

Galaxy Platform For NGS Data Analyses

Bioinformatics in next generation sequencing projects

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Sequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012

BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material]

BLAST & Genome assembly

Aligning reads: tools and theory

Using Galaxy for NGS Analyses Luce Skrabanek

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

NGS Analysis Using Galaxy

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Illumina Next Generation Sequencing Data analysis

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Sequence Alignment Heuristics

NGS FASTQ file format

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Genomic Files. University of Massachusetts Medical School. October, 2014

Sequence Alignment: Mo1va1on and Algorithms

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m TopHat2

UNIVERSITY OF OSLO. Department of informatics. Parallel alignment of short sequence reads on graphics processors. Master thesis. Bjørnar Andreas Ruud

Lecture 12: January 6, Algorithms for Next Generation Sequencing Data

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

Proceedings of the 11 th International Conference for Informatics and Information Technology

Standard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines:

CLC Server. End User USER MANUAL

Genome 373: Mapping Short Sequence Reads I. Doug Fowler

Bioinformatics for High-throughput Sequencing

Sequence Analysis Pipeline

Package Rsubread. July 21, 2013

Next Generation Sequencing

Bioinformatics for Biologists

Dynamic Programming & Smith-Waterman algorithm

Heuristic methods for pairwise alignment:

SpliceTAPyR - An efficient method for transcriptome alignment

Highly Scalable and Accurate Seeds for Subsequence Alignment

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Identiyfing splice junctions from RNA-Seq data

EECS730: Introduction to Bioinformatics

Building approximate overlap graphs for DNA assembly using random-permutations-based search.

Package Rbowtie. January 21, 2019

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

RNA-seq. Read mapping and Quantification. Genomics: Lecture #12. Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin

Analysis of ChIP-seq data

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Bismark Bisulfite Mapper User Guide - v0.7.3

Benchmarking of RNA-seq aligners

Rsubread package: high-performance read alignment, quantification and mutation discovery

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA

From Smith-Waterman to BLAST

HISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Rsubread package: high-performance read alignment, quantification and mutation discovery

Genomic Files. University of Massachusetts Medical School. October, 2015

Exercise 2: Browser-Based Annotation and RNA-Seq Data

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Database Searching Using BLAST

Omixon PreciseAlign CLC Genomics Workbench plug-in

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

NGS Data Analysis. Roberto Preste

Bioinformatics explained: BLAST. March 8, 2007

FastA & the chaining problem

Lecture 12. Short read aligners

Concurrent and Accurate RNA Sequencing on Multicore Platforms

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Transcription:

Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018

Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

History of Sequence Similarity Margaret Dayhoff compiled one of the first protein sequence database, manually aligns sequences and creates the first protein substitution matrices Temple Smith and Michael Waterman proposed an optimal alignment algorithm for local alignments William Pearson and David Lipman implement an optimization of the Smith-Waterman Algorithm using short exact matching words (FASTA) Stephen Altshul, Warren Gish, Webb Miller, Eugene Myers and David Lipman, propose a faster alignment tool to identify high identity matches without GAPS Randall Smith and Temple Smith implement pattern-induced multiple-sequence alignment (PUMA) Sean Eddy implements multiple sequence alignments using hidden Markov Models (HMMER) Warren Gish, branches the development of BLAST with WU-BLAST Gapped BLAST and PSI-BLAST Bowtie, BWA, and TopHat Released STAR Released HISAT Released HISAT2 Released 1979 1981 1985 1990 1992 1995 1996 1997 2009 2012 2013 2015

Whole Genome Shotgun

Utility of Mapping DNASeq Identify Variation RNASeq Estimate the abundance of transcripts and genes ChromatinSeq Determine the structure of DNA (open, close, bound to proteins, etc)

Genetic Variation Cardoso JG, Andersen MR, Herrgård MJ, Sonnenschein N. Analysis of genetic variation and potential applications in genome-scale metabolic modeling. Front Bioeng Biotechnol. 2015 Feb 16;3:13. doi: 10.3389/fbioe.2015.00013. ecollection 2015. Review. PubMed PMID: 25763369; PubMed Central PMCID: PMC4329917.

Chromatin Seq

How to Make an Alignment PMILGYWNVRGL :. PPYTIVYFPVRG PMILGYWNVRGL PM-ILGYWNVRGL :. :. ::: : :. :. ::: PPYTIVYFPVRG PPYTIV-YFPVRG PMILGYWNVRGL P: P: Y. :. T.. I :... V... :. Y. :. F...... P: V... :. R : G : Local Alignment AAPMILGYWNVRGLBB :. :. ::: DDPPYTIVYFPVRGCC

Global Alignments Global Alignment -PMILGYWNVRGL :. :. ::: PPYTIVYFPVRG-

Local Alignments Local Alignment AAPMILGYWNVRGLBB :. :. ::: DDPPYTIVYFPVRGCC

Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

Short Read Aligners Short read aligners assume that the read came from intact from the reference So the alignment is global from the read perspective and local from the reference perspective

Short Read Aligners Read GATCGCAGAGCTCGGGCATAGCTAGCGC AGCTATGATCGCAGAGCTCGGGCATAGCTAGCGCTAGAGCTCGCGATCGCAGAGTCGA Genome

Short Read Aligners Read GATCGCAGAGCTCGGGCATAGCTAGCGC Seed AGCTATGATCGCAGAGCTCGGGCATAGCTAGCGCTAGAGCTCGCGATCGCAGAGTCGA Genome

Short Read Aligners GATCGCAGAG AGCTATGATCGCAGAGCTCGGGCATAGCTAGCGCTAGAGCTCGCGATCGCAGAGTCGA Genome CTCGGGCATAGCTAGCGC

Short Read Aligners TCGGGCATAGCTAGCGC GATCGCAGAGC AGCTATGATCGCAGAGCTCGGGCATAGCTAGCGCTAGAGCTCGCGATCGCAGAGTCGA Genome

Short Read Aligners CGGGCATAGCTAGCGC GATCGCAGAGCT AGCTATGATCGCAGAGCTCGGGCATAGCTAGCGCTAGAGCTCGCGATCGCAGAGTCGA Genome

SRA Features: Seeding Seeding represents the first few tens of base pairs of a read. The seed part of a read is expected to contain less erroneous characters due to the specifics of the NGS technologies. Therefore, the seeding property is mostly used to maximize performance and accuracy. The alignments are then extended from the seed.

SRA Features: Base Quality Base quality scores provide a measure on correctness of each base in the read. The base quality score is assigned by a phred-like algorithm. The score Q is equal to -10 log10(e), where e is the probability that the base is wrong. Some tools use the quality scores to decide mismatch locations. Others accept or reject the read based on the sum of the quality scores at mismatch positions.

SRA Features: Gaps Existence of indels necessitates inserting or deleting nucleotides while mapping a sequence to a reference genome (gaps). The complexity of choosing a gap location increases with the read length. Therefore, some tools do not allow any gaps while others limit their locations and numbers.

SRA Features: Paired-End Paired-end reads result from sequencing both ends of a DNA molecule. Mapping paired-end reads increases the confidence in the mapping locations due to having an estimation of the distance between the two ends.

SRA Features: Color-Space Color space read is a read type generated by SOLiD sequencers. In this technology, overlapping pairs of letters are read and given a number (color) out of four numbers [17]. The reads can be converted into bases, however, performing the mapping in the color space has advantages in terms of error detection.

SRA Features: Bisulphite Bisulphite treatment is a method used for the study of the methylation state of the DNA [3]. In bisulphite treated reads, each unmethylated cytosine is converted to uracil. Therefore, they require special handling in order not to misalign the reads.

Short Read Aligners

SRA: Burrows-Wheeler Transform BWT is an efficient data indexing technique that maintains a relatively small memory footprint when searching through a given data block. BWT was extended by Ferragina and Manzini to a newer data structure, named FM-index, to support exact matching. By transforming the genome into an FM-index, the lookup performance of the algorithm improves for the cases where a single read matches multiple locations in the genome. However, the improved performance comes with a significantly large index build up time compared to hash tables.

SRA: BWA BWA is a BWT based tool. The BWA tool uses the Ferragina and Manzini matching algorithm to find exact matches. To find inexact matches, the authors provided a new backtracking algorithm that searches for matches between substring of the reference genome and the query within a certain defined distance. BWA is fast, and can do gapped alignments. When run without seeding, it will find all hits within a given edit distance. Long read aligner is also fast, and can perform well for 454, Ion Torrent, Sanger, and PacBio reads. BWA is actively maintained and has a strong user community.

SRA: Bowtie Bowtie starts by building an FM-index for the reference genome and then uses the modified Ferragina and Manzini [39] matching algorithm to find the mapping location. There are two main versions of Bowtie namely Bowtie and Bowtie 2. Bowtie 2 is mainly designed to handle reads longer than 50 bps. Additionally, Bowtie 2 supports features not handled by Bowtie. Bowtie2 is faster than BWA for some types of alignment, but it takes a hit in sensitivity and specificity in some applications.

Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

RNASeq

SRA Features: Splice-Aware Splicing refers to the process of cutting the RNA to remove the non-coding part (introns) and keeping only the coding part (exons) and joining them together. Therefore, when sequencing the RNA, a read might be located across exon-exon junctions. The process of mapping such reads back to the genome is hard due to the variability of the intron length. For instance, the intron length ranges between 250 and 65,130 nt in eukaryotic model organisms [37].

HISAT2 15, junction reads with long, >15-bp anchors in both exons; (iii) 2M_8_15, junction reads with intermediate, 8- to 15-bp anchors; (iv) 2M_1_7, junction reads with short, 1- to 7-bp, anchors; and Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015 Apr;12(4):357-60. doi: 10.1038/nmeth.3317. Epub 2015 Mar 9. PubMed PMID: 25751142; PubMed Central PMCID: PMC4655817.

HISAT2 Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015 Apr;12(4):357-60. doi: 10.1038/nmeth.3317. Epub 2015 Mar 9. PubMed PMID: 25751142; PubMed Central PMCID: PMC4655817.

Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

Mapping Quality Probability that a read is mapped incorrectly Factors include: uniqueness (one best scoring alignment) number of mismatches number of gaps quality of the bases (Phred)

Alignment Metrics Alignment Rate (Mapping Rate) Paired Alignments Properly Paired Mapping Comparing the rate of read pairs mapped within a certain proximity Average Insert Size Distance between adapter sequences Duplication Rate Same fragment size can be duplicated during library preparation (PCR) or during sequencing (colony formation)

File Formats: FastQ

File Formats: SAM

File Formats: SAM @SQ = Contigs/Chromosomes @RG = Read Group @PG = Program Info

File Formats: SAM Flags

Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

Pitfalls All of the down stream analysis is based on the alignment Improper alignment can lead to false positive variate calls Inaccurate abundance calculations

Misalignment The Human Genome has many duplications Misalignment can result from sequence repetition Misalignment can be improved with: Increased read lengths Finding multiple 35bp matches is more likely than finding multiple 100bp matches Paired Sequencing A mate pair can anchor another when there is multiple mapping of the other mate pair.