Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Similar documents
Lecture 12. Short read aligners

Under the Hood of Alignment Algorithms for NGS Researchers

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Bioinformatics in next generation sequencing projects

Read Naming Format Specification

Illumina Next Generation Sequencing Data analysis

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

Read Mapping and Assembly

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

RNA-seq. Manpreet S. Katari

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Mapping NGS reads for genomics studies

RCAC. Job files Example: Running seqyclean (a module)

Read Mapping. Slides by Carl Kingsford

Aligners. J Fass 21 June 2017

Long Read RNA-seq Mapper

Omega: an Overlap-graph de novo Assembler for Metagenomics

Scalable RNA Sequencing on Clusters of Multicore Processors

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

High-throughout sequencing and using short-read aligners. Simon Anders

Sequence Analysis Pipeline

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

The SAM Format Specification (v1.3 draft)

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Short Read Alignment. Mapping Reads to a Reference

NGS Data and Sequence Alignment

Identiyfing splice junctions from RNA-Seq data

Sequence mapping and assembly. Alistair Ward - Boston College

RNA-seq Data Analysis

de Bruijn graphs for sequencing data

Rsubread package: high-performance read alignment, quantification and mutation discovery

INTRODUCTION AUX FORMATS DE FICHIERS

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Aligning reads: tools and theory

The SAM Format Specification (v1.3-r837)

Rsubread package: high-performance read alignment, quantification and mutation discovery

Galaxy Platform For NGS Data Analyses

SMALT Manual. December 9, 2010 Version 0.4.2

Standard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines:

Scalable Solutions for DNA Sequence Analysis

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

BLAST & Genome assembly

Short Read Alignment Algorithms

Atlas-SNP2 DOCUMENTATION V1.1 April 26, 2010

Package Rsubread. July 21, 2013

Genomic Files. University of Massachusetts Medical School. October, 2014

SSAHA2 Manual. September 1, 2010 Version 0.3

all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m TopHat2

Package Rbowtie. January 21, 2019

Goal: Learn how to use various tool to extract information from RNAseq reads.

Benchmarking of RNA-seq aligners

Ensembl RNASeq Practical. Overview

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Assembly in the Clouds

Bioinformatics for High-throughput Sequencing

IDBA - A practical Iterative de Bruijn Graph De Novo Assembler

Analysis of ChIP-seq data

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

Genomic Files. University of Massachusetts Medical School. October, 2015

Genome Assembly and De Novo RNAseq

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

Tiling Assembly for Annotation-independent Novel Gene Discovery

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab

Introduction to Genome Assembly. Tandy Warnow

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

The Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt April 1st, 2015

merantk Version 1.1.1a

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

Bioinformatics I. Teaching assistant(s): Eudes Barbosa Markus List

Darwin: A Hardware-acceleration Framework for Genomic Sequence Alignment

CLC Server. End User USER MANUAL

Mapping Reads to Reference Genome

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual

m6aviewer Version Documentation

ABSTRACT USING MANY-CORE COMPUTING TO SPEED UP DE NOVO TRANSCRIPTOME ASSEMBLY. Sean O Brien, Master of Science, 2016

User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux

Sequence Preprocessing: A perspective

Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

BLAST & Genome assembly

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

Package Rsubread. February 4, 2018

Next Generation Sequencing

Dindel User Guide, version 1.0

Genetics 211 Genomics Winter 2014 Problem Set 4

Sequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012

Using Galaxy for NGS Analyses Luce Skrabanek

Mapping, Alignment and SNP Calling

Sequence Alignment: Mo1va1on and Algorithms

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

Transcription:

Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD 3 Plus... Sequenced reads Paired-end Genome Analyzer Paired-End Libraries Paired-End Libraries R1 R2 1

Mate-Pair Libraries Mate-Pair Libraries R2 R1 R1 R2 Data Analysis Data Analysis Mapping Reference Assembly Overlap Layout Consensus Assembly String Graph Assembler Overlap - Layout - Consensus assemblers Examples Celera Assembler, Arachne, Atlas Mapping Reference De-Bruijn Graph Assembler Short-read assemblers Examples: Velvet, Abyss, SOAPdenovo Transcriptome assembly: Oases 2

Reference Reference Mapping Mapping Reference Genome 0 n m Reference Genome 0 n m Quadratic algorithm Requires O(m*n) time and space Infeasible for millions of short reads Filtering Filtering Genome Genome Preprocess Preprocess Index Filter Algorithm Index Filtration Phase Potential Matches Genome Filtering Preprocess Filter Algorithm Exact Algorithm Index Filtration Phase Potential Matches Verification Phase AAA ACC CGA AAC ACG AAG ACT GAA AAT AGA ACA TTT True Matches False Matches Size of that table: 4 3 = 64 entries = Σ k 3

AAA ACC CGA AAC ACG 0 AAG ACT GAA AAT AGA ACA TTT AAA ACC CGA 1 AAC ACG 0 AAG ACT GAA AAT AGA ACA TTT Size of that table: 4 3 = 64 entries = Σ k Size of that table: 4 3 = 64 entries = Σ k AAA ACC CGA 1 AAC ACG 0 AAG ACT GAA 2 AAT AGA ACA TTT AAA 3,4 ACC 19 CGA 1 AAC 5 ACG 0 AAG Empty ACT 6,14 GAA 2 AAT Empty AGA ACA Empty TTT Empty Size of that table: 4 3 = 64 entries = Σ k Searching a Verification Algorithm Banded Dynamic Programming AAA 3,4 ACC 19 CGA 1 AAC 5 ACG 0 AAG Empty ACT 6,14 GAA 2 AAT Empty AGA ACA Empty TTT Empty Sequence: ACTG Potential match at position 6 and 14 Reference Genome 0 n m Bandwidth 4

Techniques Index Hash tables, k-mer Index Suffix arrays Burrows-Wheeler-Transformation (BWT) of a suffix array Filtering Algorithms Single or multiple seeds Pigeonhole principle Q-gram filtering Verification Simple seed-and-extend Banded dynamic programming Quality-based dynamic programming Mappers Source: illumina ELANDv2 Gapped Banded Alignment (20bp) ELANDv2 Multiseed Alignment (Seed max. 2 errors) Source: illumina Source: illumina Parallelization Data Decomposition Split the reads Examples: Bowtie, Eland RNA-Seq Functional Decomposition Separate filtering and verification processes 5

RNA-Seq RNA-Seq -Mapping Protocol Alignment against contaminants (rrna) Alignment against splice-junctions Alignment against genome RNA-Seq -Mapping Protocol Alignment against contaminants (rrna) Alignment against splice-junctions Alignment against genome Calling SNPs Direct Alignment against hg18 Splice Junction (60bp) GGCTTCAA 30bp GGCT Exon 1 AGCTCC CTTACT Intron 30bp TCAA Exon 2 Calling SNPs Direct Alignment against hg18 Alignment against rrna (1%) + Alignment against splice junctions (11%) SAM/BAM Generic format for storing large nucleotide sequence alignments SAM Tools Sorting alignments Merging alignments Indexing alignments Viewing alignments 6

SAM record SAM record Tab-delimited format Field 1: Query name Field 2: Flag Field 3: Reference sequence name Field 4: 1-based leftmost coordinate of the clipped sequence Field 5: Mapping quality Field 6: CIGAR strings Field 7: Mate reference sequence name Field 8: 1-based leftmost coordinate of the clipped sequence Field 9: Insert size (5 to 5 ) Field 10: Query sequence Field 11: Sequence qualities Tab-delimited format Field 1: Query name Field 2: Flag Field 3: Reference sequence name Field 4: 1-based leftmost coordinate of the clipped sequence Field 5: Mapping quality Field 6: CIGAR strings Field 7: Mate reference sequence name Field 8: 1-based leftmost coordinate of the clipped sequence Field 9: Insert size (5 to 5 ) Field 10: Query sequence Field 11: Sequence qualities Sequence characters agreeing with the reference are set to. or, for reads aligned to the forward or reverse strand. M: Alignment match or mismatch I: Insertion to the reference D: Deletion from the reference CIGAR Strings 39M 19M1D5M 9M1I23M 9M2I23M 23M2D10M 26M 12M1D15M 16M CIGAR Strings 39M 19M1D5M 9M1I1P23M 9M2I23M 23M2D10M 26M 12M1D15M 16M P: Padding (silent deletion) This is not even implemented by BWA Because it would require a de novo local assembler! 7

N: Skipped region from the reference For spliced reads: ACATGATA.GAGCTTTA (Cigar: 8M56N8M) Two more CIGAR characters S: Soft clip on the read H: Hard clip on the read Flags Bitwise FLAG: f 15 f 14 f 13 f 12 f 11 f 10 f 9 f 8 f 7 f 6 f 5 f 4 f 3 f 2 f 1 f 0 with f i = {0,1} f 0 : 0 = is not paired in sequencing, 1 = is paired in seq. f 1 : 1 = The read is mapped in a proper pair f 2 : 1 = The query sequence itself is unmapped f 3 : 1 = The mate is unmapped f 4 : 0 = forward strand, 1 = reverse strand 8