Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Similar documents
DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

DNA Sequencing. Overview

10/15/2009 Comp 590/Comp Fall

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Graph Algorithms in Bioinformatics

Algorithms for Bioinformatics

Genome 373: Genome Assembly. Doug Fowler

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

DNA Fragment Assembly

10/8/13 Comp 555 Fall

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

Sequence Assembly Required!

Algorithms for Bioinformatics

Scalable Solutions for DNA Sequence Analysis

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

RESEARCH TOPIC IN BIOINFORMANTIC

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Assembly in the Clouds

Graph Algorithms in Bioinformatics

DNA Fragment Assembly

Purpose of sequence assembly

by the Genevestigator program ( Darker blue color indicates higher gene expression.

BMI/CS 576 Fall 2015 Midterm Exam

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

Eulerian Tours and Fleury s Algorithm

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

Appendix A. Example code output. Chapter 1. Chapter 3

Description of a genome assembler: CABOG

(for more info see:

Assembly in the Clouds

Genome Assembly and De Novo RNAseq

Supplementary Table 1. Data collection and refinement statistics

Bioinformatics-themed projects in Discrete Mathematics

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells

Genome Sequencing Algorithms

Cloud Computing and the DNA Data Race Michael Schatz. April 14, 2011 Data-Intensive Analysis, Analytics, and Informatics

Graphs and Puzzles. Eulerian and Hamiltonian Tours.

How to apply de Bruijn graphs to genome assembly

Bioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,

TCGR: A Novel DNA/RNA Visualization Technique

HP22.1 Roth Random Primer Kit A für die RAPD-PCR

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3

Omega: an Overlap-graph de novo Assembler for Metagenomics

BLAST & Genome assembly

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

Hybrid Parallel Programming

Reducing Genome Assembly Complexity with Optical Maps

Cloud Computing and the DNA Data Race Michael Schatz. June 8, 2011 HPDC 11/3DAPAS/ECMLS

Genome Sequencing & Assembly. Slides by Carl Kingsford

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

Reducing Genome Assembly Complexity with Optical Maps

Path Finding in Graphs. Problem Set #2 will be posted by tonight

Introduction to Graphs

Solutions Exercise Set 3 Author: Charmi Panchal

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8

CS681: Advanced Topics in Computational Biology

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018)

Walking with Euler through Ostpreußen and RNA

CSE 549: Genome Assembly De Bruijn Graph. All slides in this lecture not marked with * courtesy of Ben Langmead.

Introduction to Genome Assembly. Tandy Warnow

BLAST & Genome assembly

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Discrete Mathematics for CS Spring 2008 David Wagner Note 13. An Introduction to Graphs

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

Chapter 3: Paths and Cycles

Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report

Towards a de novo short read assembler for large genomes using cloud computing

Read Mapping and Assembly

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

02-711/ Computational Genomics and Molecular Biology Fall 2016

Logic: The Big Picture. Axiomatizing Arithmetic. Tautologies and Valid Arguments. Graphs and Trees

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

MITOCW watch?v=zyw2aede6wu

Parallel de novo Assembly of Complex (Meta) Genomes via HipMer

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics. An Introduction to Graph Theory

Degenerate Coding and Sequence Compacting

Questions? You are given the complete graph of Facebook. What questions would you ask? (What questions could we hope to answer?)

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

EECS 203 Lecture 20. More Graphs

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

Alignment of Long Sequences

de Bruijn graphs for sequencing data

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA

Cloud Computing and the DNA Data Race

IDBA - A practical Iterative de Bruijn Graph De Novo Assembler

Digging into acceptor splice site prediction: an iterative feature selection approach

Euler and Hamilton paths. Jorge A. Cobb The University of Texas at Dallas

warm-up exercise Representing Data Digitally goals for today proteins example from nature

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche

CSCI 1820 Notes. Scribes: tl40. February 26 - March 02, Estimating size of graphs used to build the assembly.

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!

Transcription:

Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa

The sequencing problem We want to determine the identity of the base pairs that make up: a single large molecule of DNA the genome of a single cell/individual organism the genome of a species But we can t (currently) read off the sequence of an entire molecule all at once The strategy: substrings We do have the ability to read or detect short pieces (substrings) of DNA key properties: length in bp/read, error rate in %, price in $ per 1 million of base pairs Sanger sequencing: 5-8 bp, 1%, $24 Next generation technologies: 454 Genome Sequencer: 25-6 bp, 1%, $1 Illumina Genome Analyzer: 35-15 bp, 1%, $.15 3rd generation sequencing Oxford Nanopore: x1 kbp/read, up to 3% error rate, the most portable sequencer

DNA sequencing costs Shotgun sequencing fragment assembly Multiple copies of sample DNA Randomly fragment DNA Sequence sample of fragments Assemble reads

5 Statistics for shotgun sequencing Given: G genome length (3 19 nts), L read length (5 nts), N number of reads (tbd) Calculate: coverage a=nl/g Questions tbd by stats (Lander-Waterman): How many contigs are there? How big are the contigs? How many reads are in each contig? How big are the gaps? Requirement: 99% in contigs, 1% in gaps a=4.6, N=3x1 7, mean contig length 1 4, 1 reads/contig on average

The fragment assembly problem Given: A set of reads (strings) {s1, s 2,, s n } Do: Determine a large string s that best explains the reads What do we mean by best explains? What assumptions might we require? Shortest superstring problem Objective: Find a string s such that all reads s 1, s 2,, s n are substrings of s s is as short as possible Assumptions: Reads are 1% accurate Identical reads must come from the same location on the genome best = simplest

Shortest superstring example Given the reads: {ACG, CGA, CGC, CGT, GAC, GCG, GTA, TCG} What is the shortest superstring you can come up with? TCGACGCGTA (length 1) Algorithms for shortest superstring problem This problem turns out to be NP-complete Simple greedy strategy: while # strings > 1 do merge two strings with maximum overlap loop Conjectured to give string with length 2 minimum length Other approaches are based on graph theory

Graph basics a graph (G) consists of vertices (V) and edges (E) G = (V,E) edges can either be directed (directed graphs) 2 1 3 4 or undirected (undirected graphs) 2 1 3 4 Vertex degrees the degree of a vertex: the # of edges incident to that vertex for directed graphs, we also have the notion of indegree: The number incoming edges outdegree: The number of outgoing edges 1 2 3 4 degree(v 2 ) = 3 indegree(v 2 ) = 1 outdegree(v 2 ) = 2

Overlap graph One representation that is commonly used for sequence assembly is an overlap graph For a set of sequence reads S, construct a directed weighted graph G = (V,E,w) with one vertex per read (v i corresponds to s i ) edges between all vertices (a complete graph) w(v i,v j ) = overlap(s i,s j ) = length of longest suffix of s i that is a prefix of s j Overlap graph example Let S = {AGA, GAT, TCG, GAG} AGA 2 1 2 GAT 1 TCG 1 2 1 GAG

Assembly as finding a Hamiltonian path Hamiltonian path: path through graph that visits each vertex exactly once AGA 2 GAT 1 2 1 TCG 1 GAG 2 1 Path: AGAGATCG Shortest superstring as TSP minimize superstring length = minimize weight of Hamiltonian path in overlap graph with edge weights negated path: GAGATCG path weight: -5 string length: 7 GAT -2 AGA GAG TCG this is essentially the Traveling Salesman Problem (also NP-complete) -1-1 -2-2 -1-1

Assembly as a Hamiltonian path finding Hamiltonian path is an NP-complete problem nevertheless overlap graphs are often used for sequence assembly can detect repeats heuristical hierarchical decomposition unitigs (no forks, no conflicts) solved first mate-pairs to scaffold Sequencing by Hybridization (SBH) SBH array has probes for all possible k-mers For a given DNA sample, array tells us whether each k-mer is PRESENT or ABSENT in the sample the set of all k-mers present in a string S is called its spectrum

Example DNA array S: ACTGATGCAT spectrum(s, 4) = {ACTG, ATGC, CTGA,GATG, GCAT,TGAT, TGCA} AA AC AG AT CA CC CG CT GA GC GG GT TA TC AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT X TG X X TT X X X X de Bruijn graph in a de Bruijn graph edges represent k-mers that occur in spectrum(s, l) vertices correspond to (k-1)-mers {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} GT CGT CG GTG GCG AT ATG TG TGC GC GCA CA TGG GGC GG

de Bruijn graph Can we find a DNA sequence containing all k-mers? In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? Seven Bridges of Königsberg Euler answered the question: Is there a walk through the city that traverses each bridge exactly once?

Properties of Eulerian graphs cycle: a path in a graph that starts/ends on the same vertex Eulerian cycle: a path that visits every edge of the graph exactly once Theorem: A connected graph has an Eulerian cycle if and only if each of its vertices are balanced A vertex v is balanced if indegree(v) = outdegree(v) There is a linear-time algorithm for finding Eulerian cycles! Eulerian cycle algorithm start at any vertex v, traverse unused edges until returning to v while the cycle is not Eulerian pick a vertex w along the cycle for which there are untraversed outgoing edges traverse unused edges until ending up back at w join two cycles into one cycle

Finding cycles 1) start at arbitrary vertex 2) start at vertex along cycle with untraversed edges Finding cycles 3) join cycles 4) start at vertex along cycle with untraversed edges

Finding cycles 5) join cycles Joining cycles v v v w w

Assembly as finding Eulerian paths Eulerian path: path that visits every edge exactly once we can frame the assembly problem as finding Eulerian paths in a de Bruijn graph resulting sequences contain all k-mers GT CGT CG GTG GCG AT ATG TG TGC GC GCA CA TGG GGC GG assembly: ATGGCGTGCA or ATGCGTGGCA Eulerian paths a vertex v is semibalanced if indegree(v) outdegree(v) = 1 a connected graph has an Eulerian path if and only if it contains at most two semibalanced vertices GT CG AT TG GC CA GG

Eulerian path Eulerian cycle If a graph has an Eulerian Path starting at w and ending at x then All vertices must be balanced, except for w and x which may have indegree(v) outdegree(v) = 1 If and w and x are not balanced, add an edge between them to balance Graph now has an Eulerian cycle which can be converted to an Eulerian path by removal of the added edge Eulerian path Eulerian cycle GT CG AT TG GC CA GG

Sequence assembly in practice approaches are based on these ideas, but include a lot of heuristics best approach varies depending on length of reads, amount of repeats in the genome, availability of paired-end reads Violating assumptions in de Bruijn g. Assume a sequence: a_long_long_long_time length m=21, the sequence contains repeats choose k=5, number of 5-mers n=m-k+1=17 taken from Langmead: de Bruijn graph assembly. Assume different sets of k-mers: ad a) all 5-mers detected correct assembly, ad b) omitting ong_t two connected components, the overall graph is not Eulerian, ad c) extra copy of ong_t 4 semi-balanced nodes, graph not Eulerian, ad d) errors and differences between chromosomes, turn a copy of long_ into lxng_ graph not connected, largest component not Eulerian.

a) b) c) d) de Bruijn graphs short k-mers Only short k-mers guarantee that none is missed Still, the number of k-mers remains O(N) N is the total length of reads De Bruijn graph with O(N) edges and O(N) nodes too can be constructed in O(N), Euler cycle found in O(N).

Paired end reads one approach to reducing ambiguity in assembly is to use paired end reads genome cut many times at random known distance reads are sequenced ends of each fragment ~5 bp ~5 bp Paired end reads

Repetitive sequences Most common source of assembly errors If sequencing technology produces reads > repeat size, impact is much smaller Most straightforward solution: mate pairs with spacing > largest known repeat Salzberg S L, and Yorke J A Bioinformatics 25;21:432-4321 Mis-assembly of repetitive sequence Schatz M C et al. Brief Bioinform 213;14:213-224

Methods for assembly validation Nagarajan, N., Pop, M.: Sequence Assembly Demystified Nature Reviews 213;14 The Velvet assembler based on de Bruijn graphs includes additional tricks for reducing the size of the graph trying to correct for errors in sequences taking advantage of paired-end reads

Compressing the graph in Velvet reads AAGA ACTC ACTG ACTT AGAC CCGA CGAC CTCC CTGG CTTT GACT GGAC GGGA TCCG TGGG CGA GGA de Bruijn Graph CCG TCC CTC AAG AGA GAC ACT CTT TTT CTG Potential Genomes AAGACTCCGACTGGGACTTT AAGACTGGGACTCCGACTTT GGG TGG human genome: ~ 3B nodes, ~1B edges Compressing the graph in Velvet CCG TCC CGA CTC AAG AGA GAC ACT CTT TTT GGA CTG GGG TGG CTCCGA collapse linear subgraphs AAG A GAC ACT CTTT CTGGGA

Error correction in Velvet errors at end of read trim off dead-end tips A B A B B errors in middle of read pop bubbles A B C A B* C B chimeric edges clip short, low coverage nodes A C x B D A C B D Summary The sequencing problem Sequencing in vitro Sequence assembly in silico De novo versus resequencing Approaches: greedy, overlap graph, Euler trail Reads, contigs, scaffolding Assembly validation Statistical, viewers, comparative methods Still open problem Costs, efficiency, reliability