De-Novo Genome Assembly and its Current State

Size: px
Start display at page:

Download "De-Novo Genome Assembly and its Current State"

Transcription

1 De-Novo Genome Assembly and its Current State Anne-Katrin Emde April 17, 2013 Freie Universität Berlin, Algorithmische Bioinformatik Max Planck Institut für Molekulare Genetik, Computational Molecular Biology

2 What is genome assembly and why is it hard? Original genome sequence (in multiple copies) Sequence short fragments Difficulties: - Genomes are very long and repeat-rich - Reads are very short and may contain errors and biases Reconstruct original sequence from reads Assembler 3

3 Assembly algorithms - introduction Assemblers use overlaps between reads, to first produce contigs, that are then used to build scaffolds. reads contig NNNN...NNNN scaffold Read pairs contribute long-range linking information, especially in the scaffolding phase. 4

4 2 Paradigmen 1) Overlap Layout Consensus: Focus is on sequence reads 2) De Bruijn graph: Focus is on k-tuples occuring in sequence reads NGS algorithms, March

5 Algorithms Overlap Graph 1) Overlap phase: pairwise overlap alignments 2) Layout phase: overlap graph construction and finding relative placement of reads 3) Consensus phase: Produce multiple read alignment and compute contig consensus sequences Kececioglu and Myers, 1995 ACGTAATT GTAATTCA ATTCAGTC GTCCATGT CATGTTGA TGTTGACT ACGTAATTCAGTCCATGTTGACT 7

6 Algorithms Overlap Graph 1) Overlap phase: for all pairs of reads pairwise overlap alignments read1 read2 Computationally expensive! 8

7 Algorithms Overlap Graph 1) Overlap phase: r4 r5 r3 r6 r2 2) Layout phase: r1 nodes = reads edges = overlaps r r2-5 r3-3 r4-5 r r6 9

8 Algorithms Overlap Graph 1) Overlap phase: r4 r5 r3 r6 r2 2) Layout phase: r1 nodes = reads edges = overlaps r r2-5 r3-3 r4-5 r r6 path through the graph read layout In theory Hamiltonian path (NP-complete), in practice heuristics 10

9 Algorithms Overlap Graph 2) Layout phase: nodes = reads edges = overlaps -6 r1-3 r2-5 r3-3 r4-5 r r6 3) Consensus phase: multiple read alignment contig r1 ACGTAATT r2 GTAATTCA r3 ATTCAGTC r4 GTCCATGT r5 CATGTTGA r6 TGTTGACT ACGTAATTCAGTCCATGTTGACT 11

10 Algorithms Overlap Graph 2) Layout phase: nodes = reads edges = overlaps -6 r1-3 r2-5 r3-3 r4-5 r r6 3) Consensus phase: multiple read alignment contig r1 ACGTAATT r2 GTAATTCA r3 AT - CAGTC r4 GTCCATGT r5 CATATTGA r6 TGTTGACT ACGTAATTCAGTCCATGTTGACT robust with alignment errors 12

11 13

12 14

13 15

14 Overlap Graph There are not only simple paths A R B R C Approximate ordering in the overlap graph: B A R - Coverage is high - Path branches R is a repeat! C 16

15 Overlap Graph There are not only simple paths A R B R C Approximate ordering in the overlap graph: B A R C 17

16 Overlap Graph Now: R1 and R2 are nearly identical A R1 B R2 C A A A T T Approximate ordering in the overlap graph: A Rx A A A T T T B C Overlap strictness: Tradeoff between error tolerance and natural repeat separation 18

17 Overlap Graph Now: R1 and R2 are nearly identical A R1 B R2 C A A A T T Approximate ordering in the overlap graph: A R1 B Overlap strictness: Tradeoff between error tolerance and natural repeat separation R2 C 20

18 Algorithms - de Bruijn graph (Euler assembler) - No overlap phase, no consensus phase, basically just a layout phase Nodes = k-mers Edges = (k+1)-mers Given k = 4 and three read sequences: r1 r2 r3 CGTAATTC ATTC GTAATTCA TACGTAAT r3 TACG r2 ACGT CGTA GTAA TAAT AATT ATTC TTCA r1 contig: TACGTAATTCA r3 TACGTAAT r1 CGTAATTC r2 GTAATTCA In theory de Bruijn Superwalk Problem (NP-hard), in practice heuristics de Bruijn, 1946; Pevzner, 2001; Medvedev

19 de Bruijn graph Again, there are not only simple paths...g ACGT ACGTCA... GACGTACG CGTACGTC GTACGTCA GTCA CGTC GACGTCA is a read-incoherent path CGTA GACG ACGT GTAC Repetitive k-mers introduce cycles TACG 22

20 de Bruijn graph What if we increase k to 5?...GACGTACGTCA... GACGTACG CGTACGTC GTACGTCA GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA Back to a linear graph structure Increasing k leads to better repeat resolution 23

21 de Bruijn graph What if we have sequencing errors?...gacgtacgtca... GACGTACG CGTACGTC GTACGGCA GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA Additional nodes TACGG ACGGC CGGCA 24

22 A quick example One read: GTCGAGG GTCG (1x) TCGA (1x) CGAG GAGG (1x) (1x) 25

23 26 A quick example AGAT (8x) ATCC (7x) TCCG (7x) CCGA (7x) CGAT (6x) GATG (5x) ATGA (8x) TGAG (9x) GATC (8x) GATT (1x) TAGT (3x) AGTC (7x) GTCG (9x) TCGA (10x) GGCT (11x) TAGA (16x) AGAG (9x) GAGA (12x) GACA (8x) ACAG (5x) GCTT (8x) GCTC (2x) CTTT (8x) CTCT (1x) TTTA (8x) TCTA (2x) TTAG (12x) CTAG (2x) AGAC (9x) AGAA (1x) CGAG (8x) CGAC (1x) GAGG (16x) GACG (1x) AGGC (16x) ACGC (1x) All the others

24 27 A quick example AGAT (8x) ATCC (7x) TCCG (7x) CCGA (7x) CGAT (6x) GATG (5x) ATGA (8x) TGAG (9x) GATC (8x) GATT (1x) TAGT (3x) AGTC (7x) GTCG (9x) TCGA (10x) GGCT (11x) TAGA (16x) AGAG (9x) GAGA (12x) GACA (8x) ACAG (5x) GCTT (8x) GCTC (2x) CTTT (8x) CTCT (1x) TTTA (8x) TCTA (2x) TTAG (12x) CTAG (2x) AGAC (9x) AGAA (1x) CGAG (8x) CGAC (1x) GAGG (16x) GACG (1x) AGGC (16x) ACGC (1x) All the others

25 A quick example After simplification GATT AGAT TAGTCGA CGAG GATCCGATGAG GCTCTAG AGAA CGACGC GAGGCT GCTTTAG TAGA AGAGA AGACAG 28

26 Error removal Tips removed AGAT TAGTCGA CGAG GATCCGATGAG GCTCTAG GAGGCT GCTTTAG TAGA AGAGA AGACAG 29

27 Error removal Bubbles removed GATCCGATGAG AGAT TAGTCGA CGAG GAGGCT GCTTTAG TAGA AGAGA AGACAG 30

28 Error removal Final simplification AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG 31

29 de Bruijn graph What if we have sequencing errors?...gacgtacgtca... GACGTACG CGTACGTC GTACCTCA GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA GTACC TACCT ACCTC CCTCA Additional nodes & graph disconnected Choice of k: Tradeoff between error tolerance and repeat resolution 32

30 What is a good assembly? Some assembly evaluation metrics: - High N50 contig size (most commonly used metric) N50 contig size contigs sorted by size - Low number of assembly errors (sequence errors, structural misjoins) original sequence assembled as Only if a reference sequence is known! 36

31 Conclusions - Most assemblers use either the OLC or de Bruijn graph paradigm, both lead to NP-hard assembly models. - However, assembler performance is independent of the underlying paradigm and mainly depends on heuristics for repeat resolution and handling noise. - How to measure assembly accuracy is another important aspect, in general tradeoff between assembly contiguity and correctness. - Evaluations show that assembly is far from solved, assembler performance still quite inconsistent. For large genomes, the choice of assemblers is often limited to those that will run without crashing (GAGE paper, 2012) 39

32 References - GAGE: a critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg et al., Genome Res Assemblathon 2: evaluating de-novo of genome assembly in three vertebrate species, Bradnam et al., not yet published - Assembly algorithms for next-generation sequencing data. Jason R. Miller, Sergey Koren, Granger Sutton, Genomics, Fragment assembly string graph, Myers, 2005 Figures: - geneed.nlm.nih.gov A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies, Zhang et al., PLOS one, 2011 Titel, Datum 40

Convey Application Performance

Convey Application Performance Convey Application Performance Steve Wallach swallach at conveycomputer dot com Convey

More information

Frühjahrstreffen ZKI-Supercomputing , Kaiserlautern

Frühjahrstreffen ZKI-Supercomputing , Kaiserlautern Dataintensive Computing Goes HPC Example: Bioinformatics Applications THE WORLD S FIRST HYBRID-CORE COMPUTER. Frühjahrstreffen ZKI-Supercomputing 19. 20.4.2012, Kaiserlautern Dr. Hans-Joachim Hinz - Vertriebsleiter

More information

Introduction to Genome Assembly. Tandy Warnow

Introduction to Genome Assembly. Tandy Warnow Introduction to Genome Assembly Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies produce mate-pairs. Different

More information

Description of a genome assembler: CABOG

Description of a genome assembler: CABOG Theo Zimmermann Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is an assembler built upon the Celera Assembler, which, at first, was designed for Sanger sequencing,

More information

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare

More information

CS681: Advanced Topics in Computational Biology

CS681: Advanced Topics in Computational Biology CS681: Advanced Topics in Computational Biology Can Alkan EA224 calkan@cs.bilkent.edu.tr Week 7 Lectures 2-3 http://www.cs.bilkent.edu.tr/~calkan/teaching/cs681/ Genome Assembly Test genome Random shearing

More information

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015 Mapping de Novo Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #2 WS 2014/2015 Today Genome assembly: the basics Hamiltonian and Eulerian

More information

RESEARCH TOPIC IN BIOINFORMANTIC

RESEARCH TOPIC IN BIOINFORMANTIC RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very

More information

Building approximate overlap graphs for DNA assembly using random-permutations-based search.

Building approximate overlap graphs for DNA assembly using random-permutations-based search. An algorithm is presented for fast construction of graphs of reads, where an edge between two reads indicates an approximate overlap between the reads. Since the algorithm finds approximate overlaps directly,

More information

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Sequence Assembly. BMI/CS 576  Mark Craven Some sequencing successes Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity

More information

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2014 Genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Genome assembly problem Approaches Comparative assembly The string

More information

Reducing Genome Assembly Complexity with Optical Maps

Reducing Genome Assembly Complexity with Optical Maps Reducing Genome Assembly Complexity with Optical Maps Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational Biology mpop@umiacs.umd.edu

More information

Sequence mapping and assembly. Alistair Ward - Boston College

Sequence mapping and assembly. Alistair Ward - Boston College Sequence mapping and assembly Alistair Ward - Boston College Sequenced a genome? Fragmented a genome -> DNA library PCR amplification Sequence reads (ends of DNA fragment for mate pairs) We no longer have

More information

Performance analysis of parallel de novo genome assembly in shared memory system

Performance analysis of parallel de novo genome assembly in shared memory system IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS

A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS Munib Ahmed, Ishfaq Ahmad Department of Computer Science and Engineering, University of Texas At Arlington, Arlington, Texas

More information

Genome Assembly and De Novo RNAseq

Genome Assembly and De Novo RNAseq Genome Assembly and De Novo RNAseq BMI 7830 Kun Huang Department of Biomedical Informatics The Ohio State University Outline Problem formulation Hamiltonian path formulation Euler path and de Bruijin graph

More information

(for more info see:

(for more info see: Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire

More information

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14 7.36/7.91 recitation DG Lectures 5 & 6 2/26/14 1 Announcements project specific aims due in a little more than a week (March 7) Pset #2 due March 13, start early! Today: library complexity BWT and read

More information

Genome 373: Genome Assembly. Doug Fowler

Genome 373: Genome Assembly. Doug Fowler Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-

More information

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics Taller práctico sobre uso, manejo y gestión de recursos genómicos 22-24 de abril de 2013 Assembling long-read Transcriptomics Rocío Bautista Outline Introduction How assembly Tools assembling long-read

More information

Assembly in the Clouds

Assembly in the Clouds Assembly in the Clouds Michael Schatz October 13, 2010 Beyond the Genome Shredded Book Reconstruction Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools

More information

Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report

Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational

More information

Shotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.

Shotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome. Shotgun sequencing Genome (unknown) Reads (randomly chosen; have errors) Coverage is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 just draw a line

More information

Computational models for bionformatics

Computational models for bionformatics Computational models for bionformatics De-novo assembly and alignment-free measures Michele Schimd Department of Information Engineering July 8th, 2015 Michele Schimd (DEI) PostDoc @ DEI July 8th, 2015

More information

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies

More information

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop Computational Architecture of Cloud Environments Michael Schatz April 1, 2010 NHGRI Cloud Computing Workshop Cloud Architecture Computation Input Output Nebulous question: Cloud computing = Utility computing

More information

Purpose of sequence assembly

Purpose of sequence assembly Sequence Assembly Purpose of sequence assembly Reconstruct long DNA/RNA sequences from short sequence reads Genome sequencing RNA sequencing for gene discovery Amplicon sequencing But not for transcript

More information

Reducing Genome Assembly Complexity with Optical Maps

Reducing Genome Assembly Complexity with Optical Maps Reducing Genome Assembly Complexity with Optical Maps AMSC 663 Mid-Year Progress Report 12/13/2011 Lee Mendelowitz Lmendelo@math.umd.edu Advisor: Mihai Pop mpop@umiacs.umd.edu Computer Science Department

More information

AMemoryEfficient Short Read De Novo Assembly Algorithm

AMemoryEfficient Short Read De Novo Assembly Algorithm Original Paper AMemoryEfficient Short Read De Novo Assembly Algorithm Yuki Endo 1,a) Fubito Toyama 1 Chikafumi Chiba 2 Hiroshi Mori 1 Kenji Shoji 1 Received: October 17, 2014, Accepted: October 29, 2014,

More information

Assembly of the Ariolimax dolicophallus genome with Discovar de novo. Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves

Assembly of the Ariolimax dolicophallus genome with Discovar de novo. Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves Assembly of the Ariolimax dolicophallus genome with Discovar de novo Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves Overview -Introduction -Pair correction and filling -Assembly theory

More information

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73 1/73 de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January 2014 2/73 YOUR INSTRUCTOR IS.. - Postdoc at Penn State, USA - PhD at INRIA / ENS Cachan,

More information

Adam M Phillippy Center for Bioinformatics and Computational Biology

Adam M Phillippy Center for Bioinformatics and Computational Biology Adam M Phillippy Center for Bioinformatics and Computational Biology WGS sequencing shearing sequencing assembly WGS assembly Overlap reads identify reads with shared k-mers calculate edit distance Layout

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

02-711/ Computational Genomics and Molecular Biology Fall 2016

02-711/ Computational Genomics and Molecular Biology Fall 2016 Literature assignment 2 Due: Nov. 3 rd, 2016 at 4:00pm Your name: Article: Phillip E C Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29,

More information

Introduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012

Introduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012 Introduction and tutorial for SOAPdenovo Xiaodong Fang fangxd@genomics.org.cn Department of Science and Technology @ BGI May, 2012 Why de novo assembly? Genome is the genetic basis for different phenotypes

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

de Bruijn graphs for sequencing data

de Bruijn graphs for sequencing data de Bruijn graphs for sequencing data Rayan Chikhi CNRS Bonsai team, CRIStAL/INRIA, Univ. Lille 1 SMPGD 2016 1 MOTIVATION - de Bruijn graphs are instrumental for reference-free sequencing data analysis:

More information

Next Generation Sequencing Workshop De novo genome assembly

Next Generation Sequencing Workshop De novo genome assembly Next Generation Sequencing Workshop De novo genome assembly Tristan Lefébure TNL7@cornell.edu Stanhope Lab Population Medicine & Diagnostic Sciences Cornell University April 14th 2010 De novo assembly

More information

Finishing Circular Assemblies. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015

Finishing Circular Assemblies. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015 Finishing Circular Assemblies J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015 Assembly Strategies de Bruijn graph Velvet, ABySS earlier, basic assemblers IDBA, SPAdes later, multi-k

More information

Omega: an Overlap-graph de novo Assembler for Metagenomics

Omega: an Overlap-graph de novo Assembler for Metagenomics Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n

More information

How to apply de Bruijn graphs to genome assembly

How to apply de Bruijn graphs to genome assembly PRIMER How to apply de Bruijn graphs to genome assembly Phillip E C Compeau, Pavel A Pevzner & lenn Tesler A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling

More information

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018 CS 68: BIOINFORMATICS Prof. Sara Mathieson Swarthmore College Spring 2018 Outline: Jan 31 DBG assembly in practice Velvet assembler Evaluation of assemblies (if time) Start: string alignment Candidate

More information

Sequence Assembly Required!

Sequence Assembly Required! Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy

More information

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose Michał Kierzynka et al. Poznan University of Technology 17 March 2015, San Jose The research has been supported by grant No. 2012/05/B/ST6/03026 from the National Science Centre, Poland. DNA de novo assembly

More information

De novo sequencing and Assembly. Andreas Gisel International Institute of Tropical Agriculture (IITA) Ibadan, Nigeria

De novo sequencing and Assembly. Andreas Gisel International Institute of Tropical Agriculture (IITA) Ibadan, Nigeria De novo sequencing and Assembly Andreas Gisel International Institute of Tropical Agriculture (IITA) Ibadan, Nigeria The Principle of Mapping reads good, ood_, d_mo, morn, orni, ning, ing_, g_be, beau,

More information

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA De Bruijn graphs are ubiquitous [Pevzner et al. 2001, Zerbino and Birney,

More information

IDBA - A practical Iterative de Bruijn Graph De Novo Assembler

IDBA - A practical Iterative de Bruijn Graph De Novo Assembler IDBA - A practical Iterative de Bruijn Graph De Novo Assembler Speaker: Gabriele Capannini May 21, 2010 Introduction De Novo Assembly assembling reads together so that they form a new, previously unknown

More information

Characterizing and Optimizing the Memory Footprint of De Novo Short Read DNA Sequence Assembly

Characterizing and Optimizing the Memory Footprint of De Novo Short Read DNA Sequence Assembly Characterizing and Optimizing the Memory Footprint of De Novo Short Read DNA Sequence Assembly Jeffrey J. Cook Craig Zilles 2 Department of Electrical and Computer Engineering 2 Department of Computer

More information

De Novo Draft Genome Assembly Using Fuzzy K-mers

De Novo Draft Genome Assembly Using Fuzzy K-mers De Novo Draft Genome Assembly Using Fuzzy K-mers John Healy Department of Computing & Mathematics Galway-Mayo Institute of Technology Ireland e-mail: john.healy@gmit.ie Abstract- Although second generation

More information

Assembly in the Clouds

Assembly in the Clouds Assembly in the Clouds Michael Schatz November 12, 2010 GENSIPS'10 Outline 1. Genome Assembly by Analogy 2. DNA Sequencing and Genomics 3. Sequence Analysis in the Clouds 1. Mapping & Genotyping 2. De

More information

Gap Filling as Exact Path Length Problem

Gap Filling as Exact Path Length Problem Gap Filling as Exact Path Length Problem RECOMB 2015 Leena Salmela 1 Kristoffer Sahlin 2 Veli Mäkinen 1 Alexandru I. Tomescu 1 1 University of Helsinki 2 KTH Royal Institute of Technology April 12th, 2015

More information

Mar%n Norling. Uppsala, November 15th 2016

Mar%n Norling. Uppsala, November 15th 2016 Mar%n Norling Uppsala, November 15th 2016 Sequencing recap This lecture is focused on illumina, but the techniques are the same for all short-read sequencers. Short reads are (generally) high quality and

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 3: Graph Algorithms

More information

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset

More information

Constrained traversal of repeats with paired sequences

Constrained traversal of repeats with paired sequences RECOMB 2011 Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq) 26-27 March 2011, Vancouver, BC, Canada; Short talk: 2011-03-27 12:10-12:30 (presentation: 15 minutes, questions: 5 minutes)

More information

Scalable Solutions for DNA Sequence Analysis

Scalable Solutions for DNA Sequence Analysis Scalable Solutions for DNA Sequence Analysis Michael Schatz Dec 4, 2009 JHU/UMD Joint Sequencing Meeting The Evolution of DNA Sequencing Year Genome Technology Cost 2001 Venter et al. Sanger (ABI) $300,000,000

More information

MITOCW watch?v=zyw2aede6wu

MITOCW watch?v=zyw2aede6wu MITOCW watch?v=zyw2aede6wu The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong

More information

Scalable Genomic Assembly through Parallel de Bruijn Graph Construction for Multiple K-mers

Scalable Genomic Assembly through Parallel de Bruijn Graph Construction for Multiple K-mers Scalable Genomic Assembly through Parallel de Bruijn Graph Construction for Multiple K-mers Purdue University, West Lafayette, IN, USA {kmahadik,christopherwright,milind,sbagchi,schaterji}@purdue.edu ABSTRACT

More information

Read Mapping and Assembly

Read Mapping and Assembly Statistical Bioinformatics: Read Mapping and Assembly Stefan Seemann seemann@rth.dk University of Copenhagen April 9th 2019 Why sequencing? Why sequencing? Which organism does the sample comes from? Assembling

More information

ABySS. Assembly By Short Sequences

ABySS. Assembly By Short Sequences ABySS Assembly By Short Sequences ABySS Developed at Canada s Michael Smith Genome Sciences Centre Developed in response to memory demands of conventional DBG assembly methods Parallelizability Illumina

More information

Genomic Finishing & Consed

Genomic Finishing & Consed Genomic Finishing & Consed SEA stages of genomic analysis Draft vs Finished Draft Sequence Single sequencing approach Limited human intervention Cheap, Fast Finished sequence Multiple approaches Human

More information

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data 1/39 Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier 2/39 INTRODUCTION, YEAR 2000

More information

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,

More information

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example. CS267 Assignment 3: Problem statement 2 Parallelize Graph Algorithms for de Novo Genome Assembly k-mers are sequences of length k (alphabet is A/C/G/T). An extension is a simple symbol (A/C/G/T/F). The

More information

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis Data Preprocessing Next Generation Sequencing analysis DTU Bioinformatics Generalized NGS analysis Data size Application Assembly: Compare Raw Pre- specific: Question Alignment / samples / Answer? reads

More information

Tutorial for Windows and Macintosh. De Novo Sequence Assembly with Velvet

Tutorial for Windows and Macintosh. De Novo Sequence Assembly with Velvet Tutorial for Windows and Macintosh De Novo Sequence Assembly with Velvet 2017 Gene Codes Corporation Gene Codes Corporation 525 Avis Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249

More information

The Value of Mate-pairs for Repeat Resolution

The Value of Mate-pairs for Repeat Resolution The Value of Mate-pairs for Repeat Resolution An Analysis on Graphs Created From Short Reads Joshua Wetzel Department of Computer Science Rutgers University Camden in conjunction with CBCB at University

More information

DNA Fragment Assembly

DNA Fragment Assembly SIGCSE 009 Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly

More information

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis Data Preprocessing 27626: Next Generation Sequencing analysis CBS - DTU Generalized NGS analysis Data size Application Assembly: Compare Raw Pre- specific: Question Alignment / samples / Answer? reads

More information

Reevaluating Assembly Evaluations using Feature Analysis: GAGE and Assemblathons Supplementary material

Reevaluating Assembly Evaluations using Feature Analysis: GAGE and Assemblathons Supplementary material Reevaluating Assembly Evaluations using Feature Analysis: GAGE and Assemblathons Supplementary material Francesco Vezzi, Giuseppe Narzisi, Bud Mishra September 29, 22 Features computation FRC bam computes

More information

On Algorithmic Complexity of Biomolecular Sequence Assembly Problem

On Algorithmic Complexity of Biomolecular Sequence Assembly Problem On lgorithmic Complexity of Biomolecular Sequence ssembly Problem Giuseppe Narzisi 1, Bud Mishra 1,2, and Michael C Schatz 1 1 Simons Center for Quantitative Biology, One Bungtown Road, Cold Spring Harbor

More information

DNA Sequencing. Overview

DNA Sequencing. Overview BINF 3350, Genomics and Bioinformatics DNA Sequencing Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Eulerian Cycles Problem Hamiltonian Cycles

More information

The Realities of Genome Assembly

The Realities of Genome Assembly 2/19/2018 Lecture11 The Realities of Genome Assembly 1 http://localhost:8888/notebooks/comp555s18/lecture11.ipynb# 1/1 From Last Time What we learned from a related "Minimal Superstring" problem Can be

More information

Towards a de novo short read assembler for large genomes using cloud computing

Towards a de novo short read assembler for large genomes using cloud computing Towards a de novo short read assembler for large genomes using cloud computing Michael Schatz April 21, 2009 AMSC664 Advanced Scientific Computing Outline 1.! Genome assembly by analogy 2.! DNA sequencing

More information

Parallelization of MIRA Whole Genome and EST Sequence Assembler [1]

Parallelization of MIRA Whole Genome and EST Sequence Assembler [1] Parallelization of MIRA Whole Genome and EST Sequence Assembler [1] Abhishek Biswas *, Desh Ranjan, Mohammad Zubair Department of Computer Science, Old Dominion University, Norfolk, VA 23529 USA * abiswas@cs.odu.edu

More information

HiPGA: A High Performance Genome Assembler for Short Read Sequence Data

HiPGA: A High Performance Genome Assembler for Short Read Sequence Data 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops HiPGA: A High Performance Genome Assembler for Short Read Sequence Data Xiaohui Duan, Kun Zhao, Weiguo Liu* School of

More information

GATB. The Genomic Assembly & Analysis Toolbox. GATB Workshop. D.Lavenier E.Drézen

GATB. The Genomic Assembly & Analysis Toolbox. GATB Workshop. D.Lavenier E.Drézen The Genomic Assembly & Analysis Toolbox GATB Workshop D.Lavenier E.Drézen INTRODUCTION NGS technologies produce terabytes of data Efficient and fast algorithms are essential to analyze such data >read1

More information

Genome Sequencing Algorithms

Genome Sequencing Algorithms Genome Sequencing Algorithms Phillip Compaeu and Pavel Pevzner Bioinformatics Algorithms: an Active Learning Approach Leonhard Euler (1707 1783) William Hamilton (1805 1865) Nicolaas Govert de Bruijn (1918

More information

Parallelization of Velvet, a de novo genome sequence assembler

Parallelization of Velvet, a de novo genome sequence assembler Parallelization of Velvet, a de novo genome sequence assembler Nitin Joshi, Shashank Shekhar Srivastava, M. Milner Kumar, Jojumon Kavalan, Shrirang K. Karandikar and Arundhati Saraph Department of Computer

More information

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats Ching Li San Jose State University

More information

ABSTRACT USING MANY-CORE COMPUTING TO SPEED UP DE NOVO TRANSCRIPTOME ASSEMBLY. Sean O Brien, Master of Science, 2016

ABSTRACT USING MANY-CORE COMPUTING TO SPEED UP DE NOVO TRANSCRIPTOME ASSEMBLY. Sean O Brien, Master of Science, 2016 ABSTRACT Title of thesis: USING MANY-CORE COMPUTING TO SPEED UP DE NOVO TRANSCRIPTOME ASSEMBLY Sean O Brien, Master of Science, 2016 Thesis directed by: Professor Uzi Vishkin University of Maryland Institute

More information

Tutorial: De Novo Assembly of Paired Data

Tutorial: De Novo Assembly of Paired Data : De Novo Assembly of Paired Data September 20, 2013 CLC bio Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 Fax: +45 86 20 12 22 www.clcbio.com support@clcbio.com : De Novo Assembly

More information

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz

More information

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Next generation sequencing: assembly by mapping reads Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Overview What is assembly by mapping? Methods BWT File formats Tools Issues Visualization Discussion

More information

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies Chengxi Ye 1, Christopher M. Hill 1, Shigang Wu 2, Jue Ruan 2, Zhanshan (Sam) Ma

More information

A maximum likelihood approach to genome assembly

A maximum likelihood approach to genome assembly A maximum likelihood approach to genome assembly Laureando: Giacomo Baruzzo Relatore: Prof. Gianfranco Bilardi 08/10/2013 UNIVERSITÀ DEGLI STUDI DI PADOVA Dipartimento di Ingegneria dell Informazione -

More information

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems. Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD

More information

De Novo Assembly in Your Own Lab: Virtual Supercomputer Using Volunteer Computing

De Novo Assembly in Your Own Lab: Virtual Supercomputer Using Volunteer Computing British Journal of Research www.britishjr.org Original Article De Novo Assembly in Your Own Lab: Virtual Supercomputer Using Volunteer Computing V Uday Kumar Reddy 1, Rajashree Shettar* 2 and Vidya Niranjan

More information

MANUSCRIPT Pages 1 8. Information-Optimal Genome Assembly via Sparse Read-Overlap Graphs

MANUSCRIPT Pages 1 8. Information-Optimal Genome Assembly via Sparse Read-Overlap Graphs MANUSCRIPT Pages 8 Information-Optimal Genome Assembly via Sparse Read-Overlap Graphs Ilan Shomorony, Samuel H. Kim, Thomas Courtade and David N. C. Tse, Department of Electrical Engineering & Computer

More information

Preliminary Studies on de novo Assembly with Short Reads

Preliminary Studies on de novo Assembly with Short Reads Preliminary Studies on de novo Assembly with Short Reads Nanheng Wu Satish Rao, Ed. Yun S. Song, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No.

More information

FAssem : FPGA based Acceleration of De Novo Genome Assembly

FAssem : FPGA based Acceleration of De Novo Genome Assembly Assem : PGA based Acceleration of De Novo Genome Assembly Sharat Chandra Varma, Paul Kolin, M. Balakrishnan, Dominique Lavenier To cite this version: Sharat Chandra Varma, Paul Kolin, M. Balakrishnan,

More information

PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly

PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly 1 PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly Xing Liu Pushkar R. Pande Henning Meyerhenke David A. Bader Abstract The study of genomes has been revolutionized by sequencing

More information

Bioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,

Bioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity, Bioinformatics: Fragment Assembly Walter Kosters, Universiteit Leiden IPA Algorithms&Complexity, 29.6.2007 www.liacs.nl/home/kosters/ 1 Fragment assembly Problem We study the following problem from bioinformatics:

More information

AN organism s genome consists of base pairs (bp) from two

AN organism s genome consists of base pairs (bp) from two IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 5, MAY 2013 977 PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly Xing Liu, Student Member, IEEE, Pushkar R.

More information

Kermit. Walve, Riku Mikael. Schloss Dagstuhl - Leibniz-Zentrum für Informatik 2018

Kermit. Walve, Riku Mikael.   Schloss Dagstuhl - Leibniz-Zentrum für Informatik 2018 https://helda.helsinki.fi Kermit Walve, Riku Mikael Schloss Dagstuhl - Leibniz-Zentrum für Informatik 2018 Walve, R M, Rastas, P M A & Salmela, L M 2018, Kermit : Guided Long Read Assembly using Coloured

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

Cloud Computing and the DNA Data Race Michael Schatz. April 14, 2011 Data-Intensive Analysis, Analytics, and Informatics

Cloud Computing and the DNA Data Race Michael Schatz. April 14, 2011 Data-Intensive Analysis, Analytics, and Informatics Cloud Computing and the DNA Data Race Michael Schatz April 14, 2011 Data-Intensive Analysis, Analytics, and Informatics Outline 1. Genome Assembly by Analogy 2. DNA Sequencing and Genomics 3. Large Scale

More information

DIME: A Novel De Novo Metagenomic Sequence Assembly Framework

DIME: A Novel De Novo Metagenomic Sequence Assembly Framework DIME: A Novel De Novo Metagenomic Sequence Assembly Framework Version 1.1 Xuan Guo Department of Computer Science Georgia State University Atlanta, GA 30303, U.S.A July 17, 2014 1 Contents 1 Introduction

More information