Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

Similar documents
de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

de Bruijn graphs for sequencing data

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

Omega: an Overlap-graph de novo Assembler for Metagenomics

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Appendix A. Example code output. Chapter 1. Chapter 3

Sequence Assembly Required!

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

by the Genevestigator program ( Darker blue color indicates higher gene expression.

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

GATB programming day

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

Scalable Solutions for DNA Sequence Analysis

(for more info see:

Genome 373: Genome Assembly. Doug Fowler

Assembly in the Clouds

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler

Graph Algorithms in Bioinformatics

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

DNA Fragment Assembly

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA

TCGR: A Novel DNA/RNA Visualization Technique

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

Techniques for de novo genome and metagenome assembly

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

HP22.1 Roth Random Primer Kit A für die RAPD-PCR

Performance analysis of parallel de novo genome assembly in shared memory system

BLAST & Genome assembly

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

de novo assembly & k-mers

Building approximate overlap graphs for DNA assembly using random-permutations-based search.

Genome Assembly and De Novo RNAseq

Parallel de novo Assembly of Complex (Meta) Genomes via HipMer

Purpose of sequence assembly

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

warm-up exercise Representing Data Digitally goals for today proteins example from nature

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3

Reducing Genome Assembly Complexity with Optical Maps

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells

DNA Sequencing. Overview

Computational methods for de novo assembly of next-generation genome sequencing data

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

Supplementary Table 1. Data collection and refinement statistics

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics

Shotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.

BLAST & Genome assembly

Algorithms for Bioinformatics

Next Generation Sequencing Workshop De novo genome assembly

Introduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012

Meraculous De Novo Assembly of the Ariolimax dolichophallus Genome. Charles Cole, Jake Houser, Kyle McGovern, and Jennie Richardson

RESEARCH TOPIC IN BIOINFORMANTIC

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

MLiB - Mandatory Project 2. Gene finding using HMMs

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER

Algorithms for Bioinformatics

Constrained traversal of repeats with paired sequences

Mar%n Norling. Uppsala, November 15th 2016

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Digging into acceptor splice site prediction: an iterative feature selection approach

A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS

A Fast Sketch-based Assembler for Genomes

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

Read Mapping and Assembly

Description of a genome assembler: CABOG

LoRDEC: accurate and efficient long read error correction

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

Introduction to Genome Assembly. Tandy Warnow

Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

Machine Learning Classifiers

Reducing Genome Assembly Complexity with Optical Maps

10/15/2009 Comp 590/Comp Fall

Towards a de novo short read assembler for large genomes using cloud computing

Assembling short reads from jumping libraries with large insert sizes

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!

CS681: Advanced Topics in Computational Biology

discosnp++ Reference-free detection of SNPs and small indels v2.2.2

10/8/13 Comp 555 Fall

Gap Filling as Exact Path Length Problem

Parallel and memory-efficient reads indexing for genome assembly

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Hybrid Parallel Programming

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Read mapping on de Bruijn graphs

Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

IDBA - A practical Iterative de Bruijn Graph De Novo Assembler

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

Finishing Circular Assemblies. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015

Genome Sequencing Algorithms

Assembly in the Clouds

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Mapping NGS reads for genomics studies

AMemoryEfficient Short Read De Novo Assembly Algorithm

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis

arxiv: v1 [cs.dc] 31 May 2017

Transcription:

1/39 Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier

2/39 INTRODUCTION, YEAR 2000 : HUMAN GENOME PROJECT It s a giant resource that will change mankind, like the printing press. Dr James Watson, co-discoverer of DNA structure

2/39 INTRODUCTION, YEAR 2000 : HUMAN GENOME PROJECT It s a giant resource that will change mankind, like the printing press. First achievement : human sequencing Dr James Watson, co-discoverer of DNA structure the only way to read DNA is through small fragments (called reads) Sequencing process : 1) Obtain many copies of the genome 2) Cut them into millions of short fragments 3) Output the sequences of these fragments genome (unknown) reads: overlapping sub-sequences, covering the genome redundantly

3/39 INTRODUCTION, YEAR 2000 : HUMAN GENOME PROJECT Second achievement :

3/39 INTRODUCTION, YEAR 2000 : HUMAN GENOME PROJECT Second achievement : human de novo assembly (thesis topic) from millions of small fragments of DNA to a single sequence purely computational process required a supercomputer with 64 GB memory result was actually not perfect : assembly was fragmented

CONTEXT, YEAR 2012 : STILL DIFFICULT TO SEQUENCE TODAY? 4/39

CONTEXT, YEAR 2012 : STILL DIFFICULT TO SEQUENCE TODAY? 4/39

N EXT-G ENERATION S EQUENCING T ECHNOLOGIES I NGS = massively parallel sequencing 3 main NGS technologies z HGP technology z } Sanger } { { SOLiD 454 Illumina Proton, PacBio, Oxford 5/39

N EXT-G ENERATION S EQUENCING T ECHNOLOGIES I What everyone uses today : 3 main NGS technologies z HGP technology z } } { { Illumina 90 percent of the world s sequencing output is produced on Illumina instruments. GenomeWeb, February 14, 2012 ; verified with http://omicsmaps.com/stats read length 100 nt, i.e. 0.000003% of the human genome throughput equivalent to 1 human genome per day 5/39

6/39 HOW COMPUTATIONALLY HARD IS assembly TODAY? Tentative comparison of some software methods : 20 de novo assemblers omitted. Datasets : whole human genome, Illumina reads (except for Celera : Sanger reads) We focus on computational difficulty Quality of results : newer assemblies ( 2009) are much more fragmented, because of shorter reads

7/39 OUTLINE Definition of the assembly problem Contributions Contribution 1 : localized assembly Index Traversal Contribution 2 : incorporation of pairing information Monument assembler Results Contribution 3 : ultra-low memory assembly Minia Results Perspectives

7/39 Informal problem GENOME ASSEMBLY Given a set of sequenced reads, retrieve the genome. In computational terms Find an algorithm such that : Input : a set of reads that are sub-strings of the genome Output : the genome Toy example Input : {GAT, ATT, TTA, TAC, ACA, CAT, CAA} Output : GATTACATCAA

7/39 Informal problem GENOME ASSEMBLY Given a set of sequenced reads, retrieve the genome. In computational terms Find an algorithm such that : Input : a set of reads that are sub-strings of the genome Output : the genome Toy example Input : {GAT, ATT, TTA, TAC, ACA, CAT, CAA} Output : GATTACATCAA Immediate questions Q : Is there a single possible output? A : no, s = GATTACATTACAA is another possible output Q : Then, how to choose? A : need to formulate an optimization problem a a optimization problem : problem of finding the best solution from all feasible solutions

8/39 SHORTEST COMMON SUPER-STRING PROBLEM Shortest common super-string (SCS) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings. (there can be many solutions) Toy example S = {GAT, ATT, TTA} Trivial super-string : {GATATTTTA} Super-strings of length 3 :

8/39 SHORTEST COMMON SUPER-STRING PROBLEM Shortest common super-string (SCS) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings. (there can be many solutions) Toy example S = {GAT, ATT, TTA} Trivial super-string : {GATATTTTA} Super-strings of length 3 : none Super-strings of length 4 :

8/39 SHORTEST COMMON SUPER-STRING PROBLEM Shortest common super-string (SCS) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings. (there can be many solutions) Toy example S = {GAT, ATT, TTA} Trivial super-string : {GATATTTTA} Super-strings of length 3 : none Super-strings of length 4 : none Super-strings of length 5 :

8/39 SHORTEST COMMON SUPER-STRING PROBLEM Shortest common super-string (SCS) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings. (there can be many solutions) Toy example S = {GAT, ATT, TTA} Trivial super-string : {GATATTTTA} Super-strings of length 3 : none Super-strings of length 4 : none Super-strings of length 5 : {GATTA} solution Problem with SCS-based assembly The genome is not a SCS. Genomes contain long repetitions, e.g. GATTACATTACAA (length = 13). Sequencing yields reads : {GAT, ATT, TTA, TAC, ACA, CAT, CAA} A shortest common super-string is : GATTACATCAA (length = 11).

9/39 A BETTER PROBLEM FORMULATION Overlap graph (simplified definition) [Myers 95] Directed graph, vertices = reads edge r 1 r 2 if r 1 and r 2 exactly overlap over k characters. String graph Remove transitively inferable overlaps from the overlap graph. Toy string graph S = {GAT, ATT, TTA, TAC, ACA, CAT, CAA} k = 2 CAT GAT ATT TTA TAC GAT ATT ACA CAA

10/39 ASSEMBLY USING AN STRING GRAPH Assembly in theory [Nagarajan 09] Return a path of minimal length that traverses each node at least once. Illustration For the previous example, CAT GAT ATT TTA TAC ACA CAA The only solution is GATTACATTACAA. (Recall that SCS was GATTACATCAA) Graphs provide a good framework for assembly.

11/39 Example of ambiguities ASSEMBLY USING AN STRING GRAPH GAG AGT GTG ACT CTG TGA GAC ACC GAA AAT ATG Assembly in practice Return a set of paths covering the graph, such that all possible assemblies contain these paths. Solution of the example above The assembly is the following set of paths : {ACTGA, TGACC, TGAGTGA, TGAATGA}

12/39 ALMOST EVERY ASSEMBLY ALGORITHM Assembly graph with variants & errors [Zerbino, Birney 08 ; Li et al. 09 ; Simpson et al. 12 ;..] 1) The graph is completely constructed. 2) Likely sequencing errors are removed. 3) Known biological events are removed. 4) Finally, simple paths are returned. 1 1 1 1 1 2 3 2 3 2 3 2 3

13/39 Definition of the assembly problem Contributions Contribution 1 : localized assembly Index Traversal Contribution 2 : incorporation of pairing information Monument assembler Results Contribution 3 : ultra-low memory assembly Minia Results Perspectives

14/39 WHOLE-GENOME GRAPHS ARE UNNECESSARY Practically Genome graphs are a better framework than SCS, but they are monolithic, hard to parallelize, and [Simpson et al. 09] require a lot of memory (human : 150+ GB). [Li et al. 09] Contribution 1 : localized assembly Proposed approach : Store reads in a redundancy-filtered index Locally construct portions of the graph at a time

15/39 CONTRIBUTION 1.1 : REDUNDANCY-FILTERED READ INDEX Store reads in a redundancy-filtered index [GC, RC, DL 11] Locally construct portions of the graph

REDUNDANCY-FILTERED READ INDEX : BENCHMARK Store reads in a redundancy-filtered index Locally construct portions of the graph Memory usage (GB) of indexes Construction time SOAP : 41 mins us : 64 mins 16/39

17/39 CONTRIBUTION 1.2 : LOCALIZED TRAVERSAL Store reads in a redundancy-filtered index Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won t traverse : long branches s t s (max breadth b, max depth d) (min depth d) Example : Whole graph

17/39 CONTRIBUTION 1.2 : LOCALIZED TRAVERSAL Store reads in a redundancy-filtered index Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won t traverse : long branches s t s (max breadth b, max depth d) (min depth d) Example : Start with an empty graph

17/39 CONTRIBUTION 1.2 : LOCALIZED TRAVERSAL Store reads in a redundancy-filtered index Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won t traverse : long branches s t s (max breadth b, max depth d) (min depth d) Example : Construct the first portion

17/39 CONTRIBUTION 1.2 : LOCALIZED TRAVERSAL Store reads in a redundancy-filtered index Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won t traverse : long branches s t s (max breadth b, max depth d) (min depth d) Example : Construct the second portion

17/39 CONTRIBUTION 1.2 : LOCALIZED TRAVERSAL Store reads in a redundancy-filtered index Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won t traverse : long branches s t s (max breadth b, max depth d) (min depth d) Example : Construct the third portion

17/39 CONTRIBUTION 1.2 : LOCALIZED TRAVERSAL Store reads in a redundancy-filtered index Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won t traverse : long branches s t s (max breadth b, max depth d) (min depth d) Example : Construct the third portion Summary of Contribution 1 : greedy, localized assembly whole-genome graph assembly

18/39 OUTLINE Contribution 1 : localized assembly Index Traversal Contribution 2 : incorporation of pairing information Monument assembler Results Contribution 3 : ultra-low memory assembly Minia Results

PAIRING INFORMATION A vision of sequencing closer to reality is : genome (unknown) paired reads Sequencing a toy genome with paired reads of length 4 nt (with gaps of length 2). Genome?????????????? ACTA AGAG GATA ACCT CTAG ATAC TAGA TACC In practice : read length 100 nt depending on seq. method, gaps are 0, 300, 2000 or 10000 nt. 18/39

19/39 CONTRIBUTION 2 : STUDYING THE IMPACT OF PAIRING INFORMATION Reads that belong to multiple genome locations complicate analysis. Pairing information contributes to uniquely localize reads. 100 E. coli 100 H. sapiens 75 75 Unique reads (%) 50 25 50 25 0 paired unpaired 5 10 15 20 Read length (nt) 0 20 40 60 80 100 Read length (nt) In this figure, paired reads are separated by (300 2 [read length]) nt.

20/39 CONTRIBUTION 2 : INCORPORATING PAIRING INFORMATION IN ASSEMBLY You are asked to solve one of these two jigsaws. Which one looks easier?

20/39 CONTRIBUTION 2 : INCORPORATING PAIRING INFORMATION IN ASSEMBLY You are asked to solve one of these two jigsaws. Which one looks easier? Both are equally hard (NP-hard). [Demaine 07], [RC, DL 11] We defined the following problems, and showed their NP-hardness : SCS over paired strings paired Hamiltonian path super-walk in a de Bruijn graph over paired strings paired Assembly Problem (introducing paired string graphs)

21/39 CONTRIBUTION 2 : PAIRED STRING GRAPHS Recall that long branches cannot be traversed s Now, add pairing information to the graph (paired string graph) : s t In actual data, pairing is incomplete, with varying distance between mates. Will traverse : pairs-linked simple paths (heuristics) s 1 s 2 s 3 t 1 t 2 t 3 t 4

22/39 IMPLEMENTATION : MONUMENT ASSEMBLER Pairs-linked simple paths Variant sub-graphs de novo genome assembly software for Illumina reads 8,000 lines of Python + 5,000 lines of C code proof of concept of the two previous contributions unreleased, used in-house

23/39 RESULTS : ASSEMBLATHON 1 & 2 Assemblathon 1 [Earl et al. (incl. RC, DL, DN, GC, NM) 11] International competition Research teams are given a set of reads to assemble No knowledge of the solution, no preliminary ranking Synthetic genome, 100 Mb (1/30-th of the human genome) Assemblathon 2 Unknown animal genomes, 1-2 Gb (half of the human genome) Maylandia zebra Red tailed boa constrictor common pet parakeet

24/39 QUALITY OF AN ASSEMBLY contigs : gap-less assembled sequences scaffolds : contigs separated by gaps Fragmentation NG50 : length l at which half of the genome is covered by sequences of length l NG50 = 2 NG50 = 6 accuracy (many ways) and coverage (% of the genome covered)

25/39 RESULTS : ASSEMBLATHON 1 Assemblathon 1 Contiguity of sequences (kbp) : Method contig NG50 (rank) scaffold NG50 (rank) Meraculous 16 (10) 9073 (1) Allpaths 219 (2) 8396 (2)...... Monument 7 (13) 1421 (7)...... Cortex 3 (16) 9.3 (16) Performance (reported by participants) (wall h, GB) : Method Memory (rank) Time (rank) Monument 6.3 (3) 2 (1) Meraculous 4 (1) 6 (2)...... Allpaths 100 12...... Celera 100 120

26/39 For Assemblathon 1, we used : RESULTS : ASSEMBLATHON 2 Prototype of Monument (without variants traversal) Single finishing step : scaffolding (SSPACE) [Boetzer 11] What we changed for Assemblathon 2 : Variant sub-graph traversal [RC, DL 11] More elaborate finishing steps : scaffolding (SuperScaffolder) [RC, DN @ Jobim 12] gap-filling (SOAP) [Li et al. 09] Assemblathon 2 (preliminary) Snake (N50, kbp) : Method ctg. (rank) scaf. (rank) SGA 29 (4) 4505 (1) Phusion 73 (1) 4066 (2)...... Monument 65 (2) 1149 (6)...... CLC 8 (11) 19 (11) PRICE 6 (12) 6 (12) Fish (N50, kbp) : Method ctg. (rank) scaf. (rank) Bayor 31 (1) 4966 (1) Allpaths 20 (4) 4014 (2)...... Monument 31 (2) 1241 (6)...... SGA 8 (8) 110 (10) Ray 9 (12) 47 (12)

27/39 OUTLINE Contribution 1 : localized assembly Index Traversal Contribution 2 : incorporation of pairing information Monument assembler Results Contribution 3 : ultra-low memory assembly Minia Results

27/39 RECENT IMPROVEMENT : LOWER-MEMORY STRUCTURE This is not in the manuscript. de Bruijn graph [Idury, Waterman 95] Nodes are k-mers, edges are (k 1)-overlaps between nodes. GAT ATT TTA TAC ACA CAA Structurally similar to string graphs. How to encode de Bruijn graphs using as little space as possible? Memory usage (illustration for human, k = 25) Explicit list of nodes : 2k n bits 50 bits per node Self-information of n nodes : log 2!! 4 k bits n 20 bits per node.

28/39 RECENT IMPROVEMENT : LOWER-MEMORY STRUCTURE (2) Bloom filter Bit array to describe any set with a precision of ɛ. a proportion of ɛ elements will be wrongly included in the set. First step : stores nodes in a Bloom filter. node hash value ATC 0 CCG 0 TCC 5 CGC 6...... Bloom filter 1 0 0 0 0 1 1 0 0 0

29/39 RECENT IMPROVEMENT : LOWER-MEMORY STRUCTURE (3) Actual set of nodes : {TAT, ATC, CGC, CTA, CCG, TCC, GCT} Graph as stored in the previous Bloom filter : AGC CGC GCT CGA GAG GGA CCG CTA AAA TGG TCC ATC TAT ATT TTG

30/39 RECENT IMPROVEMENT : LOWER-MEMORY STRUCTURE (4) Insight : using localized traversal from black nodes, only small a fraction of the red false positives are troublesome. AGC CGC GCT CGA GAG GGA CCG CTA AAA TGG TCC ATC TAT ATT TTG

31/39 RECENT IMPROVEMENT : LOWER-MEMORY STRUCTURE (4) Proposed method [RC, GR 11] Store nodes on disk for sequential enumeration, and in memory store the Bloom filter + the troublesome FP explicitly. Bloom filter 1 0 0 0 0 1 1 0 0 0 AGC GAG CGC GCT CGA CCG CTA AAA TCC TAT ATC ATT GGA TTG TGG Nodes self-information :! 4 3 log 2 = 30 bits 7 Our structure size : 10 {z} Bloom + {z} 3 6 = 28 bits Crit. false pos.

32/39 RECENT IMPROVEMENT : LOWER-MEMORY STRUCTURE (5) Result statement The de Bruijn graph can be encoded using 1.44 log 2 ( 16k 2.08 ) + 2.08 bits of memory per node. human, k = 25 : 13 bits per node. Effectively below the self-information (20 bits/node) Not magic : it s an over-approximation made exact where it matters Is it possible to perform assembly with this immutable structure? Yes, with localized traversal (Contribution 1). Human genome assembly Minia C. & B. ABySS SOAPdenovo Contig N50 (bp) 1156 250 870 886 > 95% Accuracy (%) 94.6-94.2 - Nb of nodes/cores 1/1 1/8 21/168 1/40 Time (wall-clock, h) 23 50 15 33 Memory (sum of nodes, GB) 5.7 32 336 140

YEAR 2012 : HOW COMPUTATIONNALLY HARD IS assembly TODAY? 33/39

34/39 SUMMARY OF CONTRIBUTIONS Contribution 1 : Redundancy-filtered reads index Localized assembly technique Contribution 2 : Incorporation of pairing information in assembly models Contribution 3 : Space-efficient de Bruijn graph representation Contributions in the manuscript : Analysis of re-sequencing feasibility with exact paired reads Index-free targeted assembly (Mapsembler)

35/39 PERSPECTIVES Applications Why assemble a human genome again? To exhibit novel variations [Iqbal 11] As a benchmark, for the immense number of (meta)genomes that will be sequenced next Future of sequencing Predictions : DNA assembly Relevant until 10-100 kbp high-accuracy read lengths RNA assembly, metagenomics and metatranscriptomics No announced technology other than Illumina permits high depth of sampling. paired short-read assembly will remain a hot topic for at least a few years.

36/39 PERSPECTIVES Extension of localized assembly : Graph-based gap-filling (Monument, with T. Derrien, C. Lemaitre, & F. Legeai) Extension of paired assembly theory : Global scaffolding (SuperScaffolding, with D. Naquin) common sub-paths that appear in all solutions of a Chinese Postman instance Applications of Minia codebase : Huge metagenomic assemblies (with O. Jaillon, JM. Aury) Transcriptome assembly (Inchworm replacement) Alternative splicing detection (KisSplice module replacement) SNP detection (KisSnp 2, with R. Uricaru & P. Peterlongo) Read compression (with G. Rizk & D. Lavenier) Constant-memory k-mer counting (with G. Rizk)

SOFTWARE CONTRIBUTIONS Released software : Mapsembler 1 KisSplice 2 Minia 3 On my github 4 : Paired repetitions analysis package Light-weight, explicit de Bruijn graph construction Internal software : Monument SuperScaffolder 1 http://alcovna.genouest.org/mapsembler 2 http://alcovna.genouest.org/kissplice 3 http://minia.genouest.org 4 http://github.com/rchikhi 37/39

38/39 PUBLICATIONS WABI 2011 RC, DL PBC 2011 GC, RC, DL BMC Bioinformatics 2011 PP, RC Genome Research 2011 Earl et al. (RC, DL, DN, GC, NM) RECOMB-Seq 2012 Sacomoto et al. (RC, RU, PP) WABI 2012 RC, GR Extended abstracts, posters : BMC Bioinformatics, ISCB-SC 2009 RC, DL Jobim 2012 RC, DN

39/39 ACKNOWLEDGMENTS Dominique Lavenier Everyone at Symbiose, GenScale, GenOuest, Dyliss Pierre asked for a special dedicace M-F. Sagot, S. Gnerre, O. Jaillon, E. Rivals, B. Schmidt My family, D. Thank you all for coming!