Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Similar documents
Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

by the Genevestigator program ( Darker blue color indicates higher gene expression.

HP22.1 Roth Random Primer Kit A für die RAPD-PCR

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

Appendix A. Example code output. Chapter 1. Chapter 3

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3

TCGR: A Novel DNA/RNA Visualization Technique

warm-up exercise Representing Data Digitally goals for today proteins example from nature

Supplementary Table 1. Data collection and refinement statistics

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells

DNA Sequencing. Overview

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

10/15/2009 Comp 590/Comp Fall

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

Digging into acceptor splice site prediction: an iterative feature selection approach

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Machine Learning Classifiers

10/8/13 Comp 555 Fall

Graph Algorithms in Bioinformatics

Genome 373: Genome Assembly. Doug Fowler

Algorithms for Bioinformatics

Sequence Assembly Required!

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

Genome Sequencing Algorithms

Algorithms for Bioinformatics

MLiB - Mandatory Project 2. Gene finding using HMMs

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame

02-711/ Computational Genomics and Molecular Biology Fall 2016

Degenerate Coding and Sequence Compacting

2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA

Supplementary Materials:

Supplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot)

A relation between trinucleotide comma-free codes and trinucleotide circular codes

Purpose of sequence assembly

LABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI

DNA Fragment Assembly

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Scalable Solutions for DNA Sequence Analysis

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016

de Bruijn graphs for sequencing data

Eulerian Tours and Fleury s Algorithm

Graph Algorithms in Bioinformatics

Assembly in the Clouds

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!

Supporting Information

RESEARCH TOPIC IN BIOINFORMANTIC

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

Graphs and Puzzles. Eulerian and Hamiltonian Tours.

How to apply de Bruijn graphs to genome assembly

Structural analysis and haplotype diversity in swine LEP and MC4R genes

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA

CSE 549: Genome Assembly De Bruijn Graph. All slides in this lecture not marked with * courtesy of Ben Langmead.

OFFICE OF RESEARCH AND SPONSORED PROGRAMS

DNA Fragment Assembly

Solutions Exercise Set 3 Author: Charmi Panchal

Hybrid Parallel Programming

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats

EECS 203 Lecture 20. More Graphs

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

Genome Sequencing & Assembly. Slides by Carl Kingsford

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Salmonella enterica serotype Enteritidis.

QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Genomic Classification and Differentiation

CSCI 1820 Notes. Scribes: tl40. February 26 - March 02, Estimating size of graphs used to build the assembly.

Sequence Design Problems in Discovery of Regulatory Elements

Algorithms and Data Structures

3. The object system(s)

Towards a de novo short read assembler for large genomes using cloud computing

BMI/CS 576 Fall 2015 Midterm Exam

Programming Applications. What is Computer Programming?

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

Graphs and Genetics. Outline. Computational Biology IST. Ana Teresa Freitas 2015/2016. Slides source: AED (MEEC/IST); Jones and Pevzner (book)

Worksheet 28: Wednesday November 18 Euler and Topology

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

Bioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,

A Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving on DNA-encoded Data

Parallel de novo Assembly of Complex (Meta) Genomes via HipMer

Introduction to Genome Assembly. Tandy Warnow

(for more info see:

How to Run NCBI BLAST on zcluster at GACRC

GATB programming day

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8

Discrete Mathematics for CS Spring 2008 David Wagner Note 13. An Introduction to Graphs

Path Finding in Graphs. Problem Set #2 will be posted by tonight

Mining more complex patterns: frequent subsequences and subgraphs. Department of Computers, Czech Technical University in Prague

Discrete mathematics

BLAST & Genome assembly

An Introduction to Graph Theory

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Introduction to Graphs

Scalable Solutions for DNA Sequence Analysis

DELAMANID SUSCEPTIBILITY TESTING IN AN AUTOMATED LIQUID CULTURE SYSTEM

A Genome Assembly Algorithm Designed for Single-Cell Sequencing

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7

Strings. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Transcription:

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications

Problem Problem: Given DNA, how do we find the nucleotide sequence? Reduces to two problems: 1. Read generation (Biological) 2. Fragment Assembly (Algorithmic/Mathematical)

Introduction to DNA Sequencing Four Nucleotides: A, G, C, T No known way to read DNA one nucleotide at a time Current technology can only 'read' short segments of DNA At most approximately 100 nucleotides in length Short fragments of length k are called k-mers Biologists generate these k-mers starting at every nucleotide Then use mathematics to attempt to recover the sequence by solving a giant overlap puzzle

Brief Introduction to Read Generation First synthesize all possible 3-mers Attach these to a grid on which each l-mer is assigned a unique location Take the DNA fragment and fluorescently label it Apply this to the DNA array Read the complements of fluorescent grids AAA AGA CAA CGA GAA GGA TAA TGA AAC AGC CAC CGC GAC GGC TAC TGC AAG AGG CAG CGG GAG GGG TAG TGG AAT AGT CAT CGT GAT GGT TAT TGT ACA ATA CCA CTA GCA GTA TCA TTA ACC ATC CCC CTC GCC GTC TCC TTC ACG ATG CCG CTG GCG GTG TCG TTG ACT ATT CCT CTT GCT GTT TCT TTT

Welcome to Konigsberg a) Map of Konigsberg. b) The graph formed by compressing each land mass into a vertex and representing each bridge by an edge. Compeau, Phillip E C, Pavel A. Pevzner, and Glenn Tesler. "How to Apply De Bruijn Graphs to Genome Assembly." Nat Biotechnol Nature Biotechnology 29.11 (2011): 987-91. Web.

Konigsberg Bridge Problem Problem: Is there a walk that traverses each bridge exactly once? Euler solved this problem in the 18th century and spawned the branch of mathematics known as Graph Theory.

Hamilton's Game

From Euler and Hamilton to Genome Assembly Simplifying assumptions: 1.The genome we are reconstructing is cyclic. 2.Every read has the same length. 3.All possible substrings of length l occurring in our genome have been generated as reads 4.The reads have been generated without any errors.

Example Suppose we have the sequence: TAATGCCATGGGATGTT From this sequence, we yield the 3-mers: TAA, AAT, ATG, TGC, GCC, CCA, CAT, ATG,TGG, GGG,GGA,GAT, ATG,TGT,GTT We construct a graph from these 3-mers by: 1. Using the 3-mers as vertices. 2. Placing a directed edge from vertex 1, (v1), to vertex 2, (v2) if the prefix of v2 is the suffix of v1.

Example Prefix of AAT is AA while suffix of TAA is AA, etc.

Example In practice, k-mers are given in lexicographic order: AAT, ATG, ATG, ATG, CAT, CCA, GAT, GCC, GGA, GGG, GTT, TAA, TGC, TGG, TGT We again use the 3-mers as nodes Now we connect two nodes from one to another if the suffix is same as prefix For example, we connect AAT to all ATG nodes We yield a new graph that looks as follows. The goal is now to find a path in the graph that passes through every node exactly once. (Hamiltonian Problem)

Example

Building the Path

Finally

Sequence then becomes: TAATGCCATGGGATGTT

Example Revisited We now approach sequence generation in a new way. Start again with the sequence: TAATGCCATGGGATGTT Generate 3-mers again: TAA, AAT, ATG, TGC, GCC, CCA, CAT, ATG,TGG, GGG,GGA,GAT, ATG,TGT,GTT The 3-mers now become the edges while the prefixes and suffixes become the nodes.

Example Revisited TA is the prefix of a 3-mer with AA as the suffix, so it is connected by an edge labeled by the 3-mer TAT, etc The next step is to paste together nodes that are the same.

Eulerian Problem The goal now is to find a path through the graph that passes through every edge exactly once. (Eulerian Problem) When this path is found, concatenate the edges to retrieve the sequence.

When we read the edges back, we recover the sequence: TAATGCCATGGGATGTT

The Million Dollar Question Is the Hamiltonian Problem or the Eulerian Problem easier to solve?

Million Dollar Question Turns out that the Hamiltonian Problem is intractable NP-complete You can literally win a million dollars by solving it Hamiltonian strategy still used to sequence the Human Genome and others before 2001 Eulerian Problem is very easy to solve Proof of Euler's Theorem gives you a very nice algorithm to find the cycle

Euler's Theorem Theorem: A directed, connected, and finite graph G has an Eulerian cycle if and only if, for every vertex v in G, the indegree and the outdegree of v are equal.

Proof

Complications Eulerian Cycle found might not be unique In our example there is also a cycle that generates the sequence: TAATGGGATGCCATGTT How does the problem change when the sequence is not cyclic, but rather, a linear DNA sequence? How do we adjust for errors in the read generation?

Thank You for Listening. Any Questions?