RESEARCH TOPIC IN BIOINFORMANTIC

Similar documents
Algorithms for Bioinformatics

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Genome 373: Genome Assembly. Doug Fowler

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

DNA Sequencing. Overview

Algorithms for Bioinformatics

10/15/2009 Comp 590/Comp Fall

Graph Algorithms in Bioinformatics

Sequence Assembly Required!

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

DNA Fragment Assembly

(for more info see:

10/8/13 Comp 555 Fall

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Purpose of sequence assembly

Introduction to Genome Assembly. Tandy Warnow

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

BLAST & Genome assembly

Genome Sequencing Algorithms

BLAST & Genome assembly

02-711/ Computational Genomics and Molecular Biology Fall 2016

Reducing Genome Assembly Complexity with Optical Maps

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

Graph Algorithms in Bioinformatics

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

DNA Fragment Assembly

1. Sorting (assuming sorting into ascending order) a) BUBBLE SORT

Omega: an Overlap-graph de novo Assembler for Metagenomics

A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

BMI/CS 576 Fall 2015 Midterm Exam

Bioinformatics-themed projects in Discrete Mathematics

Chapter 3: Paths and Cycles

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

Computational models for bionformatics

CS681: Advanced Topics in Computational Biology

Computational Genomics and Molecular Biology, Fall

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats

Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report

Shortest Path Algorithm

6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS. Vehicle Routing Problem, VRP:

CSCI 1820 Notes. Scribes: tl40. February 26 - March 02, Estimating size of graphs used to build the assembly.

Modules. 6 Hamilton Graphs (4-8 lectures) Introduction Necessary conditions and sufficient conditions Exercises...

Genome Sequencing & Assembly. Slides by Carl Kingsford

Lecture 1. 2 Motivation: Fast. Reliable. Cheap. Choose two.

Introduction to Graph Theory

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler

3 Euler Tours, Hamilton Cycles, and Their Applications

Alignment of Long Sequences

CS270 Combinatorial Algorithms & Data Structures Spring Lecture 19:

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Description of a genome assembler: CABOG

Genome Assembly and De Novo RNAseq

Performance analysis of parallel de novo genome assembly in shared memory system

Module 6 NP-Complete Problems and Heuristics

Walking with Euler through Ostpreußen and RNA

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

5.1 Min-Max Theorem for General Matching

Finishing Circular Assemblies. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015

val(y, I) α (9.0.2) α (9.0.3)

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

CMSC 451: Lecture 22 Approximation Algorithms: Vertex Cover and TSP Tuesday, Dec 5, 2017

Optimal tour along pubs in the UK

Classic Graph Theory Problems

Reducing Genome Assembly Complexity with Optical Maps

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

Lecture 5: Markov models

Graphs and Puzzles. Eulerian and Hamiltonian Tours.

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016

Biology 644: Bioinformatics

IE 102 Spring Routing Through Networks - 1

3 No-Wait Job Shops with Variable Processing Times

Graph theory. Po-Shen Loh. June We begin by collecting some basic facts which can be proved via bare-hands techniques.

Number Theory and Graph Theory

Hybrid Parallel Programming

Reducing Genome Assembly Complexity with Optical Maps Final Report

De-Novo Genome Assembly and its Current State

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7

Genomic Finishing & Consed

Computational Molecular Biology

Path Finding in Graphs. Problem Set #2 will be posted by tonight

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

1 Abstract. 2 Introduction. 3 Requirements

Introduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012

Traveling Salesman Problem (TSP) Input: undirected graph G=(V,E), c: E R + Goal: find a tour (Hamiltonian cycle) of minimum cost

How to apply de Bruijn graphs to genome assembly

Euler and Hamilton paths. Jorge A. Cobb The University of Texas at Dallas

Chapter 6. The Traveling-Salesman Problem. Section 1. Hamilton circuits and Hamilton paths.

EULERIAN GRAPHS AND ITS APPLICATIONS

CSE 549: Genome Assembly Intro & OLC. All slides in this lecture not marked with * courtesy of Ben Langmead.

1 The Traveling Salesperson Problem (TSP)

February 19, Integer programming. Outline. Problem formulation. Branch-andbound

Finding homologous sequences in databases

Midterm 2 Solutions. CS70 Discrete Mathematics for Computer Science, Fall 2007

Transcription:

RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very long. It is about 3Gb for each. Now a days, the second generation of genome assembly technique creates multiple copies of the genomes. Then, shears the genomes into fragments. After that, the short pieces is gained from these fragments as showed in Fig. 1. the researchers try to assemble these huge amount of pieces back to the original genome which process is called genome assembly. Fig. 1: the process to get the reads. Definition # 1 Reads refer to the short pieces obtained from the fragment. Definition # 2 Paired reads refer to wo reads obtained from two end-sides of one fragment. Now the genome assembly problem becomes: Page 1

Input: Goal: short sequencing reads or paired reads. reconstruct the reference string from the reads. But to reconstruct the reference sequence is still hard. There are several challenges remaining here. Challenges: 1. The length of reference sequence is long. There are huge amount of short reads. 2. The errors are contained in the reads. 3. Complex structure of the genome repeats. To aim the goal of genome assembly. One idea is constructing a completed weighted directed graph or namely overlap graph. Completed weighted directed graph (overlap graph): The weights of edges denote the length of the over lapping between two reads. The nodes denote the reads. An example showed in Fig. 2 illuminate the overlap graph. Input: Output: the reads: ACG, CGA, CGC,CGT, GAC, GCG, GTA, TCG construct the overlap graph. Fig. 2: the example of overlap graph. Not all the edges are drawn in the graph. Here we just draw some of the edges. To reduce the complexity of the problem. We make some assumptions and criterion here. Page 2

Assume: 1. There are no errors in the reads. 2. Ignore the genome repeats. i.e. each read only occur once in the reference sequence. Criterion: Assembly tries to find the parsimony simplest solution. In other words, the assembly tries to find the shortest sequence from the reads. Though the shortest sequence can not be the correct one, it is not far away from the truth. By the assumptions and criterion, we can observe that: What is the reference genome? = the path in the overlap graph. How to enforce that each read only occur once? = find the Hamilton path in the graph. i.e. find the path travelling each node once and only once. How to find the parsimony simplest solution? = Maximize the weight of path. Definition # 3 a Hamilton path is a path in the graph that visit each node exactly once. Thus, the problem approaches to the travelling salesman path (TSP) problem to minimize the negative weight of the OG. Unfortunately, this problem is NP-hard problem. The researchers try to get the approximate solutions by applying the greedy algorithms. One of the algorithms is the heuristic overlapping-layoutconsensus (OLC). 1. Overlap: construct the overlap graph. 2. Layout: find the maximum weighted path in the overlap graph by the greedy algorithm. 3. Consensus: obtain the consensus sequence. For example, Tab. 1 depicts how the consensus works. Definition # 4 Consensus sequence refers to the most common nucleotide or amino acid at a particular position after multiple sequences are aligned. Then, the researchers find another efficient way to assemble the genomes later. The method is called Eulerian path approach. Similarly, to implement the Eulerian path approach, we shall make some assumptions first. Page 3

ACGAC T ATGGAG Aligned sequences ACCTCC ACGGT T ACCGGC T 0 1 0 1 1 2 G 0 0 3 3 1 1 C 0 4 2 0 2 2 A 5 0 0 1 1 0 Consensus ACGGC T Tab. 1: consensus sequence from five aligned sequences. Assume: 1. All the reads are the same k length. we call k-mers. 2. Each read is distinct. i.e. each read only occur once. 3. read is sheared one nucleobase by onenucleobase. So, the reads start from every position of sequence. Definition # 5 k-mer refers to a specific n-tuple or n-gram of nucleic acid or amino acid sequences. The graph include Eulerian path is called de Bruijn graph. In genome assembly, the de bruijn graph is constructed as following. de Bruijn graph: The nodes denote (k 1) length prefix and suffix of the reads. The edge connect two nodes if they are from the prefix and suffix of the same read. In other words, the edges represent the reads. the graph is un-weighted and directed. Definition # 6 an Eulerian path is a path in the graph which visit every edge exactly once. Theorem # 6.1 a directed graph has the Eulerian path if and only if at most one node has one more in-degree than out-degree and at most one node has one more out-degree than in-degree and the graph is connected. For example, Fig. 3 illuminate the de Bruijn graph for the same reads set as the overlap graph example. Page 4

Input: Output: the reads: ACG, CGA, CGC,CGT, GAC, GCG, GTA, TCG construct the de Bruijn graph. Fig. 3: the example of the de Bruijn graph. According to the de Bruijn graph, a possible Eulerian path is TC CG GC CG GA AC CG GT TA Thus, one possible genome sequence is : TCGCGACGTA. The simple method to find the Eulerian path from one graph: 1. Start from one node that still has unused degrees. 2. Pick one edge not using before. Travel to next node along this edge. 3. Repeat Step 2 until close one cycle. 4. Repeat Step 1 through Step 3 to find the new cycle. 5. Combine the cycles by alternate the paths of two cycles that pass the conjoint node. But in the real world, some assumptions are hardly achieved. There are some challenges in applying the Eulerian path to the genome assembly. Challenges: 1. It is hard to get the read that start from every position of the genome sequence. 2. It is still an issue that the errors are contained in the reads. Thus, to get the continues k-mers, the Euler assembler provides an idea that for a given read, breaks it into k-mers. Then, the problem of the genome assembly becomes an Eulerian super-path problem. Page 5

Eulerian super-path problem: Given the reads and the k. Trained the paths of k-mers of each read, i, as the sub-path, SP i. Find an Eulerian super-path P s.t. P contain each SP i exactly once. If we relax the condition that pass each sub-path exactly once as that pass each edge in the graph at least once, the problem becomes the Chinese postman problem. Chinese postman problem: Given the directed graph. Find shortest path that visit each edge at least once. The Chinese postman problem can be solved in polynomial time. However, the Chinese postman problem constrained by the sub-path is NP-hard. In Euler assembler, the author provided several method to approach the expected result such as the x, y-detachment and x-cut (Fig. 4). (a) x, y-detachment (b) x-cut Fig. 4: Equivalent transformations: (a)x, y-detachment and (b)x-cut. Definition # 7 The x, y-detachment is a transformation that adds a new edge z = (v in, v out ) and delete the edges x and y from G (Fig. 4a). Page 6

Definition # 8 an x-cut is a transformation by simply removing x from all the paths that start from x or end at x without affecting the graph G itself (Fig. 4b). Some paths can be merged due to consistent with each other. Fig. 5 illuminate two possible consistent. P that is consistent only with P x,y1 (Fig. 5a) is resolvable while the P that is consistent both with P x,y1 and P x,y2 (Fig. 5b) is unresolvable. (a) P is consistent only with P x,y1. (b) P is consistent both with P x,y1 and P x,y2. Fig. 5: two possible consistent. The unambiguous paths are constructed as the contigs and the paired reads help us to construct the scaffoldings (the superior level of contigs) by the contigs. Definition # 9 a contig is a set of overlapping DNA segments that together represent a consensus region of DNA. Fig. 6: cycle graph before merging. Fig. 7: de Bruijn graph after merging. But due to the huge amount of the reads the de Bruijn graph is still very complex. For example, consider the cycle string, S = ATCAGATAGGAC. (1) Page 7

The k-mers with k = 2 can be formed as a cycle graph (Fig. 6). If we merge the identical nodes, this cycle graph will become the de Bruijn graph (Fig. 7) and the merging will cost the graph complex. Thus, is that a way to reduce the merging? Here we introduce the paired de Bruijn graph. paired de Bruijn graph: Each node denotes (k 1) length prefixes or suffixes of both paired reads. The edge connect two nodes if they are from the prefixes and suffixes of the same paired read. the graph is un-weighted and directed. Also, there are some assumptions as following. Assume: 1. The paired reads have same insert size. 2. d denotes the distance between the starting points of the paired reads. For example, consider the same string, Eq. (1), with d = 4 and k = 2. As showed in Fig. 8, only two nodes can be merged due to the identity. Thus, the paired de Bruijn graph reduces the merging efficiently. Fig. 8: the cycle graph with the paired reads. We can release the first assumption a little bit. If the first parts of two paired nodes are matched and the distance between two second parts of the paired nodes are close, these two nodes can be merged. For example, one paired node is AT/GA and another is AT/GG. The first parts of two nodes are the same as AT. So, if the distance between second pairs of two nodes, GA GG, is small, then these two nodes, AT/GA and AT/GG, can be merged. Page 8