RESEARCH TOPIC IN BIOINFORMANTIC

RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very long. It is about 3Gb for each. Now a days, the second generation of genome assembly technique creates multiple copies of the genomes. Then, shears the genomes into fragments. After that, the short pieces is gained from these fragments as showed in Fig. 1. the researchers try to assemble these huge amount of pieces back to the original genome which process is called genome assembly. Fig. 1: the process to get the reads. Definition # 1 Reads refer to the short pieces obtained from the fragment. Definition # 2 Paired reads refer to wo reads obtained from two end-sides of one fragment. Now the genome assembly problem becomes: Page 1

Input: Goal: short sequencing reads or paired reads. reconstruct the reference string from the reads. But to reconstruct the reference sequence is still hard. There are several challenges remaining here. Challenges: 1. The length of reference sequence is long. There are huge amount of short reads. 2. The errors are contained in the reads. 3. Complex structure of the genome repeats. To aim the goal of genome assembly. One idea is constructing a completed weighted directed graph or namely overlap graph. Completed weighted directed graph (overlap graph): The weights of edges denote the length of the over lapping between two reads. The nodes denote the reads. An example showed in Fig. 2 illuminate the overlap graph. Input: Output: the reads: ACG, CGA, CGC,CGT, GAC, GCG, GTA, TCG construct the overlap graph. Fig. 2: the example of overlap graph. Not all the edges are drawn in the graph. Here we just draw some of the edges. To reduce the complexity of the problem. We make some assumptions and criterion here. Page 2

Assume: 1. There are no errors in the reads. 2. Ignore the genome repeats. i.e. each read only occur once in the reference sequence. Criterion: Assembly tries to find the parsimony simplest solution. In other words, the assembly tries to find the shortest sequence from the reads. Though the shortest sequence can not be the correct one, it is not far away from the truth. By the assumptions and criterion, we can observe that: What is the reference genome? = the path in the overlap graph. How to enforce that each read only occur once? = find the Hamilton path in the graph. i.e. find the path travelling each node once and only once. How to find the parsimony simplest solution? = Maximize the weight of path. Definition # 3 a Hamilton path is a path in the graph that visit each node exactly once. Thus, the problem approaches to the travelling salesman path (TSP) problem to minimize the negative weight of the OG. Unfortunately, this problem is NP-hard problem. The researchers try to get the approximate solutions by applying the greedy algorithms. One of the algorithms is the heuristic overlapping-layoutconsensus (OLC). 1. Overlap: construct the overlap graph. 2. Layout: find the maximum weighted path in the overlap graph by the greedy algorithm. 3. Consensus: obtain the consensus sequence. For example, Tab. 1 depicts how the consensus works. Definition # 4 Consensus sequence refers to the most common nucleotide or amino acid at a particular position after multiple sequences are aligned. Then, the researchers find another efficient way to assemble the genomes later. The method is called Eulerian path approach. Similarly, to implement the Eulerian path approach, we shall make some assumptions first. Page 3

ACGAC T ATGGAG Aligned sequences ACCTCC ACGGT T ACCGGC T 0 1 0 1 1 2 G 0 0 3 3 1 1 C 0 4 2 0 2 2 A 5 0 0 1 1 0 Consensus ACGGC T Tab. 1: consensus sequence from five aligned sequences. Assume: 1. All the reads are the same k length. we call k-mers. 2. Each read is distinct. i.e. each read only occur once. 3. read is sheared one nucleobase by onenucleobase. So, the reads start from every position of sequence. Definition # 5 k-mer refers to a specific n-tuple or n-gram of nucleic acid or amino acid sequences. The graph include Eulerian path is called de Bruijn graph. In genome assembly, the de bruijn graph is constructed as following. de Bruijn graph: The nodes denote (k 1) length prefix and suffix of the reads. The edge connect two nodes if they are from the prefix and suffix of the same read. In other words, the edges represent the reads. the graph is un-weighted and directed. Definition # 6 an Eulerian path is a path in the graph which visit every edge exactly once. Theorem # 6.1 a directed graph has the Eulerian path if and only if at most one node has one more in-degree than out-degree and at most one node has one more out-degree than in-degree and the graph is connected. For example, Fig. 3 illuminate the de Bruijn graph for the same reads set as the overlap graph example. Page 4

Input: Output: the reads: ACG, CGA, CGC,CGT, GAC, GCG, GTA, TCG construct the de Bruijn graph. Fig. 3: the example of the de Bruijn graph. According to the de Bruijn graph, a possible Eulerian path is TC CG GC CG GA AC CG GT TA Thus, one possible genome sequence is : TCGCGACGTA. The simple method to find the Eulerian path from one graph: 1. Start from one node that still has unused degrees. 2. Pick one edge not using before. Travel to next node along this edge. 3. Repeat Step 2 until close one cycle. 4. Repeat Step 1 through Step 3 to find the new cycle. 5. Combine the cycles by alternate the paths of two cycles that pass the conjoint node. But in the real world, some assumptions are hardly achieved. There are some challenges in applying the Eulerian path to the genome assembly. Challenges: 1. It is hard to get the read that start from every position of the genome sequence. 2. It is still an issue that the errors are contained in the reads. Thus, to get the continues k-mers, the Euler assembler provides an idea that for a given read, breaks it into k-mers. Then, the problem of the genome assembly becomes an Eulerian super-path problem. Page 5

Eulerian super-path problem: Given the reads and the k. Trained the paths of k-mers of each read, i, as the sub-path, SP i. Find an Eulerian super-path P s.t. P contain each SP i exactly once. If we relax the condition that pass each sub-path exactly once as that pass each edge in the graph at least once, the problem becomes the Chinese postman problem. Chinese postman problem: Given the directed graph. Find shortest path that visit each edge at least once. The Chinese postman problem can be solved in polynomial time. However, the Chinese postman problem constrained by the sub-path is NP-hard. In Euler assembler, the author provided several method to approach the expected result such as the x, y-detachment and x-cut (Fig. 4). (a) x, y-detachment (b) x-cut Fig. 4: Equivalent transformations: (a)x, y-detachment and (b)x-cut. Definition # 7 The x, y-detachment is a transformation that adds a new edge z = (v in, v out ) and delete the edges x and y from G (Fig. 4a). Page 6

Definition # 8 an x-cut is a transformation by simply removing x from all the paths that start from x or end at x without affecting the graph G itself (Fig. 4b). Some paths can be merged due to consistent with each other. Fig. 5 illuminate two possible consistent. P that is consistent only with P x,y1 (Fig. 5a) is resolvable while the P that is consistent both with P x,y1 and P x,y2 (Fig. 5b) is unresolvable. (a) P is consistent only with P x,y1. (b) P is consistent both with P x,y1 and P x,y2. Fig. 5: two possible consistent. The unambiguous paths are constructed as the contigs and the paired reads help us to construct the scaffoldings (the superior level of contigs) by the contigs. Definition # 9 a contig is a set of overlapping DNA segments that together represent a consensus region of DNA. Fig. 6: cycle graph before merging. Fig. 7: de Bruijn graph after merging. But due to the huge amount of the reads the de Bruijn graph is still very complex. For example, consider the cycle string, S = ATCAGATAGGAC. (1) Page 7

The k-mers with k = 2 can be formed as a cycle graph (Fig. 6). If we merge the identical nodes, this cycle graph will become the de Bruijn graph (Fig. 7) and the merging will cost the graph complex. Thus, is that a way to reduce the merging? Here we introduce the paired de Bruijn graph. paired de Bruijn graph: Each node denotes (k 1) length prefixes or suffixes of both paired reads. The edge connect two nodes if they are from the prefixes and suffixes of the same paired read. the graph is un-weighted and directed. Also, there are some assumptions as following. Assume: 1. The paired reads have same insert size. 2. d denotes the distance between the starting points of the paired reads. For example, consider the same string, Eq. (1), with d = 4 and k = 2. As showed in Fig. 8, only two nodes can be merged due to the identity. Thus, the paired de Bruijn graph reduces the merging efficiently. Fig. 8: the cycle graph with the paired reads. We can release the first assumption a little bit. If the first parts of two paired nodes are matched and the distance between two second parts of the paired nodes are close, these two nodes can be merged. For example, one paired node is AT/GA and another is AT/GG. The first parts of two nodes are the same as AT. So, if the distance between second pairs of two nodes, GA GG, is small, then these two nodes, AT/GA and AT/GG, can be merged. Page 8