BIOINFORMATICS ORIGINAL PAPER

Size: px
Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Transcription

1 BIOINFORMATICS ORIGINAL PAPER Vol. 27 no , pages doi: /bioinformatics/btr321 Genome analysis Advance Access publication June 2, 2011 A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly Hieu Dinh and Sanguthevar Rajasekaran Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA Associate Editor: Alex Bateman ABSTRACT Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length l of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the nextgeneration sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require (n 2 ) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store. Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ov max (x,y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length l is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω =l ov max (x,y) if and only if ω λ, where ov max (x,y) is the length of ov max (x,y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ 1)(2 logn + logλ )n bits with a guarantee that the basic operation of accessing an edge takes O(logλ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λlnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λln) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory. Availability: Our DNA sequence assembler that incorporates the data structure is freely available on the web at Contact: hdinh@engr.uconn.edu; rajasek@engr.uconn.edu To whom correspondence should be addressed. Received on January 24, 2011; revised on May 20, 2011; accepted on May 25, INTRODUCTION An exact-match overlap graph of n given strings of length l each is an edge-weighted graph defined as follows. Each vertex is associated with a string and there is an edge (x,y) of weight ω =l ov max (x,y) if and only if ω λ, where λ is a given threshold and ov max (x,y) is the length of the maximal exact-match overlap of two strings x and y. τ =l λ ov max (x,y) is called the overlap threshold. The formal definition of the exact-match overlap graph is given in Section 2. Storing the exact-match overlap graphs efficiently in term of memory becomes essential when the number of strings is very large. In the literature, there are two common data structures to store a general graph G=(V,E). The first data structure uses a 2D array of size V V. We refer to this as an array-based data structure. One of its advantages is that the time for accessing a given edge is O(1). However, it requires ( V 2 ) memory. The second data structure stores the set of edges E. We refer to this as an edge-based data structure. It requires ( V + E ) memory and the time for accessing a given edge is O(log ), where is the degree of the graph. Both these data structures require ( E ) memory. If the exact-match overlap graphs are stored using these two data structures, we will need ( E ) memory. Even this much of memory may not be feasible in cases when the number of strings is over a hundred million. In this article, we focus on data structures for the exact-match overlap graphs that will call for much less memory than E. 1.1 Our contributions We show that there is a compact data structure representing the exact-match overlap graph that needs much less memory than E with a guarantee that the basic operation of accessing an edge takes O(logλ) time, which is almost a constant in the context of DNA assembly. The data structure can be constructed efficiently in time and memory as well. In particular, we show that The data structure takes no more than (2λ 1)(2 logn + logλ )n bits. The data structure can be constructed in O(λln) time. As a result, any algorithm that uses overlap graphs and runs in time T can be simulated using our compact data structure. In this case, the memory needed is no more than (2λ 1)(2 logn + logλ )n bits (for storing the overlap graph) and the time needed is O(T logλ). The Author Published by Oxford University Press. All rights reserved. For Permissions, please journals.permissions@oup.com 1901

2 H.Dinh and S.Rajasekaran If λ is a constant or much smaller than n, our data structure will be a perfect solution for any application that does not have enough memory for storing the overlap graph in a traditional way. Our claim may sound contradictory because in some exact-match overlap graphs the number of edges can be (n 2 ) and it seems like it will require (n 2 ) time and memory to construct them. Fortunately, because of some special properties of the exact-match overlap graphs, we can construct and store them efficiently. In Section 3.1, we will describe these special properties in detail. Briefly, the idea of storing the overlap graph compactly is from the following simple observation. If the strings are sorted in the lexicographic order, then for any string x the lexicographic orders of the strings that contain x as a prefix are in a certain integer range or integer interval [a,b]. Therefore, the information about out-neighborhood of a vertex can be described using at most λ intervals. Such intervals have a nice property that they are either disjoint or contain each other. This property allows us to describe the out-neighborhood of a vertex by at most 2λ 1 disjoint intervals. Each interval costs 2 logn + logλ bits, where 2 logn bits are for storing its two bounds and logλ bits are for storing the weight. We have n vertices so the amount of memory required by our data structure is no more than (2λ 1)(2 logn + logλ )n bits. Note that this is just an upper bound. In practice, the amount of memory may be much less than that. 1.2 Application: DNA assembly The main motivation for the exact-match overlap graphs comes from their use in implementing fast approximation algorithms for the shortest super string problem which is the very first problem formulation for DNA assembly. The exact-match overlap graphs can be used for other problem formulations for DNA assembly as well. Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number n of strings ranges from thousands to billions. The length l of the strings is from 25 to 1000, depending on the DNA sequencing technology used. However, many DNA assemblers using overlap graphs are time and memory intensive. If an overlap graph is explicitly stored, it would require (n 2 ) memory which could be prohibitive in practice. In this article, we present a data structure for representing overlap graphs that requires only linear time and linear memory. Experimental results have shown that our preliminary DNA sequence assembler that uses this data structure can handle a large number of strings. In particular, it takes about 2.4 days and 62.4 GB memory to process a set of 660 million DNA strings of length 100 each. The set of DNA strings is drawn uniformly at random from a reference human genome of size of 3.3 billion base pairs. 1.3 Related work Gusfield et al. (1992) and Gusfield (1997) consider the all-pairs suffix-prefix problem which is actually a special case of computing the exact-match overlap graphs when λ=l. They devised an O(ln+ n 2 ) time algorithm for solving the all-pairs suffix-prefix problem. In this case, the exact-match overlap graph is a complete graph. So the run time of the algorithm is optimal if the exact-match overlap graph is stored in a traditional way. Although the run time of the algorithm by Gusfield et al. is theoretically optimal in that setting, it uses the generalized suffix tree which has two disadvantages in practice. The first disadvantage is that the space consumption of the suffix tree is quite large (Kurtz, 1999). The second disadvantage is that the suffix tree usually suffers from a poor locality of memory references (Ohlebusch and Gog, 2010). Fortunately, Abouelhoda et al. (2004) have proposed a suffix tree simulation framework that allows any algorithm using the suffix tree to be simulated by enhanced suffix arrays. Ohlebusch and Gog (2010) have made use of properties of the enhanced suffix arrays to devise an algorithm for solving the all-pairs suffix-prefix problem directly without using the suffix tree simulation framework. The run time of the algorithm by Ohlebusch and Gog is also O(ln+n 2 ). Please note that our data structure and algorithm can be used to solve the suffix-prefix problem in O(λln) time. In the context of DNA assembly, λ is typically much smaller than n and hence our algorithm will be faster than the algorithms of Gusfield (1997) and Ohlebusch and Gog (2010). Exact-match overlap graphs should be distinguished from approximate-match overlap graphs which is considered in Myers (2005), Medvedev et al. (2007) and Pop (2009). In the approximatematch overlap graph, there is an edge between two strings x and y if and only if there is a prefix of x, say x, and there is a suffix of y, say y, such that the edit distance between x and y is no more than a certain threshold. 2 BACKGROUND Let be the alphabet. The size of is a constant. In the context of DNA assembly, ={A,C,G,T}. The length of a string x on, denoted by x, is the number of symbols in x. Let x[i] be the i-th symbol of string x, and x[i,j] be the substring of x between the i-th and the j positions. A prefix of string x is the substring x[1,i] for some i. A suffix of string x is the substring x[i, x ] for some i. Given two strings x and y on, anexact-match overlap between x and y, denoted by ov(x,y), is a string which is a suffix of x and a prefix of y (notice that this definition is not symmetric). The maximal exact-match overlap between x and y, denoted by ov max (x,y), is the longest exact-match overlap between x and y. Exact-match overlap graphs: informally, an exact-match overlap graph is nothing but a graph where there is a node for each read and two nodes are connected by an edge if there is a sufficient overlap between the corresponding reads. To be more precise, given n strings s 1,s 2,...,s n and a threshold λ, the exact-match overlap graph is an edge-weighted directed graph G=(V,E) in which there is a vertex v i V associated with the string s i, for 1 i n. There is an edge (v i,v j ) E if and only if s i ov max (s i,s j ) λ. The weight of the edge (v i,v j ), denoted by ω(v i,v j ), is s i ov max (s i,s j ). See Figure 1. If all the n input strings have the same length l, then τ =l λ is called the overlap threshold. If there is an edge (v i,v j ) in the graph, it implies that the overlap between s i and s j is at least τ. The set of out-neighbors of a vertex v is denoted by OutNeigh(v). The size of the set of out-neighbors of v, OutNeigh(v), is called the out-degree of v. We denote the out-degree of v as deg out (v)= OutNeigh(v). For simplicity, we assume that all the strings s 1,s 2,...,s n have the same length l. Otherwise, let l be the length of the longest string and all else works. The operation of accessing an edge given its two endpoints: given any two vertices v i and v j, the operation of accessing the edge (v i,v j ) 1902

3 Memory-efficient overlap graph Let i be any identification number in the interval [a,b]. Since the input strings are in lexicographically sorted order, s a [1, x ] s i [1, x ] s b [1, x ]. Since a PREFIX(x) and b PREFIX(x), s a [1, x ]=s b [1, x ]. Thus, s a [1, x ]=s i [1, x ]=s b [1, x ]. Therefore, x is a prefix of s i. Hence, i PREFIX(x). Fig. 1. An example of an overlap edge. is the task of returning ω(v i,v j )if(v i,v j ) is actually an edge of the graph, and returning NULL if (v i,v j ) is not. 3 METHODS 3.1 A memory-efficient data structure representing an exact-match overlap graph In this section, we present a memory-efficient data structure to store an exactmatch overlap graph. It only requires at most (2λ 1)(2 logn + logλ )n bits. It guarantees that the time for accessing an edge, given two end points of the edge, is O(logλ). This may sound like a contradictory claim because in some exact-match overlap graphs the number of edges can be (n 2 ) and it seems like it should require at least (n 2 ) time and space to construct them. Fortunately, because of some special properties of the exact-match overlap graphs, we can construct and store them efficiently. In the following paragraphs, we will describe these special properties. The graph representation we envision is one where we store the neighbors of each node. The difference between the standard adjacency lists representation of a graph and ours lies in the fact that we only use O(λ) memory to store the neighbors of any node. We are able to do this since we sort the input strings in lexicographic order and employ a data structure called PREFIX defined below. Another crucial idea we employ is the following: let x be any string and let the input reads be sorted in lexicographic order. If we are interested in identifying all the input strings in which x is a prefix, then these strings will be contiguous in the sorted list. If we use the sorted position of each read as its identifier, then the neighbors of any node can be specified with O(λ) intervals (as we show next). Without loss of generality, we assume that the n input strings s 1,s 2,...,s n are sorted in lexicographic order. We can assume this because if they are not sorted, we can sort them by using the radix sort algorithm which runs in O(ln/w) time, where w is the word length of the machine used, assuming that the size of the alphabet is a constant. If the alphabet is not of size O(1), the sorting time will be O(nllog( )/w). We associate an identification number with each string s i and its corresponding vertex v i in the exact-match overlap graph. This identification number is nothing but the string s lexicographic order i. We will access an input string using its identification number. Therefore, the identification number and the vertex of an input string are used interchangeably. Also, it is not hard to see that we need logn bits to store an identification number. We have the following properties. Given an arbitrary string x, let PREFIX(x) be the set of identification numbers of all the input strings for which x is a prefix. Formally, PREFIX(x)={i x is a prefix of s i }. PREFIX enables us to specify the neighbors list of any node in the graph compactly. In PROPERTY 3.1, PROPERTY 3.2, PROPERTY 3.3, and Lemma 3.1, we prove certain properties of PREFIX and finally show that we can represent the neighbors list of any node as 2λ 1 intervals. Property 3.1. If PREFIX(x) =, then PREFIX(x) =[a,b], where [a,b] is some integer interval containing integers a,a+1,...,b 1,b. Proof. Let a=min i PREFIX(x) i and b=max i PREFIX(x) i. Clearly, PREFIX(x) [a,b]. On the other hand, we will show that [a,b] PREFIX(x). For example, let s 1 = AAACCGGGGTTT s 2 = ACCCGAATTTGT s 3 = ACCCTGTGGTAT s 4 = ACCGGCTTTCCA s 5 = ACTAAGGAATTT s 6 = TGGCCGAAGAAG If x =AC, then PREFIX(x) =[2,5]. Similarly, if x =ACCC, then PREFIX(x) =[2,3]. Property 3.1 tells us that PREFIX(x) can be expressed by an interval which is determined by its lower bound and its upper bound. So we only need 2 logn bits to store PREFIX(x). In the rest of this article, we will refer to PREFIX(x) as an interval. Also, given an identification number i, checking whether i is in PREFIX(x) can be done in O(1) time. In the Subsection 3.2.1, we will discuss two algorithms for computing PREFIX(x), for a given string x. The run times of these algorithms are O( x logn) and O( x ), respectively. Property 3.1 leads to the following property. Property 3.2. OutNeigh(v i )= 1 ω λ PREFIX(s i[ω+1, s i ]) for each vertex v i. In the other words, OutNeigh(v i ) is the union of at most λ non-empty intervals. Proof. Let v j be a vertex in OutNeigh(v i ). By the definition of the exactmatch overlap graph, 1 s i ov max (s i,s j ) =ω(v i,v j ) λ. Let ω(s i,s j )=ω. Clearly, ov max (s i,s j )=s i [ω+1, s i ]=s j [1, ov max (s i,s j ) ]. This implies that v j PREFIX(s i [ω+1, s i ]). On the other hand, let v j be any vertex in PREFIX(s i [ω+1, s i ]), it is easy to check that v j OutNeigh(v i ). Hence, OutNeigh(v i )= 1 ω λ PREFIX(s i[ω+1, s i ]). From Property 3.2, it follows that we can represent OutNeigh(v i )by at most λ non-empty intervals, which need at most 2λ logn bits to store. Therefore, it takes at most 2nλ logn bits to store the exact-match overlap graph. However, given two vertices v i and v j, it takes O(λ) time to retrieve ω(v i,v j ) because we have to sequentially check if v j is in PREFIX(s i [2, s i ]),PREFIX(s i [3, s i ]),..., PREFIX(s i [λ+1, s i ]). But if OutNeigh(v i ) can be represented by k disjoint intervals then the task of retrieving ω(v i,v j ) can be done in O(logk) time by using binary search. In Lemma 3.1, we show that OutNeigh(v i ) is the union of at most 2λ 1 disjoint intervals. Property 3.3. For any two strings x and y with x < y, either one of the two following statements is true: PREFIX(y) PREFIX(x) PREFIX(y) PREFIX(x)= Proof. There are only two possible cases that can happen to x and y. Case 1: x is a prefix of y. For this case, it is not hard to infer that PREFIX(y) PREFIX(x). Case 2: x is not a prefix of y. For this case, it is not hard to infer that PREFIX(y) PREFIX(x)=. 1903

4 H.Dinh and S.Rajasekaran root of a tree in the forest, and equal to the degree of the vertex plus one if the vertex is a root. Let q be the number of trees in the forest. Then, A = λ i=1 I i λ i=1 d i +q =2 E +p, where d i is the degree of the vertex associated with [a i,b i ] and E is the set of the edges of the forest. We know that in a tree the number of edges is equal to the number of vertices minus one. Thus, E =λ q. Therefore, A 2λ q 2λ 1. This completes our proof. Fig. 2. A forest illustration in the proof of Lemma 3.1. Lemma 3.1. Given λ intervals [a 1,b 1 ],[a 2,b 2 ]...[a λ,b λ ] satisfying Property 3.3, the union of them is the union of at most 2λ 1 disjoint intervals. Formally, there exist p 2λ 1 disjoint intervals [a 1,b 1 ],[a 2,b 2 ]...[a p,b p ] such that 1 i λ [a i,b i ]= 1 i p [a i,b i ]. Proof. We say interval [a i,b i ] is a parent of interval [a j,b j ] if [a i,b i ] is the smallest interval containing [a j,b j ]. We also say interval [a j,b j ] is a child of interval [a i,b i ]. Since the intervals [a i,b i ] are either pairwise disjoint or contain each other, each interval has at most one parent. Therefore, the set of the intervals [a i,b i ] form a forest in which each vertex is associated with an interval. See Figure 2. For each interval [a i,b i ], let I i be the set of maximal intervals that are contained in the interval [a i,b i ] but disjoint from all of its children. For example, if [a i,b i ]=[1,20] and its child intervals are [3,5],[7,8] and [12,15], then I i ={[1,2],[6,6],[9,11],[16,20]}. If the interval [a i,b i ] is a leaf interval (i.e. an interval having no children), I i is simply the set containing only the interval [a i,b i ]. Let A= 1 i λ I i. We will show that A is the set of the disjoint intervals [a i,b i ] satisfying the condition of the lemma. First, we show that 1 i λ [a i,b i ]= [a i,b i ] A[a i,b i ]. By the construction of I i, it is trivial to see that [a i,b i ] A[a i,b i ] 1 i λ [a i,b i ]. Conversely, it is enough to show that [a i,b i ] [a i,b i ] A[a i,b i ] for any 1 i λ. This can be proved by induction on vertices in each tree of the forest. For the base case, obviously each leaf interval [a i,b i ] is in A. Therefore, [a i,b i ] [a i,b i ] A[a i,b i ] for any leaf interval [a i,b i ]. For any internal interval [a i,b i ], assume that all of its child intervals are subsets of [a i,b i ] A[a i,b i ]. By the construction of I i, [a i,b i ] is a union of all of the intervals in I i and all of its child intervals. Therefore, [a i,b i ] [a i,b i ] A[a i,b i ]. Secondly, we show that the intervals in A are pairwise disjoint. It is sufficient to show that any interval in I i is disjoint with every interval in I j for i =j. Obviously, the statement is true if [a i,b i ] [a j,b j ]=. Let us consider the case where one contains the other. Without loss of generality, we assume that [a j,b j ] [a i,b i ]. Consider two cases: Case 1: [a i,b i ] is the parent of [a j,b j ]. By the construction of I i, any interval in I i is disjoint from [a j,b j ]. By the construction of I j, any interval in I j is contained in [a j,b j ]. Therefore, they are disjoint. Case 2: [a i,b i ] is not the parent of [a j,b j ]. Let [a j,b j ]=[a i0,b i0 ] [a ii,b ii ]... [a ih,b ih ]=[a i,b i ], where [a it,b it ] is the parent of [a it 1,b it 1 ]. From the result in Case 1, any interval in I it is disjoint from [a it 1,b it 1 ] for 1 t h. So any interval in I i is disjoint from [a j,b j ]. We already know that any interval in I j is contained in [a j,b j ]. Thus, they are disjoint. Finally, we show that the number of intervals in A is no more than 2λ 1. Clearly, A = λ i=1 I i. It is easy to see that the number of intervals in I i is no more than the number of children of [a i,b i ] plus one, which is equal to the degree of the vertex associated with [a i,b i ] if the vertex is not a The above proof yields an algorithm for computing the disjoint intervals starting from the forest of intervals. Once the forest is built, outputting the disjoint intervals can be done easily at each vertex. However, designing a fast algorithm for constructing the forest is not trivial. In the Subsection 3.2.2, we will discuss an O(λlogλ)-time algorithm for constructing the forest. Thereby, there is an O(λlogλ)-time algorithm for computing the disjoint intervals [a i,b i ] in Lemma 3.1, given λ intervals satisfying Property 3.3. Also, from Property 3.3 and Lemma 3.1, it is not hard to prove the following theorem. Theorem 3.1. OutNeigh(v i ) is the union of at most 2λ 1 disjoint intervals. Formally, OutNeigh(v i )= 1 m p [a m,b m ] where p 2λ 1, [a m,b m ] [a m,b m ]= for 1 m =m p. Furthermore, ω(v i,v j )=ω(v i,v k ) for any 1 m p and for any v i,v k [a m,b m ]. Theorem 3.1 suggests a way of storing OutNeigh(v i ) by at most (2λ 1) disjoint intervals. Each interval takes 2 logn bits to store its lower bound and its upper bound, and logλ bits to store the weight. Thus, we need 2 logn + logλ to store each interval. Therefore, it takes at most (2λ 1)(2 logn + logλ ) bits to store each OutNeigh(v i ). Overall, we need (2λ 1)(2 logn + logλ )n bits to store the exact-match overlap graph. Of course, the disjoint intervals of each OutNeigh(v i ) are stored in sorted order of their lower bounds. Therefore, the operation of accessing an edge (v i,v j ) can be easily done in O(logλ) time by using binary search. 3.2 Algorithms for constructing the compact data structure In this section, we describe two algorithms for constructing the data structure representing the exact-match overlap graph. The run time of the first algorithm is O(λlnlogn) and it only uses O(λ) extra memory, besides ln log bits used to store the n input strings. The second algorithm runs in O(λln) time and requires O(n) extra memory. As shown in Section 3.1, the algorithms need two routines. The first routine computes PREFIX(x) and the second one computes the disjoint intervals described in Lemma Computing interval PREFIX(x) In this subsection, we consider the problem of computing the interval PREFIX(x), given a string x and n input strings s 1,s 2,...,s n of the same length l in lexicographical order. We describe two algorithms for this problem. The first algorithm takes O( x logn) time and O(1) extra memory. The second algorithm runs in O( x ) time and requires O(n) extra memory. The first algorithm runs in phases. In each phase, it considers one of the symbols in x. In the first phase, it considers x[1] and obtains a list of input strings for which the first symbol is x[1]. Since the input strings are in sorted order, this list can be represented as an interval [a 1,b 1 ]. Followed by this, in the second phase the algorithm considers x[2]. From out of the strings in the interval [a 1,b 1 ], it identifies strings whose second symbol is [x 2 ]. These strings will form an interval [a 2,b 2 ]; and so on. The interval that results at the end of the k-th phase (where k = x )isprefix(x). In each phase, binary search is used to figure out the right interval. In the second algorithm, a trie is built for all the input strings. Each node in the trie corresponds to a string u and the node will store the interval for this string (i.e. the node will store a list of input strings for which u is a prefix). For any given x, we walk through the trie tracing the symbols in x. The last node visited in this traversal will have PREFIX(x) (if indeed x is a string that can be traced in the trie). 1904

5 Memory-efficient overlap graph A binary search based algorithm: let [a i,b i ]=PREFIX(x[1,i]) for 1 i x. It is easy to see that PREFIX(x)=[a x,b x ] [a x 1,b x 1 ]... [a 1,b 1 ]. Consider the following input strings, for example. s 1 = AAACCGGGGTTT s 2 = ACCAGAATTTGT s 3 = ACCATGTGGTAT s 4 = ACGGGCTTTCCA s 5 = ACTAAGGAATTT s 6 = TGGCCGAAGAAG x = ACCA Then, [a 1,b 1 ]=[1,5],[a 2,b 2 ]=[2,5],[a 3,b 3 ]=[2,3] and PREFIX(x)= [a 4,b 4 ]=[2,3]. We will find [a i,b i ] from [a i 1,b i 1 ] for i from 1 to x, where [a 0,b 0 ] = [1,n] initially. Thereby, PREFIX(x) is computed. Let Col i be the string that consists of all the symbols at position i of the input strings. In the above example, Col 3 =ACCGTG. Observe that the symbols in string Col i [a i,b i ] are in lexicographical order for 1 i x. Another observation is that [a i,b i ] is the interval where the symbol x[i] appears consecutively in string Col i [a i 1,b i 1 ]. Therefore, [a i,b i ] is determined by searching for the symbol x[i] in the string Col i [a i 1,b i 1 ]. This can be done easily by first using the binary search to find a position in the string Col i [a i 1,b i 1 ] where the symbol x[i] appears. If the symbol x[i] is not found, we return the empty interval and stop. If the symbol x[i] is found at position c i, then a i (respectively b i ) can be determined by using the double search routine in string Col i [a i 1,c i ] (respectively, string Col i [c i,b i 1 ]) as follows. We consider the symbols in the string Col i [a i 1,c i ] at positions c i 2 0,c i 2 1,...,c i 2 k,a i 1, where k = log(c i a i 1 ). We find j such that the symbol Col i [c i 2 j ] is the symbol x[i] but the symbol Col i [c i 2 j+1 ] is not. Finally, a i is determined by using binary search in string Col i [c i 2 j,c i 2 j+1 ]. Similarly, b i is determined. The pseudo-code is given as follows. 1. Initialize [a 0,b 0 ]=[1,n]. 2. for i =1 to x do 3. Find the symbol x[i] in the string Col i [a i 1,b i 1 ] using binary search. 4. if the symbol x[i] appears in the string Col i [a i 1,b i 1 ] then 5. Let c i be the position of the symbol x[i] returned by the binary search. 6. Find a i by double search and then binary search in the string Col i [a i 1,c i ]. 7. Find b i by double search and then binary search in the string Col i [c i,b i 1 ]. 8. else 9. Return the empty interval. 10. end if 11. end for 12. Return the interval [a x,b x ]. Analysis: as we discussed above, it is easy to see the correctness of the algorithm. Let us analyze the memory and time complexity of the algorithm. Since the algorithm only uses binary search and double search, it needs O(1) extra memory. For time complexity, it is easy to see that computing the interval [a i,b i ] at step i takes O(log(b i 1 a i 1 ))=O(logn) time because both binary search and double search take O(log(b i 1 a i 1 )) time. Overall, the algorithm takes O( x logn) time because there are at most x steps. A trie-based algorithm: as we have seen in Subsection 3.2.1, to compute the interval [a i,b i ] for symbol x[i], we use binary search to find the symbol x[i] in the interval [a i 1,b i 1 ]. The binary search takes O(log(b i 1 a i 1 ))= O(logn) time. We can reduce the O(logn) factor to O(1) in computing the interval [a i,b i ] by pre-computing all the intervals for each symbol in the alphabet and storing them in a trie. Given the symbol x[i], to find the interval [a i,b i ] we just retrieve it from the trie, which takes O(1) time. The trie is defined as follows (Fig. 3). At each node in the trie, we store a symbol Fig. 3. An illustration of a trie for the example input strings in Subsection and its interval. Observe that we do not have to store the nodes that have only one child. These nodes form chains in the trie. We will remove such chains and store their lengths in each remaining node. As a result, each internal node in the trie has at least two children. Because each internal node has at least two children, the number of nodes in the trie is no more than twice the number of leaves, which is equal to 2n. Therefore, we need O(n) memory to store the trie. Also, it is well known that the trie can be constructed recursively in O(ln) time. It is easy to see that once the trie is constructed, the task of finding the interval [a i,b i ] for each symbol x[i] takes O(1) time. Therefore, computing PREFIX(x) will take O( x ) time Computing the disjoint intervals In this subsection, we consider the problem of computing the maximal disjoint intervals, given k intervals [a 1,b 1 ],[a 2,b 2 ],...,[a k,b k ] which either are pairwise disjoint or contain each other. As discussed in Section 3.1, it is sufficient to build the forest of the k input intervals. Once the forest is built, outputting the maximal disjoint intervals can be done easily at each vertex of the forest. The algorithm works as follows. First, we sort the input intervals in non-decreasing order of their lower bounds a i. Among those intervals whose lower bounds are equal, we sort them in decreasing order of their upper bounds b i. So after this step, we have (i) a 1 a 2... a k and (ii) if a i =a j then b i >b j for 1 i <j k. Since the input intervals are either pairwise disjoint or contain each other, for any two intervals [a i,b i ] and [a i+1,b i+1 ] (for 1 i <k) the following statement holds. Either [a i,b i ] contains [a i+1,b i+1 ] or they are disjoint. Observe that if [a i,b i ] contains [a i+1,b i+1 ], then [a i,b i ] is actually the parent of [a i+1,b i+1 ]. If they are disjoint, then the parent of [a i+1,b i+1 ] is the smallest ancestor of [a i,b i ] that contains [a i+1,b i+1 ]. If such an ancestor does not exist, then [a i+1,b i+1 ] does not have a parent. Let A i ={[a i1,b i1 ],...,[a im,b im ]} be the set of ancestors of [a i,b i ], where i 1 <...<i m. It is easy to see that [a i1,b i1 ]... [a im,b im ]. Therefore, the smallest ancestor of [a i,b i ] that contains [a i+1,b i+1 ] can be found by binary search, which takes at most O(logk) time. Furthermore, assume that [a ij,b ij ] is the smallest ancestor, then the set of ancestors of [a i+1,b i+1 ] is A i+1 ={[a i1,b i1 ],...,[a ij,b ij ]}. Based on these observations, the algorithm can be described by the following pseudo-code. 1. Sort the input intervals [a i,b i ] as described above. 2. Initialize A=./*A is the set of ancestors of current interval [a i,b i ] */ 3. for i =1 tok 1 do 4. if [a i,b i ] contains [a i+1,b i+1 ] then 5. Output [a i,b i ] is the parent of [a i+1,b i+1 ]. 6. Add [a i+1,b i+1 ] into A. 7. else 8. Assume that A={[a i1,b i1 ],...,[a im,b im ]}. 9. Find the smallest interval in A that contains [a i+1,b i+1 ]. 10. if the smallest interval is found then 1905

6 H.Dinh and S.Rajasekaran 11. Assume that the smallest interval is [a ij,b ij ]. 12. Output [a ij,b ij ] as the parent of [a i+1,b i+1 ]. 13. Set A={[a i1,b i1 ],...,[a ij,b ij ],[a i+1,b i+1 ]}. 14. else 15. Set A={[a i+1,b i+1 ]}. 16. end if 17. end if 18. end for Analysis: as we argued above, the algorithm is correct. Let us analyze the run time of the algorithm. Sorting the input intervals takes O(klogk) time by using any optimal comparison sorting algorithm. It is easy to see that finding the smallest interval from the set A dominates the running time at each step of the loop, which takes O(logk) time. Obviously, there are k steps so the run time of the algorithm is O(klogk) overall. Notice that the sorted list of the intervals is actually a pre-order traversal of the forest. So the time complexity of the algorithm after sorting the intervals can be improved to O(k). However, the improvement does not change the overall time complexity of the algorithm since sorting the intervals takes O(klogk) time Algorithms for constructing the compact data structure In this subsection, we describe two complete algorithms for constructing the data structure. The algorithms will use the routines in Subsections and The only difference between these two algorithms is the way of computing PREFIX. The first algorithm uses the routine based on binary search to compute PREFIX and the second one uses the trie-based routine. The following pseudo code describes the first algorithm. 1. for i =1 ton do 2. for j =2 to λ+1 do 3. Compute PREFIX(s i [j, s i ]) by the routine based on binary search given in Subsection end for 5. Output the disjoint intervals from the input intervals PREFIX(s i [2, s i ]),...,PREFIX(s i [λ+1, s i ]) by using the routine in Subsection end for An analysis of the time and memory complexity of the first algorithm follows. Each computation of PREFIX in line 3 takes O(llogn) time and O(1) extra memory. So the loop of line 2 takes O(λllogn) time and O(λ) extra memory. Computing the disjoint intervals in line 5 takes O(λlogλ) time and O(λ) extra memory. Since λ l, the run time of loop 2 dominates the run time of each step of loop 1. Therefore, the algorithm takes O(λlnlogn) time and O(λ) extra memory in total. The second algorithm is described by the same pseudo code above except for line 3 where the routine of Subsection for computing PREFIX(s i [j, s i ]) is replaced by the trie-based routine of Subsection Let us analyze the second algorithm. Computing PREFIX in line 3 takes O(l) time instead of O(llogn) as in the first algorithm. With a similar analysis to that of the first algorithm, the loop of line 2 takes O(λl) time and O(λ) extra memory. Constructing the trie in line 1 takes O(ln) time. Therefore, the algorithm runs in O(λln) time. We also need O(n) extra memory to store the trie. In many cases, n is much larger than λ. So the algorithm takes O(n) extra memory. It is possible to develop a third algorithm by revising the step in line 2 to line 4 in the pseudo code that computes the set of intervals of the suffixes of each input string s i. The revision is based on suffix trees and the binary search-based algorithm given in Subsection The idea is to build a suffix tree for every input string s i. Note that every leaf in the suffix tree corresponds to a suffix of the input string s i. After building the suffix tree, we populate every leaf in the suffix tree with the corresponding interval by traversing the suffix tree and computing the interval for each node in the suffix tree given the interval for its parent. Finally, we output the intervals for those suffixes whose lengths are no less than the overlap threshold τ =l λ. It is easy to see that the time needed to determine the interval for any node in the suffix tree is O(logn) given the interval for its parent. Because the suffix tree has O(l) nodes, it takes O(llogn) time to compute intervals. In addition, it takes O(l) time to build the suffix tree and O(λlogλ) time to find the disjoint intervals. Since λ l n, we will spend O(llogn) time for each input string s i. As a result, the entire algorithm will run in time O(lnlogn). Also, the entire algorithm will take O(l) extra memory because each of the suffix trees takes O(l) memory. In practice, logn is smaller than λ and hence this algorithm could also be of interest. 4 RESULTS We have implemented a DNA sequence assembler named Largescale Efficient DNA Assembly Program (LEAP) that incorporates our data structure for the overlap graphs. The assembler has three stages: preprocess input DNA sequences, construct overlap graph and assemble. In the context of DNA sequence assembly, the input DNA sequences are called reads. In the first stage, we add the reverse complement strings of the reads. Then we sort them and remove contained reads. The second stage is the main focus of our article, constructing the data structure of the overlap graph. The last stage basically analyzes the overlap graph, then retrieves unambiguous paths and outputs the contigs accordingly. We tested our assembler on simulated data as follows. First, we simulated a genome G. Then each read of length l is drawn from a random location in either G or the reverse complement of G. Reads drawn from the genome are error-free reads. The number n of the drawn reads is determined by the coverage depth, c, by the equation n=c G l. We considered three datasets with the same read length l=100, the same coverage depth c =20 and different genome sizes: 238 Mb, 1 GB and 3.3 GB. The number of reads in the datasets is 47.6 million, 200 million and 660 million, respectively. The size of the first genome is approximately the size of human chromosome 2. The size of the third genome is approximately the whole human genome size. For the first and the second dataset, we have run our assembler with varying values of the overlap threshold τ: 30, 40, 50, 60 or 70. We only tried the overlap threshold τ =30 for the last dataset because the run time was quite long, about 2.4 days. To assess the quality of the contigs, we aligned them to the reference genome and found that all the contigs appeared in the reference genome. We have run our assembler on a Ubuntu Linux machine of 2.4 GHz CPU and 130 GB RAM. To save memory usage, we choose the binary search-based algorithm to construct the overlap graph in the second stage. The details are provided in Tables 1 and 2. The DNA sequence assembler developed by Simpson and Durbin (2010) also employs the overlap graph approach. Their assembler, named String Graph Assembler (SGA), uses the suffix array and FM-index for the entire read set to construct the overlap graph. This article reported that the bottleneck in terms of time and memory usage was in constructing the suffix array and FM-index that required 8.5 h and about 55 GB of memory on the first dataset. The total processing time was 15.2 h. On the third dataset, they estimated by extrapolation that the step of constructing the suffix array and FM-index would require about 4.5 days and 700 GB of memory. The total processing time on the third dataset would be more than that. However, SGA has been improved in terms of memory efficiency since its first version was released. Unfortunately, while the second version of SGA improves memory usage, its run time increases. We were able to run the latest version of SGAon the same machine on the datasets. The source code of the latest version of SGA can be found 1906

7 Memory-efficient overlap graph Table 1. The detail results for the first dataset Genome size (Mb) Number of read (M) Overlap threshold (bp) Overlap time (h) Intervals (M) Edges (M) Overlap memory (GB) Assembly N50 (Kb) Longest contig (Kb) Total time (h) Table 2. The detail results for the second and the third datasets Genome size (Gb) Number of read (M) Overlap threshold (bp) Overlap time (h) Intervals (B) Edges (B) Overlap memory (GB) Assembly N50 (Kb) Longest contig (Mb) Total time (h) Table 3. Time and memory comparison between the assemblers Dataset LEAP Latest SGA Old SGA Time Memory Time Memory Time Memory (GB) (days) (GB) (GB) 238 Mb 0.8 h h 55 1 GB 6.1 h GB 2.4 d d 700 at Table 3 provides the time and memory comparison between the assemblers. For all of the datasets, we have run the two assemblers with the same overlap threshold τ =30. The contigs output by the two assemblers were almost the same. 5 CONCLUSION We have described a memory-efficient data structure that represents the exact-match overlap graph. We have shown that this data structure needs at most (2λ 1)(2 logn + logλ )n bits, which is a surprising result because the number of edges in the graph can be (n 2 ). Also, it takes O(logλ) time to access an edge through the data structure. We have proposed two fast algorithms to construct the data structure. The first algorithm is based on binary search and runs in O(λlnlogn) time and takes O(λ) extra memory. The second algorithm, based on the trie, runs in O(λln) time, which is slightly faster than the first algorithm, but it takes O(n) extra memory to store the trie. The nice thing about the first algorithm is that the memory it uses is mostly for the input strings. This feature is very crucial for building an efficient DNA assembler. We are also developing our assembler LEAP that incorporates the data structure for the overlap graph. The experimental results show that our assembler can efficiently handle datasets of size equal to that of the whole human genome. Currently, our assembler works for error-free reads. In reality, reads usually have errors. If the error rate is high, our assembler may not work well. However, with improving accuracy in sequencing technology, the error rate has been reduced. If the error rate is low enough, we will have many error-free reads, which means that our assembler will still work in this case. Also, an alternative way to use our assembler is to first correct the reads before feeding them to our assembler. In future, we would like to adapt our assembler to handle reads with errors as well. ACKNOWLEDGEMENTS The authors would like to thank Vamsi Kundeti for many fruitful discussions. Funding: This work has been supported in part by the following grants: National Science Foundation ( ), National Science Foundation ( ), National Institutes of Health (1R01LM A1). Conflict of Interest: none declared. REFERENCES Abouelhoda,M. et al. (2004) Replace suffix trees with enhanced suffix arrays. J. Dis. Algorithm, 2, Gusfield,D. (1997) Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York. Gusfield,D. et al. (1992) An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett., 41, Kurtz,S. (1999) Reducing the space requirement of suffix trees. Softw. Pract. Exp., 29, Medvedev,P. et al. (2007) Computability of models for sequence assembly. In Proceedings of Workshop on Algorithms for Bioinformatics, pp Myers,E.W. (2005) The fragment assembly string graph. Bioinformatics, 21, Ohlebusch,E. and Gog,S. (2010) Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett., 110, Pop,M. (2009) Genome assembly reborn: recent computational challenges. Brief. Bioinformatics, 10, Simpson,J.T. and Durbin,R. (2010) Efficient construction of an assembly string graph using the fm-index. Bioinformatics, 26, i367 i

Graph and Digraph Glossary

Graph and Digraph Glossary 1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

Faster parameterized algorithms for Minimum Fill-In

Faster parameterized algorithms for Minimum Fill-In Faster parameterized algorithms for Minimum Fill-In Hans L. Bodlaender Pinar Heggernes Yngve Villanger Abstract We present two parameterized algorithms for the Minimum Fill-In problem, also known as Chordal

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

A more efficient algorithm for perfect sorting by reversals

A more efficient algorithm for perfect sorting by reversals A more efficient algorithm for perfect sorting by reversals Sèverine Bérard 1,2, Cedric Chauve 3,4, and Christophe Paul 5 1 Département de Mathématiques et d Informatique Appliquée, INRA, Toulouse, France.

More information

Definition For vertices u, v V (G), the distance from u to v, denoted d(u, v), in G is the length of a shortest u, v-path. 1

Definition For vertices u, v V (G), the distance from u to v, denoted d(u, v), in G is the length of a shortest u, v-path. 1 Graph fundamentals Bipartite graph characterization Lemma. If a graph contains an odd closed walk, then it contains an odd cycle. Proof strategy: Consider a shortest closed odd walk W. If W is not a cycle,

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Building approximate overlap graphs for DNA assembly using random-permutations-based search.

Building approximate overlap graphs for DNA assembly using random-permutations-based search. An algorithm is presented for fast construction of graphs of reads, where an edge between two reads indicates an approximate overlap between the reads. Since the algorithm finds approximate overlaps directly,

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

A Vizing-like theorem for union vertex-distinguishing edge coloring

A Vizing-like theorem for union vertex-distinguishing edge coloring A Vizing-like theorem for union vertex-distinguishing edge coloring Nicolas Bousquet, Antoine Dailly, Éric Duchêne, Hamamache Kheddouci, Aline Parreau Abstract We introduce a variant of the vertex-distinguishing

More information

On 2-Subcolourings of Chordal Graphs

On 2-Subcolourings of Chordal Graphs On 2-Subcolourings of Chordal Graphs Juraj Stacho School of Computing Science, Simon Fraser University 8888 University Drive, Burnaby, B.C., Canada V5A 1S6 jstacho@cs.sfu.ca Abstract. A 2-subcolouring

More information

Faster parameterized algorithms for Minimum Fill-In

Faster parameterized algorithms for Minimum Fill-In Faster parameterized algorithms for Minimum Fill-In Hans L. Bodlaender Pinar Heggernes Yngve Villanger Technical Report UU-CS-2008-042 December 2008 Department of Information and Computing Sciences Utrecht

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

(for more info see:

(for more info see: Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire

More information

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Exact Algorithms Lecture 7: FPT Hardness and the ETH Exact Algorithms Lecture 7: FPT Hardness and the ETH February 12, 2016 Lecturer: Michael Lampis 1 Reminder: FPT algorithms Definition 1. A parameterized problem is a function from (χ, k) {0, 1} N to {0,

More information

3 Competitive Dynamic BSTs (January 31 and February 2)

3 Competitive Dynamic BSTs (January 31 and February 2) 3 Competitive Dynamic BSTs (January 31 and February ) In their original paper on splay trees [3], Danny Sleator and Bob Tarjan conjectured that the cost of sequence of searches in a splay tree is within

More information

Computing the Longest Common Substring with One Mismatch 1

Computing the Longest Common Substring with One Mismatch 1 ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 1??. c Pleiades Publishing, Inc., 2011. Original Russian Text c M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Pebble Sets in Convex Polygons

Pebble Sets in Convex Polygons 2 1 Pebble Sets in Convex Polygons Kevin Iga, Randall Maddox June 15, 2005 Abstract Lukács and András posed the problem of showing the existence of a set of n 2 points in the interior of a convex n-gon

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 22.1 Introduction We spent the last two lectures proving that for certain problems, we can

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Exercise set 2 Solutions

Exercise set 2 Solutions Exercise set 2 Solutions Let H and H be the two components of T e and let F E(T ) consist of the edges of T with one endpoint in V (H), the other in V (H ) Since T is connected, F Furthermore, since T

More information

Chordal deletion is fixed-parameter tractable

Chordal deletion is fixed-parameter tractable Chordal deletion is fixed-parameter tractable Dániel Marx Institut für Informatik, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany. dmarx@informatik.hu-berlin.de Abstract. It

More information

arxiv: v3 [cs.dm] 12 Jun 2014

arxiv: v3 [cs.dm] 12 Jun 2014 On Maximum Differential Coloring of Planar Graphs M. A. Bekos 1, M. Kaufmann 1, S. Kobourov, S. Veeramoni 1 Wilhelm-Schickard-Institut für Informatik - Universität Tübingen, Germany Department of Computer

More information

Fully dynamic algorithm for recognition and modular decomposition of permutation graphs

Fully dynamic algorithm for recognition and modular decomposition of permutation graphs Fully dynamic algorithm for recognition and modular decomposition of permutation graphs Christophe Crespelle Christophe Paul CNRS - Département Informatique, LIRMM, Montpellier {crespell,paul}@lirmm.fr

More information

Interval Stabbing Problems in Small Integer Ranges

Interval Stabbing Problems in Small Integer Ranges Interval Stabbing Problems in Small Integer Ranges Jens M. Schmidt Freie Universität Berlin, Germany Enhanced version of August 2, 2010 Abstract Given a set I of n intervals, a stabbing query consists

More information

Parameterized graph separation problems

Parameterized graph separation problems Parameterized graph separation problems Dániel Marx Department of Computer Science and Information Theory, Budapest University of Technology and Economics Budapest, H-1521, Hungary, dmarx@cs.bme.hu Abstract.

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

Searching a Sorted Set of Strings

Searching a Sorted Set of Strings Department of Mathematics and Computer Science January 24, 2017 University of Southern Denmark RF Searching a Sorted Set of Strings Assume we have a set of n strings in RAM, and know their sorted order

More information

Discrete mathematics

Discrete mathematics Discrete mathematics Petr Kovář petr.kovar@vsb.cz VŠB Technical University of Ostrava DiM 470-2301/02, Winter term 2018/2019 About this file This file is meant to be a guideline for the lecturer. Many

More information

Figure 4.1: The evolution of a rooted tree.

Figure 4.1: The evolution of a rooted tree. 106 CHAPTER 4. INDUCTION, RECURSION AND RECURRENCES 4.6 Rooted Trees 4.6.1 The idea of a rooted tree We talked about how a tree diagram helps us visualize merge sort or other divide and conquer algorithms.

More information

CSCI2100B Data Structures Trees

CSCI2100B Data Structures Trees CSCI2100B Data Structures Trees Irwin King king@cse.cuhk.edu.hk http://www.cse.cuhk.edu.hk/~king Department of Computer Science & Engineering The Chinese University of Hong Kong Introduction General Tree

More information

COMP4128 Programming Challenges

COMP4128 Programming Challenges Multi- COMP4128 Programming Challenges School of Computer Science and Engineering UNSW Australia Table of Contents 2 Multi- 1 2 Multi- 3 3 Multi- Given two strings, a text T and a pattern P, find the first

More information

Suffix Trees and Arrays

Suffix Trees and Arrays Suffix Trees and Arrays Yufei Tao KAIST May 1, 2013 We will discuss the following substring matching problem: Problem (Substring Matching) Let σ be a single string of n characters. Given a query string

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Recognizing Interval Bigraphs by Forbidden Patterns

Recognizing Interval Bigraphs by Forbidden Patterns Recognizing Interval Bigraphs by Forbidden Patterns Arash Rafiey Simon Fraser University, Vancouver, Canada, and Indiana State University, IN, USA arashr@sfu.ca, arash.rafiey@indstate.edu Abstract Let

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

Dynamic Programming Homework Problems

Dynamic Programming Homework Problems CS 1510 Dynamic Programming Homework Problems 1. (2 points) Consider the recurrence relation T (0) = T (1) = 2 and for n > 1 n 1 T (n) = T (i)t (i 1) i=1 We consider the problem of computing T (n) from

More information

5 MST and Greedy Algorithms

5 MST and Greedy Algorithms 5 MST and Greedy Algorithms One of the traditional and practically motivated problems of discrete optimization asks for a minimal interconnection of a given set of terminals (meaning that every pair will

More information

Kernelization Upper Bounds for Parameterized Graph Coloring Problems

Kernelization Upper Bounds for Parameterized Graph Coloring Problems Kernelization Upper Bounds for Parameterized Graph Coloring Problems Pim de Weijer Master Thesis: ICA-3137910 Supervisor: Hans L. Bodlaender Computing Science, Utrecht University 1 Abstract This thesis

More information

Lecture 3 February 9, 2010

Lecture 3 February 9, 2010 6.851: Advanced Data Structures Spring 2010 Dr. André Schulz Lecture 3 February 9, 2010 Scribe: Jacob Steinhardt and Greg Brockman 1 Overview In the last lecture we continued to study binary search trees

More information

arxiv: v3 [cs.ds] 18 Apr 2011

arxiv: v3 [cs.ds] 18 Apr 2011 A tight bound on the worst-case number of comparisons for Floyd s heap construction algorithm Ioannis K. Paparrizos School of Computer and Communication Sciences Ècole Polytechnique Fèdèrale de Lausanne

More information

Lecture 7: Asymmetric K-Center

Lecture 7: Asymmetric K-Center Advanced Approximation Algorithms (CMU 18-854B, Spring 008) Lecture 7: Asymmetric K-Center February 5, 007 Lecturer: Anupam Gupta Scribe: Jeremiah Blocki In this lecture, we will consider the K-center

More information

On k-dimensional Balanced Binary Trees*

On k-dimensional Balanced Binary Trees* journal of computer and system sciences 52, 328348 (1996) article no. 0025 On k-dimensional Balanced Binary Trees* Vijay K. Vaishnavi Department of Computer Information Systems, Georgia State University,

More information

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np Chapter 1: Introduction Introduction Purpose of the Theory of Computation: Develop formal mathematical models of computation that reflect real-world computers. Nowadays, the Theory of Computation can be

More information

Monotone Paths in Geometric Triangulations

Monotone Paths in Geometric Triangulations Monotone Paths in Geometric Triangulations Adrian Dumitrescu Ritankar Mandal Csaba D. Tóth November 19, 2017 Abstract (I) We prove that the (maximum) number of monotone paths in a geometric triangulation

More information

5 MST and Greedy Algorithms

5 MST and Greedy Algorithms 5 MST and Greedy Algorithms One of the traditional and practically motivated problems of discrete optimization asks for a minimal interconnection of a given set of terminals (meaning that every pair will

More information

Dynamic Programming Algorithms

Dynamic Programming Algorithms Based on the notes for the U of Toronto course CSC 364 Dynamic Programming Algorithms The setting is as follows. We wish to find a solution to a given problem which optimizes some quantity Q of interest;

More information

1 Variations of the Traveling Salesman Problem

1 Variations of the Traveling Salesman Problem Stanford University CS26: Optimization Handout 3 Luca Trevisan January, 20 Lecture 3 In which we prove the equivalence of three versions of the Traveling Salesman Problem, we provide a 2-approximate algorithm,

More information

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Study of Data Localities in Suffix-Tree Based Genetic Algorithms Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the

More information

Analysis of Algorithms

Analysis of Algorithms Analysis of Algorithms Concept Exam Code: 16 All questions are weighted equally. Assume worst case behavior and sufficiently large input sizes unless otherwise specified. Strong induction Consider this

More information

Binary Trees, Binary Search Trees

Binary Trees, Binary Search Trees Binary Trees, Binary Search Trees Trees Linear access time of linked lists is prohibitive Does there exist any simple data structure for which the running time of most operations (search, insert, delete)

More information

A Note on Scheduling Parallel Unit Jobs on Hypercubes

A Note on Scheduling Parallel Unit Jobs on Hypercubes A Note on Scheduling Parallel Unit Jobs on Hypercubes Ondřej Zajíček Abstract We study the problem of scheduling independent unit-time parallel jobs on hypercubes. A parallel job has to be scheduled between

More information

Efficient Non-Sequential Access and More Ordering Choices in a Search Tree

Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Lubomir Stanchev Computer Science Department Indiana University - Purdue University Fort Wayne Fort Wayne, IN, USA stanchel@ipfw.edu

More information

Properties of red-black trees

Properties of red-black trees Red-Black Trees Introduction We have seen that a binary search tree is a useful tool. I.e., if its height is h, then we can implement any basic operation on it in O(h) units of time. The problem: given

More information

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d): Suffix links are the same as Aho Corasick failure links but Lemma 4.4 ensures that depth(slink(u)) = depth(u) 1. This is not the case for an arbitrary trie or a compact trie. Suffix links are stored for

More information

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017 CS6 Lecture 4 Greedy Algorithms Scribe: Virginia Williams, Sam Kim (26), Mary Wootters (27) Date: May 22, 27 Greedy Algorithms Suppose we want to solve a problem, and we re able to come up with some recursive

More information

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM EXERCISES Prepared by Natashia Boland 1 and Irina Dumitrescu 2 1 Applications and Modelling 1.1

More information

Greedy Algorithms CSE 780

Greedy Algorithms CSE 780 Greedy Algorithms CSE 780 Reading: Sections 16.1, 16.2, 16.3, Chapter 23. 1 Introduction Optimization Problem: Construct a sequence or a set of elements {x 1,..., x k } that satisfies given constraints

More information

4 Basics of Trees. Petr Hliněný, FI MU Brno 1 FI: MA010: Trees and Forests

4 Basics of Trees. Petr Hliněný, FI MU Brno 1 FI: MA010: Trees and Forests 4 Basics of Trees Trees, actually acyclic connected simple graphs, are among the simplest graph classes. Despite their simplicity, they still have rich structure and many useful application, such as in

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

Lecture 9 March 4, 2010

Lecture 9 March 4, 2010 6.851: Advanced Data Structures Spring 010 Dr. André Schulz Lecture 9 March 4, 010 1 Overview Last lecture we defined the Least Common Ancestor (LCA) and Range Min Query (RMQ) problems. Recall that an

More information

The Structure of Bull-Free Perfect Graphs

The Structure of Bull-Free Perfect Graphs The Structure of Bull-Free Perfect Graphs Maria Chudnovsky and Irena Penev Columbia University, New York, NY 10027 USA May 18, 2012 Abstract The bull is a graph consisting of a triangle and two vertex-disjoint

More information

Technical University of Denmark

Technical University of Denmark Technical University of Denmark Written examination, May 7, 27. Course name: Algorithms and Data Structures Course number: 2326 Aids: Written aids. It is not permitted to bring a calculator. Duration:

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Chapter 1 A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Piotr Berman Department of Computer Science & Engineering Pennsylvania

More information

Multi-Cluster Interleaving on Paths and Cycles

Multi-Cluster Interleaving on Paths and Cycles Multi-Cluster Interleaving on Paths and Cycles Anxiao (Andrew) Jiang, Member, IEEE, Jehoshua Bruck, Fellow, IEEE Abstract Interleaving codewords is an important method not only for combatting burst-errors,

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

Scan Scheduling Specification and Analysis

Scan Scheduling Specification and Analysis Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract

More information

CS 395T Computational Learning Theory. Scribe: Wei Tang

CS 395T Computational Learning Theory. Scribe: Wei Tang CS 395T Computational Learning Theory Lecture 1: September 5th, 2007 Lecturer: Adam Klivans Scribe: Wei Tang 1.1 Introduction Many tasks from real application domain can be described as a process of learning.

More information

CS 6783 (Applied Algorithms) Lecture 5

CS 6783 (Applied Algorithms) Lecture 5 CS 6783 (Applied Algorithms) Lecture 5 Antonina Kolokolova January 19, 2012 1 Minimum Spanning Trees An undirected graph G is a pair (V, E); V is a set (of vertices or nodes); E is a set of (undirected)

More information

Reachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace

Reachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace CHICAGO JOURNAL OF THEORETICAL COMPUTER SCIENCE 2014, Article 2, pages 1 29 http://cjtcs.cs.uchicago.edu/ Reachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace Thomas Thierauf Fabian

More information

Recursively Defined Functions

Recursively Defined Functions Section 5.3 Recursively Defined Functions Definition: A recursive or inductive definition of a function consists of two steps. BASIS STEP: Specify the value of the function at zero. RECURSIVE STEP: Give

More information

Greedy Algorithms CHAPTER 16

Greedy Algorithms CHAPTER 16 CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often

More information

On Universal Cycles of Labeled Graphs

On Universal Cycles of Labeled Graphs On Universal Cycles of Labeled Graphs Greg Brockman Harvard University Cambridge, MA 02138 United States brockman@hcs.harvard.edu Bill Kay University of South Carolina Columbia, SC 29208 United States

More information

Interleaving Schemes on Circulant Graphs with Two Offsets

Interleaving Schemes on Circulant Graphs with Two Offsets Interleaving Schemes on Circulant raphs with Two Offsets Aleksandrs Slivkins Department of Computer Science Cornell University Ithaca, NY 14853 slivkins@cs.cornell.edu Jehoshua Bruck Department of Electrical

More information

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition. 18.433 Combinatorial Optimization Matching Algorithms September 9,14,16 Lecturer: Santosh Vempala Given a graph G = (V, E), a matching M is a set of edges with the property that no two of the edges have

More information

Binary Trees. BSTs. For example: Jargon: Data Structures & Algorithms. root node. level: internal node. edge.

Binary Trees. BSTs. For example: Jargon: Data Structures & Algorithms. root node. level: internal node. edge. Binary Trees 1 A binary tree is either empty, or it consists of a node called the root together with two binary trees called the left subtree and the right subtree of the root, which are disjoint from

More information

Lecture 4: Primal Dual Matching Algorithm and Non-Bipartite Matching. 1 Primal/Dual Algorithm for weighted matchings in Bipartite Graphs

Lecture 4: Primal Dual Matching Algorithm and Non-Bipartite Matching. 1 Primal/Dual Algorithm for weighted matchings in Bipartite Graphs CMPUT 675: Topics in Algorithms and Combinatorial Optimization (Fall 009) Lecture 4: Primal Dual Matching Algorithm and Non-Bipartite Matching Lecturer: Mohammad R. Salavatipour Date: Sept 15 and 17, 009

More information

Algorithm and Complexity of Disjointed Connected Dominating Set Problem on Trees

Algorithm and Complexity of Disjointed Connected Dominating Set Problem on Trees Algorithm and Complexity of Disjointed Connected Dominating Set Problem on Trees Wei Wang joint with Zishen Yang, Xianliang Liu School of Mathematics and Statistics, Xi an Jiaotong University Dec 20, 2016

More information

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms 6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms We now address the question: How do we find a code that uses the frequency information about k length patterns efficiently, to shorten

More information

The 3-Steiner Root Problem

The 3-Steiner Root Problem The 3-Steiner Root Problem Maw-Shang Chang 1 and Ming-Tat Ko 2 1 Department of Computer Science and Information Engineering National Chung Cheng University, Chiayi 621, Taiwan, R.O.C. mschang@cs.ccu.edu.tw

More information

A Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012

A Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012 A randomized algorithm is presented for fast nearest neighbors search in libraries of strings. The algorithm is discussed in the context of one of the practical applications: aligning DNA reads to a reference

More information

Foundations of Computer Science Spring Mathematical Preliminaries

Foundations of Computer Science Spring Mathematical Preliminaries Foundations of Computer Science Spring 2017 Equivalence Relation, Recursive Definition, and Mathematical Induction Mathematical Preliminaries Mohammad Ashiqur Rahman Department of Computer Science College

More information

BIOINFORMATICS APPLICATIONS NOTE

BIOINFORMATICS APPLICATIONS NOTE BIOINFORMATICS APPLICATIONS NOTE Sequence analysis BRAT: Bisulfite-treated Reads Analysis Tool (Supplementary Methods) Elena Y. Harris 1,*, Nadia Ponts 2, Aleksandr Levchuk 3, Karine Le Roch 2 and Stefano

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions Note: Solutions to problems 4, 5, and 6 are due to Marius Nicolae. 1. Consider the following algorithm: for i := 1 to α n log e n

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

Fast algorithms for max independent set

Fast algorithms for max independent set Fast algorithms for max independent set N. Bourgeois 1 B. Escoffier 1 V. Th. Paschos 1 J.M.M. van Rooij 2 1 LAMSADE, CNRS and Université Paris-Dauphine, France {bourgeois,escoffier,paschos}@lamsade.dauphine.fr

More information

Paths, Flowers and Vertex Cover

Paths, Flowers and Vertex Cover Paths, Flowers and Vertex Cover Venkatesh Raman, M.S. Ramanujan, and Saket Saurabh Presenting: Hen Sender 1 Introduction 2 Abstract. It is well known that in a bipartite (and more generally in a Konig)

More information

Decision Problems. Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not.

Decision Problems. Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not. Decision Problems Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not. Definition: The class of problems that can be solved by polynomial-time

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

Computational Geometry

Computational Geometry Windowing queries Windowing Windowing queries Zoom in; re-center and zoom in; select by outlining Windowing Windowing queries Windowing Windowing queries Given a set of n axis-parallel line segments, preprocess

More information