Top-k Document Retrieval in External Memory

Size: px
Start display at page:

Download "Top-k Document Retrieval in External Memory"

Transcription

1 Top-k Document Retrieval in External Memory Rahul Shah 1, Cheng Sheng 2, Sharma V. Thankachan 1 and Jeffrey Scott Vitter 3 1 Louisiana State University, USA. {rahul,thanks}@csc.lsu.edu 2 The Chinese University of Hong Kong, China. csheng@cse.cuhk.edu.hk 3 The University of Kansas, USA. jsv@ku.edu Abstract. Let D be a given set of (string) documents of total length n. The top-k document retrieval problem is to index D such that when a pattern P of length p, and a parameter k come as a query, the index returns those k documents which are most relevant to P. Hon et al. [22] proposed a linear space framework to solve this problem in O(p+k log k) time. This query time was improved to O(p+k) by Navarro and Nekrich [33]. These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. In this paper, we propose the first external memory index supporting top-k document retrieval queries (outputs unsorted) in optimal O(p/B+log B n+k/b) I/Os, where B is the block size. The index space is almost linear O(n log n) words, where log n is the iterated logarithm of n. We also improve the existing internal memory results. Specifically, we propose a linear space index for retrieving top-k documents in O(k) time, once the locus of the pattern match is given. 1 Introduction and Related Work The inverted index is the most fundamental data structure in the field of information retrieval [43]. It is the backbone of every known search engine today. For each word in any document collection, the inverted index maintains a list of all documents in that collection which contain the word. Despite its power to answer various types of queries, the inverted index becomes inefficient, for example, when queries are phrases instead of words. This inefficiency results from inadequate use of word orderings in query phrases [37]. Similar problems also occur in applications when word boundaries do not exist or cannot be identified deterministically in the documents, like genome sequences in bioinformatics and text in many East-Asian languages. These applications call for data structures to answer queries in a more general form, that is, (string) pattern matching. Specifically, they demand the ability to identify efficiently all the documents that contain a specific pattern as a substring. The usual inverted-index approach might require the maintenance of document lists for all possible substrings of the documents. Work supported by National Science Foundation (NSF) Grants CCF (R. Shah and J. S. Vitter) and CCF (R. Shah).

2 This can take quadratic space and hence is neither theoretically interesting nor sensible from a practical viewpoint. The first frameworks for answering document retrieval queries were proposed by Matias et al. [30] and Muthukrishnan [31]. Their data structures solve the document listing problem, where the task is to index a collection D of D documents, such that whenever a pattern P of length p comes as a query, report all those documents containing P exactly once. Muthukrishnan also initiated the study of relevance metric-based document retrieval [31], which was then formalized by Hon et al. [22] as follows: Problem 1 (Top-k document retrieval problem) Let w(p, d) be the score function capturing the relevance of a pattern P with respect to a document d. Given a document collection D= {d 1, d 2,.., d D } of D documents, build an index answering the following query: given P and k, find k documents with the highest w(p,.) values in its sorted (or unsorted) order. Here, instead of reporting all the documents that match a query pattern, the problem is to output the k documents most relevant to the query in sorted order of relevance score. Relevance metrics considered in the problem can be either pattern-independent (e.g., PageRank) or -dependent. In the latter case one can take into account information like the frequency of the pattern occurrences (or term-frequency of popular tf-idf measure, which takes the number of occurrences of P in a document d as w(p, d)) and even the locations of the occurrences (e.g.,min-dist [22] which takes proximity of two closest occurrences of pattern as the score). In general, we assume that other than a static weight which is fixed for each document d, w(p, d) is dependent only on the set of occurrences of P in d. The framework of Hon et al. [22] takes linear space and answers the query in O(p + k log k) time. This was then improved by Navarro and Nekrich [33] to achieve O(p + k) query cost. Both [22] and [33] reduced this problem to a 4- sided orthogonal range query in 3d, which is defined as follows: the data consists of a set S of 3-dimensional points and the query consists of four parameters x, x, y and z, and output is the set of all those points (x i, y i, z i ) S such that x i [x, x ], y i y and z i z. While general 4-sided orthogonal range searching is proved hard [9], the desired bounds can nevertheless be achieved by identifying a special property that one dimension of the reduced subproblem can only have p distinct values. Even though there has been series of work on top-k string, including in theory as well as practical IR [5, 10, 11, 13, 15, 17 25, 33 35, 37, 38, 42] communities, most implementations (as well as theoretical results) have focused on RAM based compressed and/or efficient indexes (See [32] for an excellent survey). In this paper, we introduce an alternative framework for solving this problem and obtain the first non-trivial external memory [3] solution as follows: Theorem 1 In the external memory model, there exists an O(nh)-word structure that solves the top-k (unsorted) document retrieval problem in O(p/B + log B n+log (h) n+k/b) I/Os for any h log n, where log (h) n = log log (h 1) n, log (1) n = log n and B is the block size. 2

3 For h = log n, log (h) n is a constant, and hence we have the following result. Corollary 1 There exists an O(n log n)-word structure for answering top-k (unsorted) document retrieval problem in optimal O(p/B + log B n + k/b) I/Os. Our framework can also be used for improving the existing internal memory results. Specifically, we shall derive a linear space index for retrieving top-k documents in O(k) time, once the locus of the pattern match is given. This result improves the previous work [22, 33] by eliminating the additive term p. In applications like cross-document pattern matching [26], the locus can be computed in much faster O(log log p) time than O(p). In many pattern matching applications, for example in suffix-prefix overlap [41], maximal substring matches [27], and autocompletion search (like in Google Instant T M ) multiple loci are searched with amortized constant time for each locus. In such situations, having extra O(p) as in [22, 33] leads to inefficient solutions, where as our index can support faster queries. The result is summarized as follows: Theorem 2 There exists an O(n) word space data structure in word RAM model for solving (sorted) top-k document retrieval problem in O(k) time, once the locus of the pattern match is given. A related but somewhat orthogonal line of research has been to get top-k queries on general array based ranges. In this, we are given array A of colors with each color is assigned a score, and for a range query (i, j), we have to output k highest scored colors in this range (with each color reported at most once). If the scoring criteria is based on frequency, for example say score of a color is its number of occurrences in A[i..j], then lower-bounds on range-mode problem [8, 16] would imply no efficient (linear space and polylog time) data structures can exist. There are variants considered where each entry in the array has a fixed score or each color (document) has a fixed score, independent of number of occurrences. Recent [29] surprising result of achieving optimal I/Os with O(n log n) space has been for 3-sided categorical range reporting where each entry has another attribute called score, and the query specifies range as well as score threshold. We are supposed to output all colors whose at least one entry within the range satisfies the score criteria. There are easier variants where each entry of the same color gets the same score attribute like PageRank which have been shown to have efficient external memory results [36]. There are even simpler variants, where only top-k scores are to be reported [2, 28, 39] without considering colors or unique colors are to be reported without considering scores (as in document listing). Both these variants lead to 3-sided queries which are easier to solve in external memory. In internal memory, there exists optimal space and time data structures for outputting these scores in the sorted order [7]. 2 Preliminary: Top-k Framework This section briefly explains the linear space framework for top-k document retrieval based on the work of Hon et al. [22], and Navarro and Nekrich [33]. The generalized suffix tree (GST) of a document collection D= {d 1, d 2, d 3,..., d D } is 3

4 the combined compact trie (a.k.a. Patricia trie) of all the non-empty suffixes of all the documents. Use n to denote the total length of all the documents, which is also the number of the leaves in GST. For each node u in GST, consider the path from the root node to u. Let depth(u) be the number of nodes on the path, and prefix(u) be the string obtained by concatenating all the edge labels of the path. For a pattern P that appears in at least one document, the locus of P, denoted as u P, is the node closest to the root satisfying that P is a prefix of prefix(u P ). By numbering all the nodes in GST in the pre-order traversal manner, the part of GST relevant to P (i.e., the subtree rooted at u P ) can be represented as a range. Nodes are marked with documents. A leaf node l is marked with a document d D if the suffix represented by l belongs to d. An internal node u is marked with d if it is the lowest common ancestor of two leaves marked with d. Notice that a node can be marked with multiple documents. For each node u and each of its marked documents d, define a link to be a quadruple (origin, target, doc, score), where origin = u, target is the lowest proper ancestor 4 of u marked with d, doc = d and score = w ( prefix(u), d ). Two crucial properties of the links identified in [22] are listed below. Lemma 1 For each document d that contains a pattern P, there is a unique link whose origin is in the subtree of u P and whose target is a proper ancestor of u P. The score of the link is exactly the score of d with respect to P. Lemma 2 The total number of links is O(n). Based on Lemma 1, the top-k document retrieval problem can be reduced to the problem of finding the top-k links (according to its score) stabbed by u P, where link stabbing is defined as follows: Definition 1 (Link Stabbing) We say that a link is stabbed by node u if it is originated in the subtree of u and targets at a proper ancestor of u. If we order the nodes in GST as per the pre-order traversal order, these constraints translate into finding all the links (i) the numbers of whose origins fall in the number range of the subtree of u P, and (ii) the numbers of whose targets are less than the number of u P. Regarding constraint (i) as a two-sided range constraint on x-dimension, and regarding constraint (ii) as a one-sided range constraint on y-dimension, the problem asks for the top-k weighted points that fall in a three-sided window in 2d space, where weight of a point is the score of the corresponding link [33]. 3 External Memory Structures This section is dedicated for proving Theorem 1. The initial phase of pattern search can be performed in O(p/B + log B n) I/O s using a string B-tree [12]. Once the suffix range of P is identified, we take the lowest common ancestor of 4 Define a dummy node as the parent of the root node, marked with all the documents. 4

5 the left-most and right-most leaves in the suffix range of GST to identify the locus node u P. Hence, the first phase (i.e., finding the locus node u P of P ) takes optimal I/O s and now we focus only on the second phase (i.e., reporting the top-k links stabbed by u P ). Instead of solving the top-k version, we first solve a threshold version in Sec 3.1 where the objective is to retrieve those links stabbed by u P with score at least a given threshold τ. Then in Sec 3.2, we propose a separate structure that converts the original top-k-form query into a thresholdform query so that the structure in Sec 3.1 can now be used to answer the original problem. Finally, we obtain Theorem 1 via bootstrapping on a special structure for handling top-k queries in lesser number of I/Os for small values of k. We shall assume all scores are distinct and are within [1, O(n)]. Otherwise, the ties can be broken arbitrarily and reduce the values into rank-space. 3.1 Breaking Down Into Sub-Problems Instead of solving the top-k version, we first solve a threshold version, where the objective is to retrieve those links stabbed by u P with score at least a given threshold τ. We show that the problem can be decomposed into simpler subproblems, which consists of a 3d dominance reporting and O(log(n/B)) 3- sided range reporting in 2d, both can be solved efficiently using known structures. The main result is captured in Lemma 3 defined below. From now onwards, the origin, target and score of a link L i are represented by o i, t i and w i respectively. Lemma 3 There exists an O(n) space data structure for answering the following query: given a query node u P and a threshold τ, all links stabbed by u P with score τ can be reported in O(log 2 (n/b) + z/b) I/Os, where z is the number of outputs. Rank and Components For any node u in GST, we use u to denote its pre-order rank as well. Let size(u) denotes the number of leaves in the subtree of u, then we define its rank as: rank(u) = log size(u) B Rank Components 0 0 B = 2 Fig. 1. Rank Components Note that rank(.) [0, log n B ]. A contiguous subtree consisting of nodes with the same rank is defined as a component, and the rank of a component is same as the rank of nodes within it (see figure 1). Therefore, a component with rank = 0 is a bottom level subtree of size (number of leaves) at most B. From the definition, it can be seen that a node and at most one of its children can have the same rank. Therefore, a component with rank 1 consists of nodes in a path which goes top-down in the tree

6 The number of links originating within the subtree of any node u is at most 2size(u) 1. Therefore, the number of links originating within a component with rank = 0 is O(B). These O(B) links corresponding to each component with rank = 0 can be maintained separately as a list, taking total O(n) words space. Now, given a locus node u P, if rank(u P ) = 0, the number of links originating within the subtree of u P is also O(B) and all of them can be processed in O(1) I/O s by simply scanning the list of links corresponding to the component to which u P belongs to. The query processing is more sophisticated when rank(u P ) 1. For handling this case, we classify the links into the following 2 types based on the rank of its target with respect to the rank of query node u P : 1. equi-ranked links: links with rank(target) = rank(u P ) 2. high-ranked links: links with rank(target) > rank(u P ) Next we show that the problem of retrieving outputs among equi-ranked links can be reduced to a 3d dominance query, and the problem of retrieving outputs among highranked links can be reduced to at most log n B 3-sided range queries in 2d. Processing equi-ranked links Let C be a component and S C be set of all links L i, such that its target t i is a node in C. Also, for any link L i S C, let pseudo origin s i be the (preorder rank of) lowest ancestor of its origin o i within C (see Figure 2). Then a link L i S C originates in the subtree of any node u within C if and only if s i u. Now if the locus u P is a node in C, then among all equi-ranked links, we need to consider only those links L i S C, because the origin o j of any other equiranked link L j / S C, will not be in the subtree Fig. 2. Pseudo Origin of u P. Based on the above observations, all equi-ranked output links are those L i S C with t i < u P s i and w i τ. To solve this in external memory, we treat each link L i S C as a 3d point (t i, s i, w i ) and maintain a 3d dominance query structure over it. Now the outputs with respect to u P and τ are those links corresponding to the points within (, u P ) [u P, ) [τ, ). Such a structure for S C can be maintained in linear O( S C ) words of space and can answer the query in O(log B S C + z eq /B) I/O s using the result by Afshani [1], where S C is the number of points (corresponding to links in S C ) and z eq be the output size. Thus overall these structures occupies O(n)-word space. Lemma 4 Given a query node u P and a threshold τ, all the equi-ranked links stabbed by u P with score τ can be retrieved in O(log B n + z eq /B) I/Os using an O(n) word space data structure, where z eq is the output size. 6

7 Processing high-ranked links The following is an important observation. Observation 1 Any link L i with its origin o i within the subtree of a node u is stabbed by u if rank(t i ) > rank(u), where t i is the target of L i. This implies, while looking for the outputs among the high-ranked links, the condition of t i being a proper ancestor of u P can be ignored as it is taken care of automatically if o i [ ] u P, u P, where u P be the (pre-order rank of) rightmost leaf in the subtree rooted at u P. Let G r be the set of all links with rank equals r for 1 r log n B. Since there are only O(log(n/B)) sets, we shall maintain separate structures for links in each G r by considering only origin and score values. We treat each link L i G r as a 2d point (o i, w i ), and maintain a 3-sided range query structure over them for r = 1, 2,.., log n B. All highranked output links can be obtained by retrieving those links in L i G r with the corresponding point (o i, w i ) [u P, u P ] [τ, ] for r = rank(u P )+1,.., log n B. By using the linear space data structure in [4], the space and I/O bounds for a particular r is given by O( G r ) words and O(log B G r + z r /B), where z r is the number of output links in G r. Since a link can be a part of at most one G r, the total space consumption is O(n) words and the total query I/Os is O(log B n log(n/b) + z hi /B) = O(log 2 (n/b) + z hi /B), where z hi represents the number of high-ranked output links. Lemma 5 Given a query node u P and a threshold τ, all the high-ranked links stabbed by u P with score τ can be retrieved in O(log 2 (n/b) + z hi /B) I/Os using an O(n) word space data structure, where z hi is the output size. By combining Lemma 4 and Lemma 5, we obtain Lemma Converting Top-k to Threshold via Logarithmic Sketch Here we derive a linear space data structure, such that given a query node u and a parameter k, a threshold τ can be computed in constant I/Os, such that the number of links z stabbed by u with score τ is bounded by, k z 2k + O(log n). Hence query I/Os in Lemma 3 can be modified as O(log 2 (n/b) + z/b) = O(log 2 (n/b) + k/b). From the retrieved z outputs, the actual top-k answers can be computed by selection [6, 40] and filtering in another O(z/B) = O(k/B + log B n) I/O s. We summarize our result in the following lemma. Lemma 6 There exist an O(n) word data structure for answering the following query in O(log 2 (n/b) + k/b) I/O s: given a query point u and an integer k, report the top-k links stabbed by u. The details of top-k to threshold conversion are given below. Marked nodes and Prime Nodes in GST We identify certain nodes in the GST as marked nodes and prime nodes with respect to a parameter g called the grouping factor. The procedure starts by combining every g consecutive leaves (from left to right) together as a group, and marking the lowest common ancestor (LCA) of first and last leaf in each group. Further, we mark the LCA of all pairs of marked nodes recursively. Additionally, we ensure that the root is always 7

8 marked. At the end of this procedure, the number of marked nodes in GST will be O(n/g) [22]. Prime nodes are those which are the children of marked nodes 5. Corresponding to any marked node u (except root), there is a unique prime node u, which is its closest prime ancestor. In case u s parent is marked then u = u. For every prime node u with atleast one marked node in its subtree, the corresponding closest marked descendant u is unique. If u is marked then the closest marked descendant u is same as u. Hon et al. [22] showed that, given any node u with u being its highest marked descendent (if it exists), the number of leaves in the subtree of u, but not in the subtree of u (which we call as fringe leaves) is at most 2g. This means for a given threshold τ, if z is the number of outputs corresponding to u as the locus node, then the number of outputs corresponding to u as the locus is within z ± 2g. This is because of the fact that the number of documents d with w ( prefix(u), d ) w ( prefix(u ), d ) cannot be more than the number of fringe leaves. Therefore, we maintain the following information at every marked node u : the score of q th highest scored link stabbed by u for q = 1, 2, 4, 8,... By choosing g = log n, the total space can be bounded by O((n/g) log n) = O(n) words, and can retrieve any particular entry in O(1) time. Using the above values, the threshold τ corresponding to any given u and k can be computed as follows: first find the highest marked node u in the subtree of u (u = u if u is marked). Now identify i such that 2 i 1 < k + 2g 2 i and choose τ as the score of 2 i -th highest scored link stabbed by u. This ensures that k z < 2k + O(g) = 2k + O(log n). 3.3 Special Structures for Bounded k In this section, we derive a faster data structures for the case when k is upper bounded by a parameter g. The main idea is to identify smaller sets of O(g) links, such that top-g links stabbed by any node u are contained in one of such sets. Thus by constructing the structure described in Lemma 6 over the links in each such sets, the top-k queries for any k g can be answered faster as follows: Lemma 7 There exists a O(n) word data structure for answering top-k queries for k g in O(log 2 (g/b) + k/b) I/O s. Recall the definitions of marked nodes and prime nodes from Sec 3.2. Let u be a prime node and u (if it exists) be the unique highest marked descendent of u by choosing a grouping factor g (which will be fixed later). All the links originated from the subtree of u are categorized into the following (Figure 3). near-links: The links which are stabbed by u, but not by u. far-link: The links which are stabbed by both u and u. small-link: The links which are stabbed neither by u, nor by u. fringe-links: The links originated not from the subtree of u. Lemma 8 The number of fringe-links and the number of near-links of any prime node u is O(g). 5 Note that the number of prime nodes can be Θ(n) in the worst case. 8

9 Proof. The number of leaves in subtree(u )\subtree(u ) is at most 2g [22]. Therefore, the number of fringe-links can be bounded by O(g). For every document d whose link originates from subtree(u ) going out of it ends up as a near-link if and only if d exists at one of the leaves of subtree(u )\subtree(u ). Thus, this can also be bounded by O(g). In the case where u does not exist for u, only fringe-links exist. More over the subtree size of u is O(g) there can be no more than O(g) of these links. Consider the following set, consisting of O(g) links with respect to u : all fringe-links, near-links and g highest scored far-links. We maintain these links at u (as a data structure to be explained later). For any node u, whose closest prime ancestor (including itself) is u, the above mentioned set is called candidate links of u. From each u, we maintain the pointer to its closest prime ancestor where the set of candidate links is stored. Lemma 9 The candidate links of any node u contains top-g highest scored links stabbed by u. Fig. 3. Categorization of Links Proof. Let u be the closest prime ancestor of u. If no marked descendant of u exist, then all the links are stored as candidate links. Otherwise, small-links can never be candidates as they never cross u. Now, if u lies on the path from u to u then all far-links will satisfy both origin and target conditions. Else, far-links do not qualify. Hence, any link which is not among top-g (highest scored) of these far-links, can never be the candidate. Taking a clue from Lemma 8 and 9, for every prime node u, we shall maintain a data structure as in Lemma 6 by considering only the links stored at u, and top-k queries can be answered faster when k g. For this we shall define a candidate tree CT (u ) of node u (except the root) to be a modified version of subtree of u in GST augmented with candidate links stored at u. Firstly, for every candidate link which is targeted above u, we change the target to v, which will be a dummy parent of u in CT (u ). Now CT (u ) consists of those nodes which are either origin or target (after modification) of some candidate link of u. Moreover, all the nodes in subtree(u )\subtree(u ) are included as well. Since only the subset of nodes is selected from subtree(u ), our tree is basically a Steiner tree connecting these nodes. Moreover, the tree is edge-compacted so that no degree-1 node remains. Thus, the size of the tree as well as the number of associated links is O(g). Next we do a rank-space reduction of pre-order rank (w.r.t to GST) of the nodes in CT (u ) as well as the scores of candidate links. 9

10 The candidate tree (no degree-1 nodes) as well as the associated candidate links satisfies all the properties which we have exploited while deriving the structure in Lemma 6. Hence such a structure for CT (u ) can be maintained in O(min(g, size(u )) words space and the top-k links in CT (u ) stabbed by any node u, with u being its lowest prime ancestor can be retrieved in O(log 2 (g/b) + k/b) I/O s. The total space consumption of structures corresponding all prime nodes can be bounded by O(n) words as follows: the number of prime nodes with at least a marked node in its subtree is O(n/g), as each such prime node can be associated with a unique marked node. Thus the associated structures takes O(n/g g) = O(n) words space. The candidate set of a prime node u with no marked nodes in its subtree consists of O(size(u )) links, moreover a link cannot be in the candidate set of two such prime nodes. Thus the total space is O(n) words in this case as well. Note that for g = O(B), we need not store any structure on CT (u ), because such a candidate tree fits entirely in constant number of blocks which can be processed in O(1) I/Os. This completes the proof of Lemma I/O-Optimal Data Structure via Bootstrapping The bounds in Theorem 1 can be achieved by maintaining multiple structures as in Lemma 7. Clearly the structure in Lemma 6 is optimal for k B log 2 (n/b). However, for handling the case when k < B log 2 (n/b), we shall choose the grouping factor g i = B(log (i) (n/b)) 2, for i = 1, 2, 3,.., h log n and maintain h separate structures as in Lemma 7, occupying O(nh) space. Thus top-k query for any k g h can be answered by querying on the structure corresponding to the grouping factor g j, where g j k > g j+1 in O(log 2 (g j /B) + k/b) = O(g j+1 /B + k/b) = O(k/B) I/Os. For k < g h, we shall query on the structure corresponding to the grouping factor g h, and the I/Os are bounded by O(log 2 (g h /B) + k/b) = O(log (h) n + k/b). This completes the proof of Theorem 1. 4 Adapting to Internal Memory Our external memory framework can be adapted to internal memory by choosing B = Θ(1), and by replacing the external memory substructures by the corresponding internal memory counterparts. Retrieving the outputs among highranked links is reduced to O(log n) 3-sided range reporting queries. By using an interval tree like approach, the problem of retrieving outputs among equiranked links also can be reduced to O(log n) 3-sided range reporting queries. By using the linear-space sorted range reporting structure by Brodal et al. [7] for 3-sided range reporting, the outputs can be obtained in the sorted order of score. Further, these sorted outputs from O(log n) different places can be merged using an atomic heap [14], which is capable of performing all heap operations in O(1) time, provided the number of elements in the heap is O(log O(1) n) as in our case. At the beginning of each of these O(log n) queries, we may need to perform a binary search for finding the boundaries, thus resulting in a total query time of O(log 2 n + k), which is O(k) for k log 2 n. The space can be bounded by O(n) words. For the case when k < log 2 n, we obtain a linear space and O(log 2 log n + k) query time structure by using the ideas from Sec 3.3 (here 10

11 we choose grouping factor g = log 2 n). Again, this structure can answer queries in O(k) time for k log 2 log n. We do not continue this bootstrapping further. Instead, we make use of the following observation: the candidate set of a node consists of only O(g) links, hence a pointer to any particular link within the candidate set of any node can be maintained in O(log g) = O(log log n) bits. Thus, at every node u in GST, we shall maintain the top-(log 2 log n) links stabbed by u in the decreasing order of score as a pointer to its location within the candidate set of u. This occupies O(n log 3 log n) bits or o(n) words and top-k queries for any k log 2 log n can be answered in O(k) time by chasing the first k pointers and retrieving the documents associated with the corresponding links. This completes the proof of Theorem 2. References 1. P. Afshani. On dominance reporting in 3d. In ESA, pages 41 51, P. Afshani, G. S. Brodal, and N. Zeh. Ordered and unordered top-k range reporting in large data sets. In SODA, pages , A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31(9): , L. Arge, V. Samoladas, and J. S. Vitter. On two-dimensional indexability and optimal range search indexing. In PODS, pages , D. Belazzougui, G. Navarro, and D. Valenzuela. Improved compressed indexes for full-text document retrieval. volume 18, pages 3 13, M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. J. Comput. Syst. Sci., 7(4): , G. S. Brodal, M. G. R. Fagerberg, and A. López-Ortiz. Online sorted range reporting. In ISAAC, pages , T. M. Chan, S. Durocher, K. G. Larsen, J. Morrison, and B. T. Wilkinson. Linearspace data structures for range mode query in arrays. In STACS, pages , B. Chazelle. Lower bounds for orthogonal range searching: I. the reporting case. J. ACM, 37(2): , J. S. Culpepper, G. Navarro, S. J. Puglisi, and A. Turpin. Top-k ranked document search in general text databases. In ESA (2), pages , J. S. Culpepper, M. Petri, and F. Scholer. Efficient in-memory top-k document retrieval. In SIGIR, P. Ferragina and R. Grossi. The string b-tree: A new data structure for string search in external memory and its applications. J. ACM, 46(2): , J. Fischer, T. Gagie, T. Kopelowitz, M. Lewenstein, V. Mäkinen, L. Salmela, and N. Välimäki. Forbidden patterns. In LATIN, pages , M. L. Fredman and D. E. Willard. Trans-dichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. Syst. Sci., 48(3): , T. Gagie, G. Navarro, and S. J. Puglisi. New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci., 426:25 41, M. Greve, A. G. Jørgensen, K. D. Larsen, and J. Truelsen. Cell probe lower bounds and approximations for range mode. In ICALP (1), pages , W.-K. Hon, M. Patil, R. Shah, S. V. Thankachan, and J. S. Vitter. Indexes for document retrieval with relevance. In Munro Festschrift, pages , W.-K. Hon, R. Shah, and S. V. Thankachan. Towards an optimal space-and-querytime index for top-k document retrieval. In CPM, pages ,

12 19. W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. String retrieval for multi-pattern queries. In SPIRE, pages 55 66, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. Document listing for queries with excluded pattern. In CPM, pages , W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. Faster compressed top-k document retrieval. In DCC, W.-K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string retrieval problems. FOCS 09, pages , W.-K. Hon, R. Shah, and J. S. Vitter. Compression, indexing, and retrieval for massive string data. In CPM, pages , M. Karpinski and Y. Nekrich. Top-k color queries for document retrieval. In SODA, pages , R. Konow and G. Navarro. Faster compact top-k document retrieval. In DCC, G. Kucherov, Y. Nekrich, and T. A. Starikovskaya. Cross-document pattern matching. In CPM, pages , M. O. Külekci, J. S. Vitter, and B. Xu. Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans. Comput. Biology Bioinform., 9(2): , K. G. Larsen and R. Pagh. I/o-efficient data structures for colored range and prefix reporting. In SODA, pages , K. G. Larsen and F. van Walderveen. Near-optimal range reporting structures for categorical data. In SODA, pages , Y. Matias, S. Muthukrishnan, S. C. Sahinalp, and J. Ziv. Augmenting suffix trees, with applications. ESA 98, pages 67 78, London, UK, UK, Springer-Verlag. 31. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In SODA, pages , G. Navarro. Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. CoRR, abs/ , G. Navarro and Y. Nekrich. Top-k document retrieval in optimal time and linear space. In SODA, pages , G. Navarro and S. J. Puglisi. Dual-sorted inverted lists. In SPIRE, pages , G. Navarro and D. Valenzuela. Space-efficient top-k document retrieval. In SEA, pages , Y. Nekrich. Space-efficient range reporting for categorical data. In PODS, pages , M. Patil, S. V. Thankachan, R. Shah, W.-K. Hon, J. S. Vitter, and S. Chandrasekaran. Inverted indexes for phrases and strings. SIGIR, pages , K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12 22, C. Sheng and Y. Tao. Dynamic top-k range reporting in external memory. In PODS, pages , Y. Tao. Lecture 1: External memory model and sorting. 41. N. Välimäki, S. Ladra, and V. Mäkinen. Approximate all-pairs suffix/prefix overlaps. In CPM, pages 76 87, N. Välimäki and V. Mäkinen. Space-efficient algorithms for document retrieval. In CPM, pages , J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), July

Top-k Document Retrieval in External Memory

Top-k Document Retrieval in External Memory Top-k Document Retrieval in External Memory Rahul Shah 1, Cheng Sheng 2, Sharma V. Thankachan 1, and Jeffrey Scott Vitter 3 1 Louisiana State University, USA {rahul,thanks}@csc.lsu.edu 2 The Chinese University

More information

Linear-Space Data Structures for Range Minority Query in Arrays

Linear-Space Data Structures for Range Minority Query in Arrays Linear-Space Data Structures for Range Minority Query in Arrays Timothy M. Chan 1, Stephane Durocher 2, Matthew Skala 2, and Bryan T. Wilkinson 1 1 Cheriton School of Computer Science, University of Waterloo,

More information

Space-Efficient Framework for Top-k String Retrieval Problems

Space-Efficient Framework for Top-k String Retrieval Problems Space-Efficient Framework for Top-k String Retrieval Problems Wing-Kai Hon Department of Computer Science National Tsing-Hua University, Taiwan Email: wkhon@cs.nthu.edu.tw Rahul Shah Department of Computer

More information

Linear-Space Data Structures for Range Frequency Queries on Arrays and Trees

Linear-Space Data Structures for Range Frequency Queries on Arrays and Trees Noname manuscript No. (will be inserted by the editor) Linear-Space Data Structures for Range Frequency Queries on Arrays and Trees Stephane Durocher Rahul Shah Matthew Skala Sharma V. Thankachan the date

More information

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version Lecture Notes: External Interval Tree Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk This lecture discusses the stabbing problem. Let I be

More information

Path Queries in Weighted Trees

Path Queries in Weighted Trees Path Queries in Weighted Trees Meng He 1, J. Ian Munro 2, and Gelin Zhou 2 1 Faculty of Computer Science, Dalhousie University, Canada. mhe@cs.dal.ca 2 David R. Cheriton School of Computer Science, University

More information

Lecture Notes: Range Searching with Linear Space

Lecture Notes: Range Searching with Linear Space Lecture Notes: Range Searching with Linear Space Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk In this lecture, we will continue our discussion

More information

Space-efficient Algorithms for Document Retrieval

Space-efficient Algorithms for Document Retrieval Space-efficient Algorithms for Document Retrieval Niko Välimäki and Veli Mäkinen Department of Computer Science, University of Helsinki, Finland. {nvalimak,vmakinen}@cs.helsinki.fi Abstract. We study the

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

Dynamic Range Majority Data Structures

Dynamic Range Majority Data Structures Dynamic Range Majority Data Structures Amr Elmasry 1, Meng He 2, J. Ian Munro 3, and Patrick K. Nicholson 3 1 Department of Computer Science, University of Copenhagen, Denmark 2 Faculty of Computer Science,

More information

Applications of Succinct Dynamic Compact Tries to Some String Problems

Applications of Succinct Dynamic Compact Tries to Some String Problems Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido

More information

Adaptive and Approximate Orthogonal Range Counting

Adaptive and Approximate Orthogonal Range Counting Adaptive and Approximate Orthogonal Range Counting Timothy M. Chan Bryan T. Wilkinson Abstract We present three new results on one of the most basic problems in geometric data structures, 2-D orthogonal

More information

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Gonzalo Navarro University of Chile, Chile gnavarro@dcc.uchile.cl Sharma V. Thankachan Georgia Institute of Technology, USA sthankachan@gatech.edu

More information

Compressed Data Structures with Relevance

Compressed Data Structures with Relevance Compressed Data Structures with Relevance Jeff Vitter Provost and Executive Vice Chancellor The University of Kansas, USA (and collaborators Rahul Shah, Wing-Kai Hon, Roberto Grossi, Oğuzhan Külekci, Bojian

More information

Lempel-Ziv Compressed Full-Text Self-Indexes

Lempel-Ziv Compressed Full-Text Self-Indexes Lempel-Ziv Compressed Full-Text Self-Indexes Diego G. Arroyuelo Billiardi Ph.D. Student, Departamento de Ciencias de la Computación Universidad de Chile darroyue@dcc.uchile.cl Advisor: Gonzalo Navarro

More information

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure Lecture 6: External Interval Tree (Part II) Yufei Tao Division of Web Science and Technology Korea Advanced Institute of Science and Technology taoyf@cse.cuhk.edu.hk 3 Making the external interval tree

More information

Colored Range Queries and Document Retrieval

Colored Range Queries and Document Retrieval Colored Range Queries and Document Retrieval Travis Gagie 1, Gonzalo Navarro 1, and Simon J. Puglisi 2 1 Dept. of Computer Science, Univt. of Chile, {tgagie,gnavarro}@dcc.uchile.cl 2 School of Computer

More information

Compressed Indexes for Dynamic Text Collections

Compressed Indexes for Dynamic Text Collections Compressed Indexes for Dynamic Text Collections HO-LEUNG CHAN University of Hong Kong and WING-KAI HON National Tsing Hua University and TAK-WAH LAM University of Hong Kong and KUNIHIKO SADAKANE Kyushu

More information

Binary Heaps in Dynamic Arrays

Binary Heaps in Dynamic Arrays Yufei Tao ITEE University of Queensland We have already learned that the binary heap serves as an efficient implementation of a priority queue. Our previous discussion was based on pointers (for getting

More information

Computational Geometry

Computational Geometry Windowing queries Windowing Windowing queries Zoom in; re-center and zoom in; select by outlining Windowing Windowing queries Windowing Windowing queries Given a set of n axis-parallel line segments, preprocess

More information

ICS 691: Advanced Data Structures Spring Lecture 8

ICS 691: Advanced Data Structures Spring Lecture 8 ICS 691: Advanced Data Structures Spring 2016 Prof. odari Sitchinava Lecture 8 Scribe: Ben Karsin 1 Overview In the last lecture we continued looking at arborally satisfied sets and their equivalence to

More information

Suffix Trees and Arrays

Suffix Trees and Arrays Suffix Trees and Arrays Yufei Tao KAIST May 1, 2013 We will discuss the following substring matching problem: Problem (Substring Matching) Let σ be a single string of n characters. Given a query string

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

An Improved Query Time for Succinct Dynamic Dictionary Matching

An Improved Query Time for Succinct Dynamic Dictionary Matching An Improved Query Time for Succinct Dynamic Dictionary Matching Guy Feigenblat 12 Ely Porat 1 Ariel Shiftan 1 1 Bar-Ilan University 2 IBM Research Haifa labs Agenda An Improved Query Time for Succinct

More information

3 Competitive Dynamic BSTs (January 31 and February 2)

3 Competitive Dynamic BSTs (January 31 and February 2) 3 Competitive Dynamic BSTs (January 31 and February ) In their original paper on splay trees [3], Danny Sleator and Bob Tarjan conjectured that the cost of sequence of searches in a splay tree is within

More information

CMSC 754 Computational Geometry 1

CMSC 754 Computational Geometry 1 CMSC 754 Computational Geometry 1 David M. Mount Department of Computer Science University of Maryland Fall 2005 1 Copyright, David M. Mount, 2005, Dept. of Computer Science, University of Maryland, College

More information

Convex Hull of the Union of Convex Objects in the Plane: an Adaptive Analysis

Convex Hull of the Union of Convex Objects in the Plane: an Adaptive Analysis CCCG 2008, Montréal, Québec, August 13 15, 2008 Convex Hull of the Union of Convex Objects in the Plane: an Adaptive Analysis Jérémy Barbay Eric Y. Chen Abstract We prove a tight asymptotic bound of Θ(δ

More information

A New Approach to the Dynamic Maintenance of Maximal Points in a Plane*

A New Approach to the Dynamic Maintenance of Maximal Points in a Plane* Discrete Comput Geom 5:365-374 (1990) 1990 Springer-Verlag New York~Inc.~ A New Approach to the Dynamic Maintenance of Maximal Points in a Plane* Greg N. Frederickson and Susan Rodger Department of Computer

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

arxiv: v2 [cs.ds] 9 Apr 2009

arxiv: v2 [cs.ds] 9 Apr 2009 Pairing Heaps with Costless Meld arxiv:09034130v2 [csds] 9 Apr 2009 Amr Elmasry Max-Planck Institut für Informatik Saarbrücken, Germany elmasry@mpi-infmpgde Abstract Improving the structure and analysis

More information

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong Department of Computer Science and Engineering Chinese University of Hong Kong In this lecture, we will revisit the dictionary search problem, where we want to locate an integer v in a set of size n or

More information

Lecture L16 April 19, 2012

Lecture L16 April 19, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where

More information

Simple and Semi-Dynamic Structures for Cache-Oblivious Planar Orthogonal Range Searching

Simple and Semi-Dynamic Structures for Cache-Oblivious Planar Orthogonal Range Searching Simple and Semi-Dynamic Structures for Cache-Oblivious Planar Orthogonal Range Searching ABSTRACT Lars Arge Department of Computer Science University of Aarhus IT-Parken, Aabogade 34 DK-8200 Aarhus N Denmark

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

A Simple Linear-Space Data Structure for Constant-Time Range Minimum Query

A Simple Linear-Space Data Structure for Constant-Time Range Minimum Query A Simple Linear-Space Data Structure for Constant-Time Range Minimum Query Stephane Durocher University of Manitoba, Winnipeg, Canada, durocher@cs.umanitoba.ca Abstract. We revisit the range minimum query

More information

Lecture 9 March 4, 2010

Lecture 9 March 4, 2010 6.851: Advanced Data Structures Spring 010 Dr. André Schulz Lecture 9 March 4, 010 1 Overview Last lecture we defined the Least Common Ancestor (LCA) and Range Min Query (RMQ) problems. Recall that an

More information

A more efficient algorithm for perfect sorting by reversals

A more efficient algorithm for perfect sorting by reversals A more efficient algorithm for perfect sorting by reversals Sèverine Bérard 1,2, Cedric Chauve 3,4, and Christophe Paul 5 1 Département de Mathématiques et d Informatique Appliquée, INRA, Toulouse, France.

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 426 427 (2012) 25 41 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.vier.com/locate/tcs New algorithms on wavelet trees

More information

Association Rule Mining: FP-Growth

Association Rule Mining: FP-Growth Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong We have already learned the Apriori algorithm for association rule mining. In this lecture, we will discuss a faster

More information

Range Mode and Range Median Queries on Lists and Trees

Range Mode and Range Median Queries on Lists and Trees Range Mode and Range Median Queries on Lists and Trees Danny Krizanc 1, Pat Morin 2, and Michiel Smid 2 1 Department of Mathematics and Computer Science, Wesleyan University, Middletown, CT 06459 USA dkrizanc@wesleyan.edu

More information

Colored Range Queries and Document Retrieval 1

Colored Range Queries and Document Retrieval 1 Colored Range Queries and Document Retrieval 1 Travis Gagie a, Juha Kärkkäinen b, Gonzalo Navarro c, Simon J. Puglisi d a Department of Computer Science and Engineering,Aalto University, Finland. b Department

More information

I/O-Efficient Data Structures for Colored Range and Prefix Reporting

I/O-Efficient Data Structures for Colored Range and Prefix Reporting I/O-Efficient Data Structures for Colored Range and Prefix Reporting Kasper Green Larsen MADALGO Aarhus University, Denmark larsen@cs.au.dk Rasmus Pagh IT University of Copenhagen Copenhagen, Denmark pagh@itu.dk

More information

Greedy Algorithms CHAPTER 16

Greedy Algorithms CHAPTER 16 CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often

More information

Lecture 19 Apr 25, 2007

Lecture 19 Apr 25, 2007 6.851: Advanced Data Structures Spring 2007 Prof. Erik Demaine Lecture 19 Apr 25, 2007 Scribe: Aditya Rathnam 1 Overview Previously we worked in the RA or cell probe models, in which the cost of an algorithm

More information

CSCI2100B Data Structures Heaps

CSCI2100B Data Structures Heaps CSCI2100B Data Structures Heaps Irwin King king@cse.cuhk.edu.hk http://www.cse.cuhk.edu.hk/~king Department of Computer Science & Engineering The Chinese University of Hong Kong Introduction In some applications,

More information

arxiv: v1 [cs.ds] 21 Jun 2010

arxiv: v1 [cs.ds] 21 Jun 2010 Dynamic Range Reporting in External Memory Yakov Nekrich Department of Computer Science, University of Bonn. arxiv:1006.4093v1 [cs.ds] 21 Jun 2010 Abstract In this paper we describe a dynamic external

More information

Computational Geometry

Computational Geometry Windowing queries Windowing Windowing queries Zoom in; re-center and zoom in; select by outlining Windowing Windowing queries Windowing Windowing queries Given a set of n axis-parallel line segments, preprocess

More information

The Encoding Complexity of Two Dimensional Range Minimum Data Structures

The Encoding Complexity of Two Dimensional Range Minimum Data Structures The Encoding Complexity of Two Dimensional Range Minimum Data tructures Gerth tølting Brodal 1, Andrej Brodnik 2,3, and Pooya Davoodi 4 1 MADALGO, Department of Computer cience, Aarhus University, Denmark.

More information

Deterministic Rectangle Enclosure and Offline Dominance Reporting on the RAM

Deterministic Rectangle Enclosure and Offline Dominance Reporting on the RAM Deterministic Rectangle Enclosure and Offline Dominance Reporting on the RAM Peyman Afshani 1, Timothy M. Chan 2, and Konstantinos Tsakalidis 3 1 MADALGO, Department of Computer Science, Aarhus University,

More information

Interval Stabbing Problems in Small Integer Ranges

Interval Stabbing Problems in Small Integer Ranges Interval Stabbing Problems in Small Integer Ranges Jens M. Schmidt Freie Universität Berlin, Germany Enhanced version of August 2, 2010 Abstract Given a set I of n intervals, a stabbing query consists

More information

Encoding Data Structures

Encoding Data Structures Encoding Data Structures Rajeev Raman University of Leicester ERATO-ALSIP special seminar RMQ problem Given a static array A[1..n], pre-process A to answer queries: RMQ(l, r) : return max l i r A[i]. 47

More information

Space Efficient Data Structures for Dynamic Orthogonal Range Counting

Space Efficient Data Structures for Dynamic Orthogonal Range Counting Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro Cheriton School of Computer Science, University of Waterloo, Canada, {mhe, imunro}@uwaterloo.ca Abstract.

More information

New Common Ancestor Problems in Trees and Directed Acyclic Graphs

New Common Ancestor Problems in Trees and Directed Acyclic Graphs New Common Ancestor Problems in Trees and Directed Acyclic Graphs Johannes Fischer, Daniel H. Huson Universität Tübingen, Center for Bioinformatics (ZBIT), Sand 14, D-72076 Tübingen Abstract We derive

More information

On Succinct Representations of Binary Trees

On Succinct Representations of Binary Trees Math. in comp. sc. 99 (9999), 1 11 DOI 10.1007/s11786-003-0000 c 2015 Birkhäuser Verlag Basel/Switzerland Mathematics in Computer Science On Succinct Representations of Binary Trees Pooya Davoodi, Rajeev

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 3579 3588 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Cache-oblivious index for approximate

More information

Document Retrieval on Repetitive Collections

Document Retrieval on Repetitive Collections Document Retrieval on Repetitive Collections Gonzalo Navarro 1, Simon J. Puglisi 2, and Jouni Sirén 1 1 Center for Biotechnology and Bioengineering, Department of Computer Science, University of Chile,

More information

Cache-Oblivious String Dictionaries

Cache-Oblivious String Dictionaries Cache-Oblivious String Dictionaries Gerth Stølting Brodal Rolf Fagerberg Abstract We present static cache-oblivious dictionary structures for strings which provide analogues of tries and suffix trees in

More information

Longest Common Extensions in Trees

Longest Common Extensions in Trees Longest Common Extensions in Trees Philip Bille 1, Pawe l Gawrychowski 2, Inge Li Gørtz 1, Gad M. Landau 3,4, and Oren Weimann 3 1 DTU Informatics, Denmark. phbi@dtu.dk, inge@dtu.dk 2 University of Warsaw,

More information

Cache-Oblivious String Dictionaries

Cache-Oblivious String Dictionaries Cache-Oblivious String Dictionaries Gerth Stølting Brodal University of Aarhus Joint work with Rolf Fagerberg #"! Outline of Talk Cache-oblivious model Basic cache-oblivious techniques Cache-oblivious

More information

HEAPS ON HEAPS* Downloaded 02/04/13 to Redistribution subject to SIAM license or copyright; see

HEAPS ON HEAPS* Downloaded 02/04/13 to Redistribution subject to SIAM license or copyright; see SIAM J. COMPUT. Vol. 15, No. 4, November 1986 (C) 1986 Society for Industrial and Applied Mathematics OO6 HEAPS ON HEAPS* GASTON H. GONNET" AND J. IAN MUNRO," Abstract. As part of a study of the general

More information

l Heaps very popular abstract data structure, where each object has a key value (the priority), and the operations are:

l Heaps very popular abstract data structure, where each object has a key value (the priority), and the operations are: DDS-Heaps 1 Heaps - basics l Heaps very popular abstract data structure, where each object has a key value (the priority), and the operations are: l insert an object, find the object of minimum key (find

More information

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence

More information

Course Review for Finals. Cpt S 223 Fall 2008

Course Review for Finals. Cpt S 223 Fall 2008 Course Review for Finals Cpt S 223 Fall 2008 1 Course Overview Introduction to advanced data structures Algorithmic asymptotic analysis Programming data structures Program design based on performance i.e.,

More information

Bichromatic Line Segment Intersection Counting in O(n log n) Time

Bichromatic Line Segment Intersection Counting in O(n log n) Time Bichromatic Line Segment Intersection Counting in O(n log n) Time Timothy M. Chan Bryan T. Wilkinson Abstract We give an algorithm for bichromatic line segment intersection counting that runs in O(n log

More information

Efficient Data Structures for the Factor Periodicity Problem

Efficient Data Structures for the Factor Periodicity Problem Efficient Data Structures for the Factor Periodicity Problem Tomasz Kociumaka Wojciech Rytter Jakub Radoszewski Tomasz Waleń University of Warsaw, Poland SPIRE 2012 Cartagena, October 23, 2012 Tomasz Kociumaka

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Compressed Cache-Oblivious String B-tree

Compressed Cache-Oblivious String B-tree Compressed Cache-Oblivious String B-tree Paolo Ferragina and Rossano Venturini Dipartimento di Informatica, Università di Pisa, Italy Abstract In this paper we address few variants of the well-known prefixsearch

More information

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Chapter 1 A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Piotr Berman Department of Computer Science & Engineering Pennsylvania

More information

arxiv: v2 [cs.ds] 30 Sep 2016

arxiv: v2 [cs.ds] 30 Sep 2016 Synergistic Sorting, MultiSelection and Deferred Data Structures on MultiSets Jérémy Barbay 1, Carlos Ochoa 1, and Srinivasa Rao Satti 2 1 Departamento de Ciencias de la Computación, Universidad de Chile,

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

Submatrix Maximum Queries in Monge Matrices are Equivalent to Predecessor Search

Submatrix Maximum Queries in Monge Matrices are Equivalent to Predecessor Search Submatrix Maximum Queries in Monge Matrices are Equivalent to Predecessor Search Paweł Gawrychowski 1, Shay Mozes 2, and Oren Weimann 3 1 University of Warsaw, gawry@mimuw.edu.pl 2 IDC Herzliya, smozes@idc.ac.il

More information

Alphabet-Dependent String Searching with Wexponential Search Trees

Alphabet-Dependent String Searching with Wexponential Search Trees Alphabet-Dependent String Searching with Wexponential Search Trees Johannes Fischer and Pawe l Gawrychowski February 15, 2013 arxiv:1302.3347v1 [cs.ds] 14 Feb 2013 Abstract It is widely assumed that O(m

More information

Lecture 18 April 12, 2005

Lecture 18 April 12, 2005 6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today

More information

I/O-Algorithms Lars Arge Aarhus University

I/O-Algorithms Lars Arge Aarhus University I/O-Algorithms Aarhus University April 10, 2008 I/O-Model Block I/O D Parameters N = # elements in problem instance B = # elements that fits in disk block M = # elements that fits in main memory M T =

More information

Faster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler Transform

Faster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler Transform Faster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler German Tischler The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SA, United Kingdom

More information

On k-dimensional Balanced Binary Trees*

On k-dimensional Balanced Binary Trees* journal of computer and system sciences 52, 328348 (1996) article no. 0025 On k-dimensional Balanced Binary Trees* Vijay K. Vaishnavi Department of Computer Information Systems, Georgia State University,

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

Optimal Static Range Reporting in One Dimension

Optimal Static Range Reporting in One Dimension of Optimal Static Range Reporting in One Dimension Stephen Alstrup Gerth Stølting Brodal Theis Rauhe ITU Technical Report Series 2000-3 ISSN 1600 6100 November 2000 Copyright c 2000, Stephen Alstrup Gerth

More information

arxiv: v3 [cs.ds] 29 Jun 2010

arxiv: v3 [cs.ds] 29 Jun 2010 Sampled Longest Common Prefix Array Jouni Sirén Department of Computer Science, University of Helsinki, Finland jltsiren@cs.helsinki.fi arxiv:1001.2101v3 [cs.ds] 29 Jun 2010 Abstract. When augmented with

More information

Path Minima Queries in Dynamic Weighted Trees

Path Minima Queries in Dynamic Weighted Trees Path Minima Queries in Dynamic Weighted Trees Gerth Stølting Brodal 1, Pooya Davoodi 1, S. Srinivasa Rao 2 1 MADALGO, Department of Computer Science, Aarhus University. E-mail: {gerth,pdavoodi}@cs.au.dk

More information

Rank-Pairing Heaps. Bernard Haeupler Siddhartha Sen Robert E. Tarjan. SIAM JOURNAL ON COMPUTING Vol. 40, No. 6 (2011), pp.

Rank-Pairing Heaps. Bernard Haeupler Siddhartha Sen Robert E. Tarjan. SIAM JOURNAL ON COMPUTING Vol. 40, No. 6 (2011), pp. Rank-Pairing Heaps Bernard Haeupler Siddhartha Sen Robert E. Tarjan Presentation by Alexander Pokluda Cheriton School of Computer Science, University of Waterloo, Canada SIAM JOURNAL ON COMPUTING Vol.

More information

Optimal Succinctness for Range Minimum Queries

Optimal Succinctness for Range Minimum Queries Optimal Succinctness for Range Minimum Queries Johannes Fischer Universität Tübingen, Center for Bioinformatics (ZBIT), Sand 14, D-72076 Tübingen fischer@informatik.uni-tuebingen.de Abstract. For a static

More information

Representations of Weighted Graphs (as Matrices) Algorithms and Data Structures: Minimum Spanning Trees. Weighted Graphs

Representations of Weighted Graphs (as Matrices) Algorithms and Data Structures: Minimum Spanning Trees. Weighted Graphs Representations of Weighted Graphs (as Matrices) A B Algorithms and Data Structures: Minimum Spanning Trees 9.0 F 1.0 6.0 5.0 6.0 G 5.0 I H 3.0 1.0 C 5.0 E 1.0 D 28th Oct, 1st & 4th Nov, 2011 ADS: lects

More information

Lecture 8 13 March, 2012

Lecture 8 13 March, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 8 13 March, 2012 1 From Last Lectures... In the previous lecture, we discussed the External Memory and Cache Oblivious memory models.

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Algorithms for Euclidean TSP

Algorithms for Euclidean TSP This week, paper [2] by Arora. See the slides for figures. See also http://www.cs.princeton.edu/~arora/pubs/arorageo.ps Algorithms for Introduction This lecture is about the polynomial time approximation

More information

Lecture 9 March 15, 2012

Lecture 9 March 15, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 9 March 15, 2012 1 Overview This is the last lecture on memory hierarchies. Today s lecture is a crossover between cache-oblivious

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

Lecture 12 March 21, 2007

Lecture 12 March 21, 2007 6.85: Advanced Data Structures Spring 27 Oren Weimann Lecture 2 March 2, 27 Scribe: Tural Badirkhanli Overview Last lecture we saw how to perform membership quueries in O() time using hashing. Today we

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Geometric Streaming Algorithms with a Sorting Primitive (TR CS )

Geometric Streaming Algorithms with a Sorting Primitive (TR CS ) Geometric Streaming Algorithms with a Sorting Primitive (TR CS-2007-17) Eric Y. Chen School of Computer Science University of Waterloo Waterloo, ON N2L 3G1, Canada, y28chen@cs.uwaterloo.ca Abstract. We

More information

Quake Heaps: A Simple Alternative to Fibonacci Heaps

Quake Heaps: A Simple Alternative to Fibonacci Heaps Quake Heaps: A Simple Alternative to Fibonacci Heaps Timothy M. Chan Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario N2L G, Canada, tmchan@uwaterloo.ca Abstract. This note

More information

Binary Trees

Binary Trees Binary Trees 4-7-2005 Opening Discussion What did we talk about last class? Do you have any code to show? Do you have any questions about the assignment? What is a Tree? You are all familiar with what

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

An Optimal Dynamic Interval Stabbing-Max Data Structure?

An Optimal Dynamic Interval Stabbing-Max Data Structure? An Optimal Dynamic Interval Stabbing-Max Data Structure? Pankaj K. Agarwal Lars Arge Ke Yi Abstract In this paper we consider the dynamic stabbing-max problem, that is, the problem of dynamically maintaining

More information

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs Lecture 16 Treaps; Augmented BSTs Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Margaret Reid-Miller 8 March 2012 Today: - More on Treaps - Ordered Sets and Tables

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017 CS6 Lecture 4 Greedy Algorithms Scribe: Virginia Williams, Sam Kim (26), Mary Wootters (27) Date: May 22, 27 Greedy Algorithms Suppose we want to solve a problem, and we re able to come up with some recursive

More information