Optimal Static Range Reporting in One Dimension

Size: px

Start display at page:

Download "Optimal Static Range Reporting in One Dimension"

Sheryl Montgomery
5 years ago
Views:

1 of Optimal Static Range Reporting in One Dimension Stephen Alstrup Gerth Stølting Brodal Theis Rauhe ITU Technical Report Series ISSN November 2000

2 Copyright c 2000, Stephen Alstrup Gerth Stølting Brodal Theis Rauhe The IT University of Copenhagen All rights reserved. Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy. ISSN ISBN Copies may be obtained by contacting: The IT University of Copenhagen Glentevej 67 DK-2400 Copenhagen NV Denmark Telephone: Telefax: Web

3 Optimal Static Range Reporting in One Dimension Stephen Alstrup Gerth Stølting Brodal Ý Theis Rauhe 24th November 2000 Abstract We consider static one dimensional range searching problems. These problems are to build static data structures for an integer set Ë Í, where Í ¼ ½ ¾ Û ½, which support various queries for integer intervals of Í. For the query of reporting all integers in Ë contained within a query interval, we present an optimal data structure with linear space cost and with query time linear in the number of integers reported. This result holds in the unit cost RAM model with word size Û and a standard instruction set. We also present a linear space data structure for approximate range counting. A range counting query for an interval returns the number of integers in Ë contained within the interval. For any constant ¼, our range counting data structure returns in constant time an approximate answer which is within a factor of at most ½ of the correct answer. The IT University of Copenhagen, Glentevej 67, DK-2400 Copenhagen NV, Denmark. Ý BRICS (Basic Research in Computer Science), Center of the Danish National Research Foundation, Department of Computer Science, University of Aarhus, Ny Munkegade, DK-8000 Århus C, Denmark. Partially supported by the IST Programme of the EU under contract number IST (ALCOM-FT). gerth@brics.dk.

4 1 Introduction Let Ë be a subset of the universe Í ¼ ½ ¾ Û ½ for some parameter Û. We consider static data structures for storing the set Ë such that various types of range search queries can be answered for Ë. Our bounds are valid in the standard unit cost RAM with word size Û and a standard instruction set. We present an optimal data structure for the fundamental problem of reporting all elements from Ë contained within a given query interval. We also provide a data structure that supports an approximate range counting query and show how this can be applied for multi-dimensional orthogonal range searching. In particular, we provide new results for the following query operations. FindAny µ ¾ Í: Report any element in Ë or if there is no such element. Report µ ¾ Í: Report all elements in Ë. Count µ ¾ Í ¼: Return an integer such that Ë ½ µ Ë. Let Ò denote the size of Ë and let Ù ¾ Û denote the size of universe Í. Our main result is a static data structure with Ç Òµ space cost that supports the query FindAny in constant time. As a corollary, the data structure allows Report in time Ç µ, where is the number of elements to be reported. Furthermore, we give linear space structures for the approximate range counting problem. That is, for any constant ¼, we present a data structure that supports Count in constant time and uses Ç Òµ space. The preprocessing time for the mentioned data structures is expected time Ç Ò Ô ÐÓ Ùµ. 1.1 Related work Efficient static data structures for range searching have been studied intensively over the past 30 years, for surveys and books see e.g. [1, 18, 20]. In one dimension there has been much focus on the following two fundamental problems: the membership problem and the predecessor problem. These problems address the following queries respectively: Member µ ¾ Í: Return yes iff ¾ Ë. Pred µ ¾ Í: Return the predecessor of, i.e., Ñ Ü Ë ¼ µ or if there are no such element. The Member query is easily solved by FindAny, Report or Count by restricting the query to unit size. On the other hand, it is straightforward to compute these three queries by at most two predecessor queries given an additional sorted (relative to Í) list of the points Ë, where each point is associated its list rank. An information theoretic lower bound implies that any data structure supporting any of the above queries, including Member, requires at least ÐÓ Ù Ò bits, i.e., has linear space cost in terms of Û ÐÓ Ù bit words for Ò Ù ½ Å ½µ. In [12], Fredman, Komlós and Szemeredi give an optimal solution for the static membership problem, which supports Member in constant time and with space cost Ç Òµ. In contrast, the predecessor problem does not permit a data structure with constant query time for a space cost bounded by Ò Ç ½µ. This was first proved by Ajtai [3], and later Beame and Fich [8] improved Ajtai s lower bound and in addition gave a 1

5 matching upper bound of Ç Ñ Ò ÐÓ ÐÓ Ù ÐÓ ÐÓ ÐÓ Ù Ô ÐÓ Ò ÐÓ ÐÓ Òµµ on the query time for space cost Ç Ò ½ Æ µ for any constant Æ ¼. Beam and Fich s lower bound holds for exact counting queries, i.e., Count where ¼. Our result shows that it is possible to circumvent this lower bound by allowing a slack in the precision of the result of the queries. For data structures with linear space cost, Willard [24] provides a data structure with time Ç ÐÓ ÐÓ Ùµ for predecessor queries. Andersson and Thorup [7] show how to obtain a dynamic predecessor query with bounds Ç Ñ Ò ÐÓ ÐÓ Ù ÐÓ ÐÓ Ò ÐÓ ÐÓ ÐÓ Ù Ô ÐÓ Ò ÐÓ ÐÓ Òµµ. For linear space cost, these bounds were previously also the best known for the queries Find- Any, Report and Count. However, for superlinear space cost, Miltersen et al. [19] provide a data structure which achieves constant time for FindAny with space cost Ç Ò ÐÓ Ùµ. Miltersen et al. also show that testing for emptiness of a rectangle in two dimensions is as hard as exact counting in one dimension. Hence, there is no hope of achieving constant query time for any of the above query variants including approximate range counting for two dimensions using space at most Ò Ç ½µ. Approximate data structures Several papers discuss the approach of obtaining a speed-up of a data structure by allowing slack of precision in the answers. In [17], Matias et al. study an approximate variant of the dynamic predecessor problem, in which an answer to a predecessor query is allowed to be within a multiplicative or additive error relative to the correct universe position of the answer. They give several applications of this data structure. In particular, its use for prototypical algorithms, including Prim s minimum spanning tree algorithm and Dijkstra s shortest path algorithm. The papers [4] and [6] provide approximate data structures for other closely related problems, e.g., for nearest neighbor searching, dynamic indexed lists, and dynamic subset rank. An important application of our approximate data structure is the static -dimensional orthogonal range searching problem. The problem is given a set of points in Í, to compute a query for the points lying in a -dimensional box Ê ½ ½. Known data structures providing sublinear search time have space cost growing exponential with the dimension. This is known as the curse of dimensionality [9]. Hence, for of moderate size, a query is often most efficiently computed by a linear scan of the input. A straightforward optimization of this approach using space Ç Òµ is to keep the points sorted by each of the coordinates. Then, for a given query, we can restrict the scan to the dimension, where fewest points in Ë have the th coordinate within the interval. This approach leeds to a time cost of Ç Ø Òµ ÓÔØµ where ÓÔØ is the number of points to be scanned and Ø Òµ is the time to compute a range counting query for a given dimension. Using the previous best data structures Ô for the exact range counting problem, this approach has a time cost of Ç Ñ Ò ÐÓ ÐÓ Ù ÐÓ Ò ÐÓ ÐÓ Òµ ÓÔØµ. Using our data structure supporting Count and FindAny, we improve the time for this approach to optimal time Ç ÓÔØ ½ µµ Ç ÓÔØµ within the same space cost. A linear scan behaves well in computational models, which consider a memory hierarchy, see [2]. Hence, even for large values of ÓÔØ, it is likely that the computation needed to determine the dimension for the scan majorizes the overall time cost. 2

6 1.2 Organization The paper is organized as follows: In Section 2 we define our model of computation and the problems we consider, and state definitions and known results needed in our data structures. In Section 3 we describe our data structure for the range reporting problem, and in Section 4 we describe how to preprocess and build it. Finally, in Section 5 we describe how to extend the range reporting data structure to support approximate range counting queries. 2 Preliminaries A query Report µ can be implemented by first querying FindAny µ. If an Ü ¾ Ë is returned, we report the result of recursively applying Report Ü ½µ, then Ü, and the result of recursively applying Report Ü ½ µ. Otherwise the empty set is returned. Code for the reduction is given in Figure 2. If elements are returned, a straightforward induction shows that there are ¾ ½ recursive calls to Report, i.e. at most ¾ ½ calls to FindAny, and we have therefore the following lemma. Lemma 1 If FindAny is supported in time at most Ø, then Report can be supported in time Ç Ø µ, where is the number of elements reported. The model of computation, we assume throughout this paper, is a unit cost RAM with word size Û bits, where the set of instructions includes the standard boolean operations on words, the arbitrary shifting of words, and the multiplication of two words. We assume that the model has access to a sequence of truly random bits. For our constructions we need the following definitions and results. Given two words Ü and Ý, we let Ü Ý denote the binary exclusive-or of Ü and Ý. If Ü is a Û bit word and a nonnegative integer, we let Ü and Ü denote the rightmost Û bits of the result of shifting Ü bits to the right and bits to the left respectively, i.e. Ü Ü Ú ¾ and Ü Ü ¾ µ ÑÓ ¾ Û. For a word Ü, we let Ñ Üµ denote the most significant bit position in Ü that contains a one, i.e. Ñ Üµ Ñ Ü ¾ Ü for Ü ¼. We define Ñ ¼µ ¼. Fredman and Willard in [13] describe how to compute Ñ in constant time. Theorem 1 (Fredman and Willard [13]) Given a Û bit word Ü, the index Ñ Üµ can be computed in constant time, provided a constant number of words is known which only depend on the word size Û. Essential to our range reporting data structure is the efficient and compact implementation of sparse arrays. We define a sparse array to be a static array where only a limited number of entries are initialized to contain specific values. All other entries may contain arbitrary information, and crucial for achieving the compact representation: It is not possible to distinguish initialized and not initialized entries. For the implementation of sparse arrays we will adopt the following definition and result about perfect hash functions. Definition 1 A function Ñ is perfect for a set Ë Ñ if is 1-1 on Ë. A family À is an Ñ Ò µ-family of perfect hash functions, if for all subsets Ë Ñ of size Ò there is a function ¾ À Ñ that is perfect for Ë. 3

7 The question of representing efficiently families of perfect hash functions has been throughly studied. Schmidt and Siegel [21] described an Ñ Ò Ç Òµµ-family of perfect hash functions where each hash function can be represented by Ò ÐÓ ÐÓ Ñµ bits. Jacobs and van Emde Boas [16] gave a simpler solution requiring Ç Ò ÐÓ ÐÓ Ò ÐÓ ÐÓ Ñµ bits in the standard unit cost RAM model augmented with multiplicative arithmetic. Jacobs and van Emde Boas result suffices for our purposes. The construction in [16] makes repeated use of the data structure in [12] where some primes are assumed to be known. By replacing the applications of the data structures from [12] with applications of the data structure from [10], the randomized construction time in Theorem 2 follows immediately. Theorem 2 (Jacobs and van Emde Boas [16]) There is an Ñ Ò Ç Òµµ-family of perfect hash functions À such that any hash function ¾ À can be represented in Ò ÐÓ ÐÓ Òµ Ûµ words and evaluated in constant time for Ñ ¾ Û. The perfect hash function can be constructed in expected time Ç Òµ. A sparse array can be implemented using a perfect hash function as follows. Assume has size Ñ and contains Ò initialized entries each storing bits of information. Using a perfect hash function for the Ò initialized indices of, we can store the Ò initialized entries of in an array of size Ò, such that µ for each initialized entry. If is not initialized, µ is an arbitrary of the Ò initialized entries (depending on the choice of ). From Theorem 2 we immediately have the following corollary. Corollary 1 A sparse array of size Ñ with Ò initialized entries each containing bits of information can with expected preprocessing time Ç Òµ be stored using space Ç Ò Ûµ words, and lookups are supported in constant time, if ÐÓ ÐÓ Ò Û and Ñ ¾ Û. For the approximate range counting data structure in Section 5 we need the following result achieved by Fredman and Willard for storing small sets (in [14] denoted Q-heaps; these are actually dynamic data structures, but we only need their static properties). For a set Ë and an element Ü we define Ö Ò Ë Üµ Ý ¾ Ë Ý Ü. Theorem 3 (Fredman and Willard [14]) Let Ë be a set of Û bit words and an integer Ò, where Ë ÐÓ Òµ ½ and ÐÓ Ò Û. Using time Ç Ë µ and space Ç Ë µ words, a data structure can be constructed that supports Ö Ò Ë Üµ queries in constant time, given the availability of a table requiring space and preprocessing time Ç Òµ. The result of Theorem 3 can be extended to sets of size ÐÓ Òµ for any constant ¼, by constructing a ÐÓ Òµ ½ -ary search tree of height with the elements of Ë stored at the leaves together with their rank in Ë, and where internal nodes are represented by the data structures of Theorem 3. Top-down searches then take time proportional to the height of the tree. Corollary 2 Let ¼ be fixed constant and Ë a set of Û bit words and an integer Ò, where Ë ÐÓ Òµ and ÐÓ Ò Û. Using time Ç Ë µ and space Ç Ë µ words, a data structure can be constructed that supports predecessor queries in constant time, given the availability of a table requiring space and preprocessing time Ç Òµ. 4

8 Figure 1: The binary tree Ì for the case Û=4, Ë ½¾ ½, and À ¾. The set Ë induces the sets È ½ ¾ ½½ ½ ½ and Î ½ ¾ ½½, and the two sparse arrays and. 3 Range reporting data structure In this section we describe a data structure supporting FindAny µ queries in constant time. The basic component of the data structure is (the implicitly representation of) a perfect binary tree Ì with ¾ Û leaves, i.e. a binary tree where all leaves have depth Û, if the root has depth zero. The leaves are numbered from left-to-right ¾ Û, and the internal nodes of Ì are numbered ½ Ò ½. The root is the first node and the children of node Ú are nodes ¾Ú and ¾Ú ½, i.e. like the numbering of nodes in an implicit binary heap [11, 25]. Figure 1 shows the numbering of the nodes for the case Û. The tree Ì has the following properties (see [15]): Fact 1 The depth of an internal node Ú is Ñ Úµ, and the Ø ancestor of Ú is Ú, for ¼ ÔØ Úµ. The parent of leaf is the internal node ¾ Û ½ ½µ, for ¼ ¾ Û. For ¼ ¾ Û, the nearest common ancestor of the leaves and is the ½ Ñ µ Ø ancestor of the leaves and. For a node Ú in Ì, we let Ð Ø Úµ and Ö Ø Úµ denote the left and right children of Ú, and we let Ì Ú denote the subtree rooted at Ú and Ë Ú denote the subset of Ë where Ü ¾ Ë Ú if, and only if, Ü ¾ Ë, and leaf Ü is a descendent of Ú. We let È be the subtree of Ì consisting of the union of the internal nodes on the paths from the root to the leaves in Ë, and we let Î be the subset of È consisting of the root of Ì and the nodes where both children are in È. We denote Î the set of branching nodes. Since each leaf-to-root path in Ì contains Û internal nodes, we have È Ò Û, and since Î contains the root and the set of nodes of degree two in the subtree defined by È, we have Î Ò ½, if both children of the root are in È and otherwise Î Ò. To answer a query FindAny µ, the basic idea is to compute the nearest common ancestor Ú of the nodes and in constant time. If Ë, then either Ñ Ü Ë Ð Ø Úµ or Ñ Ò Ë Ö Ø Úµ is contained in, since is contained within the interval spanned by Ú, and and are spanned by the left and right child of Ú respectively. Otherwise whatever computation we do cannot identify an integer in Ë. At most ÒÛ nodes satisfy Ë Ú. E.g. to compute FindAny ½ µ, we have Ú, Ñ Ü Ë Ð Ø Úµ Ñ Ü Ë, and Ñ Ò Ë Ö Ø Úµ Ñ Ò Ë ½¾. By storing these nodes in a sparse array together with Ñ Ò Ë Ú and Ñ Ü Ë Ú, we obtain a data structure using space Ç ÒÛµ words, which supports FindAny 5

9 Proc Report µ Ü FindAny µ if Ü then Report Ü ½µ output Üµ Report Ü ½ µ Proc FindAny µ if then À ½ Ñ Ûµ ½µ Ñ µ Ù ½ Û ½µµ ½µµ Þ Ù Û ½ µ À ½µµ Ú Þ Î Ù Ù Î Þ Þ for Ü ¾ Ú Ð Ø Ñ Ú Ð Ø Å Ú Ö Ø Ñ Ú Ö Ø Å if Ü ¾ then return Ü return Figure 2: Implementation of the queries Report and FindAny. in constant time. In the following we describe how to reduce the space usage of this approach to Ç Òµ words. We consider the tree Ì as partitioned into a set of layers each consisting of À consecutive levels of Ì, where À ½ Ñ Ûµ ½µ, i.e. À ¾ ½ ¾ Ô ÐÓ Û, or equivalently À is the power of two, where ¾Ô ½ Û À Û. For a node Ù, we let Ùµ denote the nearest ancestor Þ of Ù, such that ÔØ Þµ ÑÓ À ¼. If ÔØ Ùµ ÑÓ À ¼, then Ùµ Ù. Since À is a power of ¾, we can compute Ü ÑÓ À as Ü À ½µ, i.e. for an internal node Ù, we can compute Ùµ Ù ÔØ Ùµ À ½µµ. E.g. in Figure 1, À ¾ and µ ¾ ½µµ ½. The data structure for the set Ë consists of three sparse arrays,, and Î, each being implemented according to Corollary 1. The arrays and will be used to find the nearest ancestor of a node in È that is a branching node. A bit-vector that for each node Þ in È with Þµ Þ (or equivalently ÔØ Þµ ÑÓ À ¼), has Þ ½ if, and only if, there exists a node Ù in Î with Ùµ Þ. A vector that for each node Ù in È where Ùµ Ù or Ùµ ½ stores the distance to the nearest ancestor Ú in Î of Ù, i.e. Ù ÔØ Úµ ÔØ Ùµ. Î A vector that for each branching node Ú in Î stores a record with the fields: left, right, Ñ and Å, where Î Ú Ñ Ñ Ò Ë Ú and Î Ú Å Ñ Ü Ë Ú and left (and right respectively) is a pointer to the record of the nearest descendent Ù in Î of Ú in the left (and right respectively) subtree of Ú. If no such Ù exists, then Î Ú left Ú (respectively Î Ú right Ú). Given the above data structure FindAny µ can be implemented by the code in Figure 2. If, the query immediately returns. Otherwise the value À is computed, and the 6

10 nearest common internal ancestor Ù in Ì of the leaves and is computed together with Þ Ùµ. Using,, and Î we then compute the nearest common ancestor branching node Ú in Î of the leaves and. In the computation of Ú an error may be introduced, since the arrays, and Î are only well defined for a subset of the nodes of Ì. However, as we show next, this only happens when Ë. Finally we check if one of the Ñ and Å values of Ú Ð Ø and Ú Ö Ø is in. If one of the four values belongs to, we return such a value. Otherwise is returned. As an exampled consider the query FindAny ½ µ for the set in Figure 1. Here ¾, Ù µ, Þ ¾µ ½µ ½ ½. Since ½ ½, we have Ù ½, and Ú Î Ù Ù Î ½ Î ½. The four values tested are the Ñ and Å values of Î ¾ and Î, i.e. ½¾ ½, and we return 12. Theorem 4 The data structure supports FindAny in constant time and Report in time Ç µ, where is the number of elements reported. The data structure requires space Ç Ë µ words. Proof. The correctness of FindAny µ can be seen as follows: If Ë, then the algorithm returns, since before returning an element there is a check to find if the element is contained in the interval. Otherwise Ë. If ¾ Ë, then by Fact 1 the computed Ù ¾ Û ½ ½ is the parent of and Þ Ù Û ½µ À ½µµ Ù ÔØ Ùµ ÑÓ Àµ Ùµ. We now argue that Ú is the nearest ancestor node of the leaf that is a branching node. If Ë Ùµ, then Ì Ùµ Î and Ùµ ¼, and Ú is computed as Î Ùµ Ùµ, which by definition of is the nearest ancestor of Ùµ that is a branching node. Otherwise Ë Ùµ ¾, implying Ì Ùµ Î and Ùµ ½. By definition Ù is then defined such that Î Ù Ù is the nearest ancestor of Ù that is a branching node. We conclude that the computed Ú is the nearest ancestor of the leaf that is a branching node. If the leaf is contained in the left subtree of Ú, then Ú Ð Ø Ú and Ú Ñ. It follows that Ú Ð Ø Ñ. Similarly, if the leaf is contained in the right subtree of Ú, then Ú Ö Ø Å. For the case where Ë and, we have by Fact 1 that the computed node Ù is the nearest common ancestor of the leaves and, where ÔØ Ùµ Û ½µ, and that Þ Ù Û ½ µ À ½µµ Ù ÔØ Ùµ ÑÓ Àµ Ùµ. Similarly to the case, we have that the computed node Ú is the nearest ancestor of the node Ù that is a branching node. If Ú Ù, i.e. Ú is the nearest common ancestor of the leaves and, then Ë Ð Ø Úµ or Ë Ö Ø Úµ. If Ë Ð Ø Úµ ¾ and Ë Ð Ø Úµ, then Ú Ð Ø Ú and Ú Ð Ø Å ¾. If Ë Ð Ø Úµ ½ and Ë Ð Ø Úµ, then Ú Ð Ø Ú and Ú Ð Ø Ñ ¾. Similarly if Ë Ö Ø Úµ, then either Ú Ö Ø Ñ ¾ or Ú Ö Ø Å ¾. Finally we consider the case where Ú Ù, i.e. either Ù ¾ Ì Ð Ø Úµ or Ù ¾ Ì Ö Ø Úµ. If Ù ¾ Ì Ð Ø Úµ and Ë Ð Ø Úµ ½, then Ú Ð Ø Ú and Ë Ð Ø Úµ Ú Ñ Ú Ð Ø Ñ. Similarly if Ù ¾ Ì Ö Ø Úµ and Ë Ö Ø Úµ ½, then Ú Ö Ø Ú and Ë Ö Ø Úµ Ú Å Ú Ö Ø Å. If Ù ¾ Ì Ð Ø Úµ and Ë Ð Ø Úµ ¾, then Ì Ú Ð Ø is either a subtree of Ì Ð Ø Ùµ or Ì Ö Ø Ùµ, implying that Ú Ð Ø Å ¾ or Ú Ð Ø Ñ ¾ respectively. Similarly if Ù ¾ Ì Ö Ø Úµ and Ë Ö Ø Úµ ¾, then either Ú Ö Ø Å ¾ or Ú Ö Ø Ñ ¾. We conclude that if Ë, then FindAny returns an element in Ë. The fact that FindAny takes constant time follows from Theorem 1 and Corollary 1, since only a constant number of boolean operations and arithmetic operations is performed plus two calls to Ñ and three sparse array lookups. The correctness of Report and the Ç µ time bound follows from Lemma 1. 7

11 The space required by the data structure depends on the size required for the three sparse arrays,, and Î. The number of internal levels of Ì with ÔØ ÑÓ À ¼ is Û À, and therefore the number of initialized entries in is at most Ò Û À Ç Ò Ô Ûµ. Similarly, the number of initialized entries in due to Ùµ Ù is at most Ò Û À. For the number of initialized entries in due to Ùµ ½, we observe that the subtree Þ of height À rooted at Þ Ùµ by definition contains at least one node from Î. If Þ Î, then Þ has at most ½ leaves which are nodes in È, and we have Þ È ½µÀ ¾À. Since Þ contributes to with at most ¾À Þ Î entries and Î Ò, the total number of initialized entries contributed to due to Ùµ ½ is bounded by ¾ÀÒ. The number of initialized entries in is therefore bounded by ¾ÀÒ Ò Û À Ç Ò Ô Ûµ. Finally, by definition, Î contains at most Ò initialized entries. Each entry of,, and Î requires space: ½, ÐÓ Û, and Ç Ûµ bits respectively, and,, and Î have Ç Ò Ô Ûµ, Ç Ò Ô Ûµ, and at most Ò initialized entries respectively. The total number of words for storing the three sparse arrays by Corollary 1 is therefore Ç ÐÓ Û Ò Ô Û Û Òµ Ûµ Ç Òµ words. It follows that the total space required for storing the data structure is Ç Òµ words. ¾ 4 Construction In this section we describe how to construct the data structure of the previous section in expected time Ç Ò Ô Ûµ. Theorem 5 Given an unordered set of Ò distinct integers each of Û bits, the range reporting data structure in Section 3 can be constructed in expected time Ç Ò Ô Ûµ. Proof. Initially Ë can be sorted in space Ç Òµ with the algorithm of Thorup [23] in time Ç Ò ÐÓ ÐÓ Òµ ¾ µ Ç Ò Ô Ûµ or with the randomized algorithm of Andersson et al. [5] in expected time Ç Ò ÐÓ ÐÓ Òµ Ç Ò Ô Ûµ. Therefore without loss of generality we can assume Ë ½ Ò where ½ for ½ Ò. We observe that Ú ¾ Î if, and only if, Ú is the root or Ú, is the nearest common ancestor of and ½ for some, where ½ Ò. Similarly as for the FindAny query, we can by Fact 1 find the nearest common ancestor Ú ¾ Î induced by and ½ in constant time by the expression Ú ½ Û ½µµ ½µµ Ñ ½ µ The nodes Ú ¾ Î form by the pointers Ú Ð Ø and Ú Ö Ø a binary tree Ì Î. The defined sequence Ú ½ Ú Ò ½ forms an inorder traversal of Ì Î. Furthermore the nodes satisfy heap order with respect to their depths in Ì. Recall that ÔØ Ú µ Ñ Ú µ can be computed in constant time. The inorder together with the heap order on the depth of the Ú nodes uniquely defines Ì Î since these are exactly the constraints determining the shape of the treaps introduced by Seidel and Aragon [22]. By applying an Ç Òµ time treap construction algorithm [22] to Ú ½ Ú ¾ Ú Ò ½ we get the required left and right pointers for Î. The Ñ and Å fields for the nodes in Î can be constructed in a bottom-up traversal of Ì Î in time Ç Òµ. 8

12 The information to be stored in the arrays and can by another traversal of Ì Î be constructed in time linear in the number of nodes to be initialized. Consider an edge Ù Úµ in Ì Î, where Ú is the parent of Ù in Ì Î, i.e. Ú is the nearest ancestor node of Ù in Ì that is a branching node or Ú is the root. Let Ù Ù ¼ Ù ½ Ù Ú be the nodes on the path from Ù to Ú in Ì such that ÔØ Ù µ ÔØ Ù ½ µ ½. While processing the edge Ù Úµ we will compute the information to be stored in the sparse arrays for the nodes Ù ¼ Ù ½ Ù ½, i.e. the nodes on the path from Ù to Ú exclusive Ú. From the defintion of and we get the following: For the array we store Ùµ ½, if ÔØ Ùµµ ÔØ Ú), and Ù ¼ for all ¼ ½, where ÔØ Ù µ ÔØ Ùµµ and ÔØ Ù µ ÑÓ À ¼. For the array we store Ù ÔØ Úµ ÔØ Ù µ for all ¼ ½ where ÔØ Ù µ ÑÓ À ¼ or ÔØ Ù µ À ÔØ Úµ À or ÔØ Ù µ ÔØ Ùµµ. Finally, we store for the root ½ ½ and ½ ¼. Constructing the three sparse arrays, after having identified the Ç Ò Ô Ûµ entries to be initialized, by Corollary 1 takes expected time Ç Ò Ô Ûµ. ¾ 5 Approximate range counting In this section we provide a data structure for approximate range counting. Let Ë Í denote the input set, and let Ò denote the size of Ë. The data structure uses space Ç Òµ words such that we can support Count in constant time, for any constant ¼. We assume Ë has been preprocessed such that in constant time we can compute FindAny µ for all ¾ Í. Next we have a sparse array such that we for each element Ü ¾ Ë can compute Ö Ò Ë Üµ in constant time. Both these data structures use Ç Òµ space. Define count µ Ë. We need to build a data structure which for any ¾ Í computes an integer such that count µ ½ µcount µ. In the following we will use the observation that for ¾ Ë,, it is easy to compute the exact value of count µ. This value can be expressed as Ö Ò Ë µ thus the computation amounts to two lookups in the sparse array storing the ranks. We reduce the task of computing Count µ to the case where either or are in Ë. First, it is easy to check if Ë is empty, i.e., FindAny µ returns, in which case we simply return 0 for the query. Hence, assume Ë is non-empty and let be any element in this set. Then for any integers and such that count µ ½ µcount µ and count µ ½ µcount µ, it holds that count µ count µ count µ ½ ½ ½ µcount µ Hence, we can return Count µ Count µ ½ as the answer for Count µ, where ¾ Ë is an integer returned by FindAny µ. Clearly, both calls to Count satisfy that one of the endpoints is in Ë, i.e., the integer. In the following we can thus without loss of generality limit ourselves to the case for a query Count µ with ¾ Ë (the other case ¾ Ë is treated symmetrically). We start by describing the additional data structures needed, and then how to compute Ö Ò Ë µ ½ and the approximate range counting query using these. Define Ô ÐÓ Ò, and Â Ü ¾ Ë Ö Ò Ë Üµ ½µ ÑÓ Ô ¼ Ñ Ü Ë. We construct the following additional data structures (see Figure 3). JumpR For each element ¾ Â we store the set JumpR µ Ý ¾ Ë count Ýµ ¾ ¾ ¼ Ô. 9

13 JnodeR For each element ¾ Ë we store the integer JnodeR µ being the successor of in Â. LN For each element ¾ Â we store the set LN µ ¾ Ë JnodeR µ. ÂÙÑÔÊ ½¼µ ÄÆ ½¼µ Figure 3: Extension of the data structure to support Count queries. Û, Ò Ë ¾, and Ô ÐÓ Ò. Each of the sets JumpR and LN have size bounded by Ô ÐÓ Í, and hence using the É-heaps from Corollary 2, we can compute predecessors for these small sets in constant time. These É-heaps have space cost linear in the set sizes. Since the total number of elements in the structures JumpR and LN is Ç Ë µ, the total space cost for these structures is Ç Ë µ. Furthermore, for the elements in Ë given in sorted order, the total construction of these data structures is also Ç Ë µ. To determine Count µ, where ¾ Ë, we iterate the following computation until the desired precision of the answer is obtained. Let JnodeR µ. If, return Ê count Pred ÄÆ µ µµ. Otherwise,, and we increase by count µ ½. Let Ý Pred JumpR µ µ and Ö Ò JumpR µ Ýµ ½. Now count Ýµ ¾ count µ ¾ ½. We increase by ¾. Now count Ýµ and count Ý µ ¾. If Ý we return. If ¾ µ ½, we are also satisfied and return ¾. Otherwise we iterate once more, now to determine Count Ý µ. Theorem 6 The data structure uses space Ç Ë µ words and supports Count in constant time for any constant ¼. Proof. From the observations above we conclude that the structure uses space Ç Òµ and expected preprocessing time Ç Òµ. Each iteration takes constant time, and next we show that the number of iterations is at most Ð ½ ÐÓ ½ µ. Let ¾ Á, ¾ Á, after the first iteration. In the Ðth iteration we either return count µ or ¾ count µ, where Á Ð ½. In the latter case we have count µ ¾ Á Ð ½. We need to show that ¾ ½ µcount µ. Since count µ, we can write ¾ ½ µ. We have ¾. Since Á Ð ½ and ¾ Á, we have ¾ Á Ð ½ ¾ Á and the result follows. ¾ References [1] P. K. Agarwal. Range searching. In Handbook of Discrete and Computational Geometry, CRC Press

14 [2] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9): , September [3] M. Ajtai. A lower bound for finding predecessors in Yao s cell probe model. Combinatorica, [4] A. Amir, A. Efrat, P. Indyk, and H. Samet. Efficient regular data structures and algorithms for location and proximity problems. In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS), [5] A. Andersson, T. Hagerup, S. Nilsson, and R. Raman. Sorting in linear time? Journal of Computer and System Sciences, 57(1):74 93, [6] A. Andersson and O. Petersson. Approximate indexed lists. Journal of Algorithms, 29(2): , November [7] A. Andersson and M. Thorup. Tight(er) worst-case bounds on dynamic searching and priority queues. In STOC: ACM Symposium on Theory of Computing (STOC), [8] P. Beame and F. Fich. Optimal bounds for the predecessor problem. In 31st ACM Symposium on Theory of Computing (STOC), [9] K. L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the 10th Annual Symposium on Computational Geometry, pages , Stony Brook, NY, USA, June ACM Press. [10] M. Dietzfelbinger. Universal hashing and k-wise independent random variables via integer arithmetic without primes. In 13th Annual Symposium on Theoretical Aspects of Computer Science, volume 1046 of Lecture Notes in Computer Science, pages Springer Verlag, Berlin, [11] R. W. Floyd. Algorithm 245: Treesort3. Communications of the ACM, 7(12):701, [12] M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with Ç ½µ worst case access time. Journal of the ACM, 31(3): , [13] M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. Journal of Computer and System Sciences, 47: , [14] M. L. Fredman and D. E. Willard. Trans-dichotomous algorithms for minimum spanning trees and shortest paths. Journal of Computer and System Sciences, 48: , [15] D. Harel and R.E. Tarjan. Fast algorithms for finding nearest common ancestors. Siam J. Comput, 13(2): , [16] C. T. M. Jacobs and P. van Emde Boas. Two results on tables. Information Processing Letters, 22(1):43 48, [17] Y. Matias, J. S. Vitter, and N. E. Young. Approximate data structures with applications. In Proc. 5th ACM-SIAM Symp. Discrete Algorithms, SODA, pages , January

15 [18] K. Mehlhorn. Data Structures and Algorithms: 3. Multidimensional Searching and Computational Geometry. Springer, [19] P. B. Miltersen, N. Nisan, S. Safra, and A. Wigderson. On data structures and asymmetric communication complexity. Journal of Computer and System Sciences, 57(1):37 49, [20] F. P Preparata and M. Shamos. Computational Geometry. Springer-Verlag, New York, 1985, [21] J. P. Schmidt and A. Siegel. The spatial complexity of oblivious -probe hash functionso. SIAM Journal of Computing, 19(5): , [22] R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16(4/5): , [23] M. Thorup. Faster deterministic sorting and priority queues in linear space. In Proc. 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages , [24] D. E. Willard. Log-logarithmic worst-case range queries are possible in space Æµ. Information Processing Letters, 17(2):81 84, 24 August [25] J. W. J. Williams. Algorithm 232: Heapsort. Communications of the ACM, 7(6): ,

SFU CMPT Lecture: Week 9

SFU CMPT Lecture: Week 9 SFU CMPT-307 2008-2 1 Lecture: Week 9 SFU CMPT-307 2008-2 Lecture: Week 9 Ján Maňuch E-mail: jmanuch@sfu.ca Lecture on July 8, 2008, 5.30pm-8.20pm SFU CMPT-307 2008-2 2 Lecture: Week 9 Binary search trees