Comparison of Compressed Quadtrees and Compressed k-d Tries for Range Search

Comparison of Compressed Quadtrees and Compressed k-d Tries for Range Search Nathan Scott December 19, 2005

1 Introduction There is an enourmous number of data structures in existence to represent spatial data, as evidenced by the survey provided by Samet in [6]. While there are many operations one would like to perform on spatial data, a common one is an orthogonal range search. An orthogonal range query provides a hyperrectangular region, parallel to the axes, and asks what points or objects are contained within. Performance is usually measured with respect to the number of points contained in the structure, considering the dimension fixed. This project considers two particular data structures, the Compressed Point-Region Quadtree (PRQT) and the Compressed k-d Trie (KDT), and compares their experimental results for range search with particular attention to how well each structure is able to handle the move to higher dimensions. Sections 2 and 3 provide a description of the compressed PRQT and KDT, respectively. Section 4 presents the experimental results, showing the average size of a PRQT (the size of the KDT is constant) and the average number of nodes visited in range queries when the number of points found is in. Finally section 5 discusses possible explanations for the results obtained, as well as some commentary on possible improvements to be made. 1

2 Compressed Point-Region Quadtrees Point-Region Quadtrees (PRQTs) are a way of storing 2-dimensional point data. A PRQT consists of a tree with a branching factor of 4. Each node covers a square region of space (referred hereafter as a quad). Points are stored at the leaf level, and internal nodes divide their quad into 4 subquadrants (North-East, South-East, South-West, and North-West) in such a way that the quads represented by the leaf nodes each contain at most one point. Figure 1 shows a simple PRQT example. Figure 1: An example of a PRQT and a visualization of how space is divided by it. PRQTs can be extended to more than 2 dimensions. In 3 dimensions, internal nodes would divide their cover space into 8 subregions, and such a tree would be called a PR Octtree. In k dimensions, internal nodes divide their cover space into 2 k subregions, and we could refer to the tree as a kr-dimensional hyperocttree. Throughout this paper, the terms PRQT, quadtree, and quad refer to the k-dimensional analogue of the 2-d definitions above, despite the slightly inaccurate usage of the word quad. In this report, internal nodes are sometimes referred to as grey nodes. Leaf nodes can be either white (when they are empty) or black (they contain a point). The midpoint of a node is the point within the quad covered by the node that is the midpoint along every dimension of the quad. One problem with the PRQT is that the size of the resulting tree for a given set of data is highly dependent on the distribution of points. Given two points with arbitrarily small Euclidean distance between them, the height of the resulting tree can be arbitrarily large.[6] Figure 2 shows an example 2

of this. Clarkson [2] introduced a compressed version of the PRQT. The compressed PRQT is identical to the ordinary PRQT, with the additional constraint that every internal node in the tree has at least two non-empty children. In this way, only internal nodes that provide useful information (divide the points within its cover space into different quads) are retained. Figure 3 shows the compressed quadtree resolution of the example from figure 2. Figure 2: An example of a very tall PRQT with few points. 2.1 Construction Different algorithms have been presented to construct a compressed PRQT. Clarkson initially described a randomized algorithm that constructs a k- d PRQT from n points in O(c k n log n) time. Bern gives a deterministic algorithm with a running time in O((ck) k n log n). Assuming a uniformly distributed data set, a compressed PRQT that contains n points has height O(log 2 k n). The branching factor of a k-d PRQT is 2 k. Thus a compressed PRQT containing n points in this situation will have O(n) nodes. This presumes a nice, evenly spaced data set where there are no white nodes and the tree is no taller than it needs to be. In the worst 3

Figure 3: An example of a compressed PRQT. case a compressed PRQT can still have height O(n), and so contain O(2 nk ) nodes. This project uses a straightforward, iterative insertion method to construct a compressed PRQT. Given a list of points, each point is inserted sequentially as if they were being encountered in a dynamic environment, without any preprocessing involved. Thus the complexity of the construction algorithm used is O(nf(n)) where f(n) is the complexity of the insertion operation. This project was concerned with analysing performance for range search, and so efficient construction was not a priority. 2.2 Insertion To insert a point P into a compressed PRQT, we start at the root and follow the tree down to the leaf level, similar to searching a binary search tree. At each grey node encountered, the point P may lie outside of that node s quad, because of compression. Presuming this does not happen, we will eventually reach a leaf node A. If A is white, we replace it with a black node covering the same quad, containing the point P. If A is black, we replace it with a new grey node G covering A s quad. If the point contained in A and P are within different subquads of G, we insert them into the appropriate ones. Otherwise, we shrink G s quad down until the two points lie in different subquads. These three cases are illustrated in figure 4. If along our traversal of the tree we reach a grey node G and P is outside of G s quad, we have to reconfigure the tree at this point. We construct a 4

Figure 4: Examples of the three different possibilities of inserting a new point into a PRQT. 5

new grey node G that covers the appropriate subquad of the parent of G. Then we perform the shrinking operation described above until P and the midpoint of G lie within different subquads of G. As it is currently implemented, the number of subdivisions required by the shrinking process depends on the Euclidean distance between the two points. Suppose the width of all of space (or space covered by the root node) is W, the Euclidean distance between two points being seperated via the shrinking process is d, and the number of divisions of the quad is q. Each division reduces the covered space by a factor of 2 along each dimension, and so after q divisions for the two points to fall into two subquads necessarily (even in the worst case), d W must be true. Thus 2 ( q+1) 2q+1 W or q log W log d. d Therefore Ω(log W ) divisions will occur in the worst case. Presuming the more favorable height of a compressed PRQT (see section 2.1) of O(log k n), insertion runs in time Ω(log k n + log W ). This analysis assumes integer points, otherwise it is not even obvious that the shrinking process will terminate (given infinitesimally small d). 2.3 Range Search Advanced algorithms have been given to perform point searches on compressed PRQT, and most extensions of the PRQT involve auxiliary data structures.[1] However, this project looks at the straightforward range search algorithm implied by [6]. Given a hyperrectangular query region and a node to search, we recursively search the children of that node based on the intersection of the query region and the quad covered by the children. There are three outcomes of this intersection, described in 5. Figure 5: The three outcomes of comparing a quad with the query rectangle. 6

Algorithm 1 RANGE QUERY(PRQTNode N, QueryRectangle Q, PointList result) {RECT INT decides whether two query rectangles intersect or if one is contained in the other.} {POINT IN RECT decides if a point is contained within a rectangular region.} {REPORT ALL POINTS adds all points contained within a subtree to the list.} if N.Colour = WHITE then do nothing else if N.Colour = BLACK then if POINT IN RECT(N.Data, Q) = TRUE then add N.Data to result end if else {N.Colour = GREY} for all children C i of N do if RECT INT(C i.coverspace, Q) = WITHIN then REPORT ALL POINTS(C i, result) {This child is entirely within the query rectangle.} else if RECT INT(C i.coverspace, Q) = INTERSECT then RANGE QUERY(C i, Q, result) {Recurse.} else {RECT INT returned NO INTERSECT} continue end if end for end if Analysis of range search is tricky. A worst case analysis gives a running time of O(N) where N is the number of nodes in the tree. This is easily seen by constructing a case where the data lies along the perimeter of the shell of space that the tree covers. Assuming a uniform distribution of points, a worst case query would be one that intersected the entire space along all dimensions except for one (i), which is just barely straddles the midpoint of. At the root, all 2 k children are hit by such a query. Within each of these, the query would intersect with half of their children (the ones along that are closest to the midpoint of dimension i), that is 2 k 1 of them. So at level 1 (level 0 being the root) there are 2 k 2 k 1 = 2 2k 1 nodes visited. At level 2, each of the 2 2k 1 nodes at level 1 have another 2 k 1 intersecting children, for 7

a total of 2 3k 2. Thus in the end a total of log k n i=1 2 ik i 1 nodes visited. See figure 6 for illustrations of these examples. Figure 6: Examples of bad query rectangles for a PRQT with a poor distribution of data and with uniformly distributed data. 8

3 Compressed k-d Tries A trie is a well known data structure that stores finite sequences of some alphabet for effecient search and retrieval. A trie is a tree of nodes, wherein internal nodes at level i branch on the i th symbol of the sequences stored. The data itself is stored at the leaf level. The Patricia trie (or compressed trie), due to Morrison, is obtained, much like with PRQTs, by adding the constraint that every internal node has at least 2 children. Sequences of symbols that would have come from a sequence of one-child internal nodes are collected into a skip string that internal nodes now store. A binary Patricia trie is a Patricia trie on the alphabet 0, 1, and therefore all internal nodes have exactly 2 nodes. Shi and Nickerson showed how to use a binary Patricia trie to perform range queries on combined spatial and textual data.[7] This project is only concerned with spatial data, but this is an easy adaptation of their work. Hereafter these binary Patricia tries are referred to as compressed k-d tries or compressed KDTs. A compressed KDT stores k-d point data by representing the points as binary strings, which are then inserted into the trie. The binary strings are obtained by interleaving the bits of each dimension s value. Let P be a point with b bits of precision along k dimensions. Thus P = (p 11 p 12... p 1b, p 21 p 22... p 2b,..., p k1 p k2... p kb ). Then P = (p 11 p 21... p k1 p 12 p 22... p k2... p 1b p 2b... p kb ) is the binary string of interleaved bits that the trie will store. See figure 7 for an example KDT for a set of points. 3.1 Construction As we will see in section 3.2, inserting a point into a KDT creates exactly two more nodes. A KDT with 1 point has exactly one node. Therefore a KDT containing n points has exactly 2n 1 nodes. In the worst case, a KDT containing n points will have a height of O(n) nodes. Assuming a uniform distribution of points, we would expect each node to split the number of points in half and so a balanced KDT of height O(log n) to result. In terms of the actual the bit strings stored, a KDT containing n points of k-d data with b bits per dimension has a constant height of kb bits. 9

Figure 7: Simple example of a KDT containing 2-d integer points, 4 bits per dimension. Construction of compressed KDTs was implemented the same way as it was for PRQTs: repeated insertion of points. Thus the construction algorithm used was again O(nf(n)) where f(n) is the running time of the insertion algorithm. It turns out that our insertion algorithm has f(n) based on the height of the tree, and so presuming the resulting KDT is balanced the construction takes O(n log n) time. 3.2 Insertion The insertion algorithm implemented is an adaptation of the one presented in [7]. Starting at the root, we follow the tree down for as long as the prefix of our node matches the path that we have followed so far. Once we reach a node where the new point diverges from that node s skip string, we replace that node with a new node with 2 children: one is the remainder of the subtree that followed the node we split on, and one is a new leaf node with the new point stored within. See figure 8 for an example. The running time of this algorithm is O(h) where h is the height of the trie. 10

Figure 8: Inserting the point (4, 12) = (0100, 1100) = (01110000) into a KDT. 3.3 Range search The range search algorithm implemented is adapted from [7]. At each bit p i visited along the tree (either via a branch or looking at a skip string) the cover space of the node is divided in half along dimension imodk, based on if p i = 1 or 0. That dimension of the cover space is compared against the query region to produce results similar to those in 2.3. If the node s cover space is within the query region (black) along all k dimensions, then we can report all points. If the node s cover space is entirely outside the query region (white) along any of the k dimensions, we can prune the subtree rooted at that node. Otherwise, we recurse and continue searching. Pseudocode is given as algorithm 2. [7] provides thorough analysis of the theoretical performance of range search on KDTs. 11

Algorithm 2 RANGE QUERY(KDTNode N, int level, CoverSpace C, QueryRectangle Q, Colour[] RI, PointList result) {IN RANGE returns WHITE, GREY, or BLACK depending on the intervals overlap along the given dimension.} {GET COLOUR looks at an array of Colours and determines if it is all black, mixed black and grey, or white.} if N is a leaf then if POINT IN RECT(N.Data, Q) then add N.Data to result end if else i 0 while i < N.SkipString.length do p levelmodk if N.SkipString i = 0 then C.Upper p (C.Lower p + C.Upper p )/2 else {N.SkipString i = 1} C.Lower p (C.Lower p + C.Upper p )/2 end if RI p IN RANGE(C, Q, p) i i + 1 level level + 1 end while colour GET COLOUR(RI) if colour = GREY then {We need to recurse.} p levelmodk if N.Left nil then C.Upper p (C.Lower p + C.Upper p )/2 RI p IN RANGE(C, Q, p) RANGE QUERY(N.Lef t, level, C, Q, RI, result) end if if N.Right nil then C.Lower p (C.Lower p + C.Upper p )/2 RI p IN RANGE(C, Q, p) RANGE QUERY(N.Right, level, C, Q, RI, result) end if else if colour = BLACK then REPORT ALL POINTS(N, result) end if 12 end if

4 Results Both the PRQT and KDT were implemented in C++ using the Standard Template Library (STL). They were compiled with g++, with optimization enabled. Source and makefiles are available online at: http://v5o5jotqkgfu3btr91t7w5fhzedjaoaz8igl.unbf.ca/~y8ju8/spatial/ Alternately, visit http://people.unb.ca/~y8ju8/ and follow the appropriate link. The structures were tested for dimensions 2 k 8, and for n = 1000, 10000, 100000, 1000000. For each case, 20 random data sets were generated, and 500 random range queries were performed. Tests were performed on a machine running Fedora Core Linux, on an AMD processor (2.2 Ghz) with 2 Gigabytes of RAM. 4.1 Space The limiting factor in terms of comparing the two structures directly was the failure of the PRQT to scale to higher dimensions. As discussed in 3.1, the number of nodes in a KDT is dependent only on the number of points, and not the dimension or distribution. The number of nodes in a PRQT, on the other hand, is highly dependent on both the dimension of data and the distribution of points. For a graph of the average size of a PRQT with a given number of points was plotted against the dimension of data. As the dimension increases, the number of nodes in the tree increases exponentially. See figures 4.1 and 4.1. 4.2 Range Search For each range query performed, the number of points reported (A) was classified into one of 4 groups: A [0, (log n) 1 2 ), A [(log n) 1 2, log n), A [log n, (log n) 2 ), and A [(log n) 2, n]. The number of nodes visted for each of the searches in these categories was then averaged. Results for the first three groups were plotted against the dimension of data for n = 1000, 10000, 100000, 1000000. See figures 4.2, 4.2, 4.2, and 4.2. 13

Figure 9: Average size of PRQT with 1000 and 100000 points in k dimensions. 14

Figure 10: Average size of PRQT with 100000 and 1000000 points in k dimensions. 15

Figure 11: Average number of nodes visted during a range search for a PRQT and KDT with 1000 points in k dimensions. 16

Figure 12: Average number of nodes visted during a range search for a PRQT and KDT with 10000 points in k dimensions. 17

Figure 13: Average number of nodes visted during a range search for a PRQT and KDT with 100000 points in k dimensions. 18

Figure 14: Average number of nodes visted during a range search for a PRQT and KDT with 1000000 points in k dimensions. 19

5 Conclusions It seems clear that the KDT outperforms the PRQT for range search. While the two are comparable at lower dimensions, the PRQT quickly becomes unweildy at dimensions 4. In fact, looking closely at the number of nodes visited by the PRQT for N = 1000, 10000, 100000, at a certain value of k we visit more nodes than there are actually points. While actual performance comparison against a naïve search was not done, it seems clear that for sparse data naïve search should outperform the PRQT. The reason for this is that the PRQT contains huge numbers of wasted nodes that represent nothing but empty space but must be visited by the search algorithm nonetheless. The PRQT guarantees that every internal node has at least 2 non-empty children. For k = 2, this means that at least half of the child nodes of an internal node contains data. Another way to look at this is at most half of the children of internal nodes are wasted. However, as k grows the ratio of informative nodes to wasted ones shrinks exponentially. For example, with k = 8 the compression only guarantees that 1 = 1 of the nodes actually contain data. 2 8 256 However, for a fixed k it seems that given a large enough n the PRQT becomes more efficient in that it spends less time searching empty space. For example, with k = 8 and n = 1000000 the average range query did not visit more than 1000000 nodes, where with fewer points more nodes were visited than points in the structure. The overall number of nodes visited is still huge compared to the KDT. The other major factor to consider is storage. The huge number of nodes in the PRQT quickly made it an unfeasible data structure for high dimensional data. While results are not presented for k > 8, the KDT has been tested on individual cases of 1000000 points of 20 dimensional data successfully. For 1000000 points, k > 8 dimensional PRQTs were not possible to successfully construct on the test system. Since the branching factor of the PRQT is 2 k, an exponential (in terms of the dimension) number of nodes are added every time two points lie within the same quad. For small values of k (2, 3) this seems to be reasonable. For higher values of k the tree quickly becomes too large to handle. An illustrative metaphor may be to consider the leaf nodes of a PRQT like buckets in some sort of hashing system. When we insert a k-d point into the PRQT, if there is a collision (in our case, two points lying too close together), it costs us 2 k in storage. The total space then depends on the 20

number of these collisions among thousands of points, rather than directly on the number of points themselves. There are several different improvements/implementations of the basic PRQT, some of which are summarized in [1]. It is not clear to what extent the problems with the PRQT described here apply to these other structures. The most obvious improvement I would recommend for the PRQT is to simply not construct the white nodes, and access the 1 to 2 k children of an internal node via some other means (such as a list or a hash table). Accessing a particular child can no longer be considered a constant time operation, but it will likely save an exponential amount of space, as well as an exponential number of nodes visted in range queries. Even if the range search time is slower, the savings in space would at least allow the data structure to be feasible for higher dimensional data. In all respects it seems that the KDT outperforms the PRQT, especially with higher dimensional data. It may be interesting to note that analysing the KDT s performance in terms of nodes visited is not necessarily entirely accurate, as differing amounts of work is performed at different nodes. However it still seems to be a reasonable estimate of search performance. 21

References [1] Srinivas Aluru and Fatih E. Sevilgen. Dynamic compressed hyperoctrees with application to the n-body problem. In Proceedings of the 19th International Confrence on the Foundations of Software Technology and Theoretical Computer Science, pages 21 33, Chennai, India, 1999. [2] Kenneth Lee Clarkson. Algorithms for Closest-Point Problems. PhD thesis, Stanford University, 1984. [3] D. Eppstein, M. T. Goodrich, and J. Z. Sun. The skip quadtree: A simple dynamic data structure for multidimensional data. In Proceedings of the 21st ACM Symposium on Computational Geometry, pages 296 305, Pisa, Italy, 2005. [4] P. Flajolet and C. Puech. Partial match retrieval of multidimensional data. Journal of the Association for Computing Machinery, 33(2):371 401, 1986. [5] J. A. Orenstein. Multidimensional tries used for associative searching. Information Processing Letters, 14(4):150 157, 1982. [6] Hanan Samet. Design And Analysis Of Spatial Data Structures. Selfpublished, College Park, MD, USA, 2004. [7] Q. Shi and B. Nickerson. k-d range search with binary patricia tries. Technical report, University of New Brunswick, Fredericton, NB, 2005. 22