Indexing Non-uniform Spatial Data

Size: px

Start display at page:

Download "Indexing Non-uniform Spatial Data"

Philippa Wood
5 years ago
Views:

1 Indexing Non-uniform Spatial Data K. V. Ravi Kanth Divyakant Agrawal Amr El Abbadi Ambuj K. Singh Department of Computer Science University of California at Santa Barbara Santa Barbara, CA Abstract Non-uniformity in data extents is a general characteristic of spatial data. Indexing such non-uniform data using conventional spatial index structures such as R -trees is inefficient for two reasons: (1) The non-uniformity increases the likelihood of overlapping index entries, and, (2) clustering of non-uniform data is likely to index more dead space than clustering of uniform data. To reduce the impact of these anomalies, we propose a new scheme that promotes data objects to higher levels in tree-based index structures. We examine two criteria for promotion of data objects and evaluate their relative merits using an R -tree. In experiments on cartographic data, we observe that our promotion criteria yield upto 45% improvement in query performance for an R -tree. 1 Introduction Advances in processing technology and storage management over the last decade have opened up new avenues for handling large and complex map and image data. The Alexandria Digital Library (ADL) project is one such mission with the goal of providing efficient access to data collections (satellite images, census data, aerial photographs, digital line graphs, digital orthophotoquads, SLAR images, and maps) spanning terabytes of storage. While the task of building a digital library is enormous, one important piece of the effort is the design and implementation of efficient index structures. In this paper, we consider the performance of current index structures on geographic and spatial data and present a new approach to indexing them to adapt to data characteristics. Many indexing schemes [3, 4, 8, 10, 11, 15, 18] have been designed to provide efficient storage and retrieval support for multidimensional data. Unfortunately, most of Work supported in part by a research grant from NSF/ARPA/NASA IRI and NSF instrumentation grant CDA these structures are suitable for point data alone. Only a few of them such as the R-tree variants [1, 8, 16] perform well with rectangular data. Even these structures lose their efficiency when it comes to indexing data rectangles with nonuniformity in their areas (extents), which is a common characteristic of geographic data. Indexing such non-uniform data is inefficient for two reasons. First, index entries for non-uniform data are more likely to overlap than uniform data. This effect is more pronounced for nested data, where one rectangle encloses several others. Second, clustering of non-uniform data in an index is likely to include more dead space (part of the dataspace that is not covered by any data object) than uniform data. What are the implications of these on the performance of the index? Overlapping index entries result in loss of information as to which part of the tree is relevant to a particular query. Indexing dead space implies accessing irrelevant parts of the tree that may not contribute to the result of the query in any way. Thus, the performance of the index structure degrades. One solution to the above problems is to separate data based on their extents into several levels and index them separately. This approach is similar to that of a Level-based Index (LIB) structure [13], that was designed for indexing nested data. The LIB structure decomposes data into several levels based on their nesting and indexes the data at each level separately. Since the data is scattered over several indices, it works well only for a small number of them. The same argument applies to any approach that decomposes non-uniform data based on their extents. Hence, we seek to smooth out the non-uniformity in the data by promoting data objects to higher levels in a single index tree. To appreciate why such an organization can be more efficient than the conventional index organizations, let us consider the example data of Figure 1(a) and examine how it is organized by popular index structures such as the R -tree [1]. In this figure, object A contains B and overlaps with C; D and E. This pattern of data occurs frequently in geographical databases. Assuming a node capacity of 3, an R - tree organizes the data as shown in Figure 1(b). Note that

2 irrespective of how good the data clustering scheme is in R -trees (or any conventional index for that matter), it will have overlapping index entries (g and h shown in dashed lines in the figure) and cannot avoid indexing a portion of the dead space (e.g., region k in the figure). Berchtold et al. addressed the problem of achieving minimal overlapping entries in their -tree [2]. However, for extended data objects, which are usually of small dimensions, the -tree achieves little or no reduction in overlap and consequently behaves similar to conventional R -tree. We observe that these anomalies of indexing non-uniform spatial data are a consequence of maintaining all the data objects at the same level (leaf level) in the tree. Based on this observation, we remove this restriction and reorganize the same set of data objects as in Figure 1(c). Here object A, due to its large extent, is stored at the root level and all others are stored at the leaf level (with i and h as the corresponding index entries). This tree organization solves both the problems of non-uniformity (1) it reduces the overlaps in index entries, and (2) it reduces the dead space that is indexed (region k is excluded). Both these factors contribute to a decrease in the extents for index regions in the tree. Since the performance of a tree structure depends on the extents of the index regions as observed in [12], the promotion-based organization yields better results than the conventional R -tree. g g i k h C B A B C D E F (b) R*-tree A (a) Example of Non-uniform Data B C D i E F h A h D E (c) Promotion-based R*-tree Figure 1. Promotion-based Indexing in an R - tree In this context, we note that the oversize shelves of Gunther [6, 7] is the first promotion-based scheme in the literature. However, this scheme and its analysis heavily rely on data objects being partitioned to create non-overlapping index regions. Hence, they are suitable only for the cell trees [5] and the R + -trees [16]. In addition, the oversize shelves also do not adapt to data distribution because promoted objects are never demoted. The analysis also assumes that all nodes at the same level have the same probability of being accessed by a query. In contrast, for a uniform query distribution, Pagel et al. [12] showed that the access probability depends on the area covered by the nodes. In this paper, we F propose a new promotion scheme that incorporates Pagel's observations and extends Gunther' s promotion-schemes to other popular index structures like the R-trees [8], R -trees [1], -trees [2] which may have overlapping index regions. In addition, we adapt to data distribution by not only promoting but also demoting data objects as and when needed. The SR-tree [9] is also designed to index non-uniform data. It attempts to smooth out the non-uniformity by cutting every object into several parts those that enclose or span an index region in the tree (in any dimension), called the spanning regions, and those that do not, called remnant regions. The spanning regions are inserted in the node that has an entry for the spanned index region. The remnant regions are simply inserted in the leaf nodes of the tree just as in an R-tree. Using such an organization, the SR-tree can potentially maintain O(N ) information for each object, where N is the number of objects. Consequently, the storage space of the SR-tree is O(N 2 ). Since all this extra information for each object is maintained in the same tree, the height of the tree may also increase. In contrast, our proposed scheme only uses linear space and is designed to improve index performance without increasing the height of the original tree. The BV-tree [4] also employs promotion of index regions to disambiguate them. However, the motivation for the promotion in the BV-trees is to ensure logarithmic access for exact match queries, which are much less common than containment and intersection type of queries. In contrast, our scheme is designed to improve the average performance of all types of queries. Besides, it could be combined with BV-trees to achieve worst-case logarithmic performance for exact match queries. Note that the worst-case performance for containment queries on n multidimensional objects in most current structures is O(n) [14]. In this paper, we first examine the effect of promoting data objects on the query performance of an index tree. We observe that query performance depends on two factors the height of the tree, and, the area covered by the index entries of the tree. The height of the tree determines the overheads of accessing a leaf node of the tree. The number of such leaf nodes accessed in a query is subject to the anomalies of indexing non-uniform data. These anomalies are reflected in the area covered in the index entries. Their impact is reduced whenever the area covered is reduced. Therefore, a good index scheme should decrease the area covered without increasing the height of the tree. The promotionbased index of Figure 1 is an example of such an index. We describe a general scheme for promoting data objects without affecting the height of the tree. We then study two variants of this scheme using different promotion criteria. The first, called nesting-based promotion, uses the concept of spatial nesting to decide on possible candidates for promotion whereas the second, called extent-based promotion,

3 uses the extents of the data objects to do the same. We experiment with these criteria using R -trees using real cartographic datasets from the Alexandria Digital Library (ADL) [17]. The ADL has several sources of spatial data (e.g. the Catalog of all its holdings, the Gazetteer etc.) Each such data collection has a number of objects with varying spatial footprints (the footprint of an object is the minimum bounding rectangle for that object). For instance, the Gazetteer has the spatial footprints for the countries of the world, the states and provinces in each of them and the cities within each state. We use these collections to evaluate the relative merits of the different promotion criteria and compare them with the conventional R -tree. In our experiments, we observe that the extent-based promotion yields reasonable performance gains in our experiments whereas the gains from nesting-based promotion depend on the nesting characteristics of the data. By promoting data objects, both variants reduce the ill-effects of non-uniformity in the data without increasing the height of the tree. The rest of the paper is organized as follows. In the next section, we analyze the impact of promotion on the performance of an index structure. In Section 3, we propose a promotion strategy with two different promotion criteria. In Section 4, we present our experiments with these criteria. Finally, in Section 5, we conclude the paper with pointers for further research on this topic. 2 Effect of Promotion on Index Performance In this section, we estimate the effect of promoting data objects on the performance of a tree-structured index. To this end, we first estimate the increase in the height of the tree due to promotion of data objects, and then analyze the subsequent query performance. 2.1 Effect of Promotion on Tree Height We assume the index structure is a balanced tree. Each node of the tree has between a minimum m and a maximum M child entries. Each child entry is a 2-tuple of the form hchild f ootprint; child nodei storing a pointer to the child node and its footprint (child footprint). The footprints represent arbitrary regions in multidimensional space. The exact representation depends on the index structure and is of little interest to us at this point. Consequently, the effects of promotion on the height of the index tree are equally applicable to most spatial data index structures like the R-trees [8], R + -trees [16], and R -trees [1]. Since we are concerned with the worst case increase in height due to promotion of entries, we assume every node has exactly m children. At any level, we assume that at most a fraction f of the total entries from that level are promoted to the next higher level. We start with the lowest level of the tree and promote entries from this level to the next higher level and move up the tree in this manner. For a set of n data objects at level 0 (the lowest level), fn entries are promoted to the next level whereas (1? f )n entries remain at that level. These remaining entries are stored in nodes at level 0. Each such node is represented by an index entry at the next level. Hence, with fn entries from level 0 promoted to level 1, the total number of entries at level 1 is n(1? f )=m + fn. Of these, only a fraction k = (1?f) of the total entries remain at level 1. The recurrence relation for estimating the number of entries a i at level i before promotion is therefore: a i = ((1? f )=m + f )a i?1. Since a 0 = n, solving the above recurrence relation yields a i = ((1? f )=m + f ) i a 0 = (1? f )=m + f i n Let c = 1=((1? f )=m + f ). The recursion bottoms out at level i, whenever the number of entries at that level is no more than m. This corresponds to the root level and no more promotions take place. Using this fact, the height h of the tree is determined as follows: h = log c (a 0 =a h ) = log c (n=m). Note that when there is no promotion (i.e., f = 0), the height is (log m n? 1). If one entry out of every m entries (i.e., one from every node) is promoted, the height increases by a factor of log m=2 2. For practical values of m = 50, this increase is 17%. 2.2 Performance of an index structure Next, we estimate the query performance of an index structure in terms of its height and the area covered by its index entries. We consider containment and intersection queries, which are the most common query types for spatial databases. We assume that these queries access all tree nodes whose footprints intersect the query box. The number of such node accesses characterizes the query performance of the index. Pagel et al. [12] give a performance measure for estimating the number of accesses to the children of a node in terms of the probabilities of a query window intersecting their footprints. They consider different query models based on whether or not the query distribution is dependent on the data distribution. Let W QM denote one of these query models and let P (q \ c) denote the probability of a query q intersecting the footprint of node c. Then, the number of children of node N that intersect query q is denoted by P M (W QM; q; N ) and is given by: P M (W QM; q; N ) = c2children(n) P (q \ c) For simplicity, we assume the query distribution is independent of the data distribution and each query retrieves all data in the window specified. We extend the performance measure to indicate the number of nodes accessed in the

4 whole tree T (rooted at node T ) for a query q as described below. Since the performance measure is an indication of the node accesses, we refer to it as the node access measure and denote it by na(q; T ). na(q; T ) = n2children(t ) P ((q \ n)=(q \ T )) (1 + na(q; n)) Note that we have modified Pagel' s performance measure to reflect the conditional probability of a child n of node T being accessed by query q given that node T is already accessed by query q. This reflects the order of traversal from a node to its children in a tree structure. Also, note that the access to the root node is not accounted for in this equation. Now we simplify the above recurrence relation using the following Lemma. We assume h is the height of tree and nodes i (T ) denotes the set of nodes at level i of the tree rooted at T. Lemma 1 P ((q \ n)=(q \ T )) (1 + na(q; n)) n2children(t ) = 0ih?1 n2nodesi(t ) P ((q \ n)=(q \ T )) Proof : By induction on the height h of the tree. Basis: For h=0, the tree consists of only the root node with no children and the above equality is trivially true. Induction Hypothesis: Assume it is true, for every tree of height no more than k. We prove the equality for any tree of height no more than k + 1. Let T be the root node of such a tree and let n 1 ; : : : ; n m be its children. Then the node access measure for tree T is given by: na(q; T ) = n2fn1;:::;nmg P ((q \ n)=(q \ T ))(1 + na(q; n)) By induction hypothesis, we have for each n 2 fn 1 ; : : : n m g, na(q; n) = Therefore, it follows that 0ik?1 n2nodesi(n) na(q; T ) = n2fn1;:::;nmg (1 + 0ik c2nodesi(n) = P ((q \ n)=(q \ T )) n2nodesk (T ) + n2nodesk (T ) 0ik?1 c2nodesi(n) P ((q \ n)=(q \ T )) P ((q \ c)=(q \ n)) P ((q \ c)=(q \ n))) P ((q\n)=(q\t )P ((q\c)=(q\n)) If node c is a child of node n and node n is a child of node T, then the footprint of node c is contained in that of n and the footprint of node n in that of T. Hence the probability that node c intersects a query q given that node n does, is given by: P ((q\c)=(q\n)) = P ((q\c)=(q\t ))=P ((q\n)=(q\t )) Consequently, the node access measure for a tree rooted at node T becomes na(q; T ) = P ((q \ n)=(q \ T )) n2nodesk (T ) + P ((q \ c)=(q \ T )) n2nodesk (T ) 0ik?1 c2nodesi(n) = P ((q \ n)=(q \ T )) n2nodesk (T ) + P ((q \ c)=(q \ T )) 0ik?1 c2nodesi(t ) = P ((q \ n)=(q \ T )) 0ih?1 n2nodesi(t ) Hence the equality is true for trees of height k Next, we estimate the average number of accesses in a tree T (ana(t )) for all possible queries. This is proportional to the sum of the node accesses for all possible queries (specified by the query set Q), i.e., ana(t ) / P (q \ n) q2q 0ih?1 n2nodesi By changing the order of the summations, we have ana(t ) / 0ih?1 n2nodesi q2q P (q \ n) If we assume a uniform query distribution and small query windows, the innermost summation is proportional to the extent of the index region corresponding to the node n. Hence, the average number of node accesses for a uniform query distribution is proportional to the sum of the extents of all the index entries in the tree T, i.e., ana(t ) / 0ih?1 n2nodesi extent(n) From this equation, it follows that if we decrease the extents of index entries without possibly changing the height in the equation, we can decrease the average number of node accesses. In practice, this is possible by promoting some data objects to the unfilled portions of higher level nodes. However, if too many entries are promoted, then the result might be first an increase in the number of nodes at higher levels and then in the height of the tree. As the height of the tree increases, the area covered by the new root is added to the above equation. Clearly, the above equation argues for a good balance between number of entries promoted and the increase in the height of the tree to achieve

5 good performance gains. Hence, it is safe to promote entries to higher levels as long as they lead to little or no increase in the height of the index tree. In addition, demotion of objects is also necessary to maintain the balance. 3 Promotion-based Indexing Based on the analysis of the previous section, a good index should promote selected data objects to higher levels in the index tree without any substantial increase in the height of the tree. We now propose a promotion scheme that incorporates these observations. In what follows, we describe the scheme using an underlying spatial index structure the R -tree [1]. We have chosen the R -tree because it is one of the best and most versatile spatial structures incorporating a set of optimizations to aggregate data into good clusters. Although we use an R -tree, the scheme is equally applicable to other spatial index structures like the R-trees [8], -trees [2]. 3.1 Promotion Strategy Data objects may be promoted only at the time of new insertions. Insertion of a new data object A proceeds in the usual way from the root to the leaves as in an R -tree. The insertion of A into a node N may result in the promotion of a data object from node N to its parent. 1. If node N has a promoted object B and B is contained in object A, then A is inserted in N as a promoted object. 2. Otherwise, object A is inserted into the subtree that accommodates it best as in an R -tree. An additional restriction on that subtree is that it should contain at least one node at the leaf level. Step 1 ensures that larger extent objects are maintained in higher levels of the tree by a premature termination of the insertion process in a non-leaf node. Step 2 simply inserts the object in a subtree. Note that subtrees that contain only promoted entries are not chosen. Since promoted objects reside in non-leaf nodes of the tree, the leaf nodes of such subtrees are not at the lowest level in the tree. Consequently, insertion into such subtrees may lead to an increase in the height of the tree. Such an occurrence is avoided by the above restriction. Insertion of an entry A into a node N is simply the addition of that entry to that node. If by so doing, the node overflows (i.e, exceeds its capacity), it is resolved in one of the following ways (in the order specified): 1. If there are any entries that could be demoted using a demotion criterion (described in Section 3.3), demote the one that has the least increase in the area of the extent of the subtree to which it is demoted. 2. Otherwise, pick a data object, if any, to be promoted using the promotion criterion (described in Section 3.2) and promote it to the next higher level in the tree. This means the promoted object is inserted in the parent of node N. 3. Otherwise, split the node N as specified by the underlying structure (the R -tree) and propagate the effect of the split to the parent. Note that by promoting data objects only from overflowing nodes, and by demoting data objects as and when possible, we regulate the number of data objects that are stored at non-leaf levels. This policy ensures that the height of the tree does not increase due to promotion of data objects. At the same time, we can reduce the area covered in the index entries using a specific promotion criterion and thus improve the query performance. 3.2 Promotion Criteria We now discuss two promotion criteria for reducing the area coverage in index entries. The first, called nestingbased promotion, uses spatial nesting of objects to determine if a data object is a candidate for promotion. This criteria uses the data characteristics but does not explicitly consider the area-coverage. The nesting-based promotion criterion is formally described as follows: Nesting-based Promotion A data object is a potential candidate for promotion if it contains (nests) a threshold (called the nesting threshold) number of other entries in the node. If there is more than one such data object, then choose the one that has the maximum number of contained nodes and then the maximum area. The second criterion, called extent-based promotion, takes the areas of data objects into consideration. Extent-based Promotion Calculate the footprint (and its area) of the overflowing node. For each data object e that shares a common boundary with the footprint of the overflowing node, do the following: Estimate the footprint of the node without the data object e. Let a e be its extent (area). Data object e is a candidate for promotion

6 if the area a e is less than a fraction (specified by an area threshold value) of the area a. The gain of promoting the data object e is the difference in the areas a? a e. From all the candidates for promotion, choose the one that has the maximum gain. Note that unlike in nesting-based promotion, promotion occurs only if it results in a substantial decrease in the area of the node (and its corresponding index entry in the parent). As we see later, the areas of the index entries play a significant role in the overall performance of the index. A R B C D S E W F G H J T L Figure 2. Example data for Promotion-based Indexing R S W T U A B C D E F J K L N P Q A B C R S G D E F (a) W H (c) J K L T U M N P Q A B C A B C R I4 I5 D F R S G K M W D E F (b) I1 I2 I3 (d) N J K L T U U P Q N P Q I6 I7 U E G H J K L M N P Q Figure 3. Promotion-based Index for a Sequence of Insertion: Tree after insertion of (a) A; B; C; D; E; F; J; K; L; N; P and Q, (b) G, and, (c) M and H. (d) An equivalent R -tree. We refer to an R -tree that incorporates nesting-based promotion as a nesting-based R -tree. Likewise, an R - tree that incorporates extent-based promotion as the extentbased R -tree. We illustrate the creation of a nesting-based R -tree for the data of Figure 2. Assume the node capacity (M) of the tree is 3 and the nesting threshold is 2. After inserting the objects A; B; C; D; E; F; J; K; L; N; P and Q the resulting structure is the same as a conventional R -tree and is shown in Figure 3(a). However, when object G is inserted, it is stored at level 1 unlike in a conventional R -tree. This modification is shown in Figure 3(b). The final structure after inserting objects H and M is shown in structure Figure 3(c). Object H, which is inserted subsequently, is stored in the root. (A similar structure can also be obtained using extent-based promotion.) Contrast this structure with an R -tree which stores all data objects at the same level, irrespective of their extents. In an R -tree, when object G is inserted to the tree of Figure 3(a), it splits the node pointed to by index entry S, giving rise to overlapping index entries in the tree. The same phenomenon repeats when objects M and H are inserted. Figure 3(d) shows the resulting R -tree. The newly formed index entries I1; : : : ; I7 overlap as well as index dead space. 3.3 Demotion Criteria We now discuss criteria for demoting promoted objects in an index. These demotion criteria correspond to the promotion criteria discussed above and ensure that there are no entries that are unnecessarily stored at higher levels in the tree. These demotion criteria are described below. Demotion Criterion for Nesting-based Promotion Check if there is any promoted data object that is fully contained in an index entry of the same node. If there is more than one such object, choose the one with the smallest area. Demotion Criterion for Extent-based Promotion Check for a promoted data object in the overflowing node that has no promotion gains. To calculate the gains of a promoted object e, we first determine the best subtree into which e can be inserted. Let g be the child entry in the node that corresponds to this subtree. If a is the area of the index entry g, calculate the area a e if e were inserted in the subtree. Object e is a candidate for demotion if area a is a fraction (specified by area threshold) of the area a e. Among all possible candidates, choose the one that has the minimum area increase. 3.4 Query Processing Processing of queries proceeds in the same way as in R - trees. The only difference is that we have to check for data objects even in index nodes of the tree. The algorithm is as follows: If node T is a leaf node (i.e. contains only data objects), determine all the data objects that satisfy the query. Otherwise, for each child entry e in the node T do the following

7 If entry e is a promoted object, include it in the result if it satisfies the query. Otherwise, process the query on the subtree corresponding to the child entry e. Consequently, in Figure 3(c) a containment query Q with the same footprint as object G retrieves the objects G from level 1 along with objects D; E and F from the leaf level. In this process, it only accesses 3 nodes. In contrast, as seen from Figure 3(d), the same query accesses 4 nodes in an R -tree. 3.5 Deletions Deletions are processed in the following manner: If the data object is from a leaf node of the tree, then the deletion is the same as in an R -tree. If the data object is stored at a non-leaf node of the R -tree, the data object is removed from the node and the node is adjusted as if it were a deletion of an index entry. 4 Experimental Results In our experiments we implemented the promotionbased index on top of an R -tree and evaluated the relative merits of the two promotion criteria the nesting-based promotion and the extent-based promotion and compared them with a conventional R* tree. The code for all the structures is written in C and implemented on a Sparc workstation running SunOS 5.4 Operating System. The workstation had 32MB main memory with a clock speed of 40MHz. The disk page size was fixed at 8K in all our experiments. This allowed a node to have a maximum capacity of 200 entries. 4.1 Experimental Setup We used three different real cartographic datasets in our experiments the first from the Alexandria Gazetteer, the second from the Alexandria Catalog, and, the third from the Database Institute, University of Munich, Germany. The Gazetteer has spatial footprints of most geographical features of the world. The Alexandria Gazetteer is built by merging the GNIS (Geographic Names Information Systems) Gazetteer, and BGN (Board of Geographic Names) Gazetteer. The GNIS Gazetteer has locations, bounding coordinates (in degrees) for places and other geographical features of interest in United States whereas the BGN has worldwide information. The combined Gazetteer has 5 million entries, most having point extents. We ran our experiments on the sample for California, which has 73K entries of which 68K entries are just point data. The second dataset is the Catalog of all holdings in the Alexandria Digital Library. Each holding in the Catalog has an associated spatial footprint, which is used in queries (in conjunction with other information) for searching the Catalog database. To speed up the retrieval processing, we index these spatial extents. Unlike the Gazetteer, most of these spatial extents are rectangular objects and are spread out over a large number of nesting levels. Specifically, the 400K data of the Catalog has 36 levels of nesting. Since each data entry is 20 bytes, the total size of the Catalog data is 8MB. The third dataset is the Tiger data obtained from the archive at the Database Institute, University of Munich, Germany. The 128K data has a nesting depth of 4 and mainly consists of very small rectangles. The total size of the data is 256K. Since our data set is 2-dimensional in nature, we experimented with 2-dimensional queries. The queries generated are ensured to span 0.001% to 10% of the area of the domain (e.g., California for Gazetteer). For each query area, we generated 100 queries of exact match, enclosure, intersection, and containment queries. Each query is generated by first generating the center using an independent random distribution in x and y-dimensions and then fixing a square box of the required area around the center of the query. We conducted our experiments when only the root nodes of the trees are cached. (The behavior is similar when much larger portions of the index are cached). We measured the number of node accesses per query averaged over 100 queries. Note that in all our experiments the number of disk (node) accesses for containment and intersection in an R -tree (and all our index schemes that are built on top of R -tree) are the same. This is because the algorithms for determining containment and intersection only differ at the leaf nodes where they check for containment or intersection of the data. Likewise, the node accesses for enclosure and exact search queries are the same. 4.2 Effect of Thresholds in Promotion Schemes The effect of the thresholds in the promotion-based index (the area threshold for extent-based criterion and nesting threshold for nesting-based criterion) is to control the number of entries promoted to the higher levels in the tree. If too many entries are promoted, it may result in an increase in the height of the tree and consequently the performance advantages due to promotion may vanish. If there are too few promotions, we may not get any performance improvements. In this section, we describe the results of varying the thresholds for each promotion criterion. First, we study the effect of varying the area threshold for extent-based promotion on the Catalog data. As the area threshold increases from 0.6 to 0.9, the number of

8 Catalog Data a t=0.6 a t=0.75 a t=0.9 # of promoted entries Area covered Height Table 1. Effect of varying area threshold (a t ) on Extent-based Promotion Catalog Data Qry Enclosure/Exact Search Containment/ Intersection Range (Disk Accesses) (Disk Accesses) a t=0.6 a t=0.75 a t=0.9 a t=0.6 a t=0.75 a t= Table 2. Extent-based Promotion with varying area thresholds (a t ) entries that are eligible for promotion increases. Consequently, the number of promoted entries increases reducing the total area covered by the index entries of the tree. These facts are borne out in Table 1. From the analysis for the node accesses in the previous section, we expect that the decrease in the extents (areas) is accompanied by a decrease in the average number of disk accesses per query. This is verified in Table 2, where there is a corresponding decrease in the number of disk accesses per query as the area threshold increases from 0.6 to 0.9. However, as the query area increases, the difference in performance for the different area thresholds decreases. This is because with the increase in the query area, the number of objects retrieved in the query increases for all variants. Consequently, the gains from having some of them promoted do not contribute to a decrease in the disk accesses. Similar results were obtained for the Gazetteer and the Tiger datasets. Catalog Data n t=67 n t=100 # of promoted entries Area covered Height 4 3 Table 3. Effect of varying nesting thresholds on Nesting-based Promotion Next, we vary the nesting-threshold for the nesting-based promotion. Tables 3 and 4 show the performance variance on the Catalog data for two different values of nesting threshold M/3 (67) and M/2 (100), where M is the Catalog Data Qry Enclosure/Exact Search Containment/Intersection Range (Disk Accesses) (Disk Accesses) n t=67 n t=100 n t=67 n t= Table 4. Nesting-based Promotion with varying nesting thesholds maximum capacity (200) of a tree node. From Table 3, we observe that for a nesting threshold of M=3, the number of promoted entries are This is expected because the Catalog data is highly nested (with 36 levels of nesting). Consequently, this exodus from leaf levels to higher levels in the tree increases the height of the tree from 3 to 4. According to our analysis in the previous section, the performance should improve if the height increase is avoided. As seen in Table 3, increasing the nesting threshold to M=2 results in the promotion of 2810 data objects with no increase in the height of the tree. Consequently as seen in Table 4, the performance of the index is at its best for a nesting threshold of 100 (M=2). We also increased the threshold to 150 (3M=2) and found that there were no promoted objects for the Catalog data. This suggests that choosing the optimal threshold could be quite tricky and could depend very much on the data characteristics. In contrast, the area threshold for the extent-based scheme is quite straightforward and can be fixed at 0.9 for all practical instances of non-uniform data. 4.3 Performance Comparison of Promotionbased Schemes with R -tree We now compare the performance of our promotionbased schemes with that of an R -tree using different datasets. For the extent-based promotion scheme, we fixed the area threshold at 0.9 and for the nesting-based promotion, we chose the nesting threshold to be 100 (M=2). These values correspond to the best thresholds for the two schemes. The performance figures for the two promotion criteria along with those for an R -tree are tabulated in Table 5. For containment queries, we observe that the extent-based promotion has a slight advantage over nesting-based promotion. This is because the extent-based scheme promotes data objects only if there is a substantial gain in doing so, unlike the nesting-based promotion. For enclosure queries, the nesting-based promotion performs better than the extentbased promotion. This is a consequence of having around

9 Catalog Data Qry Enclosure/Exact Search Containment/Intersection Range (# of disk accesses) (# of disk accesses) R* Nest. Extent R* Nest. Extent Prom R* Prom R* Prom R* Prom R* Table 5. Performance Comparison of R with Promotion-based schemes on Catalog Data Number of Disk Accesses Extent-based Prom R* Nesting-based Prom R* R* 3K promoted entries in nesting-based promotion as compared 280 entries in extent-based promotion. Consequently, most of the enclosure queries are resolved at higher levels of the tree. However, both schemes yield substantial improvements in performance, although the extent-based promotion has a better choice (and less number) of promoted entries than the nesting-based scheme. Next, we present the performance results on the Gazetteer dataset. Figures 4 and 5 show these results for the three structures. Note that the x- and y scales in these figures (and in subsequent ones too) are logarithmic. We observe that the nesting-based promotion and the extentbased promotion perform equally well. This is because most of the data objects that give rise to anomalies of nonuniformity are promoted in both the structures. Next, we examine the performance gains for Tiger data. Figure 6 shows the performance curves. Since the Tiger data had small rectangles, we fixed the area threshold for the extent-based promotion criteria at The nestingbased promotion did not yield any performance gains over the R -trees. In contrast, the extent-based promotion scheme yields 5% improvement for containment queries and upto 10% improvement for enclosure queries. This is because the Tiger data does not have any nesting, which explains the lack of improvements using nesting-based criterion. The extent-based criterion, on the other hand, shows an improvement in performance because it does not require data to be nested. These experiments indicate that the extent-based promotion is sufficiently robust to adapt to a wide variety of nonuniform data. 5 Conclusions In this paper, we examined efficient indexing strategies for non-uniform spatial data. We observed that the performance of a tree-based index depends on two factors the height of the tree, and the area covered by the index entries of the tree. We proposed a promotion scheme that addresses Query Area as Percentage of Total Area Figure 4. Performance of different promotions schemes with R -tree on Gazetteer Data for Containment queries Number of Disk Accesses Extent-based Prom R* Nesting-based Prom R* R* Query Area as Percentage of Total Area Figure 5. Performance of different promotions schemes with R -tree on Gazetteer Data for Enclosure queries

10 Extent-based Prom R* Nesting-based Prom R* R* 2 Extent-based Prom R* Nesting-based Prom R* R* 64 Number of Disk Accesses Number of Disk Accesses Query Area as Percentage of Total Area Query Area as Percentage of Total Area Figure 6. Performance of different promotions schemes with R -tree on Tiger Data for (a) Containment, and (b) Enclosure queries these two factors. This scheme improves index performance by reducing area-coverage without affecting the tree height. In this scheme, we experimented with two promotion criteria the nesting-based criterion and the extent-based criterion. We evaluated their relative merits in comparison to the conventional R -tree on real cartographic data. We noted that both promotion criteria yield substantial improvements (upto 45%) in disk access performance over the conventional R -tree. We found that the extent-based criteria is more robust than the nesting-based criterion. In future, we plan to extend these optimizations to other spatial index structures such as the Bang files [3] and the BV-trees [4]. References [1] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R* tree: An efficient and robust access method for points and rectangles. Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages , May [2] S. Berchtold, D. A. Keim, and H. P. Kreigel. The x-tree: An index structure for high dimensional data. Proceedings of the Int. Conf. on Very Large Data Bases, [3] M. W. Freeston. The bang file: a new kind of grid file. Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages , May [4] M. W. Freeston. A general solution of the n-dimensional b-tree problem. Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, May [5] O. Gunther. The design of the cell tree: An object-oriented index structure for geometric databases. Proc. Int. Conf. on Data Engineering, pages , [6] O. Gunther. Evaluation of spatial access methods with oversize shelves. Workshop on Geographic database management systems, pages , [7] O. Gunther and H. Noltemeier. Spatial database indices for large extended objects. Proc. Int. Conf. on Data Engineering, pages , [8] A. Guttman. R-trees: A dynamic index structure for spatial searching. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 47 57, [9] C. Kolovson and M. Stonebraker. Segment indexes: Dynamic indexing techniques for multi-dimensional interval data. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages , [10] D. B. Lomet and B. Salzberg. The hb-tree: A multiattribute indexing method with good guaranteed performance. Proc. ACM Symp. on Transactions of Database Systems, 15(4): , December [11] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptbale, symmetric multikey file structure. Proc. ACM Symp. on Transactions of Database Systems, 9(1):38 71, March [12] B. U. Pagel, H. W. Siz, H. Toben, and P. Widmayer. Towards an analysis of range query performance. Proc. ACM Symp. on Principles of Database Systems, pages , [13] K. V. Ravi Kanth, D. Agrawal, A. El Abbadi, Ambuj K. Singh, and T. Smith. Indexing hierarchical data. Technical Report CS-TR-9514, Univ. of California at Santa Barbara, [14] K. V. Ravi Kanth and Ambuj K. Singh. Optimal dynamic range searching in non-replicating index structures. Manuscript, [15] H. Samet. The design and analysis of spatial data structures. Addison-Wesley Publishing Co., [16] T. Sellis, N. Roussopoulos, and C. Faloutsos. The r + -tree: A dynamic index for multi-dimensional objects. Proceedings of the Int. Conf. on Very Large Data Bases, (13): , [17] T. R. Smith and J. Frew. Alexandria Digital Library. Communications of the ACM, 38(4):61 62, April [18] D. White and R. Jain. Similarity indexing with the SS-tree. Proc. Int. Conf. on Data Engineering, pages , 1996.

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is