Indexing Non-uniform Spatial Data

Size: px
Start display at page:

Download "Indexing Non-uniform Spatial Data"

Transcription

1 Indexing Non-uniform Spatial Data K. V. Ravi Kanth Divyakant Agrawal Amr El Abbadi Ambuj K. Singh Department of Computer Science University of California at Santa Barbara Santa Barbara, CA Abstract Non-uniformity in data extents is a general characteristic of spatial data. Indexing such non-uniform data using conventional spatial index structures such as R -trees is inefficient for two reasons: (1) The non-uniformity increases the likelihood of overlapping index entries, and, (2) clustering of non-uniform data is likely to index more dead space than clustering of uniform data. To reduce the impact of these anomalies, we propose a new scheme that promotes data objects to higher levels in tree-based index structures. We examine two criteria for promotion of data objects and evaluate their relative merits using an R -tree. In experiments on cartographic data, we observe that our promotion criteria yield upto 45% improvement in query performance for an R -tree. 1 Introduction Advances in processing technology and storage management over the last decade have opened up new avenues for handling large and complex map and image data. The Alexandria Digital Library (ADL) project is one such mission with the goal of providing efficient access to data collections (satellite images, census data, aerial photographs, digital line graphs, digital orthophotoquads, SLAR images, and maps) spanning terabytes of storage. While the task of building a digital library is enormous, one important piece of the effort is the design and implementation of efficient index structures. In this paper, we consider the performance of current index structures on geographic and spatial data and present a new approach to indexing them to adapt to data characteristics. Many indexing schemes [3, 4, 8, 10, 11, 15, 18] have been designed to provide efficient storage and retrieval support for multidimensional data. Unfortunately, most of Work supported in part by a research grant from NSF/ARPA/NASA IRI and NSF instrumentation grant CDA these structures are suitable for point data alone. Only a few of them such as the R-tree variants [1, 8, 16] perform well with rectangular data. Even these structures lose their efficiency when it comes to indexing data rectangles with nonuniformity in their areas (extents), which is a common characteristic of geographic data. Indexing such non-uniform data is inefficient for two reasons. First, index entries for non-uniform data are more likely to overlap than uniform data. This effect is more pronounced for nested data, where one rectangle encloses several others. Second, clustering of non-uniform data in an index is likely to include more dead space (part of the dataspace that is not covered by any data object) than uniform data. What are the implications of these on the performance of the index? Overlapping index entries result in loss of information as to which part of the tree is relevant to a particular query. Indexing dead space implies accessing irrelevant parts of the tree that may not contribute to the result of the query in any way. Thus, the performance of the index structure degrades. One solution to the above problems is to separate data based on their extents into several levels and index them separately. This approach is similar to that of a Level-based Index (LIB) structure [13], that was designed for indexing nested data. The LIB structure decomposes data into several levels based on their nesting and indexes the data at each level separately. Since the data is scattered over several indices, it works well only for a small number of them. The same argument applies to any approach that decomposes non-uniform data based on their extents. Hence, we seek to smooth out the non-uniformity in the data by promoting data objects to higher levels in a single index tree. To appreciate why such an organization can be more efficient than the conventional index organizations, let us consider the example data of Figure 1(a) and examine how it is organized by popular index structures such as the R -tree [1]. In this figure, object A contains B and overlaps with C; D and E. This pattern of data occurs frequently in geographical databases. Assuming a node capacity of 3, an R - tree organizes the data as shown in Figure 1(b). Note that

2 irrespective of how good the data clustering scheme is in R -trees (or any conventional index for that matter), it will have overlapping index entries (g and h shown in dashed lines in the figure) and cannot avoid indexing a portion of the dead space (e.g., region k in the figure). Berchtold et al. addressed the problem of achieving minimal overlapping entries in their -tree [2]. However, for extended data objects, which are usually of small dimensions, the -tree achieves little or no reduction in overlap and consequently behaves similar to conventional R -tree. We observe that these anomalies of indexing non-uniform spatial data are a consequence of maintaining all the data objects at the same level (leaf level) in the tree. Based on this observation, we remove this restriction and reorganize the same set of data objects as in Figure 1(c). Here object A, due to its large extent, is stored at the root level and all others are stored at the leaf level (with i and h as the corresponding index entries). This tree organization solves both the problems of non-uniformity (1) it reduces the overlaps in index entries, and (2) it reduces the dead space that is indexed (region k is excluded). Both these factors contribute to a decrease in the extents for index regions in the tree. Since the performance of a tree structure depends on the extents of the index regions as observed in [12], the promotion-based organization yields better results than the conventional R -tree. g g i k h C B A B C D E F (b) R*-tree A (a) Example of Non-uniform Data B C D i E F h A h D E (c) Promotion-based R*-tree Figure 1. Promotion-based Indexing in an R - tree In this context, we note that the oversize shelves of Gunther [6, 7] is the first promotion-based scheme in the literature. However, this scheme and its analysis heavily rely on data objects being partitioned to create non-overlapping index regions. Hence, they are suitable only for the cell trees [5] and the R + -trees [16]. In addition, the oversize shelves also do not adapt to data distribution because promoted objects are never demoted. The analysis also assumes that all nodes at the same level have the same probability of being accessed by a query. In contrast, for a uniform query distribution, Pagel et al. [12] showed that the access probability depends on the area covered by the nodes. In this paper, we F propose a new promotion scheme that incorporates Pagel's observations and extends Gunther' s promotion-schemes to other popular index structures like the R-trees [8], R -trees [1], -trees [2] which may have overlapping index regions. In addition, we adapt to data distribution by not only promoting but also demoting data objects as and when needed. The SR-tree [9] is also designed to index non-uniform data. It attempts to smooth out the non-uniformity by cutting every object into several parts those that enclose or span an index region in the tree (in any dimension), called the spanning regions, and those that do not, called remnant regions. The spanning regions are inserted in the node that has an entry for the spanned index region. The remnant regions are simply inserted in the leaf nodes of the tree just as in an R-tree. Using such an organization, the SR-tree can potentially maintain O(N ) information for each object, where N is the number of objects. Consequently, the storage space of the SR-tree is O(N 2 ). Since all this extra information for each object is maintained in the same tree, the height of the tree may also increase. In contrast, our proposed scheme only uses linear space and is designed to improve index performance without increasing the height of the original tree. The BV-tree [4] also employs promotion of index regions to disambiguate them. However, the motivation for the promotion in the BV-trees is to ensure logarithmic access for exact match queries, which are much less common than containment and intersection type of queries. In contrast, our scheme is designed to improve the average performance of all types of queries. Besides, it could be combined with BV-trees to achieve worst-case logarithmic performance for exact match queries. Note that the worst-case performance for containment queries on n multidimensional objects in most current structures is O(n) [14]. In this paper, we first examine the effect of promoting data objects on the query performance of an index tree. We observe that query performance depends on two factors the height of the tree, and, the area covered by the index entries of the tree. The height of the tree determines the overheads of accessing a leaf node of the tree. The number of such leaf nodes accessed in a query is subject to the anomalies of indexing non-uniform data. These anomalies are reflected in the area covered in the index entries. Their impact is reduced whenever the area covered is reduced. Therefore, a good index scheme should decrease the area covered without increasing the height of the tree. The promotionbased index of Figure 1 is an example of such an index. We describe a general scheme for promoting data objects without affecting the height of the tree. We then study two variants of this scheme using different promotion criteria. The first, called nesting-based promotion, uses the concept of spatial nesting to decide on possible candidates for promotion whereas the second, called extent-based promotion,

3 uses the extents of the data objects to do the same. We experiment with these criteria using R -trees using real cartographic datasets from the Alexandria Digital Library (ADL) [17]. The ADL has several sources of spatial data (e.g. the Catalog of all its holdings, the Gazetteer etc.) Each such data collection has a number of objects with varying spatial footprints (the footprint of an object is the minimum bounding rectangle for that object). For instance, the Gazetteer has the spatial footprints for the countries of the world, the states and provinces in each of them and the cities within each state. We use these collections to evaluate the relative merits of the different promotion criteria and compare them with the conventional R -tree. In our experiments, we observe that the extent-based promotion yields reasonable performance gains in our experiments whereas the gains from nesting-based promotion depend on the nesting characteristics of the data. By promoting data objects, both variants reduce the ill-effects of non-uniformity in the data without increasing the height of the tree. The rest of the paper is organized as follows. In the next section, we analyze the impact of promotion on the performance of an index structure. In Section 3, we propose a promotion strategy with two different promotion criteria. In Section 4, we present our experiments with these criteria. Finally, in Section 5, we conclude the paper with pointers for further research on this topic. 2 Effect of Promotion on Index Performance In this section, we estimate the effect of promoting data objects on the performance of a tree-structured index. To this end, we first estimate the increase in the height of the tree due to promotion of data objects, and then analyze the subsequent query performance. 2.1 Effect of Promotion on Tree Height We assume the index structure is a balanced tree. Each node of the tree has between a minimum m and a maximum M child entries. Each child entry is a 2-tuple of the form hchild f ootprint; child nodei storing a pointer to the child node and its footprint (child footprint). The footprints represent arbitrary regions in multidimensional space. The exact representation depends on the index structure and is of little interest to us at this point. Consequently, the effects of promotion on the height of the index tree are equally applicable to most spatial data index structures like the R-trees [8], R + -trees [16], and R -trees [1]. Since we are concerned with the worst case increase in height due to promotion of entries, we assume every node has exactly m children. At any level, we assume that at most a fraction f of the total entries from that level are promoted to the next higher level. We start with the lowest level of the tree and promote entries from this level to the next higher level and move up the tree in this manner. For a set of n data objects at level 0 (the lowest level), fn entries are promoted to the next level whereas (1? f )n entries remain at that level. These remaining entries are stored in nodes at level 0. Each such node is represented by an index entry at the next level. Hence, with fn entries from level 0 promoted to level 1, the total number of entries at level 1 is n(1? f )=m + fn. Of these, only a fraction k = (1?f) of the total entries remain at level 1. The recurrence relation for estimating the number of entries a i at level i before promotion is therefore: a i = ((1? f )=m + f )a i?1. Since a 0 = n, solving the above recurrence relation yields a i = ((1? f )=m + f ) i a 0 = (1? f )=m + f i n Let c = 1=((1? f )=m + f ). The recursion bottoms out at level i, whenever the number of entries at that level is no more than m. This corresponds to the root level and no more promotions take place. Using this fact, the height h of the tree is determined as follows: h = log c (a 0 =a h ) = log c (n=m). Note that when there is no promotion (i.e., f = 0), the height is (log m n? 1). If one entry out of every m entries (i.e., one from every node) is promoted, the height increases by a factor of log m=2 2. For practical values of m = 50, this increase is 17%. 2.2 Performance of an index structure Next, we estimate the query performance of an index structure in terms of its height and the area covered by its index entries. We consider containment and intersection queries, which are the most common query types for spatial databases. We assume that these queries access all tree nodes whose footprints intersect the query box. The number of such node accesses characterizes the query performance of the index. Pagel et al. [12] give a performance measure for estimating the number of accesses to the children of a node in terms of the probabilities of a query window intersecting their footprints. They consider different query models based on whether or not the query distribution is dependent on the data distribution. Let W QM denote one of these query models and let P (q \ c) denote the probability of a query q intersecting the footprint of node c. Then, the number of children of node N that intersect query q is denoted by P M (W QM; q; N ) and is given by: P M (W QM; q; N ) = c2children(n) P (q \ c) For simplicity, we assume the query distribution is independent of the data distribution and each query retrieves all data in the window specified. We extend the performance measure to indicate the number of nodes accessed in the

4 whole tree T (rooted at node T ) for a query q as described below. Since the performance measure is an indication of the node accesses, we refer to it as the node access measure and denote it by na(q; T ). na(q; T ) = n2children(t ) P ((q \ n)=(q \ T )) (1 + na(q; n)) Note that we have modified Pagel' s performance measure to reflect the conditional probability of a child n of node T being accessed by query q given that node T is already accessed by query q. This reflects the order of traversal from a node to its children in a tree structure. Also, note that the access to the root node is not accounted for in this equation. Now we simplify the above recurrence relation using the following Lemma. We assume h is the height of tree and nodes i (T ) denotes the set of nodes at level i of the tree rooted at T. Lemma 1 P ((q \ n)=(q \ T )) (1 + na(q; n)) n2children(t ) = 0ih?1 n2nodesi(t ) P ((q \ n)=(q \ T )) Proof : By induction on the height h of the tree. Basis: For h=0, the tree consists of only the root node with no children and the above equality is trivially true. Induction Hypothesis: Assume it is true, for every tree of height no more than k. We prove the equality for any tree of height no more than k + 1. Let T be the root node of such a tree and let n 1 ; : : : ; n m be its children. Then the node access measure for tree T is given by: na(q; T ) = n2fn1;:::;nmg P ((q \ n)=(q \ T ))(1 + na(q; n)) By induction hypothesis, we have for each n 2 fn 1 ; : : : n m g, na(q; n) = Therefore, it follows that 0ik?1 n2nodesi(n) na(q; T ) = n2fn1;:::;nmg (1 + 0ik c2nodesi(n) = P ((q \ n)=(q \ T )) n2nodesk (T ) + n2nodesk (T ) 0ik?1 c2nodesi(n) P ((q \ n)=(q \ T )) P ((q \ c)=(q \ n)) P ((q \ c)=(q \ n))) P ((q\n)=(q\t )P ((q\c)=(q\n)) If node c is a child of node n and node n is a child of node T, then the footprint of node c is contained in that of n and the footprint of node n in that of T. Hence the probability that node c intersects a query q given that node n does, is given by: P ((q\c)=(q\n)) = P ((q\c)=(q\t ))=P ((q\n)=(q\t )) Consequently, the node access measure for a tree rooted at node T becomes na(q; T ) = P ((q \ n)=(q \ T )) n2nodesk (T ) + P ((q \ c)=(q \ T )) n2nodesk (T ) 0ik?1 c2nodesi(n) = P ((q \ n)=(q \ T )) n2nodesk (T ) + P ((q \ c)=(q \ T )) 0ik?1 c2nodesi(t ) = P ((q \ n)=(q \ T )) 0ih?1 n2nodesi(t ) Hence the equality is true for trees of height k Next, we estimate the average number of accesses in a tree T (ana(t )) for all possible queries. This is proportional to the sum of the node accesses for all possible queries (specified by the query set Q), i.e., ana(t ) / P (q \ n) q2q 0ih?1 n2nodesi By changing the order of the summations, we have ana(t ) / 0ih?1 n2nodesi q2q P (q \ n) If we assume a uniform query distribution and small query windows, the innermost summation is proportional to the extent of the index region corresponding to the node n. Hence, the average number of node accesses for a uniform query distribution is proportional to the sum of the extents of all the index entries in the tree T, i.e., ana(t ) / 0ih?1 n2nodesi extent(n) From this equation, it follows that if we decrease the extents of index entries without possibly changing the height in the equation, we can decrease the average number of node accesses. In practice, this is possible by promoting some data objects to the unfilled portions of higher level nodes. However, if too many entries are promoted, then the result might be first an increase in the number of nodes at higher levels and then in the height of the tree. As the height of the tree increases, the area covered by the new root is added to the above equation. Clearly, the above equation argues for a good balance between number of entries promoted and the increase in the height of the tree to achieve

5 good performance gains. Hence, it is safe to promote entries to higher levels as long as they lead to little or no increase in the height of the index tree. In addition, demotion of objects is also necessary to maintain the balance. 3 Promotion-based Indexing Based on the analysis of the previous section, a good index should promote selected data objects to higher levels in the index tree without any substantial increase in the height of the tree. We now propose a promotion scheme that incorporates these observations. In what follows, we describe the scheme using an underlying spatial index structure the R -tree [1]. We have chosen the R -tree because it is one of the best and most versatile spatial structures incorporating a set of optimizations to aggregate data into good clusters. Although we use an R -tree, the scheme is equally applicable to other spatial index structures like the R-trees [8], -trees [2]. 3.1 Promotion Strategy Data objects may be promoted only at the time of new insertions. Insertion of a new data object A proceeds in the usual way from the root to the leaves as in an R -tree. The insertion of A into a node N may result in the promotion of a data object from node N to its parent. 1. If node N has a promoted object B and B is contained in object A, then A is inserted in N as a promoted object. 2. Otherwise, object A is inserted into the subtree that accommodates it best as in an R -tree. An additional restriction on that subtree is that it should contain at least one node at the leaf level. Step 1 ensures that larger extent objects are maintained in higher levels of the tree by a premature termination of the insertion process in a non-leaf node. Step 2 simply inserts the object in a subtree. Note that subtrees that contain only promoted entries are not chosen. Since promoted objects reside in non-leaf nodes of the tree, the leaf nodes of such subtrees are not at the lowest level in the tree. Consequently, insertion into such subtrees may lead to an increase in the height of the tree. Such an occurrence is avoided by the above restriction. Insertion of an entry A into a node N is simply the addition of that entry to that node. If by so doing, the node overflows (i.e, exceeds its capacity), it is resolved in one of the following ways (in the order specified): 1. If there are any entries that could be demoted using a demotion criterion (described in Section 3.3), demote the one that has the least increase in the area of the extent of the subtree to which it is demoted. 2. Otherwise, pick a data object, if any, to be promoted using the promotion criterion (described in Section 3.2) and promote it to the next higher level in the tree. This means the promoted object is inserted in the parent of node N. 3. Otherwise, split the node N as specified by the underlying structure (the R -tree) and propagate the effect of the split to the parent. Note that by promoting data objects only from overflowing nodes, and by demoting data objects as and when possible, we regulate the number of data objects that are stored at non-leaf levels. This policy ensures that the height of the tree does not increase due to promotion of data objects. At the same time, we can reduce the area covered in the index entries using a specific promotion criterion and thus improve the query performance. 3.2 Promotion Criteria We now discuss two promotion criteria for reducing the area coverage in index entries. The first, called nestingbased promotion, uses spatial nesting of objects to determine if a data object is a candidate for promotion. This criteria uses the data characteristics but does not explicitly consider the area-coverage. The nesting-based promotion criterion is formally described as follows: Nesting-based Promotion A data object is a potential candidate for promotion if it contains (nests) a threshold (called the nesting threshold) number of other entries in the node. If there is more than one such data object, then choose the one that has the maximum number of contained nodes and then the maximum area. The second criterion, called extent-based promotion, takes the areas of data objects into consideration. Extent-based Promotion Calculate the footprint (and its area) of the overflowing node. For each data object e that shares a common boundary with the footprint of the overflowing node, do the following: Estimate the footprint of the node without the data object e. Let a e be its extent (area). Data object e is a candidate for promotion

6 if the area a e is less than a fraction (specified by an area threshold value) of the area a. The gain of promoting the data object e is the difference in the areas a? a e. From all the candidates for promotion, choose the one that has the maximum gain. Note that unlike in nesting-based promotion, promotion occurs only if it results in a substantial decrease in the area of the node (and its corresponding index entry in the parent). As we see later, the areas of the index entries play a significant role in the overall performance of the index. A R B C D S E W F G H J T L Figure 2. Example data for Promotion-based Indexing R S W T U A B C D E F J K L N P Q A B C R S G D E F (a) W H (c) J K L T U M N P Q A B C A B C R I4 I5 D F R S G K M W D E F (b) I1 I2 I3 (d) N J K L T U U P Q N P Q I6 I7 U E G H J K L M N P Q Figure 3. Promotion-based Index for a Sequence of Insertion: Tree after insertion of (a) A; B; C; D; E; F; J; K; L; N; P and Q, (b) G, and, (c) M and H. (d) An equivalent R -tree. We refer to an R -tree that incorporates nesting-based promotion as a nesting-based R -tree. Likewise, an R - tree that incorporates extent-based promotion as the extentbased R -tree. We illustrate the creation of a nesting-based R -tree for the data of Figure 2. Assume the node capacity (M) of the tree is 3 and the nesting threshold is 2. After inserting the objects A; B; C; D; E; F; J; K; L; N; P and Q the resulting structure is the same as a conventional R -tree and is shown in Figure 3(a). However, when object G is inserted, it is stored at level 1 unlike in a conventional R -tree. This modification is shown in Figure 3(b). The final structure after inserting objects H and M is shown in structure Figure 3(c). Object H, which is inserted subsequently, is stored in the root. (A similar structure can also be obtained using extent-based promotion.) Contrast this structure with an R -tree which stores all data objects at the same level, irrespective of their extents. In an R -tree, when object G is inserted to the tree of Figure 3(a), it splits the node pointed to by index entry S, giving rise to overlapping index entries in the tree. The same phenomenon repeats when objects M and H are inserted. Figure 3(d) shows the resulting R -tree. The newly formed index entries I1; : : : ; I7 overlap as well as index dead space. 3.3 Demotion Criteria We now discuss criteria for demoting promoted objects in an index. These demotion criteria correspond to the promotion criteria discussed above and ensure that there are no entries that are unnecessarily stored at higher levels in the tree. These demotion criteria are described below. Demotion Criterion for Nesting-based Promotion Check if there is any promoted data object that is fully contained in an index entry of the same node. If there is more than one such object, choose the one with the smallest area. Demotion Criterion for Extent-based Promotion Check for a promoted data object in the overflowing node that has no promotion gains. To calculate the gains of a promoted object e, we first determine the best subtree into which e can be inserted. Let g be the child entry in the node that corresponds to this subtree. If a is the area of the index entry g, calculate the area a e if e were inserted in the subtree. Object e is a candidate for demotion if area a is a fraction (specified by area threshold) of the area a e. Among all possible candidates, choose the one that has the minimum area increase. 3.4 Query Processing Processing of queries proceeds in the same way as in R - trees. The only difference is that we have to check for data objects even in index nodes of the tree. The algorithm is as follows: If node T is a leaf node (i.e. contains only data objects), determine all the data objects that satisfy the query. Otherwise, for each child entry e in the node T do the following

7 If entry e is a promoted object, include it in the result if it satisfies the query. Otherwise, process the query on the subtree corresponding to the child entry e. Consequently, in Figure 3(c) a containment query Q with the same footprint as object G retrieves the objects G from level 1 along with objects D; E and F from the leaf level. In this process, it only accesses 3 nodes. In contrast, as seen from Figure 3(d), the same query accesses 4 nodes in an R -tree. 3.5 Deletions Deletions are processed in the following manner: If the data object is from a leaf node of the tree, then the deletion is the same as in an R -tree. If the data object is stored at a non-leaf node of the R -tree, the data object is removed from the node and the node is adjusted as if it were a deletion of an index entry. 4 Experimental Results In our experiments we implemented the promotionbased index on top of an R -tree and evaluated the relative merits of the two promotion criteria the nesting-based promotion and the extent-based promotion and compared them with a conventional R* tree. The code for all the structures is written in C and implemented on a Sparc workstation running SunOS 5.4 Operating System. The workstation had 32MB main memory with a clock speed of 40MHz. The disk page size was fixed at 8K in all our experiments. This allowed a node to have a maximum capacity of 200 entries. 4.1 Experimental Setup We used three different real cartographic datasets in our experiments the first from the Alexandria Gazetteer, the second from the Alexandria Catalog, and, the third from the Database Institute, University of Munich, Germany. The Gazetteer has spatial footprints of most geographical features of the world. The Alexandria Gazetteer is built by merging the GNIS (Geographic Names Information Systems) Gazetteer, and BGN (Board of Geographic Names) Gazetteer. The GNIS Gazetteer has locations, bounding coordinates (in degrees) for places and other geographical features of interest in United States whereas the BGN has worldwide information. The combined Gazetteer has 5 million entries, most having point extents. We ran our experiments on the sample for California, which has 73K entries of which 68K entries are just point data. The second dataset is the Catalog of all holdings in the Alexandria Digital Library. Each holding in the Catalog has an associated spatial footprint, which is used in queries (in conjunction with other information) for searching the Catalog database. To speed up the retrieval processing, we index these spatial extents. Unlike the Gazetteer, most of these spatial extents are rectangular objects and are spread out over a large number of nesting levels. Specifically, the 400K data of the Catalog has 36 levels of nesting. Since each data entry is 20 bytes, the total size of the Catalog data is 8MB. The third dataset is the Tiger data obtained from the archive at the Database Institute, University of Munich, Germany. The 128K data has a nesting depth of 4 and mainly consists of very small rectangles. The total size of the data is 256K. Since our data set is 2-dimensional in nature, we experimented with 2-dimensional queries. The queries generated are ensured to span 0.001% to 10% of the area of the domain (e.g., California for Gazetteer). For each query area, we generated 100 queries of exact match, enclosure, intersection, and containment queries. Each query is generated by first generating the center using an independent random distribution in x and y-dimensions and then fixing a square box of the required area around the center of the query. We conducted our experiments when only the root nodes of the trees are cached. (The behavior is similar when much larger portions of the index are cached). We measured the number of node accesses per query averaged over 100 queries. Note that in all our experiments the number of disk (node) accesses for containment and intersection in an R -tree (and all our index schemes that are built on top of R -tree) are the same. This is because the algorithms for determining containment and intersection only differ at the leaf nodes where they check for containment or intersection of the data. Likewise, the node accesses for enclosure and exact search queries are the same. 4.2 Effect of Thresholds in Promotion Schemes The effect of the thresholds in the promotion-based index (the area threshold for extent-based criterion and nesting threshold for nesting-based criterion) is to control the number of entries promoted to the higher levels in the tree. If too many entries are promoted, it may result in an increase in the height of the tree and consequently the performance advantages due to promotion may vanish. If there are too few promotions, we may not get any performance improvements. In this section, we describe the results of varying the thresholds for each promotion criterion. First, we study the effect of varying the area threshold for extent-based promotion on the Catalog data. As the area threshold increases from 0.6 to 0.9, the number of

8 Catalog Data a t=0.6 a t=0.75 a t=0.9 # of promoted entries Area covered Height Table 1. Effect of varying area threshold (a t ) on Extent-based Promotion Catalog Data Qry Enclosure/Exact Search Containment/ Intersection Range (Disk Accesses) (Disk Accesses) a t=0.6 a t=0.75 a t=0.9 a t=0.6 a t=0.75 a t= Table 2. Extent-based Promotion with varying area thresholds (a t ) entries that are eligible for promotion increases. Consequently, the number of promoted entries increases reducing the total area covered by the index entries of the tree. These facts are borne out in Table 1. From the analysis for the node accesses in the previous section, we expect that the decrease in the extents (areas) is accompanied by a decrease in the average number of disk accesses per query. This is verified in Table 2, where there is a corresponding decrease in the number of disk accesses per query as the area threshold increases from 0.6 to 0.9. However, as the query area increases, the difference in performance for the different area thresholds decreases. This is because with the increase in the query area, the number of objects retrieved in the query increases for all variants. Consequently, the gains from having some of them promoted do not contribute to a decrease in the disk accesses. Similar results were obtained for the Gazetteer and the Tiger datasets. Catalog Data n t=67 n t=100 # of promoted entries Area covered Height 4 3 Table 3. Effect of varying nesting thresholds on Nesting-based Promotion Next, we vary the nesting-threshold for the nesting-based promotion. Tables 3 and 4 show the performance variance on the Catalog data for two different values of nesting threshold M/3 (67) and M/2 (100), where M is the Catalog Data Qry Enclosure/Exact Search Containment/Intersection Range (Disk Accesses) (Disk Accesses) n t=67 n t=100 n t=67 n t= Table 4. Nesting-based Promotion with varying nesting thesholds maximum capacity (200) of a tree node. From Table 3, we observe that for a nesting threshold of M=3, the number of promoted entries are This is expected because the Catalog data is highly nested (with 36 levels of nesting). Consequently, this exodus from leaf levels to higher levels in the tree increases the height of the tree from 3 to 4. According to our analysis in the previous section, the performance should improve if the height increase is avoided. As seen in Table 3, increasing the nesting threshold to M=2 results in the promotion of 2810 data objects with no increase in the height of the tree. Consequently as seen in Table 4, the performance of the index is at its best for a nesting threshold of 100 (M=2). We also increased the threshold to 150 (3M=2) and found that there were no promoted objects for the Catalog data. This suggests that choosing the optimal threshold could be quite tricky and could depend very much on the data characteristics. In contrast, the area threshold for the extent-based scheme is quite straightforward and can be fixed at 0.9 for all practical instances of non-uniform data. 4.3 Performance Comparison of Promotionbased Schemes with R -tree We now compare the performance of our promotionbased schemes with that of an R -tree using different datasets. For the extent-based promotion scheme, we fixed the area threshold at 0.9 and for the nesting-based promotion, we chose the nesting threshold to be 100 (M=2). These values correspond to the best thresholds for the two schemes. The performance figures for the two promotion criteria along with those for an R -tree are tabulated in Table 5. For containment queries, we observe that the extent-based promotion has a slight advantage over nesting-based promotion. This is because the extent-based scheme promotes data objects only if there is a substantial gain in doing so, unlike the nesting-based promotion. For enclosure queries, the nesting-based promotion performs better than the extentbased promotion. This is a consequence of having around

9 Catalog Data Qry Enclosure/Exact Search Containment/Intersection Range (# of disk accesses) (# of disk accesses) R* Nest. Extent R* Nest. Extent Prom R* Prom R* Prom R* Prom R* Table 5. Performance Comparison of R with Promotion-based schemes on Catalog Data Number of Disk Accesses Extent-based Prom R* Nesting-based Prom R* R* 3K promoted entries in nesting-based promotion as compared 280 entries in extent-based promotion. Consequently, most of the enclosure queries are resolved at higher levels of the tree. However, both schemes yield substantial improvements in performance, although the extent-based promotion has a better choice (and less number) of promoted entries than the nesting-based scheme. Next, we present the performance results on the Gazetteer dataset. Figures 4 and 5 show these results for the three structures. Note that the x- and y scales in these figures (and in subsequent ones too) are logarithmic. We observe that the nesting-based promotion and the extentbased promotion perform equally well. This is because most of the data objects that give rise to anomalies of nonuniformity are promoted in both the structures. Next, we examine the performance gains for Tiger data. Figure 6 shows the performance curves. Since the Tiger data had small rectangles, we fixed the area threshold for the extent-based promotion criteria at The nestingbased promotion did not yield any performance gains over the R -trees. In contrast, the extent-based promotion scheme yields 5% improvement for containment queries and upto 10% improvement for enclosure queries. This is because the Tiger data does not have any nesting, which explains the lack of improvements using nesting-based criterion. The extent-based criterion, on the other hand, shows an improvement in performance because it does not require data to be nested. These experiments indicate that the extent-based promotion is sufficiently robust to adapt to a wide variety of nonuniform data. 5 Conclusions In this paper, we examined efficient indexing strategies for non-uniform spatial data. We observed that the performance of a tree-based index depends on two factors the height of the tree, and the area covered by the index entries of the tree. We proposed a promotion scheme that addresses Query Area as Percentage of Total Area Figure 4. Performance of different promotions schemes with R -tree on Gazetteer Data for Containment queries Number of Disk Accesses Extent-based Prom R* Nesting-based Prom R* R* Query Area as Percentage of Total Area Figure 5. Performance of different promotions schemes with R -tree on Gazetteer Data for Enclosure queries

10 Extent-based Prom R* Nesting-based Prom R* R* 2 Extent-based Prom R* Nesting-based Prom R* R* 64 Number of Disk Accesses Number of Disk Accesses Query Area as Percentage of Total Area Query Area as Percentage of Total Area Figure 6. Performance of different promotions schemes with R -tree on Tiger Data for (a) Containment, and (b) Enclosure queries these two factors. This scheme improves index performance by reducing area-coverage without affecting the tree height. In this scheme, we experimented with two promotion criteria the nesting-based criterion and the extent-based criterion. We evaluated their relative merits in comparison to the conventional R -tree on real cartographic data. We noted that both promotion criteria yield substantial improvements (upto 45%) in disk access performance over the conventional R -tree. We found that the extent-based criteria is more robust than the nesting-based criterion. In future, we plan to extend these optimizations to other spatial index structures such as the Bang files [3] and the BV-trees [4]. References [1] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R* tree: An efficient and robust access method for points and rectangles. Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages , May [2] S. Berchtold, D. A. Keim, and H. P. Kreigel. The x-tree: An index structure for high dimensional data. Proceedings of the Int. Conf. on Very Large Data Bases, [3] M. W. Freeston. The bang file: a new kind of grid file. Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages , May [4] M. W. Freeston. A general solution of the n-dimensional b-tree problem. Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, May [5] O. Gunther. The design of the cell tree: An object-oriented index structure for geometric databases. Proc. Int. Conf. on Data Engineering, pages , [6] O. Gunther. Evaluation of spatial access methods with oversize shelves. Workshop on Geographic database management systems, pages , [7] O. Gunther and H. Noltemeier. Spatial database indices for large extended objects. Proc. Int. Conf. on Data Engineering, pages , [8] A. Guttman. R-trees: A dynamic index structure for spatial searching. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 47 57, [9] C. Kolovson and M. Stonebraker. Segment indexes: Dynamic indexing techniques for multi-dimensional interval data. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages , [10] D. B. Lomet and B. Salzberg. The hb-tree: A multiattribute indexing method with good guaranteed performance. Proc. ACM Symp. on Transactions of Database Systems, 15(4): , December [11] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptbale, symmetric multikey file structure. Proc. ACM Symp. on Transactions of Database Systems, 9(1):38 71, March [12] B. U. Pagel, H. W. Siz, H. Toben, and P. Widmayer. Towards an analysis of range query performance. Proc. ACM Symp. on Principles of Database Systems, pages , [13] K. V. Ravi Kanth, D. Agrawal, A. El Abbadi, Ambuj K. Singh, and T. Smith. Indexing hierarchical data. Technical Report CS-TR-9514, Univ. of California at Santa Barbara, [14] K. V. Ravi Kanth and Ambuj K. Singh. Optimal dynamic range searching in non-replicating index structures. Manuscript, [15] H. Samet. The design and analysis of spatial data structures. Addison-Wesley Publishing Co., [16] T. Sellis, N. Roussopoulos, and C. Faloutsos. The r + -tree: A dynamic index for multi-dimensional objects. Proceedings of the Int. Conf. on Very Large Data Bases, (13): , [17] T. R. Smith and J. Frew. Alexandria Digital Library. Communications of the ACM, 38(4):61 62, April [18] D. White and R. Jain. Similarity indexing with the SS-tree. Proc. Int. Conf. on Data Engineering, pages , 1996.

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree X-tree Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d a Department of Computer and Information Science, University of Konstanz b Department of Computer Science, University

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

Experimental Evaluation of Spatial Indices with FESTIval

Experimental Evaluation of Spatial Indices with FESTIval Experimental Evaluation of Spatial Indices with FESTIval Anderson Chaves Carniel 1, Ricardo Rodrigues Ciferri 2, Cristina Dutra de Aguiar Ciferri 1 1 Department of Computer Science University of São Paulo

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Bkd-tree: A Dynamic Scalable kd-tree

Bkd-tree: A Dynamic Scalable kd-tree Bkd-tree: A Dynamic Scalable kd-tree Octavian Procopiuc Pankaj K. Agarwal Lars Arge Jeffrey Scott Vitter July 1, 22 Abstract In this paper we propose a new index structure, called the Bkd-tree, for indexing

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Bkd-tree: A Dynamic Scalable kd-tree

Bkd-tree: A Dynamic Scalable kd-tree Bkd-tree: A Dynamic Scalable kd-tree Octavian Procopiuc, Pankaj K. Agarwal, Lars Arge, and Jeffrey Scott Vitter Department of Computer Science, Duke University Durham, NC 2778, USA Department of Computer

More information

Spatial Processing using Oracle Table Functions

Spatial Processing using Oracle Table Functions Spatial Processing using Oracle Table Functions Ravi Kanth V Kothuri, Siva Ravada and Weisheng Xu Spatial Technologies, NEDC, Oracle Corporation, Nashua NH 03062. Ravi.Kothuri, Siva.Ravada, Weisheng.Xu

More information

Physical Level of Databases: B+-Trees

Physical Level of Databases: B+-Trees Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,

More information

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Introduction to Indexing R-trees. Hong Kong University of Science and Technology Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining Data Warehousing & Data Mining Wolf-Tilo Balke Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Summary Last week: Logical Model: Cubes,

More information

CMSC 754 Computational Geometry 1

CMSC 754 Computational Geometry 1 CMSC 754 Computational Geometry 1 David M. Mount Department of Computer Science University of Maryland Fall 2005 1 Copyright, David M. Mount, 2005, Dept. of Computer Science, University of Maryland, College

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

Efficient Spatial Query Processing in Geographic Database Systems

Efficient Spatial Query Processing in Geographic Database Systems Efficient Spatial Query Processing in Geographic Database Systems Hans-Peter Kriegel, Thomas Brinkhoff, Ralf Schneider Institute for Computer Science, University of Munich Leopoldstr. 11 B, W-8000 München

More information

Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week:

Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week: Summary Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Last week: Logical Model: Cubes,

More information

HISTORICAL BACKGROUND

HISTORICAL BACKGROUND VALID-TIME INDEXING Mirella M. Moro Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil http://www.inf.ufrgs.br/~mirella/ Vassilis J. Tsotras University of California, Riverside Riverside,

More information

An index structure for efficient reverse nearest neighbor queries

An index structure for efficient reverse nearest neighbor queries An index structure for efficient reverse nearest neighbor queries Congjun Yang Division of Computer Science, Department of Mathematical Sciences The University of Memphis, Memphis, TN 38152, USA yangc@msci.memphis.edu

More information

Data Structures and Algorithms

Data Structures and Algorithms Data Structures and Algorithms CS245-2008S-19 B-Trees David Galles Department of Computer Science University of San Francisco 19-0: Indexing Operations: Add an element Remove an element Find an element,

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

I I YO xo 2 CH2695-5/89/0000/0296$ IEEE. A Robust Multi-Attribute Search Structure

I I YO xo 2 CH2695-5/89/0000/0296$ IEEE. A Robust Multi-Attribute Search Structure A Robust Multi-Attribute Search Structure David B. Lomet Wang Institute of Graduate Studies Tyngsboro, Massachusetts Betty Salzberg College of Computer Science Northeastern University Boston, Massachusetts,

More information

Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web

Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web František Brabec Computer Science Department University of Maryland College Park, Maryland 20742 brabec@umiacs.umd.edu Hanan

More information

Multidimensional Data and Modelling - DBMS

Multidimensional Data and Modelling - DBMS Multidimensional Data and Modelling - DBMS 1 DBMS-centric approach Summary: l Spatial data is considered as another type of data beside conventional data in a DBMS. l Enabling advantages of DBMS (data

More information

Extending Rectangle Join Algorithms for Rectilinear Polygons

Extending Rectangle Join Algorithms for Rectilinear Polygons Extending Rectangle Join Algorithms for Rectilinear Polygons Hongjun Zhu, Jianwen Su, and Oscar H. Ibarra University of California at Santa Barbara Abstract. Spatial joins are very important but costly

More information

Organizing Spatial Data

Organizing Spatial Data Organizing Spatial Data Spatial data records include a sense of location as an attribute. Typically location is represented by coordinate data (in 2D or 3D). 1 If we are to search spatial data using the

More information

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Storing data on disk The traditional storage hierarchy for DBMSs is: 1. main memory (primary storage) for data currently

More information

The Grid File: An Adaptable, Symmetric Multikey File Structure

The Grid File: An Adaptable, Symmetric Multikey File Structure The Grid File: An Adaptable, Symmetric Multikey File Structure Presentation: Saskia Nieckau Moderation: Hedi Buchner The Grid File: An Adaptable, Symmetric Multikey File Structure 1. Multikey Structures

More information

So, we want to perform the following query:

So, we want to perform the following query: Abstract This paper has two parts. The first part presents the join indexes.it covers the most two join indexing, which are foreign column join index and multitable join index. The second part introduces

More information

Spatial Data Management

Spatial Data Management Spatial Data Management [R&G] Chapter 28 CS432 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value

More information

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree. The Lecture Contains: Index structure Binary search tree (BST) B-tree B+-tree Order file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture13/13_1.htm[6/14/2012

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

B-Trees. Version of October 2, B-Trees Version of October 2, / 22

B-Trees. Version of October 2, B-Trees Version of October 2, / 22 B-Trees Version of October 2, 2014 B-Trees Version of October 2, 2014 1 / 22 Motivation An AVL tree can be an excellent data structure for implementing dictionary search, insertion and deletion Each operation

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

Lecture Notes: Range Searching with Linear Space

Lecture Notes: Range Searching with Linear Space Lecture Notes: Range Searching with Linear Space Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk In this lecture, we will continue our discussion

More information

Multidimensional Indexing The R Tree

Multidimensional Indexing The R Tree Multidimensional Indexing The R Tree Module 7, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create

More information

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing" Database System Concepts, 6 th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Chapter 11: Indexing and Hashing" Basic Concepts!

More information

Data Organization and Processing

Data Organization and Processing Data Organization and Processing Spatial Join (NDBI007) David Hoksza http://siret.ms.mff.cuni.cz/hoksza Outline Spatial join basics Relational join Spatial join Spatial join definition (1) Given two sets

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Spatial Data Management

Spatial Data Management Spatial Data Management Chapter 28 Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite

More information

Temporal Range Exploration of Large Scale Multidimensional Time Series Data

Temporal Range Exploration of Large Scale Multidimensional Time Series Data Temporal Range Exploration of Large Scale Multidimensional Time Series Data Joseph JaJa Jusub Kim Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of

More information

R-Trees. Accessing Spatial Data

R-Trees. Accessing Spatial Data R-Trees Accessing Spatial Data In the beginning The B-Tree provided a foundation for R- Trees. But what s a B-Tree? A data structure for storing sorted data with amortized run times for insertion and deletion

More information

Multidimensional Data and Modelling (grid technique)

Multidimensional Data and Modelling (grid technique) Multidimensional Data and Modelling (grid technique) 1 Grid file Increase of database usage and integrated information systems File structures => efficient access to records How? Combine attribute values

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Problem Set 5 Solutions

Problem Set 5 Solutions Introduction to Algorithms November 4, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 21 Problem Set 5 Solutions Problem 5-1. Skip

More information

Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations

Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations Stefan Berchtold, Christian Böhm 2, and Hans-Peter Kriegel 2 AT&T Labs Research, 8 Park Avenue, Florham Park,

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

ADAPTIVE SORTING WITH AVL TREES

ADAPTIVE SORTING WITH AVL TREES ADAPTIVE SORTING WITH AVL TREES Amr Elmasry Computer Science Department Alexandria University Alexandria, Egypt elmasry@alexeng.edu.eg Abstract A new adaptive sorting algorithm is introduced. The new implementation

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Fast Similarity Search for High-Dimensional Dataset

Fast Similarity Search for High-Dimensional Dataset Fast Similarity Search for High-Dimensional Dataset Quan Wang and Suya You Computer Science Department University of Southern California {quanwang,suyay}@graphics.usc.edu Abstract This paper addresses

More information

CS F-11 B-Trees 1

CS F-11 B-Trees 1 CS673-2016F-11 B-Trees 1 11-0: Binary Search Trees Binary Tree data structure All values in left subtree< value stored in root All values in the right subtree>value stored in root 11-1: Generalizing BSTs

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

CS 310 B-trees, Page 1. Motives. Large-scale databases are stored in disks/hard drives.

CS 310 B-trees, Page 1. Motives. Large-scale databases are stored in disks/hard drives. CS 310 B-trees, Page 1 Motives Large-scale databases are stored in disks/hard drives. Disks are quite different from main memory. Data in a disk are accessed through a read-write head. To read a piece

More information

Intro to DB CHAPTER 12 INDEXING & HASHING

Intro to DB CHAPTER 12 INDEXING & HASHING Intro to DB CHAPTER 12 INDEXING & HASHING Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing

More information

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 161 CHAPTER 5 Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 1 Introduction We saw in the previous chapter that real-life classifiers exhibit structure and

More information

The Dynamic Data Cube

The Dynamic Data Cube Steven Geffner, Divakant Agrawal, and Amr El Abbadi Department of Computer Science University of California Santa Barbara, CA 93106 {sgeffner,agrawal,amr}@cs.ucsb.edu Abstract. Range sum queries on data

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Using Natural Clusters Information to Build Fuzzy Indexing Structure Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

Query Processing Over Peer-To-Peer Data Sharing Systems

Query Processing Over Peer-To-Peer Data Sharing Systems Query Processing Over Peer-To-Peer Data Sharing Systems O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California at Santa Barbara odsahin, abhishek, agrawal,

More information

Background: disk access vs. main memory access (1/2)

Background: disk access vs. main memory access (1/2) 4.4 B-trees Disk access vs. main memory access: background B-tree concept Node structure Structural properties Insertion operation Deletion operation Running time 66 Background: disk access vs. main memory

More information

On k-dimensional Balanced Binary Trees*

On k-dimensional Balanced Binary Trees* journal of computer and system sciences 52, 328348 (1996) article no. 0025 On k-dimensional Balanced Binary Trees* Vijay K. Vaishnavi Department of Computer Information Systems, Georgia State University,

More information

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser Datenbanksysteme II: Multidimensional Index Structures 2 Ulf Leser Content of this Lecture Introduction Partitioned Hashing Grid Files kdb Trees kd Tree kdb Tree R Trees Example: Nearest neighbor image

More information

Indexing: B + -Tree. CS 377: Database Systems

Indexing: B + -Tree. CS 377: Database Systems Indexing: B + -Tree CS 377: Database Systems Recap: Indexes Data structures that organize records via trees or hashing Speed up search for a subset of records based on values in a certain field (search

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

Algorithms for GIS:! Quadtrees

Algorithms for GIS:! Quadtrees Algorithms for GIS: Quadtrees Quadtree A data structure that corresponds to a hierarchical subdivision of the plane Start with a square (containing inside input data) Divide into 4 equal squares (quadrants)

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

Efficiency of Hybrid Index Structures - Theoretical Analysis and a Practical Application

Efficiency of Hybrid Index Structures - Theoretical Analysis and a Practical Application Efficiency of Hybrid Index Structures - Theoretical Analysis and a Practical Application Richard Göbel, Carsten Kropf, Sven Müller Institute of Information Systems University of Applied Sciences Hof Hof,

More information

Figure 4.1: The evolution of a rooted tree.

Figure 4.1: The evolution of a rooted tree. 106 CHAPTER 4. INDUCTION, RECURSION AND RECURRENCES 4.6 Rooted Trees 4.6.1 The idea of a rooted tree We talked about how a tree diagram helps us visualize merge sort or other divide and conquer algorithms.

More information

Indexing Biometric Databases using Pyramid Technique

Indexing Biometric Databases using Pyramid Technique Indexing Biometric Databases using Pyramid Technique Amit Mhatre, Sharat Chikkerur and Venu Govindaraju Center for Unified Biometrics and Sensors (CUBS), University at Buffalo, New York, U.S.A http://www.cubs.buffalo.edu

More information

Chapter 12: Indexing and Hashing (Cnt(

Chapter 12: Indexing and Hashing (Cnt( Chapter 12: Indexing and Hashing (Cnt( Cnt.) Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

II (Sorting and) Order Statistics

II (Sorting and) Order Statistics II (Sorting and) Order Statistics Heapsort Quicksort Sorting in Linear Time Medians and Order Statistics 8 Sorting in Linear Time The sorting algorithms introduced thus far are comparison sorts Any comparison

More information

Multidimensional Indexes [14]

Multidimensional Indexes [14] CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting

More information

Chapter 4: Trees. 4.2 For node B :

Chapter 4: Trees. 4.2 For node B : Chapter : Trees. (a) A. (b) G, H, I, L, M, and K.. For node B : (a) A. (b) D and E. (c) C. (d). (e).... There are N nodes. Each node has two pointers, so there are N pointers. Each node but the root has

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

The R+-Tree: A Dynamic Index for Multi- Dimensional Objects

The R+-Tree: A Dynamic Index for Multi- Dimensional Objects Carnegie Mellon University Research Showcase @ CMU Computer Science Department School of Computer Science 9-1987 The R+-Tree: A Dynamic Index for Multi- Dimensional Objects Timos Sellis University of Maryland

More information

Spatial Data Structures

Spatial Data Structures CSCI 480 Computer Graphics Lecture 7 Spatial Data Structures Hierarchical Bounding Volumes Regular Grids BSP Trees [Ch. 0.] March 8, 0 Jernej Barbic University of Southern California http://www-bcf.usc.edu/~jbarbic/cs480-s/

More information

T. Biedl and B. Genc. 1 Introduction

T. Biedl and B. Genc. 1 Introduction Complexity of Octagonal and Rectangular Cartograms T. Biedl and B. Genc 1 Introduction A cartogram is a type of map used to visualize data. In a map regions are displayed in their true shapes and with

More information

Indexing and Hashing

Indexing and Hashing C H A P T E R 1 Indexing and Hashing Solutions to Practice Exercises 1.1 Reasons for not keeping several search indices include: a. Every index requires additional CPU time and disk I/O overhead during

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Ray Tracing Acceleration Data Structures

Ray Tracing Acceleration Data Structures Ray Tracing Acceleration Data Structures Sumair Ahmed October 29, 2009 Ray Tracing is very time-consuming because of the ray-object intersection calculations. With the brute force method, each ray has

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

Main Memory and the CPU Cache

Main Memory and the CPU Cache Main Memory and the CPU Cache CPU cache Unrolled linked lists B Trees Our model of main memory and the cost of CPU operations has been intentionally simplistic The major focus has been on determining

More information

CMSC724: Access Methods; Indexes 1 ; GiST

CMSC724: Access Methods; Indexes 1 ; GiST CMSC724: Access Methods; Indexes 1 ; GiST Amol Deshpande University of Maryland, College Park March 14, 2011 1 Partially based on notes from Joe Hellerstein Outline 1 Access Methods 2 B+-Tree 3 Beyond

More information

Adaptive-Mesh-Refinement Pattern

Adaptive-Mesh-Refinement Pattern Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

On The Complexity of Virtual Topology Design for Multicasting in WDM Trees with Tap-and-Continue and Multicast-Capable Switches

On The Complexity of Virtual Topology Design for Multicasting in WDM Trees with Tap-and-Continue and Multicast-Capable Switches On The Complexity of Virtual Topology Design for Multicasting in WDM Trees with Tap-and-Continue and Multicast-Capable Switches E. Miller R. Libeskind-Hadas D. Barnard W. Chang K. Dresner W. M. Turner

More information

Framework for Design of Dynamic Programming Algorithms

Framework for Design of Dynamic Programming Algorithms CSE 441T/541T Advanced Algorithms September 22, 2010 Framework for Design of Dynamic Programming Algorithms Dynamic programming algorithms for combinatorial optimization generalize the strategy we studied

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

TRANSACTION-TIME INDEXING

TRANSACTION-TIME INDEXING TRANSACTION-TIME INDEXING Mirella M. Moro Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil http://www.inf.ufrgs.br/~mirella/ Vassilis J. Tsotras University of California, Riverside Riverside,

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information