Indexing Valid Time Intervals *

Size: px
Start display at page:

Download "Indexing Valid Time Intervals *"

Transcription

1 Indexing Valid Time Intervals * Tolga Bozkaya Meral Ozsoyoglu Computer Engineering and Science Department Case Western Reserve University {bozkaya, ozsoy}@ces.cwru.edu Abstract To support temporal operators and to increase the efficiency of temporal queries, indexing based on temporal attributes is required. We consider the problem of indexing the temporal dimension in valid time databases. We assume that the temporal information of data objects are represented as valid time intervals that have to be managed dynamically by an efficient index structure. Unlike the time intervals in transaction time databases, valid time intervals can be inserted, deleted, and modified at any point in time. Furthermore, their lifespans can go beyond the current time point and extend into the future. We propose an indexing scheme that uses augmented B+trees called Interval B+trees for indexing a dynamic set of valid time intervals. Interval B+trees (IB+trees) use beginning points of the intervals as key points and keep maximum end point information of its subtrees for each internal node. We introduce an algorithm to apply time-splits at the leaf level of the IB+tree that would partition long valid time intervals into disjoint subintervals and distribute them among several leaf nodes to increase efficiency of search operation, especially for timeslice queries. We compared IB+trees with time-splits to one dimensional R-trees and observed that while their performances for timeslice queries are comparable, IB+trees are far more superior for many temporal queries that are based on beginning points of time intervals. This is expected as the IB+trees use the beginning points of intervals as keys and therefore support such queries naturally. We also show the extensions to our indexing scheme for handling open ended valid time intervals (valid time intervals whose lifespans extend into future indefinitely), and valid time intervals whose end points move along the current timeline. 1.Introduction In this paper, we are concerned with indexing a dynamic set of valid time intervals. Valid time intervals represent the time span in which the data entities exist in real life. The use of valid time notion is very common in many database applications, such as banking, scientific experiments, payroll databases, multimedia databases, etc.. In valid time databases, necessary tools are provided to model, maintain, and query the information that varies over time. To efficiently handle queries on the temporal dimension of data in valid time databases, indexing time intervals that correspond to lifespans of temporal object versions is required. Majority of research on indexing temporal databases concentrates on designing efficient index structures for transaction time databases [ST95]. In transaction time databases, temporal objects are inserted into the database in an append only fashion, and modifications or deletions to * This work has been partially supported by NSF grants IRI , IRI , and the NSF FAW Award IRI

2 historical data are not allowed. Temporal information of historical versions of entities are represented as transaction time intervals, where each time interval corresponds to the duration of the existence of an historical entity in the database. This append-only behavior provides opportunities to design new index structures, or to fine-tune existing ones to increase efficiency for temporal search queries. We mention some of the related work on indexing transaction time intervals in the next section. Indexing valid time intervals in a temporal database is quite a different problem than indexing transaction time intervals. First of all, the append-only behavior cannot be assumed in valid time databases. In valid time databases, it is possible to insert, update or delete past and future information whenever the information becomes available. Modifications and deletions are possible also due to corrections. In short, valid time intervals have to be managed dynamically, unlike transaction time intervals. Second, valid time intervals may extend into future indefinitely (making them open ended), or they may have end points that move along the current timeline. Coupled with the fact that these intervals have to be managed dynamically, the problem of indexing valid time databases becomes a problem with different requirements. In bitemporal databases, temporal information about entities are represented in both temporal dimensions (valid time and transaction time). Each entity has a valid time lifespan that was recorded in the database throughout the transaction time lifespan of the entity. Indexing bitemporal (two dimensional) intervals is another interesting problem, where the challenge is to exploit the append only behavior of the transaction time dimension and to provide dynamic management of intervals for the valid time dimension [KTF95]. In this paper, since our focus is on valid time databases, unless otherwise specified, we use intervals to refer to valid time intervals. Temporal information can be queried in a variety of ways. Conventionally, a temporal index is supposed to support timeslice queries. Since we focus on indexing the temporal dimension only, we are mainly interested in answering pure-timeslice queries efficiently. A pure timeslice query asks for all temporal objects whose time intervals intersect a given query time point (query time interval, in the general case). Other variations of timeslice queries may include specifications on the key dimension such as asking for temporal objects whose key values fall in a given key range/value and whose time intervals intersect a given time instant (range-timeslice queries / pure-key-timeslice queries ). In the most general case, temporal queries may employ any of the temporal operators that specify the 13 possible relationships between intervals [AH85] (See Figure 5.5). Particular index orders among the beginning points, or the end time points of the intervals may come in handy in efficiently answering queries that employ these temporal operators. We propose an indexing scheme that uses B+trees on beginning points of the intervals augmented with maximum end time points of intervals in the internal nodes. We call this augmented structure as the Interval B+tree [BO95] (IB+tree for short). Since the basic structure of the IB+tree is the B+tree, it can efficiently index a dynamic set of valid time intervals. The keys of the IB+tree are beginning points of the intervals, which makes it useful for temporal queries that employ temporal operators that capture temporal relationships based on beginning points of intervals such as right-covered by, equals, right-covers, covered by, left-covered by, right overlaps, met by (See Figure 5.5). The augmented information in the internal nodes helps to trim the search for timeslice queries, and may be useful for queries that employ other temporal operators. In IB+trees, the augmented information in an internal node is simply the maximum end points of intervals that are indexed in the subtrees below that node. This information will not be 2

3 very useful when there are several long intervals distributed over the timeline causing the maximum end points in the subtrees to be too high to trim the search efficiently. To handle this situation, we propose a time-split algorithm to be applied to the intervals in the leaves. These time-splits in the leaves partition long intervals into several shorter parts making the augmented information very useful in trimming the search path. The time-splits do not require any extra operation on the IB+trees, although they increase storage requirements as the split parts have to be reinserted to the tree. We should also note that time-splits can be applied off-line during nonpeak hours of operation. We compared IB+trees (with time-splits) with one dimensional R-trees and observed that their performances for timeslice queries are very close. Note that, R-trees do not directly support search operations based on beginning points of the intervals. Most of the time, they require a timeslice search for these operations. We also experimented on queries that employ covered by, and met by operators to demonstrate this point. We show how IB+trees can be extended to index open ended valid time intervals and valid time intervals with moving end points (along the current timeline), together with intervals having fixed end points. The rest of the paper is organized as follows. In the next section, we discuss the related work on indexing temporal intervals, and dynamic interval management. In section 3, we describe the IB+tree structure. Time-splits are explained in section 4. Section 5 shows the results of our experimental work on comparing one dimensional R-trees and IB+trees with time-splits. In section 6 we explain how open ended and moving ended intervals could be handled in IB+trees. Section 7 concludes. 2.Related Work There are quite a number of index structures designed for transaction time databases (See [ST95] for a recent survey). Index structures that are designed for transaction time databases exploit the append-only behavior of transaction time intervals to provide efficiency in temporal queries. Time Index [EWK90], Append-Only trees [SG93], Snapshot Index [TK95], AD*-trees [BO95] are few of those index structures designed for answering queries on temporal dimension only. Some of these structures are shown to provide optimal query time [TK95, Ram97] (in terms of order) in answering pure timeslice queries. In the literature, there are also index structures that are build on both key and time dimensions for efficiently answering range-timeslice queries, as well as key-range queries that are purely based on key dimension. ST-trees [SG93], Multiversion B-trees[BGO+93], Time-split B+trees [LS90], and B+trees on window lists [Ram97] are some of these structures to name a few. Some of these structures are also proven to guarantee optimal query time [BGO+93, ST95]. We do not elaborate much on these structures due to limited space. There are also index structures that are used for dynamic management of time intervals. Segment R-trees [KS91], R-trees [Gut84], TP-index [SOL94] are some of them to name a few. These structures are mostly multidimensional index structures that can be directly used (such as R-trees) to index time intervals, or variations of multidimensional index structures tailored for indexing temporal domains (such as TP-index). Being a popular spatial index structure, R-trees have been heavily used for dynamic interval management and comparison to its efficiency has been a common practice for new index structures. We take the same approach in this paper, and compare the performance of IB+trees with time-splits to one dimensional R-trees. 3. Interval B+trees: 3

4 In this section we briefly explain the Interval B+tree structure to show how it keeps and makes use of the augmented information in its internal nodes. There are several structures in the literature for indexing interval (not necessarily temporal) data. Some of these index structures for intervals have been researched in fields other than databases (such as computational geometry). Although they are not suitable for secondary storage as they are binary tree structures, they have been the inspiration for many secondary storage index structures for database applications. Priority Search trees [Mc85], Segment trees [PS85], Interval-trees [CLR90] are some of these structures proposed for interval search. Interval B+trees are secondary storage models of the Interval trees, therefore we discuss Interval trees in more detail below. Interval-trees The Interval-tree [CLR90] is a binary tree that is augmented to support operations on a dynamic set of intervals. The underlying tree structure can be any balanced binary tree structure such as the AVL tree, Red-black tree, etc., provided that the augmented information can be efficiently maintained throughout the dynamic operations (insertions and deletions) to keep the tree balanced. A node x of the Interval-tree contains an interval (int[x]), and the key of x is the beginning point of that interval. Thus, an inorder tree walk of the data structure lists the intervals in sorted order by their beginning points. In addition to this, a node x also contains the maximum value of any interval end point stored in the subtree rooted at that node (which we denote as max[x]). This information is easily maintained with little effort through all operations (ex: rotations) to keep the tree balanced. Insertions and deletions can be done in O(log 2 n). The Interval-tree supports the interval search operation, which finds an interval that intersects with a given search interval. The algorithm is simple and short, as shown below: INTERVAL-SEARCH(T,I) [CLR90] (For a given interval I[i s, i e ], find an interval that intersects with I in the Interval-tree T.) ( left[x] and right[x] stand for the left and right child of a node) 1) x = root (T) 2) while x NIL and I does not intersect the interval int[x] do 2.1) if left[x] NIL and max[left[x]] i s then x = left[x] 2.2) else x = right[x] 3) return x if it is not NIL. Interval B+tree Structure The Interval B+tree (IB+tree) is a direct generalization of the Interval-tree to a multi-way B+tree structure. It is, basically, a B+tree where each node is augmented with the same kind of information as in the binary Interval-trees. While the properties of the B+tree structure are kept invariant, the internal nodes of an IB+tree keep the maximum end point of the intervals indexed by its subtrees. So, an internal node of order k (with k children) has k maximum points for each of its children as the augmented information. The leaves of the tree keep the data items and have no children, so they do not have any extra information. Insertion and deletion operations for IB+tree are similar to those for B+trees, with the only exception of a little overhead to maintain the augmented information. However, this overhead does not change the complexity of the operations. Most of the time, only the maximum fields of some of the nodes visited (in the worst case, all of them) along the path from the root to 4

5 the leaf (the leaf where the insertion or the deletion is made) may need to be updated. The complexity of insertion and deletion operations for IB+trees is O(log k n) (the same as B+trees), where n is the number of leaf nodes, and k is the average fanout of a node in the tree. The internal node structure of an IB+tree with k keys a 1, a 2,.., a k, and k child pointers c 1,.., c k, and with k maximum end points (we will shortly refer them as maximums) m 1,.., m k of the subtrees rooted for each child is shown in Figure 3.1. a 1 a 2... a k c 1 (m 1 ) c 2 (m 2 ) c k-1 (m k-1 ) c k (m k ) Figure 3.1. An internal IB+tree node The difference from the Interval-trees is that the maximum end points for each of the children are kept in the parent node. So, the children nodes need not be accessed to check the maximum end points in their subtrees, as required in Interval-trees. Since an IB+tree is a generalization of the binary Interval tree, we can use the same interval search algorithm above with minor modifications for an IB+tree. INTERVAL-SEARCH (N, I) (for Interval B+trees) ( For a given search interval I[i s, i e ] (where i s and i e are the starting end the ending points of I), find an interval that intersects with I in the Interval B+tree T. Here, N is a node of the Interval B+tree and the initial call is INTERVAL-SEARCH( root(t), I). ) ( Let us assume that N has k children (internal node), or k data items (leaf)) 1) if N is a leaf node then check if there is an intersecting interval with I among the intervals in N. 2) else if N is an internal node then 2.1) i=0; 2.2) if I intersects [a i, m i ] then INTERVAL-SEARCH(c i, I) 2.3) else if i< k then i = i + 1, goto 2.2 As the keys of the IB+tree are the beginning points of the indexed intervals, any query on the beginning points of intervals can be answered efficiently using the search algorithms of B+trees. This structure does not fully support the queries that are based on the end points of the intervals, but, it is still helpful in many cases. We will discuss the algorithms for evaluating general timeslice queries, in which case all intervals intersecting a given query interval are supposed to be retrieved. Note that INTERVAL-SEARCH algorithm returns one interval (if there exists at least one) that intersects with the given search interval. Actually, that interval is also the one with the minimum beginning point. To find all the intervals that intersect a given query interval, we can still use the INTERVAL-SEARCH to find the first intersecting interval (the one with the minimum beginning point), and then we can use the links between the leaf nodes for a sequential search from that point on. Or, we can follow all the child pointers that satisfy the condition in step 2.2 of the algorithm to find all the intersecting intervals, hoping that we will be able to trim some high level branches of the search tree, which will help us to be faster in answering the query. In the latter case, the INTERVAL-SEARCH algorithm has to be modified for a rangesearch by replacing step 2.3 with the following: 5

6 2.3) if i< k then i = i + 1, goto 2.2 (else is deleted) With this change, the condition in step 2.2 is checked for all children of the node, and all subbranches that satisfy the condition are visited. We refer to this search algorithm described above as ALL-INTERVAL-SEARCH algorithm. In the worst case, it may have to trace all the internal nodes in the range, which will be slower than a sequential search on the leaves, but on the average, depending on the distribution of the data intervals, it may be able to trim some branches rooted at some level higher than the leaf level, and that will definitely increase the speed. Example 3.1 demonstrates such a case. (22) R (17) (41) [4,22] [6,11] [10,13 [14,17] [20,32] [26,41] C 1 C 2 C 3 Figure 3.2. The IB+tree used in Example 1. Example 3.1: In Figure 3.2, we see an IB+tree of height 2 and of order 3. Let s assume we want to find all the intervals that intersect with the search interval [18, 25]. We can answer this query in two ways. In the first one, we can use INTERVAL-SEARCH algorithm to find the first intersection interval with the minimum starting point ([4, 22]) and then carry out a sequential search following the links between the leaf nodes. For this we have to visit the nodes R, C 1, C 2, C 3 in order. In the second way we can use the ALL-INTERVAL-SEARCH algorithm, in which case we visit the nodes R, C 1, C 3, but not C Time-splits Although the IB+tree structure is a simple structure that allows dynamic management of valid time intervals, it may not be efficient for some distributions. Consider the case where each leaf node has a very long interval. The augmented information in the nodes of the IB+tree will not be useful to trim the search, as most of the leaves will have at least one long interval that would probably intersect the query interval. In this case, the IB+tree will not be any more helpful than a B+tree on beginning points of the intervals. Although, such a pathological case is not likely to happen, it is obvious that efficiency of the IB+trees (for timeslice queries) very much depends on how much the augmented information can be used to trim the search. To improve efficiency, we suggest to apply time-splits at the leaf level that will partition long intervals into disjoint subintervals and distribute them over several leaf nodes. Time-split operation for IB+trees is different from the conventional time-split operation used in structures for indexing transaction time intervals. In such structures, the time-split partitions (splits) some of the intervals in a node accommodating the new partitions in a newly created node. In IB+trees, the split parts of the intervals after the time-split are reinserted to the tree with respect to their beginning points as usual. Note that a time-split does not require any extra operation on the B+tree structure (it is only applied to the leaf level), which means it can be implemented with the conventional B+tree operations (just requiring re-insertion of split parts). Also, time-splits are done for increasing the efficiency of search operations, they can be totally avoided during peak hours of operation for better update performances and carried along in batches during off hours. 6

7 If a time-split operation is to be applied to a leaf node at time instant t, all data intervals whose end points extend beyond t are split at point t. The split parts are reinserted into the IB+tree, t is marked to be the new maximum end point of the intervals in the leaf. The maximum end point information is posted to the parent level and this may proceed further up the tree if necessary. Let us give a simple example: Example 4.1: Assume that there is a leaf node that accommodates the following 6 intervals. 2,6 3,46 4,10 6,58 8,11 12,14 14 (split point) Figure 4.1: The intervals in leaf node L (example 4.1). Leaf node L: ( [2, 6], [3, 46], [4, 10], [6, 58], [8, 11], [12, 14] ). Maximum end point for L is 58 which is kept at its parent node as the augmented end point information. If we decide to time-split this node at time point 14 the intervals [3, 46] and [6, 58] will each be separated to two partitions at time point 14 ([3, 46] will be split to [3, 14] [15, 46], and [6, 58] will be split to [6, 14] and [15, 58] ). After the split, L= ([2, 6], [3, 14], [4, 10], [6, 14], [8, 11], [12, 14] ) with its maximum end point being 14. The intervals [15, 36], and [15, 58] will be reinserted into another node with respect to their beginning points (which is 15). The remaining question is to decide when to apply a time-split to a node, and then, if a time-split is to be applied, how to pick the time point of split. The objective is to end up with a leaf node where the durations of the intervals are comparable to each other. Algorithm TimeSplit(L) L: A leaf node with k intervals I j [b j,e j ] j=1..k. 1) if L is underflow then exit; (To avoid splitting the first leaf node) 2) MAXEND= Max j { 1..k}(e j ) 3) if L is the rightmost leaf node then goto 8. 4) i=picksplitpoint(l); (Compute the cost for each end point, return the index to the point with minimum cost (the split point)) 5) If e i =MAXEND then goto 8; (split point is the maximum end point, so no split is necessary) 6) else MAXEND= e i ; 7) For j=1 to k if e j > e i then Reinsert([e i +1, e j ]); (Split the interval at I ei and reinsert the split part) e j = e i. 8) Post MAXEND as the new maximum end point for node L to the parent of L. (This may proceed to the upper levels of the tree) Figure 4.2: Timesplit algorithm 7

8 The algorithm for time-split operation is given in Figure 4.2. The algorithm TimeSplit(L) is applied to a leaf node L after a new interval is inserted into L, and the end point of the new interval is greater than the maximum end point among the intervals that were already in L. If the maximum end point information for L need not be changed, TimeSplit(L) is not applied, otherwise the new maximum end point is posted to the parent node in the end (step 9). The candidates for split points are the end points of the intervals in the leaf. In other words, number of candidates is equal to the number of intervals in the leaf. We apply a cost function to each of these candidate points, and pick the one with the minimum cost (step 4). If the minimum cost belongs to the maximum end point, no split is necessary (step 6). Next, we should explain how the cost of splitting a leaf node at a given time point is computed. As seen in Figure 4.2, the cost of splitting the leaf at each candidate point is computed using the PickSplitPoint(L) function which takes a leaf node as input and returns the index to the interval whose end point is the best split point (with the minimum cost). The algorithm for PickSplitPoint(L) is given in Figure 4.3. Algorithm PickSplitPoint(L) L: A leaf node with k intervals I j [b j,e j ] j=1..k. 1) MAXBEGIN= Max j { 1..k}(b j ) (Maximum beginning point) 2) For j=1 to k if e j < MAXBEGIN then ENDLIST[j]=MAXBEGIN ; else ENDLIST[j]=e j ; 3) For j=1 to k k CUTCOST(j) = ENDLIST[i] ENDLIST[j] i= 1 4) Let MINCOST= Min i 1..k (CUTCOST[i]) 5) For j=1 to k NUMSPLIT[j]= number of intervals to be split if e j is chosen as the split point. 6) For j=1 to k FINALCOST(j)=CUTCOST(j)+ MINCOST * α * NUMSPLIT[j]; 7) Let I x be the interval such that; FINALCOST(x)= Min{FINALCOST(j) j=1..k and e j > MAXBEGIN} (I x is the interval whose end point e x will be the split point) 8) Return x; Figure 4.3: Algorithm PickSplitPoint Since IB+tree uses beginning points of the intervals as keys, PickSplitPoint function never selects a split point that is less than or equal to the maximum beginning point. So the key information in the IB+tree is never changed due to a time-split. To compute the costs, first a list of end points of the intervals are kept in ENDLIST[]. For intervals whose end points are less than the maximum key (beginning point), maximum key is taken as its end point (step 2). An intermediate cost function CUTCOST() computes, for each end point, the accumulation of absolute differences from other end points (step 3). The minimum value among these intermediate costs is computed and stored in MINCOST (step 4). For each end point, the number of intervals that would have to be split if that end point is chosen as the split point is kept in another list NUMSPLIT[] (step 5). It will be used to integrate the number of intervals to be split into the split cost, which is done in step 6. The final cost of a point is its intermediate cost plus a 8

9 penalty ( MINCOST * α ) for each interval it causes to split. Here, α is a parameter (0 < α < 1) which can be tuned to adjust space(hence update cost)/querytime tradeoff. Higher values will lead to fewer splits and hence less storage expansion, but also to worse query efficiency. Smaller values will lead to better query efficiency, but increase storage requirements and update costs (due to reinsertion of split parts). In our experiments we have chosen α as 0.2. The cost of PickSplitPoint() function is quadratic (because of step 3), however, it can easily be computed in O(k logk) time where k is the number of intervals in the leaf node. For this, first, all intervals in the leaf node would have to be sorted with respect to their end points in increasing order. So, if that is assumed before step 3, step 3 can be computed in linear time since CUTCOST(j+1)=CUTCOST(j) + (2j - k) (ENDLIST[j+1] - [ENDLIST[j]) ( j=1,...,k-1) So, after computing CUTCOST(1), the rest can be computed in linear time using the equation above. In that case, the order of the PickSplitPoint() algorithm becomes O(k logk) due to the initial sort operation. As an example, we show the steps for picking the split point for the node in Figure 4.1. Example 4.2: We want to find the split point for the leaf node : L: ( I 1 [2, 6], I 2 [3, 46], I 3 [4, 10], I 4 [6, 58], I 5 [8, 11], I 6 [12, 14] ). We call the function PickSplitPoint(L) with α=0.2: step 1: MAXBEGIN = 12 step 2: ENDLIST[]= <12, 46, 12, 58, 12, 14> step 3: CUTCOST[] = <82, 146, 82, 294, 82, 82> step 4: MINCOST[]=82; (α*mincost=16) step 5: NUMSPLIT[]= <3, 1, 3, 0, 3, 2> step 6: FINALCOST[]= <130, 162, 130, 294, 130, 114> step 7: Split point is 14, which is the end point of the interval I 6. step 8: Return 6. To decrease the number of disk accesses during search queries, it is important to avoid cases where a search query retrieves a leaf node and finds only one interval to put into the result from that leaf. From the algorithms above, it can be observed that a leaf node is not necessarily split whenever it has long intervals. If there are too many of these long intervals in the node, it is not a good idea to split the node anyway; that is because if many such (long) intervals exist in such a node, it means many of them will contribute to the answer of an interval search query. Besides, a time-split on such a node will cause many new partitions (from the long intervals) to be reinserted, which may be very costly. 5. Experimental Results In this section, we demonstrate our experimental results with IB+trees and one dimensional R-trees. In the experiments, we compare the search efficiency of IB+trees and R- trees in terms of the number of nodes read during search operations. The experiments are done using five different data sets. Each data set contains 100,000 intervals whose beginning points are distributed randomly in the range 0 to 250,000. The distribution of the duration of the intervals are different in each of the five different data sets. Table 5.1 below lists the descriptions of these different data sets. We will refer to these data sets using the names (D1,..,D5) shown in Table 5.1. The queries are evaluated after all data intervals are inserted in each test case. 9

10 Dataset Description D1 The durations of the intervals are distributed exponentially having mean value 100. D2 The durations of the intervals are distributed exponentially having mean value D3 D4 D5 This data set is created by merging sets D1 and D2. 20% of the durations are distributed exponentially with mean 2000, and the rest is distributed exponentially with mean 100. The durations of the intervals are distributed normally having mean as 200 and standard deviation as also 200. The durations of the intervals are distributed normally having mean as 2000 and standard deviation as Table 5.1: Descriptions of the 5 data sets each having intervals. We compared three structures. One of them is the IB+tree as explained in section 3. The next one is the IB+tree with time-splits where the time-splits are done using the algorithms presented in section 4. The third structure is the one-dimensional R-tree. All of these structures have a maximum fanout of 51 and a minimum fanout of 26 for both internal nodes and the leaves. All of these structures have the same node structure. The IB+tree (with and without timesplits) keep a key, the maximum end point of the subtree below, and a pointer in each entry of an internal node. An R-tree node entry has a minimum bounding interval (a beginning and an end point) and a pointer. So, having the same fanout for both structures is a fair assumption. In Figure 5.1, we see the storage requirements of the three structures in terms of the total number of nodes they have. One dimensional R-tree and IB+tree without time-splits require about the same storage, meaning that the average fanouts of the nodes in both structures are about the same. IB+trees with time-splits require more storage due to the increased number of intervals because of splits. For data sets D1, D2, D4, and D5; IB+tree with time-splits require 50-60% more storage. For data set D3, storage requirement is around 85% more as the intervals in D3 have a considerable number of long intervals stored together with short intervals, causing more splits to take place. The query performance results for these index structures are discussed below. These results are obtained by taking averages of 20 different queries. Each query consists of a query interval and the type of temporal relationship employed for the query. To compare the search performances of the index structures, first, we tested the performances of the index structures for interval timeslice queries, which employ the interval intersection operator, (i.e., all data intervals intersecting a query interval are retrieved as the result). The midpoints of the query points are picked randomly, and their durations are normally distributed with µ=100 and σ=50. Two types of search strategies are available for IB+trees. The first one is finding the first interval (with minimum beginning point) that intersects the query interval using INTERVAL-SEARCH() algorithm of section 3, and then looking at consecutive leaf nodes sequentially to find others. We will refer to this strategy as Sequential Search strategy. The second one is using ALL-INTERVAL-SEARCH() algorithm of section 3. We will refer to this strategy as Range Search strategy. Generally, Range Search strategy is superior, but Sequential Search strategy can be chosen when the leaf nodes are physically clustered in the secondary storage. Both strategies can be used for both IB+tree variants, with or without splits. 10

11 Total number of nodes D1 D2 D3 D4 D5 Data sets IB+tree with time splits One dimensional R-tree IB+tree without time splits Figure 5.1: Storage requirements of the three structures. The average numbers of internal node accesses per timeslice query are shown in Figure 5.2. For all of the three index structures, the height was 4 (3 internal + 1 leaf level). That means the sequential search method for IB+tree variants will make 3 internal node accesses for each query. The averages displayed in Figure 5.2 are for the range search method. IB+trees with timesplits make less number of internal node accesses than R-trees for data sets D2 and D3 due to the large number of splits applied on mostly long intervals (which can also be seen from Figure 5.1). In these data sets, IB+trees with time-splits packs the resulting set of data intervals (after splits) more tightly into the leaf nodes. For data sets D1 and D5, R-trees make slightly less number of node accesses than IB+trees with time-splits. IB+trees without time-splits perform the most number of internal node accesses in every data set. # internal node accesses per search D1 D2 D3 D4 D5 IB+tree w ith time splits One dimensional R-tree IB+tree w ithout time splits Figure 5.2: Average number of internal node accesses per timeslice query. Datasets The average numbers of leaf node accesses for the three index structures are given in Figures 5.3, and 5.4. Figure 5.3 compares the IB+trees with and without time-splits. This chart shows how much improvement using time-splits brings to IB+trees. When the durations of the intervals are small and relatively close to each other (data sets D1, D4, and D5), IB+trees with time-splits perform close to IB+trees without time-splits, having a slight edge over them. The difference comes into surface when there are long intervals distributed with short intervals (data 11

12 sets D2 and especially D3). Sequential search methods for IB+trees without time-splits perform very poorly in such cases. Range search methods perform better, but still it does not get close to the performance of IB+trees with time-splits. Sequential and range search methods give close performances in IB+trees with time-splits in every case. Avg number of leaf node accesses D1 D2 D3 D4 D5 Data sets IB+tree with time splits (Range search) IB+tree with time splits (Sequential search) IB+tree without time splits (Range search) IB+tree without time splits(sequential search) Figure 5.3: Comparison of IB+trees with and without timesplits in terms of the average number of leaf node accesses for timeslice queries. Figure 5.4 shows the comparison of IB+trees with time-splits to one-dimensional R-trees. We see that one-dimensional R-trees have comparable performances to IB+trees with time-splits, especially when the range search strategy is used for IB+trees with time-splits. Avg number of leaf node accesses D1 D2 D3 D4 D5 Data sets IB+tree with time splits (Range search) IB+tree with time splits (Sequential search) One dimensional R-tree Figure 5.4: Comparison of IB+trees with time-splits to one-dimensional R-trees in terms of the average number of leaf node accesses for timeslice queries. Although IB+trees do not perform better than one dimensional R-trees for timeslice queries, they keep the list of intervals ordered with respect to their beginning points, making them superior to R-trees for other temporal query operators such as right-covered by, equals, right-covers, covered by, left-covered by, right overlaps, met by [AH85] (shown in Figure 5.5). These operators are either totally based on beginning points of intervals, or they specify a range for the beginning points of the intervals. R-trees cannot handle such queries as well as they handle timeslice queries. For example, a simple met by query can be answered by an IB+tree in O(logn) time while one dimensional R-trees have to make a point inclusion search to answer the same query. To make this point clear, we experimented on these three index structures for covered by and met by operators. 12

13 y before x y meets x y left-overlaps x y left-covers x y covers x y right-covered by x y equals to x y right-covers x y covered by x y left-covered by x y right-overlaps x y met by x y after x Figure 5.5. Temporal relationships between intervals. x Covered by operator specifies the inclusion relationship between intervals (actually, between data intervals and the query interval). For this, as the beginning points of the qualifying data intervals should fall in the range specified by the query interval, covered by queries can be considered to be partially based on beginning points. Such queries can be answered by IB+trees by checking all data intervals whose beginning time points fall into the specified range, which requires a range search on beginning points. For R-trees, the search strategy is not any different than the strategy for interval timeslice queries, i.e., all nodes (leaf or internal) with minimum bounding intervals intersecting the query interval should be accessed. For covered by queries, we used two query sets. The first set (we will refer to it as Q1) has query intervals whose midpoints are picked randomly and whose durations are normally distributed with µ=1000 and σ=500. The intervals in this set have compatible lengths with the data intervals in D1, D3, and D5. The second set (Q2) is similar to Q1, but the lengths of the query intervals are normally distributed with µ=100 and σ=50. So the query intervals in Q2 have compatible lengths with the data intervals in D2, D3, and D4. For query set Q1 For Query set Q2 # of Internal Node Accesses D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 Datasets IB+tree with timesplits One dimensional R-tree Figure 5.6: Average number of internal node accesses for the two query sets for covered by queries. 13

14 Figure 5.6 shows the average number of internal node accesses for IB+trees and one dimensional R-trees for covered by queries. For both IB+trees with time-splits and IB+trees without time-splits the number of internal nodes accesses required is the same as covered by queries are answered by checking the intervals whose beginning points fall into the query range. R-trees make more internal node accesses, especially when the data intervals have large durations (data sets D1, D3, and D5). For query set Q1 For Query set Q2 # of Leaf Node Accesses D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 Data sets IB+tree with timesplits One dimensional R-tree IB+tree without timesplits Figure 5.7: Average number of leaf node accesses for the two query sets for covered by queries. Figure 5.7 shows the average number of leaf node accesses for covered by queries for query sets Q1 and Q2. When the data intervals are relatively long (D1, D3, and D5), one dimensional R-trees had very poor performance, especially for short query intervals (Q2). It performs slightly better than IB+trees with time-splits when the data intervals are short (D2 and D4) but the query intervals are long (Q1). IB+trees without time-splits always give the best performances, which is an expected result. The ratio of the number of leaf nodes accessed by an IB+tree with time-splits to the number of leaf nodes accessed by an IB+tree without time-splits also reflects the ratio of their storage requirements. Since IB+trees with time-splits have more intervals to index (due to splits), it makes more leaf node accesses. 40 Leaf Node Accesses Internal Node Accesses IB+tree w ith timesplits One dimensional R-tree # of Node Accesses D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 Data sets Figure 5.8: Average number of leaf and internal node accesses for the two query sets for met by queries. Finally, we show the performances of the three structures for met by queries in Figure 5.8. Met by operator is completely based on beginning points of intervals. For IB+trees, it is 14

15 simply a key-based point search. For R-trees it is not simpler than a point timeslice search, since all nodes whose minimum bounding intervals include the end point of the query interval have to be accessed. While the IB+trees make on the average of slightly more than one leaf access per query, one dimensional R-trees have to make an order of magnitude more leaf node accesses. In the next section, an extension of IB+trees for handling open ended intervals and intervals with moving end points (along the current timeline) is presented. 6. Handling Special Time variables In valid time databases, some temporal objects may have valid time intervals with end points that have to be treated specially. Some valid time intervals may be open ended, meaning that their end points can span into the future indefinitely. We will use the special time variable infinity to denote the end points of such intervals. There may also be intervals whose end points are equal to the variable now, which represents the current time value. The indexing problem becomes more interesting with the introduction of such intervals. One trivial solution is to index intervals with end points now, and open ended intervals (with end points infinity) in separate index structures based on their beginning points. All intervals of the form [x, now] (x is an absolute time point) can be indexed in one B+tree, and all intervals of the form [y, infinity) can be indexed in another. Intervals with absolute (fixed) end points could be indexed in a third index structure such as IB+tree. In that case, any search query will proceed down all three indices, and the results would have to be merged in the end. Using such an indexing scheme may actually be a good idea if these three index structures can be kept in parallel disks to overlap I/O time. However, since all valid time intervals have to be handled dynamically, the traffic to move the intervals among the three structures due to changes may become overwhelming. With special treatment of the variables now and infinity, it is possible to index all valid time intervals in the same IB+tree structure. It is important to manage the augmented information in the internal nodes during update and search operations. This is as explained below: n n n If all valid time intervals below a subtree have their end points as either absolute time points (past or future) or infinite, we simply keep the maximum as the augmented information for that subtree. Here infinite will be treated as positive infinity in finding maximum, i.e., it will be the maximum. If the valid time intervals below a subtree have end time points in the past (having values less than current time) and some end points that are equal to now, now will be used as the augmented information for that subtree. If the subtree indexes, intervals that have absolute future time-points as their endpoints as well as intervals with end-point now, we keep the maximum end-point and mark it with a special marker (which requires a bit flag) to denote that the maximum may change as time passes by and the variable time point now may become the maximum. If this happens, we modify the maximum for that subtree the next time we access it because of an update operation. Consider the example given in Figure 6.1. At time 30 (when now is 30), the maximum end time point for C 3 is 49 marked with the special marker # as 49 is greater than the current value of now and there is one data interval with the end-point now ([26, now]) in C 3. Let s assume that we invoked an update query at time 52 (inserting the time interval [20, 32]), and we accessed C 3. As we go down the tree, we also modify the maximum point information as the value of now is greater that 49 at time 52. Note that we do not need to modify the augmented 15

16 information before, since we will be aware of the fact than the subtree below has the maximum end point as max(49, now) during a search operation. At time = 30 (now=30) (22) R now (#49) [3,20] [6,31] [8,9] [11,16 [14, now] [26, now] [35, 49] C 1 C 2 C 3 At time = 52 (now=52) (22) R now now [3,20] [6,31] [8,9] [11,16 [14, now] [20,32] [26, now] [35,49] C 1 C 2 C 3 Figure 6.1: An IB+tree for indexing valid-time intervals. Note that the special value infinity is treated as an absolute time value. Certainly, we would expect the intervals with end points infinity to be updated to definite time values (an absolute time point or now) eventually. The insertion, deletion and search operations of the IB+tree can be slightly modified in the way we mention above to handle such updates. Having valid time intervals with end points now or infinity also slightly changes the procedure for time-splits. When calculating the cost of splitting a leaf node with function PickSplitPoint(), the current value of now should be used. Similarly, the value of infinity can be taken as the maximum integer ( (2 32-1) if 4 bytes is used to represent a time value). The procedure for calculating the costs and splitting the intervals stays the same. If a leaf is split at a future time point and there are intervals having now as end points, the maximum of the leaf node becomes the split point marked with the special marker #, and that should be posted to the parent node as the maximum end point information. Such a case is illustrated in the following example. Example 6.1: Consider a leaf node L accommodating the intervals shown in Figure ,22 3,24 4,now 6,50 now(=16) 24 (split point) 8,20 12,now Figure 6.2: The intervals in leaf node L (example 6.1). If a time-split operation is done on L, 24 will be chosen as the split point. In that case, the interval [6, 50] will be split into two parts, [6, 24] and [25, 50] where the second part will be reinserted. The maximum end point of L after the split will be 24, but because of the intervals 16

17 [4,now], and [12, now], (#24) should be posted as L s maximum end point to its parent. This means that L s maximum at any given time is max(24, now). 7. Conclusion In this paper, we considered the problem of indexing time intervals in valid time databases. Valid time intervals should be indexed dynamically, since updates and deletions are possible in valid time databases unlike the case for transaction time databases where intervals are inserted in an append-only fashion with time order. Valid time intervals may also span into future, having end points greater than the current time value now. Some of these intervals that span into future may be open ended, meaning that their end points are indefinite. Handling intervals with moving end points (intervals with end points equal to now) and intervals with open end points together intervals with fixed end points allowing dynamic operations poses an interesting problem. We suggest to use Interval B+trees to index valid time intervals. Interval B+trees can easily handle dynamic operations as they are basically B+trees built on beginning points of the valid time intervals, whose internal nodes are augmented with maximum end point information of the subtrees below them. To efficiently handle skewed distributions, we introduce a time-split algorithm to be applied to the leaf nodes of the IB+tree to increase efficiency in search operations. Time-splits help partition relatively long intervals and distribute those partitions among different leaf nodes, so that all the leaves accommodate intervals of comparable lengths. Experimental results show that using time-splits in IB+trees considerably improve search performance, although causing some increase in storage due to the increased number of intervals because of partitions. Comparison with one-dimensional R-trees showed that IB+trees with timesplits give very close performance to R-trees for timeslice queries, however IB+trees perform far more superior for many temporal queries that are based on beginning points of time intervals (such as met by, covered by). This result was expected since IB+trees index the intervals with respect to their beginning points. We have also shown modifications to IB+trees for handling valid time intervals that have moving (now) or indefinite (infinity) ending points. Since the beginning point of any valid time interval is always a fixed point, these modifications just concern the handling of augmented data and do not make any changes to the underlying B+tree structure. On the other hand, it is not clear how useful R-trees can be when open ended intervals are indexed together with intervals with fixed end points, especially in terms of controlling the overlap. References [AH85] J.F. Allen, P.J. Hayes, A Common-sense Theory of Time, Proceedings of the International Joint Conference on Artificial Intelligence, August [BGO+93] B. Becker, S. Gschwind, T. Ohler, B. Seeger, P. Widmayer, On Optimal Multiversion Access Structures, Proceedings of Symposium on Large Spatial Databases, in Lectures Notes in Computer Science, Vol 692, pages , Singapore [BO95] T. Bozkaya, M.Ozsoyoglu, Indexing Transaction Time Databases, Technical Report CES Computer Engineering and Science Department, CWRU. [CLR90] T. H. Cormen, C. E. Leiserson, R.L. Rivest Introduction to Algorithms, MCGraw-Hill [EWK90] R. Elmasri, G.T.J. Wuu, Y. Kim, The Time-Index: An Access Structure for Temporal Data, Proceedings of 16th VLDB Conference, pages 1-12, August [EWK93] R. Elmasri, G. T. J. Wuu, V. Kouramajiam, The Time-Index and The Monotonic B+tree, In [T93], chapter

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree. The Lecture Contains: Index structure Binary search tree (BST) B-tree B+-tree Order file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture13/13_1.htm[6/14/2012

More information

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Introduction to Indexing R-trees. Hong Kong University of Science and Technology Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

HISTORICAL BACKGROUND

HISTORICAL BACKGROUND VALID-TIME INDEXING Mirella M. Moro Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil http://www.inf.ufrgs.br/~mirella/ Vassilis J. Tsotras University of California, Riverside Riverside,

More information

TRANSACTION-TIME INDEXING

TRANSACTION-TIME INDEXING TRANSACTION-TIME INDEXING Mirella M. Moro Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil http://www.inf.ufrgs.br/~mirella/ Vassilis J. Tsotras University of California, Riverside Riverside,

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Storing data on disk The traditional storage hierarchy for DBMSs is: 1. main memory (primary storage) for data currently

More information

Laboratory Module X B TREES

Laboratory Module X B TREES Purpose: Purpose 1... Purpose 2 Purpose 3. Laboratory Module X B TREES 1. Preparation Before Lab When working with large sets of data, it is often not possible or desirable to maintain the entire structure

More information

B-Trees and External Memory

B-Trees and External Memory Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 B-Trees and External Memory 1 (2, 4) Trees: Generalization of BSTs Each internal

More information

Range Searching and Windowing

Range Searching and Windowing CS 6463 -- Fall 2010 Range Searching and Windowing Carola Wenk 1 Orthogonal range searching Input: n points in d dimensions E.g., representing a database of n records each with d numeric fields Query:

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

B-Trees and External Memory

B-Trees and External Memory Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 and External Memory 1 1 (2, 4) Trees: Generalization of BSTs Each internal node

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Advanced Algorithms. Class Notes for Thursday, September 18, 2014 Bernard Moret

Advanced Algorithms. Class Notes for Thursday, September 18, 2014 Bernard Moret Advanced Algorithms Class Notes for Thursday, September 18, 2014 Bernard Moret 1 Amortized Analysis (cont d) 1.1 Side note: regarding meldable heaps When we saw how to meld two leftist trees, we did not

More information

Physical Level of Databases: B+-Trees

Physical Level of Databases: B+-Trees Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,

More information

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time B + -TREES MOTIVATION An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations finish within O(log N) time The theoretical conclusion

More information

CMSC 754 Computational Geometry 1

CMSC 754 Computational Geometry 1 CMSC 754 Computational Geometry 1 David M. Mount Department of Computer Science University of Maryland Fall 2005 1 Copyright, David M. Mount, 2005, Dept. of Computer Science, University of Maryland, College

More information

Trees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University

Trees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University Trees Reading: Weiss, Chapter 4 1 Generic Rooted Trees 2 Terms Node, Edge Internal node Root Leaf Child Sibling Descendant Ancestor 3 Tree Representations n-ary trees Each internal node can have at most

More information

B-Trees. Version of October 2, B-Trees Version of October 2, / 22

B-Trees. Version of October 2, B-Trees Version of October 2, / 22 B-Trees Version of October 2, 2014 B-Trees Version of October 2, 2014 1 / 22 Motivation An AVL tree can be an excellent data structure for implementing dictionary search, insertion and deletion Each operation

More information

Question Bank Subject: Advanced Data Structures Class: SE Computer

Question Bank Subject: Advanced Data Structures Class: SE Computer Question Bank Subject: Advanced Data Structures Class: SE Computer Question1: Write a non recursive pseudo code for post order traversal of binary tree Answer: Pseudo Code: 1. Push root into Stack_One.

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure Lecture 6: External Interval Tree (Part II) Yufei Tao Division of Web Science and Technology Korea Advanced Institute of Science and Technology taoyf@cse.cuhk.edu.hk 3 Making the external interval tree

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version Lecture Notes: External Interval Tree Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk This lecture discusses the stabbing problem. Let I be

More information

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11 DATABASE PERFORMANCE AND INDEXES CS121: Relational Databases Fall 2017 Lecture 11 Database Performance 2 Many situations where query performance needs to be improved e.g. as data size grows, query performance

More information

Augmenting Data Structures

Augmenting Data Structures Augmenting Data Structures [Not in G &T Text. In CLRS chapter 14.] An AVL tree by itself is not very useful. To support more useful queries we need more structure. General Definition: An augmented data

More information

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs Lecture 5 Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs Reading: Randomized Search Trees by Aragon & Seidel, Algorithmica 1996, http://sims.berkeley.edu/~aragon/pubs/rst96.pdf;

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Massive Data Algorithmics

Massive Data Algorithmics Database queries Range queries 1D range queries 2D range queries salary G. Ometer born: Aug 16, 1954 salary: $3,500 A database query may ask for all employees with age between a 1 and a 2, and salary between

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing References Generalized Search Trees for Database Systems. J. M. Hellerstein, J. F. Naughton

More information

Binary Search Trees, etc.

Binary Search Trees, etc. Chapter 12 Binary Search Trees, etc. Binary Search trees are data structures that support a variety of dynamic set operations, e.g., Search, Minimum, Maximum, Predecessors, Successors, Insert, and Delete.

More information

Ensures that no such path is more than twice as long as any other, so that the tree is approximately balanced

Ensures that no such path is more than twice as long as any other, so that the tree is approximately balanced 13 Red-Black Trees A red-black tree (RBT) is a BST with one extra bit of storage per node: color, either RED or BLACK Constraining the node colors on any path from the root to a leaf Ensures that no such

More information

Multiway searching. In the worst case of searching a complete binary search tree, we can make log(n) page faults Everyone knows what a page fault is?

Multiway searching. In the worst case of searching a complete binary search tree, we can make log(n) page faults Everyone knows what a page fault is? Multiway searching What do we do if the volume of data to be searched is too large to fit into main memory Search tree is stored on disk pages, and the pages required as comparisons proceed may not be

More information

Section 4 SOLUTION: AVL Trees & B-Trees

Section 4 SOLUTION: AVL Trees & B-Trees Section 4 SOLUTION: AVL Trees & B-Trees 1. What 3 properties must an AVL tree have? a. Be a binary tree b. Have Binary Search Tree ordering property (left children < parent, right children > parent) c.

More information

Self-Balancing Search Trees. Chapter 11

Self-Balancing Search Trees. Chapter 11 Self-Balancing Search Trees Chapter 11 Chapter Objectives To understand the impact that balance has on the performance of binary search trees To learn about the AVL tree for storing and maintaining a binary

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

B-Trees & its Variants

B-Trees & its Variants B-Trees & its Variants Advanced Data Structure Spring 2007 Zareen Alamgir Motivation Yet another Tree! Why do we need another Tree-Structure? Data Retrieval from External Storage In database programs,

More information

Computational Geometry

Computational Geometry Windowing queries Windowing Windowing queries Zoom in; re-center and zoom in; select by outlining Windowing Windowing queries Windowing Windowing queries Given a set of n axis-parallel line segments, preprocess

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Efficient Access to Non-Sequential Elements of a Search Tree

Efficient Access to Non-Sequential Elements of a Search Tree Efficient Access to Non-Sequential Elements of a Search Tree Lubomir Stanchev Computer Science Department Indiana University - Purdue University Fort Wayne Fort Wayne, IN, USA stanchel@ipfw.edu Abstract

More information

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree Chapter 4 Trees 2 Introduction for large input, even access time may be prohibitive we need data structures that exhibit running times closer to O(log N) binary search tree 3 Terminology recursive definition

More information

Balanced Binary Search Trees. Victor Gao

Balanced Binary Search Trees. Victor Gao Balanced Binary Search Trees Victor Gao OUTLINE Binary Heap Revisited BST Revisited Balanced Binary Search Trees Rotation Treap Splay Tree BINARY HEAP: REVIEW A binary heap is a complete binary tree such

More information

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive

More information

Red-Black-Trees and Heaps in Timestamp-Adjusting Sweepline Based Algorithms

Red-Black-Trees and Heaps in Timestamp-Adjusting Sweepline Based Algorithms Department of Informatics, University of Zürich Vertiefungsarbeit Red-Black-Trees and Heaps in Timestamp-Adjusting Sweepline Based Algorithms Mirko Richter Matrikelnummer: 12-917-175 Email: mirko.richter@uzh.ch

More information

Balanced Search Trees

Balanced Search Trees Balanced Search Trees Computer Science E-22 Harvard Extension School David G. Sullivan, Ph.D. Review: Balanced Trees A tree is balanced if, for each node, the node s subtrees have the same height or have

More information

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing 15-82 Advanced Topics in Database Systems Performance Problem Given a large collection of records, Indexing with B-trees find similar/interesting things, i.e., allow fast, approximate queries 2 Indexing

More information

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs Computational Optimization ISE 407 Lecture 16 Dr. Ted Ralphs ISE 407 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms in

More information

I/O-Algorithms Lars Arge

I/O-Algorithms Lars Arge I/O-Algorithms Fall 203 September 9, 203 I/O-Model lock I/O D Parameters = # elements in problem instance = # elements that fits in disk block M = # elements that fits in main memory M T = # output size

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Lecture 3: B-Trees. October Lecture 3: B-Trees

Lecture 3: B-Trees. October Lecture 3: B-Trees October 2017 Remarks Search trees The dynamic set operations search, minimum, maximum, successor, predecessor, insert and del can be performed efficiently (in O(log n) time) if the search tree is balanced.

More information

Range Queries. Kuba Karpierz, Bruno Vacherot. March 4, 2016

Range Queries. Kuba Karpierz, Bruno Vacherot. March 4, 2016 Range Queries Kuba Karpierz, Bruno Vacherot March 4, 2016 Range query problems are of the following form: Given an array of length n, I will ask q queries. Queries may ask some form of question about a

More information

Intro to DB CHAPTER 12 INDEXING & HASHING

Intro to DB CHAPTER 12 INDEXING & HASHING Intro to DB CHAPTER 12 INDEXING & HASHING Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing

More information

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree Deletion in a B-tree Disk Storage Data is stored on disk (i.e., secondary memory) in blocks. A block is

More information

Main Memory and the CPU Cache

Main Memory and the CPU Cache Main Memory and the CPU Cache CPU cache Unrolled linked lists B Trees Our model of main memory and the cost of CPU operations has been intentionally simplistic The major focus has been on determining

More information

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing" Database System Concepts, 6 th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Chapter 11: Indexing and Hashing" Basic Concepts!

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Search Trees. Undirected graph Directed graph Tree Binary search tree

Search Trees. Undirected graph Directed graph Tree Binary search tree Search Trees Undirected graph Directed graph Tree Binary search tree 1 Binary Search Tree Binary search key property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then

More information

Efficient Non-Sequential Access and More Ordering Choices in a Search Tree

Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Lubomir Stanchev Computer Science Department Indiana University - Purdue University Fort Wayne Fort Wayne, IN, USA stanchel@ipfw.edu

More information

CS 3343 Fall 2007 Red-black trees Carola Wenk

CS 3343 Fall 2007 Red-black trees Carola Wenk CS 3343 Fall 2007 Red-black trees Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk CS 334 Analysis of Algorithms 1 Search Trees A binary search tree is a binary tree.

More information

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Using Natural Clusters Information to Build Fuzzy Indexing Structure Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1 (2,4) Trees 9 2 5 7 10 14 (2,4) Trees 1 Multi-Way Search Tree ( 9.4.1) A multi-way search tree is an ordered tree such that Each internal node has at least two children and stores d 1 key-element items

More information

Red-black trees (19.5), B-trees (19.8), trees

Red-black trees (19.5), B-trees (19.8), trees Red-black trees (19.5), B-trees (19.8), 2-3-4 trees Red-black trees A red-black tree is a balanced BST It has a more complicated invariant than an AVL tree: Each node is coloured red or black A red node

More information

Multidimensional Indexing The R Tree

Multidimensional Indexing The R Tree Multidimensional Indexing The R Tree Module 7, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create

More information

Efficient Range Query Processing on Uncertain Data

Efficient Range Query Processing on Uncertain Data Efficient Range Query Processing on Uncertain Data Andrew Knight Rochester Institute of Technology Department of Computer Science Rochester, New York, USA andyknig@gmail.com Manjeet Rege Rochester Institute

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

We assume uniform hashing (UH):

We assume uniform hashing (UH): We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a

More information

Chapter 17 Indexing Structures for Files and Physical Database Design

Chapter 17 Indexing Structures for Files and Physical Database Design Chapter 17 Indexing Structures for Files and Physical Database Design We assume that a file already exists with some primary organization unordered, ordered or hash. The index provides alternate ways to

More information

CS 350 : Data Structures B-Trees

CS 350 : Data Structures B-Trees CS 350 : Data Structures B-Trees David Babcock (courtesy of James Moscola) Department of Physical Sciences York College of Pennsylvania James Moscola Introduction All of the data structures that we ve

More information

Analysis of Algorithms

Analysis of Algorithms Analysis of Algorithms Concept Exam Code: 16 All questions are weighted equally. Assume worst case behavior and sufficiently large input sizes unless otherwise specified. Strong induction Consider this

More information

CS350: Data Structures B-Trees

CS350: Data Structures B-Trees B-Trees James Moscola Department of Engineering & Computer Science York College of Pennsylvania James Moscola Introduction All of the data structures that we ve looked at thus far have been memory-based

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

Data Structures Week #6. Special Trees

Data Structures Week #6. Special Trees Data Structures Week #6 Special Trees Outline Adelson-Velskii-Landis (AVL) Trees Splay Trees B-Trees October 5, 2018 Borahan Tümer, Ph.D. 2 AVL Trees October 5, 2018 Borahan Tümer, Ph.D. 3 Motivation for

More information

AVL Trees. (AVL Trees) Data Structures and Programming Spring / 17

AVL Trees. (AVL Trees) Data Structures and Programming Spring / 17 AVL Trees (AVL Trees) Data Structures and Programming Spring 2017 1 / 17 Balanced Binary Tree The disadvantage of a binary search tree is that its height can be as large as N-1 This means that the time

More information

The RUM-tree: supporting frequent updates in R-trees using memos

The RUM-tree: supporting frequent updates in R-trees using memos The VLDB Journal DOI.7/s778-8--3 REGULAR PAPER The : supporting frequent updates in R-trees using memos Yasin N. Silva Xiaopeng Xiong Walid G. Aref Received: 9 April 7 / Revised: 8 January 8 / Accepted:

More information

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana Introduction to Indexing 2 Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana Indexed Sequential Access Method We have seen that too small or too large an index (in other words too few or too

More information

l Heaps very popular abstract data structure, where each object has a key value (the priority), and the operations are:

l Heaps very popular abstract data structure, where each object has a key value (the priority), and the operations are: DDS-Heaps 1 Heaps - basics l Heaps very popular abstract data structure, where each object has a key value (the priority), and the operations are: l insert an object, find the object of minimum key (find

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Expected Time in Linear Space

Expected Time in Linear Space Optimizing Integer Sorting in O(n log log n) Expected Time in Linear Space Ajit Singh M. E. (Computer Science and Engineering) Department of computer Science and Engineering, Thapar University, Patiala

More information

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler Lecture 12: BT Trees Courtesy to Goodrich, Tamassia and Olga Veksler Instructor: Yuzhen Xie Outline B-tree Special case of multiway search trees used when data must be stored on the disk, i.e. too large

More information

Indexing and Hashing

Indexing and Hashing C H A P T E R 1 Indexing and Hashing This chapter covers indexing techniques ranging from the most basic one to highly specialized ones. Due to the extensive use of indices in database systems, this chapter

More information

Problem Set 5 Solutions

Problem Set 5 Solutions Introduction to Algorithms November 4, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 21 Problem Set 5 Solutions Problem 5-1. Skip

More information

CSE 373 OCTOBER 25 TH B-TREES

CSE 373 OCTOBER 25 TH B-TREES CSE 373 OCTOBER 25 TH S ASSORTED MINUTIAE Project 2 is due tonight Make canvas group submissions Load factor: total number of elements / current table size Can select any load factor (but since we don

More information

Temporal Range Exploration of Large Scale Multidimensional Time Series Data

Temporal Range Exploration of Large Scale Multidimensional Time Series Data Temporal Range Exploration of Large Scale Multidimensional Time Series Data Joseph JaJa Jusub Kim Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of

More information

CS 350 Algorithms and Complexity

CS 350 Algorithms and Complexity CS 350 Algorithms and Complexity Winter 2019 Lecture 12: Space & Time Tradeoffs. Part 2: Hashing & B-Trees Andrew P. Black Department of Computer Science Portland State University Space-for-time tradeoffs

More information

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials)

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials) CSE100 Advanced Data Structures Lecture 8 (Based on Paul Kube course materials) CSE 100 Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

More information

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Repeating Segment Detection in Songs using Audio Fingerprint Matching Repeating Segment Detection in Songs using Audio Fingerprint Matching Regunathan Radhakrishnan and Wenyu Jiang Dolby Laboratories Inc, San Francisco, USA E-mail: regu.r@dolby.com Institute for Infocomm

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17 01.433/33 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/2/1.1 Introduction In this lecture we ll talk about a useful abstraction, priority queues, which are

More information

COMP171. AVL-Trees (Part 1)

COMP171. AVL-Trees (Part 1) COMP11 AVL-Trees (Part 1) AVL Trees / Slide 2 Data, a set of elements Data structure, a structured set of elements, linear, tree, graph, Linear: a sequence of elements, array, linked lists Tree: nested

More information

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See   for conditions on re-use Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

OPPA European Social Fund Prague & EU: We invest in your future.

OPPA European Social Fund Prague & EU: We invest in your future. OPPA European Social Fund Prague & EU: We invest in your future. Data structures and algorithms Part 9 Searching and Search Trees II Petr Felkel 10.12. 2007 Topics Red-Black tree Insert Delete B-Tree Motivation

More information

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary Chap4: Spatial Storage and Indexing 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary Learning Objectives Learning Objectives (LO) LO1: Understand concept of a physical data model

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS583-002) Amarda Shehu Fall 2017 1 Binary Search Trees Traversals, Querying, Insertion, and Deletion Sorting with BSTs 2 Example: Red-black Trees Height of a Red-black

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs Algorithms in Systems Engineering ISE 172 Lecture 16 Dr. Ted Ralphs ISE 172 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms

More information