Indexing Valid Time Intervals *

Size: px

Start display at page:

Download "Indexing Valid Time Intervals *"

Sharlene Richards
6 years ago
Views:

1 Indexing Valid Time Intervals * Tolga Bozkaya Meral Ozsoyoglu Computer Engineering and Science Department Case Western Reserve University {bozkaya, ozsoy}@ces.cwru.edu Abstract To support temporal operators and to increase the efficiency of temporal queries, indexing based on temporal attributes is required. We consider the problem of indexing the temporal dimension in valid time databases. We assume that the temporal information of data objects are represented as valid time intervals that have to be managed dynamically by an efficient index structure. Unlike the time intervals in transaction time databases, valid time intervals can be inserted, deleted, and modified at any point in time. Furthermore, their lifespans can go beyond the current time point and extend into the future. We propose an indexing scheme that uses augmented B+trees called Interval B+trees for indexing a dynamic set of valid time intervals. Interval B+trees (IB+trees) use beginning points of the intervals as key points and keep maximum end point information of its subtrees for each internal node. We introduce an algorithm to apply time-splits at the leaf level of the IB+tree that would partition long valid time intervals into disjoint subintervals and distribute them among several leaf nodes to increase efficiency of search operation, especially for timeslice queries. We compared IB+trees with time-splits to one dimensional R-trees and observed that while their performances for timeslice queries are comparable, IB+trees are far more superior for many temporal queries that are based on beginning points of time intervals. This is expected as the IB+trees use the beginning points of intervals as keys and therefore support such queries naturally. We also show the extensions to our indexing scheme for handling open ended valid time intervals (valid time intervals whose lifespans extend into future indefinitely), and valid time intervals whose end points move along the current timeline. 1.Introduction In this paper, we are concerned with indexing a dynamic set of valid time intervals. Valid time intervals represent the time span in which the data entities exist in real life. The use of valid time notion is very common in many database applications, such as banking, scientific experiments, payroll databases, multimedia databases, etc.. In valid time databases, necessary tools are provided to model, maintain, and query the information that varies over time. To efficiently handle queries on the temporal dimension of data in valid time databases, indexing time intervals that correspond to lifespans of temporal object versions is required. Majority of research on indexing temporal databases concentrates on designing efficient index structures for transaction time databases [ST95]. In transaction time databases, temporal objects are inserted into the database in an append only fashion, and modifications or deletions to * This work has been partially supported by NSF grants IRI , IRI , and the NSF FAW Award IRI

2 historical data are not allowed. Temporal information of historical versions of entities are represented as transaction time intervals, where each time interval corresponds to the duration of the existence of an historical entity in the database. This append-only behavior provides opportunities to design new index structures, or to fine-tune existing ones to increase efficiency for temporal search queries. We mention some of the related work on indexing transaction time intervals in the next section. Indexing valid time intervals in a temporal database is quite a different problem than indexing transaction time intervals. First of all, the append-only behavior cannot be assumed in valid time databases. In valid time databases, it is possible to insert, update or delete past and future information whenever the information becomes available. Modifications and deletions are possible also due to corrections. In short, valid time intervals have to be managed dynamically, unlike transaction time intervals. Second, valid time intervals may extend into future indefinitely (making them open ended), or they may have end points that move along the current timeline. Coupled with the fact that these intervals have to be managed dynamically, the problem of indexing valid time databases becomes a problem with different requirements. In bitemporal databases, temporal information about entities are represented in both temporal dimensions (valid time and transaction time). Each entity has a valid time lifespan that was recorded in the database throughout the transaction time lifespan of the entity. Indexing bitemporal (two dimensional) intervals is another interesting problem, where the challenge is to exploit the append only behavior of the transaction time dimension and to provide dynamic management of intervals for the valid time dimension [KTF95]. In this paper, since our focus is on valid time databases, unless otherwise specified, we use intervals to refer to valid time intervals. Temporal information can be queried in a variety of ways. Conventionally, a temporal index is supposed to support timeslice queries. Since we focus on indexing the temporal dimension only, we are mainly interested in answering pure-timeslice queries efficiently. A pure timeslice query asks for all temporal objects whose time intervals intersect a given query time point (query time interval, in the general case). Other variations of timeslice queries may include specifications on the key dimension such as asking for temporal objects whose key values fall in a given key range/value and whose time intervals intersect a given time instant (range-timeslice queries / pure-key-timeslice queries ). In the most general case, temporal queries may employ any of the temporal operators that specify the 13 possible relationships between intervals [AH85] (See Figure 5.5). Particular index orders among the beginning points, or the end time points of the intervals may come in handy in efficiently answering queries that employ these temporal operators. We propose an indexing scheme that uses B+trees on beginning points of the intervals augmented with maximum end time points of intervals in the internal nodes. We call this augmented structure as the Interval B+tree [BO95] (IB+tree for short). Since the basic structure of the IB+tree is the B+tree, it can efficiently index a dynamic set of valid time intervals. The keys of the IB+tree are beginning points of the intervals, which makes it useful for temporal queries that employ temporal operators that capture temporal relationships based on beginning points of intervals such as right-covered by, equals, right-covers, covered by, left-covered by, right overlaps, met by (See Figure 5.5). The augmented information in the internal nodes helps to trim the search for timeslice queries, and may be useful for queries that employ other temporal operators. In IB+trees, the augmented information in an internal node is simply the maximum end points of intervals that are indexed in the subtrees below that node. This information will not be 2

3 very useful when there are several long intervals distributed over the timeline causing the maximum end points in the subtrees to be too high to trim the search efficiently. To handle this situation, we propose a time-split algorithm to be applied to the intervals in the leaves. These time-splits in the leaves partition long intervals into several shorter parts making the augmented information very useful in trimming the search path. The time-splits do not require any extra operation on the IB+trees, although they increase storage requirements as the split parts have to be reinserted to the tree. We should also note that time-splits can be applied off-line during nonpeak hours of operation. We compared IB+trees (with time-splits) with one dimensional R-trees and observed that their performances for timeslice queries are very close. Note that, R-trees do not directly support search operations based on beginning points of the intervals. Most of the time, they require a timeslice search for these operations. We also experimented on queries that employ covered by, and met by operators to demonstrate this point. We show how IB+trees can be extended to index open ended valid time intervals and valid time intervals with moving end points (along the current timeline), together with intervals having fixed end points. The rest of the paper is organized as follows. In the next section, we discuss the related work on indexing temporal intervals, and dynamic interval management. In section 3, we describe the IB+tree structure. Time-splits are explained in section 4. Section 5 shows the results of our experimental work on comparing one dimensional R-trees and IB+trees with time-splits. In section 6 we explain how open ended and moving ended intervals could be handled in IB+trees. Section 7 concludes. 2.Related Work There are quite a number of index structures designed for transaction time databases (See [ST95] for a recent survey). Index structures that are designed for transaction time databases exploit the append-only behavior of transaction time intervals to provide efficiency in temporal queries. Time Index [EWK90], Append-Only trees [SG93], Snapshot Index [TK95], AD*-trees [BO95] are few of those index structures designed for answering queries on temporal dimension only. Some of these structures are shown to provide optimal query time [TK95, Ram97] (in terms of order) in answering pure timeslice queries. In the literature, there are also index structures that are build on both key and time dimensions for efficiently answering range-timeslice queries, as well as key-range queries that are purely based on key dimension. ST-trees [SG93], Multiversion B-trees[BGO+93], Time-split B+trees [LS90], and B+trees on window lists [Ram97] are some of these structures to name a few. Some of these structures are also proven to guarantee optimal query time [BGO+93, ST95]. We do not elaborate much on these structures due to limited space. There are also index structures that are used for dynamic management of time intervals. Segment R-trees [KS91], R-trees [Gut84], TP-index [SOL94] are some of them to name a few. These structures are mostly multidimensional index structures that can be directly used (such as R-trees) to index time intervals, or variations of multidimensional index structures tailored for indexing temporal domains (such as TP-index). Being a popular spatial index structure, R-trees have been heavily used for dynamic interval management and comparison to its efficiency has been a common practice for new index structures. We take the same approach in this paper, and compare the performance of IB+trees with time-splits to one dimensional R-trees. 3. Interval B+trees: 3

4 In this section we briefly explain the Interval B+tree structure to show how it keeps and makes use of the augmented information in its internal nodes. There are several structures in the literature for indexing interval (not necessarily temporal) data. Some of these index structures for intervals have been researched in fields other than databases (such as computational geometry). Although they are not suitable for secondary storage as they are binary tree structures, they have been the inspiration for many secondary storage index structures for database applications. Priority Search trees [Mc85], Segment trees [PS85], Interval-trees [CLR90] are some of these structures proposed for interval search. Interval B+trees are secondary storage models of the Interval trees, therefore we discuss Interval trees in more detail below. Interval-trees The Interval-tree [CLR90] is a binary tree that is augmented to support operations on a dynamic set of intervals. The underlying tree structure can be any balanced binary tree structure such as the AVL tree, Red-black tree, etc., provided that the augmented information can be efficiently maintained throughout the dynamic operations (insertions and deletions) to keep the tree balanced. A node x of the Interval-tree contains an interval (int[x]), and the key of x is the beginning point of that interval. Thus, an inorder tree walk of the data structure lists the intervals in sorted order by their beginning points. In addition to this, a node x also contains the maximum value of any interval end point stored in the subtree rooted at that node (which we denote as max[x]). This information is easily maintained with little effort through all operations (ex: rotations) to keep the tree balanced. Insertions and deletions can be done in O(log 2 n). The Interval-tree supports the interval search operation, which finds an interval that intersects with a given search interval. The algorithm is simple and short, as shown below: INTERVAL-SEARCH(T,I) [CLR90] (For a given interval I[i s, i e ], find an interval that intersects with I in the Interval-tree T.) ( left[x] and right[x] stand for the left and right child of a node) 1) x = root (T) 2) while x NIL and I does not intersect the interval int[x] do 2.1) if left[x] NIL and max[left[x]] i s then x = left[x] 2.2) else x = right[x] 3) return x if it is not NIL. Interval B+tree Structure The Interval B+tree (IB+tree) is a direct generalization of the Interval-tree to a multi-way B+tree structure. It is, basically, a B+tree where each node is augmented with the same kind of information as in the binary Interval-trees. While the properties of the B+tree structure are kept invariant, the internal nodes of an IB+tree keep the maximum end point of the intervals indexed by its subtrees. So, an internal node of order k (with k children) has k maximum points for each of its children as the augmented information. The leaves of the tree keep the data items and have no children, so they do not have any extra information. Insertion and deletion operations for IB+tree are similar to those for B+trees, with the only exception of a little overhead to maintain the augmented information. However, this overhead does not change the complexity of the operations. Most of the time, only the maximum fields of some of the nodes visited (in the worst case, all of them) along the path from the root to 4

5 the leaf (the leaf where the insertion or the deletion is made) may need to be updated. The complexity of insertion and deletion operations for IB+trees is O(log k n) (the same as B+trees), where n is the number of leaf nodes, and k is the average fanout of a node in the tree. The internal node structure of an IB+tree with k keys a 1, a 2,.., a k, and k child pointers c 1,.., c k, and with k maximum end points (we will shortly refer them as maximums) m 1,.., m k of the subtrees rooted for each child is shown in Figure 3.1. a 1 a 2... a k c 1 (m 1 ) c 2 (m 2 ) c k-1 (m k-1 ) c k (m k ) Figure 3.1. An internal IB+tree node The difference from the Interval-trees is that the maximum end points for each of the children are kept in the parent node. So, the children nodes need not be accessed to check the maximum end points in their subtrees, as required in Interval-trees. Since an IB+tree is a generalization of the binary Interval tree, we can use the same interval search algorithm above with minor modifications for an IB+tree. INTERVAL-SEARCH (N, I) (for Interval B+trees) ( For a given search interval I[i s, i e ] (where i s and i e are the starting end the ending points of I), find an interval that intersects with I in the Interval B+tree T. Here, N is a node of the Interval B+tree and the initial call is INTERVAL-SEARCH( root(t), I). ) ( Let us assume that N has k children (internal node), or k data items (leaf)) 1) if N is a leaf node then check if there is an intersecting interval with I among the intervals in N. 2) else if N is an internal node then 2.1) i=0; 2.2) if I intersects [a i, m i ] then INTERVAL-SEARCH(c i, I) 2.3) else if i< k then i = i + 1, goto 2.2 As the keys of the IB+tree are the beginning points of the indexed intervals, any query on the beginning points of intervals can be answered efficiently using the search algorithms of B+trees. This structure does not fully support the queries that are based on the end points of the intervals, but, it is still helpful in many cases. We will discuss the algorithms for evaluating general timeslice queries, in which case all intervals intersecting a given query interval are supposed to be retrieved. Note that INTERVAL-SEARCH algorithm returns one interval (if there exists at least one) that intersects with the given search interval. Actually, that interval is also the one with the minimum beginning point. To find all the intervals that intersect a given query interval, we can still use the INTERVAL-SEARCH to find the first intersecting interval (the one with the minimum beginning point), and then we can use the links between the leaf nodes for a sequential search from that point on. Or, we can follow all the child pointers that satisfy the condition in step 2.2 of the algorithm to find all the intersecting intervals, hoping that we will be able to trim some high level branches of the search tree, which will help us to be faster in answering the query. In the latter case, the INTERVAL-SEARCH algorithm has to be modified for a rangesearch by replacing step 2.3 with the following: 5

6 2.3) if i< k then i = i + 1, goto 2.2 (else is deleted) With this change, the condition in step 2.2 is checked for all children of the node, and all subbranches that satisfy the condition are visited. We refer to this search algorithm described above as ALL-INTERVAL-SEARCH algorithm. In the worst case, it may have to trace all the internal nodes in the range, which will be slower than a sequential search on the leaves, but on the average, depending on the distribution of the data intervals, it may be able to trim some branches rooted at some level higher than the leaf level, and that will definitely increase the speed. Example 3.1 demonstrates such a case. (22) R (17) (41) [4,22] [6,11] [10,13 [14,17] [20,32] [26,41] C 1 C 2 C 3 Figure 3.2. The IB+tree used in Example 1. Example 3.1: In Figure 3.2, we see an IB+tree of height 2 and of order 3. Let s assume we want to find all the intervals that intersect with the search interval [18, 25]. We can answer this query in two ways. In the first one, we can use INTERVAL-SEARCH algorithm to find the first intersection interval with the minimum starting point ([4, 22]) and then carry out a sequential search following the links between the leaf nodes. For this we have to visit the nodes R, C 1, C 2, C 3 in order. In the second way we can use the ALL-INTERVAL-SEARCH algorithm, in which case we visit the nodes R, C 1, C 3, but not C Time-splits Although the IB+tree structure is a simple structure that allows dynamic management of valid time intervals, it may not be efficient for some distributions. Consider the case where each leaf node has a very long interval. The augmented information in the nodes of the IB+tree will not be useful to trim the search, as most of the leaves will have at least one long interval that would probably intersect the query interval. In this case, the IB+tree will not be any more helpful than a B+tree on beginning points of the intervals. Although, such a pathological case is not likely to happen, it is obvious that efficiency of the IB+trees (for timeslice queries) very much depends on how much the augmented information can be used to trim the search. To improve efficiency, we suggest to apply time-splits at the leaf level that will partition long intervals into disjoint subintervals and distribute them over several leaf nodes. Time-split operation for IB+trees is different from the conventional time-split operation used in structures for indexing transaction time intervals. In such structures, the time-split partitions (splits) some of the intervals in a node accommodating the new partitions in a newly created node. In IB+trees, the split parts of the intervals after the time-split are reinserted to the tree with respect to their beginning points as usual. Note that a time-split does not require any extra operation on the B+tree structure (it is only applied to the leaf level), which means it can be implemented with the conventional B+tree operations (just requiring re-insertion of split parts). Also, time-splits are done for increasing the efficiency of search operations, they can be totally avoided during peak hours of operation for better update performances and carried along in batches during off hours. 6

7 If a time-split operation is to be applied to a leaf node at time instant t, all data intervals whose end points extend beyond t are split at point t. The split parts are reinserted into the IB+tree, t is marked to be the new maximum end point of the intervals in the leaf. The maximum end point information is posted to the parent level and this may proceed further up the tree if necessary. Let us give a simple example: Example 4.1: Assume that there is a leaf node that accommodates the following 6 intervals. 2,6 3,46 4,10 6,58 8,11 12,14 14 (split point) Figure 4.1: The intervals in leaf node L (example 4.1). Leaf node L: ( [2, 6], [3, 46], [4, 10], [6, 58], [8, 11], [12, 14] ). Maximum end point for L is 58 which is kept at its parent node as the augmented end point information. If we decide to time-split this node at time point 14 the intervals [3, 46] and [6, 58] will each be separated to two partitions at time point 14 ([3, 46] will be split to [3, 14] [15, 46], and [6, 58] will be split to [6, 14] and [15, 58] ). After the split, L= ([2, 6], [3, 14], [4, 10], [6, 14], [8, 11], [12, 14] ) with its maximum end point being 14. The intervals [15, 36], and [15, 58] will be reinserted into another node with respect to their beginning points (which is 15). The remaining question is to decide when to apply a time-split to a node, and then, if a time-split is to be applied, how to pick the time point of split. The objective is to end up with a leaf node where the durations of the intervals are comparable to each other. Algorithm TimeSplit(L) L: A leaf node with k intervals I j [b j,e j ] j=1..k. 1) if L is underflow then exit; (To avoid splitting the first leaf node) 2) MAXEND= Max j { 1..k}(e j ) 3) if L is the rightmost leaf node then goto 8. 4) i=picksplitpoint(l); (Compute the cost for each end point, return the index to the point with minimum cost (the split point)) 5) If e i =MAXEND then goto 8; (split point is the maximum end point, so no split is necessary) 6) else MAXEND= e i ; 7) For j=1 to k if e j > e i then Reinsert([e i +1, e j ]); (Split the interval at I ei and reinsert the split part) e j = e i. 8) Post MAXEND as the new maximum end point for node L to the parent of L. (This may proceed to the upper levels of the tree) Figure 4.2: Timesplit algorithm 7

8 The algorithm for time-split operation is given in Figure 4.2. The algorithm TimeSplit(L) is applied to a leaf node L after a new interval is inserted into L, and the end point of the new interval is greater than the maximum end point among the intervals that were already in L. If the maximum end point information for L need not be changed, TimeSplit(L) is not applied, otherwise the new maximum end point is posted to the parent node in the end (step 9). The candidates for split points are the end points of the intervals in the leaf. In other words, number of candidates is equal to the number of intervals in the leaf. We apply a cost function to each of these candidate points, and pick the one with the minimum cost (step 4). If the minimum cost belongs to the maximum end point, no split is necessary (step 6). Next, we should explain how the cost of splitting a leaf node at a given time point is computed. As seen in Figure 4.2, the cost of splitting the leaf at each candidate point is computed using the PickSplitPoint(L) function which takes a leaf node as input and returns the index to the interval whose end point is the best split point (with the minimum cost). The algorithm for PickSplitPoint(L) is given in Figure 4.3. Algorithm PickSplitPoint(L) L: A leaf node with k intervals I j [b j,e j ] j=1..k. 1) MAXBEGIN= Max j { 1..k}(b j ) (Maximum beginning point) 2) For j=1 to k if e j < MAXBEGIN then ENDLIST[j]=MAXBEGIN ; else ENDLIST[j]=e j ; 3) For j=1 to k k CUTCOST(j) = ENDLIST[i] ENDLIST[j] i= 1 4) Let MINCOST= Min i 1..k (CUTCOST[i]) 5) For j=1 to k NUMSPLIT[j]= number of intervals to be split if e j is chosen as the split point. 6) For j=1 to k FINALCOST(j)=CUTCOST(j)+ MINCOST * α * NUMSPLIT[j]; 7) Let I x be the interval such that; FINALCOST(x)= Min{FINALCOST(j) j=1..k and e j > MAXBEGIN} (I x is the interval whose end point e x will be the split point) 8) Return x; Figure 4.3: Algorithm PickSplitPoint Since IB+tree uses beginning points of the intervals as keys, PickSplitPoint function never selects a split point that is less than or equal to the maximum beginning point. So the key information in the IB+tree is never changed due to a time-split. To compute the costs, first a list of end points of the intervals are kept in ENDLIST[]. For intervals whose end points are less than the maximum key (beginning point), maximum key is taken as its end point (step 2). An intermediate cost function CUTCOST() computes, for each end point, the accumulation of absolute differences from other end points (step 3). The minimum value among these intermediate costs is computed and stored in MINCOST (step 4). For each end point, the number of intervals that would have to be split if that end point is chosen as the split point is kept in another list NUMSPLIT[] (step 5). It will be used to integrate the number of intervals to be split into the split cost, which is done in step 6. The final cost of a point is its intermediate cost plus a 8

9 penalty ( MINCOST * α ) for each interval it causes to split. Here, α is a parameter (0 < α < 1) which can be tuned to adjust space(hence update cost)/querytime tradeoff. Higher values will lead to fewer splits and hence less storage expansion, but also to worse query efficiency. Smaller values will lead to better query efficiency, but increase storage requirements and update costs (due to reinsertion of split parts). In our experiments we have chosen α as 0.2. The cost of PickSplitPoint() function is quadratic (because of step 3), however, it can easily be computed in O(k logk) time where k is the number of intervals in the leaf node. For this, first, all intervals in the leaf node would have to be sorted with respect to their end points in increasing order. So, if that is assumed before step 3, step 3 can be computed in linear time since CUTCOST(j+1)=CUTCOST(j) + (2j - k) (ENDLIST[j+1] - [ENDLIST[j]) ( j=1,...,k-1) So, after computing CUTCOST(1), the rest can be computed in linear time using the equation above. In that case, the order of the PickSplitPoint() algorithm becomes O(k logk) due to the initial sort operation. As an example, we show the steps for picking the split point for the node in Figure 4.1. Example 4.2: We want to find the split point for the leaf node : L: ( I 1 [2, 6], I 2 [3, 46], I 3 [4, 10], I 4 [6, 58], I 5 [8, 11], I 6 [12, 14] ). We call the function PickSplitPoint(L) with α=0.2: step 1: MAXBEGIN = 12 step 2: ENDLIST[]= <12, 46, 12, 58, 12, 14> step 3: CUTCOST[] = <82, 146, 82, 294, 82, 82> step 4: MINCOST[]=82; (α*mincost=16) step 5: NUMSPLIT[]= <3, 1, 3, 0, 3, 2> step 6: FINALCOST[]= <130, 162, 130, 294, 130, 114> step 7: Split point is 14, which is the end point of the interval I 6. step 8: Return 6. To decrease the number of disk accesses during search queries, it is important to avoid cases where a search query retrieves a leaf node and finds only one interval to put into the result from that leaf. From the algorithms above, it can be observed that a leaf node is not necessarily split whenever it has long intervals. If there are too many of these long intervals in the node, it is not a good idea to split the node anyway; that is because if many such (long) intervals exist in such a node, it means many of them will contribute to the answer of an interval search query. Besides, a time-split on such a node will cause many new partitions (from the long intervals) to be reinserted, which may be very costly. 5. Experimental Results In this section, we demonstrate our experimental results with IB+trees and one dimensional R-trees. In the experiments, we compare the search efficiency of IB+trees and R- trees in terms of the number of nodes read during search operations. The experiments are done using five different data sets. Each data set contains 100,000 intervals whose beginning points are distributed randomly in the range 0 to 250,000. The distribution of the duration of the intervals are different in each of the five different data sets. Table 5.1 below lists the descriptions of these different data sets. We will refer to these data sets using the names (D1,..,D5) shown in Table 5.1. The queries are evaluated after all data intervals are inserted in each test case. 9

10 Dataset Description D1 The durations of the intervals are distributed exponentially having mean value 100. D2 The durations of the intervals are distributed exponentially having mean value D3 D4 D5 This data set is created by merging sets D1 and D2. 20% of the durations are distributed exponentially with mean 2000, and the rest is distributed exponentially with mean 100. The durations of the intervals are distributed normally having mean as 200 and standard deviation as also 200. The durations of the intervals are distributed normally having mean as 2000 and standard deviation as Table 5.1: Descriptions of the 5 data sets each having intervals. We compared three structures. One of them is the IB+tree as explained in section 3. The next one is the IB+tree with time-splits where the time-splits are done using the algorithms presented in section 4. The third structure is the one-dimensional R-tree. All of these structures have a maximum fanout of 51 and a minimum fanout of 26 for both internal nodes and the leaves. All of these structures have the same node structure. The IB+tree (with and without timesplits) keep a key, the maximum end point of the subtree below, and a pointer in each entry of an internal node. An R-tree node entry has a minimum bounding interval (a beginning and an end point) and a pointer. So, having the same fanout for both structures is a fair assumption. In Figure 5.1, we see the storage requirements of the three structures in terms of the total number of nodes they have. One dimensional R-tree and IB+tree without time-splits require about the same storage, meaning that the average fanouts of the nodes in both structures are about the same. IB+trees with time-splits require more storage due to the increased number of intervals because of splits. For data sets D1, D2, D4, and D5; IB+tree with time-splits require 50-60% more storage. For data set D3, storage requirement is around 85% more as the intervals in D3 have a considerable number of long intervals stored together with short intervals, causing more splits to take place. The query performance results for these index structures are discussed below. These results are obtained by taking averages of 20 different queries. Each query consists of a query interval and the type of temporal relationship employed for the query. To compare the search performances of the index structures, first, we tested the performances of the index structures for interval timeslice queries, which employ the interval intersection operator, (i.e., all data intervals intersecting a query interval are retrieved as the result). The midpoints of the query points are picked randomly, and their durations are normally distributed with µ=100 and σ=50. Two types of search strategies are available for IB+trees. The first one is finding the first interval (with minimum beginning point) that intersects the query interval using INTERVAL-SEARCH() algorithm of section 3, and then looking at consecutive leaf nodes sequentially to find others. We will refer to this strategy as Sequential Search strategy. The second one is using ALL-INTERVAL-SEARCH() algorithm of section 3. We will refer to this strategy as Range Search strategy. Generally, Range Search strategy is superior, but Sequential Search strategy can be chosen when the leaf nodes are physically clustered in the secondary storage. Both strategies can be used for both IB+tree variants, with or without splits. 10

11 Total number of nodes D1 D2 D3 D4 D5 Data sets IB+tree with time splits One dimensional R-tree IB+tree without time splits Figure 5.1: Storage requirements of the three structures. The average numbers of internal node accesses per timeslice query are shown in Figure 5.2. For all of the three index structures, the height was 4 (3 internal + 1 leaf level). That means the sequential search method for IB+tree variants will make 3 internal node accesses for each query. The averages displayed in Figure 5.2 are for the range search method. IB+trees with timesplits make less number of internal node accesses than R-trees for data sets D2 and D3 due to the large number of splits applied on mostly long intervals (which can also be seen from Figure 5.1). In these data sets, IB+trees with time-splits packs the resulting set of data intervals (after splits) more tightly into the leaf nodes. For data sets D1 and D5, R-trees make slightly less number of node accesses than IB+trees with time-splits. IB+trees without time-splits perform the most number of internal node accesses in every data set. # internal node accesses per search D1 D2 D3 D4 D5 IB+tree w ith time splits One dimensional R-tree IB+tree w ithout time splits Figure 5.2: Average number of internal node accesses per timeslice query. Datasets The average numbers of leaf node accesses for the three index structures are given in Figures 5.3, and 5.4. Figure 5.3 compares the IB+trees with and without time-splits. This chart shows how much improvement using time-splits brings to IB+trees. When the durations of the intervals are small and relatively close to each other (data sets D1, D4, and D5), IB+trees with time-splits perform close to IB+trees without time-splits, having a slight edge over them. The difference comes into surface when there are long intervals distributed with short intervals (data 11

12 sets D2 and especially D3). Sequential search methods for IB+trees without time-splits perform very poorly in such cases. Range search methods perform better, but still it does not get close to the performance of IB+trees with time-splits. Sequential and range search methods give close performances in IB+trees with time-splits in every case. Avg number of leaf node accesses D1 D2 D3 D4 D5 Data sets IB+tree with time splits (Range search) IB+tree with time splits (Sequential search) IB+tree without time splits (Range search) IB+tree without time splits(sequential search) Figure 5.3: Comparison of IB+trees with and without timesplits in terms of the average number of leaf node accesses for timeslice queries. Figure 5.4 shows the comparison of IB+trees with time-splits to one-dimensional R-trees. We see that one-dimensional R-trees have comparable performances to IB+trees with time-splits, especially when the range search strategy is used for IB+trees with time-splits. Avg number of leaf node accesses D1 D2 D3 D4 D5 Data sets IB+tree with time splits (Range search) IB+tree with time splits (Sequential search) One dimensional R-tree Figure 5.4: Comparison of IB+trees with time-splits to one-dimensional R-trees in terms of the average number of leaf node accesses for timeslice queries. Although IB+trees do not perform better than one dimensional R-trees for timeslice queries, they keep the list of intervals ordered with respect to their beginning points, making them superior to R-trees for other temporal query operators such as right-covered by, equals, right-covers, covered by, left-covered by, right overlaps, met by [AH85] (shown in Figure 5.5). These operators are either totally based on beginning points of intervals, or they specify a range for the beginning points of the intervals. R-trees cannot handle such queries as well as they handle timeslice queries. For example, a simple met by query can be answered by an IB+tree in O(logn) time while one dimensional R-trees have to make a point inclusion search to answer the same query. To make this point clear, we experimented on these three index structures for covered by and met by operators. 12

13 y before x y meets x y left-overlaps x y left-covers x y covers x y right-covered by x y equals to x y right-covers x y covered by x y left-covered by x y right-overlaps x y met by x y after x Figure 5.5. Temporal relationships between intervals. x Covered by operator specifies the inclusion relationship between intervals (actually, between data intervals and the query interval). For this, as the beginning points of the qualifying data intervals should fall in the range specified by the query interval, covered by queries can be considered to be partially based on beginning points. Such queries can be answered by IB+trees by checking all data intervals whose beginning time points fall into the specified range, which requires a range search on beginning points. For R-trees, the search strategy is not any different than the strategy for interval timeslice queries, i.e., all nodes (leaf or internal) with minimum bounding intervals intersecting the query interval should be accessed. For covered by queries, we used two query sets. The first set (we will refer to it as Q1) has query intervals whose midpoints are picked randomly and whose durations are normally distributed with µ=1000 and σ=500. The intervals in this set have compatible lengths with the data intervals in D1, D3, and D5. The second set (Q2) is similar to Q1, but the lengths of the query intervals are normally distributed with µ=100 and σ=50. So the query intervals in Q2 have compatible lengths with the data intervals in D2, D3, and D4. For query set Q1 For Query set Q2 # of Internal Node Accesses D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 Datasets IB+tree with timesplits One dimensional R-tree Figure 5.6: Average number of internal node accesses for the two query sets for covered by queries. 13

14 Figure 5.6 shows the average number of internal node accesses for IB+trees and one dimensional R-trees for covered by queries. For both IB+trees with time-splits and IB+trees without time-splits the number of internal nodes accesses required is the same as covered by queries are answered by checking the intervals whose beginning points fall into the query range. R-trees make more internal node accesses, especially when the data intervals have large durations (data sets D1, D3, and D5). For query set Q1 For Query set Q2 # of Leaf Node Accesses D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 Data sets IB+tree with timesplits One dimensional R-tree IB+tree without timesplits Figure 5.7: Average number of leaf node accesses for the two query sets for covered by queries. Figure 5.7 shows the average number of leaf node accesses for covered by queries for query sets Q1 and Q2. When the data intervals are relatively long (D1, D3, and D5), one dimensional R-trees had very poor performance, especially for short query intervals (Q2). It performs slightly better than IB+trees with time-splits when the data intervals are short (D2 and D4) but the query intervals are long (Q1). IB+trees without time-splits always give the best performances, which is an expected result. The ratio of the number of leaf nodes accessed by an IB+tree with time-splits to the number of leaf nodes accessed by an IB+tree without time-splits also reflects the ratio of their storage requirements. Since IB+trees with time-splits have more intervals to index (due to splits), it makes more leaf node accesses. 40 Leaf Node Accesses Internal Node Accesses IB+tree w ith timesplits One dimensional R-tree # of Node Accesses D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 Data sets Figure 5.8: Average number of leaf and internal node accesses for the two query sets for met by queries. Finally, we show the performances of the three structures for met by queries in Figure 5.8. Met by operator is completely based on beginning points of intervals. For IB+trees, it is 14

15 simply a key-based point search. For R-trees it is not simpler than a point timeslice search, since all nodes whose minimum bounding intervals include the end point of the query interval have to be accessed. While the IB+trees make on the average of slightly more than one leaf access per query, one dimensional R-trees have to make an order of magnitude more leaf node accesses. In the next section, an extension of IB+trees for handling open ended intervals and intervals with moving end points (along the current timeline) is presented. 6. Handling Special Time variables In valid time databases, some temporal objects may have valid time intervals with end points that have to be treated specially. Some valid time intervals may be open ended, meaning that their end points can span into the future indefinitely. We will use the special time variable infinity to denote the end points of such intervals. There may also be intervals whose end points are equal to the variable now, which represents the current time value. The indexing problem becomes more interesting with the introduction of such intervals. One trivial solution is to index intervals with end points now, and open ended intervals (with end points infinity) in separate index structures based on their beginning points. All intervals of the form [x, now] (x is an absolute time point) can be indexed in one B+tree, and all intervals of the form [y, infinity) can be indexed in another. Intervals with absolute (fixed) end points could be indexed in a third index structure such as IB+tree. In that case, any search query will proceed down all three indices, and the results would have to be merged in the end. Using such an indexing scheme may actually be a good idea if these three index structures can be kept in parallel disks to overlap I/O time. However, since all valid time intervals have to be handled dynamically, the traffic to move the intervals among the three structures due to changes may become overwhelming. With special treatment of the variables now and infinity, it is possible to index all valid time intervals in the same IB+tree structure. It is important to manage the augmented information in the internal nodes during update and search operations. This is as explained below: n n n If all valid time intervals below a subtree have their end points as either absolute time points (past or future) or infinite, we simply keep the maximum as the augmented information for that subtree. Here infinite will be treated as positive infinity in finding maximum, i.e., it will be the maximum. If the valid time intervals below a subtree have end time points in the past (having values less than current time) and some end points that are equal to now, now will be used as the augmented information for that subtree. If the subtree indexes, intervals that have absolute future time-points as their endpoints as well as intervals with end-point now, we keep the maximum end-point and mark it with a special marker (which requires a bit flag) to denote that the maximum may change as time passes by and the variable time point now may become the maximum. If this happens, we modify the maximum for that subtree the next time we access it because of an update operation. Consider the example given in Figure 6.1. At time 30 (when now is 30), the maximum end time point for C 3 is 49 marked with the special marker # as 49 is greater than the current value of now and there is one data interval with the end-point now ([26, now]) in C 3. Let s assume that we invoked an update query at time 52 (inserting the time interval [20, 32]), and we accessed C 3. As we go down the tree, we also modify the maximum point information as the value of now is greater that 49 at time 52. Note that we do not need to modify the augmented 15

16 information before, since we will be aware of the fact than the subtree below has the maximum end point as max(49, now) during a search operation. At time = 30 (now=30) (22) R now (#49) [3,20] [6,31] [8,9] [11,16 [14, now] [26, now] [35, 49] C 1 C 2 C 3 At time = 52 (now=52) (22) R now now [3,20] [6,31] [8,9] [11,16 [14, now] [20,32] [26, now] [35,49] C 1 C 2 C 3 Figure 6.1: An IB+tree for indexing valid-time intervals. Note that the special value infinity is treated as an absolute time value. Certainly, we would expect the intervals with end points infinity to be updated to definite time values (an absolute time point or now) eventually. The insertion, deletion and search operations of the IB+tree can be slightly modified in the way we mention above to handle such updates. Having valid time intervals with end points now or infinity also slightly changes the procedure for time-splits. When calculating the cost of splitting a leaf node with function PickSplitPoint(), the current value of now should be used. Similarly, the value of infinity can be taken as the maximum integer ( (2 32-1) if 4 bytes is used to represent a time value). The procedure for calculating the costs and splitting the intervals stays the same. If a leaf is split at a future time point and there are intervals having now as end points, the maximum of the leaf node becomes the split point marked with the special marker #, and that should be posted to the parent node as the maximum end point information. Such a case is illustrated in the following example. Example 6.1: Consider a leaf node L accommodating the intervals shown in Figure ,22 3,24 4,now 6,50 now(=16) 24 (split point) 8,20 12,now Figure 6.2: The intervals in leaf node L (example 6.1). If a time-split operation is done on L, 24 will be chosen as the split point. In that case, the interval [6, 50] will be split into two parts, [6, 24] and [25, 50] where the second part will be reinserted. The maximum end point of L after the split will be 24, but because of the intervals 16

17 [4,now], and [12, now], (#24) should be posted as L s maximum end point to its parent. This means that L s maximum at any given time is max(24, now). 7. Conclusion In this paper, we considered the problem of indexing time intervals in valid time databases. Valid time intervals should be indexed dynamically, since updates and deletions are possible in valid time databases unlike the case for transaction time databases where intervals are inserted in an append-only fashion with time order. Valid time intervals may also span into future, having end points greater than the current time value now. Some of these intervals that span into future may be open ended, meaning that their end points are indefinite. Handling intervals with moving end points (intervals with end points equal to now) and intervals with open end points together intervals with fixed end points allowing dynamic operations poses an interesting problem. We suggest to use Interval B+trees to index valid time intervals. Interval B+trees can easily handle dynamic operations as they are basically B+trees built on beginning points of the valid time intervals, whose internal nodes are augmented with maximum end point information of the subtrees below them. To efficiently handle skewed distributions, we introduce a time-split algorithm to be applied to the leaf nodes of the IB+tree to increase efficiency in search operations. Time-splits help partition relatively long intervals and distribute those partitions among different leaf nodes, so that all the leaves accommodate intervals of comparable lengths. Experimental results show that using time-splits in IB+trees considerably improve search performance, although causing some increase in storage due to the increased number of intervals because of partitions. Comparison with one-dimensional R-trees showed that IB+trees with timesplits give very close performance to R-trees for timeslice queries, however IB+trees perform far more superior for many temporal queries that are based on beginning points of time intervals (such as met by, covered by). This result was expected since IB+trees index the intervals with respect to their beginning points. We have also shown modifications to IB+trees for handling valid time intervals that have moving (now) or indefinite (infinity) ending points. Since the beginning point of any valid time interval is always a fixed point, these modifications just concern the handling of augmented data and do not make any changes to the underlying B+tree structure. On the other hand, it is not clear how useful R-trees can be when open ended intervals are indexed together with intervals with fixed end points, especially in terms of controlling the overlap. References [AH85] J.F. Allen, P.J. Hayes, A Common-sense Theory of Time, Proceedings of the International Joint Conference on Artificial Intelligence, August [BGO+93] B. Becker, S. Gschwind, T. Ohler, B. Seeger, P. Widmayer, On Optimal Multiversion Access Structures, Proceedings of Symposium on Large Spatial Databases, in Lectures Notes in Computer Science, Vol 692, pages , Singapore [BO95] T. Bozkaya, M.Ozsoyoglu, Indexing Transaction Time Databases, Technical Report CES Computer Engineering and Science Department, CWRU. [CLR90] T. H. Cormen, C. E. Leiserson, R.L. Rivest Introduction to Algorithms, MCGraw-Hill [EWK90] R. Elmasri, G.T.J. Wuu, Y. Kim, The Time-Index: An Access Structure for Temporal Data, Proceedings of 16th VLDB Conference, pages 1-12, August [EWK93] R. Elmasri, G. T. J. Wuu, V. Kouramajiam, The Time-Index and The Monotonic B+tree, In [T93], chapter

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree. The Lecture Contains: Index structure Binary search tree (BST) B-tree B+-tree Order file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture13/13_1.htm[6/14/2012