Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure

Lecture 6: External Interval Tree (Part II) Yufei Tao Division of Web Science and Technology Korea Advanced Institute of Science and Technology taoyf@cse.cuhk.edu.hk 3 Making the external interval tree dynamic Remember that it is not our purpose to design a static structure to solve the stabbing problem in fact, a persistent B-tree already fulfills that purpose. Our goal is to have a fully dynamic structure. In the sequel, we will discuss how the external interval tree can be updated efficiently. For this purpose, we will make the tall cache assumption that M B 2, which allows us to avoid some complicated details, so that we can focus on learning several major techniques to dynamize an external memory structure. 3.1 Dynamizing an underflow structure Recall that each node of the external interval tree has an underflow structure which is a persistent B-tree. As mentioned in the previous lecture, in general, we cannot efficiently insert or delete an arbitrary interval in a persistent B-tree. However, an underflow structure has a special property: it indexes at most B 2 intervals. Next, we will utilize this property to make an underflow structure dynamic. Consider, in general, a data structure T that manages at most N elements. Assume that (i) T occupies space(n) blocks, (ii) can be constructed in build(n) I/Os, and (iii) supports a query in query(n)+o(k/b) I/Os, where K is the number of elements reported. Then: Lemma 1. T can be converted into a fully dynamic structure that has size space(n), answers a query in O(query(N)+K/B) I/Os, and supports an insertion or a deletion in 1 B build(n) I/Os amortized. To achieve the above purpose, it suffices to associate T with an additional (disk) block to buffer all the incoming updates (insertions and deletions). In other words, each update is simply placed in the buffer block without actually modifying T. Obviously, the space complexity of T remains space(n). To answer a query, we first retrieve from T the set S of qualifying elements in query(n)+o( S /B) I/Os. Remember, however, some elements in S may no longer belong to the dataset due to the deletions in the buffer block. Conversely, some new elements to be added to the dataset by the insertions in the buffer may also need to reported. To account for these changes, it suffices to spend an extra I/O to inspect the buffer block. In any case, S cannot differ from K by more than B. Hence, the total query cost is query(n)+1+o((k+b)/b) = O(query(N)+K/B). How do we incorporate the buffered updates into T? Nothing needs to be done until the buffer gets full, i.e., B updates have been accumulated. At this time, we simply rebuild the entire T in build(n) I/Os, and then, clear the buffer. Since this is done only once every B updates, on average, each update bears only build(n)/b I/Os. Lemma 1 is particularly useful when N = B 2 and build(b 2 ) = O(B), i.e., T is a B 2 -structure that can be built in linear time. In this case, the each update can then be performed in O(1) I/Os amortized. This is true for an underflow structure in the external interval tree. Notice that the tall 1

cache assumption permits us to simply read all the at most B 2 elements into memory and construct the persistent B-tree there. The only cost incurred is that of reading the elements and writing the structure back to the disk 1, i.e., O(B). We therefore have: Corollary 1. Each underflow structure consumes linear space, answers a stabbing query in O(1 + K/B) I/Os, and supports an insertion or a deletion in constant I/Os amortized. 3.2 Modifying the external interval tree We need to slightly modify the static external interval tree as described in the previous lecture to make it dynamic. First, the base tree T is a weight-balanced B-tree (instead of a normal B-tree), whereeach leaf node has a capacity B and each internal node has at most B child nodes (implying that the branching parameter is B/4). Now consider an internal node u in T with child nodes u 1,...,u f. Before, each L u (i) (R u (i)) was implemented as a linked list, but now we implement it as a B-tree indexing the left (right) endpoints. The purpose is to insert/remove an interval in L u (i) (R u (i)) using I/Os logarithmic to the size of the B-tree. Similarly, we implement each M u [i,j] as a B-tree (indexing, e.g., its intervals left endpoints). The last change concerns what it means by a multi-slab σ u [i,j] is underflowing. Before, this was defined as σ u [i,j] having less than B (middle) intervals. Now, we extend the definition: If M u [i,j] is non-empty, σ u [i,j] underflows if it has less than B/2 intervals. Otherwise (i.e., all the intervals belonging to σ u [i,j] are indexed by the underflow structure U u ), σ u [i,j] underflows as long as it has less than B intervals. We stick to the invariant that if σ u [i,j] underflows, its intervals are managed by U u ; otherwise, they are indexed by M u [i,j]. Notice that the modified underflow definition creates a leeway of B/2 before the intervals of σ u [i,j] are moved between M u [i,j] and U u. In any case, U u still manages at most B 2 intervals. The above changes do not affect the space consumption of the overall structure, and nor do they affect the query algorithm and its cost. We are now ready to clarify the update algorithms. 3.3 Performing an insertion Let s be the interval being inserted. We first insert the left and right endpoints of s in T (without handling overflows yet even if they occur) by traversing at most two root-to-leaf paths. In doing so, we have also identified the node u whose stabbing set S u we should add s to. Assume that u has f child nodes, and that the left (right) endpoint of s falls in σ(u i ) (σ(u j )) for some i,j. Cut s into a left interval s l, a middle interval s m, and a right interval s r. Insert s l (s r ) into the left (right) structure of u, more specifically, L u (i) (R u (j)). If s m, we check whether the intervals of σ u [i+1,j 1] are being indexed by M u [i+1,j 1]. If yes, s m is inserted there. Otherwise, we add s m to U u. Now σ u [i+1,j 1] may have B intervals so that it no longer underflows. In this case, we find them in O(1) I/Os (by performing a stabbing query on U u ), delete all of them from U u in O(B) amortized I/Os (see Corollary 1), initialize an 1 Strictly speaking, the situation is a bit more complex because besides the elements, the construction algorithm of the persistent B-tree also needs to store additional data, which would make the total amount of memory consumption over B 2 words. However, one can show that the algorithm requires O(B 2 ) words at any moment. As a result, we can eliminate this issue by constraining an underflow structure to contain at most B 2 /c elements, for some proper c. 2

empty B-tree M u [i+1,j 1], and insert those B intervals into M u [i +1,j 1] using O(B) I/Os. We can charge this cost over the at least B/2 elements added to U u since the previous underflow of σ u [i +1,j 1]. Therefore, on average, each insertion bears only O(B)/ B 2 = O(1) I/Os for the movement of intervals from U u to M u [i+1,j 1]. The cost so far is O(log B N) amortized. Now it remains to handle overflows, which may have happened to the nodes on the at most two root-to-leaf paths we followed at the beginning. We treat the overflows in a bottom-up manner, namely, first handling the at most two leaf nodes, then their parents, and so on. In general, let v be a node that overflows, and ˆv be its parent. Split the elements of v into v 1 and v 2 following the standard algorithm in the weight-balanced B-tree. Let l be the splitting value, i.e., all the elements in v 1 (v 2 ) are smaller (at least) l. Note that l becomes a new slab boundary at ˆv. We proceed to fix the secondary structures of v 1,v 2 and ˆv. Note that the intervals in S v (stabbing set of v) can now be divided into three groups: (i) those completely to the left of l, (ii) those completely to the right of l, and (iii) those crossing l. The first group becomes S v1, the second becomes S v2, while the intervals of the third group, denoted as S up, should be inserted into Sˆv. Clearly, S v1,s v2,s up can be obtained in O( S v /B) I/Os by scanning S v once. In fact, with this cost, we can obtain two sorted lists for S v1, one sorted by the left endpoints of its intervals and the other by their right intervals (this detail is left to you). Refer to the first (second) copy as the left (right) copy of S v1. The same is true for S v2 and S up. Before proceeding we prove: Lemma 2. Consider a node u and its stabbing set S u. Given the left and right copies of S u, all the secondary structures of u can be built in O( B + S u /B) I/Os. Proof. Assume that u has f B child nodes. By scanning the left copy of S u once, we can generate the intervals indexed by L u (i) for each i [1,f]. After which, L u (i) can be built in O(1+ L u (i) /B) I/Os. Hence, the left structure of u can be constructed in O( B + S u /B) I/Os in total. Similarly, its right structure can also be constructed in the same cost. As M B 2, and there are less than f 2 = B multi-slabs, by scanning the left copy of S u once, we can obtain the intervals belonging to each multi-slab in O( S u /B) I/Os, such that if a multislab has at least B intervals, all those intervals are stored in a file, sorted by their left endpoints; otherwise, the intervals of the (underflowing) multi-slab remain in memory. Build the underflow structure using the intervals in memory, and write the structure to the disk in cost linear to the number of indexed intervals. Finally, for each non-underflowing multi-slab σ u [i,j], build M u [i,j] on those intervals in cost linear to the number of them. Therefore, given S v1 and S v2, the secondary structures of v 1 and v 2 can be constructed in O( B+ S v1 /B+ S v2 /B) = O(B+ S v /B) I/Os. Now let us focus on ˆv. The new Sˆv is the union of the original Sˆv and S up. From now on, we use Sˆv to refer to the new Sˆv. Given the left and right copies of S up, it is easy to generate the corresponding copies of Sˆv in O( Sˆv /B) I/Os, after which the secondary structures of ˆv can be rebuilt in O( B + Sˆv /B) I/Os. Now it is time to use the fact that T is a weight-balanced B-tree with leaf capacity b = B and branching parameter p = B/4. Let w(v) and w(ˆv) be the weights of v and ˆv, respectively. It thus follows that w(ˆv) 4p w(v). Observe that S v w(v) (as each interval in S v has both endpoints in the subtree of v), and Sˆv w(ˆv). In other words, the total cost of re-constructing the secondary 3

structures of v 1,v 2 and ˆv is O( B + S / B + Sˆv /B) = O( B +w(v)/ B +w(ˆv)/b) = O( B +w(ˆv)/b) = O( B +4p w(v)/b) = O(w(v)) where the last inequality used the fact that p < B and that w(v) B. Recall that, by the property of the weight-balanced B-tree, when v overflows, Ω(v) updates have been performed under its subtree. Hence, we can amortize the O(w(v)) cost of handing the overflow over those updates so that each one of them accounts for only constant I/Os. As each update may need to bear such an amortized cost O(log B N) times, it follows that each insertion can be performed in O(log B N) I/Os amortized. Remarks. There are two key ingredients in the above insertion algorithm that lead to the nice amortized insertion time of O(log B N). The first one is all the B 2 -structures, each of which is space efficient, and can be updated and supports a query with constant overhead (see Corollary 1). The second ingredient is the usage of the weight-balanced B-tree, which allows us to pay a huge amount of cost to handle the overflow of a node as much as the number of data elements in the subtree of the node. This technique is known as partial rebuilding. It is the first time in this course we see the necessity of the weight-balanced B-tree. 3.4 Performing a deletion As expected, the major difficulty of a deletion is the handling of underflows. Interestingly, next we will see how to circumvent the difficulty altogether by using a technique called global rebuilding. Recall that a query algorithm reports intervals only from stabbing sets. Hence, as long as we can (i) keep the stabbing sets updated, and (ii) make sure that the weight-balanced B-tree T still allows us to guide the query to the relevant stabbing sets, we can seek ways to save us some trouble when it comes to removing elements from T itself. With the above in mind, the deletion algorithm can be made surprisingly simple. To delete a segment s, we remove it from the secondary structures of the node whose stabbing set contains s. This can be done easily in O(log B N) I/Os by reversing the corresponding steps in an insertion. We are done right here, without even removing the left or right endpoint of s from T. It is easy to see that the correctness of the query algorithm can still be guaranteed. As no element is ever deleted from T, underflows can never happen. There is, however, a minor drawback. Since we permit redundant endpoints to remain in T, over time the number of endpoints in T can become so much larger than the current N, so that the height of T may eventually become ω(log B N). To avoid this, after N/2 updates since the initial construction of T (where N is the size of the dataset I at the time of that construction), we simply rebuild the entire T from scratch by incrementally inserting each interval currently in I (of course, we need to keep track of I exactly but this can be easily done with another B-tree). Notice that now I can have at most 3N/2 elements, so T can be re-constructed in O(N log B N) I/Os, or merely O(log B N) amortized I/Os per update. It is easy to verify that with this approach, the height of T is always O(log B I ) at all times. Summarizing all the above discussion, we have: 4

Theorem 1. Under the tall-cache assumption M B 2, there exists a structure on a set of N intervals that consumes O(N/B) space, supports a stabbing query in O(log B N +K/B) I/Os, and can be updated in O(log B N) amortized I/Os per insertion and deletion. Bibliography The interval tree in internal memory is due to Edelsbrunner [3]. Its external version was developed by Arge and Vitter [1]. They showed that Theorem 1 still holds even without the the tall cache assumption, by giving a clever algorithm to construct an underflow structure in O(B) I/Os using only 2 memory blocks (i.e., M = 2B). They also explained in [2] how to remove the amortization so that each insertion/deletion can be handled in O(log B N) I/Os in the worst case. Finally, the partial and global rebuilding techniques we discussed were invented by Overmars [4]. References [1] L. Arge and J. S. Vitter. Optimal dynamic interval management in external memory. In FOCS, pages 560 569, 1996. [2] L. Arge and J. S. Vitter. Optimal external memory interval management. SIAM J. of Comp., 32(6):1488 1508, 2003. [3] H. Edelsbrunner. A new approach to rectangle intersections, part I. International Journal of Computer Mathematics, 13:209 219, 1983. [4] M. H. Overmars. The Design of Dynamic Data Structures. Springer-Verlag, 1987. 5