I/O-Algorithms Lars Arge - PDF Free Download

I/O-Algorithms Fall 203 September 9, 203

I/O-Model lock I/O D Parameters = # elements in problem instance = # elements that fits in disk block M = # elements that fits in main memory M T = # output size in searching problem We often assume that M> 2 P I/O: Movement of block between memory and disk 2

Scanning: Fundamental ounds Internal External Sorting: Permuting log 2 log M min{, log M } Searching: log 2 log 3

External Search Trees (log 2 ) () FS blocking: lock height O(log 2 )/ O(log 2 ) O(log ) Output elements blocked Rangesearch in (log T ) I/Os Optimal: O(/) space and (log T ) query

-trees Seems very difficult to maintain FS blocking during rotation FS-blocking naturally corresponds to tree with fan-out () -trees balanced by allowing node degree to vary Rebalancing performed by splitting and merging nodes 5

(a,b)-tree T is an (a,b)-tree (a 2 and b 2a-) All leaves on the same level (contain between a and b elements) Except for the root, all nodes have degree between a and b Root has degree between 2 and b (2,)-tree (a,b)-tree uses linear space and has height O(log ) a 6

Insert: (a,b)-tree Insert Search and insert element in leaf v DO v has b+ elements/children Split v: make nodes v and v with b and b a elements b 2 insert element (ref) in parent(v) (make new root if necessary) v=parent(v) Insert touch 2 (log a ) nodes v b v v b b 2 2 7

(a,b)-tree Delete Delete: Search and delete element from leaf v DO v has a- elements/children Fuse v with sibling v : move children of v to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and a+b-<2b) children split v v=parent(v) v a - v 2a - Delete touch O(log a ) nodes

(a,b)-tree properties: If b=2a- every update can cause many rebalancing operations (a,b)-tree insert delete (2,3)-tree If b 2a update only cause O() rebalancing operations amortized If b>2a O( ) ( b O ) rebalancing operations amortized 2-a a * oth somewhat hard to show If b=a easy to show that update causes O( log ) rebalance a a operations amortized 9

Summary/Conclusion: -tree -trees: (a,b)-trees with a,b = () O(/) space O(log +T/) query O(log ) update -trees with elements in the leaves sometimes called + -tree Construction in ( O log M ) I/Os Sort elements and construct leaves uild tree level-by-level bottom-up 0

Summary/Conclusion: -tree -tree with branching parameter b and leaf parameter k (b,k ) All leaves on same level and contain between /k and k elements Except for the root, all nodes have degree between /b and b Root has degree between 2 and b -tree with leaf parameter O(/) space Height O(log b ) O( k ) amortized leaf rebalance operations ( b k b k () O log ) amortized internal node rebalance operations -tree with branching parameter c, 0<c, and leaf parameter Space O(/), updates O(log ), queries O (log T )

Secondary Structures When secondary structures used, a rebalance on v often require O(w(v)) I/Os (w(v) is weight of v) If ( w( v)) inserts have to be made below v between operations O() amortized split bound (log ) O amortized insert bound odes in standard -tree do not have this property In internal memory [ ]-trees have the desired property ut rebalanced using rotations 2

Weight-balanced -tree Idea: Combination of -tree and [ ]-tree Weight constraint on nodes instead of degree constraint Rebalancing performed using split/fuse as in -tree Weight-balanced -tree with parameters b and k (b>, k ) All leaves on same level and contain between k/ and k elements Internal node v at level l has w(v) < b l k Except for the root, internal node v at level l has w(v)> b l k The root has more than one child Internal node degree between b k / b k l l- b k... b l l- l- b b k... b k k and b l k l l- level l / b k level l- b 3

Weight-balanced -tree Insert Search for relevant leaf u and insert new element Traverse path from u to root: If level l node v now has w(v)=b l k+ then split into nodes v and v with w v l l b k ) - ( ') ( -b and w( v'') k 2 l b k ) l- ( b k 2 l - Algorithm correct since b k b k 3 l 5 l such that w( v') b k and w( v'') b k touch O(log k ) nodes b Weight-balance property: ( b l k) updates below v and v before next rebalance operation l b k... b b l k l l k l- l- b k... b k

Weight-balanced -tree Delete Search for relevant leaf u and delete element Traverse path from u to root: If level l node v now has w( v) b then fuse with sibling into node v 2 l 5 l - with b k - w( v') b k 7 l If now w( v') b k then split into nodes with weight l and b k b l k - 7 l l- 5 l b k --b k b k 6 6-5 l- 6 l k b k b k... b l l b l k - k l- l- b k... b k Algorithm correct and touch O(log b k ) nodes Weight-balance property: ( b l k) updates below v and v before next rebalance operation 5

Summary/Conclusion: Weight-balanced -tree Weight-balanced -tree with branching parameter b and leaf parameter k=ω() O(/) space Height O(log b k ) O(log ) rebalancing operations after update b Ω(w(v)) updates below v between consecutive operations on v Weight-balanced -tree with branching parameter c and leaf parameter Updates in O(log ) and queries in O (log T ) I/Os Construction bottom-up in O ( log M ) I/O 6

Persistent -tree In some applications we are interested in being able to access previous versions of data structure Databases Geometric data structures (later) Partial persistence: Update current version (getting new version) Query all versions We would like to have partial persistent -tree with O(/) space is number of updates performed O(log ) update O (log T ) query in any version 7

Persistent -tree Easy way to make -tree partial persistent Copy structure at each operation Maintain version-access structure (-tree) update i i+ i+2 i i+ i+2 i+3 Good O (log T ) O(/) I/O update O( 2 /) space query in any version, but

Persistent -tree Idea: Elements augmented with existence interval and stored in one structure Persistent -tree with parameter b (>6): Directed graph * odes contain elements augmented with existence interval * At any time t, nodes with elements alive at time t form -tree with leaf and branching parameter b -tree with leaf and branching parameter b on indegree 0 node If b=: Query at any time t in O (log T ) I/Os 9

Updates performed as in -tree Persistent -tree: Updates To obtain linear space we maintain new-node invariant: ew node contains between 3 and 7 alive elements and no dead elements 2 3 7 20

Persistent -tree Insert Search for relevant leaf u and insert new element If u contains + elements: lock overflow Version split: Mark u dead and create new node u with x alive element If x 7 : Strong overflow If 3 : Strong underflow x If 3 x 7 then recursively update parent(l): Delete reference to u and insert reference to u 3 7 3 7 2

Strong overflow ( 7 ) Persistent -tree Insert x Split v into u and u with x elements each ( 3 x 2 ) Recursively update parent(u): Delete reference to l and insert reference to v and v 2 2 3 7 7 3 7 3 3 7 Strong underflow ( 3 ) x Merge x elements with y live elements obtained by version split on sibling ( x y 2 ) If x y 7 then (strong overflow) perform split into nodes with (x+y)/2 elements each ( 7 ( x y)/ 2 ) 6 6 Recursively update parent(u): Delete two insert one/two references 22

Persistent -tree Delete Search for relevant leaf u and mark element dead If u contains x Version split: alive elements: lock underflow Mark u dead and create new node u with x alive element Strong underflow ( 3 ): x Merge (version split) and possibly split (strong overflow) Recursively update parent(u): Delete two references insert one or two references 2 3 7 23

Insert lock overflow Version split Persistent -tree done 0,0 Delete lock underflow Version split done -,+ Strong overflow Split Strong underflow Merge done 2 -,+2 Strong overflow Split done -2,+ 3 7 done -2,+2 2

Update: O(log ) Persistent -tree Analysis Search and rebalance on one root-leaf path Space: O(/) At least updates in leaf in existence interval When leaf u dies * At most two other nodes are created * At most one block over/underflow one level up (in parent(l)) During updates we create: * O( ) leaves * O ) nodes i levels up ( ) ( O i O ) i ( i blocks 3 7 25 2

Summary/Conclusion: Persistent -tree Persistent -tree Update current version Query all versions Efficient implementation obtained using existence intervals Standard technique During operations O(/) space O(log ) update O (log T ) query 26

Other -tree Variants Level-balanced -trees Global instead of local balancing strategy Whole subtrees rebuilt when too many nodes on a level Used when parent pointers and divide/merge operations needed String -trees Used to maintain and search (variable length) strings 27

-tree Construction In internal memory we can sort elements in O( log ) time using a balanced search tree: Insert all elements one-by-one (construct tree) Output in sorted order using in-order traversal Same algorithm using -tree use O( log ) I/Os logm A factor of O ) non-optimal ( log As discussed we could build -tree bottom-up in O( log M ut what about persistent -tree? ) I/Os In general we would like to have dynamic data structure to use in O log ) algorithms O log ) I/O operations ( M ( M 2

uffer-tree Technique M elements fan-out M/ O(log ) M Main idea: Logically group nodes together and add buffers Insertions done in a lazy way elements inserted in buffers When a buffer runs full elements are pushed one level down uffer-emptying in O(M/) I/Os every block touched constant number of times on each level inserting elements (/ blocks) costs O( log M ) I/Os 29

Definition: asic uffer-tree -tree with branching parameter and leaf parameter Size M buffer in each internal node M M... M $m$ blocks M Updates: Add time-stamp to insert/delete element Collect elements in memory before inserting in root buffer Perform buffer-emptying when buffer runs full 30

ote: asic uffer-tree uffer can be larger than M during recursive buffer-emptying * Elements distributed in sorted order at most M elements in buffer unsorted Rebalancing needed when leaf-node buffer emptied * Leaf-node buffer-emptying only performed after all full internal node buffers are emptied M... M $m$ blocks M 3

Internal node buffer-empty: asic uffer-tree Load first M (unsorted) elements into memory and sort them Merge elements in memory with rest of (already sorted) elements Scan through sorted list while * Removing matching insert/deletes * Distribute elements to child buffers Recursively empty full child buffers M... M $m$ blocks M Emptying buffer of size X takes O(X/+M/)=O(X/) I/Os 32

asic uffer-tree uffer-empty of leaf node with K elements in leaves Sort buffer as previously Merge buffer elements with elements in leaves Remove matching insert/deletes obtaining K elements If K <K then * Add K-K dummy elements and insert in dummy leaves Otherwise * Place K elements in leaves * Repeatedly insert block of elements in leaves and rebalance K Delete dummy leaves and rebalance when all full buffers emptied 33

asic uffer-tree Invariant: uffers of nodes on path from root to emptied leaf-node are empty Insert rebalancing (splits) performed as in normal -tree v v v Delete rebalancing: v buffer emptied before fuse of v ecessary buffer emptyings performed before next dummyblock delete Invariant maintained v v v 3

asic uffer-tree Analysis: ot counting rebalancing, a buffer-emptying of node with X M elements (full) takes O(X/) I/Os total full node emptying cost O log M ) I/Os Delete rebalancing buffer-emptying (non-full) takes O(M/) I/Os cost of one split/fuse O(M/) I/Os During updates * O(/) leaf split/fuse * O logm ) internal node split/fuse ( M ( Total cost of operations: ( O log M ) I/Os 35

asic uffer-tree Emptying all buffers after insertions: Perform buffer-emptying on all nodes in FS-order resulting full-buffer emptyings cost ( O log M ) I/Os empty O( ) non-full buffers using O(M/) O(/) I/Os M M... M $m$ blocks M elements can be sorted using buffer tree in ( O log M ) I/Os 36

Summary/Conclusion: uffer-tree atching of operations on -tree using M-sized buffers ( O log ) I/O updates amortized M All buffers emptied in ( O log M ) I/Os One-dim. rangesearch operations can also be supported in ( T O log ) I/Os amortized M Search elements handle lazily like updates All elements in relevant sub-trees reported during buffer-emptying uffer-emptying in O(X/+T /), where T is reported elements $m$ blocks Using buffer technique persistent -tree built in O ( log M ) I/O 37

asic buffer tree can be used in external priority queue To delete minimal element: uffered Priority Queue Empty all buffers on leftmost path Delete M elements in leftmost leaf and keep in memory Deletion of next M minimal elements free Inserted elements checked against minimal elements in memory ( M ) M ( M O log ) I/Os every O(M) delete O log ) amortized ( M 3

Other External Priority Queues uffer technique can be used on other priority queue structures Heap Tournament tree Priority queue supporting update often used in graph algorithms O log ) on tournament tree ( 2 Major open problem to do it in O log ) I/Os Worst case efficient priority queue has also been developed operations require O(log I/Os ) M ( M 39

Other uffer-tree Technique Results Attaching () size buffers to normal -tree can also be use to improve update bound uffered segment tree Has been used in batched range searching and rectangle intersection algorithm Has been used on String -tree to obtain I/O-efficient string sorting algorithms 0

Summary/Conclusions: Fund. Data Structures -tree O(/) space, O(log ) update, O(log +T/) query Weight-balanced -tree Ω(w(v)) updates below v between consecutive operations on v Persistent -tree Query in any previous version uffer tree atching of operations to obtain O log ) bounds ( M

References External Memory Geometric Data Structures Lecture notes by. Section -5 2