I/O-Algorithms Lars Arge

Similar documents
Massive Data Algorithmics

Massive Data Algorithmics

Algorithms for Memory Hierarchies Lecture 3

Level-Balanced B-Trees

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

CSE 530A. B+ Trees. Washington University Fall 2013

Algorithms for Memory Hierarchies Lecture 2

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure

External Memory Algorithms for Geometric Problems. Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)

(2,4) Trees. 2/22/2006 (2,4) Trees 1

l So unlike the search trees, there are neither arbitrary find operations nor arbitrary delete operations possible.

(2,4) Trees Goodrich, Tamassia (2,4) Trees 1

Search Trees - 1 Venkatanatha Sarma Y

Recall: Properties of B-Trees

ICS 691: Advanced Data Structures Spring Lecture 8

Optimal External Memory Interval Management

Trees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University

l Heaps very popular abstract data structure, where each object has a key value (the priority), and the operations are:

Algorithms. Deleting from Red-Black Trees B-Trees

I/O-Algorithms Lars Arge Aarhus University

Multiway searching. In the worst case of searching a complete binary search tree, we can make log(n) page faults Everyone knows what a page fault is?

Data Structures and Algorithms

Chapter 5 Data Structures Algorithm Theory WS 2016/17 Fabian Kuhn

CSE 326: Data Structures B-Trees and B+ Trees

Sorting and Searching

Background: disk access vs. main memory access (1/2)

External-Memory Algorithms with Applications in GIS - (L. Arge) Enylton Machado Roberto Beauclair

Overview of Presentation. Heapsort. Heap Properties. What is Heap? Building a Heap. Two Basic Procedure on Heap

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version

Sorting and Searching

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Balanced Binary Search Trees. Victor Gao

Tree-Structured Indexes

Course Review. Cpt S 223 Fall 2009

Course Review for Finals. Cpt S 223 Fall 2008

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

Computational Geometry

Trees. Eric McCreath

Tree-Structured Indexes. Chapter 10

Lecture 22 November 19, 2015

Tree-Structured Indexes

Tree-Structured Indexes

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

Section 4 SOLUTION: AVL Trees & B-Trees

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

DDS Dynamic Search Trees

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

CS350: Data Structures AVL Trees

Tree-Structured Indexes

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Material You Need to Know

Lecture 19 Apr 25, 2007

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Tree-Structured Indexes

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

CS301 - Data Structures Glossary By

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

Laboratory Module X B TREES

CS350: Data Structures B-Trees

Augmenting Data Structures

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

CS711008Z Algorithm Design and Analysis

Dictionaries. Priority Queues

CS711008Z Algorithm Design and Analysis

9. Heap : Priority Queue

Binary Heaps in Dynamic Arrays

2-3 and Trees. COL 106 Shweta Agrawal, Amit Kumar, Dr. Ilyas Cicekli

Priority Queues Heaps Heapsort

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

Physical Level of Databases: B+-Trees

B-Trees. Version of October 2, B-Trees Version of October 2, / 22

UNIT III BALANCED SEARCH TREES AND INDEXING

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

HEAPS: IMPLEMENTING EFFICIENT PRIORITY QUEUES

Data Structures and Algorithms

Multiway Search Trees. Multiway-Search Trees (cont d)

Priority Queues and Binary Heaps

CSCI2100B Data Structures Heaps

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

Data Structures and Algorithms for Engineers

CS Fall 2010 B-trees Carola Wenk

Lecture 3: B-Trees. October Lecture 3: B-Trees

Lecture 8 13 March, 2012

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Cpt S 223 Fall Cpt S 223. School of EECS, WSU

Course Review. Cpt S 223 Fall 2010

Multi-Way Search Trees

CIS265/ Trees Red-Black Trees. Some of the following material is from:

CS 350 : Data Structures B-Trees

CSE 5311 Notes 4a: Priority Queues

Binary Trees

Heaps. 2/13/2006 Heaps 1

CSE 373 OCTOBER 25 TH B-TREES

Binary heaps (chapters ) Leftist heaps

Tree-Structured Indexes

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

Transcription:

I/O-Algorithms Fall 203 September 9, 203

I/O-Model lock I/O D Parameters = # elements in problem instance = # elements that fits in disk block M = # elements that fits in main memory M T = # output size in searching problem We often assume that M> 2 P I/O: Movement of block between memory and disk 2

Scanning: Fundamental ounds Internal External Sorting: Permuting log 2 log M min{, log M } Searching: log 2 log 3

External Search Trees (log 2 ) () FS blocking: lock height O(log 2 )/ O(log 2 ) O(log ) Output elements blocked Rangesearch in (log T ) I/Os Optimal: O(/) space and (log T ) query

-trees Seems very difficult to maintain FS blocking during rotation FS-blocking naturally corresponds to tree with fan-out () -trees balanced by allowing node degree to vary Rebalancing performed by splitting and merging nodes 5

(a,b)-tree T is an (a,b)-tree (a 2 and b 2a-) All leaves on the same level (contain between a and b elements) Except for the root, all nodes have degree between a and b Root has degree between 2 and b (2,)-tree (a,b)-tree uses linear space and has height O(log ) a 6

Insert: (a,b)-tree Insert Search and insert element in leaf v DO v has b+ elements/children Split v: make nodes v and v with b and b a elements b 2 insert element (ref) in parent(v) (make new root if necessary) v=parent(v) Insert touch 2 (log a ) nodes v b v v b b 2 2 7

(a,b)-tree Delete Delete: Search and delete element from leaf v DO v has a- elements/children Fuse v with sibling v : move children of v to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and a+b-<2b) children split v v=parent(v) v a - v 2a - Delete touch O(log a ) nodes

(a,b)-tree properties: If b=2a- every update can cause many rebalancing operations (a,b)-tree insert delete (2,3)-tree If b 2a update only cause O() rebalancing operations amortized If b>2a O( ) ( b O ) rebalancing operations amortized 2-a a * oth somewhat hard to show If b=a easy to show that update causes O( log ) rebalance a a operations amortized 9

Summary/Conclusion: -tree -trees: (a,b)-trees with a,b = () O(/) space O(log +T/) query O(log ) update -trees with elements in the leaves sometimes called + -tree Construction in ( O log M ) I/Os Sort elements and construct leaves uild tree level-by-level bottom-up 0

Summary/Conclusion: -tree -tree with branching parameter b and leaf parameter k (b,k ) All leaves on same level and contain between /k and k elements Except for the root, all nodes have degree between /b and b Root has degree between 2 and b -tree with leaf parameter O(/) space Height O(log b ) O( k ) amortized leaf rebalance operations ( b k b k () O log ) amortized internal node rebalance operations -tree with branching parameter c, 0<c, and leaf parameter Space O(/), updates O(log ), queries O (log T )

Secondary Structures When secondary structures used, a rebalance on v often require O(w(v)) I/Os (w(v) is weight of v) If ( w( v)) inserts have to be made below v between operations O() amortized split bound (log ) O amortized insert bound odes in standard -tree do not have this property In internal memory [ ]-trees have the desired property ut rebalanced using rotations 2

Weight-balanced -tree Idea: Combination of -tree and [ ]-tree Weight constraint on nodes instead of degree constraint Rebalancing performed using split/fuse as in -tree Weight-balanced -tree with parameters b and k (b>, k ) All leaves on same level and contain between k/ and k elements Internal node v at level l has w(v) < b l k Except for the root, internal node v at level l has w(v)> b l k The root has more than one child Internal node degree between b k / b k l l- b k... b l l- l- b b k... b k k and b l k l l- level l / b k level l- b 3

Weight-balanced -tree Insert Search for relevant leaf u and insert new element Traverse path from u to root: If level l node v now has w(v)=b l k+ then split into nodes v and v with w v l l b k ) - ( ') ( -b and w( v'') k 2 l b k ) l- ( b k 2 l - Algorithm correct since b k b k 3 l 5 l such that w( v') b k and w( v'') b k touch O(log k ) nodes b Weight-balance property: ( b l k) updates below v and v before next rebalance operation l b k... b b l k l l k l- l- b k... b k

Weight-balanced -tree Delete Search for relevant leaf u and delete element Traverse path from u to root: If level l node v now has w( v) b then fuse with sibling into node v 2 l 5 l - with b k - w( v') b k 7 l If now w( v') b k then split into nodes with weight l and b k b l k - 7 l l- 5 l b k --b k b k 6 6-5 l- 6 l k b k b k... b l l b l k - k l- l- b k... b k Algorithm correct and touch O(log b k ) nodes Weight-balance property: ( b l k) updates below v and v before next rebalance operation 5

Summary/Conclusion: Weight-balanced -tree Weight-balanced -tree with branching parameter b and leaf parameter k=ω() O(/) space Height O(log b k ) O(log ) rebalancing operations after update b Ω(w(v)) updates below v between consecutive operations on v Weight-balanced -tree with branching parameter c and leaf parameter Updates in O(log ) and queries in O (log T ) I/Os Construction bottom-up in O ( log M ) I/O 6

Persistent -tree In some applications we are interested in being able to access previous versions of data structure Databases Geometric data structures (later) Partial persistence: Update current version (getting new version) Query all versions We would like to have partial persistent -tree with O(/) space is number of updates performed O(log ) update O (log T ) query in any version 7

Persistent -tree Easy way to make -tree partial persistent Copy structure at each operation Maintain version-access structure (-tree) update i i+ i+2 i i+ i+2 i+3 Good O (log T ) O(/) I/O update O( 2 /) space query in any version, but

Persistent -tree Idea: Elements augmented with existence interval and stored in one structure Persistent -tree with parameter b (>6): Directed graph * odes contain elements augmented with existence interval * At any time t, nodes with elements alive at time t form -tree with leaf and branching parameter b -tree with leaf and branching parameter b on indegree 0 node If b=: Query at any time t in O (log T ) I/Os 9

Updates performed as in -tree Persistent -tree: Updates To obtain linear space we maintain new-node invariant: ew node contains between 3 and 7 alive elements and no dead elements 2 3 7 20

Persistent -tree Insert Search for relevant leaf u and insert new element If u contains + elements: lock overflow Version split: Mark u dead and create new node u with x alive element If x 7 : Strong overflow If 3 : Strong underflow x If 3 x 7 then recursively update parent(l): Delete reference to u and insert reference to u 3 7 3 7 2

Strong overflow ( 7 ) Persistent -tree Insert x Split v into u and u with x elements each ( 3 x 2 ) Recursively update parent(u): Delete reference to l and insert reference to v and v 2 2 3 7 7 3 7 3 3 7 Strong underflow ( 3 ) x Merge x elements with y live elements obtained by version split on sibling ( x y 2 ) If x y 7 then (strong overflow) perform split into nodes with (x+y)/2 elements each ( 7 ( x y)/ 2 ) 6 6 Recursively update parent(u): Delete two insert one/two references 22

Persistent -tree Delete Search for relevant leaf u and mark element dead If u contains x Version split: alive elements: lock underflow Mark u dead and create new node u with x alive element Strong underflow ( 3 ): x Merge (version split) and possibly split (strong overflow) Recursively update parent(u): Delete two references insert one or two references 2 3 7 23

Insert lock overflow Version split Persistent -tree done 0,0 Delete lock underflow Version split done -,+ Strong overflow Split Strong underflow Merge done 2 -,+2 Strong overflow Split done -2,+ 3 7 done -2,+2 2

Update: O(log ) Persistent -tree Analysis Search and rebalance on one root-leaf path Space: O(/) At least updates in leaf in existence interval When leaf u dies * At most two other nodes are created * At most one block over/underflow one level up (in parent(l)) During updates we create: * O( ) leaves * O ) nodes i levels up ( ) ( O i O ) i ( i blocks 3 7 25 2

Summary/Conclusion: Persistent -tree Persistent -tree Update current version Query all versions Efficient implementation obtained using existence intervals Standard technique During operations O(/) space O(log ) update O (log T ) query 26

Other -tree Variants Level-balanced -trees Global instead of local balancing strategy Whole subtrees rebuilt when too many nodes on a level Used when parent pointers and divide/merge operations needed String -trees Used to maintain and search (variable length) strings 27

-tree Construction In internal memory we can sort elements in O( log ) time using a balanced search tree: Insert all elements one-by-one (construct tree) Output in sorted order using in-order traversal Same algorithm using -tree use O( log ) I/Os logm A factor of O ) non-optimal ( log As discussed we could build -tree bottom-up in O( log M ut what about persistent -tree? ) I/Os In general we would like to have dynamic data structure to use in O log ) algorithms O log ) I/O operations ( M ( M 2

uffer-tree Technique M elements fan-out M/ O(log ) M Main idea: Logically group nodes together and add buffers Insertions done in a lazy way elements inserted in buffers When a buffer runs full elements are pushed one level down uffer-emptying in O(M/) I/Os every block touched constant number of times on each level inserting elements (/ blocks) costs O( log M ) I/Os 29

Definition: asic uffer-tree -tree with branching parameter and leaf parameter Size M buffer in each internal node M M... M $m$ blocks M Updates: Add time-stamp to insert/delete element Collect elements in memory before inserting in root buffer Perform buffer-emptying when buffer runs full 30

ote: asic uffer-tree uffer can be larger than M during recursive buffer-emptying * Elements distributed in sorted order at most M elements in buffer unsorted Rebalancing needed when leaf-node buffer emptied * Leaf-node buffer-emptying only performed after all full internal node buffers are emptied M... M $m$ blocks M 3

Internal node buffer-empty: asic uffer-tree Load first M (unsorted) elements into memory and sort them Merge elements in memory with rest of (already sorted) elements Scan through sorted list while * Removing matching insert/deletes * Distribute elements to child buffers Recursively empty full child buffers M... M $m$ blocks M Emptying buffer of size X takes O(X/+M/)=O(X/) I/Os 32

asic uffer-tree uffer-empty of leaf node with K elements in leaves Sort buffer as previously Merge buffer elements with elements in leaves Remove matching insert/deletes obtaining K elements If K <K then * Add K-K dummy elements and insert in dummy leaves Otherwise * Place K elements in leaves * Repeatedly insert block of elements in leaves and rebalance K Delete dummy leaves and rebalance when all full buffers emptied 33

asic uffer-tree Invariant: uffers of nodes on path from root to emptied leaf-node are empty Insert rebalancing (splits) performed as in normal -tree v v v Delete rebalancing: v buffer emptied before fuse of v ecessary buffer emptyings performed before next dummyblock delete Invariant maintained v v v 3

asic uffer-tree Analysis: ot counting rebalancing, a buffer-emptying of node with X M elements (full) takes O(X/) I/Os total full node emptying cost O log M ) I/Os Delete rebalancing buffer-emptying (non-full) takes O(M/) I/Os cost of one split/fuse O(M/) I/Os During updates * O(/) leaf split/fuse * O logm ) internal node split/fuse ( M ( Total cost of operations: ( O log M ) I/Os 35

asic uffer-tree Emptying all buffers after insertions: Perform buffer-emptying on all nodes in FS-order resulting full-buffer emptyings cost ( O log M ) I/Os empty O( ) non-full buffers using O(M/) O(/) I/Os M M... M $m$ blocks M elements can be sorted using buffer tree in ( O log M ) I/Os 36

Summary/Conclusion: uffer-tree atching of operations on -tree using M-sized buffers ( O log ) I/O updates amortized M All buffers emptied in ( O log M ) I/Os One-dim. rangesearch operations can also be supported in ( T O log ) I/Os amortized M Search elements handle lazily like updates All elements in relevant sub-trees reported during buffer-emptying uffer-emptying in O(X/+T /), where T is reported elements $m$ blocks Using buffer technique persistent -tree built in O ( log M ) I/O 37

asic buffer tree can be used in external priority queue To delete minimal element: uffered Priority Queue Empty all buffers on leftmost path Delete M elements in leftmost leaf and keep in memory Deletion of next M minimal elements free Inserted elements checked against minimal elements in memory ( M ) M ( M O log ) I/Os every O(M) delete O log ) amortized ( M 3

Other External Priority Queues uffer technique can be used on other priority queue structures Heap Tournament tree Priority queue supporting update often used in graph algorithms O log ) on tournament tree ( 2 Major open problem to do it in O log ) I/Os Worst case efficient priority queue has also been developed operations require O(log I/Os ) M ( M 39

Other uffer-tree Technique Results Attaching () size buffers to normal -tree can also be use to improve update bound uffered segment tree Has been used in batched range searching and rectangle intersection algorithm Has been used on String -tree to obtain I/O-efficient string sorting algorithms 0

Summary/Conclusions: Fund. Data Structures -tree O(/) space, O(log ) update, O(log +T/) query Weight-balanced -tree Ω(w(v)) updates below v between consecutive operations on v Persistent -tree Query in any previous version uffer tree atching of operations to obtain O log ) bounds ( M

References External Memory Geometric Data Structures Lecture notes by. Section -5 2