4.4 B-trees Disk access vs. main memory access: background B-tree concept Node structure Structural properties Insertion operation Deletion operation Running time 66 Background: disk access vs. main memory access (1/2) One disk page access takes the same time as executing several hundred thousands of machine instructions. Accessing one random disk page takes about 10 milliseconds. Accessing the next consecutive disk page takes much less. Overall, disk access takes about 5 milliseconds per page. So, performance measure for a disk-resident data structure is the disk I/O cost. 67 1
Background: disk access vs. main memory access (2/2) A simple measure of the disk I/O cost is the number of disk pages accessed. Access = read or write Performance metric for a main-memory data structure is the CPU cost. A simple measure of the CPU cost is the number of machine instructions, the number of primitive operations (e.g., comparisons between two elements), etc. 68 B-tree concept B-tree is: a disk-resident tree. a multi-way balanced tree for external (i.e., on disk) searching. balanced to minimize the disk I/O cost. A B-tree node represents a disk page. The page size is typically an order of hundred to thousand bytes (e.g., 512, 1024,, 8192). 69 2
B-tree structure: example Figure 4.60 B-tree of order 5 In practice, the order is much higher than 5. 70 B-tree node structure A B-tree internal node is an index node made of entries with two types of fields: p i (0 i M-1) is a pointer (i.e., link). k i (1 i M-1) is a key. (Internal) Node p 0 k 1 p 1 k 2 p 2 k M-1 p M-1 class Entry { Key k; Node p; } class Node { Node p; Entry a[1..m-1];} 71 3
B-tree node structure A B-tree leaf node is a data node made of data records. Note. What s called a leaf node in the textbook is actually not part of a B-tree. It represents a data record on which a B-tree is built as an index structure. (Well, let s follow the textbook anyway ) (Leaf) Node r 1 r 2 r L class Entry { } // data record class Node { Entry a[1..l]; } 72 B-tree structural properties Every node is between 50% to 100% full. A leaf node contains between L/2 and L data items. (L: load factor of a leaf node) An internal node (except the root) contains M/2 to M pointers (i.e., (M-1)/2 to (M-1) keys). (M: load factor of an internal node) The i th key value is the smallest key value in the (i+1) th subtree. The root node is either a leaf node or contains 2 to M pointers (i.e., 1 to M-1 keys). 73 4
B-tree parameters Consider disk page size B= 8Kbytes = 8192 bytes pointer size P = 4 bytes key size K = 8 bytes data record size R = 520 bytes Then, for a leaf node, L = B/R = 15. for an internal node, M = (B-P)/(P+K) +1 = 683. 74 B-tree node merge and split The key to keeping a B-tree balanced is the node split (when insertion) and node merge (when deletion) operations. If a node is 100% full, then split it before inserting a new key. If a node is 50% full, then merge it with a neighboring node before deleting another key. 75 5
Side note: bounds for split or merge In some environment (e.g., DBMS), users are allowed to adjust the lower bound and the upper bound to something different from 50% and 100%, respectively. In that case, the upper bound and the lower bound can be set to control the overhead of node split or merge. If the upper bound is too low, node split occurs too frequently! If the lower bound is too high, node merge occurs too frequently! 76 B-tree insertion algorithm Assume duplicate keys are ignored. 1. Search the tree for a leaf node into which to insert the key, and insert it if the key is new. 2. If the node overflows as a result, then (1) split the node (i.e., acquire a new node) (2) move half to the new node (3) if the node is a leaf, copy the smallest key of the new node to the parent; if an internal node, move the smallest key of the new node to the parent. If the parent node overflows as a result, repeat the step 2. 77 6
B-tree insertion example (1/4) Figure 4.60 Insert 57. 78 B-tree insertion example (2/4) Figure 4.61 Now, insert 55. 79 7
B-tree insertion example (3/4) Figure 4.62 Inserting 55 caused a split into two leaf nodes. Now, insert 40. 80 B-tree insertion example (4/4) Figure 4.63 Inserting 40 caused a split into two leaf nodes and then a split of the parent node. 81 8
B-tree deletion algorithm 1. Search for a leaf node containing the key to be deleted and, if found, remove the key from the node. 2. If the node is less than half full as a result, (1) Look to the siblings for one that has enough to share (i.e., more than half left after giving) (2) If neither can afford it, merge with one of them and remove the smallest key of the larger sibling from the parent node. (Here, the larger sibling refers to the one with the larger keys.) 3. If the parent node has less than half left as a result, then repeat the step 2. 82 B-tree deletion example Figure 4.64 B-tree after deleting 99 from the B-tree in Figure 4.63 The last node at the leaf level underflows, and so merges with its neighbor. This merge then causes its parent to underflow, so the parent gets one entry from its neighbor. 83 9
Running time of B-tree operations Given a B-tree of order M with N data nodes, the number of disk pages accessed for search equals the height of the B- tree. The height is log M/2 N in the worst case (every node is half full). log M N in the best case (every node is completely full). log (2/3)M N in the average case confirmed empirically by repeating a large number of random insertions and deletions. Q: What is the average height if N = 1000000 and M = 500? A: log (2/3)500 1000000 = 2.78 = 3. (Only three for a million data nodes!) 84 The End 85 10