B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert

Similar documents
CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

HASH TABLES. CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast. An Ideal Hash Functions.

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

CSE 326: Data Structures Lecture #11 B-Trees

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

CSE 373 OCTOBER 25 TH B-TREES

CSE 326: Data Structures B-Trees and B+ Trees

CSE373: Data Structures & Algorithms Lecture 6: Hash Tables

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

CSE 373: Data Structures and Algorithms

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 326: Data Structures Splay Trees. James Fogarty Autumn 2007 Lecture 10

Self-Balancing Search Trees. Chapter 11

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

Balanced Search Trees

UNIT III BALANCED SEARCH TREES AND INDEXING

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Multi-way Search Trees! M-Way Search! M-Way Search Trees Representation!

CS350: Data Structures B-Trees

(2,4) Trees. 2/22/2006 (2,4) Trees 1

CSE 332: Data Structures & Parallelism Lecture 7: Dictionaries; Binary Search Trees. Ruth Anderson Winter 2019

C SCI 335 Software Analysis & Design III Lecture Notes Prof. Stewart Weiss Chapter 4: B Trees

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CSE373: Data Structures & Algorithms Lecture 11: Implementing Union-Find. Lauren Milne Spring 2015

Introduction to Data Management. Lecture 15 (More About Indexing)

CSE373: Data Structures & Algorithms Lecture 5: Dictionary ADTs; Binary Trees. Linda Shapiro Spring 2016

CSE 373 MAY 10 TH SPANNING TREES AND UNION FIND

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

Section 1: True / False (1 point each, 15 pts total)

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

(2,4) Trees Goodrich, Tamassia (2,4) Trees 1

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

COSC 2007 Data Structures II Final Exam. Part 1: multiple choice (1 mark each, total 30 marks, circle the correct answer)

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

CS 350 : Data Structures B-Trees

CSE 100 Advanced Data Structures

Background: disk access vs. main memory access (1/2)

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Taking Stock. IE170: Algorithms in Systems Engineering: Lecture 7. (A subset of) the Collections Interface. The Java Collections Interfaces

(the bubble footer is automatically inserted in this space)

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Splay Trees. (Splay Trees) Data Structures and Programming Spring / 27

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Search Trees. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Binary Search Trees

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

CSC 261/461 Database Systems Lecture 17. Fall 2017

Final Examination CSE 100 UCSD (Practice)

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

CSE 373 OCTOBER 11 TH TRAVERSALS AND AVL

CSE 373 Spring 2010: Midterm #1 (closed book, closed notes, NO calculators allowed)

CS 315 Data Structures mid-term 2

CSE 373 Winter 2009: Midterm #1 (closed book, closed notes, NO calculators allowed)

Binary Heaps in Dynamic Arrays

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials)

CSE373: Data Structures & Algorithms Lecture 28: Final review and class wrap-up. Nicki Dell Spring 2014

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Search Trees - 1 Venkatanatha Sarma Y

CSE373: Data Structures & Algorithms Lecture 4: Dictionaries; Binary Search Trees. Kevin Quinn Fall 2015

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Red-black trees (19.5), B-trees (19.8), trees

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Binary Search Trees. Analysis of Algorithms

CSE 332 Winter 2015: Midterm Exam (closed book, closed notes, no calculators)

Balanced Binary Search Trees

CS 2150 (fall 2010) Midterm 2

Lecture 11: Multiway and (2,4) Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

Data Structure. A way to store and organize data in order to support efficient insertions, queries, searches, updates, and deletions.

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Introduction to Data Management. Lecture 21 (Indexing, cont.)

9/24/ Hash functions

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

2-3 Tree. Outline B-TREE. catch(...){ printf( "Assignment::SolveProblem() AAAA!"); } ADD SLIDES ON DISJOINT SETS

CS301 - Data Structures Glossary By

Data Structures and Algorithms

Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Multiway Search Trees. Multiway-Search Trees (cont d)

Chapter 2: Basic Data Structures

Sample Exam 1 Questions

BST Deletion. First, we need to find the value which is easy because we can just use the method we developed for BST_Search.

Chapter 12: Indexing and Hashing. Basic Concepts

A dictionary interface.

Binary Heaps. COL 106 Shweta Agrawal and Amit Kumar

CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting. Dan Grossman Fall 2013

Multi-Way Search Trees

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

Lecture 15 Binary Search Trees

Chapter 12: Indexing and Hashing

Binary heaps (chapters ) Leftist heaps

B-Trees. Version of October 2, B-Trees Version of October 2, / 22

Multi-Way Search Trees

Understand how to deal with collisions

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

Design and Analysis of Algorithms Lecture- 9: B- Trees

CMSC 132: Object-Oriented Programming II. Hash Tables

Lecture 8 Index (B+-Tree and Hash)

Transcription:

B+ Tree Review CSE2: Data Abstractions Lecture 10: More B Trees; Hashing Dan Grossman Spring 2010 M-ary tree with room for L data items at each leaf Order property: Subtree between keys x and y contains only data that is x and < y (notice the ) Balance property: x< All nodes and leaves at least half full, and all leaves at same height find and insert efficient insert uses splitting to handle overflow, which may require splitting parent, and so on recursively 7 21 x<7 7 x< x<21 21 x 2 Can do a little better with insert Adoption for insert Eventually have to split up to the root (the tree will fill) But can sometimes avoid splitting via adoption Change what leaf is correct by changing parent keys This idea in reverse is necessary in deletion (next) Eventually have to split up to the root (the tree will fill) But can sometimes avoid splitting via adoption Change what leaf is correct by changing parent keys This idea in reverse is necessary in deletion (next) Example: 1 Example: 0 Splitting 0 Insert(1) 0 1 2 Adoption 0 Insert(1) 0 1 2 2 2 4

And Now for Deletion Delete(2) Delete() 2 M = L = 0 2 8 5 0 8 M = L = 0 8 What s wrong? 0 8 Adopt from a neighbor! 6 Delete() 0 8 0 8 0 8 0 8 M = L = 7 M = L = Uh-oh, neighbors at their minimum! 8

0 8 0 8 0 8 0 8 Move in together and remove leaf now parent might underflow; it has neighbors 9 M = L = M = L = 10 Delete() Delete() 0 0 8 0 8 0 8 8 M = L = 11 M = L =

0 8 8 8 8 0 0 0 M = L = 1 M = L = Deletion Algorithm 1. Remove the data from its leaf 0 8 0 8 2. If the leaf now has L/2-1, underflow! If a neighbor has > L/2 items, adopt and update parent Else merge node with neighbor Guaranteed to have a legal number of items Parent now has one less node. If step (2) caused the parent to have M/2-1 children, underflow! M = L =

Deletion algorithm continued Efficiency of delete. If an internal node has M/2-1 children If a neighbor has > M/2 items, adopt and update parent Else merge node with neighbor Guaranteed to have a legal number of items Parent now has one less node, may need to continue up the tree Find correct leaf: O(log 2 M log M n) Remove from leaf: O(L) Adopt from or merge with neighbor: O(L) Adopt or merge all the way up to root: O(M log M n) Total: O(L + M log M n) If we merge all the way up through the root, that s fine unless the root went from 2 children to 1 In that case, delete the root and make child the root This is the only case that decreases tree height But it s not that bad: Merges are not that common Remember disk accesses were the name of the game: O(log M n) Note: Worth comparing insertion and deletion algorithms 17 B Trees in Java? For most of our data structures, we have encouraged writing highlevel, reusable code, such as in Java with generics It is worth knowing enough about how Java works to understand why this is probably a bad idea for B trees Assuming our goal is efficient number of disk accesses Java has many advantages, but it wasn t designed for this If you just want a balanced tree with worst-case logarithmic operations, no problem If M=, this is called a 2- tree If M=4, this is called a 2--4 tree The key issue is extra levels of indirection 19 Naïve approaches Even if we assume data items have int keys, you cannot get the data representation you want for really big data interface Keyed<E> { int key(e); } class BTreeNode<E implements Keyed<E>> { static final int M = 8; int[] keys = new int[m-1]; BTreeNode<E>[] children = new BTreeNode[M]; int numchildren = 0; } class BTreeLeaf<E> { static final int L = 2; E[] data = (E[])new Object[L]; int numitems = 0; } 20

What that looks like BTreeNode ( objects with header words ) M-1 20 (larger array) The moral The whole idea behind B trees was to keep related data in contiguous memory M (larger array) All the red references on the previous slide are inappropriate As minor point, beware the extra header words 70 BTreeLeaf (data objects not in contiguous memory) L (larger array) 20 21 But that s the best you can do in Java Again, the advantage is generic, reusable code But for your performance-critical web-index, not the way to implement your B-Tree for terabytes of data C# may have better support for flattening objects into arrays C and C++ definitely do Levels of indirection matter! 22 Possible fixes Don t use generics No help: all non-primitive types are reference types For internal nodes, use an array of pairs of keys and references No, that s even worse! Instead of an array, have M fields (key1, key2, key, ) Gets the flattening, but now the code for shifting and binary search can t use loops (tons of code for large M) Similar issue for leaf nodes Conclusion: Balanced Trees Balanced trees make good dictionaries because they guarantee logarithmic-time find, insert, and delete Essential and beautiful computer science But only if you can maintain balance within the time bound AVL trees maintain balance by tracking height and allowing all children to differ in height by at most 1 B trees maintain balance by keeping nodes at least half full and all leaves at same height Other great balanced trees (see text; worth knowing they exist) Red-black trees: all leaves have depth within a factor of 2 Splay trees: self-adjusting; amortized guarantee; no extra space for height information 2 24

Hash Tables Hash tables Aim for constant-time (i.e., O(1)) find, insert, and delete On average under some reasonable assumptions A hash table is an array of some fixed size Basic idea: hash function: index = h(key) hash table 0 There are m possible keys (m typically large, even infinite) but we expect our table to have only n items where n is much less than m (often written n << m) Many dictionaries have this property Compiler: All possible identifiers allowed by the language vs. those used in some file of one program Database: All possible student names vs. students enrolled AI: All possible chess-board configurations vs. those considered by the current player key space (e.g., integers, strings) TableSize 1 25 26 Hash functions Who hashes what? An ideal hash function: Is fast to compute Rarely hashes two used keys to the same index Often impossible in theory; easy in practice Will handle collisions in next lecture hash table 0 Hash tables can be generic To store elements of type E, we just need E to be: 1. Comparable: order any two E (like with all dictionaries) 2. Hashable: convert any E to an int When hash tables are a reusable library, the division of responsibility generally breaks down into two roles: hash function: index = h(key) client hash table library E int table-index collision? collision resolution key space (e.g., integers, strings) TableSize 1 We will learn both roles, but most programmers in the real world spend more time as clients while understanding the library 27 28

More on roles Some ambiguity in terminology on which parts are hashing client hash table library E int table-index collision? collision resolution hashing? hashing? Two roles must both contribute to minimizing collisions (heuristically) Client should aim for different ints for expected items Avoid wasting any part of E or the 2 bits of the int Library should aim for putting similar ints in different indices conversion to index is almost always mod table-size using prime numbers for table-size is common 29 What to hash? In lecture we will consider the two most common things to hash: integers and strings If you have objects with several fields, it is usually best to have most of the identifying fields contribute to the hash to avoid collisions Example: class Person { String first; String middle; String last; int age; } An inherent trade-off: hashing-time vs. collision-avoidance Bad idea(?): Only use first name Good idea(?): Only use middle initial Admittedly, what-to-hash is often an unprincipled guess 0 Hashing integers Hashing integers key space = integers Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural 0 1 2 4 5 key space = integers Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural 0 10 1 41 2 4 4 5 Example: TableSize = 10 Insert 7,, 41, 4, 10 (As usual, ignoring data along for the ride ) 6 7 8 9 Example: TableSize = 10 Insert 7,, 41, 4, 10 (As usual, ignoring data along for the ride ) 6 7 7 8 9 1 2

Collision-avoidance With x % TableSize the number of collisions depends on the ints inserted (obviously) TableSize Larger table-size tends to help, but not always Example: 7,, 41, 4, 10 with TableSize = 10 and TableSize = 7 Technique: Pick table size to be prime. Why? Real-life data tends to have a pattern, and multiples of 61 are probably less likely than multiples of 60 Next time we ll see that one collision-handling strategy does provably better with prime table size More on prime table size If TableSize is 60 and Lots of data items are multiples of 5, wasting 80% of table Lots of data items are multiples of 10, wasting 90% of table Lots of data items are multiples of 2, wasting 50% of table If TableSize is 61 Collisions can still happen, but 5, 10,, 20, will fill table Collisions can still happen but 10, 20, 0,, will fill table Collisions can still happen but 2, 4, 6, 8, will fill table In general, if x and y are co-prime (means gcd(x,y)==1), then (a * x) % y == (b * x) % y if and only if a % y == b % y So good to have a TableSize that has not common factors with any likely pattern x 4 Okay, back to the client Specializing hash functions If keys aren t ints, the client must convert to an int Trade-off: speed and distinct keys hashing to distinct ints Very important example: Strings Key space K = s 0 s 1 s 2 s m-1 (where s i are chars: s i [0,52] or s i [0,256] or s i [0,2 ]) Some choices: Which avoid collisions best? How might you hash differently if all your strings were web addresses (URLs)? 1. h(k) = s 0 % TableSize m 1 2. h(k) = s i % TableSize i = 0 k. h(k) = % TableSize 1 i s i 7 i = 0 5

Combining hash functions A few rules of thumb / tricks: 1. Use all 2 bits (careful, that includes negative numbers) 2. Use different overlapping bits for different parts of the hash This is why a factor of 7 i works better than 256 i Example: abcde and ebcda. When smashing two hashes into one hash, use bitwise-xor bitwise-and produces too many 0 bits bitwise-or produces too many 1 bits 4. Rely on expertise of others; consult books and other resources 5. If keys are known ahead of time, choose a perfect hash 7