In-Memory Searching. Linear Search. Binary Search. Binary Search Tree. k-d Tree. Hashing. Hash Collisions. Collision Strategies.

Similar documents
Hash Table and Hashing

Data Structures and Algorithms

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Balanced Search Trees. CS 3110 Fall 2010

B-Trees & its Variants

CS301 - Data Structures Glossary By

Data Structures Lesson 7

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

B-Trees. Version of October 2, B-Trees Version of October 2, / 22

UNIT III BALANCED SEARCH TREES AND INDEXING

Trees. (Trees) Data Structures and Programming Spring / 28

Multi-way Search Trees! M-Way Search! M-Way Search Trees Representation!

Question Bank Subject: Advanced Data Structures Class: SE Computer

Physical Level of Databases: B+-Trees

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

Multiway searching. In the worst case of searching a complete binary search tree, we can make log(n) page faults Everyone knows what a page fault is?

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

TREES. Trees - Introduction

Sorted Arrays. Operation Access Search Selection Predecessor Successor Output (print) Insert Delete Extract-Min

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Chapter 12 Advanced Data Structures

9/24/ Hash functions

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Organizing Spatial Data

Some Search Structures. Balanced Search Trees. Binary Search Trees. A Binary Search Tree. Review Binary Search Trees

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Section 05: Solutions

Chapter 12: Indexing and Hashing. Basic Concepts

DATA STRUCTURES/UNIT 3

We assume uniform hashing (UH):

Algorithms. AVL Tree

Section 05: Solutions

CSIT5300: Advanced Database Systems

Chapter 12: Indexing and Hashing

CS102 Binary Search Trees

Recall: Properties of B-Trees

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

BINARY SEARCH TREES cs2420 Introduction to Algorithms and Data Structures Spring 2015

2-3 Tree. Outline B-TREE. catch(...){ printf( "Assignment::SolveProblem() AAAA!"); } ADD SLIDES ON DISJOINT SETS

Computer Science 210 Data Structures Siena College Fall Topic Notes: Binary Search Trees

Week 10. Sorting. 1 Binary heaps. 2 Heapification. 3 Building a heap 4 HEAP-SORT. 5 Priority queues 6 QUICK-SORT. 7 Analysing QUICK-SORT.

Self-Balancing Search Trees. Chapter 11

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

CS 3114 Data Structures and Algorithms READ THIS NOW!

Intro to DB CHAPTER 12 INDEXING & HASHING

CSCI2100B Data Structures Heaps

CSCI Trees. Mark Redekopp David Kempe

Week 10. Sorting. 1 Binary heaps. 2 Heapification. 3 Building a heap 4 HEAP-SORT. 5 Priority queues 6 QUICK-SORT. 7 Analysing QUICK-SORT.

Binary Trees. BSTs. For example: Jargon: Data Structures & Algorithms. root node. level: internal node. edge.

Hashing for searching

Balanced Search Trees

Some Practice Problems on Hardware, File Organization and Indexing

Comp 335 File Structures. B - Trees

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

Chapter 11: Indexing and Hashing

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

Quiz 1 Solutions. Asymptotic growth [10 points] For each pair of functions f(n) and g(n) given below:

Lecture 3 February 9, 2010

2-3 and Trees. COL 106 Shweta Agrawal, Amit Kumar, Dr. Ilyas Cicekli

Indexing and Hashing

void insert( Type const & ) void push_front( Type const & )

Data and File Structures Chapter 11. Hashing

Chapter 6 Heaps. Introduction. Heap Model. Heap Implementation

BST Deletion. First, we need to find the value which is easy because we can just use the method we developed for BST_Search.

A6-R3: DATA STRUCTURE THROUGH C LANGUAGE

Balanced Search Trees

- 1 - Handout #22S May 24, 2013 Practice Second Midterm Exam Solutions. CS106B Spring 2013

CS 525: Advanced Database Organization 04: Indexing

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

Binary Search Trees. Analysis of Algorithms

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Hashing. Hashing Procedures

Balanced Binary Search Trees. Victor Gao

CS F-11 B-Trees 1

Tree-Structured Indexes

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

CISC 235: Topic 4. Balanced Binary Search Trees

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

Design and Analysis of Algorithms Lecture- 9: B- Trees

Chapter 12: Indexing and Hashing (Cnt(

Augmenting Data Structures

Lecture 11: Multiway and (2,4) Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

Properties of a heap (represented by an array A)

CS350: Data Structures Red-Black Trees

CSE 530A. B+ Trees. Washington University Fall 2013

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Solution READ THIS NOW! CS 3114 Data Structures and Algorithms

Tree-Structured Indexes

Lecture 3: B-Trees. October Lecture 3: B-Trees

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1.

V Advanced Data Structures

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Heap Model. specialized queue required heap (priority queue) provides at least

Transcription:

In-Memory Searching Linear Search Binary Search Binary Search Tree k-d Tree Hashing Hash Collisions Collision Strategies Chapter 4

Searching A second fundamental operation in Computer Science We review O(n) linear search and O(n lg n) binary search We next discuss more sophisticated approaches Two techniques form the basis for very large dataset search on disk Trees Hashing 2

Linear Search One of the simplest search algorithms Take a collection of n records Scan from start to end, looking for a record with primary key k Best case target is first record O 1 Worse case target is last record or not in collection O(n) Average case search ~ n Τ2 records to find target also O(n) Purpose of linear search is two-fold: Simple to implement, used when n is small or search is rare Represents a hard upper bound on acceptable search performance 3

Binary Search If a collection is sorted, we can perform binary search Discards n Τ2 records from further consideration on first comparison Discards additional n Τ4 records from further consideration on second comparison Continues until record found or no data left to search An algorithm that splits collection in half repeatedly runs in O(lg n) However, building a sorted collection requires O(n lg n), so binary search maintenance requires O(n lg n) 4

Recursive Binary Search Algorithm binary_search(k, A, lf, rt) Input: k, target key; A, sorted array to search; lf, start of search range; rt, end of search range n = rt lf + 1 if n 0 then return -1 end n Τ2 c = lf + if k == A[c] then return c else if k < A[c] then return binary_search( k, A, lf, c 1 ) else return binary_search( k, A, c + 1, rt ) end // Searching empty range // Center of search region // Target record found // Search left half of search region // Search right half of search region 5

Binary Search Tree Choose data structure to implement binary search Sorted array O(lg n) search performance Adding element: O(lg n) to find position, O(n) to make space Similarly, deleting elements requires O(n) to fill hole Common alternative, binary search tree (BST) Tree, each holds primary key, reference to (up to) two child nodes All keys in left subtree are smaller than parent key All keys in right subtree are larger than parent key 6

BST Search Algorithm bst_search(k, node) Input: k, target key; node, node in BST to begin search if node == null then return null end if k = node. key then return node else if k < node. key return bst_search( k, node. left) else return bst_search( k, node. right) end // Searching empty tree // Target record found // Search left subtree // Search right subtree 7

BST Operations Insertion. Insert record with key k, search for k in BST If k found, duplicate record, replace node with new record If empty subtree found, insert new node holding k s record Deletion. Delete record with key k, search for k in BST 1. If k not found, delete fails 2. If k s node has no children, remove k s node, stop 3. If k s node has one subtree, promote subtree s root 4. If k s node has two subtrees: a. Find successor to k s node smallest key value k in right subtree by walking right once, then walking left as far as possible b. Remove successor (since it has empty left subtree must match case 2 or 3 above) c. Promote successor to k s node s position 8

BST Deletion Cases deletion with no children deletion with one child deletion with two children 9

BST Performance Search, add, delete all require initial search Add, delete perform search followed by constant-time operation, so dominated by search performance BST roughly balanced: left subtree height right subtree height throughout tree Comparison discards half of remaining nodes, search in O(lg n) BST unbalanced: many empty subtrees at internal levels Search degenerates into linear search, O(n) Inserting nearly sorted data into BST produces unbalanced tree Can maintain balance with self-correction trees (e.g., AVL) 10

k-d Tree k-dimensional (k-d) tree is binary tree that subdivides a collection into ranges for k attributes Proposed by Jon Louis Bentley in 1975 Designed to support associative, or multiattribute, searches E.g., consider a collection of weather reports with temp and precip k-d tree can efficiently support queries like Return all records with temp < 0 and precip > 4cm k-d tree structure similar to BST, but we rotate through the k dimensions at each tree level 11

k-d Tree Index k-d tree structure similar to BST, but we rotate through the k dimensions at each tree level E.g., 2-d temp + precip tree subdivides by temp on root level, by precip on second level, again by temp on third level, and so on Each k-d node contains key k c and left, right subtrees Unlike BST, records are not stored in internal nodes Target key k t used to determine which subtree to enter k t k c, left subtree; k t > k c, right subtree Leaf nodes contain collections of records that satisfy conditions along root-to-leaf path 12

k-d Tree Example Suppose we want to use k-d tree to subdivide dwarves by height ht and weight wt Snow White and the seven dwarves define initial tree structure Name ht wt Sleepy 36 48 Happy 34 52 Doc 38 51 Dopey 37 54 Grumpy 32 55 Sneezy 35 46 Bashful 33 50 Ms. White 65 98 13

k-d Tree Index Construction Construction identical to BST, but rotate between k = 2 dimensions at each level of tree 1. Sleepy inserted into root of tree, which uses ht as subdivision attribute 2. Happy and Doc are inserted as children of Sleepy a. Happy s ht = 34 36, inserted left of root b. Doc s ht = 38 > 36, inserted right of root c. Both Happy and Doc use wt as their subdivision attribute (second level) 3. Dopey inserted a. Dopey s ht = 37 puts him right of root, wt = 51 puts him left of b. Dopey s wt = 54 > 51 puts him right of Doc c. Dopey rotates to use ht as his subdivision attribute (third level) 4. Remaining dwarves and Snow White are inserted using identical approach 14

Snow White k-d Tree first three insertions, ht subdivides root, wt subdivides second level fourth insertion ht subdivides third level remaining insertions, leaves hold individual records 15

k-d Tree Record Management k-d index used to locate records based on ht and wt Buckets placed at each (null) leaf Designed to hold records as they are inserted into k-d tree Critical to choose index records that represent distributions of attribute values within data collection Poor choice of index records will produce poor distribution of records Index records are normally first records inserted into k-d tree 16

Spatial Interpretation k-d index subdivides k-dimensional space of all possible records into subspaces for each dimension k-d index subdivides k-dimensional space using (k 1)- dimensional cutting planes representing entries in the index Consider visualizing ht wt index Since k = 2, we subdivide 2D plane using 1D lines Each region represents a bucket in the index 17

Snow White k-d Tree Index 18

k-d Tree Search 1. Identify all paths whose internal nodes satisfy target attribute ranges, may produce multiple paths 2. Perform in-memory search of each path s buckets for records that match target criteria 3. Return records within buckets that satisfy search criteria Want to control size of buckets Re-indexing collection is expensive Size of index dictates maximum number of buckets, and therefore average expected bucket size 19

k-d Tree Search Example Search for records with ht 36 and wt 47 1. At root, branch left (ht 36) 2. At next node, branch left again (wt 49) 3. At next node, branch left and right (ht 35 and ht > 35 both fall within target range of ht 36) 4. Along right path, reach bucket 3 5. Along left path, branch left (wt 50), reaching bucket 1 Bucket 1: ht 35, wt 50; bucket 3: 35 < ht 36, wt 52 Both buckets may include records with ht 36 and wt 47 No other buckets could contain these types of records From (Bashful, Sneezy) and (Sleepy), only Sneezy meets criteria (ht = 35, wt = 46) 20

k-d Tree Performance k-d tree s index has critical impact on performance Index should subdivide data stored in tree in balanced manner All buckets at same or nearly same level in tree Same or nearly same number of elements in each bucket If data known a-prior, median elements used to construct index E.g., the Snow White k-d tree is designed for individuals with ht 37 and wt 55 Anything outside this range will be forced into one of two buckets For dynamic trees, maintaining balance is complicated Adaptive k-d trees exist to try to maintain balance 21

Hashing Hashing, a second major class of efficient search algorithms Hash function converts key k t into numeric value h on a fixed range 0 n 1 h used as location/address for k t within a hash table A of size n Analogous to array indexing, can store/retrieve k t at A[h] If hash function is O(1), search, insert, delete are also O(1) Unfortunately, the number of possible keys m n is normally much larger than the size of A 22

Hash Function Requirements Because m n, h is not identical to an array index Three important properties distinguish h from an array index Hash value for k t should appear random Hash values should be distributed uniformly over range 0 n 1 Two different keys k s and k t can hash to the same h, a collision Collisions are a major issue, especially if each location in A can only hold one key Minimizing collisions are a main area for consideration 23

Perfect Hashing Choose a hash function that does not produce collisions Suppose we store credit cards, use card number as key For card numbers of form 0000 0000 0000 0000, m = 10 16 possible numbers (keys), or 10 quadrillion Clearly, not possible to create in-memory A of size n = 10 16 Of course, not every possible number is in use Numbers do span around 1 10 15 to 9 10 15 so array still not feasible Even for small number of keys, perfect hashing very difficult E.g., to store m = 4000 keys in A of size n = 5000, only 1 in 10 120000 functions perfect 24

Fold-and-Add Hash Function Common function, fold-and-add 1. Convert k t to a numeric sequence 2. Fold and add the numbers, correcting for overflow 3. Divide result by prime number, return the remainder as h k t = Subramanian, convert to numeric sequence by mapping characters to ASCII codes, binding pairs of codes S u b r a m a n i a n 85 117 98 114 97 109 97 110 105 97 110 Assume largest pair is zz with combined ASCII code of 122122 To manage overflow divide by prime number 125299 slightly larger than 122122 after each add, keep remainder 25

Fold-and-Add (cont d) S u b r a m a n i a n 85 117 98 114 97 109 97 110 105 97 110 85117 + 98114 = 193231 mod 125299 = 67932 67932 + 97109 = 16041 mod 125299 = 39742 35742 + 97110 = 136852 mod 125299 = 11533 11553 + 10597 = 22150 mod 125299 = 22150 22150 + 110 = 22260 mod 125299 = 22260 We divide the final result by the size of the hash table Hash table size should itself be prime We choose A of size n = 101, producing a final h of h = 22260 mod 101 = 40 26

Hash Value Distributions Given a hash table size n holding r records, what is the likelihood that No key hashes to a particular address in the table? One key hashes to a particular address? Two keys hash to a particular address? Assume hash function uniformly distributes hash values For any key, probability it hashes to address k is b For any key, probability it does not hash to address k is a b = 1 n, a = 1 1 n (4.1) 27

Collision Probability Given a and b, insert two keys into hash table Compute individual cases Probability first key hits and address, second key misses Probability both keys hit same address (collision) ba = 1 n 1 1 n = 1 n 1 n 2 (4.2) bb = 1 n n n = 1 n 2 Probability x of r keys hash to a common address C = r n = r! x! r x! x 1 1 r x (4.3) Pr = Cb x a r x = C 1 n n 28

Estimated Collision Probability Since r! in its equation, C expensive to compute Poisson distribution Pr(x) does good job of estimating our probability Cb x a r x Pr x = ( rτ n) x e ( rτ n ) x! (4.4) Since x normally small, x! in denominator is not an issue 29

Estimated Collision Example Consider extreme case Store r = 1000 keys in a hash table of size n = 1000 Here, r Τn = 1, can use this ratio to calculate Pr(0), Pr(1), Pr(2), Pr 0 = 10 e 1 0! Pr 1 = 11 e 1 1! Pr 2 = 12 e 1 2! = 0.368 = 0.368 (4.5) = 0.184 Given hash table size n = 1000, we expect npr(0) = 1000 0.368 = 368 entries that are empty npr(1) = 1000 0.368 = 368 entries holding 1 key npr(2) = 1000 0.184 = 184 entries that try to hold 2 keys, and so on 30

Estimating Collision Count Consider previous example r = n = 1000 npr(0) = 368 entries in table hold no keys npr(1) = 368 entries in table hold 1 key 1000 npr 0 npr 1 = 264 entries try to hold > 1 keys First 264 keys are stored, 1000 368 + 264 = 368 keys collide Collision rate of 36.8% 368 entries 0 keys 368 entries 1 key 368 keys inserted 264 entries > 1 keys 1000 368 = 632 keys inserted 264 keys accepted 632 264 = 368 keys collide 31

Larger Hash Table Increase n = 2000, reduce packing rate to r Τn = npr 0 = 2000 0.50 e 0.5 0! npr 1 = 2000 0.51 e 0.5 1! = 0.607 = 0.304 1000Τ 2000 = 0.5 1000 npr 0 npr 1 = 178 entries try to hold > 1 keys 178 keys are stored, 214 keys collide, collision rate of 21.4% 1214 entries 0 keys 608 entries 1 key 608 keys inserted 178 entries > 1 keys 1000 608 = 392 keys inserted 178 keys accepted 392 178 = 214 keys collide 32

Progressive Overflow Since collisions cannot be avoided, they must be managed Progressive overflow insertion Attempt to insert key at its hash position h If h already occupied and table is full, insertion fails Otherwise, walk forward through hash table until empty position found Progressive overflow deletion Start searching for key at its hash position h If n positions are examined and key is not found, deletion fails Otherwise, mark key s position as dirty: empty but previously occupied 33

Progressive Insertion Algorithm progressive_insert(rec, tbl, n) Input: rec, record to insert; tbl, hash table; n, table size num = 0 h = hash(rec. key) while num < n do if tbl[h] is empty then tbl h = rec break else end h = h + 1 % n num++ end return (num == n)? false : true // Number of insertion attempts // Store record // Try next table position // Return status of insert attempt 34

Progressive Search Algorithm progressive_search(key, tbl, dirty, n) Input: key, target key; tbl, hash table; dirty, dirty entry table n, table size num = 0 h = hash(key) while num < n do if key == tbl h. key then return tbl h else if tbl[h] is empty and! dirty[h] then return false else end h = h + 1 % n num++ end return false // Number of insertion attempts // Target record found // Search failed // Try next table position // Search failed 35

Progressive Delete Algorithm progressive_delete(key, tbl, dirty, n) Input: key, target key; tbl, hash table; dirty, dirty entry table n, table size h = progressive_search (key, tbl, dirty, n) if h!= false then tbl[ h ] = empty dirty[ h ] = true else return false end // Set table position empty // Mark table position dirty 36

Progressive Overflow Search To search, hash key to h, start search at position h If record found, return it If entire table examined, search fails Empty positions are handled using dirty bit If position empty and dirty bit set a. Position may have been occupied when record inserted b. And we may have walked forward looking for an empty position c. So we need to keep searching If position empty and dirty bit not set a. Position was never occupied, so we could not have walked over it b. Therefore, record not in table and search fails 37

Progressive Overflow Disadvantages 1. Hash table can become full Hash function uses n, so every record s hash value changes if n changes Must remove all records, increase size, re-insert all records 2. Runs form as records are inserted Multiple records hash to the same h, runs of contiguous records form Expensive to find a record near the start of the run 3. If runs merge with one another, super-runs form Very long collection of contiguous records Searching may walk over records without same h as target record If table > 75% full, search deteriorates to O(n) 38

Multi-Record Buckets Alternative approach, reduce collisions by storing more than one record in each hash table entry Implement bucket as expandable array or linked list Insertion and deletion are identical to simple hash table We do not need to worry about exceeding capacity Search for k with hash value h, load bucket A[h], scan for k If we use buckets, packing density of A is Τ r bn n is size of A, b is maximum entries in each A[i] position 39

Single vs. Multi-Record Buckets r = 700, n = 1000, b = 1 r Τn = 700Τ 1000 = 0.7 P 0 = 0.70 e 0.7 = 0.497 0! P 1 = 0.71 e 0.7 1! = 0.348 r = 700, n = 500, b = 2 r Τn = 700Τ 500 = 1.4 P 0 = 1.40 e 1.4 = 0.247 0! P 1 = 1.41 e 1.4 1! P 2 = 1.42 e 1.4 2! = 0.345 = 0.242 497 entries 0 keys 347 entries 1 key 347 recs 155 entries > 1 keys 352 recs 197 collisions 28.1% 124 entries 0 keys 172 entries 1 key 172 recs 121 entries 2 keys 242 recs 83 entries > 2 keys 286 recs 120 collisions 17.1% 40

Bucket Advantages and Disadvantages Simply rearranging 1000 table entries into a two-bucket table reduced collision rate from 28.1% to 17.1% Multi-bucket tables still have disadvantages If r n, buckets become long, search deteriorates to O(n) Check for duplicate keys, deletion also deteriorates to O(n) Size of table n still cannot be efficiently changed 41