In-Memory Searching Linear Search Binary Search Binary Search Tree k-d Tree Hashing Hash Collisions Collision Strategies Chapter 4
Searching A second fundamental operation in Computer Science We review O(n) linear search and O(n lg n) binary search We next discuss more sophisticated approaches Two techniques form the basis for very large dataset search on disk Trees Hashing 2
Linear Search One of the simplest search algorithms Take a collection of n records Scan from start to end, looking for a record with primary key k Best case target is first record O 1 Worse case target is last record or not in collection O(n) Average case search ~ n Τ2 records to find target also O(n) Purpose of linear search is two-fold: Simple to implement, used when n is small or search is rare Represents a hard upper bound on acceptable search performance 3
Binary Search If a collection is sorted, we can perform binary search Discards n Τ2 records from further consideration on first comparison Discards additional n Τ4 records from further consideration on second comparison Continues until record found or no data left to search An algorithm that splits collection in half repeatedly runs in O(lg n) However, building a sorted collection requires O(n lg n), so binary search maintenance requires O(n lg n) 4
Recursive Binary Search Algorithm binary_search(k, A, lf, rt) Input: k, target key; A, sorted array to search; lf, start of search range; rt, end of search range n = rt lf + 1 if n 0 then return -1 end n Τ2 c = lf + if k == A[c] then return c else if k < A[c] then return binary_search( k, A, lf, c 1 ) else return binary_search( k, A, c + 1, rt ) end // Searching empty range // Center of search region // Target record found // Search left half of search region // Search right half of search region 5
Binary Search Tree Choose data structure to implement binary search Sorted array O(lg n) search performance Adding element: O(lg n) to find position, O(n) to make space Similarly, deleting elements requires O(n) to fill hole Common alternative, binary search tree (BST) Tree, each holds primary key, reference to (up to) two child nodes All keys in left subtree are smaller than parent key All keys in right subtree are larger than parent key 6
BST Search Algorithm bst_search(k, node) Input: k, target key; node, node in BST to begin search if node == null then return null end if k = node. key then return node else if k < node. key return bst_search( k, node. left) else return bst_search( k, node. right) end // Searching empty tree // Target record found // Search left subtree // Search right subtree 7
BST Operations Insertion. Insert record with key k, search for k in BST If k found, duplicate record, replace node with new record If empty subtree found, insert new node holding k s record Deletion. Delete record with key k, search for k in BST 1. If k not found, delete fails 2. If k s node has no children, remove k s node, stop 3. If k s node has one subtree, promote subtree s root 4. If k s node has two subtrees: a. Find successor to k s node smallest key value k in right subtree by walking right once, then walking left as far as possible b. Remove successor (since it has empty left subtree must match case 2 or 3 above) c. Promote successor to k s node s position 8
BST Deletion Cases deletion with no children deletion with one child deletion with two children 9
BST Performance Search, add, delete all require initial search Add, delete perform search followed by constant-time operation, so dominated by search performance BST roughly balanced: left subtree height right subtree height throughout tree Comparison discards half of remaining nodes, search in O(lg n) BST unbalanced: many empty subtrees at internal levels Search degenerates into linear search, O(n) Inserting nearly sorted data into BST produces unbalanced tree Can maintain balance with self-correction trees (e.g., AVL) 10
k-d Tree k-dimensional (k-d) tree is binary tree that subdivides a collection into ranges for k attributes Proposed by Jon Louis Bentley in 1975 Designed to support associative, or multiattribute, searches E.g., consider a collection of weather reports with temp and precip k-d tree can efficiently support queries like Return all records with temp < 0 and precip > 4cm k-d tree structure similar to BST, but we rotate through the k dimensions at each tree level 11
k-d Tree Index k-d tree structure similar to BST, but we rotate through the k dimensions at each tree level E.g., 2-d temp + precip tree subdivides by temp on root level, by precip on second level, again by temp on third level, and so on Each k-d node contains key k c and left, right subtrees Unlike BST, records are not stored in internal nodes Target key k t used to determine which subtree to enter k t k c, left subtree; k t > k c, right subtree Leaf nodes contain collections of records that satisfy conditions along root-to-leaf path 12
k-d Tree Example Suppose we want to use k-d tree to subdivide dwarves by height ht and weight wt Snow White and the seven dwarves define initial tree structure Name ht wt Sleepy 36 48 Happy 34 52 Doc 38 51 Dopey 37 54 Grumpy 32 55 Sneezy 35 46 Bashful 33 50 Ms. White 65 98 13
k-d Tree Index Construction Construction identical to BST, but rotate between k = 2 dimensions at each level of tree 1. Sleepy inserted into root of tree, which uses ht as subdivision attribute 2. Happy and Doc are inserted as children of Sleepy a. Happy s ht = 34 36, inserted left of root b. Doc s ht = 38 > 36, inserted right of root c. Both Happy and Doc use wt as their subdivision attribute (second level) 3. Dopey inserted a. Dopey s ht = 37 puts him right of root, wt = 51 puts him left of b. Dopey s wt = 54 > 51 puts him right of Doc c. Dopey rotates to use ht as his subdivision attribute (third level) 4. Remaining dwarves and Snow White are inserted using identical approach 14
Snow White k-d Tree first three insertions, ht subdivides root, wt subdivides second level fourth insertion ht subdivides third level remaining insertions, leaves hold individual records 15
k-d Tree Record Management k-d index used to locate records based on ht and wt Buckets placed at each (null) leaf Designed to hold records as they are inserted into k-d tree Critical to choose index records that represent distributions of attribute values within data collection Poor choice of index records will produce poor distribution of records Index records are normally first records inserted into k-d tree 16
Spatial Interpretation k-d index subdivides k-dimensional space of all possible records into subspaces for each dimension k-d index subdivides k-dimensional space using (k 1)- dimensional cutting planes representing entries in the index Consider visualizing ht wt index Since k = 2, we subdivide 2D plane using 1D lines Each region represents a bucket in the index 17
Snow White k-d Tree Index 18
k-d Tree Search 1. Identify all paths whose internal nodes satisfy target attribute ranges, may produce multiple paths 2. Perform in-memory search of each path s buckets for records that match target criteria 3. Return records within buckets that satisfy search criteria Want to control size of buckets Re-indexing collection is expensive Size of index dictates maximum number of buckets, and therefore average expected bucket size 19
k-d Tree Search Example Search for records with ht 36 and wt 47 1. At root, branch left (ht 36) 2. At next node, branch left again (wt 49) 3. At next node, branch left and right (ht 35 and ht > 35 both fall within target range of ht 36) 4. Along right path, reach bucket 3 5. Along left path, branch left (wt 50), reaching bucket 1 Bucket 1: ht 35, wt 50; bucket 3: 35 < ht 36, wt 52 Both buckets may include records with ht 36 and wt 47 No other buckets could contain these types of records From (Bashful, Sneezy) and (Sleepy), only Sneezy meets criteria (ht = 35, wt = 46) 20
k-d Tree Performance k-d tree s index has critical impact on performance Index should subdivide data stored in tree in balanced manner All buckets at same or nearly same level in tree Same or nearly same number of elements in each bucket If data known a-prior, median elements used to construct index E.g., the Snow White k-d tree is designed for individuals with ht 37 and wt 55 Anything outside this range will be forced into one of two buckets For dynamic trees, maintaining balance is complicated Adaptive k-d trees exist to try to maintain balance 21
Hashing Hashing, a second major class of efficient search algorithms Hash function converts key k t into numeric value h on a fixed range 0 n 1 h used as location/address for k t within a hash table A of size n Analogous to array indexing, can store/retrieve k t at A[h] If hash function is O(1), search, insert, delete are also O(1) Unfortunately, the number of possible keys m n is normally much larger than the size of A 22
Hash Function Requirements Because m n, h is not identical to an array index Three important properties distinguish h from an array index Hash value for k t should appear random Hash values should be distributed uniformly over range 0 n 1 Two different keys k s and k t can hash to the same h, a collision Collisions are a major issue, especially if each location in A can only hold one key Minimizing collisions are a main area for consideration 23
Perfect Hashing Choose a hash function that does not produce collisions Suppose we store credit cards, use card number as key For card numbers of form 0000 0000 0000 0000, m = 10 16 possible numbers (keys), or 10 quadrillion Clearly, not possible to create in-memory A of size n = 10 16 Of course, not every possible number is in use Numbers do span around 1 10 15 to 9 10 15 so array still not feasible Even for small number of keys, perfect hashing very difficult E.g., to store m = 4000 keys in A of size n = 5000, only 1 in 10 120000 functions perfect 24
Fold-and-Add Hash Function Common function, fold-and-add 1. Convert k t to a numeric sequence 2. Fold and add the numbers, correcting for overflow 3. Divide result by prime number, return the remainder as h k t = Subramanian, convert to numeric sequence by mapping characters to ASCII codes, binding pairs of codes S u b r a m a n i a n 85 117 98 114 97 109 97 110 105 97 110 Assume largest pair is zz with combined ASCII code of 122122 To manage overflow divide by prime number 125299 slightly larger than 122122 after each add, keep remainder 25
Fold-and-Add (cont d) S u b r a m a n i a n 85 117 98 114 97 109 97 110 105 97 110 85117 + 98114 = 193231 mod 125299 = 67932 67932 + 97109 = 16041 mod 125299 = 39742 35742 + 97110 = 136852 mod 125299 = 11533 11553 + 10597 = 22150 mod 125299 = 22150 22150 + 110 = 22260 mod 125299 = 22260 We divide the final result by the size of the hash table Hash table size should itself be prime We choose A of size n = 101, producing a final h of h = 22260 mod 101 = 40 26
Hash Value Distributions Given a hash table size n holding r records, what is the likelihood that No key hashes to a particular address in the table? One key hashes to a particular address? Two keys hash to a particular address? Assume hash function uniformly distributes hash values For any key, probability it hashes to address k is b For any key, probability it does not hash to address k is a b = 1 n, a = 1 1 n (4.1) 27
Collision Probability Given a and b, insert two keys into hash table Compute individual cases Probability first key hits and address, second key misses Probability both keys hit same address (collision) ba = 1 n 1 1 n = 1 n 1 n 2 (4.2) bb = 1 n n n = 1 n 2 Probability x of r keys hash to a common address C = r n = r! x! r x! x 1 1 r x (4.3) Pr = Cb x a r x = C 1 n n 28
Estimated Collision Probability Since r! in its equation, C expensive to compute Poisson distribution Pr(x) does good job of estimating our probability Cb x a r x Pr x = ( rτ n) x e ( rτ n ) x! (4.4) Since x normally small, x! in denominator is not an issue 29
Estimated Collision Example Consider extreme case Store r = 1000 keys in a hash table of size n = 1000 Here, r Τn = 1, can use this ratio to calculate Pr(0), Pr(1), Pr(2), Pr 0 = 10 e 1 0! Pr 1 = 11 e 1 1! Pr 2 = 12 e 1 2! = 0.368 = 0.368 (4.5) = 0.184 Given hash table size n = 1000, we expect npr(0) = 1000 0.368 = 368 entries that are empty npr(1) = 1000 0.368 = 368 entries holding 1 key npr(2) = 1000 0.184 = 184 entries that try to hold 2 keys, and so on 30
Estimating Collision Count Consider previous example r = n = 1000 npr(0) = 368 entries in table hold no keys npr(1) = 368 entries in table hold 1 key 1000 npr 0 npr 1 = 264 entries try to hold > 1 keys First 264 keys are stored, 1000 368 + 264 = 368 keys collide Collision rate of 36.8% 368 entries 0 keys 368 entries 1 key 368 keys inserted 264 entries > 1 keys 1000 368 = 632 keys inserted 264 keys accepted 632 264 = 368 keys collide 31
Larger Hash Table Increase n = 2000, reduce packing rate to r Τn = npr 0 = 2000 0.50 e 0.5 0! npr 1 = 2000 0.51 e 0.5 1! = 0.607 = 0.304 1000Τ 2000 = 0.5 1000 npr 0 npr 1 = 178 entries try to hold > 1 keys 178 keys are stored, 214 keys collide, collision rate of 21.4% 1214 entries 0 keys 608 entries 1 key 608 keys inserted 178 entries > 1 keys 1000 608 = 392 keys inserted 178 keys accepted 392 178 = 214 keys collide 32
Progressive Overflow Since collisions cannot be avoided, they must be managed Progressive overflow insertion Attempt to insert key at its hash position h If h already occupied and table is full, insertion fails Otherwise, walk forward through hash table until empty position found Progressive overflow deletion Start searching for key at its hash position h If n positions are examined and key is not found, deletion fails Otherwise, mark key s position as dirty: empty but previously occupied 33
Progressive Insertion Algorithm progressive_insert(rec, tbl, n) Input: rec, record to insert; tbl, hash table; n, table size num = 0 h = hash(rec. key) while num < n do if tbl[h] is empty then tbl h = rec break else end h = h + 1 % n num++ end return (num == n)? false : true // Number of insertion attempts // Store record // Try next table position // Return status of insert attempt 34
Progressive Search Algorithm progressive_search(key, tbl, dirty, n) Input: key, target key; tbl, hash table; dirty, dirty entry table n, table size num = 0 h = hash(key) while num < n do if key == tbl h. key then return tbl h else if tbl[h] is empty and! dirty[h] then return false else end h = h + 1 % n num++ end return false // Number of insertion attempts // Target record found // Search failed // Try next table position // Search failed 35
Progressive Delete Algorithm progressive_delete(key, tbl, dirty, n) Input: key, target key; tbl, hash table; dirty, dirty entry table n, table size h = progressive_search (key, tbl, dirty, n) if h!= false then tbl[ h ] = empty dirty[ h ] = true else return false end // Set table position empty // Mark table position dirty 36
Progressive Overflow Search To search, hash key to h, start search at position h If record found, return it If entire table examined, search fails Empty positions are handled using dirty bit If position empty and dirty bit set a. Position may have been occupied when record inserted b. And we may have walked forward looking for an empty position c. So we need to keep searching If position empty and dirty bit not set a. Position was never occupied, so we could not have walked over it b. Therefore, record not in table and search fails 37
Progressive Overflow Disadvantages 1. Hash table can become full Hash function uses n, so every record s hash value changes if n changes Must remove all records, increase size, re-insert all records 2. Runs form as records are inserted Multiple records hash to the same h, runs of contiguous records form Expensive to find a record near the start of the run 3. If runs merge with one another, super-runs form Very long collection of contiguous records Searching may walk over records without same h as target record If table > 75% full, search deteriorates to O(n) 38
Multi-Record Buckets Alternative approach, reduce collisions by storing more than one record in each hash table entry Implement bucket as expandable array or linked list Insertion and deletion are identical to simple hash table We do not need to worry about exceeding capacity Search for k with hash value h, load bucket A[h], scan for k If we use buckets, packing density of A is Τ r bn n is size of A, b is maximum entries in each A[i] position 39
Single vs. Multi-Record Buckets r = 700, n = 1000, b = 1 r Τn = 700Τ 1000 = 0.7 P 0 = 0.70 e 0.7 = 0.497 0! P 1 = 0.71 e 0.7 1! = 0.348 r = 700, n = 500, b = 2 r Τn = 700Τ 500 = 1.4 P 0 = 1.40 e 1.4 = 0.247 0! P 1 = 1.41 e 1.4 1! P 2 = 1.42 e 1.4 2! = 0.345 = 0.242 497 entries 0 keys 347 entries 1 key 347 recs 155 entries > 1 keys 352 recs 197 collisions 28.1% 124 entries 0 keys 172 entries 1 key 172 recs 121 entries 2 keys 242 recs 83 entries > 2 keys 286 recs 120 collisions 17.1% 40
Bucket Advantages and Disadvantages Simply rearranging 1000 table entries into a two-bucket table reduced collision rate from 28.1% to 17.1% Multi-bucket tables still have disadvantages If r n, buckets become long, search deteriorates to O(n) Check for duplicate keys, deletion also deteriorates to O(n) Size of table n still cannot be efficiently changed 41