In-Memory Searching. Linear Search. Binary Search. Binary Search Tree. k-d Tree. Hashing. Hash Collisions. Collision Strategies.

Size: px

Start display at page:

Download "In-Memory Searching. Linear Search. Binary Search. Binary Search Tree. k-d Tree. Hashing. Hash Collisions. Collision Strategies."

Lydia Cobb
6 years ago
Views:

1 In-Memory Searching Linear Search Binary Search Binary Search Tree k-d Tree Hashing Hash Collisions Collision Strategies Chapter 4

2 Searching A second fundamental operation in Computer Science We review O(n) linear search and O(n lg n) binary search We next discuss more sophisticated approaches Two techniques form the basis for very large dataset search on disk Trees Hashing 2

3 Linear Search One of the simplest search algorithms Take a collection of n records Scan from start to end, looking for a record with primary key k Best case target is first record O 1 Worse case target is last record or not in collection O(n) Average case search ~ n Τ2 records to find target also O(n) Purpose of linear search is two-fold: Simple to implement, used when n is small or search is rare Represents a hard upper bound on acceptable search performance 3

4 Binary Search If a collection is sorted, we can perform binary search Discards n Τ2 records from further consideration on first comparison Discards additional n Τ4 records from further consideration on second comparison Continues until record found or no data left to search An algorithm that splits collection in half repeatedly runs in O(lg n) However, building a sorted collection requires O(n lg n), so binary search maintenance requires O(n lg n) 4

5 Recursive Binary Search Algorithm binary_search(k, A, lf, rt) Input: k, target key; A, sorted array to search; lf, start of search range; rt, end of search range n = rt lf + 1 if n 0 then return -1 end n Τ2 c = lf + if k == A[c] then return c else if k < A[c] then return binary_search( k, A, lf, c 1 ) else return binary_search( k, A, c + 1, rt ) end // Searching empty range // Center of search region // Target record found // Search left half of search region // Search right half of search region 5

6 Binary Search Tree Choose data structure to implement binary search Sorted array O(lg n) search performance Adding element: O(lg n) to find position, O(n) to make space Similarly, deleting elements requires O(n) to fill hole Common alternative, binary search tree (BST) Tree, each holds primary key, reference to (up to) two child nodes All keys in left subtree are smaller than parent key All keys in right subtree are larger than parent key 6

7 BST Search Algorithm bst_search(k, node) Input: k, target key; node, node in BST to begin search if node == null then return null end if k = node. key then return node else if k < node. key return bst_search( k, node. left) else return bst_search( k, node. right) end // Searching empty tree // Target record found // Search left subtree // Search right subtree 7

8 BST Operations Insertion. Insert record with key k, search for k in BST If k found, duplicate record, replace node with new record If empty subtree found, insert new node holding k s record Deletion. Delete record with key k, search for k in BST 1. If k not found, delete fails 2. If k s node has no children, remove k s node, stop 3. If k s node has one subtree, promote subtree s root 4. If k s node has two subtrees: a. Find successor to k s node smallest key value k in right subtree by walking right once, then walking left as far as possible b. Remove successor (since it has empty left subtree must match case 2 or 3 above) c. Promote successor to k s node s position 8

9 BST Deletion Cases deletion with no children deletion with one child deletion with two children 9

10 BST Performance Search, add, delete all require initial search Add, delete perform search followed by constant-time operation, so dominated by search performance BST roughly balanced: left subtree height right subtree height throughout tree Comparison discards half of remaining nodes, search in O(lg n) BST unbalanced: many empty subtrees at internal levels Search degenerates into linear search, O(n) Inserting nearly sorted data into BST produces unbalanced tree Can maintain balance with self-correction trees (e.g., AVL) 10

11 k-d Tree k-dimensional (k-d) tree is binary tree that subdivides a collection into ranges for k attributes Proposed by Jon Louis Bentley in 1975 Designed to support associative, or multiattribute, searches E.g., consider a collection of weather reports with temp and precip k-d tree can efficiently support queries like Return all records with temp < 0 and precip > 4cm k-d tree structure similar to BST, but we rotate through the k dimensions at each tree level 11

12 k-d Tree Index k-d tree structure similar to BST, but we rotate through the k dimensions at each tree level E.g., 2-d temp + precip tree subdivides by temp on root level, by precip on second level, again by temp on third level, and so on Each k-d node contains key k c and left, right subtrees Unlike BST, records are not stored in internal nodes Target key k t used to determine which subtree to enter k t k c, left subtree; k t > k c, right subtree Leaf nodes contain collections of records that satisfy conditions along root-to-leaf path 12

13 k-d Tree Example Suppose we want to use k-d tree to subdivide dwarves by height ht and weight wt Snow White and the seven dwarves define initial tree structure Name ht wt Sleepy Happy Doc Dopey Grumpy Sneezy Bashful Ms. White

14 k-d Tree Index Construction Construction identical to BST, but rotate between k = 2 dimensions at each level of tree 1. Sleepy inserted into root of tree, which uses ht as subdivision attribute 2. Happy and Doc are inserted as children of Sleepy a. Happy s ht = 34 36, inserted left of root b. Doc s ht = 38 > 36, inserted right of root c. Both Happy and Doc use wt as their subdivision attribute (second level) 3. Dopey inserted a. Dopey s ht = 37 puts him right of root, wt = 51 puts him left of b. Dopey s wt = 54 > 51 puts him right of Doc c. Dopey rotates to use ht as his subdivision attribute (third level) 4. Remaining dwarves and Snow White are inserted using identical approach 14

15 Snow White k-d Tree first three insertions, ht subdivides root, wt subdivides second level fourth insertion ht subdivides third level remaining insertions, leaves hold individual records 15

16 k-d Tree Record Management k-d index used to locate records based on ht and wt Buckets placed at each (null) leaf Designed to hold records as they are inserted into k-d tree Critical to choose index records that represent distributions of attribute values within data collection Poor choice of index records will produce poor distribution of records Index records are normally first records inserted into k-d tree 16

17 Spatial Interpretation k-d index subdivides k-dimensional space of all possible records into subspaces for each dimension k-d index subdivides k-dimensional space using (k 1)- dimensional cutting planes representing entries in the index Consider visualizing ht wt index Since k = 2, we subdivide 2D plane using 1D lines Each region represents a bucket in the index 17

18 Snow White k-d Tree Index 18

19 k-d Tree Search 1. Identify all paths whose internal nodes satisfy target attribute ranges, may produce multiple paths 2. Perform in-memory search of each path s buckets for records that match target criteria 3. Return records within buckets that satisfy search criteria Want to control size of buckets Re-indexing collection is expensive Size of index dictates maximum number of buckets, and therefore average expected bucket size 19

20 k-d Tree Search Example Search for records with ht 36 and wt At root, branch left (ht 36) 2. At next node, branch left again (wt 49) 3. At next node, branch left and right (ht 35 and ht > 35 both fall within target range of ht 36) 4. Along right path, reach bucket 3 5. Along left path, branch left (wt 50), reaching bucket 1 Bucket 1: ht 35, wt 50; bucket 3: 35 < ht 36, wt 52 Both buckets may include records with ht 36 and wt 47 No other buckets could contain these types of records From (Bashful, Sneezy) and (Sleepy), only Sneezy meets criteria (ht = 35, wt = 46) 20

21 k-d Tree Performance k-d tree s index has critical impact on performance Index should subdivide data stored in tree in balanced manner All buckets at same or nearly same level in tree Same or nearly same number of elements in each bucket If data known a-prior, median elements used to construct index E.g., the Snow White k-d tree is designed for individuals with ht 37 and wt 55 Anything outside this range will be forced into one of two buckets For dynamic trees, maintaining balance is complicated Adaptive k-d trees exist to try to maintain balance 21

22 Hashing Hashing, a second major class of efficient search algorithms Hash function converts key k t into numeric value h on a fixed range 0 n 1 h used as location/address for k t within a hash table A of size n Analogous to array indexing, can store/retrieve k t at A[h] If hash function is O(1), search, insert, delete are also O(1) Unfortunately, the number of possible keys m n is normally much larger than the size of A 22

23 Hash Function Requirements Because m n, h is not identical to an array index Three important properties distinguish h from an array index Hash value for k t should appear random Hash values should be distributed uniformly over range 0 n 1 Two different keys k s and k t can hash to the same h, a collision Collisions are a major issue, especially if each location in A can only hold one key Minimizing collisions are a main area for consideration 23

24 Perfect Hashing Choose a hash function that does not produce collisions Suppose we store credit cards, use card number as key For card numbers of form , m = possible numbers (keys), or 10 quadrillion Clearly, not possible to create in-memory A of size n = Of course, not every possible number is in use Numbers do span around to so array still not feasible Even for small number of keys, perfect hashing very difficult E.g., to store m = 4000 keys in A of size n = 5000, only 1 in functions perfect 24

25 Fold-and-Add Hash Function Common function, fold-and-add 1. Convert k t to a numeric sequence 2. Fold and add the numbers, correcting for overflow 3. Divide result by prime number, return the remainder as h k t = Subramanian, convert to numeric sequence by mapping characters to ASCII codes, binding pairs of codes S u b r a m a n i a n Assume largest pair is zz with combined ASCII code of To manage overflow divide by prime number slightly larger than after each add, keep remainder 25

26 Fold-and-Add (cont d) S u b r a m a n i a n = mod = = mod = = mod = = mod = = mod = We divide the final result by the size of the hash table Hash table size should itself be prime We choose A of size n = 101, producing a final h of h = mod 101 = 40 26

27 Hash Value Distributions Given a hash table size n holding r records, what is the likelihood that No key hashes to a particular address in the table? One key hashes to a particular address? Two keys hash to a particular address? Assume hash function uniformly distributes hash values For any key, probability it hashes to address k is b For any key, probability it does not hash to address k is a b = 1 n, a = 1 1 n (4.1) 27

28 Collision Probability Given a and b, insert two keys into hash table Compute individual cases Probability first key hits and address, second key misses Probability both keys hit same address (collision) ba = 1 n 1 1 n = 1 n 1 n 2 (4.2) bb = 1 n n n = 1 n 2 Probability x of r keys hash to a common address C = r n = r! x! r x! x 1 1 r x (4.3) Pr = Cb x a r x = C 1 n n 28

29 Estimated Collision Probability Since r! in its equation, C expensive to compute Poisson distribution Pr(x) does good job of estimating our probability Cb x a r x Pr x = ( rτ n) x e ( rτ n ) x! (4.4) Since x normally small, x! in denominator is not an issue 29

30 Estimated Collision Example Consider extreme case Store r = 1000 keys in a hash table of size n = 1000 Here, r Τn = 1, can use this ratio to calculate Pr(0), Pr(1), Pr(2), Pr 0 = 10 e 1 0! Pr 1 = 11 e 1 1! Pr 2 = 12 e 1 2! = = (4.5) = Given hash table size n = 1000, we expect npr(0) = = 368 entries that are empty npr(1) = = 368 entries holding 1 key npr(2) = = 184 entries that try to hold 2 keys, and so on 30

31 Estimating Collision Count Consider previous example r = n = 1000 npr(0) = 368 entries in table hold no keys npr(1) = 368 entries in table hold 1 key 1000 npr 0 npr 1 = 264 entries try to hold > 1 keys First 264 keys are stored, = 368 keys collide Collision rate of 36.8% 368 entries 0 keys 368 entries 1 key 368 keys inserted 264 entries > 1 keys = 632 keys inserted 264 keys accepted = 368 keys collide 31

32 Larger Hash Table Increase n = 2000, reduce packing rate to r Τn = npr 0 = e 0.5 0! npr 1 = e 0.5 1! = = Τ 2000 = npr 0 npr 1 = 178 entries try to hold > 1 keys 178 keys are stored, 214 keys collide, collision rate of 21.4% 1214 entries 0 keys 608 entries 1 key 608 keys inserted 178 entries > 1 keys = 392 keys inserted 178 keys accepted = 214 keys collide 32

33 Progressive Overflow Since collisions cannot be avoided, they must be managed Progressive overflow insertion Attempt to insert key at its hash position h If h already occupied and table is full, insertion fails Otherwise, walk forward through hash table until empty position found Progressive overflow deletion Start searching for key at its hash position h If n positions are examined and key is not found, deletion fails Otherwise, mark key s position as dirty: empty but previously occupied 33

34 Progressive Insertion Algorithm progressive_insert(rec, tbl, n) Input: rec, record to insert; tbl, hash table; n, table size num = 0 h = hash(rec. key) while num < n do if tbl[h] is empty then tbl h = rec break else end h = h + 1 % n num++ end return (num == n)? false : true // Number of insertion attempts // Store record // Try next table position // Return status of insert attempt 34

35 Progressive Search Algorithm progressive_search(key, tbl, dirty, n) Input: key, target key; tbl, hash table; dirty, dirty entry table n, table size num = 0 h = hash(key) while num < n do if key == tbl h. key then return tbl h else if tbl[h] is empty and! dirty[h] then return false else end h = h + 1 % n num++ end return false // Number of insertion attempts // Target record found // Search failed // Try next table position // Search failed 35

36 Progressive Delete Algorithm progressive_delete(key, tbl, dirty, n) Input: key, target key; tbl, hash table; dirty, dirty entry table n, table size h = progressive_search (key, tbl, dirty, n) if h!= false then tbl[ h ] = empty dirty[ h ] = true else return false end // Set table position empty // Mark table position dirty 36

37 Progressive Overflow Search To search, hash key to h, start search at position h If record found, return it If entire table examined, search fails Empty positions are handled using dirty bit If position empty and dirty bit set a. Position may have been occupied when record inserted b. And we may have walked forward looking for an empty position c. So we need to keep searching If position empty and dirty bit not set a. Position was never occupied, so we could not have walked over it b. Therefore, record not in table and search fails 37

38 Progressive Overflow Disadvantages 1. Hash table can become full Hash function uses n, so every record s hash value changes if n changes Must remove all records, increase size, re-insert all records 2. Runs form as records are inserted Multiple records hash to the same h, runs of contiguous records form Expensive to find a record near the start of the run 3. If runs merge with one another, super-runs form Very long collection of contiguous records Searching may walk over records without same h as target record If table > 75% full, search deteriorates to O(n) 38

39 Multi-Record Buckets Alternative approach, reduce collisions by storing more than one record in each hash table entry Implement bucket as expandable array or linked list Insertion and deletion are identical to simple hash table We do not need to worry about exceeding capacity Search for k with hash value h, load bucket A[h], scan for k If we use buckets, packing density of A is Τ r bn n is size of A, b is maximum entries in each A[i] position 39

40 Single vs. Multi-Record Buckets r = 700, n = 1000, b = 1 r Τn = 700Τ 1000 = 0.7 P 0 = 0.70 e 0.7 = ! P 1 = 0.71 e 0.7 1! = r = 700, n = 500, b = 2 r Τn = 700Τ 500 = 1.4 P 0 = 1.40 e 1.4 = ! P 1 = 1.41 e 1.4 1! P 2 = 1.42 e 1.4 2! = = entries 0 keys 347 entries 1 key 347 recs 155 entries > 1 keys 352 recs 197 collisions 28.1% 124 entries 0 keys 172 entries 1 key 172 recs 121 entries 2 keys 242 recs 83 entries > 2 keys 286 recs 120 collisions 17.1% 40

41 Bucket Advantages and Disadvantages Simply rearranging 1000 table entries into a two-bucket table reduced collision rate from 28.1% to 17.1% Multi-bucket tables still have disadvantages If r n, buckets become long, search deteriorates to O(n) Check for duplicate keys, deletion also deteriorates to O(n) Size of table n still cannot be efficiently changed 41

Hash Table and Hashing

Hash Table and Hashing The tree structures discussed so far assume that we can only work with the input keys by comparing them. No other operation is considered. In practice, it is often true that an input