Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case behavior Let the running time of RANDOMIZED-SELECT on an input array.. ] of elements be a random variable ); we obtain an upper bound on E[ ]as follows The procedure RANDOMIZED-PARTITION is equally likely to return any element as the pivot MAT-72006 AA+DS, Fall 2014 25-Sep-14 295 Therefore, for each such that 1, the subarray.. ] has elements (all less than or equal to the pivot) with probability 1/ For = 1,2,,, dene indicator random variables where =Ithesubarray.. hasexactlyelements Assuming that the elements are distinct, we have E = 1 When we choose ] as the pivot element, we do not know, a priori, 1) if we will terminate immediately with the correct answer, 2) recurse on the subarray.. 1], or 3) recurse on the subarray [ + 1.. ] MAT-72006 AA+DS, Fall 2014 25-Sep-14 296 1
This decision depends on where the th smallest element falls relative to ] Assuming ) to be monotonically increasing, we can upper-bound the time needed for a recursive call by that on largest possible input I.e., to obtain an upper bound, we assume that the th element is always on the side of the partition with the greater number of elements For a given call of RANDOMIZED-SELECT, the indicator random variable has value 1 for exactly one value of, and it is 0 for all other When =1, the two subarrays on which we might recurse have sizes 1and MAT-72006 AA+DS, Fall 2014 25-Sep-14 297 We have the recurrence (max( 1, )) + )) = (max(1,))+) Taking expected values, we have E E (max(1,))+) = E[ (max( 1, ))] + ) MAT-72006 AA+DS, Fall 2014 25-Sep-14 298 2
= E E[(max(1,))]+) = 1 E[(max( 1, ))] + ) The first Eq. on this slide follows by independence of random variables and max(1,) Consider the expression max(1,)= > /2 /2 MAT-72006 AA+DS, Fall 2014 25-Sep-14 299 If is even, each term from /2 up to 1) appears exactly twice in the summation, and if is odd, all these terms appear twice and /2 appears once Thus, we have E[ 2 E ) We can show that E ) by substitution In summary, we can nd any order statistic, and in particular the median, in expected linear time, assuming that the elements are distinct MAT-72006 AA+DS, Fall 2014 25-Sep-14 300 3
III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Augmenting Data Structures Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations to be performed on sets For example, many algorithms need only the ability to insert elements into, delete elements from, and test membership in a set We call a dynamic set that supports these operations a dictionary MAT-72006 AA+DS, Fall 2014 25-Sep-14 302 4
Operations on dynamic sets Operations on a dynamic set can be grouped into queries and modifying operations SEARCH ) Given a set and a key value, return a pointer to an element in such that, or NIL if no such element belongs to INSERT ) Augment the set with the element pointed to by. We assume that any attributes in element have already been initialized DELETE ) Given a pointer to an element in the set, remove from. Note that this operation takes a pointer to an element, not a key value MINIMUM) A query on a totally ordered set that returns a pointer to the element of with the smallest key MAT-72006 AA+DS, Fall 2014 25-Sep-14 303 MAXIMUM) Return a pointer to the element of with the largest key SUCCESSOR ) Given an element whose key is from a totally ordered set, return a pointer to the next larger element in, or NIL if is the maximum element PREDECESSOR ) Given an element whose key is from a totally ordered set, return a pointer to the next smaller element in, or NIL if is the minimum element We usually measure the time taken to execute a set operation in terms of the size of the set E.g., we later describe a data structure that can support any of the operations on a set of size in time (lg ) MAT-72006 AA+DS, Fall 2014 25-Sep-14 304 5
11 Hash Tables Often one only needs the dictionary operations INSERT, SEARCH, and DELETE Hash table effectively implements dictionaries In the worst case, searching for an element in a hash table takes ) time In practice, hashing performs extremely well Under reasonable assumptions, the average time to search for an element is (1) MAT-72006 AA+DS, Fall 2014 25-Sep-14 323 11.2 Hash tables A set of keys is stored in a dictionary, it is usually much smaller than the universe of all possible keys A hash table requires storage ( ) while search for an element in it only takes (1) time The catch is that this bound is for the average-case time MAT-72006 AA+DS, Fall 2014 25-Sep-14 324 6
An element with key is stored in slot (); that is, we use a hash function to compute the slot from the key 0,1,, 1 maps the universe of keys into the slots of hash table [0.. 1] The size of the hash table is typically We say that an element with key hashes to slot or that ) is the hash value of If hash to same slot we have a collision Effective techniques resolve the conict MAT-72006 AA+DS, Fall 2014 25-Sep-14 325 MAT-72006 AA+DS, Fall 2014 25-Sep-14 326 7
The ideal solution avoids collisions altogether We might try to achieve this goal by choosing a suitable hash function One idea is to make appear to be random, thus minimizing the number of collisions Of course, must be deterministic so that a key always produces the same output ) Because, there must be at least two keys that have the same hash value; avoiding collisions altogether is therefore impossible MAT-72006 AA+DS, Fall 2014 25-Sep-14 327 Collision resolution by chaining In chaining, we place all the elements that hash to the same slot into the same linked list Slot contains a pointer to the head of the list of all stored elements that hash to If no such elements exist, slot contains NIL The dictionary operations on a hash table are easy to implement when collisions are resolved by chaining MAT-72006 AA+DS, Fall 2014 25-Sep-14 328 8
CHAINED-HASH-INSERT ) 1. insert at the head of list ] CHAINED-HASH-SEARCH ) 1. search for element with key in list ] CHAINED-HASH-DELETE ) 1. delete from the list [. ] MAT-72006 AA+DS, Fall 2014 25-Sep-14 329 Worst-case running time of insertion is (1) It is fast in part because it assumes that the element being inserted is not already present in the table For searching, the worst-case running time is proportional to the length of the list; we analyze this operation more closely soon We can delete an element (given a pointer) in (1) time if the lists are doubly linked In singly linked lists, to delete, we would rst have to nd in the list [. ] MAT-72006 AA+DS, Fall 2014 25-Sep-14 330 9
Analysis of hashing with chaining A hash table of slots stores elements, dene the load factor, i.e., the average number of elements in a chain Our analysis will be in terms of, which can be <1,=1, or >1 In the worst-case all keys hash to one slot The worst-case time for searching is thus ) plus the time to compute the hash function MAT-72006 AA+DS, Fall 2014 25-Sep-14 331 Average-case performance of hashing depends on how well distributes the set of keys to be stored among the slots, on the average Simple uniform hashing (SUH): Assume that an element is equally likely to hash into any of the slots, independently of where any other element has hashed to For = 0,1,, 1, let us denote the length of the list ] by, so and the expected value of is E[ ] = = / MAT-72006 AA+DS, Fall 2014 25-Sep-14 332 10
Theorem 11.1 In a hash table which resolves collisions by chaining, an unsuccessful search takes average-case time (1 + ), under SUH assumption. Proof Under SUH, not already in the table is equally likely to hash to any of the slots. The time to search unsuccessfully for is the expected time to go through list )], which has expected length E[ ) ]=. Thus, the expected number of elements examined is, and the total time required (including computing ()) is (1 + ). MAT-72006 AA+DS, Fall 2014 25-Sep-14 333 Theorem 11.2 In a hash table which resolves collisions by chaining, a successful search takes average-case time (1 + ), under SUH. Proof We assume that the element being searched for is equally likely to be any of the elements stored in the table. The number of elements examined during the search for is one more than the number of elements that appear before in s list. Elements before in the list were all inserted after was inserted. MAT-72006 AA+DS, Fall 2014 25-Sep-14 334 11
Let us take the average (over the table elements ) of the expected number of elements added to s list after was added to the list + 1 Let, = 1,2,,, denote the th element inserted into the table and let For keys and, dene the indicator random variable = I{ } Under SUH, Pr = 1, and by Lemma 5.1, E[ ] = 1 MAT-72006 AA+DS, Fall 2014 25-Sep-14 335 Thus, the expected number of elements examined in successful search is E 1 1+ = 1 1+ E = 1 1+ 1 MAT-72006 AA+DS, Fall 2014 25-Sep-14 336 12
=1+ 1 =1+ 1 ) =1+ 1 + 1) 2 =1+ 1 =1+ 2 + Thus, the total time required for a successful search is 2 + 2 + = 1 + MAT-72006 AA+DS, Fall 2014 25-Sep-14 337 If the number of hash-table slots is at least proportional to the number of elements in the table, we have ) and, consequently, (1) Thus, searching takes constant time on average Insertion takes (1)worst-case time Deletion takes 1 worst-case time when the lists are doubly linked Hence, we can support all dictionary operations in (1) time on average MAT-72006 AA+DS, Fall 2014 25-Sep-14 338 13
11.3 Hash functions A good hash function satises (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way to check this condition, since we rarely know the probability distribution from which the keys are drawn The keys might not be drawn independently MAT-72006 AA+DS, Fall 2014 25-Sep-14 339 If we, e.g., know that the keys are random real numbers independently and uniformly distributed in the range <1 Then the hash function = satises the condition of SUH In practice, we can often employ heuristic techniques to create a hash function that performs well Qualitative information about the distribution of keys may be useful in this design process MAT-72006 AA+DS, Fall 2014 25-Sep-14 340 14
In a compiler s symbol table the keys are strings representing identiers in a program Closely related symbols, such as pt and pts, often occur in the same program A good hash function minimizes the chance that such variants hash to the same slot A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data For example, the division method computes the hash value as the remainder when the key is divided by a specied prime number MAT-72006 AA+DS, Fall 2014 25-Sep-14 341 Interpreting keys as natural numbers Hash functions assume that the universe of keys is natural numbers = {0,1,2, } We can interpret a character string as an integer expressed in suitable radix notation We interpret the identier pt as the pair of decimal integers (112,116), since = 112 and = 116 in ASCII Expressed as a radix-128 integer, pt becomes 112 128 + 116 = 14,452 MAT-72006 AA+DS, Fall 2014 25-Sep-14 342 15
11.3.1 The division method In the division method, we map a key into one of slots by taking the remainder of divided by That is, the hash function is mod E.g., if the hash table has size = 12 and the key is = 100, then =4 Since it requires only a single division operation, hashing by division is quite fast MAT-72006 AA+DS, Fall 2014 25-Sep-14 343 When using the division method, we usually avoid certain values of E.g., should not be a power of 2, since if =2, then ) is just the lowest-order bits of We are better off designing the hash function to depend on all the bits of the key A prime not too close to an exact power of 2 is often a good choice for MAT-72006 AA+DS, Fall 2014 25-Sep-14 344 16
Suppose we wish to allocate a hash table, with collisions resolved by chaining, to hold roughly = 2,000character strings, where a character has 8 bits We don t mind examining an average of 3 elements in an unsuccessful search, and so we allocate a hash table of size = 701 We choose = 701because it is a prime near 2000/3 but not near any power of 2 (2 = 512 and 2 = 1024) Treating each key as an integer, our hash function would be = mod701 MAT-72006 AA+DS, Fall 2014 25-Sep-14 345 11.3.2 The multiplication method The method operates in two steps First, multiply by a constant, 0<<1, and extract the fractional part of Then, multiply this value by and take the oor of the result The hash function is = mod1) where modmeans the fractional part of, that is, MAT-72006 AA+DS, Fall 2014 25-Sep-14 346 17
Value of is not critical, we choose = 2 The function is implemented on computers Suppose that the word size of the machine is bits and that ts into a single word Restrict to be a fraction of the form 2, where is an integer in the range 0<<2 Multiply by the -bit integer 2 The result is a -bit value 2, where is the high-order word of the product and is the low-order word of the product The desired -bit hash value consists of the most signicant bits of MAT-72006 AA+DS, Fall 2014 25-Sep-14 347 MAT-72006 AA+DS, Fall 2014 25-Sep-14 348 18
Knuth (1973) suggests that ( 1) 2 0.6180339887 is likely to work reasonably well Suppose that = 123,456, = 14, =2 = 16,384, and = 32 Let to be the fraction of the form 2 closest to ( 1), 2so that = 2,654,435,769/2 Then = 327,706,022,297,664 = 76,300 2 + 17,612,864, and = 76,300 and = 17,612,864 The 14 most signicant bits of yield the value = 67 MAT-72006 AA+DS, Fall 2014 25-Sep-14 349 11.4 Open addressing In open addressing, all elements occupy the hash table itself That is, each table entry contains either an element of the dynamic set or NIL Search for element systematically examines table slots until we nd the element or have ascertained that it is not in the table No lists and no elements are stored outside the table, unlike in chaining MAT-72006 AA+DS, Fall 2014 25-Sep-14 350 19
Thus, the hash table can ll up so that no further insertions can be made the load factor can never exceed 1 We could store the linked lists for chaining inside the hash table, in the otherwise unused hashtable slots Instead of following pointers, we compute the sequence of slots to be examined The extra memory freed provides the hash table with a larger number of slots for the same amount of memory, potentially yielding fewer collisions and faster retrieval MAT-72006 AA+DS, Fall 2014 25-Sep-14 351 To perform insertion, we successively examine, or probe, the hash table until we nd an empty slot in which to put the key Instead of being xed in order 0,1,, 1 (which requires ) search time), the sequence of positions probed depends upon the key being inserted To determine which slots to probe, we extend the hash function to include the probe number (starting from 0) as a second input MAT-72006 AA+DS, Fall 2014 25-Sep-14 352 20
Thus, the hash function becomes 0,1,, {0,1,, 1} With open addressing, we require that for every key, the probe sequence,0,,1,,1) be a permutation of 0,1,,1, so that every position is eventually considered as a slot for a new key as the table lls up Let us assume that the elements in the hash table are keys with no satellite information; the key is identical to the element containing key MAT-72006 AA+DS, Fall 2014 25-Sep-14 353 HASH-INSERT either returns the slot number of key or ags an error because the table is full HASH-INSERT ) 1. 2. repeat 3. ) 4. if ]=NIL 5. 6. return 7. else +1 8. until 9. error hash table overow MAT-72006 AA+DS, Fall 2014 25-Sep-14 354 21
Search for key probes the same sequence of slots that the insertion algorithm examined HASH-SEARCH ) 1. 2. repeat 3. ) 4. if ]= 5. return 6. +1 7. until = NIL 8. return NIL MAT-72006 AA+DS, Fall 2014 25-Sep-14 355 When we delete a key from slot, we cannot simply mark it as empty by storing NIL in it We might be unable to retrieve any key during whose insertion we had probed slot Instead we mark the slot with value DELETED Modify HASH-INSERT to treat such a slot as empty so that we can insert a new key there HASH-SEARCH passes over DELETED values When we use DELETED value, search times no longer depend on the load factor Therefore chaining is more commonly selected as a collision resolution technique MAT-72006 AA+DS, Fall 2014 25-Sep-14 356 22
We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a whole probe sequence True uniform hashing is difcult to implement, however, and in practice suitable approximations (such as double hashing, dened below) are used MAT-72006 AA+DS, Fall 2014 25-Sep-14 357 We examine three common techniques to compute the probe sequences required for open addressing: 1) linear probing, 2) quadratic probing, and 3) double hashing These techniques all guarantee that, 0),, 1),, 1) is a permutation of 0,1,, for each key No technique fullls the assumption of UH; none of them is capable of generating more than different probe sequences Double hashing has the greatest number of probe sequences and gives the best results MAT-72006 AA+DS, Fall 2014 25-Sep-14 358 23
Linear probing Given a hash function 0,1,,1, an auxiliary hash function, use the function = mod Given key, we rst probe the slot given by the auxiliary hash function ] We next probe slots +1,,1] Wrap around to 0, 1,, 1] Initial probe determines the entire probe sequence, there are only distinct seq s MAT-72006 AA+DS, Fall 2014 25-Sep-14 359 Linear probing is easy to implement, but it suffers from a problem known as primary clustering Long runs of occupied slots build up, increasing the average search time Clusters arise because an empty slot preceded by full slots gets lled next with probability +1) Long runs of occupied slots tend to get longer, and the average search time increases MAT-72006 AA+DS, Fall 2014 25-Sep-14 360 24
Quadratic probing Use a hash function of the form = ( mod where are positive auxiliary constants Initial position probed is ]; later positions are offset by amounts that depend in a quadratic manner on the probe number This works much better than linear probing, but to make full use of the hash table, the values of, and are constrained MAT-72006 AA+DS, Fall 2014 25-Sep-14 361 Also, if two keys have the same initial probe position, then their probe sequences are the same, since,0 =,0 implies This property leads to a milder form of clustering, called secondary clustering As in linear probing, the initial probe determines the entire sequence, and so only distinct probe sequences are used MAT-72006 AA+DS, Fall 2014 25-Sep-14 362 25
Double hashing One of the best methods available for open addressing, the permutations produced have many characteristics of random ones Uses a hash function of the form = ( ))mod where and are auxiliary hash functions The initial probe goes to position ; successive probes are offset from previous positions by the amount (), modulo MAT-72006 AA+DS, Fall 2014 25-Sep-14 363 Here we have a hash table of size 13 with mod13and = 1 + (mod11) Since 14 mod13 and 14 mod11, we insert the key 14 into empty slot + 2 = 9, after examining slots =1and =5and nding them to be occupied MAT-72006 AA+DS, Fall 2014 25-Sep-14 364 26
The value ) must be relatively prime to for the entire hash table to be searched A convenient way to ensure this condition is to let be a power of 2 and to design so that it always produces an odd number Another way is to let be prime and to design so that it always returns a positive integer less than For example, we could choose prime and let mod, = 1 + (mod), where is slightly less than (say, 1) MAT-72006 AA+DS, Fall 2014 25-Sep-14 365 E.g., if = 123,456, = 701, and = 700, we have = 80 and = 257 We rst probe position 80, and then we examine every 257 th slot (modulo ) until we nd the key or have examined every slot When is prime or a power of 2, double hashing improves over linear or quadratic probing in that ) probe sequences are used, rather than ) Each possible )pair yields a distinct probe sequence MAT-72006 AA+DS, Fall 2014 25-Sep-14 366 27