Implementation with ruby features. Sorting, Searching and Haching. Quick Sort. Algorithm of Quick Sort

Size: px

Start display at page:

Download "Implementation with ruby features. Sorting, Searching and Haching. Quick Sort. Algorithm of Quick Sort"

Alicia Lamb
5 years ago
Views:

1 Implementation with ruby features Sorting, and Haching Bruno MARTI, University of ice - Sophia Antipolis mailto:bruno.martin@unice.fr hmods.html It uses the ideas of the quicksort def qsort return self if empty? select { x x < first }.qsort + select { x x==first} + select { x x > first }.qsort How can we replace the select operator from ruby? Quick Sort Algorithm of Quick Sort Invented by C.A.R. Hoare in 1960, easy to implement, a good general purpose internal sort It is a divide-and-conquer algorithm : take at random an element in the array, say v divide the array into two partitions : One contains elements smaller than v The other contains elements greater than v put the elements v at the begining of the array (say, index between 1 and m 1) and the elements v at the of the array (index between m + 1 and ) then you have found the place to put v between the two partitions (at position m) recursively call QuickSort on ([a 0,..., a m 1 ] and [a m+1,..., a 1 ]) stop when the partition is reduced to a single element For example, the random element can be the leftmost or the rightmost element, we choose the rightmost. Our QuickSort runs on an array [a left,..., a right ]: def quick!(left,right) if left < right m = self.partition(left,right) self.quick!(left, m-1) self.quick!(m+1, right)

2 Algorithm of the Partition of the Array Example: [3,5,1,2,4].qsort! Scans (index i) from the left until you find an elt v (a[i] v) Scans (index j) from the right until you find an elt v (a[j] v) Both elements are obviously out of place: swap a[i] and a[j] Continue until the scan pointers cross (j i) Exchange v (a[right]) with the element a[i] until j<=i do i+=1 until self[i]>=v #scans for i:self[i]>=v j-=1 until self[j]<=v #scans for j:self[j]<=v if i<=j self.swap!(i,j) #exchange both elements i+=1; j-=1 #modify indexes:clean recursion Best seen from /Users/bmartin/Documents/Enseignement/Mathmods/Programs with trirapide! The big picture def qsort! def lqsort(left,right) #sort from left to right if left<right v,i,j=self[right],left,right until j<=i do i+=1 until self[i]>=v #scans for i:self[i]>=v j-=1 until self[j]<=v #scans for j:self[j]<=v if i<=j self.swap!(i,j) #exchange both elements i+=1; j-=1 #modify indexes:clean recursion self.lqsort(left,j) #sort left part self.lqsort(i,right) #sort right part self.lqsort(0,self.length-1) self Quick Sort We test that neither i nor j cross the array bounds left and right Because v = self [right] you are sure that the loop on i stops at least when i = right But if v = self [right] happens to be the smallest element between left and right, the loop on j might pass the left of the array To avoid the tests, you can choose another solution Take three elements in the array: the leftmost, the rightmost and the middle one Sort them Put the smallest at the leftmost position, the greatest at the rightmost position and the middle one as v

3 Quick Sort on Average-Case Partitioning Quick Sort on worst-case partitioning Average performance of Quick Sort is about 1.38 log : very efficient algorithm with a very small constant Quick Sort is a divide-and-conquer algorithm which splits the problem in two recursive calls and combines the results Divide-and-conquer is a good method every time you can split your problem in smaller pieces and combine the results to obtain the global solution But divide-and-conquer leads to an efficient algorithm only when the problem is divided without overlap Quick Sort is very inefficient on already sorted sets: O( 2 ) Suppose a[0],..., a[ 1] sorted without equal elements At the first call v = a[ 1] The while on i continues until i = 1 and stops because a[ 1] = v : the sort does comparisons The while on j stops on j = 2 because a[ 2] < v: 1 comparison We exchange a[ 1] with itself : 1 exchange We call QuickSort on a[0],..., a[ 2] and on a[ 2],..., a[ 1] which imediately stops So ( + 1) + +( 1) =( + 3)/2 QuickSort is in O( 2 ) on sorted sets C : average number of comparisons for sorting elements: C = k=1 (C k 1 + C k ) + 1 comparisons during the two inner whiles 1+2 (2 when i and j cross) Plus the average number of comparisons on the two sub-arrays ((C 0 + C 1 )+(C 1 + C 2 )+... +(C 1 + C 0 ))/ By symmetry : C = k=1 C k 1 substract C ( 1)C 1 C =( + 1)C 1 +2 divide both side by ( + 1) to obtain the recurrence : C +1 = C = C =... = C k Approximation : C +1 2 k=1 1 k x dx 2 ln C 2ln 2ln(2)Log() 1.38Log k=4 Intuition for the performance of quick sort Quicksort running time deps on whether the partitioning is balanced The worst-case partitioning occurs when the partitioning produces one region with 1 element and one with 1 elements: O( 2 ) The best-case partitioning occurs when the partitioning produces two regions with /2 elements (C = +2C /2 ): O( log ) worst-case ^ best-case ^ / \ / \ 1-1 /2 /2 / \ / \ / \ log 1-2 /4 /4 /4 /4 / \ / \ v 1 1 v

4 Lower Bound for Sorting Overview Is sorting an array of size possible in log operations? If you use element comparisons: it is impossible You need to model your computation problem: You express each sort by a decision tree where each internal node represents the comparison between two elements The left child correspond to the negative answer and the right child to the positive one Each leaf represents a given permutation 1 2 Representing the decision tree model Introduction to Set to sort: {a1, a2, a3} the corresponding decision tree is : a1 > a2 / \ a2 > a3 a1 > a3 / \ / \ (a1,a2,a3) a1 > a3 (a2,a1,a3) a2 > a3 / \ / \ (a1,a3,a2) (a3,a1,a2) (a3,a2,a1) (a3,a2,a1) The decision tree to sort elements has! leaves (all possible permutations) A binary tree with! leaves has a height order of log(!) which is approximately log (Stirling) log is a lower bound for sorting : fundamental operation in many tasks: retrieving a particular information among a large amount of stored data The stored data can be viewed as a set Information divided into records with field key used for searching Goal of : find the records whose key matches a given searched key Dictionaries and symbol tables are two examples of data structures needed for searching

5 Operations of Sequential in a Sorted List is in O() The time complexity often deps on the structure given to the set of records (eg lists, sets, arrays, trees,...) So, when programming a algorithm on a structure, one often needs to provide operations like Insertion, Deletion and sometimes Sorting the set of records In any case, the time complexity of the searching algorithm might be sensitive to operations like comparison of keys, insertion of one record in the set, shift of records, exchange of records,... Sequential searching in a sorted list approximately uses /2 for both a successful and an unsuccessful search The (average) complexity of the successful search in sorted lists equals the successful search on array in the average case For unsuccessful: The search can be ed by each of the elements of the list We do 1 comparison if the searched key is less than the first element,..., + 1 comparison if the key is greater than the last one (the sentinel) ( ( + 1))/ =( + 1)( + 2)/2 Sequential in an Array is O() An Elementary Algorithm : the Binary Search Sequential in an array uses + 1 comparisons for an unsuccessful search in the best, average and worst case ( + 1)/2 comparisons for a successful search on the average 1 Suppose that the records have the same probability to be found We do 1 comparison with the first one,. to find the last one on the average: ( )/ = ( + 1)/2 When the set of records gets large and the records are ordered to reduce the searching time, use a divide-and-conquer strategy: Divide the set into two parts Determine in which part the key might belong to Repeat the search on this part of the set 1 average=mean= sum of all the entries number of entries

6 Application to numerical analysis For finding an approximate of the zeroes of a cont. function by the Theorem (Intermediate value theorem) If the function f (x) =y is continuous on [a, b] and u is a number st f (a) < u < f (b), then there is a c [a, b] s.t. f (c) =u. if one can evaluate the sign of f ((a + b)/2); Let f be strictly increasing on [a, b] withf (a) < 0 < f (b) The binary search allows to find y st f (y) = 0: 1 start with the pair (a, b) 2 evaluate v = f ((a + b)/2) 3 if v < 0 replace a by v otherwise replace b by v 4 iterate on the new pair until the diff. between the values is less than an arbitrary given precision Performance of Binary Search Proof 1 : Proof 2 : Consider the tree of the recursive calls of the Search At each call the array is split into two halves The tree is a full binary tree The number of comparisons equals the tree height : log 2 The number of comparisons at step equals the number of comparisons in one subarray plus 1 because you compare with the root Solve the recurrence C = C /2 +1, for 2 with C 1 =0 log C = C /2 +1 =2 n C 2 n = C 2 n C 2 n = n = log Performance of Binary Search Order of magnitude on the average case : Binary Search uses approximately log comparisons for both (un)successful search in best, average and worst case Maximal number of comparisons when the search is unsuccessful A successful sequential search in a set of elements takes 5000 comparisons A successful binary search in the same set takes 14 comparisons BUT Inserting an element : In an array takes 1 operation In a sorted array takes operations : to find the place and shift right the other elements

7 Elementary Algorithm: Interpolation Outline Dictionary search: if the word begins by B you look near the beginning and if the word begins by T you turn a lot of pages. Suppose you search the key k, in the binary search you cut the array in the middle middle = left + 1 (right left) 2 In the interpolation you takes the values of the keys into account by replacing 1/2 by a better progression position = left + k A[left].key (right left) A[right].key A[left].key 1 2 Performance of the Interpolation Search The interpolation search uses approximately log(log ) comparisons for both (un)successful search in the array But Interpolation search heavily deps on the fact that the keys are well distributed over the interval The method requires some computation; for small sets the log of binary search is close to log(log ) So interpolation search should be used for large sets in applications where comparisons are particularly expensive or for external methods where access costs are high is a completely different method of searching The idea is to access directly the record in a table using its key - the same way an index accesses an entry in an array - We use a hash function that computes a table index from the key Basic operations: insert, remove, search

8 Why does M have to be prime? The steps in hashing: 1 compute a hash function which maps keys in table addresses Since there are more records () than indexes (M) in the table, two or more keys may hash to the same table address : it s the collision problem 2 the collision resolution process Good hash functions should uniformly distribute entries in the table Since, if the function uniformly distributes the keys, the complexity of searching is approx. divided by the table s size An example of hash function is hash(key)= key[0] (2 k ) 0 + key[1] (2 k ) key[n] (2 k ) n mod M Suppose you choose M =2 k then XXX mod M is unaffected by adding to XXX multiples of 2 k hash(key)=key[0] : hash only deps on the 1 st char of key The simplest way to ensure that the hash function takes all the characters of a key into account is to take M prime Transform Keys into Integers in [[0, M 1]] How to Handle the Collision Process If your key is already a large integer choose M to be a prime and compute key If your key is an uppercase character string Example mod M encode each char in a 5-bit code (5 bits (2 5 ) are required to encode 26 items): each letter is encoded by the binary value of its rank in the alphabet compute the modulo of the corresponding decimal value ABC (2 5 ) 2 +2 (2 5 ) 1 +3 (2 5 ) 0 = mod M index table We have an array of size M - called the hash table - and a hash function which gives for any key a possible entry in this array Problem: decide what to do when 2 keys hash to the same address A first simple method is to build for each table entry a linked list of records whose keys hash to the same entry Colliding records are chained together we call it separate chaining At the initialization, the hash table will be an array of M pointers to empty linked lists

9 Example Performances Good hash functions uniformly distribute entries over the table expected values in O(α) (α = M table s filling rate): Unsuccessful: 1 M M (1 + L i) since the element L i Q (M, ) =α +1 since L i = Successful: searching for an element in the table equals the cost of inserting it when only the inserted elements before it were already in the table: Q + (M, ) = 1 1 Q (M, i) = i M = 1+α 2 1 2M i=0 The interest of hashing is that it is efficient and easy to program i=0 a record in a Hash Table with linked lists Main operation on a HashTable: search a record with its key: compute the hash value of the key : hash(key)=i access to the linked list at position i : HashTable[i] if there s more than your record in the list you have collisions searching becomes a search in a list: iterate on each record comparing the keys unsuccessful search: you iterate down the list without finding your record Operations of insertion and removal of records in a Hash Table become linked list operations Alternative proof for successful search x i is the i th element inserted into the table and k i = key[x i ] X ij = 1{h(k i )=h(k j )} for all i, j (indicator Rand.Var.) simple uniform hashing: Pr{h(k i )=h(k j )} =1/M E[X ij ]=1/M expected number of elements examined in a successful search: E 1 1+ X ij (1) i=1 j=i+1 j=i+1 X ij= of elements inserted after x i into the same slot as x i. (1) = 1 i=1 1+ j=i+1 E[X ij] = 1 i=1 1+ j=i+1 1 M = M i=1 ( i) =1+ M i=1 i=1 i = 1+ 1 M 2 (+1) 2 =1+ 1 2M

Expected cost interpretation and Inserting in Linear Probing if = O(M), then α = /M = O(M)/M = O(1) searching takes constant time on the average insertion is O(1) in the worst case deletion takes

10 Expected cost interpretation and Inserting in Linear Probing if = O(M), then α = /M = O(M)/M = O(1) searching takes constant time on the average insertion is O(1) in the worst case deletion takes O(1) worst-case time for doubly linked lists hence, all dictionary operations take O(1) time on average with hash tables with chaining If the place HashTable[hash(key)] is already busy If the keys match, the search is successful Else there is a collision You search at the next place i +1 If the place is free, the search is unsuccessful and you have found a place to insert your record Else if the keys match, the search is successful If the keys differ try the next position i +2 But be careful the position after i is i +1mod M And check that the table is not full otherwise the iteration won t terminate Another structure for Hash Table: Linear Probing Example When the number of elements can be estimated in advance You can avoid using any linked list You store records in a table of size M > Empty places in the table help you for collision resolution It is called the linear probing

11 Problem with Linear Probing Eliminating the Clustering Problem Suppose you like to perform the operation of suppression To suppress an element in the Hash Table, you search it, you remove it from the array and the place is free again. Is it so simple? Suppose key1 and key2 (different) hash to the same address i you insert key1 first at position i you try to insert key2 at position i, you find it busy, and you finally insert it at position i +1 now you suppress key1. The place i becomes free you search key2: it hashes at a free position i: its search is unsuccessful but key2 is in the table A place may have three status: free, busy and suppress Instead of examining each successive entry, we use a second hash function to compute a fixed increment to use for the sequence (instead of using 1 in linear probing) Deping on the choice of the second hash function, the program may not work : obviously 0 leads to an infinite loop Performances in Hash Table with linear probing Conclusion on This hashing works because it guarantees that when you search for a particular key you look at every key that hashes to the same table address In linear probing when the table begins to fill up, you also look to other keys: 2 different collision sets may be stuck together: clustering problem Linear probing is very slow when tables are almost full because of the clustering problem And when the table is full you cannot continue to use it is a classical problem in CS: various algorithms have been studied and are widely used There are many empirical and analytic results that make utility of evident for a broad variety of applications is prefered to binary tree searches for many applications because it is simple to implement and can provide very fast constant searching times when space is available for a large enough table

12 in Ruby zip=hash.new zip={"06000" => "ice", "06100" => "ice", "06110" => "Le Cannet", "06130" => "Grasse", "06140" => "Coursegoules", "06140" => "Tourrettes sur Loup", "06140" => "Vence", "06190" => "Rocquebrune Cap Martin", "06200" => "ice", "06230" => "Saint Jean Cap Ferrat", "06230" => "Villefranche sur Mer"} zip["06300"]="ice" # adds a new entry zip.keys=>["06140", "06130", "06230", "06110", "06000", "06100", "06200", "06300", "06190"] zip.values=>["vence", "Grasse", "Villefranche sur Mer", "Le Cannet", "ice", "ice", "ice", "ice", "Rocquebrune Cap Martin"] zip.select { key,val val="ice"}=>[["06000", "ice"], ["06100", "ice"], ["06200", "ice"], ["06300", "ice"]] zip.index "ice" => "06000" zip.each { k,v puts "#{k}/#{v}"}=> 06140/Vence 06130/Grasse 06230/Villefranche sur Mer 06110/Le Cannet 06000/ice 06100/ice 06200/ice 06300/ice 06190/Rocquebrune Cap Martin

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing: Hashing 7- Hashing Bruno MARTI, University of ice - Sophia Antipolis mailto:bruno.martin@unice.fr http://www.i3s.unice.fr/~bmartin/mathmods.html The steps in hashing: 1 compute a hash function which maps