CS6B Lecture #5 Today: Sorting, t. Standard methods Properties of standard methods Selection Readings for Today: DS(IJ), Chapter 8; Readings for Next Topic: Balanced searches, DS(IJ), Chapter ; Shell s sort Idea: Improve insertion sort by first sorting distant elements: First sort subsequences of elements k apart: sort items #, k, ( k ), ( k ),..., then sort items #, + k, + ( k ), + ( k ),..., then sort items #, + k, + ( k ), + ( k ),..., then etc. sort items # k, ( k ), ( k ),..., Each time an item moves, reduce #inversions by as much as k +. Now sort subsequences of elements k apart: sort items #, k, ( k ), ( k ),..., then sort items #, + k, + ( k ), + ( k ),...,. End at plain insertion sort ( = apart), but with most inversions gone. Sort is Θ(N.5 ) (take CS7 for why!). Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Example of Shell s Sort Sorting by Selection: Heapsort #I #C 5 4 8 7 6 5 4 4 8 7 6 5 4 5 Idea: Keep selecting smallest (or largest) element. Really bad idea on a simple list or vector. But we ve already seen it in action: use heap. Gives O(N lg N) algorithm (N remove-first operations). Since we remove items from end of heap, we use that area to accumulate result: 7 6 5 4 4 8 5 4 4 6 5 7 8 4 5 4 4 5 6 7 8 4 5 - I: Inversions left. C: Comparisons needed to sort subsequences. original: heapified: - 7 4 4 7-7 - 4 7-4 7-4 - 7 4-7 4-7 4 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 4
Merge Sorting Idea: Divide data in equal parts; recursively sort halves; merge results. Already seen analysis: Θ(N lg N). Good for external sorting: First break data into small enough chunks to fit in memory and sort. Then repeatedly merge into bigger and bigger sequences. Can merge K sequences of arbitrary size on sedary storage using Θ(K) storage. For internal sorting, use binomial comb to orchestrate: Illustration of Internal Merge Sort L: (, 5, 5,,, 6,, -,,, 8) : () : : element processed : : : elements processed : : (5) : (, 5) : (, 5) : : elements processed elements processed : : : (8) : : (, 6) : (, ) : (, 5,, 5) : (, 5,, 5) : : (-,,, 5, 6,,, 5) 4 elements processed 6 elements processed elements processed Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 5 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 6 Idea: Quicksort: Speed through Probability Partition data into pieces: everything > a pivot value at the high end of the sequence to sorted, and everything on the low end. Repeat recursively on the high and low pieces. For speed, stop when pieces are small enough and do insertion sort on the whole thing. Reason: insertion sort has low stant factors. By design, no item will move out of its will move out of its piece [why?], so when pieces are small, #inversions is, too. Have to choose pivot well. E.g.: median of first, last and middle items of sequence. Example of Quicksort In this example, we tinue until pieces are size 4. Pivots for next step are starred. Arrange to move pivot to dividing line each time. Last step is insertion sort. 6 8-4 -7-5 5 4 -* -4-5 -7-8 5 4 6* -4-5 -7-5 * 6 * 4 8-4 -5-7 - 5 6 8 4 Now everything is close to right, so just do insertion sort: -7-5 -4-5 6 8 4 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 7 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 8
Probabalistic time: Performance of Quicksort If choice of pivots good, divide data in two each time: Θ(N lg N) with a good stant factor relative to merge or heap sort. If choice of pivots bad, most items on one side each time: Θ(N ). Ω(N lg N) in st case, so insertion sort tter for nearly ordered input s. Interesting point: randomly shuffling the data fore sorting makes Ω(N ) time very unlikely! Quick Selection The Selection Problem: forgiven k, find k th smallestelementindata. Obvious method: sort, select element #k, time Θ(N lg N). If k some stant, easily do in Θ(N) time: Go through array, keep smallest k items. Get probably Θ(N) time for all k by adapting quicksort: Partition around some pivot, p, as in quicksort, arrange that pivot ends up at dividing line. Suppose that in the result, pivot is at index m, all elements pivot have indicies m. If m = k, you re done: p is answer. If m > k, recursively select k th from left half of sequence. If m < k, recursively select (m k ) th from right half of sequence. Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Selection Example Problem: Find just item # in the sorted version of array: Initial tents: 5 6-4 7 4 4 4* 5 46 Looking for # to left of pivot 4: -4 7 4* 4 5 5 4 46 6 Looking for #6 to right of pivot 4: -4 4 7 * 4 5 5 4 46 6 4 Looking for # to right of pivot : -4 4 7 4 5 5 4 46 6 Just two elements; just sort and return #: -4 4 7 4 5 5 4 46 6 Result: Selection Performance For this algorithm, if m roughly in middle each time, cost is C(N) =, if N =, N + C(N/), otherwise. = N + N/ +... + = N Θ(N) But in worst case, get Θ(N ), as for quicksort. By another, non-obvious algorithm, get Θ(N) worst-case time for all k (take CS7). Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5
Better than N lg N? Canprove that if all you do to keys is comparethem then sorting must take Ω(N lg N). Basic idea: there are N! possible ways the input data could scrambled. Therefore, your program must prepared to do N! different combinations of move operations. Therefore, there must N! possible combinations of outcomes of all the if tests in your program(we re assuming that comparisons are -way). Since each if test goes two ways, numr of possible different outcomes for k if tests is k. Thus, need enough tests so that k > N!, which means k Ω(lg N!). Using Stirling s approximation, this tells us that m! πm m m + Θ e k Ω(N lg N). m, Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Beyond Comparison: Distribution Counting But suppose do more than compare keys? For example, how we sort a of N integer keys whose values range from to kn, for some small stant k? One technique: count the numr of items <, <, etc. If M p =#items with value < p, then in sorted order, the j th item with value p must #M p + j. Gives linear-time algorithm. Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 4 Distribution Counting Example Suppose all items are tween and as in this example: 7 4 5 7 6 7 4 < < 6 < 7 < 4 < 4 5 < 5 6 < 6 7 < 7 8 6 < 8 4 4 5 6 7 6 Counts line gives # occurrences of each key. 6 < Counts Running sum 7 7 6 Running sum gives cumulative count of keys each value......which tells us where to put each key: The first instance of key k goes into slot m, where m is the numr of key instances that are < k. Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 5 Radix Sort Idea: Sort keys one character at a time. Can use distribution counting for each digit. Can work either right to left (LSD radix sort) or left to right (MSD radix sort) LSD radix sort is venerable: used for punched cards. Pass (by char #) Initial:,,,,,,,, t t d n t,,,,,,,, t Pass (by char #) Pass (by char #) t t a e o,,,,,,, t, b c l s,, t,,,,,, Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 6
MSD Radix Sort A bit more complied: must keep lists from each step separate But, stop processing -element lists A posn,,,,,,,, t,, t /,,, / / /, t /,,, / / / / t /,,, / / / / t /,, / / / / / t / / / / / / Performance of Radix Sort Radix sort takes Θ(B) time where B is total size of the key data. Have measured other sorts as function of #records. How to compare? To have N different records, must have keys at least Θ(lg N) long [why?] Furthermore, comparison actually takes time Θ(K) where K is size of key in worst case [why?] So N lg N comparisons really means N(lg N) operations. While radix sort takes B = N lg N time. On the other hand, must work to get good stant factors with radix sort. Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 7 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 8 And Don t Forget Search Trees Idea: A search tree is in sorted order, when read in inorder. Need balance to really use for sorting [next topic]. Given balance, same performance as heapsort: N insertions in time lg N each, plus Θ(N) to traverse, gives Θ(N + N lg N) = Θ(N lg N) Summary Insertion sort: Θ(N k) comparisons and moves, where k is maximum amount data is displaced from final position. Good for small datas or almost ordered data s. Quicksort: Θ(N lg N) with good stant factor if data is not pathological. Worst case O(N ). Merge sort: Θ(N lg N) guaranteed. Good for external sorting. Heapsort, treesort with guaranteed balance: Θ(N lg N) guaranteed. Radix sort, distribution sort: Θ(B) (numr of bytes). Also good for external sorting. Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5 Last modified: Thu Oct 8 6:: 4 CS6B: Lecture #5