II (Sorting and) Order Statistics

Similar documents
Worst-case running time for RANDOMIZED-SELECT

Lower bound for comparison-based sorting

How many leaves on the decision tree? There are n! leaves, because every permutation appears at least once.

The complexity of Sorting and sorting in linear-time. Median and Order Statistics. Chapter 8 and Chapter 9

Jana Kosecka. Linear Time Sorting, Median, Order Statistics. Many slides here are based on E. Demaine, D. Luebke slides

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

COT 6405 Introduction to Theory of Algorithms

8 SortinginLinearTime

We assume uniform hashing (UH):

Chapter 8 Sorting in Linear Time

Design and Analysis of Algorithms

CPSC 311 Lecture Notes. Sorting and Order Statistics (Chapters 6-9)

EECS 2011M: Fundamentals of Data Structures

15.4 Longest common subsequence

16 Greedy Algorithms

COMP Data Structures

Sorting Shabsi Walfish NYU - Fundamental Algorithms Summer 2006

CSE373: Data Structure & Algorithms Lecture 21: More Comparison Sorting. Aaron Bauer Winter 2014

Chapter 8 Sort in Linear Time

V Advanced Data Structures

COMP 352 FALL Tutorial 10

Sorting and Selection

Algorithms and Data Structures: Lower Bounds for Sorting. ADS: lect 7 slide 1

9/24/ Hash functions

Advanced Algorithms and Data Structures

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

V Advanced Data Structures

Algorithms Chapter 8 Sorting in Linear Time

CSE 3101: Introduction to the Design and Analysis of Algorithms. Office hours: Wed 4-6 pm (CSEB 3043), or by appointment.

Comparison Based Sorting Algorithms. Algorithms and Data Structures: Lower Bounds for Sorting. Comparison Based Sorting Algorithms

CS 561, Lecture 1. Jared Saia University of New Mexico

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Merge Sort & Quick Sort

Lecture 9: Sorting Algorithms

Data Structures. Sorting. Haim Kaplan & Uri Zwick December 2013

1. Covered basics of a simple design technique (Divideand-conquer) 2. Next, more sorting algorithms.

CSE 373: Data Structures and Algorithms

7. Sorting I. 7.1 Simple Sorting. Problem. Algorithm: IsSorted(A) 1 i j n. Simple Sorting

CSE 373 MAY 24 TH ANALYSIS AND NON- COMPARISON SORTING

Can we do faster? What is the theoretical best we can do?

Sample Solutions to Homework #2

Lecture 5: Sorting Part A

Fast Sorting and Selection. A Lower Bound for Worst Case

Sorting. Riley Porter. CSE373: Data Structures & Algorithms 1

Next. 1. Covered basics of a simple design technique (Divideand-conquer) 2. Next, more sorting algorithms.

Problem. Input: An array A = (A[1],..., A[n]) with length n. Output: a permutation A of A, that is sorted: A [i] A [j] for all. 1 i j n.

The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems.

Selection (deterministic & randomized): finding the median in linear time

SORTING AND SELECTION

4. Sorting and Order-Statistics

How fast can we sort? Sorting. Decision-tree example. Decision-tree example. CS 3343 Fall by Charles E. Leiserson; small changes by Carola

Sorting. Popular algorithms: Many algorithms for sorting in parallel also exist.

CSE 373: Data Structures and Algorithms

CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting. Dan Grossman Fall 2013

15.4 Longest common subsequence

22 Elementary Graph Algorithms. There are two standard ways to represent a

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

CS 5321: Advanced Algorithms Sorting. Acknowledgement. Ali Ebnenasir Department of Computer Science Michigan Technological University

Sorting. CMPS 2200 Fall Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk

Introduction to Algorithms October 12, 2005 Massachusetts Institute of Technology Professors Erik D. Demaine and Charles E. Leiserson Quiz 1.

8. Sorting II. 8.1 Heapsort. Heapsort. [Max-]Heap 6. Heapsort, Quicksort, Mergesort. Binary tree with the following properties Wurzel

IS 709/809: Computational Methods in IS Research. Algorithm Analysis (Sorting)

Recurrences and Divide-and-Conquer

COMP 250 Fall recurrences 2 Oct. 13, 2017

O(n): printing a list of n items to the screen, looking at each item once.

Dr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions

Data Structures and Algorithms Week 4

Lower Bound on Comparison-based Sorting

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS

Solutions to Problem Set 1

Question 7.11 Show how heapsort processes the input:

Chapter 4: Sorting. Spring 2014 Sorting Fun 1

CSC 505, Spring 2005 Week 6 Lectures page 1 of 9

Comparison Sorts. Chapter 9.4, 12.1, 12.2

Multiple-choice (35 pt.)

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

CS240 Fall Mike Lam, Professor. Quick Sort

Faster Sorting Methods

1 Probabilistic analysis and randomized algorithms

Chapter 1 Divide and Conquer Algorithm Theory WS 2013/14 Fabian Kuhn

ECE608 - Chapter 8 answers

The divide and conquer strategy has three basic parts. For a given problem of size n,

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Randomized Algorithms, Quicksort and Randomized Selection

Sorting. There exist sorting algorithms which have shown to be more efficient in practice.

Data Structures and Algorithms Chapter 4

Definition For vertices u, v V (G), the distance from u to v, denoted d(u, v), in G is the length of a shortest u, v-path. 1

11/8/2016. Chapter 7 Sorting. Introduction. Introduction. Sorting Algorithms. Sorting. Sorting

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Divide and Conquer Algorithms: Advanced Sorting. Prichard Ch. 10.2: Advanced Sorting Algorithms

CSci 231 Final Review

arxiv: v3 [cs.ds] 18 Apr 2011

Introduction to Algorithms I

Scribe: Sam Keller (2015), Seth Hildick-Smith (2016), G. Valiant (2017) Date: January 25, 2017

Algorithms and Applications

22 Elementary Graph Algorithms. There are two standard ways to represent a

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 6. Sorting Algorithms

Algorithms and Data Structures. Marcin Sydow. Introduction. QuickSort. Sorting 2. Partition. Limit. CountSort. RadixSort. Summary

Divide and Conquer Algorithms

Transcription:

II (Sorting and) Order Statistics Heapsort Quicksort Sorting in Linear Time Medians and Order Statistics 8 Sorting in Linear Time The sorting algorithms introduced thus far are comparison sorts Any comparison sort must make ( lg ) comparisons in the worst case to sort elements Merge sort and heapsort are asymptotically optimal We examine counting sort, radix sort, and bucket sort that run in linear time MAT-72006 AADS, Fall 2017 20-Sep-17 195 1

8.1 Lower bounds for sorting We can view comparison sorts abstractly in terms of decision trees a full binary tree that represents the comparisons between elements that are performed by a particular sorting algorithm MAT-72006 AADS, Fall 2017 20-Sep-17 196 The execution of the sorting algorithm corresponds to tracing a simple path from the root of the decision tree down to a leaf An internal node compares When we come to a leaf, the sorting algorithm has established the ordering () () () Any correct sorting algorithm must be able to produce each permutation of its input Each of the! permutations must appear as one of the leaves of the decision tree for a comparison sort to be correct MAT-72006 AADS, Fall 2017 20-Sep-17 197 2

A lower bound for the worst case Length of the longest path from the root to any of the reachable leaves is the worst-case number of comparisons The worst-case number of comparisons for a given algorithm equals the height of its tree A lower bound on the heights of all decision trees in which each permutation appears as a reachable leaf is a bound on the running time of any comparison sort algorithm MAT-72006 AADS, Fall 2017 20-Sep-17 198 Theorem 8.1 Any comparison sort algorithm requires ( lg ) comparisons in the worst case. Proof It sufces to determine the height of a decision tree in which each permutation appears as a reachable leaf. Consider a decision tree of height with reachable leaves corresponding to a comparison sort on elements. Each of the! permutations of the input appears as some leaf. MAT-72006 AADS, Fall 2017 20-Sep-17 199 3

Therefore, we have!. Since a binary tree of height has no more than 2 leaves, we have 2 which, by taking logarithms, implies lg(!) since the lg function is monotonically increasing. Furthermore, lg, because lg! = ( lg ) Hence, heapsort and merge sort are asymptotically optimal comparison sorts MAT-72006 AADS, Fall 2017 20-Sep-17 200 8.2 Counting sort Assume that each input element is an integer in the range 0 to, for some integer When = (), the sort runs in () time Counting sort determines, for an input element, the number of elements It uses this information to place element directly into its position in the output array E.g., if 17 elements are less than (or equal to), then belongs in output position 18 MAT-72006 AADS, Fall 2017 20-Sep-17 201 4

In addition to the input array [1.. ] we require two other arrays: [1..]holds the sorted output, and [0..]provides temporary working storage MAT-72006 AADS, Fall 2017 20-Sep-17 202 COUNTING-SORT(,, ) 1. let [0..]be a new array 2. for 0to 3. 0 4. for 1to. 5. [ [] +1 6. // now contains the number of elements equal to 7. for 1to 8. + [ 1] 9. // now contains the number of elements 10. for. downto 1 11. [] 12. 1 MAT-72006 AADS, Fall 2017 20-Sep-17 203 5

How much time does counting sort require? The for loop of lines 2 3 takes time () the for loop of lines 4 5 takes time () the for loop of lines 7 8 takes time (), and the for loop of lines 10 12 takes time () Thus, the overall time is ( + ) In practice, we usually use counting sort when we have = (), in which case the running time is () MAT-72006 AADS, Fall 2017 20-Sep-17 204 Counting sort beats the lower bound of lg because it is not a comparison sort In fact, no comparisons between input elements occur anywhere in the code Instead, counting sort uses the actual values of the elements to index into an array An important property of counting sort is that it is stable numbers with the same value appear in the output array in the same order as they do in the input array MAT-72006 AADS, Fall 2017 20-Sep-17 205 6

8.3 Radix sort Radix sort is used by the card-sorting machines you now nd only in computer museums Radix sort solves the problem of card sorting counter-intuitively by sorting on the least signicant digit rst In order for radix sort to work correctly, the digit sorts must be stable MAT-72006 AADS, Fall 2017 20-Sep-17 206 RADIX-SORT(, ) 1. for 1to 2. use a stable sort to sort array on digit MAT-72006 AADS, Fall 2017 20-Sep-17 207 7

Lemma 8.3 Given -digit numbers in which each digit can take on up to possible values, RADIX-SORT sorts the numbers in + time if the stable sort takes + time Proof The correctness of radix sort follows by induction on the columns. When each digit is in the range 0 to 1(so that it can take on possible values), and is not too large, counting sort is the obvious choice. Each pass over -digit numbers then takes time +.There are passes, and so the total time for radix sort follows. MAT-72006 AADS, Fall 2017 20-Sep-17 208 8.4 Bucket sort Assume that the input is drawn from uniform distribution and has an average-case running time of () Counting and bucket sort are fast because they assumes something about the input Counting sort: the input contains integers in a small range, Bucket sort: the input is drawn from a random process that distributes elements uniformly and independently over the interval [0,1) MAT-72006 AADS, Fall 2017 20-Sep-17 209 8

Bucket sort divides the interval 0,1 into equal-sized subintervals, or buckets, and then distributes the input numbers into the buckets Since inputs are uniformly and independently distributed over 0,1, we do not expect many numbers to fall into each bucket We sort numbers in each bucket and go through the buckets in order, listing the elements in each The input is an -element array and each element [] in the array satises []<1 The code requires an auxiliary array [0.. 1] of linked lists (buckets) MAT-72006 AADS, Fall 2017 20-Sep-17 210 BUCKET-SORT() 1. let [0.. 1] be a new array 2.. 3. for 0to 1 4. make [] an empty list 5. for 1to 6. insert [] into list [ ] 7. for 0to 1 8. sort list [] with insertion sort 9. concatenate the lists 0,1,,[ 1] together in order MAT-72006 AADS, Fall 2017 20-Sep-17 211 9

MAT-72006 AADS, Fall 2017 20-Sep-17 212 Consider two elements [] and [] Assume wlog that [ [] Since [ [], either element [] goes into the same bucket as [] or it goes into a bucket with a lower index If [] and [] go into the same bucket the for loop of lines 7 8 puts them into the proper order If [] and [] go into different buckets line 9 puts them into the proper order Therefore, bucket sort works correctly MAT-72006 AADS, Fall 2017 20-Sep-17 213 10

Observe that all lines except line 8 take () time in the worst case We need to analyze the total time taken by the calls to insertion sort in line 8 Let be the random variable denoting the number of elements placed in bucket [] Since insertion sort runs in quadratic time, the running time of bucket sort is = + ( ) MAT-72006 AADS, Fall 2017 20-Sep-17 214 We now compute the expected value of the running time, where we take the expectation over the input distribution Taking expectations of both sides and using linearity of expectation, we have + + MAT-72006 AADS, Fall 2017 20-Sep-17 215 11

Claim: = 2 1 for = 0,1,, 1 It is no surprise that each bucket has the same value of, since each value in is equally likely to fall in any bucket To prove the claim, dene indicator random variables =I fallsinbucket for = 0,1,, 1and = 1,2,, Thus, = MAT-72006 AADS, Fall 2017 20-Sep-17 216 = = = + = + MAT-72006 AADS, Fall 2017 20-Sep-17 217 12

Indicator random variable is 1 with probability 1/ and 0 otherwise, therefore E =1 1 +0 1 =1 When, the variables and are independent, and hence = = = MAT-72006 AADS, Fall 2017 20-Sep-17 218 Substituting these two expected values, we obtain = 1 + 1 = 1 +( 1) 1 =1+ 1 =2 1 which proves the claim MAT-72006 AADS, Fall 2017 20-Sep-17 219 13

Using this expected value we conclude that average-case running time for bucket sort is + (21)= Even if the input is not drawn uniformly, bucket sort may still run in linear time As long as the input has the property the sum of the squares of the bucket sizes is linear in the total number of elements, bucket sort will run in linear time MAT-72006 AADS, Fall 2017 20-Sep-17 220 9 Medians and Order Statistics The th order statistic of a set of elements is the th smallest element E.g., the minimum of a set of elements is the rst order statistic ( =1), and the maximum is the th order statistic ( = ) Amedian is the halfway point of the set When is odd, the median is unique, occurring at =(+ 1)/2 MAT-72006 AADS, Fall 2017 20-Sep-17 221 14

When is even, there are two medians, occurring at = /2 and = /2+1 Thus, regardless of the parity of, medians occur at =(+ 1)/2 (the lower median) and =(+ 1)/2 (the upper median) For simplicity, we use the median to refer to the lower median MAT-72006 AADS, Fall 2017 20-Sep-17 222 The problem of selecting the th order statistic from a set of distinct numbers We assume that the set contains distinct numbers virtually everything extends to the situation in which a set contains repeated values We formally specify the problem as follows: Input: A set of (distinct) numbers and an integer, with Output: The element that is larger than exactly 1other elements of MAT-72006 AADS, Fall 2017 20-Sep-17 223 15

We can solve the problem in ( lg ) time by heapsort or merge sort and then simply index the th element in the output array There are faster algorithms First, we examine the problem of selecting the minimum and maximum of a set of elements Then we analyze a practical randomized algorithm that achieves an () expected running time, assuming distinct elements MAT-72006 AADS, Fall 2017 20-Sep-17 224 9.1 Minimum and maximum How many comparisons are necessary to determine the minimum of a set of elements? We can easily obtain an upper bound of 1 comparisons examine each element of the set in turn and keep track of the smallest element seen so far. In the following procedure, we assume that the set resides in array, where. = MAT-72006 AADS, Fall 2017 20-Sep-17 225 16

MINIMUM() 1. min[1] 2. for 2to. 3. if min>[] 4. min[] 5. return min We can, of course, nd the max with 1 comparisons as well MAT-72006 AADS, Fall 2017 20-Sep-17 226 This is the best we can do, since we can obtain a lower bound of 1comparisons Think of any algorithm that determines the minimum as a tournament among the elements Each comparison is a match in the tournament in which the smaller of the two elements wins Observing that every element except the winner must lose at least one match, we conclude that 1 comparisons are necessary to determine the minimum Hence, the algorithm MINIMUM is optimal w.r.t. the number of comparisons performed MAT-72006 AADS, Fall 2017 20-Sep-17 227 17

Simultaneous minimum and maximum Sometimes, we must nd both the minimum and the maximum of a set of elements For example, a graphics program may need to scale a set of (, ) data to t onto a rectangular display screen or other graphical output device To do so, the program must rst determine the minimum and maximum value of each coordinate MAT-72006 AADS, Fall 2017 20-Sep-17 228 () comparisons is asymptotically optimal: Simply nd the minimum and maximum independently, using 1comparisons for each, for a total of 2 2comparisons In fact, we can nd both the minimum and the maximum using at most 3 /2 comparisons by maintaining both the minimum and maximum elements seen thus far Rather than processing each element of the input by comparing it against the current minimum and maximum, we process elements in pairs MAT-72006 AADS, Fall 2017 20-Sep-17 229 18

Compare pairs of input elements rst with each other, and then we compare the smaller with the current min and the larger to the current max, at a cost of 3 comparisons for every 2 elements If is odd, we set both the min and max to the value of the rst element, and then we process the rest of the elements in pairs If is even, we perform 1 comparison on the rst 2 elements to determine the initial values of the min and max, and then process the rest of the elements in pairs as in the case for odd MAT-72006 AADS, Fall 2017 20-Sep-17 230 If is odd, then we perform 3 /2 comparisons If is even, we perform 1 initial comparison followed by 3( 2)/2 comparisons, for a total of 2 2 Thus, in either case, the total number of comparisons is at most 3 /2 MAT-72006 AADS, Fall 2017 20-Sep-17 231 19

9.2 Selection in expected linear time The selection problem appears more difcult than nding a minimum, but the asymptotic running time for both is the same: () A divide-and-conquer algorithm RANDOMIZED-SELECT is modeled after the quicksort algorithm Unlike quicksort, RANDOMIZED-SELECT works on only one side of the partition Whereas quicksort has an expected running time of ( lg ), the expected running time of RANDOMIZED- SELECT is (), assuming that the elements are distinct MAT-72006 AADS, Fall 2017 20-Sep-17 232 RANDOMIZED-SELECT uses the procedure RANDOMIZED-PARTITION ofrandomized- QUICKSORT RANDOMIZED-PARTITION(,, ) 1. RANDOM(, ) 2. exchange [] with [] 3. return PARTITION(,, ) MAT-72006 AADS, Fall 2017 20-Sep-17 233 20

Partitioning the array PARTITION(,, ) 1. [] 2. 1 3. for to 1 4. if [ 5. +1 6. exchange [] with [] 7. exchange [ + 1]with [] 8. return + 1 MAT-72006 AADS, Fall 2017 20-Sep-17 234 MAT-72006 AADS, Fall 2017 20-Sep-17 235 21

PARTITION always selects an element = [] as a pivot element around which to partition the subarray [..] As the procedure runs, it partitions the array into four (possibly empty) regions At the start of each iteration of the for loop in lines 3 6, the regions satisfy properties, shown above MAT-72006 AADS, Fall 2017 20-Sep-17 236 At the beginning of each iteration of the loop of lines 3 6, for any array index, 1. If, then [ 2. If +11, then > 3. If =, then = Indices between and 1are not covered by any case, and the values in these entries have no particular relationship to the pivot The running time of PARTITION on the subarray [..]is, = +1 MAT-72006 AADS, Fall 2017 20-Sep-17 237 22

Return the th smallest element of [.. ] RANDOMIZED-SELECT(,,, ) 1. if = 2. return [] 3. RANDOMIZED-PARTITION(,, ) 4. +1 5. if = // the pivot value is the answer 6. return [] 7. elseif < 8. return RANDOMIZED-SELECT(,, 1, ) 9. else return RANDOMIZED-SELECT(, + 1,, ) MAT-72006 AADS, Fall 2017 20-Sep-17 238 (1) checks for the base case of the recursion Otherwise, RANDOMIZED-PARTITION partitions [..]into two (possibly empty) subarrays [.. 1] and [ +1..]s.t. each element in the former is < each element of the latter (4) computes the # of elements in [..] (5) checks if [] is th smallest element Otherwise, determine in which of the two subarrays the th smallest element lies If <, then the desired element lies on the low side of the partition, and (8) recursively selects it MAT-72006 AADS, Fall 2017 20-Sep-17 239 23

If >, then the desired element lies on the high side of the partition Since we already know values that are smaller than the th smallest element of [..]the desired element is the ( )th smallest element of [ +1..], which (9) nds recursively MAT-72006 AADS, Fall 2017 20-Sep-17 240 Worst-case running time for RANDOMIZED-SELECT is ( ), even to nd the minimum The algorithm has a linear expected running time, though Because it is randomized, no particular input elicits the worst-case behavior Let the running time of RANDOMIZED-SELECT on an input array [..]of elements be a random variable () We obtain an upper bound on [ ] as follows MAT-72006 AADS, Fall 2017 20-Sep-17 241 24

The procedure RANDOMIZED-PARTITION is equally likely to return any element as the pivot Therefore, for each such that, the subarray [..]has elements (all less than or equal to the pivot) with probability 1/ For =1,2,,, dene indicator random variables where =Ithesubarray..hasexactlyelements MAT-72006 AADS, Fall 2017 20-Sep-17 242 Assuming that the elements are distinct, we have =1 When we choose [] as the pivot element, we do not know, a priori, 1) if we will terminate immediately with the correct answer, 2) recurse on the subarray [.. 1], or 3) recurse on the subarray [ +1..] This decision depends on where the th smallest element falls relative to [] MAT-72006 AADS, Fall 2017 20-Sep-17 243 25

Assuming () to be monotonically increasing, we can upper-bound the time needed for a recursive call by that on largest possible input To obtain an upper bound, we assume that the th element is always on the side of the partition with the greater number of elements For a given call of RANDOMIZED-SELECT, the indicator random variable has value 1 for exactly one value of, and it is 0 for all other When =1, the two subarrays on which we might recurse have sizes 1and MAT-72006 AADS, Fall 2017 20-Sep-17 244 We have the recurrence ( ((max( 1, ))+()) = (max( 1, ))+() Taking expected values, we have ( (max( 1, ))+() = [ (max( 1, ))]+() MAT-72006 AADS, Fall 2017 20-Sep-17 245 26

= [(max( 1, ))]+() = 1 [(max( 1, ))] + () The first Eq. on this slide follows by independence of random variables and max( 1, ) Consider the expression max( 1, ) = 1 if > /2 if /2 MAT-72006 AADS, Fall 2017 20-Sep-17 246 If is even, each term from /2 up to ( 1) appears exactly twice in the summation, and if is odd, all these terms appear twice and /2 appears once Thus, we have [ 2 +() / We can show that = () by substitution In summary, we can nd any order statistic, and in particular the median, in expected linear time, assuming that the elements are distinct MAT-72006 AADS, Fall 2017 20-Sep-17 247 27