Lecture 3: Sorting 1

 Ernest Oliver
 3 months ago
 Views:
Transcription
1 Lecture 3: Sorting 1
2 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = <a 1, a 2,, a n > a sequence of n elements in arbitrary order After sorting: S is transformed into a permutation of S: S = <a 1, a 2,, a n > such that a i <= a j for 1 <= i <= j <= n Internal vs. external sorting: Internal sorting: elements to be sorted fit into the processor s main memory. External sorting: huge number of elements stored into auxiliary storage (hard disks). 2
3 Comparisonbased vs. Noncomparisonbased Sorting Comparisonbased sorting: Repeatedly compare pairs of elements and if they are out of order exchange them => compareexchange operation Theorem: The lower bound of any comparisonbased sorting algorithm is Ω(n log n). i.e. no comparisonbased algorithm can possibly be faster than Ω(n log n) in the worst case. Noncomparisonbased sorting: Use certain known properties of the elements (e.g. distribution) Lower bound: Ω(n) 3
4 Issues in Parallel Sorting Where the input and output sequences are stored? Stored on only one process. Distributed among the processes => assumed here How comparison are performed? One element per process: 4
5 CompareSplit: More than One Element Per Process Each process is assigned a block of n/p elements. Block A i assigned to P i A i <= A j if every element of A i is less than or equal to every element in A j. After sorting each P i holds a set A i such that A i <= A j for i <= j Comparesplit operation: (assume blocks already sorted at each process) 1. Each process sends its block to the other process. 2. Each process merges the two sorted blocks and retains only the appropriate half of the merged block. Communication cost: t s +t w (n/p) Total time: Θ(n/p) 5
6 CompareSplit 6
7 Comparators: Sorting Networks Increasing comparator Decreasing comparator 7
8 Sorting Networks Depth of a network = number of columns The speed of a network is determined by its depth. 8
9 Bitonic Sort Bitonic sequence: a sequence of elements <a 0, a 1,, a n1 > with the property that either: 1. There exists i, 0<=i<=n1, such that <a 0,a 1,,a i > is monotonically increasing and <a i+1, a i+2,, a n1 > is monotonically decreasing, or 2. There exists a cyclic shift of indices so that (1) is satisfied. Examples: <1, 2, 4, 7, 6, 0> <8, 9, 2, 1, 0, 4> cyclic shift of <0, 4, 8, 9, 2, 1> 9
10 Bitonic Split s = <a 0, a 1,, a n1 > a bitonic sequence such that a 0 <=a 1 <= <= a n/21 and a n/2 =>a n/2+1 => =>a n1 Bitonic split: s 1 = <min{a 0,a n/2 }, min{a 1,a n/2+1 },, min{a n/21,a n1 },> s 2 = <max{a 0,a n/2 }, max{a 1,a n/2+1 },, max{a n/21,a n1 },> There exists b i from s 1 such that all the elements before b i are from the increasing part of s and all the elements after b i are from the decreasing part of s. There exists b i from s 2 such that all the elements before b i are from the decreasing part of s and all the elements after b i are from the increasing part of s. => s 1 and s 2 are bitonic Every element of s 1 is smaller than every element of s 2 10
11 Bitonic Split a 5 s 2 a 6 a 7 a 8 =b 2 a 3 a 4 a 2 =b 2 a 9 a 10 a 0 a 1 s 1 a 11 11
12 Bitonic Merge Reduce the problem of rearranging a bitonic sequence of size n to that of rearranging two smaller bitonic sequences and concatenating the result. Bitonic merge: Apply recursively the procedure until the length of the sequences is 1 => sorted sequence Takes log n bitonic splits. Can be implemented on a network of comparators => bitonic merging network 12
13 Bitonic Merge: Example 13
14 Bitonic Merging Network ΘBM[16] log n columns of n/2 comparators 14
15 Bitonic Sorting Network 15
16 Bitonic Sorting Network First three stages detailed 16
17 Bitonic Sorting Network: Complexity Depth d(n): d(n) = d(n/2) + log n d(n) = log n = (log 2 n + log n)/2 d(n) = Θ(log 2 n) On a serial computer => Θ(n log 2 n) 17
18 Mapping Bitonic Sort on Parallel Computers Bitonic sort is communication intensive => we need to take into account the topology of the interconnection network. One element per process: Each wire of the bitonic sorting network represents a distinct process The compareexchange operations are performed by n/2 pairs of processes Good mapping: minimizes the distance that the elements travel during a compareexchange operation. Observation: wires whose labels differ in the ith leastsignificant bit perform a compareexchange operation (log n i +1) times. 18
19 Bitonic Sort on a Hypercube Bitonic sort property: Compareexchange operations only between wires that differ in only one bit. Hypercube property: processes whose labels differ in only one bit are neighbors. Optimal mapping: wire l to process l, l =0, 1, n1 In ith step of the final stage processes communicate along the (d  (i  1)) th dimension. 19
20 Bitonic Sort on a Hypercube Communication during the last stage of bitonic sort 20
21 Bitonic Sort on a Hypercube Communication during the last stage of bitonic sort 21
22 Bitonic Sort on a Hypercube Hypercube dimensions involved in bitonic sort communication 22
23 Bitonic Sort on a Hypercube T p = log n = (log 2 n + log n)/2 T p = Θ(log 2 n) Cost =Θ(n log 2 n) => cost optimal with respect to the sequential bitonic sort => not cost optimal with respect to the best comparison based algorithm 23
24 Bitonic Sort on a Mesh A mesh has a lower connectivity than that of a hypercube It is impossible to map wires to processes such that each compareexchange operation occurs only between neighboring processes. Map wires such that the most frequent compareexchange operations occur between neighbors. 24
25 Bitonic Sort on a Mesh rowmajor rowmajor snakelike rowmajor shuffled 25
26 Bitonic Sort on a Mesh rowmajor shuffled 26
27 Bitonic Sort on a Mesh Property: Wires that differ in the ith least significant bit are mapped onto mesh processes that are 2 [(i1)/2] links away. Assuming StoreandForward routing: t comm = Θ(n 1/2 ) logn i [( j 2] 2 1) / i= 1 j= 1 Parallel run time = Θ(log 2 n) + Θ(n 1/2 ) Cost = Θ(n 1.5 ) => not cost optimal 7 n 27
28 More than One Element Per Process Simple way: n/p virtual processes on each processor and on hypercube: T p = Θ((n/p) log 2 n) => not costoptimal system Another way: use compare and split operations The problem reduces to bitonic sorting of p blocks using comparesplit operations and local sorting of n/p elements. Total number of steps required: (1+log p)(log p)/2 28
29 Bitonic Sort on a Hypercube (p<n) Comparesplit operation takesθ(n/p) computation time andθ(n/p) communication time Analysis: Initial local sort Θ((n/p) log (n/p)) Comparisons Θ((n/p) log 2 p) Communication Θ((n/p) log 2 p) T p = Θ((n/p) log (n/p)) + Θ((n/p) log 2 p) Cost optimal if: log 2 p = Θ(log n) => p = Θ(2 log1/2 n ) Isoefficiency function: W = Θ(p log p log 2 p) Θ(2 log2 p log 2 p) => for large p is worse than polynomial => poor scalability 29
30 Bitonic Sort on a Mesh (p<n) Comparesplit operation takesθ(n/p) computation time andθ(n/p) communication time Analysis: Initial local sort Θ((n/p) log (n/p)) Comparisons Θ((n/p) log 2 p) Communication Θ((n/p) p 1/2 ) T p = Θ((n/p) log (n/p)) + Θ((n/p) log 2 p) + Θ(n/p 1/2 ) Cost optimal if: p 1/2 = Θ(log n) => p = Θ(log 2 n) Isoefficiency function: W = Θ(2 p1/2 p 1/2 ) => exponential function => very poor scalability! 30
31 Bubble Sort and its Variants Sequential bubble sort: (n1) compareexchange operations: (a 1, a 2 ), (a 2, a 3 ),, (a n1, a n ) => it moves the largest element at the end (n2) compareexchange operations to the first n1 elements of the resulting sequence. After n1 iterations the sequence is sorted. 31
32 Sequential Bubble Sort T s = Θ(n 2 ) It compares all adjacent pairs in order => Inherently sequential => Difficult to parallelize! 32
33 OddEven Transposition Sorts n elements in n phases (n even) using n/2 compareexchange operations Odd phase: compareexchange between the elements with odd indices and their right neighbors. (a 1, a 2 ), (a 3, a 4 ),, (a n1, a n ) Even phase: compareexchange between the elements with odd indices and their right neighbors. (a 2, a 3 ), (a 4, a 5 ),, (a n2, a n1 ) 33
34 Sequential OddEven Transposition T s = Θ(n 2 ) 34
35 Parallel OddEven Transposition Assumes a n processor ring and one element per process. During each phase compareexchange operations are performed simultaneously => Θ(1) T p = Θ(n) Cost = Θ(n 2 ) => not costoptimal 35
36 Parallel OddEven Transposition More than one element per process (p<n): Initial local sort of n/p elements => Θ((n/p)log(n/p)) p phases (p/2 odd, p/2 even) using comparesplit operations. Each phase: computation Θ(n/p), communication Θ(n/p) => Total for p phases: computation Θ(n), communication Θ(n) T p = Θ((n/p)log(n/p)) + Θ(n) + Θ(n) Cost optimal when p = Θ(log n) Isoefficiency function: Θ(p2 p ) => exponential It is poorly scalable! 36
37 Parallel Shellsort Compareexchange between nonadjacent elements. First stage: processes that are far away from each other in the array comparesplit their elements. Second stage: oddeven transposition in which the odd and even phases are performed only if the blocks on the processes are changing. In the first stage the elements are moved close to their final destination => fewer odd and even phases on the second stage. 37
38 Parallel Shellsort first phase 38
39 Parallel Shellsort: Analysis Initially: Local sort of n/p elements: Θ((n/p)log (n/p)) First stage: each process performs log p comparesplit operations Assuming a bisection bandwidth of Θ(p) then each comparesplit operation is Θ(n/p). => first stage takes Θ((n/p) log p) Second stage: l < p odd and even phases are performed each taking Θ(n/p) => second stage takes Θ(l(n/p)) T p = Θ((n/p)log (n/p)) + Θ((n/p) log p) + Θ(l(n/p)) If l small => better than oddeven transposition If l = Θ(p) => similar to oddeven transposition 39
40 Sequential Quicksort Complexity in the average case: Θ(n log n) Divideandconquer algorithm First step => divide A[q r] is partitioned into two nonempty subsequences A[q s] and A[s+1 r] such that elements of the first are smaller than or equal to each element of the second subsequence. Second step => conquer Sort the two subsequences by recursively applying quicksort. 40
41 Sequential Quicksort Algorithm Pivot (assumed to be the first element of the subsequence) 41
42 Sequential Quicksort: Analysis Complexity of partitioning a sequence of size k: Θ(k) Worst case: always partition into a sequence of one element and one of k1 elements => T(n) = T(n1) + Θ(n) => T(n) = Θ(n 2 ) Best case: always partition into two roughly equal size subsequences => T(n) = 2T(n/2) + Θ(n) => T(n) = Θ(n log n) Average case: equally likely to break the sequence into partitions of size 0 and n1, 1 and n2, etc. => T(n) = (1/n) n1 k=0 (T(k) + T(n1k)) + Θ(n) => T(n) = Θ(n log n) 42
43 Parallelizing Quicksort Naïve parallelization: execute it initially on a single process then assign one of the subproblems to another process and so on. Upon termination each process contains one element Major drawback: one process must partition the initial array of size n => runtime is bounded below by Ω(n) => not cost optimal Efficient parallelization: do the partitioning steps in parallel! 43
44 Parallelizing Quicksort 44
45 Parallelizing Quicksort 45
46 Efficient Global Rearrangement 46
47 Parallelizing Quicksort Four steps for splitting: 1. Determine and broadcast the pivot => Θ(log p) 2. Locally rearrange the array assigned to each process => Θ(n/p) 3. Determine the locations in the globally rearranged array that the local elements will go to => two prefix sum operations => Θ(log p) 4. Perform the global rearrangement => Θ(n/p) (lower bound) Each of these four steps are executed log p times => Total time for splitting: Θ((n/p)log p) + Θ(log 2 p) The last phase is to sort locally the elements => Θ((n/p)log (n/p)) T p = Θ((n/p)log (n/p)) + Θ((n/p)log p) + Θ(log 2 p) Isoefficiency: Θ(p log 2 p) 47
48 Pivot Selection The split should be done such that each partition has a nontrivial fraction of the original array. Random: During the ith split one process in each group randomly selects one of its elements to be the pivot. Bad for the parallel implementation Median: Assumes a uniform distribution of the elements in the initial array. n/p elements stored at each process form a representative sample of all n elements. Distribution of elements at each process is the same as the overall distribution. The median at each process approximates the overall median. Pivot = local median Maintains balanced subsequences throughout the execution of the algorithm. 48
49 Bucket Sort Sorts an array of n elements whose values are uniformly distributed over an interval [a,b] Sequential Bucket Sort: Divide [a,b] into m equalsized subintervals => buckets Place the elements in the appropriate bucket => approximately n/m elements per bucket Sort the elements in each bucket. T s = Θ(n log(n/m)) If m = Θ(n) => T s = Θ(n) linear time! 49
50 Parallel Bucket Sort Each process represents a bucket (m = p) Parallel Bucket Sort: Each process partitions its block into p processes. Each process sorts its bucket. The uniform distribution assumption is not realistic If the distribution is not uniform subblocks => buckets have a significantly different number of elements => degrading the performance 50
51 Sample Sort Does not assume uniform distribution. A sample of size s is selected from the n element sequence Sort the sample and select m1 elements (splitters). Splitters determine the range of the buckets After the buckets are determined the algorithm proceeds as in bucket sort. 51
52 Parallel Sample Sort 52
53 Parallel Sample Sort: Analysis p processes messagepassing computer and O(p) bisection bandwidth. Parallel Sample Sort Internal sort => Θ((n/p)log (n/p)) Selection of p1 samples =>Θ(p) Send p1 elements to P 0 (gather) =>Θ(p 2 ) Internal sort of p(p1) elements at P 0 => Θ(p 2 log p) Select p1 splitters at P 0 => Θ(p) Broadcast p1 splitters =>Θ(p log p) Each process partitions its blocks into p subblocks, one for each bucket => p1 binary searches => Θ(p log (n/p)) What s the algorithm? Each process sends subblocks to the appropriate processes (if distribution is close to uniform then size of subblocks is approx n/p 2 and assume Hypercube) (t s + t w mp/2)log p => Θ((n/p)log p) T p = Θ((n/p)log (n/p)) + Θ(p 2 log p) + Θ(p log (n/p)) + Θ((n/p)log p) Isoefficiency function:θ(p 3 log p) 53