5 Sorting Atomic Items

Size: px

Start display at page:

Download "5 Sorting Atomic Items"

Ashley Joseph
5 years ago
Views:

1 5 Sortig Atomic Items 5.1 The merge-based sortig paradigm Stoppig recursio Sow Plow From biary to multi-way Mergesort 5.2 Lower bouds A lower-boud for Sortig A lower-boud for Permutig 5.3 The distributio-based sortig paradigm From two- to three-way partitioig Pivot selectio Boudig the extra-workig space From biary to multi-way Quicksort The Dual Pivot Quicksort 5.4 Sortig with multi-disks This lecture will focus o the very-well kow problem of sortig a set of atomic items, the case of variable-legth items (aka strigs) will be addressed i the followig chapter. Atomic meas that they occupy a costat-fixed umber of memory cells, typically they are itegers or reals represeted with a fixed umber of bytes, say 4 (32 bits) or 8 (64 bits) bytes each. The sortig problem. Give a sequece of atomic items S [1, ] ad a total orderig betwee each pair of them, sort S i icreasig order. We will cosider two complemetal sortig paradigms: the merge-based paradigm, which uderlies the desig of Mergesort, ad the distributio-based paradigm which uderlies the desig of Quicksort. We will adapt them to work i the disk model (see Chapter 1), aalyze their I/O-complexities ad propose some useful tools that ca allow to speed up their executio i practice, such as the Sow Plow techique ad Data compressio. We will also demostrate that these disk-based adaptatios are I/O-optimal by provig a sophisticated lower-boud o the umber of I/Os ay exteralmemory sorter must execute to produce a ordered sequece. I this cotext we will relate the Sortig problem with the so called Permutig problem, typically eglected whe dealig with sortig i the RAM model. The permutig problem. Give a sequece of atomic items S [1, ] ad a permutatioπ[1, ] of the itegers{1, 2,...,}, permute S accordig toπthus obtaiig the ew sequece S [π[1]], S [π[2]],...,s[π[]]. Clearly Sortig icludes Permutig as a sub-task: to order the sequece S we eed to determie its sorted permutatio ad the implemet it (possibly these two phases are itricately itermigled). So Sortig should be more difficult tha Permutig. Ad ideed i the RAM model we kow that sortig atomic items takes Θ( log ) time (via Mergesort or Heapsort) whereas permutig them takesθ() time. The latter time boud ca be obtaied by just movig oe item at a time c Paolo Ferragia,

2 5-2 Paolo Ferragia accordig to what idicates the array π. Surprisigly we will show that this complexity gap does ot exist i the disk model, because these two problems exhibit the same I/O-complexity uder some reasoable coditios o the iput ad model parameters, M, B. This elegat ad deep result was obtaied by Aggarwal ad Vitter i 1998 [1], ad it is surely the result that spurred the huge amout of algorithmic literature o the I/O-subject. Philosophically speakig, AV s result formally proves the ituitio that movig items i the disk is the real bottleeck, rather tha fidig the sorted permutatio. Ad ideed researchers ad software egieers typically speak about the I/O-bottleeck to characterize this issue i their (slow) algorithms. We will coclude this lecture by briefly metioig at two solutios for the problem of sortig items o D-disks: the disk-stripig techique, which is at the base of RAID systems ad turs ay efficiet/optimal 1-disk algorithm ito a efficiet D-disk algorithm (typically loosig its optimality, if ay), ad the Greed-sort algorithm, which is specifically tailored for the sortig problem o D- disks ad achieves I/O-optimality. 5.1 The merge-based sortig paradigm We recall the mai features of the exteral-memory model itroduced i Chapter 1: it cosists of a iteral memory of size M ad allows blocked-access to disk by readig/writig B items at oce. Algorithm 5.1 The biary merge-sort: MergeSort(S, i, j) 1: if (i< j) the 2: m=(i+ j)/2; 3: MergeSort(S, i, m 1); 4: MergeSort(S, m, j); 5: Merge(S, i, m, j); 6: ed if Mergesort is based o the Divide&Coquer paradigm. Step 1 checks if the array to be sorted cosists of at least two items, otherwise it is already ordered ad othig has to be doe. If items are more tha two, it splits the iput array S ito two halves, ad the recurses o each part. As recursio eds, the two halves S [i, m 1] ad S [m, j] are ordered so that Step 5 fuses them i S [i, j] by ivokig procedure Merge. This mergig step eeds a auxiliary array of size, so that MergeSort is ot a i-place sortig algorithm (ulike Heapsort ad Quicksort) but eeds O() extra workig space. Give that at each recursive call we halve the size of the iput array to be sorted, the total umber of recursive calls is O(log ). The Merge-procedure ca be implemeted i O( j i+1) time by usig two poiters, say x ad y, that start at the begiig of the two halves S [i, m 1] ad S [m, j]. The S [x] is compared with S [y], the smaller is writte out i the fused sequece, ad its poiter is advaced. Give that each compariso advaces oe poiter, the total umber of steps is bouded above by the total umber of poiter s advacemets, which is upper bouded by the legth of S [i, j]. So the time complexity of MergeSort(S, 1, ) ca be modeled via the recurrece relatio T()=2T(/2)+O()=O( log ), as well kow from ay basic algorithm course. 1 Let us assume ow that > M, so that S must be stored o disk ad I/Os become the most importat resource to be aalyzed. I practice every I/O takes 5ms o average, so oe could thik 1 I all our lectures whe the base of the logarithm is ot idicated, it meas 2.

3 Sortig Atomic Items 5-3 that every item compariso takes oe I/O ad thus oe could estimate the ruig time of Mergesort o a massive S as: 5ms Θ( log ). If is of the order of few Gigabytes, say 2 30 which is actually ot much massive for the curret memory-size of commodity PCs, the previous time estimate would be of about >10 8 ms, amely more tha 1 day of computatio. However, if we ru Mergesort o a commodity PC it completes i few hours. This is ot surprisig because the previous evaluatio totally eglected the existece of the iteral memory, of size M, ad the sequetial patter of memory-accesses iduced by Mergesort. Let us therefore aalyze the Mergesort algorithm i a more precise way withi the disk model. First of all we otice that O(z/B) I/Os is the cost of mergig two ordered sequeces of z items i total. This holds if M 2B, because the Merge-procedure i Algorithm 5.1 ca keep i iteral memory the 2 pages that cotai the two poiters scaig S [i, j] where z= j i+1. Every time a poiter advaces ito aother disk page, a I/O-fault occurs, the page is fetched i iteral memory, ad the fusio cotiues. Give that S is stored cotiguously o disk, S [i, j] occupies O(z/B) pages ad this is the I/O-boud for mergig two sub-sequeces of total size z. Similarly, the I/O-cost for writig the merged sequece is O(z/B) because it occurs sequetially from the smallest to the largest item of S [i, j] by usig a auxiliary array. As a result the recurret relatio for the I/O-complexity of Mergesort ca be writte as T()=2T(/2)+O(/B) = O( B log ) I/Os. But this formula does ot explai completely the good behavior of Mergesort i practice, because it does ot accout for the memory hierarchy yet. I fact as Mergesort recursively splits the sequece S, smaller ad smaller sub-sequeces are geerated that have to be sorted. So whe a subsequece of legth z fits i iteral memory, amely z=o(m), the it will be etirely cached by the uderlyig operatig system usig O(z/B) I/Os ad thus the subsequet sortig steps would ot icur i ay I/Os. The et result of this simple observatio is that the I/O-cost of sortig a sub-sequece of z=o(m) items is o logerθ( z B log z), as accouted for i the previous recurrece relatio, but it is O(z/B) I/Os which accouts oly the cost of loadig the subsequece i iteral memory. This savig applies to all S s subsequeces of sizeθ(m) o which Mergesort is recursively ru, which areθ(/m) i total. So the overall savig isθ( B log M), which leads us to re-formulate the Mergesort s complexity asθ( B log M ) I/Os. This boud is particularly iterestig because relates the I/O-complexity of Mergesort ot oly to the disk-page size B but also to the iteral-memory size M, ad thus to the cachig available at the sorter. Moreover this bouds suggests three immediate optimizatios to the classic pseudocode of Algorithm 5.1 that we discuss below Stoppig recursio The first optimizatio cosists of itroducig a threshold o the subsequece size, say j i < cm, which triggers the stop of the recursio, the fetchig of that subsequece etirely i iteralmemory, ad the applicatio of a iteral-memory sorter o this sub-sequece (see Figure 5.1). The value of the parameter c depeds o the space-occupacy of the sorter, which must be guarateed to work etirely i iteral memory. As a example, c is 1 for i-place sorters such as Isertiosort ad Heapsort, it is much close to 1 for Quicksort (because of its recursio), ad it is less tha 0.5 for Mergesort (because of the extra-array used by Merge). As a result, we should write cm istead of M ito the I/O-boud above, because recursio is stopped at cm items: thus obtaiigθ( B log cm ). This substitutio is useless whe dealig with asymptotic aalysis, give that c is a costat, but it is importat whe cosiderig the real performace of algorithms. I this settig it is desirable to make c as closer as possible to 1, i order to reduce the logarithmic factor i the I/O-complexity thus preferrig i-place sorters such as Heapsort or Quicksort. We remark that Isertiosort could also be a good choice (ad ideed it is) wheever M is small, as it occurs whe cosiderig the sortig of items over the 2-levels: L1 ad L2 caches, ad the iteral memory. I this case M would be few Megabytes.

4 5-4 Paolo Ferragia Sow Plow Lookig at the I/O-complexity of mergesort, i.e.θ( B log M ), is clear that the larger is M the smaller is the umber of merge-passes over the data. These passes are clearly the bottleeck to the efficiet executio of the algorithm especially i the presece of disks with low badwidth. I order to circumvet this problem we ca either buy a larger memory, or try to deploy as much as possible the oe we have available. As algorithm egieer we opt for the secod possibility ad thus propose two techiques that ca be combied together i order to elarge (virtually) M. The first techique is based o data compressio ad builds upo the observatio that the rus are icreasigly sorted. So, istead of represetig items via a fixed-legth codig (e.g. 4 or 8 bytes), we ca use iteger compressio techiques that squeeze those items i fewer bits thus allowig us to pack more of them i iteral memory. A followig lecture will describe i detail several approaches to this problem (see Chapter??), here we cotet ourselves metioig the ames of some of these approaches: γ-code, δ-code, Rice/Golomb-codig, etc. etc.. I additio, sice the smaller is a iteger the fewer bits are used for its ecodig, we ca eforce the presece of small itegers i the sorted rus by ecodig ot just their absolute value but the differece betwee oe iteger ad the previous oe i the sorted ru (the so called delta-codig). This differece is surely o egative (equals zero if the ru cotais equal items), ad smaller tha the item to be ecoded. This is the typical approach to the ecodig of iteger sequeces used i moder search egies, that we will discuss i a followig lecture (see Chapter??). FIGURE 5.1: Whe a ru fits i the iteral memory of size M, we applyqsort over its items. I gray we depict the recursive calls that are executed i iteral memory, ad thus do ot elicit I/Os. Above there are the calls based o classic Mergesort, oly the call o 2M items is show. The secod techique is based o a elegat idea, called thesow Plow ad due to D. Kuth [3], that allows to virtually icrease the memory size of a factor 2 o average. This techique scas the iput sequece S ad geerates sorted rus whose legth has variable size loger tha M ad 2M o average. Its use eeds to chage the sortig scheme because it first creates these sorted rus, of variable legth, ad the applies repeatedly over the sorted rus the Merge-procedure. Although rus will have differet legths, the Merge will operate as usual requirig a optimal umber of I/Os for their mergig. Hece O(/B) I/Os will suffice to halve the umber of rus, ad thus a total of O( B log 2M ) I/Os will be used o average to produce the totally ordered sequece. This correspods to a savig of 1 pass over the data, which is o egligible if the sequece S is very log. For ease of descriptio, let us assume that items are trasferred oe at a time from disk to memory, istead that block-wise. Evetually, sice the algorithm scas the iput items it will be apparet that the umber of I/Os required by this process is liear i their umber (ad thus optimal). The algorithm proceeds i phases, each phase geerates a sorted ru (see Figure 5.2 for a illustrative

5 Sortig Atomic Items 5-5 FIGURE 5.2: A illustratio of four steps of a phase i Sow Plow. The leftmost picture shows the startig step i whichu is heapified, the a picture shows the output of the miimum elemet i H, hece the two possible cases for the isertio of the ew item, ad fially the stoppig coditio i whichh is empty adu fills etirely the iteral memory. example). A phase starts with the iteral-memory filled of M (usorted) items, stored i a heap data structure called H. Sice the array-based implemetatio of heaps requires o additioal space, i additio to the idexed items, we ca fit ih as may items as we have memory cells available. The phase scas the iput sequece S (which is usorted) ad at each step, it writes to the output the miimum item withih, saymi, ad loads i memory the ext item from S, sayext. Sice we wat to geerate a sorted output, we caot storeext ih ifext<mi, because it will be the ew heap-miimum ad thus it will be writte out at the ext step thus destroyig the property of ordered ru. So i that caseext must be stored i a auxiliary array, calledu, which stays usorted. Of course the total size ofh adu is M over the whole executio of a phase. A phase stops wheeverh is empty ad thusu cosists of M usorted items, ad the ext phase ca thus start (storig those items i a ew heaph ad emptyigu). Two observatios are i order: (i) durig the phase executio, the miimum ofh is o decreasig ad so it is o-decreasig also the output ru, (ii) the items ih at the begiig of the phase will be evetually writte to output which thus is loger tha M. Observatio (i) implies the correctess, observatio (ii) implies that this approach is ot less efficiet tha the classic Mergesort. Algorithm 5.2 A phase of the Sow-Plow techique Require:U is a usorted array of M items 1: H= build a mi-heap overu s items; 2: SetU= ; 3: while (H ) do 4: mi=extract miimum fromh; 5: Writemi to the output ru; 6: ext=read the ext item from the iput sequece; 7: if (ext < mi) the 8: writeext iu; 9: else 10: isertext ih; 11: ed if 12: ed while Actually it is more efficiet tha that o average. Suppose that a phase readsτitems i total from S. By the while-guard i Step 3 ad our commets above, we ca derive that a phase eds wheh is empty ad U = M. We kow that the read items go i part ih ad i part iu. But sice items are added tou ad ever removed durig a phase, M of theτitems ed-up iu.

6 5-6 Paolo Ferragia Cosequetly (τ M) items are iserted ih ad evetually writte to the output (sorted) ru. So the legth of the sorted ru at the ed of the phase is M+ (τ M)=τ, where the first addedum accouts for the items ih at the begiig of a phase, whereas the secod addedum accouts for the items read from S ad iserted ih durig the phase. The key issue ow is to compute the average ofτ. This is easy if we assume a radom distributio of the iput items. I this case we have probability 1/2 thatext is smaller thami, ad thus we have equal probability that a read item is iserted either ih or iu. Overall it follows thatτ/2 items go toh adτ/2 items go to U. But we already kow that the items iserted iu are M, so we ca set M=τ/2 ad thus we getτ=2m. FACT 5.1 Sow-Plow builds O(/M) sorted rus, each loger tha M ad actually of legth 2M o average. Usig Sow-Plow for the formatio of sorted rus i a Merge-based sortig scheme, this achieves a I/O-complexity of O( B log 2 ) o average. 2M From biary to multi-way Mergesort Previous optimizatios deployed the iteral-memory size M to reduce the umber of recursio levels by icreasig the size of the iitial (sorted) rus. But the the mergig was biary i that it fused two iput rus at a time. This biary-merge impacted oto the base 2 of the logarithm of the I/O-complexity of Mergesort. Here we wish to icrease that base to a much larger value, ad i order to get this goal we eed to deploy the memory M also i the mergig phase by elargig the umber of rus that are fused at a time. I fact the merge of 2 rus uses oly 3 blocks of the iteral memory: 2 blocks are used to cache the curret disk pages that cotai the compared items, amely S [x] ad S [y] from the otatio above, ad 1 block is used to cache the output items which are flushed whe the block is full (so to allow a block-wise writig to disk of the merged ru). But the iteral memory cotais a much larger umber of blocks, i.e. M/B 3, which remai uused over the whole mergig process. The third optimizatio we propose, therefore cosists of deployig all those blocks by desigig a k-way mergig scheme that fuses k rus at a time, with k 2. Let us set k=(m/b) 1, so that k blocks are available to read block-wise k iput rus, ad 1 block is reserved for a block-wise writig of the merged ru to disk. This scheme poses a challegig mergig problem because at each step we have to select the miimum amog k cadidates items ad this caot be obviously doe brute-forcedly by iteratig amog them. We eed a smarter solutio that agai higes oto the use of a mi-heap data structure, which cotais k pairs (oe per iput ru) each cosistig of two compoets: oe deotig a item ad the other deotig the origi ru. Iitially the items are the miimum items of the k rus, ad so the pairs have the form R i [1], i, where R i deotes the ith iput ru ad i=1, 2,...,k. At each step, we extract the pair cotaiig the curret smallest item ih (give by the first compoet of its pairs), write that item to output ad isert i the heap the ext item i its origi ru. As a example, if the miimum pair is R m [x], m the we write i output R m [x] ad isert ih the ew pair R m [x+1], m, provided that the mth ru is ot exhausted, i which case o pair replaces the extracted oe. I the case that the disk page cotaiig R m [x+1] is ot cached i iteral memory, a I/O-fault occurs ad that page is fetched, thus guarateeig that the ext B reads from ru R m will ot elicit ay further I/O. It should be clear that this mergig process takes O(log 2 k) time per item, ad agai O(z/B) I/Os to merge k rus of total legth z. As a result the mergig-scheme recalls a k-way tree with O(/M) leaves (rus) which ca have bee formed usig ay of the optimizatios above (possibly via Sow Plow). Hece the total umber of mergig levels is ow O(log M/B M ) for a total volume of I/Os equal to O( B log M/B M ). We observe that sometime we will also write the formula as O( B log M/B B ), as it typically occurs i the literature, because log M/B M ca be writte as log M/B (B (M/B))=(log M/B B)+1=Θ(log M/B B). This makes

7 Sortig Atomic Items 5-7 o differece asymptotically give that log M/B M =Θ(log M/B B ). THEOREM 5.1 Multi-way Mergesort takes O( B log M/B M ) I/Os ad O( log ) comparisos/- time to sort atomic items i a two-level memory model i which the iteral memory has size M ad the disk page has size B. The use of Sow-Plow or iteger compressors would virtually icrease the value of M with a twofold advatage i the fial I/O-complexity, because M occurs twice i the I/O-boud. I practice the umber of mergig levels will be very small: assumig a block size B=4KB ad a memory size M= 4GB, we get M/B=2 32 /2 12 = 2 20 so that the umber of passes is 1/20th smaller tha the oes eeded by biary Mergesort. Probably more iterestig is to observe that oe pass is able to sort = M items, but two passes are able to sort M 2 /B items, sice we ca merge M/B-rus each of size M. It goes without sayig that i practice the iteral-memory space which ca be dedicated to sortig is smaller tha the physical memory available (typically MBs versus GBs). Nevertheless it is evidet that M 2 /B is of the order of Terabytes already for M= 128MB ad B=4KB. 5.2 Lower bouds At the begiig of this lecture we commeted o the relatio existig betwee the Sortig ad the Permutig problems, cocludig that the former oe is more difficult tha the latter i the RAM model. The gap i time complexity is give by a logarithmic factor. The questio we address i this sectio is whether this gap does exist also whe measurig I/Os. Surprisigly eough we will show that Sortig is equivalet to Permutig i terms of I/O-volume. This result is amazig because it ca be read as sayig that the I/O-cost for sortig is ot i the computatio of the sorted permutatio but rather the movemet of the data o the disk to realize it. This is the so called I/O-bottleeck that has i this result the mathematical proof ad quatificatio. Before diggig ito the proof of this lower boud, let us briefly show how a sorter ca be used to permute a sequece of items S [1, ] i accordace to a give permutatioπ[1, ]. This will allow us to derive a upper boud to the umber of I/Os which suffice to solve the Permutig problem o ay S,π. Recall that this meas to geerate the sequece S [π[1]], S [π[2]],...,s[π[]]. I the RAM model we ca jump amog S s items accordig to permutatioπad create the ew sequece S [π[i]], for i=1, 2,...,, thus takig O() optimal time. O disk we have actually two differet algorithms which iduce two icomparable I/O-bouds. The first algorithm cosists of mimickig what is doe i RAM, payig oe I/O per moved item ad thus takig O() I/Os. The secod algorithm cosists of geeratig a proper set of tuples ad the sort them. Precisely, the algorithm creates the sequece P of pairs i, π[i] where the first compoet idicates the positio i where the item S [π[i]] must be stored. The it sorts these pairs accordig to theπ-compoet, ad via a parallel sca of S adpsubstitutesπ[i] with the item S [π[i]], thus creatig the ew pairs i, S [π[i]]. Fially aother sort is executed accordig to the first compoet of these pairs, thus obtaiig a sequece of items correctly permuted. The algorithm uses two sca of the data ad two sorts, so it eeds O( B log M/B M ) I/Os. THEOREM 5.2 Permutig items takes O(mi{, B log M/B M }) I/Os i a two-level memory model i which the iteral memory has size M ad the disk page has size B. I what follows we will show that this algorithm, i its simplicity, is I/O-optimal. The two upperbouds for Sortig ad Permutig equal each other wheever =Ω( B log M/B M ). This occurs whe

8 5-8 Paolo Ferragia B>log M/B M that holds always i practice because that logarithm term is about 2 or 3 for values of up to may Terabytes. So programmers should ot be afraid to fid sophisticated strategies for movig their data i the presece of a permutatio, just sort them, you caot do better! time complexity (RAM model) I/O complexity (two-level memory model) Permutig O() O(mi{, B log M/B M }) Sortig O( log 2 ) O( B log M B M ) TABLE 5.1 Time ad I/O complexities of the Permutig ad Sortig problems i a two-level memory model i which M is the iteral-memory size, B is the disk-page size, ad D=1is the umber of available disks. The case of multi-disks presets the multiplicative term /D i place of A lower-boud for Sortig There are some subtle issues here that we wish to do ot ivestigate too much, so we hereafter give oly the ituitio which uderlies the lower-bouds for both Sortig ad Permutig. 2 We start by resortig the compariso-tree techique for provig compariso-based lower bouds i the RAM model. A algorithm correspods to a family of such trees, oe per iput size (so ifiite i umber). Every ode is a compariso betwee two items. The compariso has two possible results, so the faout of each iteral ode is two ad the tree is biary. Each leaf of the tree correspods to a solutio of the uderlyig problem to be solved: so i the case of sortig, we have oe leaf per permutatio of the iput. Every root-to-leaf path i the compariso-tree correspods to a computatio, so the logest path correspods to the worst-case umber of comparisos executed by the algorithm. I order to derive a lower boud, it is therefore eough to compute the depth of the shallowest biary tree havig that umber of leaves. The shallowest biary tree with l leaves is the (quasi-)perfectly balaced tree, for which the height h is such that 2 h l. Hece h log 2 l. I the case of sortig l=! so the classic lower boud h=ω( log 2 ) is easily derived by applyig logarithms at both sides of the equatio ad usig the Stirlig s approximatio for the factorial. I the two-level memory model the use of compariso-trees is more sophisticated. Here we wish to accout for I/Os, ad exploit the fact that the iformatio available i the iteral memory ca be used for free. As a result every ode correspods to oe I/O, the umber of leaves equals still to!, but the fa-out of each iteral ode equals to the umber of compariso-results that this sigle I/O ca geerate amog the items it reads from disk (i.e. B) ad the items available i iteral memory (i.e. M B). These B items ca be distributed i at most ( M B) ways amog the other M Bitems preset i iteral memory, so oe I/O ca geerate o more tha ( M B) differet results for those comparisos. But this is a icomplete aswer because we are ot cosiderig the permutatios amog those items! However, some of these permutatios have bee already couted by some previous I/O, ad thus we have ot to recout them. These permutatios are the oes cocerig with items that have already passed through iteral memory, ad thus have bee fetched by some previous I/O. So we have to cout oly the permutatios amog the ew items, amely the oes 2 There are two assumptios that are typically itroduced i those argumets. Oe cocers with item idivisibility, so items caot be broke up ito pieces (hece hashig is ot allowed!), ad the other cocers with the possibility to oly move items ad ot create/destroy/copy them, which actually implies that exactly oe copy of each item does exist durig their sortig or permutig.

9 Sortig Atomic Items 5-9 that have ever bee cosidered by a previous I/O. We have /B iput pages, ad thus /B I/Os accessig ew items. So these I/Os geerate ( M B) (B!) results by comparig those ew B items with the M Boes i iteral memory. Let us ow cosider a computatio with t I/Os, ad thus a path i the compariso-tree with t odes. /B of those odes must access the iput items, which must be surely read to geerate the fial permutatio. The other t B odes read pages cotaiig already processed items. Ay rootto-leaf path has this form, so we ca look at the compariso tree as havig the ew-i/os at the top ( ad the other odes at its bottom. Hece if the tree has depth t, its umber of leaves is at least M ) t B (B!) /B. By imposig that this umber is!, ad applyig logarithms to both members, we derive that t=ω( B log M/B M ). It is ot difficult to exted this argumet to the case of D disks thus obtaiig the followig. THEOREM 5.3 I a two-level memory model with iteral memory of size M, disk-page size B ad D disks, a compariso-based sortig algorithm must executeω( DB log M/B DB ) I/Os. It is iterestig to observe that the umber of available disks D does ot appear i the deomiator of the base of the logarithm, although it appears i the deomiator of all other terms. If this would be the case, istead, D would somewhat pealize the sortig algorithms because it would reduce the logarithm s base. I the light of Theorem 5.1, multi-way Mergesort is I/O ad time optimal o oe disk, so D liearly boosts its performace thus havig more disks is liearly advatageous (at least from a theoretical poit of view). But Mergesort is o loger optimal o multi-disks because the simultaeous mergig of k>2 rus, should take O(/DB) I/Os i order to be optimal. This meas that the algorithm should be able to fetch D pages per I/O, hece oe per disk. This caot be guarateed, at every step, by the curret mergig-scheme because whichever is the distributio of the k rus amog the D disks, ad eve if we kow which are the ext DB items to be loaded i the heaph, it could be the case that more tha B of these items reside o the same disk thus requirig more tha oe I/O from that disk, hece prevetig the parallelism i the read operatio. I the followig Sectio 5.4 we will address this issue by proposig the disk stripig techique, that comes close to the I/O-optimal boud via a simple data layout o disks, ad the Greedsort algorithm that achieves full optimality by devisig a elegat ad sophisticated mergig scheme A lower-boud for Permutig Let us assume that at ay time the memory of our model, hece the iteral memory of size M ad the ubouded disk, cotais a permutatio of the iput items possibly iterspersed by empty cells. No more tha blocks will be o empty durig the executio of the algorithm, because steps (ad thus I/Os) is a obvious upper boud to the I/O-complexity of Permutig (obtaied by mimickig o disk the Permutig algorithm for the RAM model). We deote by P t the umber of permutatios geerated by a algorithm with t I/Os, where t ad P 0 = 1 sice at the begiig we have the iput order as iitial permutatio. I what follows we estimate P t ad the set P t! i order to derive the miimum umber of steps t eeded to realize ay possible permutatio give i iput. Permutig is differet from Sortig because the permutatio to be realized is provided i iput, ad thus we do ot eed ay computatio. So i this case we distiguish three types of I/Os, which cotribute differetly to the umber of geerated permutatios: Write I/O: This may icrease P t by a factor O() because we have at most +1 possible ways to write the output page amog the at most ot-empty pages available o disk. Ay writte page is touched, ad they are o more tha at ay istat of the permutig process.

10 5-10 Paolo Ferragia Read I/O o a utouched page: If the page was a iput page ever read before, the read operatio imposes to accout for the permutatios amog the read items, hece B! i umber, ad to accout also for the permutatios that these B items ca realize by distributig them amog the M B items preset i iteral memory (similarly as doe for Sortig). So this read I/O ca icrease P t by a factor O( ( M B) (B!)). The umber of iput (hece utouched ) pages is /B. After a read I/O, they become touched. Read I/O o a touched page: If the page was already read or writte, we already accouted i P t for the permutatios amog its items, so this read I/O ca icrease P t oly by a factor O( ( M B) ) due to the shufflig of the B read items with the M Boes preset i iteral memory. The umber of touched pages is at most. If t r is the umber of reads ad t w is the umber of writes executed by a Permutig algorithm, where t=t r + t w, the we ca boud P t as follows (here big-oh have bee dropped to ease the readig of the formulas): P t ( B ( ) M (B!)) /B ( B ( M B ) ) t r /B t w ( ( M B ) ) t (B!) B I order to geerate every possible permutatio of the iput items, we eed that P t!. We ca thus derive a lower boud o t by imposig that! ( ( M B) ) t (B!) B ad resolvig with respect to t: We distiguish two cases. If B log M B Ω(); otherwise it is t=ω( log B )=Ω( B log M B log M B B proof to the case of D disks. log B t=ω( B log M B + log ) log, the the above equatio becomes t=ω( log B M log )= ). As for sortig, it is ot difficult to exted this THEOREM 5.4 I a two-level memory model with iteral memory of size M, disk-page size B ad D disks, permutig items eedsω(mi{ D, DB log M/B DB }) I/Os. Theorems prove that the I/O-bouds provided i Table 5.1 for the Sortig ad Permutig problems are optimal. Comparig these bouds we otice that they are asymptotically differet wheever B log M B < log. Give the curret values for B ad M, respectively few KBs ad few GBs, this iequality holds if =Ω(2 B ) ad hece whe is more tha Yottabytes (= 2 80 ). This is ideed a ureasoable situatio to deal with oe CPU ad few disks. Probably i this cotext it would be more reasoable to use a cloud of PCs, ad thus aalyze the proposed algorithms via a distributed model of computatio which takes ito accout may CPUs ad more-tha-2 memory levels. It is therefore ot surprisig that researchers typically assume Sortig = Permutig i the I/O-settig. 5.3 The distributio-based sortig paradigm Like Mergesort, Quicksort is based o the divide&coquer paradigm, so it proceeds by dividig the array to be sorted ito two pieces which are the sorted recursively. But ulike Mergesort, Quicksort does ot explicitly allocate extra-workig space, its combie-step is abset ad its dividestep is sophisticated ad impacts oto the overall efficiecy of this sortig algorithm. Algorithm 5.3

11 Sortig Atomic Items 5-11 Algorithm 5.3 The biary quick-sort: QuickSort(S, i, j) 1: if (i< j) the 2: r= pick the positio of a good pivot ; 3: swap S [r] with S [i]; 4: p = Partitio(S, i, j); 5: QuickSort(S, i, p 1); 6: QuickSort(S, p + 1, j); 7: ed if reports the pseudocode of Quicksort, this will be used to commet o its complexity ad argue for some optimizatios or tricky issues which arise whe implemetig it over hierarchical memories. The key idea is to partitio the iput array S [i, j] i two pieces such that oe cotais items which are smaller (or equal) tha the items cotaied i the latter piece. This partitio is order preservig because o subsequet steps are ecessary to recombie the ordered pieces after the two recursive calls. Partitioig is typically obtaied by selectig oe iput item as a pivot, ad by distributig all the other iput items ito two sub-arrays accordig to whether they are smaller/greater tha the pivot. Items equal to the pivot ca be stored aywhere. I the pseudocode the pivot is forced to occur i the first positio S [i] of the array to be sorted (steps 2 3): this is obtaied by swappig the real pivot S [r] with S [i] before that procedure Partitio(S, i, j) is ivoked. We otice that step 2 does ot detail the selectio of the pivot, because this will be the topic of a subsequet sectio. There are two issues for achievig efficiecy i the executio of Quicksort: oe cocers with the implemetatio of Partitio(S, i, j), ad the other oe with the ratio betwee the size of the two formed pieces because the more balaced they are, the more Quicksort comes closer to Mergesort ad thus to the optimal time complexity of O( log ). I the case of a totally ubalaced partitio, i which oe piece is possibly empty (i.e. p=i or p= j), the time complexity of Quicksort is O( 2 ), thus recallig i its cost the Isertio sort. Let us commet these two issues i detail i the followig subsectios From two- to three-way partitioig The goal of Partitio(S, i, j) is to divide the iput array ito two pieces, oe cotais items which are smaller tha the pivot, ad the other cotais items which are larger tha the pivot. Items equal to the pivot ca be arbitrarily distributed amog the two pieces. The iput array is therefore permuted so that the smaller items are located before the pivot, which i tur precedes the larger items. At the ed of Partitio(S, i, j), the pivot is located at S [p], the smaller items are stored i S [i, p 1], the larger items are stored i S [p+1, j]. This partitio ca be implemeted i may ways, takig O() optimal time, but each of them offers a differet cache usage ad thus differet performace i practice. We preset below a tricky algorithm which actually implemets a three-way distributio ad takes ito accout the presece of items equal to the pivot. They are detected ad stored aside i a special sub-array which is located betwee the two smaller/larger pieces. It is clear that the cetral sub-array, which cotais items equal to the pivot, ca be discarded from the subsequet recursive calls, similarly as we discard the pivot. This reduces the umber of items to be sorted recursively, but eeds a chage i the (classic) pseudo-code of Algorithm 5.3, because Partitio must ow retur the pair of idices which delimit the cetral sub-array istead of just the positio p of the pivot. The followig Algorithm 5.4 details a implemetatio for the three-way partitioig of S [i, j] which uses three poiters that move rightward over this array ad maitai the followig ivariat: P is the pivot drivig the three-way distributio, S [c] is the item curretly compared agaist P, ad S [i, c 1] is the part of the iput array already scaed ad three-way partitioed i its elemets. I particular S [i, c 1] cosists of three parts: S [i, l 1] cotais items

12 5-12 Paolo Ferragia smaller tha P, S [l, r 1] cotais items equal to P, ad S [r, c 1] cotais items larger tha P. It may be the case that ayoe of these sub-arrays is empty. Algorithm 5.4 The three-way partitioig: Partitio(S, i, j) 1: P=S [i]; l=i; r=i 1; 2: for (c=r; c j; c++) do 3: if (S [c]== P) the 4: swap S [c] with S [r]; 5: r++; 6: else if (S [c]<p) the 7: swap S [c] with S [l]; 8: swap S [l] with S [r]; 9: r++; l++; 10: ed if 11: ed for 12: retur l, r 1 ; Step 1 iitializes P to the first item of the array to be partitioed (which is the pivot), l ad r are set to guaratee that the smaller/greater pieces are empty, whereas the piece cotaiig items equal to the pivot cosists of the oly item P. Next the algorithm scas S [i+1, j] tryig to maitai the ivariat above. This is easy if S [c]>p, because it suffices to exted the part of the larger items by advacig r. I the other two cases (i.e. S [c] P) we have to isert S [c] i its correct positio amog the items of S [i, r 1], i order to preserve the ivariat o the three-way partitio of S [i, c]. The cute idea is that this ca be implemeted i O(1) time by meas of at most two swaps, as described graphically i Figure 5.3. The three-way partitioig algorithm takes O() time ad offers two positive properties: (i) stream-like access to the array S which allows the pre-fetchig of the items to be read; (ii) the items equal to the pivot ca the be elimiated from the followig recursive calls Pivot selectio The selectio of the pivot is crucial to get balaced partitios, reduce the umber of recursive calls, ad achieve optimal O( log ) time complexity. The pseudo-code of Algorithm 5.3 does ot detail the way the pivot is selected because this may occur i may differet ways, each offerig pros/cos. As a example, if we choose the pivot as the first item of the iput array (amely r=i), the selectio is fast but it is easy to istatiate the iput array i order to iduce u-balaced partitios: just take S to be a icreasig or decreasig ordered sequece of items. Worse tha this, it is the observatio that ay determiistic choice icurs i this drawback. Oe way to circumvet bad iputs is to select the pivot radomly amog the items i S [i, j]. This prevets the case that a give iput is bad for Quicksort, but makes the behavior of the algorithm u-predictable i advace ad depedat o the radom selectio of the pivot. We ca show that the average time complexity is the optimal O( log ), with a hidde costat small ad equal to This fact, together with the i-place ature of Quicksort, makes this approach much appealig i practice (cfrqsort below). THEOREM 5.5 The radom selectio of the pivot drives Quicksort to compare o more tha 1.45 log items, o average.

13 Sortig Atomic Items 5-13 FIGURE 5.3: The two cases ad the correspodig swappig. O the arrow we specify the value of the moved item with respect to the pivot. Proof The proof is deceptively simple if attacked from the correct agle. We wish to compute the umber of comparisos executed by Partitio over the sequece S. Let X u,v be the radom biary variable which idicates whether S [u] ad S [v] are compared by Partitio, ad deote by p u,v the probability that this evet occurs. The average umber of comparisos executed by Quicksort ca the be computed as E[ u,v X u,v ]= u v>u 1 p u,v + 0 (1 p u,v )= v=u+1 u=1 p u,v by liearity of expectatio. To estimate p u,v we cocetrate o the radom choice of the pivot S [r] because two items are compared by Partitio oly if oe of them is the pivot. So we distiguish three cases. If S [r] is smaller or larger tha both S [u] ad S [v], the the two items S [u] ad S [v] are ot compared to each other ad they are passed to the same recursive call of Quicksort. So the problem presets itself agai o a smaller subset of items cotaiig both S [u] ad S [v]. This case is therefore ot iterestig for estimatig p u,v, because we caot coclude aythig at this recursive-poit about the executio or ot of the compariso betwee S [u] ad S [v]. I the case that either S [u] or S [v] is the pivot, the they are compared by Partitio. I all other cases, the pivot is take amog the items of S whose value is betwee S [u] ad S [v]; so these two items go to two differet partitios (hece two differet recursive calls of Quicksort) ad will ever be compared. As a result, to compute p u,v we have to cosider as iterestig pivot-selectios the two last situatios. Amog them, two are the good cases, but how may are all these iterestig pivotselectios? We eed to cosider the sorted S, deoted by S. There is a obvious bijectio betwee pairs of items i S ad pairs of items i S. Let us assume that S [u] is mapped to S [u ] ad S [v] is mapped to S [v ], the it is easy to derive the umber of iterestig pivot-selectios as v u + 1 which correspods to the umber of items (hece pivot cadidates) whose value is betwee S [u] ad S [v] (extremes icluded). So p u,v = 2/(v u + 1). This formula may appear complicate because we have o the left u, v ad o the right u, v. Give the bijectio betwee S ad S, We ca rephrase the statemet cosiderig all pairs (u, v) i S as cosiderig all pairs (u, v ) i S, ad thus write:

14 5-14 Paolo Ferragia E[ u,v X u,v ]= u=1 v=u+1 p u,v = 2 v u + 1 < 2 v >u u =1 u =1 k=1 1 2 l k where the last iequality comes from the properties of the -th harmoic umber. The statemet of the theorem the follows by observig that l 1.45 log 2. The ext questio is how we ca eforce the average behavior. The atural aswer is to sample more tha oe pivot. Typically 3 pivots are radomly sampled from S ad the cetral oe (i.e. the media) is take, thus requirig just two comparisos i O(1) time. Takig more tha 3 pivots makes the selectio of a good oe more robust, as proved i the followig theorem [2]. THEOREM 5.6 If Quicksort partitios aroud the media of 2s+1 radomly selected elemets, 2H it sorts distict elemets i H 2s+2 H s+1 + O() expected comparisos, where H z is the z-th harmoic umber z i=1 1 i. By icreasig s, we ca push the expected umber of comparisos close to log +O(), however the selectio of the media icurs a higher cost. I fact this ca be implemeted either by sortig the s samples i O(s log s) time ad takig the oe i the middle positio s+1 of the ordered sequece; or i O(s) worst-case time via a sophisticated algorithm (ot detailed here). Radomizatio helps i simplifyig the selectio still guarateeig O(s) time o average. We detail this approach here because its aalysis is elegat ad its structure geeral eough to be applied ot oly for the selectio of the media of a uordered sequece, but also for selectig the item of ay rak k. Algorithm 5.5 Selectig the k-th raked item: RadSelect(S, k) 1: r= pick a radom item from S ; 2: S < = items of S which are smaller tha S [r]; 3: S > = items of S which are larger tha S [r]; 4: < = S < ; 5: = = S ( S < + S > ); 6: if (k < ) the 7: retur RadSelect(S <, k); 8: else if (k ( < + = )) the 9: retur S [r]; 10: else 11: retur RadSelect(S >, k < = ); 12: ed if Algorithm 5.5 is radomized ad selects the item of the uordered S havig rak k. It is iterestig to see that the algorithmic scheme mimics the oe used i the Partitioig phase of Quicksort: here the selected item S [r] plays the same role of the pivot i Quicksort, because it is used to partitio the iput sequece S i three parts cosistig of items smaller/equal/larger tha S [r]. But ulike Quicksort, RadSelect recurses oly i oe of these three parts, amely the oe cotaiig the k-th raked item. This part ca be determied by just lookig at the sizes of those parts, as doe i Steps 6 ad 8. There are two specific issues that deserve a commet. We do ot eed to recurse o S = because it cosists of items equal to S [r]. If recursio occurs o S >, we eed to update the rak k because we are droppig from the origial sequece the items belogig to the set S < S =.

15 Sortig Atomic Items 5-15 Correctess is therefore immediate, so we are left with computig the average time complexity of this algorithm which turs to be the optimal O(), give that S is usorted ad thus all of its items have to be examied to fid the oe havig rak k amog them. THEOREM 5.7 Selectig the k-th raked item i a uordered sequece of size takes O() average time i the RAM model, ad O(/B) I/Os i the two-level memory model. Proof Let us call good selectio the oe that iduces a partitio i which < ad > are ot larger tha 2/3. We do ot care of the size of S = sice, if it cotais the searched item, that item is retured immediately as S [r]. It is ot difficult to observe that S [r] must have rak i the rage [/3, 2/3] i order to esure that < 2/3 ad > 2/3. This occurs with probability 1/3, give that S [r] is draw uiformly at radom from S. So let us deote by ˆT() the average time complexity of RadSelect whe ru o a array S [1, ]. We ca write ˆT() O()+ 1 3 ˆT(2/3)+ 2 3 ˆT(), where the first term accouts for the time complexity of Steps 2-5, the secod term accouts for the average time complexity of a recursive call o a good selectio, ad the third term is a crude upper boud to the average time complexity of a recursive call o a bad selectio (that is actually assumed to recurse o the etire S agai). This is ot a classic recurret relatio because the term ˆT() occurs o both sides; evertheless, we observe that this term occurs with differet costats i the frot. Thus we ca simplify the relatio by subtractig those terms, so gettig 1 3 ˆT() O()+ 1 3 ˆT(2/3), which gives ˆT()=O()+ ˆT(2/3)=O(). If this algorithm is executed i the two-level memory model, the equatio becomes ˆT() = O(/B) + ˆT(2/3) = O(/B) give that the costructio of the three subsets ca be doe via a sigle pass over the iput items. We ca use RadSelect i may differet ways withi Quicksort. For example, we ca select the pivot as the media of the etire array S (settig k=/2) or the media amog a over-sampled set of 2s+1 pivots (settig k=s+1, where s /2), or fially, it could be subtly used to select a pivot that geerates a balaced partitio i which the two parts have differet sizes both beig a fractio of, sayα ad (1 α) withα<0.5. This last choice k= α seems meaigless because the three-way partitioig still takes O() time but icreases the umber of recursive calls from log 2 to log 1 α. But this observatio eglects the sophisticatio of moder CPUs which are parallel, pipelied ad superscalar. These CPUs execute istructios i parallel, but if there is a evet that impacts o the istructio flow, their parallelism is broke ad the computatio slows dow sigificatly. Particularly slow are brach mispredictios, which occur i the executio of Partitio(S, i, j) wheever a item smaller tha or equal to the pivot is ecoutered. If we reduce these cases, the we reduce the umber of brach-mispredictios, ad thus deploy the full parallelism of moder CPUs. Thus the goal is to properly setαi a way that the reduced umber of mispredictios balaces the icreased umber of recursive calls. The right value for α is clearly architecture depedet, recet results have show that a reasoable value is Boudig the extra-workig space QuickSort is frequetly amed as a i-place sorter because it does ot use extra-space for orderig the array S. This is true if we limit ourself to the pseudocode of Algorithm 5.3, but it is o loger true if we cosider the cost of maagig the recursive calls. I fact, at each recursive call, the OS must allocate space to save the local variables of the caller, i order to retrieve them wheever the recursive call eds. Each recursive call has a space cost ofθ(1) which has to be multiplied by

16 5-16 Paolo Ferragia the umber of ested calls Quicksort ca issue o a array S [1, ]. This umber ca beω() i the worst case, thus makig the extra-workig spaceθ() o some bad iputs (such as the already sorted oes, poited out above). Algorithm 5.6 The biary quick-sort with bouded recursive-depth: BoudedQS(S, i, j) 1: while ( j i> 0 ) do 2: r= pick the positio of a good pivot ; 3: swap S [r] with S [i]; 4: p = Partitio(S, i, j); 5: if (p i+ j 2 ) the 6: BoudedQS(S, i, p 1); 7: i= p+1; 8: else 9: BoudedQS(S, p + 1, j); 10: j= p 1; 11: ed if 12: ed while 13: IsertioSort(S, i, j); We ca circumvet this behavior by restructurig the pseudocode of Algorithm 5.3 as specified i Algorithm 5.6. This algorithm is cryptic at a first glace, but the uderlyig desig priciple is pretty smart ad elegat. First of all we otice that the while-body is executed oly if the iput array is loger tha 0, otherwise Isertio-sort is called i Step 13, thus deployig the well-kow efficiecy of this sorter over very small sequeces. The value of 0 is typically chose of few tes of items. If the iput array is loger tha 0, a modified versio of the classic biary Quicksort is executed that mixes oe sigle recursive call with a iterative while-loop. The ratio uderlyig this code re-factorig is that the correctess of classic Quicksort does ot deped o the order of the two recursive calls, so we ca reshuffle them i such a way that the first call is always executed o the smaller part of the two/three-way partitio. This is exactly what the IF-statemet i step 5 guaratees. I additio to that, the pseudo-code above drops the recursive call oto the larger part of the partitio i favor of aother executio of the body of the while loop i which we properly chaged the parameters i ad j to reflect the ew extremes of that larger part. This chage is well-kow i the literature of compilers with the ame of elimiatio of tail recursio. The et result is that the recursive call is executed o a sub-array whose size is o more tha the half of the iput array. This guaratees a upper boud of O(log 2 ) o the umber of recursive calls, ad thus o the size of the extra-space eeded to maage them. THEOREM 5.8 BoudedQS sorts atomic items i the RAM model takig O( log ) average time, ad usig O(log ) additioal workig space. We coclude this sectio by observig that the C89 ad C99 ANSI stadards defie a sortig algorithm, called qsort, whose implemetatio ecapsulates most of the algorithmic tricks detailed above. 3 This witesses further the efficiecy of the distributio-based sortig scheme over the 2-3 Actuallyqsort is based o a differet two-way partitioig scheme that uses two iterators, oe moves forward ad the

4 Sorting Atomic Items

4 Sorting Atomic Items 4 Sortig Atomic Items 4.1 The merge-based sortig paradigm... 4-2 4.2 A lower boud o I/Os... 4-7 4.3 The distributio-based sortig paradigm... 4-10 Partitioig Pivot Selectio Limitig the umber of recursio