Sorting on Clusters of SMPs (Extended Abstract)

Size: px

Start display at page:

Download "Sorting on Clusters of SMPs (Extended Abstract)"

Reginald Boyd
6 years ago
Views:

1 Sortig o Clusters of SMPs (Exteded Abstract) David R. Helma Joseh JáJá Istitute for Advaced Comuter Studies & Deartmet of Electrical Egieerig, Uiversity of Marylad, College Park, MD fhelma, joseh g@umiacs.umd.edu Abstract We itroduce a efficiet algorithm for sortig o clusters of symmetric multirocessors (SMPs). This algorithm relies o a ovel scheme for stably sortig o a sigle SMP couled with balaced regular commuicatio o the cluster. Our SMP algorithm seems to be asymtotically faster tha ay of the ublished algorithms. The algorithms were imlemeted i C usig POSIX threads ad the SIMPLE library of commuicatio rimitives ad ru o a cluster of DEC AlhaServer 2100A systems. Our exerimetal results verify the scalability ad efficiecy of our roosed solutio ad illustrate the imortace of cosiderig both memory hierarchy ad the overhead of shiftig to multile odes. 1 Itroductio Clusters of symmetric multirocessors (SMPs) have emerged as a rimary cadidate for large scale multirocessor systems. I site of this tred, relatively little work has bee doe to develo techiques for desigig algorithms that make effective use of the resources available o such latforms. This task is made difficult by the cotrastig requiremets of the latform. O the oe had, each SMP must be viewed o its ow as a hierarchical shared memory machie. Good erformace requires both good load distributio ad the miimizatio of the overhead icurred by simultaeous mai memory access. O the other had, from the ersective of the cluster, each ode is i effect a suerrocessor, ad therefore the cluster of SMPs is a collectio of owerful rocessors coected by a commuicatio et- Suorted i art by NSF grat No. CCR ad NSF HPCC/GCAG grat No. BIR Thaks to David Bader of the Uiversity of New Mexico for the use of his SIMPLE commuicatio rimitives. work. Maximizig the erformace of such a distributed memory machie requires both efficiet load balacig ad regular balaced commuicatio. I this aer, we examie the roblem of sortig o SMP clusters because of its demadig requiremets for irregular memory access ad iterrocessor commuicatio. From the ersective of the cluster, ay algorithm which erforms well for distributed memory machies would be a reasoable cadidate. I articular, we have idetified two such algorithms i our revious work [8, 7, 10]. The first is a variatio o samle sort ad the other is a variatio o the aroach of sortig by regular samlig. O the other had, from the ersective of the idividual SMP, there are fewer choices for sortig o hierarchical shared memory machies. These iclude the algorithms described i [3, 11, 12, 2, 6, 13]. Ufortuately, oe of these algorithms by itself is sufficiet to achieve efficiet erformace o a SMP cluster. Efficiet algorithms for distributed memory machies ted to cofie iterrocessor commuicatio to a miimum umber of regular balaced exchages. By cotrast, algorithms for shared memory machies tyically caitalize o the fact that accessig data associated with aother rocessor is o more exesive tha accessig its ow data i the shared memory. We itroduce a sortig algorithm for clusters of SMPs which is a hybrid of our modified algorithms for arallel sortig by radom samlig ad arallel sortig by determiistic samlig. Our algorithm was imlemeted i C usig POSIX threads ad rus o a cluster of DEC AlhaServer 2100A systems. We ra our code usig a variety of bechmarks that we have idetified to examie the deedece of our algorithm o the iut distributio. Our exerimetal results verify the scalability ad efficiecy of our roosed solutio ad illustrate the imortace of cosiderig both memory hierarchy ad the overhead of shiftig to multile odes.

2 The orgaizatio of our abstract is as follows. Sectio 2 resets our comutatioal model for aalyzig algorithms for SMP clusters. Sectio 3 describes our sortig algorithm for this latform. Fially, Sectio 4 describes the exerimetal erformace of our sortig algorithm o a SMP cluster. 2 Comutatioal Model For our uroses, the cost of a algorithm eeds to iclude a measure that reflects the umber ad tye of memory accesses. A memory access ca either be to a local cache, a local mai memory, or a remote mai memory. A umber of models have already bee roosed to cature the erformace of multilevel hierarchical memories, icludig the D-disk model of Vitter ad Shriver [14], the Hierarchical Memory with Block Trasfer Model of Aggarwal et al. [1], ad the Uiform Memory Hierarchy Model of Aler et al. [4]. However, we believe these models are uecessarily comlicated for use with clusters of SMPs ad would require sigificat refiemets to cature the corresodig hybrid architecture. I our comlexity model, each SMP is viewed as a twolevel hierarchy for which good erformace requires both good load distributio ad the miimizatio of the overhead icurred by secodary memory access. The cluster is viewed as a collectio of owerful rocessors coected by a commuicatio etwork. Maximizig erformace o the cluster requires both efficiet load balacig ad regular balaced commuicatio. Hece, our erformace model combies two searate but comlemetary models. More secifically, i our SMP model, memory at each SMP is see as cosistig of two levels: cache ad mai memory. A block of w cotiguous, words ca be read from or writte to mai memory i + wr time, where is the latecy of the bus, r is the umber of rocessors cometig for access to ossibly distict blocks i mai memory, ad is the badwidth. By cotrast,, the trasfer of w ocotiguous words would require w + r time. We cature erformace o a SMP cluster i exactly the same way as described i [8, 7, 10]. We view a arallel algorithm as a sequece of local SMP comutatios iterleaved with commuicatio stes, where we allow comutatio ad commuicatio to overla. Assumig o cogestio, the trasfer of a block cosistigof w cotiguous words betwee two odes takes + w time, where is the latecy of the etwork ad is the badwidth er ode. We assume that the bisectio badwidth is sufficietly high to suort block ermutatio routig amogst the N odes at the rate of 1=. I articular, for ay subset of r odes, a block ermutatio amogst the r odes takes + w time, where w is the size of the largest block. Usig this cost model, we ca evaluate the commuicatio time T comm (; N ) of a algorithm as a fuctio of the iut size, the umber of odes N, ad the arameters ad. The overall comlexity of the cluster T (;N ) is give by the sum of T SMP ad T comm (; N ). See [9] for a more extesive discussio of comutatioal models. 3 Sortig Algorithms For simlicity of resetatio, we first cosider a algorithm for sortig o a sigle SMP, ad the describe a algorithm for sortig o clusters of SMPs Sortig o a Sigle SMP Our algorithm is a adatatio of our sortig by regular samlig [8, 10], which we origially develoed for distributed memory machies. As modified for a SMP, this algorithm is similar to the arallel sortig by regular samlig (PSRS) algorithm, excet that our algorithm ca be easily imlemeted as a stable iteger sort. The seudocode for our algorithm is as follows, where C is the size of the cache ad L is the cache lie size: (1) Each rocessor P i (1 i ) sorts the subsequece of the iut elemets with idices (i,1) +1 through i as follows: (A)Sort each block of m iut elemets usig a aroriate sequetial algorithm, where m C. For itegers we use the radix sort algorithm, whereas for floatig oit umbers we use the merge sort algorithm. log(=m) (B)For j = 1 u to, merge the sorted blocks of size, mz (j,1) usig z-way merge, where z C L. th (i,1) + k s (2) Each rocessor P i selects each elemet as a samle, for (1 k s) ad a give value of s s. 2 (3) Processor P merges the sorted subsequeces of samles ad the selects the (ks) th samle as Slitter[k], for(1 k, 1). By default, the th slitter is the largest value allowed by the data tye used. Additioally, biary search is used to comute for the set of samles with idices 1 through (ks) the umberof samles Est[k] which share the same value as Slitter[k]. Ste (4): Each rocessor P i uses biary search to defie a idex b (i;j) for each of the sorted iut sequeces created i Ste (1). If we defie T (i;j) as 2

3 a subsequece cotaiig the first b (i;j) elemets i the j th sorted iut sequece, the the set of subsequeces ft (i;1);t (i;2); :::; T (i;)g will cotai all those values i the iut setwhich are strictly less tha Slitter[i] ad at most Est[i] s elemets with the same value as Slitter[i]. Ste (5): Each rocessor P i merges those subsequeces of the sorted iut sequeces which lie betwee idices b ((i,1);j) ad b (i;j) usig -way merge. It ca be show [8, 10] that o rocessor will merge more tha + s, elemets. It is straightforward to see how this algorithm ca be imlemeted as a stable iteger sort, ad [9] cotais a iformal roof. The aalysis of our algorithm is as follows. The cost of sequetially sortig elemets i blocks of size m i Ste (1A) deeds o the data tye - sortig itegers usig radix sort requires O + + time, whereas sortig floatig oit umbers usig merge sort requires O log m + + time. Ste (1B) ivolves merge begiig with requires oly O log because of satial locality. log(=m) m m + m + log(=m) rouds of z-way blocks of size m, which time The readig of s oco-,, tiguous samles i Ste (2) requires O s + s + time. Ste (3) ivolves a -way merge of blocks of size s followed by biary searches o segmets of size s. Sice oly, oe rocessor is active i Ste (3), it requires O s log() + log(s) + s time. Ste (4) ivolves biary searches o segmets of size. Sice these reads are ocotiguous, they require O log + log, + time. Fially, Ste (5), ivolves a -way merge of blocks of total maximum size + s,, requirig at most log s O + s time. It ca be show that the overall comlexity ca be aroximated by: 0 T (; ) log log + log (z) m 1 A The aalysis suggests that the arameters m ad z should be as large as ossible, subject to the stated costraits. Ideed, i such a case we believe our algorithm to be asymtotically faster tha the erformace achieved by ay of the other kow algorithms o our model. However, our exerimetal ivestigatio shows that the otimal choices for m ad z are actually slightly smaller tha suggested by our aalysis, sice they deed o a variety of factors besides those catured i our model Sortig O a Cluster of SMPs We have already develoed both a determiistic [8, 10] ad a radomized [8, 7] samle sort algorithm which we have show to be very efficiet for message assig latforms. Either would be a aroriate choice for a cluster of SMPs, but we chose the radomized samle sort because it roved to be slightly faster i its imlemetatio. We reeat the seudocode here for coveiece, where we relace each sequetial ste where aroriate by a multithreaded SMP imlemetatio. Note that the commuicatio rimitives metioed are described i detail i [9]. Ste (1): Usig threads, each ode N i (1 i N ) radomly assigs each of its N elemets to oe of N buckets. With high robability, o bucket will receive more tha c 1 N 2 elemets, for (c 1 2). Ste (2): Each ode N i routes the cotets of bucket j to ode N j,for(1 i; j N ). Sice with high robability o bucket will receive more tha c 1 N 2 elemets, this is equivalet to erformig a trasose oeratio with block size c 1 N 2. Ste (3): Usig threads, each ode N i sorts at most 1 N ( 1 c 1 ) values received i Ste (2) with the aroriate versio of our SMP sortig algorithm, deedig o the data tye. Ste (4): From its sorted list, of N ( 1) elemets, ode N 1 selects each j N th elemet as 2 Slitter[j], for (1 j N, 1). By default, Slitter[N ] is the largest value allowed by the data tye used. Additioally, for each Slitter[j], biary search is used to determie the values Frac L [j] ad Frac R [j], which are resectively the fractios of the total umber of elemets at ode N 1 with the same value as Slitter[j,, 1] ad Slitter[j] which also lie betwee idex (j, 1) N 2 +1, ad idex j N 2, iclusively. Ste (5): Node N 1 broadcasts the Slitter, Frac L, ad Frac R arrays to the other N, 1 odes. Ste (6): Each ode N i uses biary search o its sorted local array to defie for each of the N slitters a subsequece S j. The subsequece associated with Slitter[j] cotaisall thosevalueswhich are greater tha Slitter[j,1] ad less tha Slitter[j], as well as Frac L [j] ad Frac R [j] of the total umber of elemets i the local array with the same value as Slitter[j, 1] ad Slitter[j], resectively. 3

4 Ste (7): Each ode N i routes the subsequece associated with Slitter[j] to rocessor N j,for (1 i; j N ). Sice with high robability o sequece will cotai more tha c 2 N 2 elemets, for (c 2 5:42), this is equivalet to erformig a trasose oeratio with block size c 2 N 2. Ste (8): Usig threads, each ode N i merges the N sorted subsequeces received i Ste (7) to roduce the i th colum of the sorted array. Note that, with high robability, o ode has received more tha 2 N elemets, for ( 2 2:62). A detailed aalysis of this algorithm ca be foud i [9], ad we merely reeat its coclusio here for coveiece. With high robability, the overall comlexity of our samle sort algorithm is give (for floatig oit umbers) by T (; ) T com (; N; ) +T comm (; N; ) for N 2 < = O 3l. N log + 4 Performace Evaluatio log Nm log (z) N + + N Our algorithms were imlemeted usig POSIX threads ad ru o a DEC Alha Cluster. Our DEC Alha cluster cosists of 10 AlhaServer 2100A systems, each of which holds 4 Alha 21064A rocessors ruig each at 275 MHz. Each Alha 21064A rocessor has a 16KB rimary data cache ad a 4MB secodary data cache. The AlhaServers are coected usig the Digital Gigaswitch/ATM ad OC-3c adater cards, which have a eak badwidth ratig of Mbs. Iterode commuicatio is effected by calls to the SIMPLE collective commuicatio rimitives [5]. We tested our code o a variety of bechmarks, each of which had both a 32-bit iteger versio ad a 64-bit double recisio floatig oit umber (double) versio. See [9] for a detailed descritio ad justificatio of these bechmarks Exerimetal Results for a Sigle SMP We begi by exerimetally determiig the otimal values of the arameters m ad z, wherem is the block size i Ste (1A) ad z is the z-way merge used i Ste (1B). Table 1 dislays the times required to sort 4M doubles usig four threads as a fuctio of m ad z. Clearly, erformace suffers dramatically whe the block size reaches 4MB (512K eight byte double recisio umbers), which 1 A is the limit of the secodary cache. This is exected, sice sortig a block i Ste (1A) ow requires that data be reeatedly swaed to mai memory. It is more difficult to exlai why the otimal values for m ad z were 32K ad 32, resectively, sice this block size exceeds the 16KB rimary cache but falls well short of the secodary cache size. Part of the exlaatio might lie i the fact that the overall comlexity of O log + log( m ) obscures the fact that the comlexity of the block sort i Ste (1a) is O log m + +, which is a icreasig fuctio of m. Thevalueofz =32isreasoable whe we ote tha the actual cache lie size is 32 bytes. Our tree of losers is imlemeted with z 12-byte records, so the etire rocess could easily take lace i the 16KB rimary cache. For the remaiig discussio, the times reorted are for the exerimetally otimal values of m ad z ad for the umber of samles s =8. The relative erformace of the algorithm o differet bechmarks seems to cofirm that this was a reasoable choice for s. Block Deomiatio of z-way Merge Size K K K K K K K K K M (No z-way merge is ecessary for this block size) Table 1. Time (i secods) required to sort 4M doubles usig four threads as a fuctio of M ad z. Table 1 also rovides a iterestig illustratio i the imortace of miimizig secodary memory access. If we cosider the data for a block size of 2K, the executio time dros as we move from z = 2 to z = 4 to z = 8. This is reasoable sice we require 9 rouds of 2-way merge, 5 rouds of 4-way merge, ad oly 3 rouds of 8-way merge, ad each roud of z-way merge is obviously aother roud where all the iut elemets must be brought i from mai memory. Movig from z = 8 to z = 16 has little effect o the executio time sice it does othig to reduce the memory requiremets, but movig to z =32saves a roud of memory access ad, hece, the executio time is further reduced. However, the most dramatic illustratio of the imortace of miimizig secodary memory access ca be foud by comarig the otimal sortig time of 3:41 secods for m =32K ad z =32with the time of 9:86 secods required to sort usig oly biary merge sort. Reducig memory access by a combiatio of block sortig ad 4

5 z-way mergig imroved erformace early threefold. Tables 2 dislays the erformace of our sortig algorithm as a fuctio of iut distributio for a variety of iut sizes. Notice that the he erformace is essetially similar for bechmarks [U], [G], ad[wr] ad for bechmarks [Z], [DD], ad[rd]. The reaso why the bechmarks i the secod grou ra sigificatly faster tha the bechmarks i the first grou is that the secod grou of bechmarks all cotaied values restricted to a very small rage (f0g,f0; 1; 2g, ad f0; 1; :::; 32g) which results i sigificat savigs i time required for sortig ad mergig. Because there was little differece i the erformace of the bechmarks i the first grou, the remaider of this sectio will oly discuss the erformace of our sortig algorithm o the sigle bechmark [U] (uiform distributio). Iut Bechmark Size [U] [G] [Z] [WR] [DD] [RD] 1M M M M Table 2. Sortig itegers (i secods) usig 4 threads o a DEC AlhaServer 21000A. The results i Figure 1 examie the scalability of our sortig algorithm as a fuctio of the umber of threads. Results are show for sortig both itegers ad doubles. Bearig i mid that these grahs are log-log lots, they show that for a fixed iut size the executio time early halves whe the umber of threads is doubled. The grah i Figure 2 examies the scalability of our sortig algorithm as a fuctio of roblem size, for differig umbers of threads. They show that for a fixed umber of threads there is a almost liear deedece betwee the executio time ad the total umber of elemets. Figure 1. Scalability of sortig itegers ad doubles with resect to the umber of threads Exerimetal Results for a Cluster of SMPs For each exerimet, the iut is evely distributed amogst the odes. The outut cosists of the elemets i o-descedig order arraged amogst the odes so that the elemets at each ode are i sorted order ad o elemet at ode N i is greater tha ay elemet at rocessor N j, for all i<j. Note that i all cases the results show for a sigle ode were obtaied usig the sortig algorithm for a sigle SMP. Table 3 dislays the erformace of our sortig algorithm as a fuctio of iut distributio for a variety of iut sizes. I each case, the erformace is essetially ideedet of the iut distributio. Because of this ideedece, the remaider of this sectio will oly discuss the erformace of our sortig algorithm o the sigle bechmark [U] (uiform distributio). Figure 2. Scalability of sortig doubles with resect to the roblem size, for differig umbers of threads. 5

6 Iut Bechmark Size [U] [G] [2-G] [B] [S] [Z] [WR] [DD] [RD] 4M M M M Table 3. Total executio time (i secods) required to sort a variety of doubles bechmarks o a 8 ode cluster usig 4 threads. The results i Figure 3 examies the scalability of our sortig algorithm as a fuctio of the umber of odes, for a variety of threads. To uderstad these results, cosider the ste by ste breakdow of the executio times show i Table 4 for sortig 8M itegers with both 1 ad 4 threads. Movig from oe ode to two itroduces the overhead of Stes 1-2 ad 4-8, which together accout for aroximately 35% ad 50% of the total executio time o oe ode with 1 ad 4 threads, resectively. This cosumes the majority of the time we could hoe to save by sharig the work of sortig amogst two odes. The effect is more roouced for multile threads because as our model redicts iterode commuicatio is ideedet of the umber of threads. The effect of this overhead would be eve more roouced were it ot for the fact that the time required for Ste 3 for both 1 ad 4 odes is cosiderably higher tha we would exect from sortig 4M itegers o a sigle ode. But movig betwee 1 ad 4 odes ad 1 ad 8 odes, the time required for Ste (3) scales iversely with the umber of odes, which is the exectatio of our model. The failure of commuicatio i Stes 2 ad 7 to scale iversely with the umber of odes might at first aear surrisig. However, this erformace is actually quite reasoable if we recall that for 2, 4, ad 8 odes, each ode has to sed aroximately 2M, 1.5M, ad 0.875M itegers across the etwork, resectively. The clear imlicatio of these results is that a algorithm must be both efficiet ad scalable to justify the use of multile odes. Oe Thread Four Threads Ste(s) Total Table 4. Time required for each ste of sortig 8M itegers with resect to the umber of odes usig 1 ad 4 threads. The grahs i Figures 4 ad 5 examie the scalability of our sortig algorithm as a fuctio of roblem size, for Figure 3. Scalability of sortig itegers with resect to the umber of odes, for differig umbers of threads. differig umbers of odes ad for 1 ad 4 threads. For oe thread, they show that for a fixed umber of odes there is a almost liear deedece betwee the executio time ad the total umber of elemets. The results for 4 threads i Figure 5 are seemigly more roblematic. However, a ste by ste breakdow of the ruig times i [9] shows that the commuicatio costs domiate the executio time ad that as the roblem size icreases from 1M to 2M itegers the commuicatio costs actually decrease,resumablybecause of a chage i the commuicatio rotocol. Oce the rotocol is switched, the relative costs of commuicatio declie ad the executio time scales with roblem size as our model aticiates. Refereces [1] A. Aggarwal, A. Chadra, ad M. Sir. Heirarchical Memory with Block Trasfer. I Proceedigs of the 28th Aual IEEE Symosium o Foudatios of Comuter Sciece, ages , October [2] A. Aggarwal ad G. Plaxto. Otimal Parallel Sortig i Multi-Level Storage. I Proceedigs of the Fifth Aual ACM-SIAM Symosium o Discrete Algorithms, ages , [3] A. Aggarwal ad J. Vitter. The Iut/Outut Comlexity of Sortig ad Related Problems. Commuicatios of the ACM, 31: ,

7 Figure 4. Scalability of sortig itegers with resect to the roblem size, for differig umbers of odes ad a sigle thread. Figure 5. Scalability of sortig itegers with resect to the roblem size, for differig umbers of odes ad four threads. [4] B. Aler, L. Carter, E. Feig, ad T. Selker. The Uiform Memory Hierarchy Model of Comuatatio. Algorithmica, 12:72 109, [5] D.A. Bader ad J. JáJá. SIMPLE: A Methodology for Programmig High Performace Algorithms o Clusters of Symmetric Multirocessors. UMIACS-TR Techical Reort, UMIACS, Uiversity of Marylad, [6] R. Barve, E. Grove, ad J. Vitter. Simle Radomized Mergesort o Parallel Disks. I Proceedigs of the Eighth Aual ACM Symosium o Parallel Algorithms ad Architectures, ages , Padua, Italy, Jue [7] D.R. Helma, D.A. Bader, ad J. JáJá. A Radomized Parallel Sortig Algorithm With a Exerimetal Study. Techical Reort UMIACS-TR-96-53, UMI- ACS, Uiversity of Marylad, To aear i the Joural of Parallel ad Distributed Comutig. [8] D.R. Helma, D.A. Bader, ad J. JáJá. Parallel Algorithms for Persoalized Commuicatio ad Sortig With a Exerimetal Study. I Proceedigs of the Eighth Aual ACM Symosium o Parallel Algorithms ad Architectures, ages , Padua, Italy, Jue [9] D.R. Helma ad J. JáJá. Sortig o Clusters of SMPSs. Techical Reort UMIACS-TR-97-69, UMI- ACS, Uiversity of Marylad, [10] D.R. Helma, J. JáJá, ad D.A. Bader. A New Determiistic Parallel Sortig Algorithm With a Exerimetal Evaluatio. Techical Reort UMIACS-TR , UMIACS, Uiversity of Marylad, [11] M. Nodie ad J. Vitter. Large-Scale Sortig i Parallel Memories. I Proceedigs of the Third Aual ACM Symosium o Parallel Algorithms ad Architectures, ages 29 39, Newort, RI, Jue [12] M. Nodie ad J. Vitter. Determiistic Distributio Sort i Shared ad Distributed Memory Multirocessors. I Proceedigs of the Fifth Aual ACM Symosium o Parallel Algorithms ad Architectures, ages , Vele, Germay, Jue [13] P. Varma, B. Iyer, D. Haderle, ad S. Du. Parallel Mergig: Algorithm ad Imlemetatio Results. Parallel Comutig, 15: , [14] J. Vitter ad E. Shriver. Algorithms for Parallel Memory I: Two-Level Memories. Algorithmica, 12: ,

A NOTE ON COARSE GRAINED PARALLEL INTEGER SORTING

Chater 26 A NOTE ON COARSE GRAINED PARALLEL INTEGER SORTING A. Cha ad F. Dehe School of Comuter Sciece Carleto Uiversity Ottawa, Caada K1S 5B6 æ {acha,dehe}@scs.carleto.ca Abstract Keywords: We observe