Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms

Size: px

Start display at page:

Download "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms"

Adelia Robinson
6 years ago
Views:

1 Using SIMD Registers n Instrutions to Enle Instrution-Level Prllelism in Sorting Algorithms Timothy Furtk furtk@s.ulert. José Nelson Amrl mrl@s.ulert. Roert Niewiomski niewio@s.ulert. Deprtment of Computing Siene University of Alert, Emonton, AB, Cn ABSTRACT Most ontemporry proessors offer some version of Single Instrution Multiple Dt (SIMD) mhinery vetor registers n instrutions to mnipulte t store in suh registers. The entrl ie of this pper is to use these SIMD resoures to improve the performne of the til of reursive sorting lgorithms. When the numer of elements to e sorte rehes set threshol, t is loe into the vetor registers, mnipulte in-register, n the result store k to memory. Three implementtions of sorting with two ifferent SIMD mhineries x86-64 s SSE2 n G5 s AltiVe emonstrte tht this ie elivers signifint spee improvements. The improvements provie re orthogonl to the gins otine through empiril serh for suitle sorting lgorithm [11]. When integrte with the Dynmilly Tune Sorting Lirry (DTSL) this new oe genertion strtegy reues the time spent y DTSL up to 22% for moertely-size rrys, with greter reltive reutions for smll rrys. Wll-lok performne of -heps is improve y up to 39% using similr tehnique. Ctegories n Sujet Desriptors C.1.2 [Proessor Arhitetures]: Multiple Dt Strem Arhitetures (Multiproessors) Single-instrution-strem, multiple-t-strem proessors (SIMD) Generl Terms Algorithms, Performne Keywors Quiksort, Sorting, Sorting Networks, SIMD, Instrution- Level Prllelism, Vetoriztion. 1. INTRODUCTION This pper resses the utomti genertion of effiient oe to sort short sequenes of vlues. The ie is tht Permission to mke igitl or hr opies of ll or prt of this work for personl or lssroom use is grnte without fee provie tht opies re not me or istriute for profit or ommeril vntge n tht opies er this notie n the full ittion on the first pge. To opy otherwise, to repulish, to post on servers or to reistriute to lists, requires prior speifi permission n/or fee. SPAA 07, June 9 11, 2007, Sn Diego, Cliforni, USA. Copyright 2007 ACM /07/ $5.00. n he-of-time optimizer serhes for fst oe for severl sequene lengths n mhine onfigurtions. Then the ompiler n simply instntite suh oe when generting n optimize lirry. While lgorithm-speifi optimiztions n empiril serh hve long een use oth for sientifi omputtion n for lrge prllel mhines [4, 5, 19, 21], only reently these tehniques were pplie to integer-intensive, symoli, omputtion. Li et l. evelope the Dynmilly Tune Sorting Lirry tht pts to the hrteristis of the input to e sorte [11]. The min ontriution of this pper is the insight tht the resoures implemente in ontemporry proessors to enle SIMD omputtions n e put to goo use to improve the performne of sorting short sequenes. As emonstrte in this work the effetive use of these SIMD resoures improves performne through the reution of memory referenes n inrese in instrution level prllelism. The initil inspirtion for this work ws the nee for fst sorting of short sequenes in the implementtion of grphis renering in intertive vieo-gme pplitions. In suh pplitions it is often neessry to eie, for eh pixel of the imge, wht is the orer of the elements tht shoul e isplye [2]. Even though Z-uffer pixel-orering omputtions re typilly hnle y speilize Grphis Proessing Unit (GPU), there re plenty of similr orering omputtions tht re one y the Centrl Proessing Unit (CPU) in omputer gmes. For instne, sorting is use to hrterize the intensity of the vrious light soures tht illuminte hrter. Moreover, ontemporry vieo-gme pplition hve t their isposl rih supply of SIMD registers n instrutions. For exmple, the PowerPC-se XBox 360 hrwre fetures 128 AltiVe registers on eh of its three ores long with n expne set of AltiVe instrutions. In ition to intertive vieo-gme pplitions, sorting of short sequenes is lso present in prtiulephysis simultion pplitions. Thus, using SIMD registers n instrutions to sort smll sequenes is nturl. One solution ws rete, pplying it to the sequenes tht must e sorte t the til-en of stnr reursive sorting lgorithms ws the next logil step. The experimentl evlution of the new vetorregister-se sorting lgorithms presente in this pper use ommoity proessors (x86-64 n G5) n extensions to the DTSL lirry euse these mhines n lgorithms re more reily ville n exploitle thn proprietry vieo-gme hrwre n softwre. The lgorithms presente re effetive for sorting short sequenes of flotingpoint or integer vlues (keys), n pirs omprise of key

2 n memory ress, i.e. key-pointer pirs, s well s omputing the inex of minimum (mximum) element. Three new SIMD-se lgorithms use the onept of sorting networks tht re effetive to sort smll sets of numers. Setion 2 esries: (1) the opertion of stnr sorting networks; (2) how the SIMD vetors n e use to implement sorting networks; n (3) how oe genertor n instntite optimize vetor oe for sorting networks operting in sequenes of ny length. The min ontriutions in this pper re: three lgorithms tht use the SIMD mhinery of ontemporry proessors for effiient in-register sorting of short sequenes n their integrtion into n optimize generl-purpose sorting lirry; metho to use itertive-eepening serh to fin fst instrution sequenes to move t within the SIMD registers; metho to ompute the minimum element in n rry, with pplitions to -heps; n n extensive experimentl stuy in three ifferent proessors tht emonstrte up to 22% improvement in the performne of DTSL for moertely-size rry, n up to 39% in -heps. This stuy lso inites tht the elimintion of los, stores, rnhes, n rnh mispreitions orreltes well with the improve performne. Setion 3 esries two lgorithms tht omine firstpss sorting in the SIMD registers with seon-pss sorting in memory. Setion 4 esries n lgorithm tht sorts shorter sequenes ompletely within the SIMD registers, thus eliminting rnh instrutions ltogether. Setion 5 esries how to exten these key-sorting lgorithms to sort key-pointer pirs, n Setion 6 uses similr tehniques to spee up hepify-own opertions in -heps. The experimentl evlution is presente in Setion SORTING NETWORKS The inputs to n in-ple omprtor, COMP(, ), re two storge units memory lotions, registers, or vetorregister elements n, eh ontining numeril input. After the omprtor exeutes, the lower numeril vlue is store in n the higher numeril vlue is store in. Knuth esries omprtor network s evie tht pplies fixe sequene of omprtor opertors to n input vetor of given size [8]. When omprtor network proues sorte output for ny possile input sequene, it is lle sorting network. The size of sorting network is the totl numer of omprtors in the network. The epth of sorting network is the length of the ritil pth in its epenene grph. Therefore the epth provies oun for the prllel exeution of the sorting network, while the size provies oun for sequentil exeution. An exmple of sorting network with size 5 n epth 3 is shown in Fig. 1. The network is epite s set of vluerrying vertil rils n omprtors. Vlues flow from top to ottom. A hevy ot t line rossing inites tht the vlue t the vertil ril is n input to the omprtor represente y the horizontl line. A omprtor moves the lrger vlue to the left, n the smller vlue to the right. COMP(, ) COMP(, ) COMP(, ) COMP(, ) COMP(, ) Figure 1: A 4-element sorting network. For instne, if the inputs re = 7, = 2, = 5, = 9, then the sorte output t the ottom of the sorting network is = 9, = 7, = 5, n = 2. The vlue 9 moves from ril to ril t COMP(, ), n then moves from ril to ril t COMP(, ). Although severl lgorithms re ville to generte oe for sorting networks, Bther s o-even mergesort lgorithm is often hosen for its effiieny [1]. Bther s lgorithm uses O(n log 2 n) omprtors n hs epth of O(log 2 n). Sorting networks n e effiiently implemente in proessors tht provie min n mx instrution. Sorting networks implemente with these instrutions voi the performne penlties of rnh miss-preitions inurre y tritionl rnh-se sorting implementtions. The experimentl results in Setion 7 inite tht eliminting rnhes in the oe of sorting networks is signifint win in ontemporry proessors. 2.1 Supporting Hrwre Consier mhine tht hs the following min n mx instrutions: ( ( : : min(, ) =, mx(, ) = : otherwise : otherwise The omprtor require y sorting network is esily onstrute using these two opertions, opy instrution, n temporry vrile. For instne, suh instrutions re ville in the x86-64 rhitetures supporting the SSE2 min n mx opertions tht return the minimum (mximum) pke single-preision floting-point vlues [6]. 1 The extension of sorting networks to operte on vetor instrutions requires the efinition of vetorize min n mx instrutions. 2 For input vetors A n B, A = B = n, let C = min(a, B) e the element-wise minimum vetor, suh tht C i = min(a i, B i), 1 i n. The vetorize mx instrution is efine similrly. The with of (vetorize) sorting network refers to the numer of vetors eing sorte. Given n orere list of vetors X 1, X 2,..., X n, strem of t is forme y seleting the i th element from eh vetor in orer, thus the i th strem is Xi 1, Xi 2,..., Xi n. For instne, the x86-64 rhiteture hs 16 XMM vetor registers, n eh register n hol 4 floting-point vlues. Therefore, sorting the vlues in n XMM registers using sorting network proues 4 sorte strems of t of length n. Up to 15 XMM registers n e use, i.e. 1 n < 16, euse one register must e reserve s temporry storge for the swp of vlues in the omprtor. This ompre-n-swp mhinery offers severl vntges to sort smll set of vlues tht fits within the SIMD 1 SSE stns for Streming SIMD Extensions. SSE2 improves upon the originl SSE. 2 These vetor instrutions re lle SIMD extension.

3 registers: (1) its opertion is unonitionl n t inepenent; (2) it is inherently rnh-free, n thus free of rnh-preition performne penlties; (3) it inreses the nwith of sorting y enling the SIMD instrutionlevel prllelism; n (4) eh ompre-n-swp requires the exeution of only 3 instrutions. A oe genertor must e le to generte oe to sort sequenes of ny length in mhine with n+1 SIMD registers. The solution is to efine size-optiml sorting networks tht use 1, 2,..., n registers. The optiml oe for the implementtion of eh of these sorting networks is pre-generte n store in smll oese ville to the oe genertor for eployment. One t hs een loe into the SIMD registers the oe genertor instntites the oe to perform the omprtor opertions speifie y the sorting network, n integrtes the resulting strems. 3. STREAM-BASED TWO-PASS SORTING The first two SIMD-se sorting lgorithms isusse in this pper operte in two phses. In the first phse the SIMD registers n instrutions re use to generte prtilly-sorte output. In the seon phse stnr sorting lgorithm insertion sort n mergesort re investigte in this pper finishes the sorting. The hoie of lgorithm for the seon phse ittes the est t orgniztion for the first one. For the first phse, onsier the use of the SIMD sorting mhinery esrie in Setion 2 for the tsk of sorting sequene of k n vlues using n SIMD registers, eh register ple of storing k vlues. Eh group of k vlues is loe from memory into seprte SIMD register. For moment, ssume tht the strt of the sequene is ligne for suh lo opertion. The sorting mhinery is then pplie to proue k sorte strems of length n, n the sorte strems re written k in-ple to memory in n interleve form. The orgniztion of the t in memory for k = 4 is shown in Fig. 2. After sorting, A 1 A 2... A n, B 1 B 2... B n, et. After this initil sorting the orering reltionship etween elements from seprte strems, A i, B i, C i, n D i, is still unknown. Now the output from the vetorize sorting network must unergo n itionl sorting pss. Let us exmine the use of insertion sort n mergesort to finish sorting this prtilly sorte output. If the strt of the sequene is not ligne, the tehnique use in this pper will e to sort the ligne vetor loks tht overlp the trget region. The extr fringe elements will e sve to temporry rry, reple y positive/negtive infinity s pproprite, n resorte upon ompletion. A 1 B 1 C 1 D 1 A 2 B 2 C 2 D 2 A n B n C n D n Figure 2: Interleve sorte strems from n 4- element SIMD registers. The first register ontins elements A 1, B 1, C 1, n D 1. A 1 A 2 A n, et. 3.1 Seon Pss with Insertion Sort A stnr insertion-sort lgorithm my e use to sort the output of the SIMD-se sorting network. Insertion sort elivers the est performne when its input is mostly sorte euse the lgorithm oes not hve to move elements very fr. Thus potentil issue with using insertion sort s seon pss is how the t shoul e loe into the SIMD vetors in the first phse to proue the most fvorle input for insertion sort. Consier n input sequene of S vlues, n mhine with n + 1 SIMD vetors. Eh vetor n store up to k vlues. Let m = S/k. If m n the entire rry n e loe into the SIMD registers, sorte, n written k inple. Then ll to insertion sort will finish sorting the entire sequene. If m > n, n in-ple lgorithm ivies the rry into susets smll enough to fit in the vetor registers, sorts them with sorting network, n writes eh sorte suset k to the sme lotions. A nive pproh woul simply ivie the rry into m/n lmost equl-size loks. However, if the t is uniformly istriute this prtition results in m/n similr loks, one fter the other. The prolem is tht smll elements from the lst lok woul hve similr vlues to the smll elements from the first lok, n woul require insertion sort to move mny elements to fr positions to omine these loks. A etter pproh is to lo the loks into the SIMD registers in strie fshion. Consier for exmple n = 4 n m = 12 whih requires three sorting network lls. Inste of the first ll ting on elements A 1, A 2, A 3, n A 4, it ts on A 1, A 4, A 7, n A 10. The seon ll ts on elements A 2, A 5, A 8, n A 11, n the thir on A 3, A 6, A 9, n A 12. In this wy the smll vlues in the rry re likely to en up in A 1, A 2, n A 3. A strie with greter thn one improves insertion sort performne in ses of uniform or mostly-sorte istriutions. In this pper, this strie version of the vetorize sorting network followe y n insertion sort pss is lle ISort. 3.2 Seon Pss with Mergesort The mergesort lgorithm, lle MSort, uses fixe-size lok of temporry storge T tht is lrge enough to hol the entire rry A. Beuse the SIMD-se sorting is pplie to smll sequenes this rry will not e lrge in prtie. MSort proees s follows. Compute the numer of loks of t to e sorte, m/n, n llote temporry spe T. Cll the sorting network on eh lok from A n store the sorte strems to T. The Q-MERGE lgorithm esrie y Wikremesinghe et l. [20] se on work y [14] is now use to store the sorte t into A: (1) Buil hep ontining the first element in eh strem, n ssoite with eh element pointer to the next element in its strem; (2) Repetely extrt the minimum element from the hep. During the extrtion, reple the remove element with the next element in its strem, n reuil the hep. With smll numer of strems, suffiient registers my e ville to ontin the entire hep. Hepify opertions re then effiient n the only flow of t to/from memory is to feth the next item from strem or to store the next vlue to A. For heps tht re too lrge to fit within the ville registers, in-memory hep oe my e use. Mintenne opertions on smll heps my e written using the known register lotions of elements, voiing potentilly ostly memory esses n pointer iniretions. MSort uses one merge hep, with the numer of inputs eing multiple of v. Tht is, eh hep ompletely hn-

4 les the output from one or more vetorize sorting network lls. Further, only heps whih my e ontine within the ville registers re onsiere. Aitionl optimiztions inlue pling sentinel vlue of infinity t the en of eh strem to voi heking if strems re empty [20]. One the sentinel is loe into the he it will sink to the ottom. When ny sentinel is extrte from the hep the sorting is omplete. Eh sorting network ll ples elements from the sme strem onstnt istne wy from eh other. Thus the next element on strem n e foun y ing onstnt offset to the ress of the urrent element, whih mkes the mintenne of the next element pointer in the hep strightforwr. 4. ONE-PASS VECTOR SORTING The thir SIMD-se sorting lgorithm omplishes the sorting in single pss. Intuitively this is possile y loing ll of the n elements to e sorte into the vetor registers, pplying the omprtors for n n-element (slr) sorting network, n writing the elements k to memory in-ple. Tle 1: SSE2 instrutions use in the exmple of Fig. 4 Instrution Desription movps R, R opy the ontents of R to R shufps R, R, i opy 2 elements of R to the 2 loworer wors of R, n 2 elements of R to the 2 high-orer wors of R. The elements to e opie re speifie y i. movhlps R, R opy the 2 high-orer wors from R to the 2 low-orer wors of R. movlhps R, R opy the 2 low-orer wors from R to the 2 high-orer wors of R. The iffiulty with this pproh lies in repositioning elements within the vetor registers suh tht the vetor omprtor opertions o not orrupt the vlues of elements not involve in the omprison. Moreover, simply ligning omprtor inputs my e hllenging, epening on the frgmenttion of free lotions within the vetor registers. Sine the ost of pplying vetor omprtor remins the sme regrless of the numer of re vlues in eh input vetor, nturl optimiztion is to exeute more thn one (slr) sorting-network omprtor t time. However, the ost of itionl t-movement instrutions to properly position multiple omprtor inputs in eh vetor register my outweigh the enefit of prlleliztion. In prtie, for the sorting networks onsiere, it i not pper to e the se tht ligning s mny elements s possile 3 ws ever etrimentl to the resulting sequene of opertions. However the lgorithm we present oes provie the ility to lne suh lignment osts for the trget rhiteture. 4.1 Serhing to Aligning Vetor Elements We will first esrie the lgorithm use for fining sequene of lignment instrutions, n then show how this pplies to smll 4-element sorting network. 3 With the optimiztion tht the sorting networks orresponing to Bther s Merge Exhnge re expliitly seprte into lyers. Figure 3: An 8-element sorting network proue y Bther s Merge Exhnge, with reks inite etween lyers Algorithm Input The input to our lgorithm is sequene of omprtors orresponing to sorting network. In our se the sorting networks were proue y Bther s Merge Exhnge not to e onfuse with Bther s Bitoni Sort. Merge Exhnge hs the property of prouing n initil sequene of omprtors onneting elements tht re seprte y powers of 2. This llows for exeuting lrge numer of prllel omprtors t the strt without ny nee for lignment instrutions. The t epenenies in the sorting network efine prtil orering for the exeution of the omprisons. The omprtors n thus e prtitione into sets in suh wy tht ll the omprtors in eh set n e exeute in prllel. This prtition orrespons to the omputtion of the mximl nti-hins in t-epeneny grph [18]. One nturl optimiztion, onsiering multiple legl orerings of the omprtor sequene, ws not implemente ue to the omintoril inrese in the serh spe. While we present no forml pproximtion ouns, we feel tht the resulting suoptimlity of the instrution sequenes proue is not signifint. One importnt optimiztion whih reues oth the numer of ssemly instrutions n the time neee to serh for sequene is to insert expliit reks etween levels of the Merge Exhnge sorting network. Tht is, to isllow exeuting slr omprtors from ifferent levels within one prllel omprtor. For this purpose we onsier levels to e the results of the innermost loop in Bther s Merge Exhnge lgorithm s esrie in [8]. An 8-element Merge Exhnge network is shown in Fig. 3 with suh lyer reks inite. The sequene of lignment instrutions within lyer is often repete for susequent loks of slr omprtors. Foring reks etween levels my e thought of s helping to mintin this repeting pttern of element positions within vetors. This repetition is not exploite iretly, ut it oes seem to introue less noise whih my propgte when rerrnging elements Initil Stte For onveniene we will ssume tht we hve n unoune numer of vetor registers. The resulting sequene of ssemly instrutions my e restrite to smll numer of physil vetor registers s post-proessing step y spilling n loing vlues to n from memory s pproprite.

5 We will lso ssume tht the elements re lote in ontinuous region of memory, re ppropritely ligne, n tht the numer of elements is multiple of the size of vetor. These restritions re for simplifition only n my e lifte y mking smll hnges to the lgorithm. Note tht the proess of serhing for sequene of lignment instrutions is only onerne with keeping trk of the lels of the elements ontine within the vetor registers we will refer to mnipultions of elements only for onveniene. The first step is to lo ll of the elements from memory into the vetor registers. It is nturl n onvenient to ssume sequentil leling, suh tht the first memory lotion is lele 0 n the lst lotion n 1. Given relisti onstrints on the pilities of the vetor mnipultion instrutions, numer of empty vetor registers re require s swp spe for rerrnging elements. In our experiments hving 5 empty vetor registers in ition to those registers holing the initil vlues ws seen to e suffiient Aligning Set of Comprtors While ll of the omprtors in the sorting network hve not yet een exeute, selet the next k omprtors tht o not ross lyer n suh tht k is no lrger thn the numer of elements in vetor register. The tsk is then to rerrnge elements suh tht ll the low elements from the k omprtors re in one vetor, n ll the high elements in nother 4, n ligne element-wise with their prtner. Suh n lignment is only vli if pplying vetor omprtor will not erse the lst opy of ny element. An ersure must neessrily our when ompring n element with either n empty (grge) vlue or nother element with n unknown orering reltion. Note tht pplying vetor omprtor will lso invlite opies of ompre elements tht re lote in other registers. Fining sequene of ssemly instrutions to omplish this lignment is performe using stnr itertiveeepening serh. The legl tions in stte re ll vetor ssemly instrutions whih o not ompletely eliminte n element from the set of vetor registers. Due to fesiility onerns, eh itertive-eepening serh is ivie into two phses: moving the low hlf of eh omprtor into one vetor, n then moving the high hlf into lignment. If the mximum serh epth in ny one phse rehes 3, then tht tsk is further suivie into moving the first 2 elements into vetor, the next 2 elements into nother, n finlly omining them. Even with these inrementl stges, ue to the mssive rnhing ftor nive implementtion of this serh woul tke signifint mount of time for even moertely lrge networks. Our implementtion mkes use of severl missile heuristis to prune portions of the serh spe. To ress the treoff etween the ost of exeuting vetor omprtor n the ost of lignment instrutions, the ove serh is repete for smller vlues of k, n the finl ost eomes omintion of the numer of lignment instrutions n penlty for inluing fewer slr omprtors thn is possile. Intuitively this ttempts to selet the sequene with the est rtio of numer of lignments proue versus instru- 4 The prtitioning of low n high elements my e roppe if relelling is performe when pplying the vetor omprtor, se on whether the slr omprtor is inverte. tions require, with n itionl is towrs prouing more lignments sine more lignments will reue the totl numer of omprison steps. When ll pproprite vlues of k hve een serhe, the hoie of how mny omprtors to inlue is me greeily n is not revisite. The vetor omprtor is then pplie n the serh ontinues using the remining omprtors Writing Vlues Bk to Memory After the finl omprtor hs een pplie the elements re sorte ut re not lote within the vetor registers in n orer in whih they n e written k to memory. A similr itertive-eepening serh now fins n instrution sequene to otin the orret lignment. 4.2 Exmple Serh The sorting network shown in Fig. 1 will e use to illustrte the sequene of events in the lignment lgorithm for single-pss in-register sorting. This network hs four elements re requires the exeution of five omprison instrutions. An in-register sorting instne of this network using the x86-64 SSE(2) SIMD mhinery is shown in Fig. 4. The instrutions use in this instne re esrie in Tle 1. 5 The sorting network of Fig. 1 proues the following prtitions: P 1 = {COMP(, ), COMP(, )}; P 2 = {COMP(, ), COMP(, )}; n P 3 = {COMP(, )}. First the elements of XMM0 re ssigne the four elements to e sorte (,,, n ). Then low-ost sequene of vetor instrutions is serhe for to lign with n with. Here this my e one with single movlhps instrution in step 1. This llows for exeuting the COMP(, ) n COMP(, ) omprtors in prllel (step 2) 6. After this omprison the vlue store in element is smller thn the vlue store in element, n the vlue store in element is smller thn the vlue store in element. In Fig. 4 lnk squre represents vetor element tht ontins n unknown vlue tht is not relevnt to the sorting proess. For instne, fter the omprison in step 2 the vlues tht were in elements n in the low-orer wors of XMM0 my hve move. As they re not prt of the sorting proess they re now represente y lnk squres. If the inputs to the sorting network re = 7, = 2, = 5, n = 9, this omprison woul leve the highest-orer wors of XMM0 n XMM1 intt n woul swp the ontents of the seon highest-orer wors. It my lso swp the vlues in the two low-orer wors of these registers, ut the ontents of those wors re irrelevnt. Now the two omprtors in prtition P 2 re nites for the next vetor lignment. The initil stte for this serh is the position of the elements in the vetors t the en of step 2. In the exmple in Fig. 4 sequene of two instrutions, movhlps n shufps, is selete to lign elements with n with. Thus oth omprtors of P 2 n e exeute in prllel in step 5. A penultimte serh is performe to exeute the lst omprtor, resulting in steps 6 n 7, t whih point the 5 Other SSE2 instrutions frequently use for t movement ut not inlue in this exmple re: pshuf, unpkhps, n unpklps. 6 For SSE2, omprtor etween the ontents of two registers R n R requires temporry register T n the exeution of three instrutions: movps T, R; minps R, R; n mxps R, T.

6 Step 1: movhlps xmm1, xmm0 Step 2: COMP(0,1) XMM0 XMM1 XMM2 XMM3 Step 3: movhlps xmm0, xmm1 Step 4: Step 5: shufps xmm1, xmm0, 0x88 COMP(0, 1) XMM0 XMM1 XMM2 XMM3 Step 6: movhlps xmm2, xmm1 Step 7: movps xmm3, xmm0 Step 8: COMP(2, 3) XMM0 XMM1 XMM2 XMM3 Step 9: Step 10: shufps xmm0, xmm2, 0x13 movlhps xmm1, xmm3 XMM0 XMM1 XMM2 XMM3 Step 11: shufps xmm1, xmm0, 0x2 Step 12: movps [rsi+(0)], xmm1 min memory Figure 4: Instrution sequene to pply n in-register 4-element sorting network in n x86-64 rhiteture. The ssoite sorting network is shown in Fig. 1. element vlues re sorte. Finlly, the elements must e properly positione within one register (in this se XMM1) efore the sorte sequene n e written k to memory with movps instrution. The vetoriztion of sorting network only nees to e one one for eh sorting network n for eh rhiteture s set of vetor instrutions. Thus ll the serhes esrie ove shoul e performe one n offline. The resulting sheule n then e use whenever sequene of the orresponing size nees to e sorte. 5. SORTING KEY-POINTER PAIRS So fr this pper resses the prolem of sorting n rry of floting-point vlues. A more generl prolem is tht of sorting n rry of t strutures. Consier the se where eh struture hs well-efine floting-point key vlue. Effiient lgorithms sort n rry of key-pointer pirs to voi moving lrge t strutures. This setion esries n extension of the vetorize sorting networks to hnle key-pointer pirs with floting-point keys n yte sequene representing the pointer. The solution to the key-pointer sorting prolem onsists of storing the keys n the pointers into seprte SIMD vetors. If keys n pointers pper interleve in memory then they must e swizzle when loe into the SIMD vetors n this swizzling must e reverse when storing the sorte result to memory. With the keys n pointers in seprte vetors, the stnr sorting network solution is implemente for the keys, while the pointers move in synhrony with the key movements. This is omplishe y using itmsk to pply the swp opertions only to selete elements in the pointer vetor. Speifilly, those elements whih orrespon to hnges in the key vetor fter pplying the key omprtor. The onstrution of this itmsk is supporte in rhitetures tht support SIMD opertions. For instne, this my e one in strightforwr mnner using the AltiVe vsel instrution, while x86-64 rhitetures must mke use of sequene of oolen opertions to msk n omine registers s shown in Fig. 5. sm("pshuf xmm15, xmm1, 0xE4"); // xmm15 := opy of key_ sm("minps xmm1, xmm2"); // key_ := min(key_, key_) sm("mxps xmm2, xmm15"); // key_ := mx(key_, key_) sm("mpps xmm15, xmm1, 4"); // xmm15 := key_!= key_ sm("pshuf xmm14, xmm3, 0xE4"); // xmm14 := opy of ptr_ sm("xorps xmm14, xmm4"); // q := ptr_ XOR ptr_ sm("nps xmm15, xmm14") // q := q AND itmsk sm("xorps xmm3, xmm15"); // ptr_ := ptr_ XOR q sm("xorps xmm4, xmm15"); // ptr_ := ptr_ XOR q Figure 5: Key-pointer omprtor using SSE2 ssemly instrutions. Vetor registers xmm1 n xmm2 hol keys, registers xmm3 n xmm4 hol the respetive pointers. Registers xmm14 n xmm15 re use s temporry storge.

7 6. VECTORIZING D-HEAPS -heps re strightforwr generliztion of inry heps where eh internl noe hs hilren inste of 2. Inresing the vlue of results in shllower tree t the expense of requiring elete-min opertions to perform more work when serhing for the hil noe with minimum key vlue. For onreteness ssume min-heps. Assume n impliit hep lyout, with ll elements store in ontiguous rry. The root noe is lote t inex 0, n the nth hil of noe t inex i is lote t inex i + n, with 1 n. The prent of ny noe my e similrly ompute y iviing its inex-1 y. In [9, 10] LMr n Lner investigte the performne of tritionl impliit heps n how they re ffete y t hes. They suggest inresing the rnhing ftor s well s the t lignment tehniques esrie here n use in our implementtion. We present here metho for inresing -hep performne y using SIMD vetor instrutions to quikly ompute the inex of the hil with minimum key vlue. This omputtion is use within hepify-own opertions. This metho is similr to the one use for sorting keypointer pirs in tht it relies on the synhronous movement of vlues within seon set of registers. In this sitution the vlues moving in synhrony re the inies of eh hil noe (speifilly the offset from the first hil, suh tht the vlues rnge from 0 to 1). For simpliity, ssume tht is multiple of k, the numer of elements in SIMD vetor. This ssumption lso ligns noe s hilren on oth he-line n SIMD vetor ounries. This lignment requires tht the root noe e lote t the en of he-line suh tht its first hil is t the eginning of the next he-line. If the noes in the hep re key-pointer pirs, rther thn just keys, loing into SIMD vetor my require itionl swizzle instrutions to interleve the keys from 2 seprte vetor los. Only the key vlues re require; the ssoite pointer t my e isre. When lok of keys is loe into SIMD vetor, the inex offsets for those keys re loe into nother vetor from onstnt n stti rry ontining vlues 0, 1,..., 1. The synhronous movement of the inex offsets is implemente in the sme mnner s the movement of the pointer vlues in Setion 5. The loing n movement of these offsets is omitte from the lgorithm esription for revity. The lgorithm proees s follows: (1) lo the first k keys into one SIMD vetor, ll this register A; (2) while unre keys remin, re the next k keys into SIMD vetor B n set A := min(a, B); (3) finlly, repetely ompre one hlf of the vlues in A ginst the other hlf until only one element remins; (4) return the inex of this element. If the noe eing exmine oes not hve hilren (this my only our t lst internl hep noe) then the vetorize serh is reple y strightforwr liner sn. 7. EXPERIMENTAL EVALUATION 7.1 Sorting Algorithms The three versions of vetorize sorting esrie in this pper were evlute y integrting them s the low-level lgorithms for DTSL s quiksort. The min finings of this experimentl evlution re: Signifint reutions in exeution time re possile for sorting on the Pentium 4, with lesser reutions on the G5 n Core 2 Duo, epening on rry size. The integrtion of SIMD-se sorting lgorithms to sort sequenes smller thn fixe threshol improves the performne of DTSL when sorting element rrys of floting-point key-pointer pirs y up to 22%. This performne improvement is ue not only to reution in the numer of los, stores, n rnh instrutions, ut lso to signifint erese in the numer of rnh mispreitions Integrting Algorithms into DTSL Tle 2: Algorithms stuie Algorithm Desription MSortX - Y MSort lgorithm with X strems pplie t Y threshol. ISortX - Y ISort lgorithm with X strems pplie t Y threshol. RSort - Y One-pss register sort pplie t Y threshol. DTSL - Y Originl DTSL quiksort with SN pplie t Y threshol. Ins - Y Stnr insertion sort pplie t Y threshol. Brnh misp. erese from DTSL % 100% 80% 60% 40% 20% 0% -20% P4 Reution of Quiksort Brnh Mispreitions per Low-Level Algorithm (Key-Pointer pirs, 5000 trils) MSort4-16 MSort4-32 MSort4-64 MSort4-253 MSort ISort4-16 ISort4-64 ISort4-253 ISort8-509 ISort RSort - 16 RSort - 64 RSort - 96 Low-level lgorithm use RSort rry size= Figure 6: Reution of rnh mispreitions on 64-it 3.40 GHz Pentium 4. The SIMD-se lgorithms presente in this pper were integrte in the quiksort implementtion of DTSL. The DTSL s quiksort is not reursive. Inste it mintins n in-funtion stk of urrent prtitions. When the numer of elements to e sorte rops elow threshol, DTSL swithes to low-level sorting lgorithm. The version of quiksort tht proues the est, or lose the est, performne when sorting elements in DTSL uses slr sorting network SN s the low-level lgorithm [11]. The singleelement omprtors in this sorting network re written in the C lnguge n use rnh instrutions to onitionlly perform element interhnges. The efult threshol to swith to this low-level lgorithm is sixteen elements. This version of DTSL s quiksort is the seline for the omprtive performne stuy in this pper. Tle 2 lists the lgorithms use in this performne evlution. The stnr RSort RSort DTSL - 32 Ins - 16 Ins - 32 Ins - 48

8 Time erese from DTSL % 50% 40% 30% 20% 10% 0% -10% -20% -30% MSort4-16 MSort4-32 P4 Reution of Quiksort Cyles per Low-Level Algorithm (Key-Pointer pirs, 5000 trils) MSort4-64 MSort4-253 MSort ISort4-16 ISort4-64 ISort4-253 ISort8-509 ISort RSort - 16 RSort - 64 RSort - 96 RSort rry size= RSort RSort DTSL - 32 Ins - 16 Ins - 32 Ins - 48 Time erese from DTSL % 40% 30% 20% 10% 0% -10% -20% -30% Core 2 Reution of Quiksort Cyles per Low-Level Algorithm (Key-Pointer pirs, 5000 trils) rry size= MSort4-16 MSort4-32 MSort4-64 MSort4-253 MSort8-509 MSort ISort4-16 ISort4-32 ISort4-64 ISort4-253 ISort8-509 ISort RSort - 8 RSort - 16 RSort - 24 RSort - 32 RSort - 64 RSort - 96 RSort DTSL - 32 Ins - 16 Ins - 32 Ins - 48 Low-level lgorithm use Low-level lgorithm use Figure 7: Quiksort yle ounts reltive to DTSL on 64-it 3.40 GHz Pentium 4. Figure 9: Quiksort yle ounts reltive to DTSL on 3.2 GHz Core 2 Duo E6400. Time erese from DTSL % 20% 10% 0% -10% -20% -30% MSort4-16 MSort4-32 MSort4-64 G5 Reution of Quiksort Cyles per Low-Level Algorithm (Key-Pointer pirs, 5000 trils) MSort4-253 MSort8-509 MSort ISort4-16 ISort4-32 ISort4-64 ISort4-253 ISort RSort - 8 RSort - 16 RSort - 24 RSort - 32 Low-level lgorithm use RSort - 64 rry size= Figure 8: Quiksort wll-lok times reltive to DTSL on 2.7 GHz Power M G5. insertion-sort lgorithm, Ins - Y, is inlue to provie fmilir omprison point Wll-Clok Exeution Time Experiments were performe on 64-it 3.4 GHz Pentium 4, n IBM 2.7 GHz PowerPC G5, n 3.2 GHz Core 2 Duo E6400. Figs. 7, 8, 9 show the reltive wll-lok exeution times for the sorting of vetor of key-pointer pirs in reltion to the DTSL seline. Eh r represents the verge runtime over 5000 trils on uniformly istriute keys reltive to the DTSL seline. The lrge threshols for MSort, ISort, n RSort, extening eyon wht n onurrently fit within the physil vetor registers, re the result of the register spilling mentione in Se. 4. Time reutions for the Pentium 4 re quite strong for rnge of rry sizes, with the gretest reution of 58% for 200 elements, where ll re immeitely sorte y MSort RSort - 96 eomes the etter lterntive for lrger rrys, with time reution of 22% for elements. Lrge time reutions on the Core 2 Duo n the G5 re limite to smll rry sizes. For 200 elements MSort8-509 hs respetive time reution of 43% n 33%. For the lrgest rry RSort - 32 hieves only 7% n 4% reltive improvement on these rhitetures. As seen in Fig. 6 the numer of rnh mispreitions for eh lgorithm tens to erese s the threshol eomes lrger, refleting the reue numer (or lk) of rnh instrutions involve. However lrger RSort threshols require RSort - 96 RSort DTSL - 32 Ins - 16 Ins - 32 Ins - 48 itionl funtions, muh more so thn for the two-pss lgorithms. These funtions grow proportionl to the size of their respetive sorting networks n inlue lignment opertions. The size of some of the generte ojet files spns to severl megytes. The first-pss sorting instrutions for MSort n ISort o not require nerly s muh spe Low-Level Algorithm Timing PAPI TOT_CYC event ount 1e P4 Low-level Sorting Algorithm Timing for Key-Pointer Pirs RSort MSort4 MSort12 ISort4 ISort12 DTSL SN Insertion Arry size Figure 10: Low-level lgorithm yle ounts on 64-it 3.40 GHz Pentium 4. Fig. 10 shows the numer of lok yles, otine through the PAPI lirry, require y eh lgorithm s the numer of elements to e sorte vries to the mximum (s implemente) for eh lgorithm. Eh point in the grph is the verge over trils with uniformly istriute keys. This grph shows tht RSort is signifintly superior to oth the SN rnh-intensive lgorithm n stnr insertion sort, n onfirms tht MSort is lso n exellent hoie for the sorting of short sequenes. The performne of RSort on the G5 n Core 2 is roughly the sme s tht of the Pentium 4 results shown in Fig. 10 for sequenes smller thn 32 elements. A etile stuy of other performne ounters showe orreltion etween reution in the numer of rnhes, los, n stores exeute n the reltive performne of the lgorithms.

9 7.2 D-Heps The performne of -heps ws investigte y ompring highly optimize versions with ifferent rnhing ftors ginst SIMD vrints where vetor instrutions were use uring hepify-own opertions. The min finings re signifint reution in yle ount for lrger heps, when ompring the est SIMD -hep ginst the est non-simd -hep. SIMD time versus non-simd 100% 80% 60% 40% Rtio of Best -Hep Times P4 20% G5 Core 2 0% Hep size (log 2 ) Figure 12: Rtio of the est SIMD hep times reltive to est non-simd hep times from Fig. 11. All soure oe ws written in C++ n ws ompile using g 3.4.6, 4.0.0, n on the Pentium 4, G5, n Core 2 Duo respetively, with full optimiztions n loop unrolling. The rnhing ftor ws known t ompile time. The hep itself ws ligne in memory suh tht the root noe s hilren egn on he-line ounry. A onsequene is tht ll hilren re ligne for SIMD vetor esses. The inry hep h further optimize inex omputtions. As with the previous experiments, hep elements re keypointer pirs. Hep hve sizes whih re powers of 2, from 2 4 to 2 26, n re initilize y inserting n elements, where n is the mximum size of the hep. Keys for initil elements re rwn uniformly from 0,..., n 1. 10,000,000 itertions se on the Hol moel s esrie in [7] were then performe. Eh itertion onsists of ll to elete-min followe y insert-element. The key of the new element is equl to the key of element lst remove plus vlue rwn uniformly from 0,..., n 1. As seen in Fig. 11, when the hep size eomes 2 18 there is rossover etween vlues of in the performne of tritionl heps. For smll heps = 2 performs etter, while = 8 or = 16 performs etter for lrger heps, resulting from etter lolity of eh noe s hilren s well s erese hep epth. All grphs in Fig. 11 show only the est or ner-est vlues of for lrity. Fig. 12 shows the rtio of exeution times etween the est SIMD hep t eh size versus the est tritionl hep. For the Pentium 4, G5, n Core 2 Duo, the SIMD heps hve n verge reution in yles of 31%, 18%, n 15% respetively, with the lrgest reutions ourring t the 2 18 rossover point for the Pentium 4 n G5, n t 2 20 for the Core RELATED WORK The implementtion of sorting in lrge-sle vetor mhines hs een extensively stuie. Siegel proue one of the erliest esriptions of how to implement Bther s sorting network, lso known s itoni sorting, in SIMD mhines [17]. Bitton et l. provies n extensive esription of suh implementtions [3]. The new ontriution of this pper is to emonstrte how the well-known sorting networks n e implemente in the SIMD mhinery of ontemporry proessors n to inite tht oe genertors n instne suh implementtions to improve the performne of reursive sorting lgorithms n heps. The ie of mking etter use of register resoures within the proessor to reue the numer of lo of stores, in our se to put the SIMD resoures to goo use in sorting, is lso explore y Arge et l. [20]. Their ie of forming he-lo-size runs with quiksort is similr to our ie of swithing to SIMD-register-se sorting t n pproprite threshol. The ontrst is tht we re lso enefiting from the SIMD mhinery whih llows more prllelism in the exeution n the elimintion of rnhes while they use the generl-purpose registers n the storge ville t he line. Reently ompilers hve een use more often to improve the oe genertion for SIMD mhinery in ontemporry proessors. Ren et l. s pproh of using n optimiztion lgorithm to improve the t permuttions is more generl thn our speifi itertive-eepening serh [15]. Nuzmn et l. esries ompiler frmework to generte vetorize oe for interleve t [13]. The reltionship etween the SIMD-register-se sorting lgorithms presente in this pper n the evelopment of DTSL is n orthogonl improvement to lirry genertor [11]. Li et l. fouse on the ynmi ientifition of the est sorting lgorithm for given input sequene [12]. They selete n effiient lgorithm for the til of their reursive metho. This pper offers etter solution for the sorting of sequenes tht re smll enough to enefit from the use of the SIMD mhinery. Similrly, we provie fster mehnism for seleting minimum (mximum) hil in the impliit -heps stuie y LMr n Lner [9, 10]. Our SIMD-register-se sorting oul lso improve prtition se sorting methos. For instne, Shen n Ding use n ptive prtitioning sheme to ttempt to evenly prtition t into hunks smller thn he size n then use quiksort or insertion sort to finish sorting eh uket [16]. This pper offers etter solution for the sorting of sequenes tht re smll enough to enefit from the use of the SIMD mhinery. 9. CONCLUSIONS This pper proposes the use of the SIMD mhinery provie in moern proessors to improve the performne of reursion tils. The ie is tht whenever the numer of elements to e proesse fits within the SIMD registers ville in the proessor, these vlues shoul e loe one into the SIMD registers n then n effiient SIMD exeution shoul e use. While the fesiility of this ie ws emonstrte with the integrtion of more effiient lgorithm for sorting short sequenes into DTSL, the ie shoul e generlly pplile to reursive omputtion. One effiient low-level SIMD lgorithms re rfte, they n e generte into solution tse to e instntite y oe genertors into optimize lirries. Alterntively, if suitle ientifition lgorithm is rete, the ompiler shoul e le to integrte these solutions iretly into generl progrms.

10 PAPI TOT_CYC event ount 8e+10 7e+10 6e+10 5e+10 4e+10 3e+10 2e+10 1e+10 P4 -Hep Timing (Hol Moel) SIMD 8 SIMD Hep size (log 2 ) () Time (miroseons) 4e e+07 3e e+07 2e e+07 1e+07 5e+06 G5 -Hep Timing (Hol Moel) SIMD Hep size (log 2 ) () PAPI TOT_CYC event ount 6e+10 5e+10 4e+10 3e+10 2e+10 1e+10 Core 2 Duo -Hep Timing (Hol Moel) SIMD 16 SIMD Hep size (log 2 ) () Figure 11: Cyle ount / wll-lok time for ifferent hep sizes n vlues of on : () 64-it 3.40 GHz Pentium 4; () 2.7 GHz Power M G5; () 3.20 GHz Core 2 Duo E insertions n eletions. Aknowlegments The experimentl evlution of these ies ws me possile thnks to Dvi Pu s generous shring of his group s DTSL oe. This reserh is support y grnts from the Nturl Siene n Engineering Reserh Counil (NSERC) of Cn, n y IBM Corportion. 10. REFERENCES [1] K. E. Bther. Sorting networks n their pplitions. In AFIPS Spring Joint Computing Conferene, pges , [2] L. Bishop, D. Eerly, T. Whitte, M. Finh, n M. Shntz. Designing PC gme engine. IEEE Computer Grphis n Applitions, 18(1):46 53, [3] D. Bitton, D. J. DeWitt, D. K. Hsio, n J. Menon. A txonomy of prllel sorting. Computing Surveys, 16(3): , Septemer [4] J. D. Frens n D. S. Wise. Auto-loking mtrix-multiplition or trking BLAS3 performne from soure oe. In Prinipples n Prtie of Prllel Progrmming PPoPP, pges , Ls Vegs, Nev, [5] M. Frigo. A fst Fourier trnsform ompiler. In Progrmming Lnguge Design n Implementtion PLDI, pges , Atlnt, GA, June [6] Intel. IA-32 Intel R 64 n i-32 rhitetures softwre eveloper s mnul volume 1: Bsi rhiteture pf, [7] Dougls W. Jones. An empiril omprison of priority-queue n event-set implementtions. Commun. ACM, 29(4): , [8] Donl Ervin Knuth. The Art of Computer Progrmming, Vol. 3 - Sorting n Serhing. Aison-Wesley Longmn Pulishing Co., In., Boston, MA, USA, [9] A. LMr n R. E. Lner. The influene of hes on the performne of heps. ACM Journl of Experimentl Algorithms, 1:4, [10] A. LMr n R. E. Lner. The influene of hes on the performne of sorting. In SODA: ACM-SIAM Symposium on Disrete Algorithms (A Conferene on Theoretil n Experimentl Anlysis of Disrete Algorithms), [11] X. Li, M. Grzrn, n D. Pu. A ynmilly tune sorting lirry. In Coe Genertion n Optimiztion CGO, pges , Plo Alto, CA, [12] X. Li, M. J. Grzrán, n D. Pu. Optimizing sorting with geneti lgorithms. In Coe Genertion n Optimiztion CGO, pges , Sn Jose, CA, Mrh [13] D. Nuzmn, I. Rosen, n A. Zks. Auto-vetoriztion of interleve t for SIMD. In Progrmming lnguge esign n implementtion PLDI, pges , [14] A. Rne, S. Kothri, n R. Uup. Register effiient mergesorting. In High Performne Computing HiPC, volume 1970 of LNCS, pges Springer, [15] Gng Ren, Peng Wu, n Dvi Pu. Optimizing t permuttions for SIMD evies. In Progrmming lnguge esign n implementtion PLDI, pges , [16] Xipeng Shen n Chen Ding. Aptive t prtition for sorting using proility istriution. In ICPP 04: Proeeings of the 2004 Interntionl Conferene on Prllel Proessing (ICPP 04), pges , Wshington, DC, USA, IEEE Computer Soiety. [17] H. J. Siegel. The universlity of vrious types of SIMD mhine interonnetion networks. In Proeeings of the 4th Annul Symposium on Computer Arhiteture, pges 23 25, Silver Spring, MD, Mrh ACM SIGARCH/IEEE-CS. [18] S. A. A. Touti. Register sturtion in instrution level prllelism. Interntionl Journl of Prllel Progrmming, 33(4): , [19] R. Whley, A. Petitet, n J. Dongrr. Automte empiril optimiztions of sotwre n the ATLAS projet. Prllel Computing, 27(1-2):3 35, [20] R. Wikremesinghe, L. Arge, J. S. Chse, n J. S. Vitter. Effiient sorting using registers n hes. ACM Journl of Experimentl Algorithmis, 7:9, [21] J. Xiong, J. Johnson, R. Johnson, n D. Pu. SPL: A lnguge n ompiler for DSP lgorithms. In Progrmming Lnguge Design n Implementtion PLDI, pges , Snowir, Uth, June 2001.

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved. Chpter 9 Greey Tehnique Copyright 2007 Person Aison-Wesley. All rights reserve. Greey Tehnique Construts solution to n optimiztion prolem piee y piee through sequene of hoies tht re: fesile lolly optiml