International Conference on Parallel Processing, St. Charles, IL, August COMMUNICATION OPTIMIZATIONS USED IN THE PARADIGM

Size: px

Start display at page:

Download "International Conference on Parallel Processing, St. Charles, IL, August COMMUNICATION OPTIMIZATIONS USED IN THE PARADIGM"

Irene Clarke
5 years ago
Views:

1 Internatonal Conference on Parallel Processng, St. Charles, IL, August COMMCATION OPTIMIZATIONS USED IN THE PARADIGM COMPILER FOR DISTRIBUTED-MEMORY MULTICOMPUTERS Danel J. Palermo, Ernesto Su, John A. Chandy, and Prthvra Baneree Center for Relable and Hgh-Performance Computng Unversty of Illnos at Urbana-Champagn Urbana, IL 11, U.S.A. fpalermo, ernesto, chandy, banereeg@crhc.uuc.edu Abstract The PARADIGM (PARAllelzng compler for DIstrbuted-memory General-purpose Multcomputers) proect at the Unversty of Illnos provdes a fully automated means to parallelze programs, wrtten n a seral programmng model, for executon on dstrbutedmemory multcomputers. To provde ecent executon, PARADIGM automatcally performs varous optmzatons to reduce the overhead and dle tme caused by nterprocessor communcaton. Optmzatons studed n ths paper nclude message coalescng, message vectorzaton, message aggregaton, and coarse gran ppelnng. To separate the optmzaton algorthms from machne-specc detals, parameterzed models are used to estmate communcaton and computaton costs for a gven machne. The models are also used n coarse gran ppelnng to automatcally select a task granularty that balances the avalable parallelsm wth the costs of communcaton. To determne the applcablty of the optmzatons on derent machnes, we analyzed ther performance on an Intel PSC/, an Intel PSC/, and a Thnkng Machnes. 1. INTRODUCTION Dstrbuted-memory multcomputers such as the Intel PSC/, the Intel Paragon, the IBM SP-1, the NCUBE/, and the Thnkng Machnes oer sgncant advantages over shared-memory multprocessors n terms of cost and scalablty. However, lackng a global address space, they present a very dcult programmng model n whch the user must specfy how data and computaton are to be parttoned across processors and determne whch sectons of data need to be communcated among whch processors. To overcome ths dculty, sgncant research eort has been amed at source-to-source parallelzng complers for multcomputers that releve the programmer from the task of program parttonng and communcaton generaton, whle the speccaton of data dstrbutons remans a responsblty of the programmer. These complers take a program wrtten n a sequental or shared-memory parallel language, and based on user-speced parttonng of the data, generate code for a gven multcomputer. Examples nclude Fortran D [1], Fortran 9D [], the SUIF Ths research was supported n part by the Oce of Naval Research under Contract N1-91J-19, and n part by the Natonal Aeronautcs and Space Admnstraton under Contract NASA NAG compler [3], and the Superb compler []. However, many of these research eorts are now lookng nto automated data parttonng. Many researchers n ths area are also currently nvolved n denng Hgh Performance Fortran (HPF) [] to standardze parallel programmng wth data dstrbuton drectves. Some related work on the evaluaton of compler optmzatons performed by the Fortran D compler has prevously appeared n []. The compler optmzatons that they descrbed were selected manually, appled to one-dmensonal parttonngs, and only evaluated on an PSC/. The novel aspects we present n ths paper are the automatc selecton of multdmensonal data parttons, the development of an estmaton framework for coarse gran ppelnng, and the comparson of exstng optmzatons on derent archtectures. The remander of ths paper s organzed as follows. Secton provdes an overvew of the PARADIGM compler. The varous communcaton optmzatons used n the compler, as well as the technques used to select granularty for the ppelnng transformaton, are descrbed n Secton 3. An analyss of the results usng the presented optmzatons s performed n Secton, and conclusons are presented n Secton.. OVERVIEW OF PARADIGM Fgure 1 presents a functonal vew of the maor components n the PARADIGM compler. The compler accepts a sequental program (currently FORTRAN 77) and produces a SPMD (Sngle Program Multple Data) parallel program wth message passng. Followng are bref descrptons of some of the maor areas n the complaton strategy: Program Analyss Parafrase- [7] s used as a preprocessng platform to parse the sequental program nto an ntermedate representaton, to perform useful analyss (such as generatng ow, dependence, and call graphs), as well as to facltate varous code transformatons (such as constant propagaton, nducton varable substtuton, loop dstrbuton, loop nterchange, and scalar expanson). To descrbe parttoned sets of teratons and regons of data, Processor Tagged Descrptors (PTDs) [] are used to provde a unform representaton of the parttonng for every processor. Operatons on PTDs are extremely ef- cent, capturng the eect on all processors n a gven dmenson smultaneously. PTDs are easly extended to an arbtrary number of dmensons and are ndependent of

2 Internatonal Conference on Parallel Processng, St. Charles, IL, August 199 Sequental Program Parafrase- Automatc Data Dstrbuton Module Data dstrbuton specfcatons PARADIGM Communcaton and Optmzatons Module Generc Lbrary Interface Code Generaton Fgure 1: PARADIGM Compler Overvew SPMD Parallel Program the total number of processors. Data Parttonng Dstrbuton of data s determned automatcally by the compler usng a constrant-based algorthm [9, 1], whch selects an abstract multdmensonal mesh topology along wth how program data s to be dstrbuted on the mesh. To mnmze the executon tme for a partcular machne, parameterzed models are used to estmate computaton and communcaton costs. For each target machne, there s a set of parameters whch nterface wth the cost models eectvely solatng the parttonng algorthm from a specc archtecture. Computaton Parttonng Computaton s dvded among processors usng the owner computes rule. A drect applcaton of ths rule wthout further optmzatons leads to run-tme resoluton, whch results n code that computes the ownershp and communcaton for each reference at run tme. An ecent mplementaton of the owner computes rule, however, can avod the overhead of computng ownershp at run tme. For computatons enclosed n a loop nest, the loops can be parttoned (known as loop bounds reducton [1]) allowng processors to execute only those teratons whch have assgnments that wrte to local memory. Communcaton Analyss The references n assgnment statements are also analyzed to detect the need for communcaton. PTDs are constructed to descrbe the teratons requrng communcaton of non-local data, the processors nvolved, and the exact regons of the arrays to be sent or receved. Once the communcaton descrptors have been computed for ndvdual references, varous communcaton optmzatons can be performed (see Secton 3). Data dependence and ow nformaton s also used to determne whether a gven optmzaton s applcable, and f so, to what extent. Processor Mappng The compler generates code that vews the target machne as a multdmensonal mesh of processors. The exact conguraton of ths mesh s chosen durng the automatc data-parttonng phase. Snce a mesh topology can be easly mapped onto other topologes, machne-dependent processor mappng s acheved through lbrary support to ecently map the mesh to a gven archtecture [11]. Generc Lbrary Interface Support for specc communcaton lbrares s provded through a generc lbrary nterface. For each supported lbrary, abstract functons are mapped to correspondng lbrary-specc code generators at comple-tme. Lbrary nterfaces have been mplemented for the Intel PSC communcaton lbrary, Thnkng Machnes CMMD, Parasoft Express [1], PVM [13], and the Portable Instrumented Communcaton Lbrary (PICL) [1, 1]. Express, PVM, and PICL also provde executon tracng and support for many derent machnes. The portablty of ths nterface allows the compler to generate code for a wde varety of machnes. Summary When fully mplemented, PARADIGM wll be capable of performng all of the followng tasks automatcally: generaton of data parttonng drectves [9, 1], parttonng of computaton and generaton of communcaton [11], synthess of hgh-level communcaton [1], explotaton of functonal parallelsm [17], support of a multthreaded executon model [1], and support of rregular computatons [19]. 3. COMMCATION OPTIMIZATIONS The rst three communcaton optmzatons examned n ths paper, message coalescng, message vectorzaton, and message aggregaton, are targeted at reducng the overhead assocated wth communcaton [1, ]. These optmzatons rely on the fact that the start-up cost of communcaton for dstrbuted-memory multcomputers s much greater than the per-byte transmsson cost (by a factor of over tmes for the PSC/ and 7 tmes for the ). In general, gven the start-up latency and transmsson rate for a specc archtecture (see Table 1), the compler uses a communcaton model n whch the transfer cost (n sec) of a message of m bytes s dened as: transfer(m) = ovhd + rate m (1) For a machne such as the PSC/, the parameters depend on the length of the message, so the model becomes: + :m (f m 1) transfer(m) = 1 + :3m (f m > 1) Several optmzatons can be employed to ncrease the performance of the parallel program by combnng messages n varous ways to reduce the total amount of communcaton overhead. 3.1 Message Coalescng Separate communcaton for derent references to the same data s unnecessary f the data has not been mod- ed between uses. When statcally analyzng the access Table 1: Communcaton Model Parameters [] Machne Sze ovhd (s) rate (s) m m > 1.1 PSC/ m 1. m > PSC/ m m > 1 7.3

3 Internatonal Conference on Parallel Processng, St. Charles, IL, August P1 P P1 P P1 P P1 P (a) Before (b) After Fgure : Message Vectorzaton patterns, these redundant communcatons are detected and coalesced nto a sngle message, allowng the data to be reused rather than communcated for every reference. For sectons of arrays, unons of overlappng PTD ndex sets ensure that each unmoded data element s communcated only once. Coalescng s always benecal snce entre communcaton operatons can be elmnated. 3. Message Vectorzaton Non-local elements of an array that are ndexed wthn a loop nest can also be vectorzed nto a sngle larger message nstead of beng communcated ndvdually (see Fgure ). Dependence analyss s used to determne the outermost loop at whch the combnng can be appled. The \temwse" messages are combned, or vectorzed, as they are lfted out of the enclosng loop nests to the selected vectorzaton level. (a) Vectorzaton reduces the total number of communcaton operatons, but also ncreases the message length. 3.3 Message Aggregaton Messages (correspondng to several array sectons) to be communcated between the same source and destnaton can also be aggregated nto a sngle larger message. Multple communcaton operatons (to be performed at the same pont) are sorted by ther destnatons durng the communcaton analyss. Messages wth dentcal destnatons can then be collected nto a sngle communcaton operaton (see Fgure 3). The gan from aggregaton s smlar to vectorzaton n that multple communcaton operatons can be elmnated at the cost of ncreasng the message length. Aggregaton can be performed on communcaton operatons of ndvdual data references as well as vectorzed communcaton operatons. Both of these applcatons of message aggregaton wll be examned n Secton. 3. Coarse Gran Ppelnng In loops where there are no cross-teraton dependences, parallelsm s extracted by ndependently executng groups of teratons on separate processors. However, n cases where there are cross-teraton dependences due to recurrences, t s not possble to mmedately execute every teraton. Often, there s the opportunty to overlap parts of the loop executon, usng some form of synchronzaton to ensure that the data dependences are enforced. In Fgure a, the rst processor s performng an operaton on every element of the rows t owns before sendng the border row to the watng processor, thereby seralzng executon of the entre computaton. In the example n (a) Management of avalable memory may requre that large regons of data be only partally vectorzed. [] (a) Before (b) After Fgure 3: Message Aggregaton do = 1, Y =p do = 1, X a(, ) = a(-1, ) + a(, -1) P P1 P (a) Before Transformaton do = 1, X do = 1, Y =p a(, ) = a(-1, ) + a(, -1) P P1 P (b) After Loop Interchange t 1 t t 3 Fgure : Code Example for Loop Ppelnng Fgure b, a loop nterchange has been appled such that the rst processor now can compute one parttoned column of elements. It can then send the border element of that column to the next processor so that processor can begn computaton mmedately. Such technques have been used n the desgn of systolc arrays [3, ] as well as n software ppelnng []. Ideally, f communcaton has zero overhead, ths s the most ecent form of computaton, snce no processor wll wat unnecessarly. Unfortunately, ths assumpton s not vald for dstrbuted-memory systems. By consderng overhead, the cost of performng numerous sngle element communcatons can be qute expensve. To address ths problem, the total communcaton overhead can agan be reduced by ncreasng the granularty of the communcaton. Ths procedure has become known as coarse gran ppelnng [] Executon Analyss In Fgure, an executon framework of a generalzed twolevel ppelned loop nest (smlar to that n Fgure ) s presented. In ths model, coarse gran ppelnng s performed by strp-mnng the outer dmenson of the twodmensonal loop nest whle channg s appled between consecutve ppelnes at an outer teraton level. Dentons of all varables whch wll be used n the analyss are also lsted along wth the gure. Snce the amount of avalable parallelsm s reduced as the granularty or strp sze, s, s ncreased, ths value must t t 3 t 3 t 1 t t 3

4 Internatonal Conference on Parallel Processng, St. Charles, IL, August 199 scomm startup = (p - 1)(s comp + comm) chaned ppelne = L ppelne + (L - 1) sync ppelne = X comp + (X/s - 1) ovhd(s) sync s comp + ovhd(s) scomm + comm -ovhd(s) comm P3 s comp P comm comm -Sovhd(s) P1 P s comp scomm comm -Rovhd(s) Strp Computaton Send (Sovhd) Forward Reference Sync Transfer L = outer loop teratons X; Y = number of columns and rows c = cost of nner loop nstructons p = number of processors s = strp sze (s b = strp sze n bytes) Receve (Rovhd) Strp Transfer transfer(m) = communcaton tme of m elements overhead(m) = communcaton overhead for m elements comp = computaton tme of one column = dy=pe c comm = tme for transfer(s) (strp of a row) scomm = tme for transfer(x) (entre row) Fgure : Estmaton Framework for Coarse Gran Ppelnng do l = 1, L comm(sze = X) do = 1, X f (my$p > ) recv(sze = 1) do = 1, dy=pe computaton f (my$p < p? 1) send(sze = 1) (a) Fne gran ppelnng do l = 1, L comm(sze = X) do = 1, X, s bb = mn(s,x-+1) f (my$p > ) recv(sze = bb) do =, + bb - 1 do = 1, dy=pe computaton f (my$p < p? 1) send(sze = bb) (b) Coarse gran ppelnng Fgure : Cross Processor Loop Ppelnng be carefully selected. If s s one, we have ne gran ppelnng, and f s s equal to the bounds of the seral loop, X, the executon s seralzed n the nner dmensons (no ppelnng). Somewhere n between les an optmal s that maxmzes the overlap of communcaton and computaton. In order to nvestgate ths tradeo, an executon tme estmate s developed from the framework to allow automatc selecton of a strp sze that yelds the hghest performance. The rst maor phase of executon s the tme requred to ll the ppelne. Ths s related to the number of processors as well as the strp sze. From the dagram, t can be seen that: startup = (p? 1)(s comp + comm) The next porton of executon s the tme spent n the ppelne. Ideally, wth no communcaton overhead, ths tme should be equal to the amount of computaton (X comp). However, because of the presence of communcaton, the tme for each message communcated n the ppelne must also be taken nto account. The number of communcaton operatons s X s. Therefore, l X m ppelne = X comp +? 1 overhead(s) s For loops wth ow-dependences caused by forward references n the computaton, the executon of a sequence of ppelned loop nests wll also generate communcaton ncurrng some extra synchronzaton costs to communcate the requred data. Snce most parallel machnes only support a sngle channel for memory transfer operatons through the communcaton network, ths causes some extra delay whch can be seen durng multple sends n Fgure. (b) The \scomm" term s used to represent the (b) In some applcatons there are no forward references, and therefore, no need to synchronze between outer loop teratons. In these cases each processor can proceed wthout any further synchronzaton.

5 Internatonal Conference on Parallel Processng, St. Charles, IL, August 199 amount of synchronzng communcaton (If an entre row s communcated, then ths cost s transfer(x)). Note that ths wll also be present n the start-up synchronzaton for the loop nest (see Fgure ). sync = s comp + 3 comm? overhead(s) + scomm Snce the ppelne s entered L tmes, the total executon tme s then: total cost = scomm + startup + L ppelne + (L? 1) sync Y = [LX + s(p + L? )] c + p (p + 3L? ) transfer(s) + h l X m L? + 1 overhead(s) + s L transfer(x) () Usng the communcaton cost model prevously presented (see Equaton 1) the total cost becomes: Y total cost = [LX + s(p + L? )] c + p (p + 3L? )(ovhd + s b rate) + h l m X L? + 1 ovhd + s L(ovhd + X b rate) The total cost can then be mnmzed wth respect to the strp sze to select the optmal granularty. Vercaton of the second dervatve wll show that ths value of s s ndeed a mnmum. Snce ovhd and rate are actually functons of s, Equaton 3 s evaluated teratvely total cost = (p + L? ) Y c + p s = (p + 3L? )b rate? L ovhd X s = s X L ovhd (p + L? )(b rate + (3) Y p c) + b(l? 1)rate 3.. Seral Computaton Cost Estmaton Snce the cost, c, of the nner loop computatons appears n ths expresson, t s necessary to estmate the computaton costs. Two technques can be used: source-level [9] and assembly-level cost estmaton. For each method of estmaton, a gven machne's operaton costs are expressed n terms of clock cycles. In the case of source-level estmaton, these costs must take nto account any support nstructons (address computaton, regster loads, stores, etc.) performed for the actual computaton. Assembly-level estmaton requres the tmng costs of ndvdual machne nstructons as well as beng able to perform pre-complaton of the source under examnaton. In both cases, estmaton of the cost of a xed block of code (whch may contan loops and other control ow structures) requres computng a dynamc cycle count. Table : Data Parttonng for Intal Tests Test PSC/ Dstrbuton ADI 1 1, block, EXPL block, Jacob 11 block, block. EVALUATION OF OPTIMIZATIONS A group of small scentc program kernels are used to examne the performance of the presented communcaton optmzatons. The selected program fragments nclude ADI Integraton (ADI, kernel ) [] -D Explct Hydrodynamcs (EXPL, kernel 1) [] Jacob's Iteratve Method Three other programs whch exhbt cross-teraton dependences are selected to examne the ppelnng optmzaton: Implct Hydrodynamcs (IMPL, kernel 3) [] Successve Over-Relaxaton Iteratve Method (SOR) Block Lower Trangular Solver (BLTS) [] These programs wll be examned separately n Secton.. Array dmensons are statcally speced and loop bounds are determned at comple tme when possble. Both the vrtual mesh conguraton as well as the data parttonng of the arrays are automatcally selected by the compler. The szes of the maor arrays and the chosen dstrbutons are shown n Table for each of the rst three programs. The evaluaton of the overhead optmzatons s performed wth processors of an Intel PSC/ as well as processors of a Thnkng Machnes. Larger arrays were used on the to relably obtan measurable tmes. Traces taken usng PICL [1] are analyzed usng the ParaGraph [1] vsualzaton tool to examne the eect of each optmzaton. The ParaGraph \Spacetme dagram" s used to further examne the executon prole vewng the frequency and amount of communcaton that takes place. Contnuous lnes ndcate unnterrupted executon whle a break n a lne ndcates that a processor has blocked awatng communcaton. Communcaton operatons are ndcated as vertcal lnes between the cooperatng processors' executon proles (see Fgures 7 to 1). Snapshot vews of the executon of EXPL compled wth the selected optmzatons are shown n each gure whle performance data s presented for all three test programs. For comparson purposes, the reported executon tmes have been normalzed to the seral executon of the correspondng program and are further separated nto two quanttes: { amount of tme spent on useful computaton (where useful refers to only the code whch carres out the actual computaton) { tme spent executng code related to computaton parttonng and communcaton The relatve eectveness of each optmzaton s determned by examnng the amount of overhead elmnated as the optmzatons are ncrementally appled.

6 Internatonal Conference on Parallel Processng, St. Charles, IL, August Intel PSC/ Thnkng Machnes Intel PSC/ Thnkng Machnes Fgure 7: Evaluaton of Run-tme Resoluton Fgure : Reduced Loop Bounds & Coalescng Intel PSC/ Thnkng Machnes Intel PSC/ Thnkng Machnes Fgure 9: Evaluaton of Message Vectorzaton Fgure 1: Message Vectorzaton & Aggregaton Unprocessor Executon Communcaton Vectorzaton Run-tme Resoluton Communcaton Aggregaton Reduced Loop Bounds All Optmzatons appled Traces taken from Explct Hydrodynamcs (Lvermore Kernel 1) usng a 1 processor PSC/ (tme n s)

7 Internatonal Conference on Parallel Processng, St. Charles, IL, August Results of Run-tme Resoluton A drect applcaton of the owner computes rule wthout any further optmzaton leads to run-tme resoluton. Snce each processor must execute the entre teraton space (to determne f any other processor needs locally owned data), derent nstances of a communcaton operaton are eectvely seralzed across the processors n a gven mesh dmenson (see Fgure 7). Multple communcaton operatons appear to be ppelned, but the tme spent between successve messages can be qute large. Examnng the executon tme for each program, t can be seen that traversng the entre teraton space to compute ownershp results n large amounts of overhead. Communcaton for resoluton programs s also very necent snce t s comprsed of a large number of small (sometmes even redundant) messages resultng n hgh communcaton overhead. The net result s a reducton n performance when compared to the seral case.. Results of Loop Bounds Reducton and Message Coalescng By applyng loop bounds reducton and statcally generatng communcaton operatons, the seralzaton present n the baselne resoluton cases can be elmnated. The statcally generated communcaton operatons are eectvely collapsed n tme as compared to the seralzed communcaton present n run-tme resoluton (see the spacetme dagram n Fgure ). Statc analyss of communcaton allows the loop bounds to be parttoned nstead of requrng every processor to check the ownershp of each reference for the entre teraton space. The overhead s dramatcally reduced as all ownershp and communcaton s now statcally determned. Note, however, that the sze of the messages s stll dentcal to that n run-tme resoluton (sngle elements), but redundant messages have been elmnated by applcaton of the coalescng optmzaton. It s stll apparent that there s an excessve amount of small messages beng communcated (as seen n Fgure ) snce communcaton overhead s now the domnant factor n the executon overhead..3 Results of Message Vectorzaton It s also possble to vectorze the communcaton operatons after loop bounds reducton and message coalescng have been appled. Usng dependence nformaton to determne when vectorzaton s applcable, communcaton operatons can be lfted out of the nner-most loops thereby reducng the communcaton frequency (ths can be seen n Fgure 9). Recall that ths also ncreases the sze of the messages, but snce the start-up cost s much greater than the per-byte cost, there s a large gan n performance. For ADI, t s possble to vectorze all communcaton completely out of the loop nest. Snce there s no longer any synchronzaton due to communcaton wthn the loops, the reducton of the array bounds resulted n super-lnear speedup, whch can be attrbuted to a gan n cache performance (also see Fgure 11).. Results of Message Aggregaton Aggregaton also reduces communcaton frequency by groupng multple communcaton operatons, ncreasng the sze of the resultng message. (see Fgure 1). Applyng aggregaton after loop bounds reducton and message ADI Expl Jacob ADI Expl Jacob.... (a) processor Intel PSC/.... x1 x1 (b) processor Thnkng Machnes CM Fgure 11: Comparson of Combned Optmzatons coalescng, a performance gan s seen for both ADI and EXPL. There was no mprovement usng aggregaton on Jacob's method snce there was only one message communcated for each source/destnaton par. By applyng aggregaton after vectorzaton reduces the communcaton overhead as can be seen n the \thnnng" of the communcaton n the executon proles when comparng Fgures 9 and 1. Snce groups of messages are combned, the performance mprovement s related to the number of derent array sectons that need to be communcated at the same pont n the program. For the small test programs examned, only a few arrays were nvolved n the communcaton, and hence the overall performance mproved only slghtly.. Results of Combned Optmzatons Fgure 11 shows the relatve speedup and the amount of overhead measured wth each optmzaton. Elmnatng the overhead of traversng the entre teraton space to compute ownershp, the combnaton of statc communcaton generaton and loop bounds reducton attaned roughly half of the total avalable performance for most programs. By applyng message vectorzaton t was possble to reduce the maorty of the remanng communcaton overhead resultng n near performance. Message aggregaton was benecal for the two programs (ADI, EXPL) whch contaned references to a number of derent arrays. Due to the more lmted scope of applcaton, ts eect tended to not be as dramatc as the other optmzatons. Snce the compler selected a two-dmensonal parttonng for Jacob's method, an extra run s shown n Fgure 11 where the compler was forced to generate a one-dmensonal parttonng. For the eght processor PSC/, the executon tme ncreased by about %. For the larger, processor, the executon tme has ncreased by over 3% and wll become more apparent wth larger numbers of processors. The hghest performance was acheved by allowng the compler to select the best dstrbuton based on cost models to estmate communca-

8 Internatonal Conference on Parallel Processng, St. Charles, IL, August 199 Table 3: Mesh Conguratons Test N Mesh ADI/ 1 EXPL Test N Mesh Jacob ton and computaton. Each of the fully optmzed test programs were also executed wth larger numbers of processors to examne the scalablty. In Table 3, each program's mesh conguraton s shown as selected by the automatc parttonng pass. The speedup curves for each of the test programs run on an PSC/, an PSC/, and a, can be seen n Fgure 1. The super-lnear speedup s agan observed for ADI and can be attrbuted to the cache eect prevously descrbed. The executon of these three programs was also compared to an exstng data-parallel compler avalable on the usng a language known as Connecton Machne Fortran (CMF.1 Fnal, CMOST 7.3). (c) The reducton n performance of the programs compled wth CMF can be attrbuted to the fact that t uses a SIMD (Sngle Instructon Multple Data) model of executon for complaton (carred over from the CM-). Smulatng a SIMD archtecture on a MIMD multcomputer (such as the ) ncurs farly hgh synchronzaton costs between blocks of computaton. Future versons of the CMF compler wll most lkely use a more asynchronous model of computaton (.e. SPMD) whch s better suted to the.. Results of Coarse Gran Ppelnng To evaluate the qualty of the strp sze estmate developed n Secton 3. (see Equaton 3), each of the test programs (IMPL, SOR, BLTS) was executed wth varyng strp szes to compare the estmate wth the actual mnmum. We also examned a smpler one-level strp sze estmate developed n the Fortran D proect []. It s nterestng to note that our Two-Level estmate reduces to ther One-Level estmate when several assumptons are appled. In ts most general form, the smpler estmate assumes that channg does not occur between consecutve ppelnes, communcaton s modeled as constant overhead, and the array dmensons are square. One-Level = r p p? 1 ovhd c Gven the two forms of strp sze estmates, the optmal strp sze s compared wth the estmated mnmums n Table. The executon tme usng the Two-Level estmate can be seen to be no more than % away from the optmal tme whle the most general form of the One-Level estmate was, at tmes, more than 3% worse than the actual (c) The vector unts were not utlzed for ether the CMF or message passng runs snce the node complers would not produce vector code for the message passng programs PSC/ PSC/ PSC/ PSC/ 1 1 (CMF) (a) ADI Integraton PSC/ PSC/ (CMF) (b) Explct Hydrodynamcs (CMF) (c) Jacob's Iteratve Method 1 1 Fgure 1: Performance Comparson mnmum. The One-Level estmate predcted strp szes of roughly half the sze of the Two-Level estmate. In fact, a further approxmaton q s made n Fortran D, resultng ovhd n an estmate of whch proves to be even farther c from the mnmum. The maor gan seen wth the Two-Level estmate comes from the channg of consecutve ppelnes. In the One-Level estmate, whch only models a sngle ppelne, the ppelne start-up costs have a more sgncant contrbuton and tend to reduce the strp sze n order to mnmze the total executon tme. When channg s taken nto account, as n the Two-Level estmate, the overall contrbuton of the start-up phase s much less. For both machnes examned, however, the communcaton rate dd not have a sgncant eect on the Two-Level estmate snce only small messages were communcated (the overhead of communcaton was two orders of magntude greater than the rate). For programs n whch larger amounts of data need to be communcated wthn the ppelne, the communcaton rate would most lkely become more sgncant. In Fgure 13, speedup curves are shown for the measured data as well as that predcted by the Two-Level estmate (wth the optmal pont ndcated for each). On the PSC/ the rato of communcaton to computaton s farly hgh whereas ths rato s much lower on the

9 Internatonal Conference on Parallel Processng, St. Charles, IL, August PSC/ - procs PSC/ - estmate 1 PSC/ - procs PSC/ - estmate Strp Sze 7 3 PSC/ - procs PSC/ - estmate 1 PSC/ - procs PSC/ - estmate Strp Sze PSC/ - 1 procs estmate (a) Implct Hydrodynamcs Strp Sze PSC/ - 1 procs estmate Strp Sze (b) Successve Over-Relaxaton Table : Data Parttonng for Ppelne Tests Test Array Szes Dstrbuton IMPL 1 1, block SOR, block BLTS 3,, block,, Table : Comparson of Estmates (Strp Sze, ) Machne Test L Optmal Two-Level One-Level IMPL PSC/ SOR ( proc) BLTS IMPL PSC/ SOR ( proc) BLTS IMPL PSC/ SOR (1 proc) BLTS PSC/ - procs PSC/ - estmate 1 PSC/ - procs PSC/ - estmate 1 1 Strp Sze PSC/ - 1 procs estmate 1 1 Strp Sze (c) Sparse Block Lower Trangular Solver Fgure 13: of Ppelnng vs. Granularty PSC/. (d) For ths reason, the performance curves tend to be steeper on the PSC/, makng the selecton of the correct strp sze more crtcal. In Fgure 1, traces from the SOR kernel also show the correlaton of the executon prole to the framework presented n Secton Snce there tends to be a range of strp szes whch all provde smlar performance, predctng the trend s more mportant than predctng the exact tme of the executon. Wth the correct trend, t s possble to automatcally select a granularty that approaches the mnmum executon tme. Estmates for the PSC/ followed the trend qute well, but dd not match the measured tmes so they are scaled n magntude to facltate comparson. Ths s expected due to the advanced optmzatons performed by the target compler. The estmates for the PSC/ were farly accurate when drectly compared to the measured data (except for the BLTS estmate whch was also scaled). For IMPL and SOR, communcaton s more tghtly coupled wth computaton than n BLTS. For these programs, control of the granularty had a greater eect on the resultng performance. On the other hand, the computaton n BLTS was not as ne graned and requred lttle or no ncrease n gran sze. (d) Comparng the PSC/ to the PSC/ the change n communcaton overhead s about a factor of whle the change n computaton s about a factor of 1.. CONCLUSIONS One of the most complex tasks facng a user n parallelzng seral programs s dealng wth nterprocessor communcaton. In ths paper, t has been shown that ths task can be performed by the compler through the use of good estmates for communcaton and computaton costs. It has also been shown that the applcaton of the presented optmzaton technques yelds hgh performance on several derent dstrbuted-memory multcomputers. By applyng the presented optmzatons, t was possble to amortze communcaton overhead obtanng near performance. Through use of good computaton and communcaton estmates, the compler was able to select the best dstrbuton for even small derences n performance. For larger machne szes and more complex programs, the utlty of automatc data dstrbuton becomes more apparent as the communcaton costs become greater for nferor data dstrbutons. The performance of coarse gran ppelnng s drectly nuenced by the relatve costs of communcaton and computaton on a gven machne. The estmates presented n ths paper allow for varaton n both computatonal power as well as the communcaton latency and bandwdth of derent machnes. Comparng the estmated mnmum to the measured data, t s apparent that the compler s able to automatcally select a granularty that gves nearoptmal performance. Currently, PARADIGM automatcally performs loop bounds reducton, message coalescng, message vectorzaton, and message aggregaton. The coarse gran ppelne transformaton wll be ntegrated nto the compler n the near future. Acknowledgements: We would lke to thank the revewers for ther helpful nput, and to thank Antono Lan, Chrsty Palermo, and Shankar Ramaswamy for ther nsghtful comments and suggestons.

$Tseng, \Complng Fortran D for MIMD Dstrbuted Memory Machnes," Communcatons of the ACM, vol. 3, pp. {, Aug. 199. [] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S.$

10 Internatonal Conference on Parallel Processng, St. Charles, IL, August Fne Gran Fgure 1: Traces of SOR wth Ppelnng Coarse Gran REFERENCES [1] S. Hranandan, K. Kennedy, and C. Tseng, \Complng Fortran D for MIMD Dstrbuted Memory Machnes," Communcatons of the ACM, vol. 3, pp. {, Aug [] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka, \Fortran 9D/HPF Compler for Dstrbuted Memory MIMD Computers, Desgn, Implementaton, and Performance Results," Proceedngs of the 1993 ACM Internatonal Conference on Supercomputng, pp. 31{3, July [3] S. P. Amarasnghe and M. S. Lam, \Communcaton Optmzaton and Code Generaton for Dstrbuted Memory Machnes," n Proceedngs of the ACM SIGPLAN'93 Conference on Programmng Language Desgn and Implementaton, pp. 1{13, June [] B. Chapman, P. Mehrotra, and H. Zma, \Programmng n Venna Fortran," n Thrd Workshop on Complers for Parallel Computers, pp. 1{1, 199. [] Hgh Performance Fortran Forum, \Hgh Performance Fortran Language Speccaton, verson 1.," Tech. Rep. CRPC-TR9, CRPC, Rce Unversty, Houston, TX, May [] S. Hranandan, K. Kennedy, and C.-W. Tseng, \Evaluaton of Compler Optmzatons for Fortran D on MIMD Dstrbuted-Memory Machnes," n Proceedngs of the 199 ACM Internatonal Conference on Supercomputng, (Washngton, DC), July 199. [7] C. D. Polychronopoulos, M. Grkar, M. R. Haghghat, C. L. Lee, B. Leung, and D. Schouten, \Parafrase-: An Envronment for Parallelzng, Parttonng, Synchronzng and Schedulng Programs on Multprocessors," n Proceedngs of the 199 Internatonal Conference on Parallel Processng, pp. II:39{, Aug [] E. Su, D. J. Palermo, and P. Baneree, \Processor Tagged Descrptors: A Data Structure for Complng for Dstrbuted-Memory Multcomputers," to appear n the 199 Internatonal Conference on Parallel Archtectures and Complaton Technques, 199. [9] M. Gupta and P. Baneree, \Demonstraton of Automatc Data Parttonng Technques for Parallelzng Complers on Multcomputers," IEEE Transactons on Parallel and Dstrbuted Systems, vol. 3, pp. 179{193, Mar [1] M. Gupta and P. Baneree, \PARADIGM: A Compler for Automated Data Parttonng on Multcomputers," n Proceedngs of the 1993 ACM Internatonal Conference on Supercomputng, (Tokyo, Japan), July [11] E. Su, D. J. Palermo, and P. Baneree, \Automatng Parallelzaton of Regular Computatons for Dstrbuted Memory Multcomputers n the PARADIGM Compler," n Proceedngs of the 1993 Internatonal Conference on Parallel Processng, pp. II:3{3, Aug [1] Parasoft Corporaton, Pasadena, CA, Express Reference Gude for FORTRAN Programmers, 199. [13] G. A. Gest, A. Begueln, J. J. Dongarra, W. Jang, R. Manchek, and V. S. Sunderam, \PVM 3. User's Gude and Reference Manual," Oak Rdge Natonal Laboratory, Oak Rdge, TN, Feb [1] M. T. Heath and J. A. Etherdge, \Vsualzng the Performance of Parallel Programs," IEEE Software, vol., pp. 9{39, Sept [1] G. A. Gest, M. T. Heath, B. W. Peyton, and P. H. Worley, \PICL: A Portable Instrumented Communcaton Lbrary, C reference manual," Tech. Rep. ORNL/TM-1113, Oak Rdge Natonal Laboratory, Oak Rdge, TN, July 199. [1] J. L and M. Chen, \Complng Communcaton-Ecent Programs for Massvely Parallel Machnes," IEEE Transactons on Parallel and Dstrbuted Systems, vol., pp. 31{37, July [17] S. Ramaswamy,, S. Sapatnekar, and P. Baneree, \A Convex Programmng Approach for Explotng Data and Functonal Parallelsm on Dstrbuted Memory Multcomputers," to appear n the 199 Internatonal Conference on Parallel Processng, 199. [1] J. G. Holm, A. Lan, and P. Baneree, \Complaton of Scentc Programs nto Multthreaded and Message Drven Computaton," to appear n the 199 Scalable Hgh Performance Computng Conference, 199. [19] A. Lan and P. Baneree, \Technques to Overlap Computaton and Communcaton n rregular Iteratve Applcatons," to appear n the ACM Internatonal Conference on Supercomputng 199, 199. [] T. von Ecken, D. E. Culler, S. C. Goldsten, and K. E. Schauser, \Actve Messages: a Mechansm for Integrated Communcaton and Computaton," n Proceedngs of the 19th Annual Internatonal Symposum on Computer Archtecture, pp. {, May 199. [1] M. Gerndt, \Updatng Dstrbuted Varables n Local Computatons," Concurrency Practce and Experence, vol., pp. 171{193, Sept [] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, \An Interactve Envronment for Data Parttonng and Dstrbuton," n Proceedngs of the th Dstrbuted Memory Computng Conference, (Charleston, SC), pp. II,11{ 117, Apr [3] H. T. Kung, \Why Systolc Archtectures?," Computer, vol. 1, no. 1, pp. 37{, 19. [] D. I. Moldovan, Parallel Processng: From Applcatons to Systems. Morgan Kaufman, [] F. McMahon, \The Lvermore Fortran Kernels: A computer test of the numercal performance range," Tech. Rep. UCRL-37, Lawrence Lvermore Natonal Laboratory, 19. [] D. Baley, J. Barton, T. Lasnsk, and H. Smon, \The NAS Parallel Benchmarks," Tech. Rep. RNR-91-, NASA Ames Research Center, 1991.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr