2.1. The Program Model

Size: px

Start display at page:

Download "2.1. The Program Model"

Oscar Davidson
5 years ago
Views:

1 Hyperplane Parttonng : n pproach to Global ata Parttonng for strbuted Memory Machnes S. R. Prakash and Y.. Srkant epartment of S, Indan Insttute of Scence angalore, Inda, 6 bstract utomatc Global ata Parttonng for strbuted Memory Machnes (MMs) s a dcult problem. In ths work, we present a parttonng strategy called 'Hyperplane Parttonng' whch works well loops wth nonunform dependences also. Several optmzatons and an mplementaton on IM-SP are descrbed.. Introducton ata Parttonng (or strbuton) for strbuted Memory Machnes (MMs) has been a dcult problem. Sgncant amount of work has been done n gettng more and more parallelsm from the programs []. However, wthout reducng communcaton overhead n the programs, they cannot be expected to run ecently. There has been eorts n whch the programmers are asked to gve data allocaton themselves and compler wll generate code automatcally []. Ths mght prove to be very dcult for programmers especally when the loops contan numerous array references. Reducton n communcaton overhead has been studed by many researchers such as Ramanujam and Sadayappan [6] where they tred to transform the programs to get parallelsm as well reduce communcaton overhead. It has been seen, as n [7], that for a loop whose access patterns cannot be statcally analyzed to get the best parttonng, complers have tradtonally generated sequental code. lthough ths pessmstc strategy s safe and smple, t essentally precludes the automatc parallelzaton of entre class of programs wth rregular domans and/or dynamcally changng nteractons. For such loops, the general strategy adopted s to use nspector, scheduler and executer codes [7]. However, such technques are for shared-memory machnes, where there s no problem of data parttonng. wth Hewlett-Packard Inda Software Operaton, angalore In ths paper, we propose a method by whch we can nd a partton of a nested do-loops whch reduces communcaton when executed on dstrbuted memory machnes. Our method ders from the other works n the followng ways: rstly, no assumptons are made regardng whether loops are doall or not, as assumed n []. Secondly, the array references that are n the loops can have any lnear functons of nducton varables, not just ~ + ~c type of functons of loop ndces as assumed n [] and others, where ~ s ndex vector and ~c s a constant vector. Such functons of loop ndces can lead to to non-unform dependences. ccordng to an emprcal study.% of two dmensonal array references contan coupled-subscrpts [8], and most compler run them sequentally due to dculty n analyzng such loops.. Hyperplane Parttonng We rst see the program model assumed n ths work and later we wll see how to partton the teraton space usng Hyperplane Parttonng... The Program Model In general, scentc programs contan a large number of array references n nested loops. In such programs, nested loops are man source of parallelsm and are most tme-consumng parts. normalzed n-nested loop [] s consdered n ths work. The body of the loop, H[ ; ;:::; n ], contans a set of assgnment statements possbly contanng array references. The array references consdered n ths work are of the form, X(a + a + :::+ a n n + a ;::::::;a m + :::+ a mn n + a n ), whch can be compactly wrtten as X(~{ + ~a ) where = 6 a a a n a a a n ::: a m a m a mn 7, ~a T =

2 , a a a m In ths work, wthout any loss of generalty, we assume that m = n. Ths assumpton wll not reduce the generalty as we can add dummy dmensons, f m<n, or we can add dummy loops whch runs for tmes f n<m... Iteraton Space Parttonng onsder the followng example. Example : for = to / do for j = to / do begn [+j,+j] = [,j] + [,j] [,j] = [,j]* end; The dependence graph for the Example s gven n Fgure (a). The dependence graph shows how the ndex ponts are dependent on each other. The problem here s to nd the teraton partton such that the communcaton that s ncurred n executng these parttons wll be as less as possble. Fndng the best possble partton, whch s zerocommuncaton partton, wll requre, ndng the vertcal partton of the dependence graph [] whch, n general, requres spannng the entre teraton space. Ths wll take enormous amount of tme especally when the number of loops are many and each runnng over a large number of teratons... omputng the Hyperplane of Partton The method to compute the Hyperplane of Partton, n bref, s as follows. Frst, for every par of references, the dependence equaton s computed. For each par, the drecton of dependence s computed. From these drectons of dependences, we compute a hyperplane whch s used to partton the teraton space nto as many number of parttons as there are logcal processors. These dependence drectons nduce data space parttons n every array used n the loop. Each logcal processor executes derent partton n parallel keepng correspondng data parttons locally, and synchronzaton and non-local accesses are handled at runtme. onsder a loop gven n Example. There are two arrays that are accessed n the loop n Example.We see that the need for communcaton arse only when there s dependence between teratons. On such stuaton derent processors are tryng to access same data whch sowned by only one of them. If we localze such dependences (.e, run both the source teraton and the target teraton on the same processor), and keep the data accessed by the teraton locally then we have removed the necessty to communcate for both synchronzatons between processors (to satsfy nter-teraton dependences) non-local data, thus reducng the overall communcaton. For Example, the dependence graph for =6 s shown n Fgure (a). s can be seen, the dependence drecton tend to algn themselves along a partcular drecton (n ths case the drecton of (-.67,)), provded the condtons (gven later) hold. y parttonng the teraton space along that drecton and by placng the data accessed by these teratons locally, we can expect the communcaton to be reduced sgncantly. Further, snce the dependence drecton tend to algn along a partcular drecton eventually, we expect the communcaton cost to reduce as the sze teraton space ncreases. In the Fgure (b) we see the eect for the Example (The curve termed 'bm.c'). ote also the same phenomenon does not happen wth some standard HPF parttons lke block dstrbutons. gan referrng to the Fgure (b), the curve termed 'bm.sc' shows ths eect. The drecton of convergence can be computed analytcally for every par of references n a loop and the hyperplane whch "best ts" these set of drectons wll be taken as the hyperplane of partton for the teraton space. Ths hyperplane s nduced nto the derent data spaces that are referenced n the loop to get the data parttons. Theorem. (ependence equaton) : The general dependence equaton for the array references, X[~{ + a ] and X[~{ + b ] s gven by : ~ d = ~ + c, where =, (, ), c =, (a, b ) and any ~ j ~ depends on ~ f ~ j = ~ + ~ d. (See [] for proof). In the above theorem, t s assumed that the coecent matrces are non-sngular,.e, nverses exst. The case of sngular matrces are dealt n []. enton. (Trajectory of ndex ponts) For a gven par of references, for a gven loop wth lower bounds ~ lb and upper bounds ~ ub, we can buld a trajectory of ndex ponts by applyng repeatedly the dependence equaton from an ntal ndex pont ~s whch takes us to the nal ndex pont ~ f, where both ~s and ~ f le between ~ lb and ~ ub.

3 6 dep.grp 8 bm.c bm.sc 7 l l 7 d d l 6 r d j 6 T % ost (a) ependence Graph (b) The ature of ependences (c) est ft lne for gven lnes Fgure. (a) ependence Graph (b) ature of ependences (c) est-ft lne enton. (recton of ependence) The drecton of dependence for a par of reference, havng subscrpt functons, ~ + a and ~ + b, s dened as the drecton, d ~ k such that d k+ ~ = d ~ k, where d k+ ~ and d ~ k are dependence drectons at two adjacent ponts on the trajectory of ndex ponts for the gven par of references and s constant, provded such a constant exsts. Otherwse drecton of dependence s sad to be oscllatory. Theorem. (recton of ependence) Suppose = + I (where s the matrx of the dependence equaton). The drecton of dependence for a par of references, havng subscrpt functons, ~ + a and ~ + b, s the egen vector of the matrx, correspondng to the domnant egen value of that matrx, f has n lnearly ndependent egen vectors and has a domnant egen value. (See See [] for the proof.) The theorem. also says that ths need not happen always the other case beng when such egen value does not exst. Then, the trajectory ether wll revolve round the orgn sprally or dverge dependng on whether egen values are not real or real respectvely. For such cases refer []. In ths secton, we wll see how to get the hyperplane whch parttons the teraton space, nto as many tles as the number of processors. These tles can be used to nduce data partton n the data space, usng a partcular reference n the loop for every array n the loop. These tles and the correspondng data parttons can be placed n the local memory of the respectve processor, and we can run those parttons n parallel by ntroducng the synchronzng messages to handle dependences. The hyperplane that mnmzes the devatons from the gven drectons wll be the one whch mnmzes the sum-squared of the sne of the angle between the hyperplane and the drectons. Refer Fgure (c). See [] for detals. Theorem. (The best t hyperplane) Gven p lnes n n dmensons, wth drecton cosnes, x j ; p; j n, passng through the orgn, the hyperplane, P n j= a jx j =whch also passes through the orgn and s the best t for the ponts at unt dstance from the orgn has the coecents a j ; j n, such that Xa = b, where the matrx X kj P P p = = x kx j, p for n, j; k and b k =, = x kx n, for n, k and a n =(See [] for proof). Fnally, we need to partton the array data space from the teraton partton. The hyperplane that we got from the above analyss can be nduced nto the array data spaces also.. ompler Optmzatons.. Space Optmzaton We should use memory more ecently by keepng just that amount of memory as s requred by the partton to resde n the processor. onsder an array n a loop wth dstrbuton as shown n the Fgure. Every processor wll use just the partton whch t owns n the memory. Ths wll result n what are known as holes n the memory. The locatons whch a processor doesn't own s called a hole. In general for processor p, the locatons n sectons P k are holes for all k 6= p. The holes are not used,.e, they are nether read nor wrtten. The hyperplane whch parttons the teraton j P P (a) P P l P Q Fgure. Illustratng holes n the memory space and n turn whch nduces partton n the data space, can be thought of as the plane wth standard bass vectors. ow, f we change the bass so that we P P P k (b) P S R

4 make the hyperplane parallel to one of the axes-planes, then parttons would look very smple to compute. Example : onsder Fgure. For the sectons n the gure (a) the correspondng sectons after the change of bass s shown n the gure (b). The ndexes have been changed from (; j) to(k; l) and the secton would be as shown n Fgure (b). So, every processor wll have one-fourth of PQRS nstead of complete array. Stll there are holes n the new sectons as well, but fewer than what were there before. Smlarly we can construct the bass whch has (n, ) vectors lyng on the plane and one normal to the plane, so that we get the new partton plane whch s parallel to ths hyperplane whch would be one of axesplanes wth the new bass (as shown n the Fgure ). The lgorthm whch constructs such bass s gven n []. We can also prove the lnear ndependence of such computed bass (see [] for detals)... Tme Optmzaton or Unform Schedulng h E P S G b X r strps h θ F s strps H a a r strps Y p- Fgure. The Rectangular Iteraton Space onsder the rectangular teraton space as shown n Fgure. We have to partton the wth a hyperplane whch makes an angle of wth the X-xs. The dea s to partton the teraton space nto as many equal parts (by area) as there are logcal processors ( p ). If these parts have (roughly) the same area then they wll have same number of teraton ponts, thus schedulng wll be unform across processors []. We partton the teraton space p + parts wth the rst and the last part owned by processor. Wth that we see that the mddle strp covers the orgn and localzes many dependences (snce the dependences algn themselves along the Egen Vector passng through the orgn). We partton the strps nto three sets. The rst and the last set wth r strps and the second wth s strps, so that r + s = p +. The strp are placed at a dstance of h from (or h h Q R b ). The algorthm to compute the strp szes, h, s gven n []. Essentally, we keep the strps at such dstances to keep the area of each strp to be (roughly) the same. Thus we acheve unform schedulng. The schedulng for non-rectangular teraton spaces and teraton spaces for hgher dmensons wth examples are gven n []... Message Optmzatons Snce, sendng a message s far more costler than a computaton, eorts have to be made to reduce the number of messages, even f that amounts to delayng few messages. We have mplemented Message Vectorzaton and Message ggregaton by a sngle strategy n our tool []. Whenever a processor p has to send a message to another processor q, nstead of sendng the message mmedately, t just wats to see whether there any more messages for the same destnaton, q. ut amount of tme t can wat s crtcal, snce ndefnte watng can cause deadlocks. So, the processor p wats untl ether the buer where the processor p has stored the pendng messages, s full or, the processor p has to wat for any processor s for some other message whch contans ether the data or synchronzaton sgnal. We have seen the message optmzaton has reduced upto 6% of the messages and thus has mproved performance to a sgncant amount. See [] for detals.. Performance The Hyperplane Parttonng technque explaned above, s for local optmzaton,.e., for a loop. The same technque can be appled to a sequence of loops whch ensures mnmal communcaton for all the loops taken together when run on a MM. Ths needs a good communcaton cost estmator, whch estmates the communcaton that would be ncurred f the gven loop s run wth gven teraton and data parttons []. The tool, Hyperplane Parttoner, whch we have developed wll do the Global ata Parttonng. The Hyperplane Parttoner was tested for performance on some benchmark programs from S and some programs desgned by us to show the merts of the tool. The enchmarks selected were I (lternatng recton Implct) and SYRK. We gve the results for I here. (Refer [] for SYRK results). I program has 6 loops wth both sngle-nested and two-nested loops. Snce the I loops are very short, we unroll the loops a few tmes, and then run them n parallel. The Global ata Parttoner had found that the best way to run the program s by parttonng the loop sequence nto two regons, rst wth rst three loops, and the second wth last three loops,

5 7 6 bm.. bm.. bm.. bm.. bm.8. bm.8. bm.. bm.. bm.. bm.. bm.8. bm.8. bm.. bm.. bm.. bm.. bm.8. bm.8. (a) The Speed-ups for M- (b) The s for M- (c) The s for I Fgure. The s for dfferent benchmarks notatonally t s ffg,f6gg. The partton plane for the rst set had the coecents (; ) and for the second set (; ), whch s (LOK,*) and (*,LOK) parttons respectvely. Fgure (c) shows the speedups for ths benchmark for derent szes of arrays and derent number of processors. In the gure, 'bmx.n.y' gves the performance for benchmark M-x for 'n' processors. The number of tmes the loops were unrolled s gven by 'y', whch vares from benchmark to benchmark and whch s found expermentally. In ths case, f 'y' s means we have to unroll tmes and f t s means unrollng has to be done tmes. The rst of our programs (M-) has ve loops wth three two-dmensonal arrays. ll the loops access the arrays n a smlar manner (has the same dependence drecton). Ths program s to show that the tool nds the statc dstrbuton f those are the best. Fgure (a) shows the performance on IM-SP for derent szes of arrays and derent number of processors. For IM-SP, t has chosen to run all the loops wth the same partton of the data spaces,.e, ff-gg wth the same hyperplane wth coecents (;,). The second program (M-) has sx loops wth no statc partton. There are three arrays accessed n derent ways n dfferent loops. For IM-SP, t decdes to partton the loops as ffgfgfgg. Fgure (b) shows the performance on IM-SP for ffgfgfgg. We also saw by experments that speed-ups for these programs wth both LOK and YLI dstrbutons were nferor when compared to the parttons whch our tool has suggested. See [] for more detals.. onclusons We have seen that there are many cases where we encounter loops whch have coupled subscrpts and we want to eectvely run them on MMs. The mplementaton results shows good performance for our tool wth such non-unform dependences. The tool also nds HPF-lke dstrbutons whenever such dstrbutons are good. Inter procedural data parttonng analyss and a good scheme for global data parttonng for a general program are lackng n the current mplementaton. References [] nanth garwal, avd Kranz, and Venkat atarajan. utomatc parttonng of parallel loops and data dstrbuton for dstrbuted shared-memory multprocessors. IEEE Transactons on Parallel and strbuted Systems, 6(9):9{96, September 99. [] Utpal anerjee. Loop Transformatons for Restructurng omplers: The Foundatons. orwell, Mass.: Kluwer cademc Publshers, 99. [] harles H. Koelbel and Pyush Mehrotra. omplng global name-space parallel loops for dstrbuted executon. IEEE Transactons on Parallel and strbuted Systems, ():{, October 99. [] Prakash S R. Hyperplane parttonng : n approach to global data parttonng for dstrbuted memory machnes. Ph.d. dssertaton, Submtted to ept. omputer Scence and utomaton, Indan Insttute of Scence, angalore, July 998. [] Prakash S R and Y Srkant. ommuncaton cost estmaton and global data dstrbuton for dstrbuted memory machnes. In Internatonal onference on Hgh Performance omputng, angalore, Inda, ecember 997. [6] J. Ramanujam and P. Sadayappan. ompletme technques for data dstrbuton n dstrbuted memory machnes. IEEE Transactons on Parallel and strbuted Systems, ():7{8, October 99. [7] Lawrence Rauchwerger, ancy M. mato, and avd. Padua. scalable method for run-tme loop parallelzaton. Internatonal Journal of Parallel Programmng, ():7{76, May 99. [8] Zhyu Shen, Zhyuan L, and Pen-hung Yew. n emprcal study of array subscrpts and data dependences. In 989 Internatonal onference on Parallel Processng, volume II, pages {, St. harles, Ill., ugust 989.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr