A Data-Parallel Implementation of O(N) Hierarchical N-body Methods

Size: px

Start display at page:

Download "A Data-Parallel Implementation of O(N) Hierarchical N-body Methods"

Margaret McGee
5 years ago
Views:

1 A Data-Parallel Implementaton of O(N) Herarchcal N-body Methods The Harvard communty has made ths artcle openly avalable. Please share how ths access benefts you. Your story matters Ctaton Hu, Yu, and S. Lennart Johnsson A Data-Parallel Implementaton of O(N) Herarchcal N-body Methods. Harvard Computer Scence Group TR Ctable lnk Terms of Use Ths artcle was downloaded from Harvard Unversty s DASH repostory, and s made avalable under the terms and condtons applcable to Other Posted Materal, as set forth at nrs.harvard.edu/urn-3:hul.instrepos:dash.current.terms-ofuse#laa

2 A Data{Parallel Implementaton of O(N ) HerarchcalN{body Methods Yu Hu S. Lennart Johnsson TR May 1996 Parallel Computng Research Group Center for Research n Computng Technology Harvard Unversty Cambrdge, Massachusetts To appear n Proceedngs of Supercomputng '96.

3 A Data Parallel Implementaton of O(N ) Herarchcal N body Methods Yu Hu " and S. Lennart Johnsson "* " Dvson of Appled Scences, Harvard Unversty Cambrdge, Massachusetts * Department of Computer Scences, Unversty of Houston Houston, Texas Emal: hu@das.harvard.edu, johnsson@cs.uh.edu Abstract The O(N ) herarchcal N body algorthms and Massvely Parallel Processors allow partcle systems of 100 mllon partcles or more to be smulated n acceptable tme. We present a data parallel mplementaton of Anderson s method and demonstrate both effcency and scalablty of the mplementaton on the Connecton Machne CM 5/5E systems. The communcaton tme for large partcle systems amounts to about 10 25%, and the overall effcency s about 35%. The evaluaton of the potental feld of a system of 100 mllon partcles takes 3 mnutes and 15 mnutes on a 256 node CM 5E, gvng expected four and seven dgts of accuracy, respectvely. The speed of the code scales lnearly wth the number of processors and number of partcles. Keywords: N body smulaton, multpole algorthms, herarchcal N body methods, data parallel programmng, massvely parallel processors. 1 Introducton The problem of computng the force (or the potental) exerted on one another by a system of electrcal charges (or masses nteractng gravtatonally) has been wdely studed and has applcatons n areas such as celestal mechancs, plasma physcs and molecular dynamcs. Algorthms that compute the forces for a system of N partcles n O(N ) operatons have been devsed [13, 14, 12, 6, 36, 1]. The constant of proportonalty s n the range 1,000 10,000. Earler herarchcal algorthms, such as those proposed by Appel [2] and by Barnes and Hut [4], were beleved to have an arthmetc complexty of O(N log N ) and dd not have a rgorous error bound, although Appel s verson was later proved to be of O(N ) [10]. The two methods were later extended to be of O(N ) wth analytcal error bounds and by combnng wth the dea of multpole expansons [11, 35]. Parallel mplementaton of the O(N log N ) or O(N ) herarchcal N body methods have been of great nterests as Massvely Parallel Processors (MPPs) offer the prmary storage and compute power for smulaton of systems wth several hundred mllon partcles by usng these fast algorthms. Table 1 gves a summary of sequental and parallel mplementatons of herarchcal N body methods. In comparng performance 1

4 Author Method & error Prog. Performance Model N Eff. Cycles/ P System partcle adaptve O(N log N) methods Salmon [27] BH, quadrupole MP Ncube Warren Salmon [33] BH, quadrupole MP 8.78M 26% 180K 512 Intel Delta Warren Salmon [34] BH, 1 = 10 3 MP 8.78M 28% 266K 512 Intel Delta Warren Salmon [35] BH, 1 = 10 2 MP 2M 111K 256 CM 5E Lu Bhatt [24] BH, quadrupole MP 10M 30% 97K 256 CM 5 Sngh et al. [29] BH SM DASH, KSR 1 nonadaptve O(N) methods Leathrum Board [23] GR, p=8 100K 65% 250K 1 RS/ GR, p=8 SM 1M 20% 32 KSR-1 Ellott Board [9] GR, FFT, p=8 100K 73% 200K 1 RS/ GR, FFT, p=8 SM 1M 14% 32 KSR 1 Schmdt Lee [28] GR, p=8 40K 39% 312K 1 Cray YMP 8/864 GR, p=16 40K 22% 1034K 1 Cray YMP 8/864 Zhao Johnsson [37] Zhao, p=3 DP 16K 12% 560K 8K CM 2 Hu Johnsson Anderson, D=5 DP 100M 27% 37K 256 CM 5E (ths work) Anderson, D=14 DP 100M 35% 183K 256 CM 5E adaptve O(N) methods Sngh et al. [30] GR, 2 D, adap SM DASH, KSR 1 Nyland et al. [26] GR, 3 D, adap DP Table 1: A summary of sequental and parallel mplementatons of herarchcal N body methods. All performance numbers are for unform partcle dstrbutons. Methods used are for three dmensons, unless otherwse stated. 1 s the error bound per partal acceleraton relatve to the mean acceleraton of the system. Empty entres mply unavalable data. MP, SM, and DP are short for message passng, shared memory, and data parallel, respectvely. results from mplementatons of dfferent N body methods, often wth dfferent parameters, on dfferent platforms runnng at dfferent clock speed, we propose the use of effcency of floatng pont operatons and cycles per partcle as the standard measure. Effcency alone s nsuffcent n comparng dffer algorthms that requre dfferent number of operatons. Cycles per partcle ncorporates machne sze, clock rate, arthmetc complextes of dfferent methods, but t does not dstngush nodal archtecture, e.g., superscalar archtectures can perform multple operatons per cycle. Barnes and Hut s O(N log N ) method has been mplemented usng the message passng programmng paradgm by Salmon and Warren [27, 33, 34] on the Intel Touchstone Delta and by Lu and Bhatt [24] on the CM 5. Both groups used assembly language for tme crtcal kernels. Salmon and Warren acheved effcences n the range 24 28%, whle Lu and Bhatt acheved 30% effcency. Recently, Warren and Salmon [35] extended ther code to ncorporate multpole and local expansons and made t portable to a varety of parallel machnes. For nonadaptve O(N ) methods, Greengard and Gropp [12] mplemented Greengard Rokhln s method n 2 D on a shared memory machne (the Encore Multmax 320), but data s not suffcently complete for ncluson n Table 1. Zhao and Johnsson [37] developed a data parallel mplementaton on the CM 2 of Zhao s method, and acheved an effcency of 12% for expansons n Cartesan coordnates, whch yelds more costly multpole expanson calculatons than polar coordnates. Leathrum and Board [23, 5] and 2

5 Ellott and Board [9] acheved effcences n the range 14 20% n mplementng Fast Fourer Transform accelerated Greengard Rokhln s method [15] on the KSR 1. Schmdt and Lee [28] vectorzed ths method for the Cray Y MP and acheved an effcency of 39% on a sngle processor. For comparson, we have also ncluded the results reported n ths paper. Lttle progress has been made n the mplementaton of adaptve O(N ) methods n dstrbuted memory machnes. Sngh et al. [29, 30] mplemented both O(N log N ) and O(N ) methods on the Stanford DASH machne, but no measures of the acheved effcency s avalable. Nyland et. al. [26] dscussed how to express the three dmensonal adaptve verson of Greengard Rokhln s method [6] n a data parallel subset of the Proteus language, whch s stll under mplementaton on parallel machnes. In ths paper, we descrbe a data parallel mplementaton of Anderson s method for N body smulatons. The mplementaton s made n Connecton Machne Fortran (CMF) [31] because no Hgh Performance Fortran (HPF) [16] compler was avalable at the tme of ths project. All but one of the features of CMF that we use are also avalable n HPF. Data moton s managed through the use of data dstrbuton drectves and control of the storage to sequence assocaton n mappng arrays to the MPP memory unts. Addtonal performance gans are acheved through aggregaton of computatons, and by a careful trade off between communcaton and redundant computaton. Our novel contrbutons to the mplementaton of O(N ) herarchcal N body methods on MPPs are mnmal data moton n parent chldren nteractons, low data moton n neghbor nteractons for nteractve feld computatons, redundant computaton/communcaton trade offs, representng translaton operatons as matrx vector multplcatons (level 2 BLAS), aggregatng multple ndependent translaton operatons nto multple nstances of matrx matrx multplcatons (level 3 BLAS), reducng the number of translaton operatons through the use of supernodes, expressng herarchcal operatons on flattened data structures effcently n a data parallel language, effcent memory usage. Most of our optmzaton technques apply to any dstrbuted memory machne. However, the relatve mert of the technques depend upon machne metrcs. We report on the performance trade offs on the CM 5/5E. To our knowledge, ths work represents the frst mplementaton of Anderson s method on a parallel machne as well as the frst mplementaton of an O(N ) N body algorthm n a data parallel language. Moreover, the effcency of our mplementaton for partcle systems wth unformly dstrbuton s compettve to those hghly effcent parallel mplementatons of Barnes Hut s algorthm usng low level message passng and assembly language programmng. Our mplementaton s also memory effcent. To our knowledge, ths s the frst known long range smulaton that smulated systems of 100 mllon partcles. Secton 2 brefly descrbes the computatonal structure of O(N ) N body methods and the computatonal elements used n Anderson s method, and defnes several of the terms used n the above summary of contrbutons. Our optmzaton technques for programmng herarchcal methods n CMF are presented n Secton 3. Secton 4 reports the performance results of our mplementaton and the measured accuracy. Secton 5 concludes the paper. 3

6 Level l Level 0 Level 1 n n n n b n n n n Level 2 n n n n b n n n n Level l+1 Fgure 1: Recursve doman decompostons, the near feld, and the nteractve feld n two dmensons. 2 O(N ) N body Methods The O(N ) herarchcal N body methods [14, 36, 1] share the same computatonal structure; they only dffer n the computatonal elements used n approxmatng the aggregated potental or force due a cluster of faraway partcles. We brefly descrbe the computatonal structure of the O(N ) methods and the computatonal elements used n Anderson s method n ths secton. 2.1 Doman Decomposton The O(N ) methods start wth refnng the computatonal doman nto a herarchy of smaller and smaller subdomans (see Fgure 1). Mesh level 0 represents the entre doman (box). Mesh level l + 1 s obtaned from level l by subdvdng each subdoman at level l (parent box) nto four (n two dmensons) or eght (n three dmensons) equally szed subdomans (chld boxes). In an adaptve method, only subdomans wth suffcently many partcles are further subdvded. Boxes that are not further subdvded are leaves. Herarchcal methods can be easly extended to rectangular domans n two dmensons and paralleleppedc domans n three dmensons [1]. Wth respect to each subdoman (box) n the herarchy, the whole doman s parttoned nto three regons. The defnton of the three regons has a sgnfcant mpact on the constant n the O(N ) asymptotc arthmetc complexty, as well as on the accuracy of the method. In the orgnal formulaton of multpole based methods [14], the near feld s defned as those subdomans that share a boundary pont wth the consdered subdoman n two dmensons, and those subdomans whch share a boundary pont wth the consdered subdoman and second nearest neghbor subdomans whch share a boundary pont wth the nearest neghbor subdomans n three dmensons. We denote these two knds of near felds as wth one separaton and two separaton, respectvely. In general, the d separaton near feld n two or three dmensons contan (2d + 1) 2 and (2d + 1) 3 subdomans, respectvely. The far feld of a subdoman s the entre doman excludng the subdoman and ts near feld subdomans. The nteractve feld of a subdoman at level l s the part of the far feld that s contaned n ts parent s near feld. In three dmensons, these defntons yeld 7(2d + 1) 3 nteractve feld subdomans. In the rest of the paper, two separaton near feld s assumed unless otherwse stated. 4

7 2.2 Computatonal Structure There are two key deas n O(N ) methods that lead to the lnear arthmetc complexty. The frst, also used n O(N log N ) methods, s to represent a cluster of partcles suffcently far away from an evaluaton pont by a sngle computatonal element, called far feld potental representaton. The exact computatonal elements are represented by an nfnte number of terms n the multpole based methods or sphere ntegratons n Anderson s method, and hence, n practce, are approxmated by elements represented by a fnte number of terms or dscretzed ntegratons. The O(N ) methods also ntroduces a local feld potental representaton a second knd of computatonal element. Ths element approxmates the potental feld n a local doman due to partcles n the far doman. The second key dea s to herarchcally form and use as few computatonal elements as possble. The method s to herarchcally combne chldren s far feld potental to form parent s and pass parent s local feld potental to chldren s, as shown n the algorthm below. O(N ) methods can be abstracted n terms of three functons G; Φ; Ψ, three translaton operators T 1 ; T 2 and T 3, and a set of recursve equatons. The physcal meanngs of T 1 ; T 2 and T 3 are: shftng a far feld potental, convertng a far feld potental to a local feld potental, and shftng a local feld potental. G s the potental functon n an explct Newtonan formulaton, Φ l s the contrbuton of subdoman at level l to the potental feld n domans n ts far feld. Ψ l represents the contrbuton to the potental feld n subdoman at level l due to partcles n subdoman s far feld regon,.e., the local feld potental n subdoman at level l. The computatonal structure s descrbed as follows [21]. Algorthm: (A generc herarchcal method) 1. Compute Φ h for all boxes at the leaf level h. 2. Upward pass: for l = h 1; h 2; :::; 2, compute Φ l n = 3. Downward pass: for l = 2; 3; :::; h, compute Ψ l = T 3 (Ψ l 1 parent() ) + X 2fchldren(n)g T 1 (Φ l+1 ): X j2fnteractve feld()g T 2 (Φ l j ): 4. Far feld: evaluate local feld potental at partcle k nsde every leaf level subdoman k; far feld = Ψ h box(k) (k): 5. Near feld: evaluate the potental feld due to the partcles n the near feld of leaf level subdomans, usng a drect evaluaton of the Newtonan nteractons wth nearby partcles, 2.3 Optmal Herarchy Depth k; near feld = X j2fnear feld(box(k))g G j (k): For N unformly dstrbuted partcles and a herarchy of depth h havng M = 8 h leaf level boxes, the total number of operatons requred for the above generc herarchcal method s T total (N; M; p) = O(Np) + O(f 1 (p)m ) + O((N nt f 2 (p) + f 3 (p))m ) + O(Np) + O( N 2 M ); 5

8 where p s the number of coeffcents n the feld representaton for a computatonal element, f 1 (p), f 2 (p), and f 3 (p) are the operaton counts for the three translaton operators, respectvely, and N nt s the number of nteractve feld boxes for nteror nodes,.e., N nt = 875 for a three dmensonal problem usng the Greengard Rokhln neghborhood defnton. The fve terms correspond to the operaton counts for the fve steps of the method. The mnmum value of T total s O(N ) for M = c N,.e., the number of leaf level boxes for the optmal herarchy depth s proportonal to the number of partcles. Snce the terms lnear n M represent the operaton counts n traversng the herarchy, and the term O( N 2 M ) represents the operaton count n the drect evaluaton n the near feld, the optmal herarchy depth balances the tme of the herarchy traversal and the drect evaluaton. In three dmensons, convertng the far feld potentals of nteractve feld boxes to local feld potentals domnate the tme n traversng the herarchy. The use of supernodes n two separaton [36, 12] reduces the effectve value of N nt n three dmensons from 875 to 189, whch brngs about a dramatc mprovement n the overall performance, at the cost of slghtly decreased accuracy. 2.4 Anderson s Method Anderson [1] uses Posson s formula for representng solutons of Laplace equaton. One advantage of ths formulaton s that the component operatons of the multpole method are very easy to formulate for approxmatons based on Posson s formula (the translaton operators n equatons (2) (4)). Another advantage s that the computatons n two and three dmensons are very smlar. Therefore, a code for three dmensons s easly obtaned from a code for two dmensons, or vce versa. Let g(x; y; z) denote potental values on a sphere of radus a and denote by Ψ the harmonc functon external to the sphere wth these boundary values. Gven a sphere of radus a and a pont ~x wth sphercal coordnates (r; ; ) outsde the sphere, let ~x p = (cos()sn(); sn()sn(); cos()) be the pont on the unt sphere along the vector from the orgn to the pont ~x. The potental value at ~x s (equaton (14) n [1] Ψ(~x) = 1 4 ZS2 " 1 X n=0 (2n + 1)( a r )n+1 P n (~s ~x p ) # g(a~s)ds; (1) where the ntegraton s carred out over S 2, the surface of the unt sphere, and P n s the nth Legendre functon. Gven a numercal formula for ntegratng functons on the surface of the sphere wth K ntegraton ponts ~s and ther correspondng weghts w, the followng formula (equaton (15) n [1] s used to approxmate the potental at ~x: Ψ(~x) " KX X M =1 n=0 (2n + 1)( a r )n+1 P n (~s ~x p ) # g(a~s )w : (2) Ths approxmaton s called an outer sphere approxmaton. n ths approxmaton two approxmatons are made compared to Equaton (1): the seres s truncated, and the ntegral s dscretzed. In approxmatng Posson s formula, one frst chooses an ntegraton order D, whch determnes the error decay rate of the approxmaton. One then chooses among dfferent ntegraton formulas the one requrng fewest ntegraton ponts whch translates nto fewest arthmetc operatons for the ntegraton. The optmal choces of K, M, and a n Table 2 are gven by Anderson [1]. The approxmaton used to represent potentals nsde a gven regon s (equaton (16) n [1] Ψ(~x) " KX X M =1 n=0 (2n + 1)( r a )n+1 P n (~s ~x p ) 6 # g(a~s )w ; (3)

9 Order of K M 0 = aouter 0 = a nner Expected error ntegraton D ( D/2) decay rate (D/2+2) Table 2: Parameter selectons and expected error decay rate of outer/nner sphere approxmatons n Anderson s method. s the sde length of a box s x j s 0. x j xj. 0 (a) Translatons T1 and T3 (b) Translaton T2 Fgure 2: Translatons as evaluatons of the approxmatons. and s called an nner sphere approxmaton. The outer sphere and nner sphere approxmatons defne the computatonal elements n Anderson s method. Outer sphere approxmatons are constructed for clusters of partcles n the leaf level boxes. Durng the upward pass, outer sphere approxmatons of chld boxes are combned nto a sngle outer sphere approxmaton of ther parent box (T 1 ) by smply evaluatng the potental nduced by the component outer sphere approxmatons at the ntegraton ponts of the parent outer sphere, as shown n Fgure 2. The stuaton s smlar for the other two translatons used n the method; shftng a parent box s nner sphere approxmaton to add to ts chldren s nner sphere approxmatons (T 3 ), and convertng the outer sphere approxmatons of a box s nteractve feld boxes to add to the box s nner sphere approxmaton (T 2 ). 3 A Data Parallel Implementaton In ths Secton, we present a data parallel mplementaton of Anderson s method n CMF on the CM 5/5E. The optmzatons manly focus on mnmzng the data movement through careful management of data dstrbuton and data references and on mprovng arthmetc effcency through aggregatng feld translaton operatons nto hgh level BLAS operatons. Most optmzatons make use of the array alasng feature of CMF [32]. Snce a sngle processng node of CM 5/5E has four (vrtual) Vector Unts (VU), each wth ts own ALU, Regster Fle, and memory, for clarty, we wll use VUs nstead of processng nodes n the followng dscusson. 7

10 Leaf level Nonleaf levels Fgure 3: Embeddng of a herarchy of grds n two 4 D arrays. 3.1 Data Structure and Dstrbuton Maxmzng concurrency and mnmzng communcaton among nodes are crucal n achevng hgh performance on dstrbuted memory machnes n addton to explotng spatal and temporal localty n the local memory herarches. The fact that data dstrbuton, or layout, usually s not known untl run tme further complcates memory management on dstrbuted memory archtectures. Run tme data allocaton s the norm when (parallel) codes can be executed on systems wth dfferent confguratons wthout recomplatons. There are two man data structures n a herarchcal method: one for storng the potental feld n the herarchy and the other for storng partcle nformaton. Far feld potentals are stored for all levels of the herarchy, snce they are computed n the upward pass and used n the downward pass. We embed the herarchy of far feld potentals n one fve dmensonal (5 D) array as follows (see Fgure 3): the leaf level s embedded n one layer of the 4 D array,.e., FAR POT(1; :; :; :; :), and level (h ) s embedded n FAR POT(2, :, 2 1 : L : 2 ; 2 1 : M : 2 ; 2 1 : N : 2 ). Three of the axes represent the organzaton of the boxes n the three spatal dmensons, whle the fourth axs s used to represent data local to a box. The embeddng preserves localty between a box and ts descendants n the herarchy. If at some level there s at least one box per VU, then for each box, all ts descendants wll be allocated to the same VU as the box tself. Gven an array declaraton wth compler drectves that only specfy whether or not an axs s dstrbuted (parallel) or local to a VU (seral), the Connecton Machne Run Tme System attempts to balance subgrd extents and mnmze the surface to volume rato. Snce communcaton s mnmzed for nonadaptve herarchcal methods when the surface to volume rato of the subgrds s mnmzed, the default layout s deal. Let the extents of the three spatal axes of the 5 D potental arrays be L,M, and N, respectvely. The extents are equal to the number of leaf level boxes along the three spatal dmensons, and hence are powers of 2 for a nonadaptve method. The global address space, denoted by b p+n 1 b p+n 2 :::b n b n 1 b n 2 :::b 0, s mapped onto the underlyng physcal machne. Wth block allocaton, the address feld s broken nto two parts the hgh order bts form the VU address and the low order bts form the local memory address. For a multdmensonal array wth block allocaton for each axs, the VU address feld and the local memory address feld are both broken nto segments, one for each axs. Moreover, snce on the Connecton Machne systems the number of VUs along any axs s constraned to be a power of two and the number of leaf level boxes along any axs s a power of two as well, t suffces to consder address bts n studyng the layout of boxes. The address felds for each of the two slces of 4 D subarrays n the 5 D potental array are shown n Fgure 4. 8

11 Axs Extent VU address Local memory address b p+n 1b p+n 2:::b n b n 1b n 2:::b 0 0 K b..b 1 L b..b b..b 2 M b..b b..b 3 N b..b b..b Fgure 4: The allocaton of the local potental arrays LOCAL POT to processng nodes. The nput to the program conssts of a boundng box and relevant partcle data gven n the form of a collecton of 1 D arrays, one for each attrbute. For partcle box nteractons (step 1 and 4 n the algorthm) and the drect evaluaton n the near feld (step 5), t s, however, both convenent and effcent to represent partcle attrbutes as 4 D arrays, wth three of the axes representng the domans of the leaf level potental array boxes, and the fourth representng the partcles n those boxes. The partcle attrbutes n the 4 D arrays wll be allocated to the same VU as the leaf level box of the herarchy to whch the partcles belong. In the next secton we dscuss how to accomplsh ths form of algnment of partcle attrbutes wth leaf level boxes. 3.2 Partcle box Interactons at the Leaf level Partcle box nteractons occur n formng the far feld potental for leaf level boxes before traversng the herarchy, and n evaluatng the local feld potental of leaf level boxes at the partcles nsde each box after traversng the herarchy. For the leaf level partcle box nteractons before traversng the herarchy, the contrbutons of all partcles n a box to each ntegraton pont on the sphere n Anderson s method (or to each term n the multpole expanson for the box) must be accumulated. Dfferent boxes have dfferent number of partcles, and therefore the number of terms added vares wth the leaf level boxes. Once the partcles are sorted such that partcles belongng to the same box are ordered together, a segmented + scan s a convenent way of addng the contrbutons of all the partcles wthn each of the boxes n parallel. A send communcaton s needed to move data between 1 D sorted partcle arrays used for the + scan and the 4 D potental arrays for partcle box nteractons. Smlarly, a scan and a send communcaton are requred n n evaluatng the local feld potental at the partcles nsde the leaf level boxes. Both scan and send communcaton may be qute tme consumng on the CM 5/5E. However, f the partcles are sorted n such a way that they are allocated to the same VU as the leaf level boxes to whch they belong, both the scan and the send requre no communcaton. The segmented scan becomes a set of scans local to each VU and can be mplemented very effcently. The sends become local memory references (copy). Unfortunately, as long as an array assgnment nvolves arrays of dfferent shape, e.g., 1 D partcle arrays and 4 D potental arrays, the CMF compler generates run tme system calls whch handles the most general case,.e., those that nvolves nter node data movement. Such run tme system calls ncur a hgh overhead even f nter node data movement never occurs. Usng the 4 D partcle array representaton can avod such scenaro. The 4 D partcle arrays have the same three parallel axes as the 4 D potental arrays. The scan operatons and the send communcaton become ndexng on the fourth local axs, and no communcaton s requred. Now the problem turns nto how to make the 1 D to 4 D reshapng of partcle arrays effcent. Snce 9

12 y1 y x1 x box addresses: x1 x0,y1 y0 keys n sortng: y1x1 y0x0 Fgure 5: Sortng partcles for maxmum localty n reshapng partcle arrays. the nput partcles have to be sorted once to brng partcles belongng to the same leaf level box together, and the cost of sortng s relatvely ndependent of dstrbutons of source and destnatons, we want the sort to maxmze the localty n reshapng the sorted 1 D partcle arrays to the 4-D partcle arrays, n addton to brng partcles belongng to the same box together. The followng coordnate sort (see Fgure 5) sorts partcles based on keys constructed from the partcle locatons, the leaf level box coordnates, and ther allocaton, and accomplshes ths task. Algorthm: (Coordnate sort) 1. Fnd the layout of the 4 D potental arrays usng ntrnsc mappng functons, e.g., the number of bts for the VU address and the local memory address for each axs; 2. For each partcle, generate the coordnates of the box to whch t belongs, denoted by xx::x, yy::y, and zz:::z; 3. Splt the box coordnates nto VU address and local memory address accordng to the layout of the potental arrays, denoted as x::xjx::x, y::yjy::y, z::zjz::z; 4. Form keys for sortng by concatenatng the VU addresses wth local memory addresses, denoted as z::zy::yx::xjz::zy::yx::x; 5. Sort. Partcles belongng to the same box are adjacent to each other after sortng. Furthermore, for a unform partcle dstrbuton, f there s at least one leaf level box per VU, then each partcle n the sorted 1 D array wll be allocated to the same VU as the leaf level box to whch t belongs n the local feld potental array. Therefore, no communcaton s needed n copyng partcle attrbutes from the sorted 1 D array to the 4 D array of partcle attrbutes whch has the same layout as the 4 D potentals array. For a near unform partcle dstrbuton, t s expected that the coordnate sort wll leave most partcles n the same VU memory as the leaf boxes to whch they belong. 3.3 Box box Interactons durng Herarchy Traversal Durng the upward pass, the combnng of far feld potentals of chld boxes to form the far feld potental of the parent box (T 1 ) requre parent chld box box nteractons. Durng the downward pass, convertng the local feld potentals for parent boxes to that for chld boxes (T 3 ) also requre parent chld (box box) 10

13 Operaton K = 12, h = 8 K = 72, h = 7 T 1; T 3: arthmetc 54% 60% T 2: arthmetc 74% 85% arthmetc ncl. copy 60% 79% arthmetc ncl. copy and maskng 44% 74% Table 3: Leaf level arthmetc effcences on a 256 node CM 5E. The aggregaton of T 2 translatons nvolves copyng and maskng. nteractons. The downward pass also requres neghbor (box box) nteractons for the converson of the far feld potental of nteractve feld boxes to local feld potentals (T 2 ). In Anderson s varant of the fast multpole method, each of the three translaton operators used n traversng the herarchy can be aggregated nto matrces and ther actons on the potental feld further aggregated nto multple nstance matrx matrx multplcatons. Snce there are no other computatons n the herarchy, the entre herarchcal part takes the form of a collecton of matrx matrx multplcatons, whch are mplemented effcently on most computers as part of the Basc Lnear Algebra Subroutnes (BLAS) [8, 7, 22]. The Connecton Machne Scentfc Software Lbrary (CMSSL) [31] supports both sngle nstance and multple nstance BLAS. For entre herarchy traversal (step 2 and 3 of the algorthm), our technques for optmzng the computatons result n an arthmetc effcency of 40% for K = 12 and a herarchy of depth eght, and an arthmetc effcency of 69% for K = 72 and a herarchy of depth seven. Table 3 summarzes the leaf level arthmetc effcences for K = 12 and K = 72 on a 256 node CM 5E. The peak arthmetc effcency of about 74% for K = 12 and 85% for K = 72 at the leaf level s degraded due to the cost of copyng, maskng, and a relatvely lower effcency at the hgher levels of the herarchy. Snce the matrx multplcaton has complexty O(K 2 ) and the cost of copyng and maskng s lnear n K, the arthmetc effcency when ncludng copyng and maskng decreases more for K = 12 than for K = 72. On the communcaton sde, by usng our technques for avodng excess data movement n prefetchng nteractve feld boxes n neghbor (box box) nteractons and for extractng/embeddng parent and chld boxes from the embedded potental arrays n parent chld nteractons, communcaton only contrbutes 12% of the total tme for herarchy traversal for K = 12 and a herarchy of depth eght. For K = 72 and a herarchy of depth seven communcaton amounts to 25% Interactve feld Box box Communcaton The nteractve feld computaton domnates the herarchcal part of the code. The nteractve feld of a chld box contans all boxes nsde a subgrd centered at the the center of the chld box s parent box, but outsde a subgrd centered at the chld box. Dependng upon whch chld box of a parent s the target, the nteractve feld extends two or three boxes beyond the near feld at the level of the chld box n the postve and negatve drecton along each axs. The near feld and nteractve felds of sblngs dffer. Each box needs to fetch the potental vectors of ts 875 nteractve feld boxes durng the nteractve feld computaton. The smplest way to fetch potental vectors of neghbor boxes s to use ndvdual CSHIFTs, one for each neghbor, as shown n Fgure 6(a). In the Connecton Machne Run Tme System, CSHIFTs along more than one axs are mplemented as a sequence of ndependent shfts, one for each axs, resultng n excessve 11

14 X (a) ndvdual CSHIFTs x y (b) CSHFITs wth unt offset subgrds on VUs S1 S2 2 2 (c) excessve data movement (d) stencl communcaton Fgure 6: Optmzng communcaton n neghbor nteractons. data moton. A better way to structure the CSHIFTs s to mpose a lnear orderng upon the nteractve feld boxes, as shown n Fgure 6(b). The potental vectors of neghbor boxes are shfted through each target box, usng a CSHIFT wth unt offset at every step. The three axes use dfferent bts n ther VU addresses. The rghtmost axs uses the lowest order bts and the leftmost axs uses the hghest order bts n the default axes orderng. In some networks, lke meshes and hypercubes, node addresses are often defned such that nodes that dffer n ther lowest order bts are adjacent. In such networks, a lnear orderng that uses CSHIFTs along the rghtmost axs most often mples less data moton between nodes than other orderngs. Ths shft order s also advantageous n our mplementaton due to the bandwdth lmtatons and the addressng of the fat tree network of the CM 5/5E. Unfortunately, the scheme just outlned results n excessve data moton, though less so than the drect use of CSHIFTs. Assume that n two dmensons every VU has a S1 S2 subgrd of boxes 1, and that the CSHIFTs are made most often along the y axs. Every CSHIFT wth unt offset nvolves a physcal shft of boundary elements off VU and a local copyng of the remanng elements. After shftng sx steps along the y axs, the CSHIFT makes a turn and moves along the x axs n the next step, followed by a sequence of steps along the y axs n the opposte drecton (see Fgure 6(c)). All the elements n a VU, except the ones n the last row before the turn, are moved back through the same VUs durng the steps after the turn. Thus, ths seemngly effcent way of expressng neghbor communcaton n CMF nvolves excessve communcaton n addton to the local data movement. Nevertheless, on a CM 5E t s 7.4 tmes faster than drect use of CSHIFTs for a subgrd wth axes extents 16 and K = 12. In order to elmnate excess data movement, we dentfy non local nteractve feld boxes for all boxes n the local subgrd, and structure the communcaton to fetch only those boxes. For a chld box on the boundary of the subgrd n a VU, the nteractve feld box furthest away from t s at dstance four along the axs normal to the boundary of the subgrd. Hence, the ghost regon s four boxes deep on each face of the subgrd. Usng the array alasng feature of CMF, the ghost boxes can be easly addressed by creatng 1 We gnore the local axs n ths secton snce communcaton only happens on parallel axes. 12

15 Method Number of non local Number of local Number of Relatve tme boxes fetched box moves CSHIFTS K = 12 K = 72 Drect on unalased arrays Lnearzed unalased arrays 85, ,608 1, Drect on alased arrays 3,584 7, Lnearzed alased arrays 4,352 6, Table 4: Comparson of data moton needs for nteractve feld evaluaton on a 32 node CM 5E, wth the local subgrd of extents 8 and ghost boxes stored n a subgrd when usng alased arrays. an array alas that separates the VU address from the local memory address. Wth the subgrds dentfed explctly, fetchng boxes n ghost regons n three dmensons requres fetchng sx surface regons, 12 edge regons, and eght corner regons. Fetchng the potental vectors of the boxes n the ghost regon drectly through array sectons and CSHIFTs n prncple results n no excess data moton. However, due to the mplementaton of CSHIFTs on the CM 5/5E excess data moton s ncurred, as mentoned above. Hence, ths seemngly optmal way of fetchng ghost boxes s not the most effcent technque on the CM 5/5E. Creatng a lnear orderng through all the VUs contanng ghost boxes and usng CSHIFTs to move whole subgrds followed by array sectonng after subgrds are moved to the destnaton VU reduces the communcaton tme by a factor of about 1.5. Movng whole subgrds s necessary n order to keep the contnuty of the lnear orderng of the subgrds. Although some redundant data moton takes place, t s consderably reduced compared to usng a lnear orderng on an unalased array. Table 4 summarzes the data moton requrements for the four methods for a subgrd of shape S1 S2 S3 wth S1 = S2 = S3 = 8. Note that for subgrd extents of less than four along any axs, communcaton beyond nearest neghbor VUs s requred Parent chld Box box Communcaton Usng the embeddng descrbed n Secton 3.1, the far feld potentals of boxes at all levels of the herarchy are embedded n two layers of a 4 D array. Durng traversal of the herarchy, temporary arrays of a sze equal to the number of boxes at the current level of the herarchy are used n the computaton. We abstract two generc functons Multgrd-embed and Multgrd-extract for embeddng/extractng a temporary array of potental vectors correspondng to some level of the herarchy nto/from the 4 D array. The reducton operator used n the upward pass s abstracted as a Multgrd-reduce operator, whle the dstrbuton operator used n the downward pass s abstracted as a Multgrd-dstrbute operator. By frst creatng an array alas separatng the local address from the physcal address, then usng array sectonng, the embeddng/extracton operaton s performed as a local copy operaton f both source and destnaton locatons are local to a VU. If source and destnaton addresses are on dfferent VUs, whch occurs close to the root of the herarchy of grds, then a two step procedure s used. Frst, a temporary array correspondng to the level of the herarchy that has the least number of boxes larger than the number of VUs,.e., at least one box on each VU, s allocated. Then, for Multgrd-embed the source s frst embedded n ths temporary array, whch then s embedded n the 4 D destnaton array. The second step s a local copy operaton, as before, whle the frst requres communcaton. But ths communcaton s much more effcent than the communcaton n embeddng the source drectly n the destnaton array because the overhead n computng send addresses, whch s about lnear n the array sze, s smaller. Ths overhead 13

16 10 1 Use send n CMF Local copyng or two-step scheme Tme (seconds) K 32K 256K 2M 16M Boxes n the temporary array Fgure 7: Performance mprovement of Multgrd-embed usng array sectonng and alasng. The two step scheme was used for the frst two cases and local copyng was used for the remanng cases. may domnate the actual communcaton, whch s proportonal to the number of elements selected. On the CM 5E, the performance of Multgrd-embed s mproved by a factor of up to two orders of magntude usng the local copyng or the two step scheme, as shown n Fgure Translatons as BLAS Operatons In Anderson s method, the translaton operators evaluate the approxmatons of the feld on the source spheres at the ntegraton ponts of the destnaton spheres (see Fgure 2). A feld approxmaton (equaton (2) or (3)) can be rewrtten as Φ( ~x j ) KX =1 f (~s ; ~x j ) g(a~s ); j = 1; K; (4) where f (~s ; ~x j ) represents the nner summaton n the orgnal approxmaton. f (~s ; ~x j ) s a functon of the unt vector ~s from the orgn of the source sphere to ts th ntegraton pont and the unt vector ~x j from the orgn of the source sphere to the jth ntegraton pont on the destnaton sphere. The evaluaton of the feld at an ntegraton pont on the destnaton sphere due to the feld values at all ntegraton ponts on the source spheres s an nner product computaton. Hence, the evaluaton of the feld at all ntegraton ponts on the destnaton sphere due to the feld values at all the ntegraton ponts of the source sphere consttutes a matrx vector multplcaton, where the matrx s of shape K K. We refer to ths matrx as a translaton matrx, snce the net effect can be nterpreted as a translaton of the feld from the ntegraton ponts on the source sphere to the ntegraton ponts on the destnaton sphere. The entres of the translaton matrx only depends on the relatve locatons of the source and destnaton spheres. Translaton matrces for T 1 and T 3 Snce n three dmensons a parent has eght chldren, each of the translaton operators T 1 and T 3 can be represented by eght matrces, one for each of the dfferent parent chld translatons. The same matrces can be used for all levels, and for the translatons between any parent and ts chldren rrespectve of locaton. Combnng the far feld of eght chld boxes to form the far feld of ther parent (T 1 ) can be expressed as eght matrx vector multplcatons, followed by an addton of the resultng vectors. Due to the symmetry 14

17 of the dstrbuton of the ntegraton ponts on the spheres, the eght matrces requred to represent T 1 (T 3 ) are permutatons of each other. One matrx can be obtaned through sutable row and column permutatons of another. Based on ths fact, the above combnng translaton can be expressed as a matrx matrx multplcaton of the translaton matrx and a matrx contanng eght permuted potental vectors of the chldren as columns, followed by permutatons of the columns of the product matrx whch then are added to form the potental vector of the parent box. A smlar approach can be used for T 3. Ths approach saves on the computaton and storage of translaton matrces and may acheve better arthmetc effcency through the aggregated matrx matrx multplcaton. However, on the CM 5E, the tme for the permutatons exceeds the gan n arthmetc effcency. In our code we store all eght matrces for each translaton operator. Even though permutatons are not used n applyng the translaton operators to the potental feld, they could be used n the precomputaton phase. Snce the permutatons depend on K (the number of ntegraton ponts) n a non trval fashon, usng permutatons n the precomputaton stage would requre storage of the permutatons for all dfferent Ks. In order to conserve memory we explctly compute all matrces at run tme (when K s known). We dscuss redundant computaton communcaton trade offs n Secton Translaton Matrces for T 2 As descrbed before, each box, except boxes suffcently close to the boundares, has 875 boxes n ts nteractve feld. Though each of the eght chldren of a parent requres 875 matrces, the sblngs share many matrces. The nteractve feld boxes of the eght sblngs have offsets n the range [ 5 + ; 4 + ] [ 5 + j; 4 + j] [ 5 + k; 4 + k]n[ 2; 2] [ 2; 2] [ 2; 2]; ; j; k 2 f0; 1g, respectvely. Each offset corresponds to a dfferent translaton matrx. The unon of the nteractve felds of the eght sblngs has = 1206 boxes wth 1206 offsets n the range [ 5; 5] [ 5; 5] [ 5; 5]n[ 2; 2] [ 2; 2] [ 2; 2]. For ease of ndexng, we generate the translaton matrces also for the 125 subdomans excluded from the nteractve feld, or a total of = 1331 matrces Aggregaton of Translatons Aggregaton of computatons lowers the overheads n computatons. In addton, the aggregaton of computatons may allow for addtonal optmzatons by provdng addtonal degrees of freedom n schedulng operatons at a gven tme. In Anderson s method aggregaton of computatons n parent chld nteractons results n multple nstance matrx matrx multplcatons. Snce the same translaton matrx s used for all parent same chld nteractons, the matrx vector multplcatons can be aggregated nto a matrx matrx multplcaton. Wth the data layout used for the 4 D potental arrays, aggregaton can only be performed along one of the three space dmensons wthout a data reallocaton n local memory. Thus, for a K K matrx, aggregaton wthout data reallocaton results n a K K by K S matrx multplcaton, where S s the axs of aggregaton. But, Sm such matrx multplcatons can be treated as one multple nstance matrx matrx multplcaton, where Sm s the length of the axs chosen for the multple nstance call to a CMSSL multple nstance matrx multplcaton routne. On a CM 5E, ths aggregaton mproved the performance of the T 1 and T 3 translatons from 58 Mflops/s/PN to 87 Mflops/s/PN for K = 12 and a subgrd of extents The matrces beng multpled are of shape and 12 8 wth 16 such nstances handled n a sngle call. For K = 72 and a subgrd of extents the performance mproved from 95 Mflops/s/PN to 96 Mflops/s/PN. Matrx shapes are and 72 4 and the number of nstances s 8. The 15

18 mnor mprovement for ths case s because the matrces are large that before aggregaton, the matrx vector multply already acheves good performance. For the far feld to local feld conversons n the nteractve feld nteractons, the unon of the nteractve felds of all the boxes n a subgrd s four boxes deep on all faces of the subgrd, and s prefetched and stored n a subgrd of shape (S1 + 8) (S2 + 8) (S3 + 8). Smlar to parent chld nteractons, each nteractve feld converson for a sngle box par s performed as a matrx vector multplcaton. Snce all box pars wth the same relatve locaton use the same translaton matrx, conversons for all local boxes and ther correspondng nteractve feld boxes wth the same relatve locaton can be aggregated nto a sngle matrx matrx multplcaton. For the nteractve feld computatons we rearrange the arrays n local memory va copyng such that a sngle nstance matrx multplcaton s performed on matrces of shape K K and K ( S1 2 S2 2 S3 2 ). For S1 = S2 = 32; S3 = 16 and K = 12, the executon rate of the by matrx multplcaton s 119 Mflops/s/PN. If there are no DRAM page faults, the copyng requres 2K cycles for a potental vector for whch the matrx multplcaton deally takes K 2 cycles. Thus, the relatve tme for copyng s 2=K. For K = 12 ths amounts to about 17%. Wth the cost of copyng ncluded, the measured performance of the translaton s 85 Mflops/s/PN. For S1 = 16; S2 = 16; S3 = 8 and K = 72, the executon rates of the by matrx multplcaton s 136 Mflops/s/PN. Includng the cost of copyng, the measured performance s 124 Mflops/s/PN. The copyng cost can be reduced by copyng a whole column block of (S1 + 8) S2=2 S3=2 boxes nto two lnear memory blocks; one for even slces of the column, and the other for odd slces. Each local column copy can be used on average 8.75 tmes. The cost of copyng s therefore reduced to (S1+8) (S1K) of that of matrx multplcaton, assumng no page faults. Includng the cost of copyng, the performance of translatons n neghbor nteractons reaches 96 and 127 Mflops/s/PN for K = 12 and K = 72, respectvely. Copyng of sectons of subgrds to allow for a K K by a K ( S1 2 S2 2 S3 2 ) matrx multplcaton s also an alternatve n parent chld nteractons, but s not used due to ts relatvely hgher cost. See [17] for detal Precomputng Translaton Matrces All translaton matrces are precomputed. Snce the translaton matrces are shared by all boxes at all levels, only one copy of each matrx s needed on each VU. Two extreme ways of computng these translaton matrces are: 1. to compute all the translaton matrces on every VU, 2. to compute each translaton matrx only once wth dfferent VUs computng dfferent matrces followed by a spread to all other VUs as a matrx s needed. In the frst method the computatons are embarrassngly parallel and no communcaton s needed. However, redundant computatons are performed. In the second method there s no redundant computaton, but replcaton s requred. If there are fewer matrces to be computed than there are VUs, then the VUs can be parttoned nto groups wth as many VUs n a group as there are matrces to be computed. Each group computes the entre collecton of matrces, followed by spreads wthn that group when a matrx s needed. The replcaton may also be performed as an all to all broadcast [20]. The load balance wth ths amount of redundant computaton s the same as wth no redundancy, but the communcaton cost may be reduced. On the CM 5E, for K varyng from 12 to 72, replcatng a K K translaton matrx to all nodes s about three to twelve tmes faster than computng t. Thus, computng the matrces n parallel followed by 16

19 Tme (seconds) compute 8 matrces on each VU compute + replcate w/o groupng compute + replcate w/ groupng replcate porton w/o groupng replcate porton w/ groupng Number of ntegraton ponts on the sphere Fgure 8: Computaton vs. replcaton n precomputng translaton matrces for T 1 (T 3 ) on a 256 node CM 5E. Tme (seconds) compute 1331 matrces on every VU compute n parallel + replcate on 256PN Tme (seconds) replcate on 256PN replcate on 64PN replcate on 32PN compute n parallel on 32PN compute n parallel on 64PN compute n parallel on 256PN Number of ntegraton ponts on the sphere Number of ntegraton ponts on the sphere Fgure 9: Computaton vs. replcaton n precomputng translaton matrces for T 2 on the CM 5E. replcaton s always a wnnng choce. For T 1 and T 3 replcaton among eght VUs nstead of all VUs s an opton. Fgure 8 shows the performance of the three methods: no replcaton, replcaton among groups of eght VUs, and replcaton to all VUs. The cost of computng the matrces n parallel followed by replcaton wthout groupng s 66% to 24% of that of computng all matrces on each VU, as K vares from 12 to 72. Wth groupng, the computaton cost s the same as wthout groupng, but the cost of replcaton s reduced by a factor of 1.75 to 1.26 as K vares from 12 to 72. The reason for the decrease of the dfference as K ncreases s that for larger K, the replcaton tme s domnated by bandwdth, whle for small K latency and overhead domnate. For T 2, computng one copy of each of the 1331 translaton matrces and replcatng t across all the nodes s up to an order of magntude faster than computng all on every VU, as shown n Fgure 9(a) for a 256 node CM 5E. The tme for computng 1331 matrces n parallel decreases on larger CM 5Es, as shown n Fgure 9(b), whle the replcaton tme, whch domnates the total tme, ncreases by about 10 20% for large K as the number of nodes doubles. As a result, the total tme for the method ncreases by at most 62% as the number of nodes changes from 32 to

20 Fgure 10: Explotng symmetry n the drect evaluaton n the near feld. Storng all 1331 translaton matrces n double precson on each VU requres K 2 bytes of memory,.e., 1.53 Mbytes for K = 12 and 53.9 Mbytes for K = 12. Therefore, replcaton of a matrx s delayed untl t s needed. The replcaton s made through one to all broadcast rather than all to all broadcast. The total number of replcatons s 1331 (h 1), where h s the depth of the herarchy, snce the T 2 translatons are used frst at level two. 3.4 Partcle partcle Interactons n the Near feld Snce the drect evaluaton n the near feld accounts for about half of the total arthmetc operatons at optmal herarchy depth, ts effcency s crucal to the overall performance of O(N ) methods. It s both effcent and convenent to use 4 D partcle arrays n the drect evaluaton n the near feld. The partcle partcle nteractons can then be vewed as neghbor box box nteractons: each box nteracts wth ts 124 neghbor boxes n the near feld. Each neghbor box box nteracton nvolves all to all nteractons between partcles n one box and partcles n the other. Explotng the symmetry of nteracton (Newton s thrd law) result n 62 nstead of 124 box box nteractons. One way of explotng symmetry s shown n a 2 D example n Fgure 10. As box 0 traverses boxes 1 to 4, the nteractons between box 0 and each of the four boxes wll be computed. The nteractons from the four boxes to box 0 are accumulated and traverse along wth box 0. In data parallel programmng, whle box 0 traverses boxes 1 to 4, boxes 5 to 8 wll traverse box 0 and the nteractons between them and box 0 wll be computed. The nteractons from these four boxes to box 0 wll be accumulated and stored n box 0. Fnally, the two contrbutons to box 0 wll be combned wth nteractons among partcles n box 0. A smlar dea for explotng symmetry appled drectly to partcles nstead of boxes was used by Applegate et al. [3]. In three dmensons, the boxes nvolved n box box nteractons of a target box can be ordered lnearly and brought to the target box through 62 sngle step CSHIFTs. Another way s to fetch non local near feld boxes from other VUs usng 4 D arrays alased nto local subgrds through array alasng, much n the same way as n fetchng non local nteractve feld boxes, as descrbed n Secton Due to an optmzaton tradng memory requrements for arthmetc effcency descrbed below, the memory requrements n the near feld nteractons are hgh. For ths reason we choose the frst method snce t requres less temporary storage. Moreover, the CSHIFTs account for only about 10-15% of the tme for the drect evaluaton. 3.5 Load balancng ssues n nonadaptve O(N ) Methods Nonadaptve herarchcal methods use nonadaptve doman decomposton, and the herarchy of recursvely decomposed domans s balanced. Three sources of parallelsm exst n traversng the herarchy, namely, among all the boxes at the same level n parent chld and nteractve feld nteractons, among each box s nteractve feld boxes n the far feld to local feld conversons, and among all boxes at all levels n the 18

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more