Chapter 1. Comparison of an O(N ) and an O(N log N ) N -body solver. Abstract

Chapter 1 Comparson of an O(N ) and an O(N log N ) N -body solver Gavn J. Prngle Abstract In ths paper we compare the performance characterstcs of two 3-dmensonal herarchcal N-body solvers an O(N) and an O(N log N) solver. We present the executon tmes for numerous N-body force evaluatons usng the two methods, wth varous values of N and, where s the prescrbed error. We nd that the O(N log N) method s more suted to problems whch demand a hgh precson and large N. We then consder how parallelsaton aects the algorthms' relatve performance. 1 Introducton The N-body problem conssts of a collecton of N partcles each exertng a force upon one another. The Nth partcle s acted upon by the remanng (N? 1) partcles, hence the tme to compute the force actng on each partcle s O(N 2 ). There are a collecton of N-body solvers whch reduce the tme to compute the N-body problem by ntroducng approxmatons. In ths paper we compare the performance characterstcs of two 3- dmensonal herarchcal N-body solvers; an O(N) and an O(N log N) solver. Our O(N) method derves from the Greengard-Rokhln Fast Multpole Method (FMM) [5, 6]. Examples of other O(N) N-body solvers may be found n [17, 1, 10]. The O(N log N) method utlses the same framework as the FMM, but n a manner whch s analogous to the Barnes-Hut Algorthm (BHA) [3]. Both the FMM and the BHA are readly parallelsed and have been mplemented on a wde range of parallel computers [4, 7, 8, 9, 14, 15, 16, 17]. The basc noton behnd these algorthms s that a cluster of partcles s replaced by a sngle source, descrbed by a multpole expanson. The force eld exerted by the cluster can be approxmated by the force eld exerted by ths multpole source, provded that the dstance between the pont of evaluaton and the cluster s large enough. Moreover, as the dstance to the cluster ncreases, we may ncrease the radal sze of the multpole source. Ths dea s eected by delneatng the clusters wth a herarchy, or `tree', of cells. The derence n the order of the two algorthms les n the manner n whch the herarchy of cells s utlsed. In the O(N log N) method, the nteracton model may be descrbed as a `partcle-cell nteracton' model, that s, the force exerted on each partcle s approxmated by ts nteracton wth the multpole sources contaned n the cell herarchy. For the O(N) method, however, the nteracton model may be descrbed as a `cell-cell nteracton' model. Ths s not strctly true, as the sources contaned n the cell herarchy do not actually nteract, but ths nomenclature does elucdate the character of the method. In ths case, a local expanson s created for every cell n the herarchy, n terms of the multpole expansons. Mathematcs Department, Naper Unversty, Ednburgh, EH14 1DJ, SCOTLAND 1

2 Prngle, Gavn J. It would appear that ths O(N) method s qucker than the O(N log N) method; however, tme for executon also depends strongly on the precson requred by the calculaton and on other specc mplementaton detals. In ths paper we present the executon tmes for numerous N-body force evaluatons usng the two methods, wth varous values of N and, where s the prescrbed error. Care s taken to ensure that all other parameters are optmsed wth respect to N and. These nclude tree depth and the number of terms taken n each multpole expanson. From the resultng executon tmes, we are able to determne whch of the two methods, the O(N) or O(N log N) method, s faster for a gven N and. We then consder how parallelsaton aects the algorthms' relatve performance, as the larger memory resource allows for an extended N- space. 2 Informal descrpton of the two solvers. Ths secton attempts to gve a rough llustraton of both of the N-body solvers. A fully detaled descrpton of the O(N) solver, and the relevant mathematcal operators, may be found n [10]. Both the O(N) method and the O(N log N) method utlse the same herarchy of multpole expansons, whch s descrbed n the followng secton. 2.1 The herarchy of multpole expansons Partcles may be grouped together nto clusters and represented by a lst of coecents whch descrbes ther dstrbuton, namely a multpole expanson. Consder the case of nteractng gravtatonal partcles, as shown n gure 1. Suppose we have a cluster of m r 0?????? O Fg. 1. r A cluster of partcles. - R partcles of mass m at postons r 0 from the orgn. If jrj > jr 0 j 8, then the scalar potental,, at a pont of evaluaton, R, located at poston r s gven by X X 1X n P n (cos ); (1) (R) =?G m jr? r 0 j =?G jrj n=0 m r 0 r where P n (x) s the Legendre polynomal of degree n, s the angle subtended between the vectors r and r 0, and G s the gravtatonal constant. We assume that G=1 wthout loss of generalty. Eqn.(1) descrbes the potental as a multpole expanson centred about the orgn. If we ensure that all the partcles le wthn a sphere of radus a,.e. jr 0 j < a, and nsst that jrj = ca, for some c > 1, then r0 < 1. If p s the hghest order retaned n the r c multpole expanson, then t may be shown that the truncaton error s gven by (2) abs = 1 r X 1X n=p+1 m 1 c n P n (cos ) A jrj(c? 1) 1 c p ;

Comparson of two N -body solvers. 3 P where A = jm j. Hence the value of p requred to acheve a gven relatve precson, = jrj abs, may be calculated from A (3) p = d? log c ((c? 1))e; where c s calculated n terms of the dstance to a pont wthn the cluster and the cluster's radus,.e. c = jrj. Ths s descrbed n full n [13]. a As the dstance to a cluster of bodes ncreases, then the radus of ths multpole expanson may also ncrease. Ths dea s eected by delneatng the clusters wth a herarchy of cells. The cells are constructed by recursvely subdvdng the computatonal doman. In 3 dmensons, the entre doman s enclosed by a cube whch s then subdvded nto eght equal cubes. Ths subdvson s performed recursvely untl there s only a small number of partcles per cell. The top level s labelled level 0; hence at any partcular level l there are 8 l cubes; ths s known as a oct-tree. The total number of levels, n, employed by the mesh has a strong nuence over the executon tme and s dependent on N, p and the dstrbuton of partcles [10, 12]. At ths pont, t s necessary to ntroduce some termnology whch s relevant to both algorthms. At any level of renement l, a cell x s subdvded nto 8 cells, whch are located at level l + 1. The 8 cells are known as the chldren of x; x s known as the parent. Cells whch have no chldren are known as leaf cells. Cells whch le at the same level of renement as cell x are known as near-eld cells, provded that the centre of these cells les less than a radal dstance of 3d away from the centre of cell x, where d s the length of one sde of a cube. Ths gves a maxmum of 92 near-eld cells. The nteracton set of a cell x s dened as those cells outwth the near-eld cells, whch are the chldren of the neghbours of x's parent. For both our N-body solvers, a truncated multpole expanson s created for every cell n the herarchy that contans at least one partcle. The method of determnng the value of p requred to acheve a gven precson s smlar to that employed n the FMM. In ths case the multpole expansons are centred on the geometrc centre of each cell n the herarchy. The BHA, on the other hand, centres the multpole expanson at the cluster's `centre-ofmass'. Ths latter system s benecal for problems where the `strengths' of the partcles are all postve. When ths s the case, the dpole moment s dentcally zero, thus f only the rst term s taken from the multpole expanson, the error behaves as f both the rst and the second terms were taken. However, many N-body smulatons, such as Vortex Methods n CFD, requre both postve and negatve `strengths'. Moreover, the error whch arses from the use of the latter system s typcally much smaller than the prescrbed error n practce. Ths s due to the fact that the least upper bound to ths error must be calculated n terms of the worst possble dstrbuton of partcles, unlke the FMM [13]. Thus one may predetermne the run-tme error wth greater control f the multpole expanson s centred on the centre of the cell. The herarchy of cells s created by rst formng multpole moments for each leaf cell n terms of the partcles whch le theren. The Legendre polynomal n eqn.(1) s expanded exactly n terms of sphercal harmoncs va the Addton Theorem [6]. The multpole moments are gven n terms of these sphercal harmoncs. Multpole moments are not calculated for empty cells. The tree s then traversed to the top, level=0, creatng multpole moments for each cell n terms of the multpole moments assocated wth the 8 chld cells. Ths s performed systematcally and ecently, snce the multpole expansons are centred on the cells' centres.

4 Prngle, Gavn J. 2.2 Informal descrpton of the O(N log N ) solver. The O(N log N) solver proceeds as follows. Each cell n the tree s consdered n turn, startng at the coarsest level, level=0. If the cell s not n the near-eld of the cell whch contans the partcle, then the assocated multpole moments are used to approxmate the potental. The next level of renement s then consdered, where all cells whch are not part of the near-eld, and whch have not yet been accounted for, wll contrbute ther assocated expansons. For any cell x, the set of cells whch contrbute a potental to partcles n cell x, s the nteracton set of cell x. Thus the nteracton set need only be located once per cell, and not once per partcle. Once at the nest level, the only partcles whch have not yet contrbuted to the potental wll be the partcles whch resde n the near-eld leaf cells. The parwse nteractons between these partcles are summed drectly. Consder a cell n the herarchy, cell say, whch s to contrbute an approxmated potental to a partcular partcle. We calculate the dstance, r say, between the partcle and the centre of that cell, and the radus of the sphere whch crcumscrbes t, a say. By eqn.(3), and snce the multpole expanson s centred at the geometrc centre of the cell, we requre p terms to acheve a certan relatve precson,, such that p = d? log c ( (c? 1))e; where c = r a. Thus, the more dstant a cell n an nteracton set, the fewer terms wll be requred to acheve a specc precson. 2.3 Informal Descrpton of the O(N ) solver The O(N) solver s dentcal to the O(N log N) solver up to the pont where the multpole moments are calculated for every cell n the herarchy. At ths stage, n the case of the O(N) solver, local expansons are created for every cell n the tree (startng at the root cell, level=0). A cell's assocated local expansons descrbes the potental n that cell due to the partcles whch le outwth tself and ts near-eld cells. The local expanson of cell x s formed from the multpole moments assocated wth all the nteracton set of cell x, and from the local expanson moments assocated wth the parent cell of cell x. Local expansons are not computed for empty cells. Once at the nest level, each leaf cell has an assocated local expanson, whch s then evaluated at each of the partcles whch le theren. As wth the O(N log N) solver, the `drect' parwse summaton method s used to evaluate the potental due to partcles whch le n the near-eld leaf cells. When employng the multpole moments to form the local expansons, only p multpole moments are requred, where (4) p = d? log c ((c? 1))e; where c =?1, where a s the dstance between the centre of the cell assocated wth the local expanson, and the centre of the cell assocated wth the multpole expanson [10, 13]. Each local expanson has p terms, where p = max (p ). Note that, for both methods, the same value of p s requred to acheve the prescrbed precson. The symmetry nherent n the oct-tree s exploted to reduce the amount of computaton. When a local expanson s used to form the local expansons of a cells 8 chld cells, only one local expanson s formed. The remanng 7 local expansons are formed by multplyng the rst local expanson by a shftng vector. Ths technque reduces the operaton count for ths operaton from O(8p 4 ) to O(p 4 + 8p 2 ). Another element of symmetry s exploted. If cell j les n the nteracton set of cell say, then the opposte s also true; n set notaton,

Comparson of two N -body solvers. 5 cell 2 nt(cell j) ) cell j 2 nt(cell ) Thus, once the nteracton set cells have been located, ther assocated local expansons may also be calculated at the same tme. Ths s smlar to the parwse nteracton used n the `drect' method and reduces ths computaton by a factor of order 2. The nherent symmetry of sphercal harmoncs s also utlsed. If cell 2 nt(cell j) and cell les drectly above, or drectly below cell j, then the computaton s reduced substantally. Moreover, for both methods presented n ths paper, ths symmetry s also used to reduce () the amount of memory one requres, () the amount of calculaton to be performed and () the sze of messages to be passed n a parallel mplementaton, all by a factor of 2 [10]. 3 Results and Conclusons A large number of N-body force evaluatons are performed usng the two methods, wth N = f1 10, 2 10, 5 10 ; = 2; 3; 4; 5g for p = 0 (monopole term only), p = 1 (dpole moments), p = 2, 4, 7, 9 and 12, for = 10?1 ; 10?2 ; 10?3 ; 10?4 and 10?5 respectvely, where s the least upper bound to the error whch s dened a pror, cf. eqn.(4). Care s taken to ensure that all other parameters, such as tree depth, are optmsed wth respect to N and. The resultng executon tmes were produced on Sun IPC Workstatons, and from these `wall-clock' tmes we are able to determne whch of the two methods, the O(N) or O(N log N) method, s faster for a gven N and. The programs are tmed from the moment after the locatons and strengths of the partcles have been read from le, untl the tme at whch the nal potental has been evaluated. Two derent dstrbutons were used to compare the two methods; a unform dstrbuton of partcles over a unt cube, and a dstrbuton over the surface of a sphere, where the and of the partcles' sphercal coordnates are unformly dstrbuted over [0; ] and [0; 2] respectvely. 10000 1000 Tme (secs) 100 10 Fg. 2. O(N log N) p = 2 O(N log N) p = 7 O(N 2 ) `drect' method O(N) p = 2 O(N) p = 7 1000 10000 100000 N The number of partcles Executon tmes usng a sphercal dstrbuton of partcles. Fgure 2 shows the executon tmes for the O(N log N) method for p = 2 and p = 7, the `drect' O(N 2 ) method and the O(N) method for p = 2 and p = 7, usng the sphercal dstrbuton. The executon tmes for the same set of parameters, but usng the unform dstrbuton, produces a very smlar graph. For both dstrbutons, the O(N) method was substantally slower than the O(N log N)

6 Prngle, Gavn J. method for N 2 [10; 500K] and p = 4, 7, 9 and 12. Usng the sphercal dstrbuton, wth p = 0; 1 and 2, the O(N) method became faster than the O(N log N) method when N = 20, 350 and 4500 respectvely. When the unform dstrbuton was employed, wth p = 0 and 1, the O(N) method became faster than the O(N log N) method when N = 50 and 400 respectvely. For p = 2, the two methods executed n approxmately the same tme for N 2 [10; 500K]. From these results we have concluded that for problems charactersed by ether type of dstrbuton, the O(N) method s faster for problems wth low precson and large N, such as some astrophyscal smulatons. Whereas the O(N log N) method s more suted to problems whch demand a large N and a hgh precson, such as the Vortex Methods n CFD, whch requre a hgh precson n order to mnmse numercally nduced nstabltes. 3.0.1 Parallelsaton. Consder a MIMD dstrbuted memory machne, usng a local doman decomposton, where the computatonal doman s dstrbuted evenly over the nodes [11]. The two methods dscussed n ths paper requre the same communcaton routnes, and send the same data between the same nodes,.e. the multpole moments, thus the parallel codes wll only der n a computaton whch s ndependent of nterprocessor communcatons. Therefore ths manner of parallelsaton wll not aect the relatve performance of the two methods. However, parallelsaton wll allow for an extended N? space due to the larger memory resource. Moreover, to to ensure a balanced load for problems where the dstrbuton s non-unform, a scattered doman decomposton should be used [2, 14, 16]. References [1] Anderson, C.R., SIAM J. Sc. Stat. Comput., 13, 4, p923-947, July 1992. [2] Baden, S.B., Vortex Methods, Lecture Notes n Mathematcs, Sprnger Verlag, p96-119, 1988. [3] Barnes, J., Hut, P., Nature, 324, p446-449, 1986. [4] Greengard, L. Gropp, W.L., Parallel Proc. for Sc. Comp., SIAM, p213-222, 1989. [5] Greengard, L., Rokhln, V., J. Comp. Phys., 73, p325-348, 1987. [6] Greengard, L., The Rapd Evaluaton of Potental Felds n Partcle Systems, MIT Press, 1988. [7] Leathrum, J.F., Board, J.A., The Parallel Fast Multpole Algorthm n Three Dmensons, Techncal Report, Duke Unversty, Aprl 1992. [8] Lustg, S.R., Crsty, J.J., Pensak, D.A., Materals Research Socety Symposum Proceedngs Seres, 278, Symp. on Comp. Methods n Mat. Sc., Aprl 1992. [9] Nyland, L.S., Prns, J.F., Ref, J.H., 2nd Symposum on Issues and Obstacles n the Practcal Implementaton of Parallel Algorthms and the Use of Parallel Machnes (DAGS'93), Hanover, N.H., June 1993. [10] Prngle, G.J., Ph.D. Thess, Naper Unversty, Ednburgh, 1994. [11], J. Future Generaton of Computng Systems, 104, Jan., 1995. [12], User Gude to the 3-dmensonal Fast Multpole Method, Techncal Report, Naper Unversty, Ednburgh, 1994. [13], Error Analyss of the Multpole Methods, Techncal Report, Naper Unversty, Ednburgh, 1994. [14] Salmon, J.K., Warren, M.S., Wnckelmans, G.S., Intl. J. Supercompter Appl., 8, 2, 1994, (to appear). [15] Schmdt, K.E., Lee, M.A., J. of Stat. Phys, 63, Nos. 5/6, 1991. [16] Sngh, J.P., Ph.D. Thess, Stanford Unversty, 1993. [17] Zhao, F., Johnson, S.L., J. Sc. Stat. Comput., 12, 6, Nov. 1991.