Chapter 1. Comparison of an O(N ) and an O(N log N ) N -body solver. Abstract

Similar documents
R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Data-Parallel Implementation of O(N) Hierarchical N-body Methods

Dynamic wetting property investigation of AFM tips in micro/nanoscale

Load Balancing for Hex-Cell Interconnection Network

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Parallel matrix-vector multiplication

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

An Optimal Algorithm for Prufer Codes *

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

A Binarization Algorithm specialized on Document Images and Photos

3D vector computer graphics

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Machine Learning: Algorithms and Applications

Hierarchical clustering for gene expression data analysis

Analysis of Continuous Beams in General

Cluster Analysis of Electrical Behavior

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Design and Analysis of Algorithms

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Query Clustering Using a Hybrid Query Similarity Measure

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Electrical analysis of light-weight, triangular weave reflector antennas

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

the nber of vertces n the graph. spannng tree T beng part of a par of maxmally dstant trees s called extremal. Extremal trees are useful n the mxed an

A parallel Poisson solver using the fast multipole method on networks of workstations

CE 221 Data Structures and Algorithms

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

S1 Note. Basis functions.

The Codesign Challenge

High-Boost Mesh Filtering for 3-D Shape Enhancement

O n processors in CRCW PRAM

Constructing Minimum Connected Dominating Set: Algorithmic approach

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

Related-Mode Attacks on CTR Encryption Mode

A One-Sided Jacobi Algorithm for the Symmetric Eigenvalue Problem

Solitary and Traveling Wave Solutions to a Model. of Long Range Diffusion Involving Flux with. Stability Analysis

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Classifier Selection Based on Data Complexity Measures *

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

F Geometric Mean Graphs

with `ook-ahead for Broadcast WDM Networks TR May 14, 1996 Abstract

Efficient Multidimensional Searching Routines for Music Information Retrieval

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

(e.g., []). In such cases, both the grd generaton process and the soluton of the resultng lnear systems can be computatonally expensve. The lack of re

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

A Robust Method for Estimating the Fundamental Matrix

Programming in Fortran 90 : 2017/2018

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Computer models of motion: Iterative calculations

Modelling of curves and surfaces in polar. and Cartesian coordinates. G.Casciola and S.Morigi. Department of Mathematics, University of Bologna, Italy

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

X- Chart Using ANOM Approach

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

c 2009 Society for Industrial and Applied Mathematics

AUTHOR QUERY FORM. Fax:

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

Computation of Ex-Core Detector Weighting Functions for VVER-440 Using MCNP5

Analysis of 3D Cracks in an Arbitrary Geometry with Weld Residual Stress

GSLM Operations Research II Fall 13/14

CHAPTER 2 DECOMPOSITION OF GRAPHS

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

Modeling, Manipulating, and Visualizing Continuous Volumetric Data: A Novel Spline-based Approach

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Multiblock method for database generation in finite element programs

PHYSICS-ENHANCED L-SYSTEMS

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:


Edge Detection in Noisy Images Using the Support Vector Machines

Wavefront Reconstructor

and NSF Engineering Research Center Abstract Generalized speedup is dened as parallel speed over sequential speed. In this paper

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Local and Global Accessibility Evaluation with Tool Geometry

Resolving Ambiguity in Depth Extraction for Motion Capture using Genetic Algorithm

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Construction of ROBDDs. area. that such graphs, under some conditions, can be easily manipulated.

Isosurface Extraction in Time-varying Fields Using a Temporal Hierarchical Index Tree

USING GRAPHING SKILLS

An Accurate Evaluation of Integrals in Convex and Non convex Polygonal Domain by Twelve Node Quadrilateral Finite Element Method

Classifier Swarms for Human Detection in Infrared Imagery

Array transposition in CUDA shared memory

Solving two-person zero-sum game by Matlab

CLASSIFICATION OF ULTRASONIC SIGNALS

Problem Set 3 Solutions

Transcription:

Chapter 1 Comparson of an O(N ) and an O(N log N ) N -body solver Gavn J. Prngle Abstract In ths paper we compare the performance characterstcs of two 3-dmensonal herarchcal N-body solvers an O(N) and an O(N log N) solver. We present the executon tmes for numerous N-body force evaluatons usng the two methods, wth varous values of N and, where s the prescrbed error. We nd that the O(N log N) method s more suted to problems whch demand a hgh precson and large N. We then consder how parallelsaton aects the algorthms' relatve performance. 1 Introducton The N-body problem conssts of a collecton of N partcles each exertng a force upon one another. The Nth partcle s acted upon by the remanng (N? 1) partcles, hence the tme to compute the force actng on each partcle s O(N 2 ). There are a collecton of N-body solvers whch reduce the tme to compute the N-body problem by ntroducng approxmatons. In ths paper we compare the performance characterstcs of two 3- dmensonal herarchcal N-body solvers; an O(N) and an O(N log N) solver. Our O(N) method derves from the Greengard-Rokhln Fast Multpole Method (FMM) [5, 6]. Examples of other O(N) N-body solvers may be found n [17, 1, 10]. The O(N log N) method utlses the same framework as the FMM, but n a manner whch s analogous to the Barnes-Hut Algorthm (BHA) [3]. Both the FMM and the BHA are readly parallelsed and have been mplemented on a wde range of parallel computers [4, 7, 8, 9, 14, 15, 16, 17]. The basc noton behnd these algorthms s that a cluster of partcles s replaced by a sngle source, descrbed by a multpole expanson. The force eld exerted by the cluster can be approxmated by the force eld exerted by ths multpole source, provded that the dstance between the pont of evaluaton and the cluster s large enough. Moreover, as the dstance to the cluster ncreases, we may ncrease the radal sze of the multpole source. Ths dea s eected by delneatng the clusters wth a herarchy, or `tree', of cells. The derence n the order of the two algorthms les n the manner n whch the herarchy of cells s utlsed. In the O(N log N) method, the nteracton model may be descrbed as a `partcle-cell nteracton' model, that s, the force exerted on each partcle s approxmated by ts nteracton wth the multpole sources contaned n the cell herarchy. For the O(N) method, however, the nteracton model may be descrbed as a `cell-cell nteracton' model. Ths s not strctly true, as the sources contaned n the cell herarchy do not actually nteract, but ths nomenclature does elucdate the character of the method. In ths case, a local expanson s created for every cell n the herarchy, n terms of the multpole expansons. Mathematcs Department, Naper Unversty, Ednburgh, EH14 1DJ, SCOTLAND 1

2 Prngle, Gavn J. It would appear that ths O(N) method s qucker than the O(N log N) method; however, tme for executon also depends strongly on the precson requred by the calculaton and on other specc mplementaton detals. In ths paper we present the executon tmes for numerous N-body force evaluatons usng the two methods, wth varous values of N and, where s the prescrbed error. Care s taken to ensure that all other parameters are optmsed wth respect to N and. These nclude tree depth and the number of terms taken n each multpole expanson. From the resultng executon tmes, we are able to determne whch of the two methods, the O(N) or O(N log N) method, s faster for a gven N and. We then consder how parallelsaton aects the algorthms' relatve performance, as the larger memory resource allows for an extended N- space. 2 Informal descrpton of the two solvers. Ths secton attempts to gve a rough llustraton of both of the N-body solvers. A fully detaled descrpton of the O(N) solver, and the relevant mathematcal operators, may be found n [10]. Both the O(N) method and the O(N log N) method utlse the same herarchy of multpole expansons, whch s descrbed n the followng secton. 2.1 The herarchy of multpole expansons Partcles may be grouped together nto clusters and represented by a lst of coecents whch descrbes ther dstrbuton, namely a multpole expanson. Consder the case of nteractng gravtatonal partcles, as shown n gure 1. Suppose we have a cluster of m r 0?????? O Fg. 1. r A cluster of partcles. - R partcles of mass m at postons r 0 from the orgn. If jrj > jr 0 j 8, then the scalar potental,, at a pont of evaluaton, R, located at poston r s gven by X X 1X n P n (cos ); (1) (R) =?G m jr? r 0 j =?G jrj n=0 m r 0 r where P n (x) s the Legendre polynomal of degree n, s the angle subtended between the vectors r and r 0, and G s the gravtatonal constant. We assume that G=1 wthout loss of generalty. Eqn.(1) descrbes the potental as a multpole expanson centred about the orgn. If we ensure that all the partcles le wthn a sphere of radus a,.e. jr 0 j < a, and nsst that jrj = ca, for some c > 1, then r0 < 1. If p s the hghest order retaned n the r c multpole expanson, then t may be shown that the truncaton error s gven by (2) abs = 1 r X 1X n=p+1 m 1 c n P n (cos ) A jrj(c? 1) 1 c p ;

Comparson of two N -body solvers. 3 P where A = jm j. Hence the value of p requred to acheve a gven relatve precson, = jrj abs, may be calculated from A (3) p = d? log c ((c? 1))e; where c s calculated n terms of the dstance to a pont wthn the cluster and the cluster's radus,.e. c = jrj. Ths s descrbed n full n [13]. a As the dstance to a cluster of bodes ncreases, then the radus of ths multpole expanson may also ncrease. Ths dea s eected by delneatng the clusters wth a herarchy of cells. The cells are constructed by recursvely subdvdng the computatonal doman. In 3 dmensons, the entre doman s enclosed by a cube whch s then subdvded nto eght equal cubes. Ths subdvson s performed recursvely untl there s only a small number of partcles per cell. The top level s labelled level 0; hence at any partcular level l there are 8 l cubes; ths s known as a oct-tree. The total number of levels, n, employed by the mesh has a strong nuence over the executon tme and s dependent on N, p and the dstrbuton of partcles [10, 12]. At ths pont, t s necessary to ntroduce some termnology whch s relevant to both algorthms. At any level of renement l, a cell x s subdvded nto 8 cells, whch are located at level l + 1. The 8 cells are known as the chldren of x; x s known as the parent. Cells whch have no chldren are known as leaf cells. Cells whch le at the same level of renement as cell x are known as near-eld cells, provded that the centre of these cells les less than a radal dstance of 3d away from the centre of cell x, where d s the length of one sde of a cube. Ths gves a maxmum of 92 near-eld cells. The nteracton set of a cell x s dened as those cells outwth the near-eld cells, whch are the chldren of the neghbours of x's parent. For both our N-body solvers, a truncated multpole expanson s created for every cell n the herarchy that contans at least one partcle. The method of determnng the value of p requred to acheve a gven precson s smlar to that employed n the FMM. In ths case the multpole expansons are centred on the geometrc centre of each cell n the herarchy. The BHA, on the other hand, centres the multpole expanson at the cluster's `centre-ofmass'. Ths latter system s benecal for problems where the `strengths' of the partcles are all postve. When ths s the case, the dpole moment s dentcally zero, thus f only the rst term s taken from the multpole expanson, the error behaves as f both the rst and the second terms were taken. However, many N-body smulatons, such as Vortex Methods n CFD, requre both postve and negatve `strengths'. Moreover, the error whch arses from the use of the latter system s typcally much smaller than the prescrbed error n practce. Ths s due to the fact that the least upper bound to ths error must be calculated n terms of the worst possble dstrbuton of partcles, unlke the FMM [13]. Thus one may predetermne the run-tme error wth greater control f the multpole expanson s centred on the centre of the cell. The herarchy of cells s created by rst formng multpole moments for each leaf cell n terms of the partcles whch le theren. The Legendre polynomal n eqn.(1) s expanded exactly n terms of sphercal harmoncs va the Addton Theorem [6]. The multpole moments are gven n terms of these sphercal harmoncs. Multpole moments are not calculated for empty cells. The tree s then traversed to the top, level=0, creatng multpole moments for each cell n terms of the multpole moments assocated wth the 8 chld cells. Ths s performed systematcally and ecently, snce the multpole expansons are centred on the cells' centres.

4 Prngle, Gavn J. 2.2 Informal descrpton of the O(N log N ) solver. The O(N log N) solver proceeds as follows. Each cell n the tree s consdered n turn, startng at the coarsest level, level=0. If the cell s not n the near-eld of the cell whch contans the partcle, then the assocated multpole moments are used to approxmate the potental. The next level of renement s then consdered, where all cells whch are not part of the near-eld, and whch have not yet been accounted for, wll contrbute ther assocated expansons. For any cell x, the set of cells whch contrbute a potental to partcles n cell x, s the nteracton set of cell x. Thus the nteracton set need only be located once per cell, and not once per partcle. Once at the nest level, the only partcles whch have not yet contrbuted to the potental wll be the partcles whch resde n the near-eld leaf cells. The parwse nteractons between these partcles are summed drectly. Consder a cell n the herarchy, cell say, whch s to contrbute an approxmated potental to a partcular partcle. We calculate the dstance, r say, between the partcle and the centre of that cell, and the radus of the sphere whch crcumscrbes t, a say. By eqn.(3), and snce the multpole expanson s centred at the geometrc centre of the cell, we requre p terms to acheve a certan relatve precson,, such that p = d? log c ( (c? 1))e; where c = r a. Thus, the more dstant a cell n an nteracton set, the fewer terms wll be requred to acheve a specc precson. 2.3 Informal Descrpton of the O(N ) solver The O(N) solver s dentcal to the O(N log N) solver up to the pont where the multpole moments are calculated for every cell n the herarchy. At ths stage, n the case of the O(N) solver, local expansons are created for every cell n the tree (startng at the root cell, level=0). A cell's assocated local expansons descrbes the potental n that cell due to the partcles whch le outwth tself and ts near-eld cells. The local expanson of cell x s formed from the multpole moments assocated wth all the nteracton set of cell x, and from the local expanson moments assocated wth the parent cell of cell x. Local expansons are not computed for empty cells. Once at the nest level, each leaf cell has an assocated local expanson, whch s then evaluated at each of the partcles whch le theren. As wth the O(N log N) solver, the `drect' parwse summaton method s used to evaluate the potental due to partcles whch le n the near-eld leaf cells. When employng the multpole moments to form the local expansons, only p multpole moments are requred, where (4) p = d? log c ((c? 1))e; where c =?1, where a s the dstance between the centre of the cell assocated wth the local expanson, and the centre of the cell assocated wth the multpole expanson [10, 13]. Each local expanson has p terms, where p = max (p ). Note that, for both methods, the same value of p s requred to acheve the prescrbed precson. The symmetry nherent n the oct-tree s exploted to reduce the amount of computaton. When a local expanson s used to form the local expansons of a cells 8 chld cells, only one local expanson s formed. The remanng 7 local expansons are formed by multplyng the rst local expanson by a shftng vector. Ths technque reduces the operaton count for ths operaton from O(8p 4 ) to O(p 4 + 8p 2 ). Another element of symmetry s exploted. If cell j les n the nteracton set of cell say, then the opposte s also true; n set notaton,

Comparson of two N -body solvers. 5 cell 2 nt(cell j) ) cell j 2 nt(cell ) Thus, once the nteracton set cells have been located, ther assocated local expansons may also be calculated at the same tme. Ths s smlar to the parwse nteracton used n the `drect' method and reduces ths computaton by a factor of order 2. The nherent symmetry of sphercal harmoncs s also utlsed. If cell 2 nt(cell j) and cell les drectly above, or drectly below cell j, then the computaton s reduced substantally. Moreover, for both methods presented n ths paper, ths symmetry s also used to reduce () the amount of memory one requres, () the amount of calculaton to be performed and () the sze of messages to be passed n a parallel mplementaton, all by a factor of 2 [10]. 3 Results and Conclusons A large number of N-body force evaluatons are performed usng the two methods, wth N = f1 10, 2 10, 5 10 ; = 2; 3; 4; 5g for p = 0 (monopole term only), p = 1 (dpole moments), p = 2, 4, 7, 9 and 12, for = 10?1 ; 10?2 ; 10?3 ; 10?4 and 10?5 respectvely, where s the least upper bound to the error whch s dened a pror, cf. eqn.(4). Care s taken to ensure that all other parameters, such as tree depth, are optmsed wth respect to N and. The resultng executon tmes were produced on Sun IPC Workstatons, and from these `wall-clock' tmes we are able to determne whch of the two methods, the O(N) or O(N log N) method, s faster for a gven N and. The programs are tmed from the moment after the locatons and strengths of the partcles have been read from le, untl the tme at whch the nal potental has been evaluated. Two derent dstrbutons were used to compare the two methods; a unform dstrbuton of partcles over a unt cube, and a dstrbuton over the surface of a sphere, where the and of the partcles' sphercal coordnates are unformly dstrbuted over [0; ] and [0; 2] respectvely. 10000 1000 Tme (secs) 100 10 Fg. 2. O(N log N) p = 2 O(N log N) p = 7 O(N 2 ) `drect' method O(N) p = 2 O(N) p = 7 1000 10000 100000 N The number of partcles Executon tmes usng a sphercal dstrbuton of partcles. Fgure 2 shows the executon tmes for the O(N log N) method for p = 2 and p = 7, the `drect' O(N 2 ) method and the O(N) method for p = 2 and p = 7, usng the sphercal dstrbuton. The executon tmes for the same set of parameters, but usng the unform dstrbuton, produces a very smlar graph. For both dstrbutons, the O(N) method was substantally slower than the O(N log N)

6 Prngle, Gavn J. method for N 2 [10; 500K] and p = 4, 7, 9 and 12. Usng the sphercal dstrbuton, wth p = 0; 1 and 2, the O(N) method became faster than the O(N log N) method when N = 20, 350 and 4500 respectvely. When the unform dstrbuton was employed, wth p = 0 and 1, the O(N) method became faster than the O(N log N) method when N = 50 and 400 respectvely. For p = 2, the two methods executed n approxmately the same tme for N 2 [10; 500K]. From these results we have concluded that for problems charactersed by ether type of dstrbuton, the O(N) method s faster for problems wth low precson and large N, such as some astrophyscal smulatons. Whereas the O(N log N) method s more suted to problems whch demand a large N and a hgh precson, such as the Vortex Methods n CFD, whch requre a hgh precson n order to mnmse numercally nduced nstabltes. 3.0.1 Parallelsaton. Consder a MIMD dstrbuted memory machne, usng a local doman decomposton, where the computatonal doman s dstrbuted evenly over the nodes [11]. The two methods dscussed n ths paper requre the same communcaton routnes, and send the same data between the same nodes,.e. the multpole moments, thus the parallel codes wll only der n a computaton whch s ndependent of nterprocessor communcatons. Therefore ths manner of parallelsaton wll not aect the relatve performance of the two methods. However, parallelsaton wll allow for an extended N? space due to the larger memory resource. Moreover, to to ensure a balanced load for problems where the dstrbuton s non-unform, a scattered doman decomposton should be used [2, 14, 16]. References [1] Anderson, C.R., SIAM J. Sc. Stat. Comput., 13, 4, p923-947, July 1992. [2] Baden, S.B., Vortex Methods, Lecture Notes n Mathematcs, Sprnger Verlag, p96-119, 1988. [3] Barnes, J., Hut, P., Nature, 324, p446-449, 1986. [4] Greengard, L. Gropp, W.L., Parallel Proc. for Sc. Comp., SIAM, p213-222, 1989. [5] Greengard, L., Rokhln, V., J. Comp. Phys., 73, p325-348, 1987. [6] Greengard, L., The Rapd Evaluaton of Potental Felds n Partcle Systems, MIT Press, 1988. [7] Leathrum, J.F., Board, J.A., The Parallel Fast Multpole Algorthm n Three Dmensons, Techncal Report, Duke Unversty, Aprl 1992. [8] Lustg, S.R., Crsty, J.J., Pensak, D.A., Materals Research Socety Symposum Proceedngs Seres, 278, Symp. on Comp. Methods n Mat. Sc., Aprl 1992. [9] Nyland, L.S., Prns, J.F., Ref, J.H., 2nd Symposum on Issues and Obstacles n the Practcal Implementaton of Parallel Algorthms and the Use of Parallel Machnes (DAGS'93), Hanover, N.H., June 1993. [10] Prngle, G.J., Ph.D. Thess, Naper Unversty, Ednburgh, 1994. [11], J. Future Generaton of Computng Systems, 104, Jan., 1995. [12], User Gude to the 3-dmensonal Fast Multpole Method, Techncal Report, Naper Unversty, Ednburgh, 1994. [13], Error Analyss of the Multpole Methods, Techncal Report, Naper Unversty, Ednburgh, 1994. [14] Salmon, J.K., Warren, M.S., Wnckelmans, G.S., Intl. J. Supercompter Appl., 8, 2, 1994, (to appear). [15] Schmdt, K.E., Lee, M.A., J. of Stat. Phys, 63, Nos. 5/6, 1991. [16] Sngh, J.P., Ph.D. Thess, Stanford Unversty, 1993. [17] Zhao, F., Johnson, S.L., J. Sc. Stat. Comput., 12, 6, Nov. 1991.