Parallel Implementation of 3D FMA using MPI

Size: px

Start display at page:

Download "Parallel Implementation of 3D FMA using MPI"

Frederica Ellis
5 years ago
Views:

1 Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO Abstract The simulation of N-body system has been used extensively in biophysics and chemistry to investigate the dynamics of biomolecules and in astrophysics to study the chaotic characteristics of the galactic system. However, the longrange force calculation has a time complexity of O(N 2 ) where N is the number of particles in the system. The fast multipole algorithm (FMA), proposed by Greengard and Rokhlin, reduces the time complexity to O(N). Our goal is to build a parallel FMA library which is portable, scalable, and efficient. We use Message Passing Interface as the communication back-end. Also, an effective communication scheme to reduce communication overhead and a partitioning technique to obtain good load balancing among processors were implemented into the library. 1 Introduction We consider the simulation of N-body system using the fast multipole algorithm (FMA) on homogeneous and heterogeneous distributed parallel systems. The simulation of the N-body system has a wide range of applications and has been used extensively in biophysics and chemistry to investigate the dynamics of biomolecules [4] and in astrophysics to study the chaotic characteristics of the galactic system [3]. N-body simulations typically require long time integrations and every time step during the simulation involves force calculations among particles. The execution time of the force calculation contributes ninety percent of the total simulation time during a single time step. This is due mainly to the calculation of long-range forces such as the Coulombic and Keplarian forces which have a time complexity of O(N 2 ) where N is the number of particles in the system. This work was supported in part by NSF grant CCR Experiments were conducted using the resources of the Cornell Theory Center. y jlu@umr.edu. URL: z okunbor@umr.edu. URL: Many algorithms had been proposed to reduce the time complexity to O(N 3=2 ), O(N log N), and O(N) [3, 9, 18, 2]. The fast multipole algorithm [8, 6], proposed by Greengard and Rokhlin, exploits multipole expansions and local expansions to collectively represent the interaction force between a particle within a cluster and particles that are relatively distance from this cluster. The time complexity of the fast multipole algorithm is O(N). Tremendous amount of work had been devoted to the parallelization of the fast multipole algorithm. Greengard and Gropp [7] presented the parallel version of FMA in two dimensions (2D). Board, et al. [11] have done a lot of work in the parallelization of FMA in three dimensions (3D). Our goal is to develop a parallel FMA library which is portable, scalable, and efficient. To assure portability, Message Passing Interface (MPI) is used as the back-end communication library of the parallel FMA library MPIFMA. Since only primitive communication functions such as point-to-point communication and broadcast are implemented in the parallel FMA library, this makes it easy to port MPIFMA to other communication libraries. For efficiency, we implemented the optimum communication scheme proposed in [13]. The advantages of the optimum communication scheme include (1) minimum number of messages are communicated among processors, and (2) the computation and communication are overlapped as much as possible. Our preliminary results are remarkable. The force calculation of one million particles took approximately 2 minutes per time step on an IBM SP2 with 64 nodes. To assure that workload is equally distributed among processors when the particle system is irregular, we incorporated the weighted subtrees technique proposed in [12] to the library. The weighted subtrees technique is designed for parallel FMA in message passing environment. The advantages of this technique are (1) the communication among processors is simpler than ORB [16] and costzone [17] and (2) the communication overhead in the upward phase of FMA can be totally eliminated when the root of the subtrees is at level 2. Our results on Intel s ipsc860 show (1) the per-

2 Level 0 Level 1 Level 2 Figure 1. The 2D tree structure. Figure 2. The 3D tree structure. formance of parallel FMA with weighted subtrees scheme is approximately 3 times faster than the conventional parallel FMA on the benchmark, irregular particle system and (2) the load is about equally distributed among processors. This paper is organized as follows. In Section 2, we describe the fast multipole algorithm. For mathematical foundation of this algorithm, refer to [8, 6, 5, 10]. Section 3 describes the implementation of MPIFMA in details. Partial results are also presented. The possible enhancements of MPIFMA and the goal of this project is listed in the last section. 2 Fast multipole algorithm The fast multipole algorithm recursively divides a computational domain into 8 subdomains in 3D and 4 subdomains in 2D until either the specified tree level is reached or a domain contains one or no particle. A domain which is divided into subdomains is called the parent domain of the subdomains and the subdomain of a parent domain is called child domain of the parent domain. The 2D and 3D FMA trees of height 3 are depicted in Figures 1 and 2, respectively. The fast multipole algorithm then utilizes the tree structure to compute multipole expansions at the finest level t?1 in step 1 and shift multipole expansions of the child domains to form the multipole expansion of their parent domain from level t? 2 to 2 in step 2. In steps 3 and 4 (ie. the downward phase), local expansions are then computed based on (1) the multipole expansions, obtained from the upward phase (ie. steps 1 and 2), in the interaction list and (2) the local expansions shifted from the upper level. The following is a sequential version of the fast multipole algorithm on which our parallel algorithm is based. Initialization: Assign the number of levels of the tree and denote by t, and the number of terms in the multipole and local expansions and denote by p. Distribute each particle into its corresponding box and construct the interaction list of each box. By Greengard s definition, the interaction list of box i at level ` is the set of boxes which are children of the nearest and second nearest neighbors of i s parent and which are not nearest or second nearest neighbors of i. Step 1: Compute multipole expansion coefficients (MECs) for each box at the finest level t? 1. The computational domain at the `th level of the tree has 8` boxes. Step 2: For each level ` from level t? 2 to level 2, form MECs for each box by shifting its child boxes at level ` + 1. Step 3: For each level ` from level 2 to level t? 2, (1) construct local expansion coefficients (LECs) for each box i from the MECs of the boxes in the interaction list of box i plus the LECs shifted from the parent box of box i (at level 2, the LECs from the parent level are assumed to be zero); and (2) shift LECs of box i into its child boxes. Step 4: For each box at the finest level, compute its LECs from the MECs of the boxes in its interaction list and add the LECs shifted from its parent box. Step 5: Compute far-field force for each particle by inserting particles information into local expansion. 2

3 Step 6: For each particle j, compute the near-field force due to the particles in the same box of j and the particles in its neighbor boxes. In 2D, the neighbor boxes box b are the boxes which share a boundary point with box b. In 3D, the neighbor boxes box b are the boxes which share a boundary point with box b (called nearest neighbors) and those which share a boundary point with the nearest neighbors (called second nearest neighbors). Step 7: For each particle, sum up the near-filed and far field forces obtained in steps 5 and 6. 3 The MPIFMA library The MPIFMA library is based on the sequential code of PMTA version 4.0 developed at Duke University and uses MPICH, developed by Argonne National Laboratory and Mississippi State University, as the back-end communication library. Our goal is to develop a parallel FMA library that is portable, scalable, and efficient. The current implementation of MPIFMA include a sample main program, which provides the user interface and input checking, and a set of function calls that compute the long-range forces using the fast multipole algorithm. The program can be run as mpifma -n <number of particles> <seed> [-b] [-d] -p <number of terms> -l <length of comp. domain> -t <the height of FMA tree> where mpifma is the name of the executable program, -p denotes the number of terms used in MEs and LEs, -l denotes the length of the computational domain in three dimensions, and -t denotes the height of the FMA tree. The -d argument is optional. When -d is specified, the direct force computation among all particles will be executed. This is used to measure the accuracy of the approximation forces calculated by FMA. The force and potential errors are also printed when -d is specified. The current implementation of MPIFMA randomly generates particle positions and charges based on the seed provided in -n. Reading particle information from a file can be easily incorporated into the library and will be implemented in the future. MPIFMA generates two types of particle systems. By default, mpifma will generate particle positions within -0.5 and 0.5. On the other hand, if IRREGULAR is specified in the Makefile, particle positions O are randomly generated between?0:5 and 0:5 with the restriction that all coordinates for a particle are either positive (0 O 0:5) or negative (?0:5 O < 0). The option -b enables the weighted subtrees partitioning technique providing a better load balancing among processors when the particle system is not uniform. P P1 (a) P3 P P1 P4 P P2 (b) Figure 3. (a) The processor mapping for 4 processors in 2D. (b) The processor mapping with ghost cells for 4 processors in 2D. 3.1 Parallel Domain Decomposition In the initialization step of FMA, all boxes at the finest level of FMA tree are mapped to processors. By using parallel domain decomposition, processors receive the same number of boxes at the finest level and their ancestors at the higher levels. A box-processor mapping for 4 processors in 2D is shown in Figure 3(a). The advantages of the parallel domain decomposition include (1) the computation load among processors is about equal when the particle system is uniform (or close to uniform) and (2) the computation of MEs in steps 1 and 2 can be done without communication among processors. The latter is due to the fact that all information required for the calculation of MEs exist locally within the processor. As shown in step 6, the computation of near-field force of particle i requires the knowledge of the particles in the box containing particle i and particles in the neighbor boxes. In other words, processor 1 in Figure 3(a) requires the information of the shaded boxes from other processors to computes near-field force. This results in communication among processors. With loading extra ghost boxes (the shaded boxes 3

4 P1 P2 P3 P4 Total Near P Run ME2LE Field Comm. Eff. Time Force Cost N/A N/A % % % % Figure 4. Butterfly Communication Scheme. Table 1. Running time using optimal scheme on ipsc860. as shown in Figure 3(b)), each processor can compute nearfield force in step 6 without communicating with other processors. Parallel domain decomposition has serious load balancing problem when particles are not distributed uniformly. Weighted subtrees technique presented later is adopted to overcome this problem. 3.2 Communication and Synchronization Overhead The conversion of MEs to LEs in steps 3 and 4 requires communication among processors and this can be accomplished by using either broadcast or butterfly switch. Broadcast is easy to implement but expensive. Butterfly switch communication scheme [1] as shown in Figure 4 is better than broadcast and has time complexity of O(log 2 P), where P is the number of processors. However, it requires synchronization among processors at each level which is expensive when the number of processors is large. To minimize the communication and synchronization overhead, MEs are asynchronously transmitted right after the computation in steps 1 and 2, and then are received for computation of LEs in steps 3 and 4. This communication scheme overlaps computation and communication. MPIFMA ensures that minimum amount of messages is communicated among processors. This is based on the fact that if box j is in the interaction list of box i, then box i is also in the interaction list of box j. In this case, processor P i which contains box i sends box i s ME directly to processor P j and vice versa as long as P i 6= P j. Therefore, for each box i, one can build up a processor list which box i s MEs will be sent. Since only the required MEs are transmitted, redundancies are therefore eliminated bringing about minimal transmission. The details can be found at [13]. Experiments were conducted on an Intel ipsc860 system with 1, 2, 4, 8, and 16 nodes. The benchmark system is a 3D particle system with particle positions O randomly generated between?0:5 and 0:5. The size of the particle Total Near P Run ME2LE Field Comm. Eff. Time Force Cost % % % % Table 2. Running time using broadcast scheme on ipsc860. system is The height of the FMA tree is 4 and the number of terms used in multipole expansions and local expansions is 8. The running results are listed in Tables 1 and 2, respectively. As shown in the tables, the total force calculation time (excluding initialization time) is dominated by the conversion from multipole expansions to local expansions (ME2LE) and the computation of near-field force. From column ME2LE in Tables 1 and 2, the conversion time reduces close to a half when the number of processors is doubled. This indicates the parallel domain decomposition is load balanced in a uniform system. The communication time 1 using optimal communication scheme grows slightly when the number of processors is increased. This is an indication that the communication and computation were overlapped successfully. We also ran MPIFMA on an IBM SP2 with 64 nodes. In a system of one million particles, the force calculation took less than 2 minutes per time step using optimal communication scheme, while it took more than 14 minutes using broadcast scheme. The communication overhead of optimal communication scheme is approximately 12 seconds, while broadcast scheme is about 350 seconds. Detailed timing results are summarized in Tables 3. 1 The communication time includes the time to pack data into buffer, send, receive, unpack buffer, and assign it to proper data. 4

5 Total Scheme Run Comm. Time Cost Broadcast Optimal Table 3. One million particle simulation on 64- node SP2 with p = 8 and t = 6. The Execution Time of Force Calculation (in sec.) Sequential Factor=0.98 Factor=0.95 Factor= Number of Processors Figure 6. The performance comparisons between the parallel domain decomposition and weighted subtree technique with factors equal to 0.98, 0.95, and Proc 1 Proc 2 Proc 3 Proc 4 Figure 5. Weighted subtrees partitioning. 3.3 Weighted Subtrees As shown in the previous sections, the parallel domain decomposition technique performs well if particles are distributed uniformly. Degradation of performance results when the particles are not distributed uniformly. This is because equal number of boxes at the finest level and their ancestors at the higher levels (we call them subtrees) is assigned to each processor without recognizing that some of subtrees may be empty. We implemented the weighted subtrees technique, proposed in [12], in MPIFMA to partially alleviate the problem associated with the parallel domain decomposition mentioned above. The weighted subtrees technique keeps track of the number of particles in each box at level 2. It starts from box 1 and stops at box i if the number of particles contained in all boxes from box 1 to box i is approximately equal to N=P where P is the number of processors used in the simulation. Then, subtrees with roots at boxes 1; 2; : : : ; i are assigned to processor 1. The same procedure continues from box i + 1 and stops at box j, subtrees with roots at boxes i + 1; : : : ; j are assigned to processor 2. This procedure continues until all subtrees are assigned to processors. It is assumed that each processor is assigned at least one subtree. A 2D example is shown in Figure 5. Just like the parallel domain decomposition scheme, there is no communication in the upward phase when weighted subtrees scheme is used, while the load on each processor is more balanced. Practically, one would like to have each processor to be responsible for the same number of particles. One obvious, but impractical way to achieve this is to have at most one particle in each box at the finest level. For non-uniformly distributed systems, this is unrealizable and to have at most one particle per box at the finest level increases the levels of the hierarchical tree structure which results in an increased ME2LE conversion time. For all practical purposes, one would like to have more than one particle per box at the finest level. Experiments have shown that better performance can be obtained when there are 40 to 50 particles in each box at the finest level [15, 14]. Because each box at the finest level may contain more than one particle, it is not likely that each processor contains exactly N=P particles. Therefore, subtrees with roots at boxes i + 1; : : : ; j are assigned to a processor once the total number of particles in these boxes exceeds N P, where is a constant factor and is suggested to be in the range of 0.80 and The values of used in the experiments are arbitrarily determined. The benchmark system is a 3D particle system with particle positions O randomly generated between?0:5 and 0:5 with the restriction that all coordinates for a particle are either positive (0 O 0:5) or negative (?0:5 O < 0). The sizes of the particle systems range from 1600 to The height of the FMA tree is 4 and the number of terms used in multipole expansions and local expansions is 8. The execution time of force calculation includes the execution times of all FMA steps, except for initialization. In Figure 6, the comparison of the execution times of force calculation of a particle system of size 6400 between the parallel domain decomposition and weighted subtrees parallel FMAs with of 0.98, 0.95, and 0.90 is displayed. The execution time reduced approximately by a half when the number of processors is increased from one to 5

6 two no matter which scheme was employed. However, for the parallel domain decomposition scheme, the execution time remains almost fixed even though P is increased to 4 or 8. This is because equal number of subtrees are assigned to processors regardless of the number of particles in each subtree. At most two processors are assigned to subtrees which contain particles and only those two processors are doing computations. The other processors remain idle. When P = 16, there are four processors which were assigned to subtrees with particles. The execution time is close to what weighted subtrees scheme can achieve when P = 4 and = 0:90. For weighted subtrees scheme, the execution time decreased when the number of processors increased for all factors used in the experiments. Although not provided here, similar results were observed for systems containing 1600, 3200, 12800, and particles. The observed speedup of weighted subtrees scheme is approximately 3 times that of parallel domain decomposition. 4 Concluding Remarks The goal of this project is to provide a tool such that user can simulate dynamical systems with the input provided by the user. So far, MPIFMA provides the building block of calculation of long-range force such as Coulomb force in one time step. We plan to include the calculation of short-range force in the future. 5 Acknowledgment We would like to thank John Board at Duke University for the sequential FMA code on which our parallel code is based. We also like to thank Mark Underwood at University of Missouri-Rolla for his help to collect part of the running results shown in this paper. [6] L. Greengard. The Rapid Evaluation of Potential Fields in Particle Systems. The MIT Press, [7] L. Greengard and W. D. Gropp. A parallel version of the fast multipole method. Computer and Mathematics with Applications, 20(7):63 71, [8] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73: , [9] B. J. and H. P. A hierachical O(N log N ) force calculation algorithm. Nature, 324: , [10] J. Katzenelson. Computational structure of the N-body problem. SIAM Journal of Scientific and Statistical Computing, 10(4): , July [11] J. F. Leathrum Jr. and J. A. Board Jr. The parallel fast multipole algorithm in three dimensions. Technical Report, April Electrical Engineering Department, Duke University. [12] E. J.-L. Lu and D. I. Okunbor. An efficient load balancing technique for parallel FMA in message passing environment. submitted to Eighth IEEE Symposium on Parallel and Distributed Processing, October [13] E. J.-L. Lu and D. I. Okunbor. Massively parallel fast multipole algorithm in three dimensions. In the proceedings of Fifth IEEE International Symposium on High Performance Distributed Computing, August (to appear). [14] D. Okunbor and E. J. Lu. Parallel fast multipole algorithm using MPI. In the proceedings of MPI Developers Conference 1995, July [15] G. J. Pringle. Numerical Study of Three-Dimensional Flow using Fast Parallel Particle Algorithms. PhD thesis, Napier University, February [16] J. K. Salmon. Parallel Hierarchical N-Body Methods. PhD thesis, California Institue of Technology, December [17] J. P. Singh. Parallel Hierarchical N-Body Methods and Their Implications for Multiprocessors. PhD thesis, Stanford University, February [18] F. Zhao. An O(N ) algorithm for three-dimensional N-body simulations. Master s thesis, Department of Electrical Engineering and Computer Science, M.I.T., October References [1] S. G. Akl. The Design and Analysis of Parallel Algorithms. Pretice-Hall, [2] C. R. Anderson. An implementation of the fast multipole method without multipoles. SIAM Journal of Scientific and Statistical Computing, 13(4): , July [3] A. Appel. An efficient program for many-body simulation. SIAM Journal of Scientific and Statistical Computing, 6:85 103, [4] J. A. Board, L. Kale, K. Schulten, R. D. Skeel, and T. Schlick. Modeling biomolecules: Larger scales, longer duration. IEEE Computational Science and Engineering, pages 19 30, Winter [5] J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm for particle simulations. SIAM Journal on Scientific and Statistical Computing, 9(4): , July

(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3)

(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3) EXPERIENCES WITH FRACTILING IN N-BODY SIMULATIONS Ioana Banicescu and Rong Lu Mississippi State University Department of Computer Science and NSF Engineering Research Center for Computational Field Simulation