(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3)

Size: px

Start display at page:

Download "(12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3)"

Tracy Gardner
5 years ago
Views:

1 EXPERIENCES WITH FRACTILING IN N-BODY SIMULATIONS Ioana Banicescu and Rong Lu Mississippi State University Department of Computer Science and NSF Engineering Research Center for Computational Field Simulation Mailstop 9637, Mississippi State, MS fioana, Keywords: parallel algorithms, dynamic scheduling, load balancing, performance evaluation, scalability ABSTRACT N-body simulations pose load balancing problems mainly due to the irregularity of the distribution of particles and to the dierent processing requirements of particles in the interior and of those near the boundary of the computation space. In the past, most of the methods to overcome performance degradation due to load imbalance used proling work from a previous time step. The overhead of these methods increases with the problem size and the number of processors. Moreover, these methods are not robust to load imbalances due to systemic variances (data access latency and operating system interference). Recently, Fractiling, a new dynamic scheduling technique based on a probabilistic analysis, has considerably improved performance on N-body simulations in distributed memory shared-address space environment. This technique adapts to algorithmic as well as systemic variances. Our goal is to experimentally extend this technique and evaluate its benets in a message passing environment. Here we present our experiences with scheduling N-body simulations with Fractiling on IBM SP2 and SuperMSPARC where the parallel code execution time was improved by up to 40%. 1 INTRODUCTION Scientic problems are often irregular, large and computationally intensive. An interesting class of irregular scientic problems is the N-body problem. N-body simulations arise in many areas of science, ranging from astrophysics to molecular biology. Given the initial positions and velocities of N particles, the problem is to nd the positions and velocities of particles after a number of time steps. The naive sequential N-body algorithm has This work is supported in part by the NSF Grant ASC and by the Research Initiation Grant MSU The research was conducted using the resources of the Maui High Performance Computing Center, which is managed under a cooperative agreement between the United States Air Force Phillips Laboratory and the University of New Mexico. O(N 2 ) complexity per time step. Recently, approximation algorithms of O(N logn) and O(N) that compute the interactions between particles within a specied accuracy have been developed (Barnes and Hutt 1986; Appel 1985; Greengard 1987; Anderson 1992). N-body algorithms are amenable to parallel execution since the calculation of forces on each particle during a time step is, for the most part, independent. Therefore, N-body algorithms are amenable to parallel execution (Greengard and Rokhlin 1987; Leathrum Jr. 1992; Hu and Johnsson 1996; Lu and Okunbor 1997). Performance gains from parallel execution of N-body simulations are dicult to obtain due to load imbalance. Imbalance could be caused by irregularity of distribution of particles and by dierent processing requirements of particles in interior and near computation space boundary. In addition, the distribution of particles varies at each time step. Various methods have previously been employed to balance processor loads and to exploit locality for the next time step. Most of them use proling by gathering information on the work load from a previous time step in order to estimate the optimal work load distribution at the present time step. The cost of these methods increases with the number of processors and particles (Singh 1993; Warren and Salmon 1993). Moreover, these methods employ a static assignment of work load to processors during a time step, due to an assumption that the distribution of particles changes slowly between time steps. These assumptions are not valid in the entire spectrum of applications that use N-body simulations. Processor load imbalances are induced not only by application features, such as irregular data and conditional statements, but also by system eects, such as data access latency and operating system interference. Adapting to system induced load imbalances requires dynamic work assignment. Dynamic scheduling schemes attempt to maintain balanced loads by assigning work to idle processors at run time, during a time step. Thus, they accommodate systemic as well as algorithmic variances. There is a tension between exploiting data local- 1

2 ity and dynamic load balancing during a time step, as the re-assignment of work may necessitate access to remote data. In general, the cost of dynamic schemes is overhead and, potentially, loss of locality. An eective combined scheduling technique that balances processor loads and maintains locality, by exploiting self-similarity properties of fractals, is Fractiling (Hummel, Schonberg, and Flynn 1992; Banicescu and Hummel 1995a). Fractiling is based on a probabilistic analysis. It thus accommodates load imbalances caused by predictable events (such as irregular data) as well as unpredictable events (such as data access latency). Fractiling adapts to algorithmic and system induced load imbalances while maximizing data locality. In fractiling, work and the corresponding data are initially placed to processors in tiles, to maximize locality. The processors that nish early "borrow" decreasing size subtiles of work units from slower processors to balance loads. The sizes of these subtiles are chosen so that they have a high probability of nishing before the optimal time. The subtile assignment are computed in an ecient way by exploiting the self-similarity property of fractals. Previous work on load balancing N-body simulations with Fractiling was applied to the parallel implementation of the Greengard's 3-d Fast Multipole Algorithm and performed on a distributed memory shared-address space environment KSR-1 at the Cornell Theory Center (Banicescu 1996). This paper attempts to experimentally extend the validity and test the benets of this technique in message passing environment, on a SuperMSPARC at the NSF Engineering Research Center for Computational Field Simulation and on a IBM SP2. Our approach to load balancing N-body simulations in this environment is to incorporate fractiling into the N-body code that discretizes the space into an oct-tree. We compare implementations of a parallel and a fractiled N-body simulation on uniform and nonuniform distributions of particles of various sizes and using up to 64 processors. Fractiling could be applied to each level in the oct-tree. However, we choose to only fractile the leaf level since it is computationally the most intensive and imbalanced part of the code. Experimental work conrmed that fractiled N-body simulation code consistently improved performance for both uniform and nonuniform distributions of particles. The next section reviews some of the common techniques for scheduling N-body simulations on parallel and distributed machines. Section x3 describes dynamic scheduling with fractiling and outlines our implementation of the fractiled N-body algorithm code in message passing environment. We discuss experimental results and draw a few conclusions in sections x4 and x5. 2 BACKGROUND AND RE- LATED WORK Previous work on load balancing N-body problem, uses information about the distribution of particles to guide the static assignment of particles to processors (Singh, Holt, Totsuka, et al. 1993; Warren and Salmon 1993; Salmon and Warren 1997; Board, Causey, Jr., et al. 1992; Board, Hakura, Elliot, et al. 1995). The assignment is recomputed after each time step as particles move over time. Some of these techniques include the orthogonal recursive bisection (ORB), and the costzones methods (Warren and Salmon 1992; Singh, Holt, Totsuka, et al. 1993). Others use a hash function to build the hashed oct-tree (HOT) which employs Morton order, a space-lling numbering scheme (Warren and Salmon 1993). Random assignment of subtiles of certain size to processors has also been considered to improve performance of N-body simulations due to load imbalance (Grama, Kumar, and Sameh 1994). With random assignment, the load imbalances of individual subtiles mute each other out to some extent. Some experimentation with new scheduling schemes applied to scientic problems have been presented in (Hummel, Schonberg, and Flynn 1992; Banicescu 1996). These schemes combine static techniques that exploit data locality with dynamic techniques that improve load balancing. In these schemes, work units and their associated data are initially placed on the same processor. Each processor executes its units in decreasing size chunks to preserve load balancing. After exhausting its local work, each processor acquires decreasing size chunks of work from other processors. These decreasing chunks are represented by multidimensional subtiles of the same shape selected to maximize data reuse. The subtiles are combined in Morton order in larger subtiles, thus preserving the self-similarity property (see gure 2 in section x3). In this way, a complex history of executed subtiles do not need be maintained. In scheduling N-body simulations with Fractiling on distributed shared-address space machines, performance of parallel code has shown considerable improvement. 3 SCHEDULING N-BODY SIM- ULATIONS WITH FRACTIL- ING Initially, work is divided into P tiles, and each is initially assigned to a processor. The tile shape can be any parallelepiped. In general, N-body simulations used tiles in blocked dimensions. In Fractiling, new work is acquired by processors in the form of decreasing size 2

3 subtiles, called fractiles. The fractile sizes are chosen such that there is a high probability of their nishing before the optimal time. To simplify the allocation of decreasing-size fractiles, the size for each consecutive P fractiles scheduled in a batch is half of the previous P fractiles. This choice is a result of an approximation of factoring rules that select sizes to ensure a high probability of nishing before the optimal time based on the mean and variance of individual element execution times. Fractiling exploits both the locality and the selfsimilarity properties of the Morton ordering that is used to guide the assignment of fractiles (see gure 2). Previously, Morton ordering has been used only for rapid indexing and ecient addressing. The self-similarity property allows fractiling to eciently be applied to multidimensional problems. In addition, when a processor needs to execute subtiles from another processor's locality, counters can be used directly as indexes into the tile to indicate the start position for next subtile execution, since all subtiles are shued and linearized (see gure 1) (12) (13) (14) (15) (8) (9) (10) (11) (4) (5) (6) (7) 0 1 (0) (1) (2) (3) 4 5 Figure 1: 2-dimensional shued row-major numbering and its fractal Figure 2: Illustration of self-similarity. The eectiveness of applying fractiling on N-body simulation codes has been rst revealed by experiments with non-fractiled and fractiled codes on a KSR-1 (Banicescu and Hummel 1995b). Performance of 2-d and 3-d N- body simulation codes based on the parallelization of Greengard's Fast Multipole Algorithm (FMA) (Greengard and Rokhlin 1987) were improved by as much as 53% by fractiling on both uniform and nonuniform distributions of particles. The Greengard algorithm performs two passes over a quad-tree per time step. During the upward pass, the summary eects of particles in 6 the subtrees are propagated up the tree, and during the downward pass, the local expansions and direct interactions are computed. Although programming of this algorithm involves large amount of work, a lot of progress has been made towards the parallelization of the 3-d FMA in message passing environment in recent years (Lu and Okunbor 1997; Rankin 1995). Here we present our implementation of a parallel 3-d FMA in message passing environment using scheduling with fractiling. We implemented the straightforward parallelization of the 3- d FMA (PFMA) code and its fractiled version (Fract), and performed our experiments on both the IBM SP2 at the Maui High Performance Computing Center and the SuperMSPARC at the National Science Foundation Engineering Center for Computational Field Simulation at Mississippi State University. The IBM SP2 has a scalable distributed memory architecture using the RS/6000 family of superscalar workstations and servers. An SP2 system is composed of one or more frames, each frame containing multiple processor nodes, congured with various amounts of memory and adapter slots. IBM SP2 networks are bidirectional multistage interconnection networks with high performance crossbar switches that allow all processors to send messages simultaneously. Communication between nodes may cross over a variety of network interfaces with, for instance, 500 nanoseconds latency per switch hop. The SuperMSPARC is a 32-processor multicomputer consisting of eight 4-processor clusters (designed and constructed at the NSF Engineering Research Center for Computational Field Simulation). The clusters are tightly coupled Sun SPARCstation 10s with 90MHZ CPU upgrades. The clusters are connected using ATM or Myrinet networks, characterized by low latency and high bandwidth. Our implementation using MPI is based on the initial PVM code from the Parallel Multipole Tree Library developed by the Scientic Computing Group at Duke University (Rankin 1995). We exploited Duke's experience and modied most of the communication code to improve eciency in the MPI environment. The parallel 3-d FMA (PFMA) algorithm was implemented by using common strategies for interleaving computation with communication to reduce the overhead of sharing data in distributed multiprocessor environment. Furthermore, the use of an inverse interaction list mechanism allows processors to have a priori knowledge of which data will be needed by other processors, and send that specic data to them without being prompted. In this way, the synchronization overhead often incurred in irregular computations is avoided. Even when using the best decomposition strategies, 3

4 performance of N-body simulations may suer due to load imbalances created by nonuniform distributions of particles, boundary eects and systemic variances. Designing an ecient 3-d fractiled code was a challenging task, since in distributed message passing environment there is a high potential of overhead increase attributed to communication and synchronization. Fractiling requires consistency of a small set of shared variables. As in the distributed memory shared-address space implementation, space is discretized in P tiles (subrectangles or sub-cubes), and each tile is assigned to a processor. The leaf level of the downward pass was selected to be fractiled since it is computationally the most intensive and imbalanced step. The sparse distribution of particles was represented as a dense cell table, and auxiliary arrays contained the particles in each cell. There are a few possible implementation methods for fractiling in distributed message passing environment. Here we concentrate on a centralized management approach where one processor is selected as a master to manage global variables. In this scheme, only the master has the authority to access and modify shared variables. Thus, this scheme guarantees data consistency and less complexity of programming. However, as the number of processors increases, a bottleneck may occur as the master tries to process larger number of requests from slaves. Other possible implementations that are presently sought by our research group are beyond the scope of this paper. Master/slave communication patterns in our implementation are depicted in gure 3. At the beginning, after dividing the computation space into P tiles, one tile per processor, each processor rst works on its half subtile. When a processor nishes its subtile, it sends a FRACT ASK message to the master. The master looks up its tables, updates the set of shared variables and then it assigns a new subtile size to the requesting processor with FRACT REPLY. The requesting processor receives the answer and continues to work. If the requesting processor completes its own tile and there is work left in another processor's tile, the master assigns a subtile in that processor's tile and sends a FRACT COMM message to the processor indicating to send its data to the requesting processor. The master also sends a FRACT REPLY to the requesting processor indicating which processor is to be helped. After receiving the message from the master, the processor being helped packs its data and sends it to the helper using FRACT ORG DATA. Upon completion of work on this data, the helper processor sends a FRACT ASK to the master to announce completion and request new assignment. It also sends the results of the computation to the processor that owns the data (owner). The above steps are repeated until there is no more work left in any processor's tile. When assigning subtiles to slaves, the master processor always observes the following rules: (i) a processor will have to have all the work completed in its own tile before starting to help another processor; (ii) after completing its own tile, a processor will always work on the tile with the largest available unnished subtile size. With this combination of features, fractiling improves data locality and load imbalance. Furthermore, we may always choose the least loaded processor to serve as a master, in order to compensate for the overhead of management incurred in the master processor. Master 1 FRACT_ASK 3 2 FRACT_REPLY FRACT_FIN_DATA 5 FRACT_COMM Proc 1 Proc 2 FRACT_ORG_DATA 4 Figure 3: Master/slaves communication in the fractiling 4 EXPERIMENTAL RESULTS To test the eectiveness of fractiling, we ran the nonfractiled (PFMA) and the fractiled (Fract) versions of the parallel 3-d FMA code on uniform and nonuniform distributions of particles on both IBM SP2 (on up to 64 processors) and SuperMSPARC (on up to 32 processors). There was small variance in the parallel execution times between runs and thus, we ran each program only 5 times and averaged the results. There was considerable variance in processor nishing times in all PFMA runs. We expected performance gains in fractiled code to reect even processor nishing times within a run. The uniform and nonuniform distributions used in our experiments involved a number of particles between 10K and 100K, resulting in 4 and 5 level oct-trees (respectively) with average densities of particles per leaf box ranging from 12 to 39. The nonuniform distribution of particles, called "corner", was created by shifting a Gaussian distribution's center in one of the space octants (see gure 4). The performance of the fractiled (Fract) versus the nonfractiled (PFMA) code has been measured in terms of execution time, speedup, percent improvement, cost (= time x P). 4

$fractiled code was scalable in both uniform and nonuniform distributions. Detailed information on results may be found in (Lu 1997).$

5 Figure 4: Nonuniform distribution (corner) Cost in Seconds Non-Fractiling Fractiling Number of Processors Our experiments indicated that fractiled code was scalable in both uniform and nonuniform distributions. Detailed information on results may be found in (Lu 1997). In uniform distributions with a high number of processors and large problem sizes the fractiled code outperformed the nonfractiled code. For large problems and smaller number of processors improvements were not substantial due to muting eects. In nonuniform distributions fractiled code consistently outperformed the nonfractiled code on both IBM SP2 and SuperMSPARC. Figures 5 and 6 illustrate the dierence in the cost of using fractiled versus nonfractiled code on two nonuniform distributions on IBM SP2 and SuperMSPARC. Cost in Seconds pfma fract Number of Processors Figure 5: Simulation on a nonuniform distribution of 100k particles on a SuperMSPARC Experimental work conducted so far on a parallel 3-d Fast Multipole Algorithm shows that the performance of the parallel code was improved by up to 40% by Fractiling (see Figure 6). In addition, the implementation scales well and is work ecient on both uniform and nonuniform distributions of particles. It exhibits impressive load balancing improvements, over the nonfractiled code, in terms of considerable smaller processor execution variance. The recorded coecient of variation (c.o.v) for the fractiling code was always less that 0.39, whereas the one for the nonfractiled code was as high as Figure 6: Simulation on a nonuniform distribution of 50k particles on IBM SP2 5 CONCLUSION AND FUTURE WORK In general, schemes to load balance N-body simulations have primarily addressed algorithmic variability. Moreover, most of these schemes used proling by gathering data structures and load information from a previous time step. However, experience has shown that these methods incur large overhead that increases with the problem size and the number of processors. These techniques are not robust in case of distributions that change rapidly and unpredictably (i.e. the radiosity problem). In addition, these techniques do not address the problem of considerable variance in processors execution times due to unpredictable system interference. Recently, dynamic scheduling techniques based on a probabilistic analysis, such as Fractiling have been used to address these concerns on N-body simulations in distributed memory shared-address space environment such as KSR-1. Here we report on experiments used to experimentally extend the method and evaluate the benets of this approach in distributed message-passing environment. In this way, Fractiling becomes a robust competitive dynamic scheduling technique for irregular computations such as N-body simulations for both distributed memory programming paradigms: shared-address space and message passing. Our experiments with N-body simulation codes on IBM SP2 and SuperMSPARC on uniform and nonuniform distributions of particles, revealed that the fractiled code consistently outperformed the nonfractiled code. Future work involves the use of a hierarchical scheme to reduce overhead when the number processors increases. 6 ACKNOWLEDGMENTS We thank John Board and the Scientic Computing Group at Duke University for providing us with the PVM 5

6 version of the parallel 3-d FMA code, which was the basis of our earlier implementations. We are grateful to Bassem Medawar, Donna Reese and Anthony Skjellum for valuable suggestions and comments. We would like to acknowledge the Maui High Performance Computing Center and the National Science Foundation Engineering Research Center for Computational Field Simulation where the implementations and all the experiments were conducted. The support of the National Science Foundation through ASC grant and the Mississippi State University for the MSU Research Initiation Grant are gratefully acknowledged. REFERENCES Anderson, C. (1992, July). An Implementation of the Fast Multipole Method Without Multipoles. SIAM J. Sci. Stat. Comput. 13(4), 923{947. Appel, A. W. (1985). An Ecient program for Many- Body Simulations. SIAM Journal of computing 6. Banicescu, I. (1996, January). Load Balancing and Data Locality in the Parallelization of the Fast Multipole Algorithm. Ph. D. thesis, Polytechnic University. Banicescu, I. and S. F. Hummel (1995a, Feburary). Balancing Processor Loads and Exploiting Data Locality in Irregular Computations. Technical Report RC19934, IBM. Banicescu, I. and S. F. Hummel (1995b). Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations. In Proceedings of Supercomputing'95 Conference, December Barnes, J. and P. Hutt (1986). A Hierarchical O(Nlog(N)) Force Calculation Algorithm. Nature 324. Board, J. A., J.. Causey, J. F. L. Jr., et al. (1992). Acclerated molecular dynamic simulations with the parallel fast multipole algorithm. Chemical Physics Letters 198, 23{34. Board, J. A., Z. S. Hakura, W. D. Elliot, et al. (1995, February). Scalable variants of multipole-based algorithms for molecular dynamics applications. In the Proceeding of Seventh SIAM Conference on Parallel Processing for Scientic Computing, Philidelphia, pp. 295{300. SIAM. Grama, A. Y., V. Kumar, and A. Sameh (1994, November). Scalabel Parallel Formulations of Barnes- Hut Method for N-Body Simulations. In Proc. of Supercomputing'94, pp. 439{448. Greengard, L. (1987). The Rapid Evaluation of Potential Fields in Particle Systems. ACM Press. Greengard, L. and V. Rokhlin (1987, May). A fast algorithm for particle simulation. Journal of Computational Physics 73, 325{48. Hu, Y. and S. L. Johnsson (1996, November). A Data- Parallel implementation of O(N) Hierarchical N-body Methods. In Supercomputing'96. Hummel, S. F., E. Schonberg, and L. E. Flynn (1992, August). A Practical and Robust Method for Scheduling Parallel Loops. Communications of the ACM 35 (8), 90{101. Leathrum Jr., J. F. (1992). Parallelization of the Fast Multipole Algorithm: Algorithm and Architecture Design. Ph. D. thesis, Duke University. Lu, E. and D. I. Okunbor (1997, March). An Ecient Load Balancing Technique for Parallel FMA in Message Passing Environment. In Proceeding of Eighth SIAM Conference on Parallel Processing for Scientic Computing. Lu, R. (1997). Parallelization of the Fast Multipole Algorithm with Fractiling in Distributed Memory Architectures. Master's thesis, Mississippi State University. Rankin, W. T. (1995). A Distributed Implementation of the Parallel Multipole Tree Algorithm - Version Duke University, Department of Electrical Engineering. Salmon, J. and M. S. Warren (1997). Parallel, Outof-core methods for N-body Simulation. In Proceeding of 8th SIAM Conference on Parallel Processing for Scientic Computing. SIAM. Singh, J. (1993). Parallel Hierarchical N-body Methods and their Implications for Multiprocessors. Ph. D. thesis, Stanford University. Singh, J., C. Holt, T. Totsuka, et al. (1993). A Parallel Adaptive Fast multipole Algorithm. In Proc. of Supercomputing'93, pp. 54{65. Warren, M. and J. Salmon (1992). Astrophysical N- Body Simulation using Hierarchical Tree Structures. In Proc. of Supercomputing'92. Warren, M. and J. Salmon (1993). A Parallel Hashed Oct Tree N-body Algorithm. In Proceeding of Supercomputing'93, pp. 12{21. IEEE Computer Society. 6

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system