FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461) Particle Simulations on Cray MPP Systems

Size: px

Start display at page:

Download "FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461) Particle Simulations on Cray MPP Systems"

Penelope Mitchell
6 years ago
Views:

1 FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461) Interner Bericht Particle Simulations on Cray MPP Systems Christian M. Dury *, Renate Knecht, Gerald H. Ristow * FZJ-ZAM-IB-9714 September 1997 (letzte Änderung: ) (*) Fachbereich Physik, Philipps-Universität Marburg, Renthof 6, D Marburg, Germany Third European CRAY-SGI MPP Workshop, Paris,

3 Particle Simulations on Cray MPP Systems Christian M. Dury a, Renate Knecht b, Gerald H. Ristow a a Fachbereich Physik, Philipps-Universität Marburg, Renthof 6, D Marburg, Germany, dury@mailer.uni-marburg.de, ristow@physik.uni-marburg.de b Zentralinstitut für Angewandte Mathematik, Forschungszentrum Jülich GmbH, D Jülich, Germany, r.knecht@fz-juelich.de Abstract Particle simulations were among the first applications to be implemented on scalar computers over forty years ago, and have since played an important role in many science and engineering applications. Because of the inherent parallelism in all particle algorithms the advent of parallel computers has revolutionized this field: basically, the same set of calculations has to be performed for every particle in the system. At present, realistic simulations with a few million particles are possible using large, general-purpose, parallel computers. In this paper the parallel simulation of the size segregation of a binary mixture of granular materials in a half-filled three-dimensional rotating drum using the discrete element method with linear contact forces is investigated. Performance results of an implementation in Fortran 90 using MPI for data communication on CRAY T3D, CRAY T3E-600, and CRAY T3E-900 are presented. These have been determined with the help of the Cray tools MPP Apprentice and the performance analysis tool PAT as well as the message passing visualization tool VAMPIR developed at the Research Centre Jülich. 1 Introduction The study of granular materials has long been an active field of research, partly due to the many interesting physical phenomena which granular materials give rise to and partly because of their importance for industrial applications [1, 6]. Due to the advent of more powerful computers many scientists and engineers believe that some of the phenomena known in this field can be better understood through well planned computer simulations [2]. This belief rests on the premise that these phenomena are collective or emergent in nature, i.e., the constituent grains experience simple, well understood interactions with each other, but that unexpected behavior emerges due to the large numbers of grains involved. Hence, if the grain-grain interactions can be efficiently programmed so that a sufficiently large system can be simulated, then it should be possible to study phenomena which are still poorly understood. In the past such simulations were performed on vector computers. However, since most of the computational time is spent calculating the collision forces acting between the particles, this limits the consideration of interactions to those which can be vectorized. For many problems of scientific interest this limitation is not very restrictive, especially if one ignores factors like the price/performance ratio of the 1

4 computation. One of the promises of massively parallel computers consisting of scalar or super-scalar processors is the ability to perform cost-effective simulations on systems with more complicated, and more realistic interactions. Parallelization techniques for particle algorithms depend upon the range of the particle interactions and the number of particles. For short range interactions and simulations with more than a few thousand particles, the link-cell approach, a form of domain parallelization, is the most appropriate choice. This method divides the physical space into small cells and assigns each particle to a given cell. If the cell size is larger than the particles interaction radius, then only the neighboring cells need to be checked in order to find all possible collision partners. Parallelization is then accomplished by allocating all cells within a given physical domain to a given processor. For homogeneous systems and systems where fluctuations in the particle density are small, a static allocation of the domains to the processors is adequate. However, in the general case, statically allocated partitions lead to inadequately distributed computing loads. This problem can be overcome by mapping the domains to the processors dynamically [7]. Nevertheless the basic physical understanding of granular materials is far from being complete. One of the most intriguing properties is their tendency to segregate. It is observed in many industrial particle handling situations, such as transporting grains or mixing pharmaceutical pills. The rotating-cylinder geometry is an archetype of numerous devices used in industrial material processing where radial segregation can occur on short time scales and axial segregation is observed on larger time scales. The mechanism of the segregation process is based on the surface flow where small particles get stuck along the inclined plane more likely than the larger ones, hence accumulating near the center of the rotating drum (see figure 1). Many parameters are involved in the process of radial segregation and mixing, such as size, shape, mass, frictional forces, angular velocity, filling of the drum, etc. (a) (b) Figure 1: 2D drum: small particles are drawn as filled circles and large particles as open circles. (a) Snapshot of the drum right before the first avalanche. (b) Snapshot of the drum after rotating t = 60s with angular velocity ω = 1.0 Hz, i.e. after 9 rotations 2

5 2 Parallelization Distinct element simulations are based upon the use of distinct, individual elements each of which is free to move according to some given rules [3]. For granular materials, the most important interactions are the inelastic, soft-sphere collisions. For such short-range interactions, the link-cell algorithm is the most efficient programming technique [2] (see figure 2). This method starts by dividing the physical space into either square or cubic cells, depending on the underlying dimension of the physical space, with a side length R L. For polydisperse systems, i.e. systems with particles of varying diameter, one normally takes R L = R max +ɛ, where ɛ is a small positive number and R max is the diameter of the largest particle. For the monodisperse case, where all particles have the same diameter, it is more efficient to take R L = R max ɛ [5]. Figure 2: Link-cell algorithm (2-D) Once the space has been sectioned, all particles whose physical coordinates lie inside a given cell are placed into a linked-list associated with that cell (see figure 3). Then the problem of finding all particles colliding with a given particle is reduced to searching over all neighboring cells for the case R L > R max. (In practice one searches only over half the neighboring cells because the collisions are symmetric.) All interacting particle pairs can now be placed into a list which can then be efficiently processed in order to determine the forces acting on each particle due to the collisions. Usually one tries to find an ɛ such that the list needs only be recreated at most every 10-th time step. After the forces acting on each element are calculated, the Hamiltonian equations of motion are integrated to find the new position of each particle. Normally, a simple leap-frog integration method suffices, however, predictor-corrector schemes are also in widespread use [1]. Typically, the time spent integrating the equations of motion is negligible compared to the time needed for calculating the particle interactions. In this paper the parallel simulation of the size segregation of a binary mixture of granular material in a half-filled three-dimensional rotating drum using the distinct element method with linear contact forces is investigated. The rotation axis in this study is the x-axis and the cylinder is parallelized along this axis 3

6 CELL i,j Pointer Particle n 1 Particle Particle n2 n3 Pointer Pointer Pointer Figure 3: Linked list (see figure 4). Each processing element (PE) is owner of the data of the local particles and the data of the halo regions which contain the particles from the neighboring PEs. The particles in these cells are not updated rather their positions are used for the force calculations of the particles in the true cells. During the course of the simulation particles will migrate outside of the spatial region controlled by the processor on which they reside. Such particles need to be removed from the list in which they are registered and transmitted to the appropriate processor, where they are then registered. The performance measurements presented here have been performed without dynamic load balancing because in this application the flow of particles from one PE to another is approximately the same. Therefore no accumulation of particles on one PE can occur. rotation axis Figure 4: Parallelization of the 3D drum This approach has been implemented in Fortran 90 using MPI [8] for data communication on the systems CRAY T3D, CRAY T3E-600, and CRAY T3E-900 [9]. The numerical methods and the parameters used as well as a quantitative analysis of the segregation for different rotational velocities are described in [4]. 3 Performance Investigations Performance measurements have been accomplished on a CRAY T3D, on a CRAY T3E-600 with stream buffers disabled, and on a CRAY T3E-900 with stream buffers enabled and disabled, resp. External stream buffers in a CRAY T3E system are used to maximize local memory bandwidth, leading to a better performance for vector-like data references. The CRAY T3E-600 at the Research Centre Jülich is equipped with the older PE modules. A hardware design problem in the memory control chip may lead to stability problems of the system when the stream buffers are activated. Therefore they are disabled and may not be activated by user controlled environment variables. The characteristics of the Cray MPP systems used here are shown in table 1. 4

7 T3D T3E-600 T3E-900 Processor DEC Alpha EV4 EV5 EV5 Clock 150 MHz 300 MHz 450 MHz MFLOPS (peak performance) D torus clock 150 MHz 150 MHz 150 MHz 3D torus link bandwidth 300 MHz 500 MHz 500 MHz Primary cache 8 KB 8 KB 8 KB Secondary cache - 96 KB 96 KB Memory bandwidth 300 MB/s 1200 MB/s 1200 MB/s Table 1: Cray MPP systems characteristics On the CRAY T3E-600 the processor clock rate is doubled in comparison to the CRAY T3D. Furthermore, the CRAY T3E processor can perform 2 operations per clock period as opposed to 1 operation on a CRAY T3D. On the CRAY T3E applications can additionally benefit from the secondary cache which is not available on the CRAY T3D. The application s performance has been investigated using the Cray tools MPP Apprentice and the Performance Analysis Tool (PAT) as well as the message passing visualization tool VAMPIR (Visualization and Analysis of MPI Resources) developed at the Research Centre Jülich. MPP Apprentice and PAT can be used to identify the most time-consuming routines. MPP Apprentice assists the user in determining the performance characteristics of a parallel application on a CRAY T3D or T3E system and gives some indication of the causes of the observed behavior. Due to the large overhead induced by the MPP Apprentice run-time library, the given timings are only an indication of the real execution times. Moreover, the given MFLOPS or integer operations cannot be used to measure the real performance. To provide more thorough information the PAT performance analysis tools is available on CRAY T3E systems. PAT uses hardware performance counters and the profil(2) system call on UNICOS/mk systems. It provides a fast, low-overhead method for estimating the amount of time consumed in procedures, determining load balance across PEs, generating and viewing trace files, timing individual calls to routines, and displaying hardware performance counter information. A program that gathers PAT performance data runs much faster than a program instrumented to collect performance data for MPP Apprentice. On the average, a program instrumented to collect data for MPP Apprentice runs three times slower compared to the uninstrumented program. On the other hand, VAMPIR provides detailed information on the message passing communication and the load balancing on the PEs. VAMPIR translates a trace file generated on a Cray MPP system at runtime into a variety of graphical views, e.g. state diagrams, activity charts, time-line displays (see figure 5), and statistics. Time-line displays are helpful to get an overview of the load balancing of the program. Colors are used to represent different kinds of activities. In this example MPI routines are shown in blue whereas the computation part is shown in green. Zooming is possible to analyze the program on any level of detail. Each message sent from one PE to another can be identified. The execution time for one iteration in this example is about 24 ms and can be determined using a VAMPIR popup-menu. To generate trace data in 5

8 the current version the source code has to be instrumented with calls to a run-time library. A future version of PAT will be capable of an object code instrumentation which makes the usage of VAMPIR independent of preprocessors for special programming languages. Figure 5: Time-line display showing one iteration out of 200 with particles on 16 PEs of a CRAY T3E-900 (stream buffers activated) Table 2 shows the measured execution times of the application without I/O. 200 iterations of the simulation were performed for a drum with particles on 16 PEs. The performance gain of CRAY T3E-900 over CRAY T3E-600 can be 50 % at maximum because of the higher clock rate. Moreover, the stream buffer usage may additionally speed up the program. The upper window in figure 6 shows the sum of the execution times of all user routines on 16 PEs of a CRAY T3D in comparison to a CRAY T3E-600 without stream buffer usage. As mentioned above the most time-consuming routine is the computation of the particle-particle interactions. This part of the program is about 3 times faster on the CRAY T3E-600. The window below displays the sum of the MPI 6

9 T3D T3E-600 streams off T3E-900 streams off T3E-900 streams on Execution time s 5.86 s 5.19 s 4.39 s Speedup in relation to CRAY T3D Speedup in relation to CRAY T3E-600 Table 2: 200 iterations with particles on 16 PEs routines showing a considerable amount of synchronization overhead (MPI barrier). In figure 7 the effect of the stream buffer usage can be seen. The overhead induced by MPI communication routines is about the same on both CRAY T3E systems. Only the amount of barrier synchronization is reduced by 50 % on the CRAY T3E-900. The most time-consuming barrier synchronization is at the beginning of the program. PE 0 has to read the input data and broadcast the appropriate subsets to the other PEs, which have to wait that PE 0 has finished the preparatory work. The performance counters of PAT give about 90 to 100 million integer operations per PE for a large system of particles and 50 iterations for the whole program including I/O on 32 PEs of a CRAY T3E-600, which is about 16 % of the theoretical peak performance. The measured wall-clock time is about 2.5 minutes for the iterations of the simulation. 4 Summary and Discussion We have studied the performance of a parallel algorithm simulating the size segregation of a binary mixture of granular materials in a half-filled three-dimensional rotating drum. The algorithm has been implemented on the Cray MPP systems CRAY T3D, CRAY T3E-600, and CRAY T3E-900. The measurements on the CRAY T3E-600 have been performed without stream buffer usage whereas on the CRAY T3E-900 the effect of the stream buffers has been considered as well. The CRAY T3E-600 is about 2.6 times faster than the CRAY T3D for the application described in this paper. Using a CRAY T3E-900 without stream buffers a speedup of 11 % can be achieved in comparison to the CRAY T3E-600. Furthermore, for this application the stream buffer usage gives an additional speedup of 15 % as opposed to a CRAY T3E-900 with stream buffers not activated. The performance improvement is about 25 % in comparison to a CRAY T3E-600 with stream buffers disabled. These performance measurements confirm the results which have been observed for other applications and benchmark tests on Cray MPP systems. Acknowledgements The authors are grateful to the University of Rostock and the Konrad-Zuse-Zentrum für Informationstechnik Berlin for granting access to their CRAY T3E-900 and CRAY T3D, resp. 7

10 Figure 6: Timings for calculation and MPI overhead on CRAY T3D (upper bars) and CRAY T3E-600 (lower bars) without stream buffer usage summed up on 16 PEs 8

11 Figure 7: Timings for calculation and MPI overhead on CRAY T3E-900 with stream buffers disabled (upper bars) and CRAY T3E-900 with stream buffers enabled (lower bars) summed up on 16 PEs 9

12 References 1. M. P. Allen and D. J. Tildesley, Computer Simulations of Liquids, Clarendon Press, Oxford, D.M. Beazley and P.S. Lomdahl, Message-Passing Multi-Cell Molecular Dynamics on the Connection Machine 5, Parallel Computing 20, 2 (1994) P.A. Cundall, O.D.L. Strack, A discrete numerical model for granular assemblies, Géotechnique 29, 1 (1979) C.M. Dury and G.H. Ristow, Radial Segregation in a Two-Dimensional Rotating Drum, Journal de Physique I France 7 (1997) W. Form, N. Ito, and G.A. Kohring, Vectorized and Parallelized Algorithms for Multi-Million Particle MD-Simulations, Int. J. Mod. Phys. C4 (1993) R.W. Hockney and J.W. Eastwood, Computer Simulation Using Particles, Adam Hilger, Bristol, R. Knecht and G.A. Kohring, Dynamic Load Balancing for the Simulation of Granular Materials, Proceedings of ICS 95, Barcelona, 3-7 July 1995, Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, T3E overview, obtainable from: 10

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware