CLUSTER-BASED MOLECULAR DYNAMICS PARALLEL SIMULATION IN THERMOPHYSICS

Size: px

Start display at page:

Download "CLUSTER-BASED MOLECULAR DYNAMICS PARALLEL SIMULATION IN THERMOPHYSICS"

Darleen Polly Elliott
5 years ago
Views:

1 CLUSTER-BASED MOLECULAR DYNAMICS PARALLEL SIMULATION IN THERMOPHYSICS JIWU SHU * BING WANG JINZHAO WANG 2 MIN CHEN 2 WEIMIN ZHENG ( Dept. of Computer Science and Technology, Tsinghua Univ., Beijing, 00084) ( 2 Dept. of Engineering Mechanics, Tsinghua Univ., Beijing, 00084) ( * Correspondence should be addressed to Jiwu Shu( shujw@tsinghua.edu.cn)) ABSTRACT Molecular dynamics simulation is an important method for the research of thermophysics. But it is difficult to implement the simulation with traditional serial algorithms because of a complex numerical calculation. A cluster-based spatial decomposition algorithm for solving large-scale Molecular Dynamics simulation of thermophysics is proposed in this paper. With the efficient strategy of domain decomposition and the fast method of neighboring particle location, we greatly reduce the calculating and communicating cost and successfully process the MD simulation for PVT property calculation of a large-scale system with 4,000,000 particles. The spatial decomposition algorithm is implemented on 25 processors with the speedup(sp) of 86.4 and the efficiency(e) of 69.%.The numerical results indicate that the proposed parallel algorithm can simulate a thermophysical system with many more particles than before, and can provide a more efficient way for computer simulation of thermophysical problems. KEY WORDS Parallel computing, Molecular Dynamics, Cluster, Thermophysics.. INTRODUCTION Recently, more and more researchers are attracted by the microscale problems in thermophysics, which includes atomic beam bombardment, liquid-vapor interface and nucleation. Those problems occur in microscopic time and space scale. Therefore it is very difficult to control the processes of those problems. Up to now, there is no traditional experimental method that can measure their processes directly and accurately []. Molecular Dynamics (MD) simulation provides a new method for the research of microscale thermophysics, by which researchers can understand these problems at the level of molecule/atom. The basic principle of MD is calculating the Newton equations of particles in the simulated system, simulating the microscale process of the system s development and then taking statistics of all kinds of macrocosmic parameters of the system.. There are two important characteristics when the MD simulation method is used to study the problems of microscale thermophysics. Firstly, this kind of MD simulation must handle a great number of particles; secondly, the simulating process must go on for numerous time steps (say,000,000 or 0,000,000 steps). Furthermore, MD simulation must deal with the resolution of kinetics equations of many particles, which requires complex numerical calculation. All of the above factors lead to enormous calculation especially when MD simulation is applied to a very large system with a large number of particles. Researchers have spent much time to simplify MD model and improve MD algorithms, but those various fast methods cannot provide a satisfactory resolution to the MD simulation of large-scale system. It is obvious that a thorough resolution of such a problem must depend on the parallel computer system and efficient and scalable parallel algorithms. Up to now, however, only a little work has been done to apply the parallel MD simulation to the microscale thermophysics research and the parallel algorithms are not well developed. So the task of designing an efficient and scalable parallel MD simulation algorithm is much useful to the research of microscale thermophysics. In the recent years, the technology of Cluster has been well developed and commercial components have been widely used, which makes it a fact that the ratio of Cluster s performance to price is much lower than the ratios of PVP and MPP. The Cluster systems are widely used in the international areas of scientific and engineering computing. This paper proposes a new spatial decomposition MD algorithm based on Cluster system, which can be used efficiently to simulatea large-scale multi-particle thermophysical system. Firstly, we provide three different domain division strategies and analyze their efficiency and scalability. Secondly, we propose a method called FLNP (Fast Location of Neighboring Particles) to accelerate the location of neighboring particles, which greatly reduce the cost of interaction calculation and communication. The FLNP method has the advantages of both LinkCell and NeighborList and has a good performance in practical simulation. With the above algorithm and strategies, we simulate a large system with 4,000,000 particles and get satisfactory

2 results on the PVT property calculation of this system. 2. MD SIMULATION MODEL In our work, we simulate a multi-particle system and calculate its PVT properties. The N particles are simulated in a D cubic space with periodic boundary conditions at the state point defined by the reduced density ρ * = and the reduced temperature T * =. 2. The simulation is begun with the particles on an fcc lattice with randomized velocities. A roughly uniform spatial density persists for the duration of the simulation. The simulation is run at constant N, volume V and temperature T, a statistical sampling from the canonical ensemble. The computational task in MD simulation is to solve the Newton s equation [2] given by dvi Fi () t = mi = F2 ( ri, rj ) + F( ri, rj, rk ) + L dt j j k () dri = vi dt where mi is the mass of particle i, r i and vi are its position and velocity vectors. F 2 is a force function describing pair-wise interactions. F k (k>=)describes the multi-body interaction which is ignored in our simulation. The most time-consuming part in MD simulation is the calculation of interaction, which usually requires 90% of total simulation time. The force terms in Equation () are typically non-linear functions of the distance between particle i and the other particles. In our simulation, the interaction can be modeled with a Lennard-Jonse potential energy as φ () r σ = 4ε r 2 σ r where r is the distance between two interacting particles,ε and σ are constants. In a long-range model, each particle interacts with all the other (N ) particles 2 and lead a computational complexity of O ( N ). But many physical problem can be modeled with short-range interaction, that is, the summations in Equation () can be restricted to particles within some small region surrounding particle i. We can implement it using a cutoff distance r c, outside of which all interactions are ignored. In this case, the interaction calculating complexity reduced to O(N ). In our simulation, the cutoff distance is 4.0σ. How to minimize the number of neighboring particles that must be checked for possible interactions is an important problem, which can greatly influence the speed of short-range MD simulation. 6 (2). PARALLEL ALGORITHM AND OPTIMIZING STRATEGY In the past twenty years, researchers have developed three classes of parallel MD simulating algorithms. The first class is called Atom Decomposition (AD) []. This kind of algorithm give a pre-determined subgroup of particles to each processor, and each processor calculates its own particles interaction and updates their velocities and positions. The biggest shortcoming of this algorithm is the enormous memory requirement because each processor must maintains the positions of all the particles. Only when dealing with MD system of a small number of particles on share-memory machines, the AD algorithm can gives a goodish performance. The second class is Force Decomposition algorithm (FD) [], in which, each processor is assigned with two subgroups of particles and it calculates the interactions between these two groups. This kind of algorithm needs not all of the particles position so it requires much less memory than AD algorithm. But FD algorithm cannot maintain load balance so easily as AD algorithm can, and only when the force matrix has a uniform sparse structure, the FD algorithms can achieve a good load balance. The last class is Spatial Decomposition (SD) [4]. The whole simulating domain is divided into sub-domains that are equal to processors in number and each sub-domain is assigned to one processor. Each processor computes only the forces on particles in its sub-domains. The main benefit of SD is that it takes full advantages of the local nature of the inter-particle forces and performs only local communication. Thus, in large-scale MD simulation, it achieves optimal O(N/P) scaling and achieves better performance on Cluster than AD and FD algorithms. Therefore, we chose the SD algorithm as our simulation method and propose three kinds of domain division strategies [5] to make the SD algorithm more efficiency and more scalable.. domain division strategies.a -dimension.b 2-dimension.c -dimension Fig. domain division strategies In Figure (), we propose three typical strategies of domain division. For the convenience of discussion, suppose the whole simulating domain is divided into n sub-domains Σ, the number of processor is i( i =,2, Ln) also n, P i ( i =,2, Ln). The n sub-domains are assigned to n processors separately. That is, processor P i computes the interactions on particles in sub-domain Σ i,and updates their positions and velocities. Below we discuss the

3 differences of these three strategies in load-balance, communication and scalability. Firstly, we discuss the performance of these three strategies in load balance. We can draw a conclusion that the -dimension division showing in Figure (.a) can achieve the best load balance because of two reasons. () The load imbalance is cause mainly because the nonuniformity of particle density, which has the least effects on -dimension division of the three division strategies. Suppose the simulating domain is scaled as x, y, z in three dimensions, and the -dimension division in Figure (.a) is implemented in x dimension. Only the non-uniformity of particle density in x direction can influence the load balance of -dimension division. On the contrary, no matter which dimension the non-uniformity occurs in, the load balance of -dimension in Figure (.c) can be greatly influenced. (2) The algorithm with -dimension division strategy can implement dynamic load balance more easily than that with -dimension division strategy. The communication architecture of -dimension division is easier than that of -dimension division. Thus the algorithm with -dimension division strategy can easily re-divide the sub-domains locally or globally when the particle density alters. On the other hand, the algorithm with -dimension cannot achieve an easy implementation of dynamic load balance due to complex communication. The communication cost in parallel algorithms is determined mainly by two factors. The first is how many data we should transfer and the second is how many times the communication happens. The less communication data volume and time, the less communication cost. There are two kinds of communications in the Spatial Decomposition algorithm. () When any particle moves from sub-domain Σ i to sub-domain Σ j, the processor P i must send the all information of this particle to the processor Pj. This kind of communication is usually called particle move, which is illustrated in Figure (2.a). The communication of particle move is simple because it always happens between neighboring processors. (2) The calculation of interaction on particles that locate near the boundary of sub-domain requires the positions of other particles that may be belong to another processor, which lead to an exchange of particle position called boundary copy illustrated in Figure (2.b). In short-range MD simulation, the exchanges involve only those particles whose distance to boundary is within cutoff distance r c. 2.a particle move 2.b boundary copy Figure 2 communications in parallel SD algorithm We compare the communication cost of three domain division strategies in both communication data volume and communication time. () Under the -dimension division, processor P i need only communicate with the two neighboring processors p and i p, and each time i step there are two communications at most. Under the 2- dimension division, processor P i must communicate with 8 neighboring processors, that requires 4 communications even using fold [] technology. Under the -dimension division, the number of neighboring processors is 26 and the times of communication are 6. Thus if we only consider the communication time, the -dimension division has the lowest communicate cost, and the - dimension the highest. (2) The main task of communication is boundary copy and so we analyze the communicate data volume of boundary copy. For the convenience of discussion, suppose that the simulated system has uniform density. The data volume can be expressed in the following equations C = ( N / ρ ) C C 2 = ( N / ρ ) = ( N / ρ ) where C is separately the total data volume in, C2, C communication of -dimension, 2-dimension and - dimension division strategies. N is the particle number of the simulated system, P is the processor number on Cluster system and ρ is the particle number in each box. We have C : = P / 2 C2 : C : 2 / P : / Specially, when P>6(this condition can be easily achieved), we have C > > C2 C ρ 2 / P / P / 2 ρ 4 ρ 6 Equation (6) shows that, the total communication data volume of -dimension is the least and that of - dimension the greatest. Further experimental result proves that the communication data volume is the dominant factor in total communication cost of large-scale parallel MD simulation. Thus the total communication cost of - dimension division is lowest in those of three kinds of domain division strategies. At last, we will compare the scalabilities of the three strategies. Generally speaking, the -dimension division strategy performs better than the -dimension division. Two facts make us reach this conclusion. () Equation (4) show that when P becomes larger, the communication cost reduces rapidly with -dimension division but remains constant with -dimension. Thus when more and more processors are used, the algorithm with -dimension becomes more and more inefficient. The -dimension division strategy limits the scalability of parallel algorithms. (2) When N is fixed, the number of sub- (4) (5) (6)

4 domains in SD algorithms is limited by the boundary copy. Speaking in detail, the sub-domain must be longer than an individual box in the dividing direction; otherwise the communication must become more complex. The length of a box is r c or r when using FLNP method, s which is discussed bellow. Because the -dimension strategy can apply the division only in one direction, its number of sub-domain is limited most greatly. So the maximal number of processors that can be used in - dimension division is much less than the 2-dimension and -dimension. Form the discussion in () and (2), we can conclude that the algorithm with -dimension division strategy is the most scalable MD simulation algorithm on Cluster. From the above discussion, we can draw the conclusion that when the load balance is well maintained, the - dimension division is the best domain decomposition strategy for the parallel MD simulation on Cluster..2 the FLNP method In short-range MD simulation of a system with N particles, in order to calculate the force on particle i, we need not check all of the other (N-) particles because only those particles who are within the cutoff distance r c can contribute to the force on particle i. There are two basic techniques used to accomplish this. In the first idea, the LinkCell [] method, the simulating domain is divided into many D cells of side length d, where d equal to r c or slightly larger, as illustrated in Figure () and each particle is mapped to some cell. This reduce s the task of finding neighbors of a given particle to checking in 27 cells, that is, the cell which this particle is in and the 26 surrounding ones. Since mapping the particles to cells only requires O (N ) work, the original 2 O ( N ) work required by force calculation is greatly reduced. r c Figure LinkCell r c The other technique used for speeding up MD calculation is known as NeighborList. For each particle, a neighboring particle list is maintained, which includes all of the particles possibly contributing to the force on the given particle, as illustrated in Figure (4). When the list is built, all of the nearby particles within an extended cutoff r c r s Figure 4 NeighborList distance r s = r c + δ are stored. The list is used to calculate interactions for a few time steps. Then before any particles could have moved from a distance r > r s to r < r c, the list is rebuilt. The advantage of the NeighborList method is that after the list has been built, checking all of the possible neighboring particles in list is much faster than checking all particles in simulated system. However, the process of list building and rebuilding still requires checking all of the simulated particles. Based on the analyses of LinkCell and NeighborList, we propose a new speedup technique called FLNP (Fast Location of Neighboring Particles). With this method, the whole simulated domain is divided into many cells with the side length r s, not r c, and at the same time, for each particle, a neighbor list is maintained. This new method has obvious advantages relative to basic LinkCell and NeighborList techniques. Firstly, compared to LinkCell method, it reduces the number of particles that should be check because the there are far fewer particles to check in a sphere of volume 4 π r s than in a cube of volume 27r c. On the other hand, compared to basic NeighborList method, there is a significant saving when list is rebuilt because the checking volume has been reduced from the whole simulated domain to 27r s. δ, which determines the relation between r c and r s, is an important parameter in FLNP method. It can bring significant influence to the efficiency of an algorithm with FLNP. When δ is given too large, the volume of particle checking would be enlarged, so the force calculation time would be increased. On the other hand, if δ is too small, the neighbor list would have to been rebuilt frequently so the advantage of NeighborList would be wasted, and the efficiency of algorithm would be reduced. Although δ is always chosen to be small relative to r c, the optimal value depends on the parameters (e.g. temperature, diffusivity, density) of the particular simulation. The FLNP method can not only greatly decrease the volume of particle checking, but also reduce the calculation and communication cost due to particle move and boundary copy. This is caused mainly by two reasons. ()The FLNP maintains a neighbor list for each particle, which stores all of the particles that can possibly contribute to the force calculation. In algorithms not using FLNP, when some particles near the boundary move from one sub-domain to another, the neighbor list has to be rebuilt. But in algorithms using FLNP, if these moving particles don t enter other particles extended cutoff distance, the neighbor list can be rebuilt late. So the communication cost of particle move can be reduced to some extent. (2)When coping boundary, processors must check which particle is near the boundary and must be

5 send to the neighboring processor. This work must be done at each time step in algorithms not using FLNP. However in algorithms using FLNP, this work can be done once every few time steps, when the neighbor lists are rebuilt. During all the other time steps, we can easily send the latest position information of particles that have been checked as boundary particles in the previous time step. 4 RESULTS AND ANALYSIS The parallel MD algorithm of Section was tested on our Cluster system. This Cluster is made up of 6 SMP nodes. Each node has 4 CPUs of Intel Xeon PIII700, 6Gbytes of hard disk, and Gbytes of memory. The communication medium between SMP nodes is Myrinet Switch with bandwidth of 2.56Gb/s. The software environments are Redhat Linux 7.2(kernel version smp), MPICH-.2.7 and gm-.5pre4 which is network protocol running on Myrinet. 4. Comparison of Domain Division Strategies Number of CPU Sp E(%) Figure(5.a) efficiency and speedup of -dimension division Number of CPU Sp E(%) Figure 5.b efficiency and speedup of 2-dimension division Number of CPU Sp E(%) Figure(5.a,5.b,5.c)show the performance curves of three kinds of domain division strategies separately. Generally speaking, the algorithm with -dimension division gets the highest performance and the one with -dimension division the lowest. Figure(5)also show that the three kinds of domain division strategies have similar parallel efficiency when P is small(say P 9 processors). When more and more processors are used, the efficiency of -dimension division drops down quickly. On the contrary, the declination of efficiency of 2-dimension and -dimension division is slight. We can draw the conclusion that the algorithm with -dimension division is the most efficient and most scalable for MD parallel simulation on Cluster system. The algorithm with 2-dimension division also has a fine scalability but it is less efficiency than that with - dimension division. The algorithm with -dimension is the worst one because of its awful efficiency and scalability, and it can provide a receivable performance only when P is small. 4.2 Influence of FLNP to parallel efficiency In Figure (6), we plot the -dimension algorithm s computing time per step under different δ. The processor number is 8 and the particle number is 4,000,000. The experimental result shows that, the FLNP method can bring much greater improvement to parallel algorithm s speed than the two basic technologies: LinkCell and NeighborList. Firstly, LinkCell is described with the result obtained when δ equals to zero in Figure(6), which shows that the speed with FLNP is about double to the speed with LinkCell. Secondly, the basic NeighborList technology cannot be use separately on MD simulation of such a large-scale system that has 4,000,000 particles. In fact, when 8 processors used, each processor must handle 500,000 particles averagely. If basic NeighborList technology would be used, it should have taken dozens of hours to build the neighbor list once. CPU Time(ms/step) Value of δ Figure 6 CPU timings (ms / time step) under different δ with FLNP The result also shows that, the value of δ can influence the speed of parallel algorithm, which requires a precise value of δ. The optimal value of δ for our simulation is Figure 5.c efficiency and speedup of -dimension division

6 in the scope of [ 0.4σ,0.5σ ]. 5. CONCLUSION We design and implement a Cluster-based spatial decomposition algorithm, which is suitable to the largescale MD simulation of microscale thermophysical problems. Firstly, we eliminate the inefficient global communication in our algorithm due to the local nature of MD simulation. Secondly, we propose three kinds of domain division strategies, which provide different efficiency and scalability. Both the theoretical analysis and the experimental results show that the -dimention domain division is the best one, especially when the load balance is well maintained and the spatial decomposition with this strategy is fit for large-scale MD simulation on Cluster due to its scalability and high efficiency. 2000, 9 47 [5] Ryoko Hayashi, Susumu Horiguchi, Parallel molecular dynamics simulations of polymers (In Japanese), Transactions of Information Processing Society of Japan, 9(6), 998, Another important optimizing strategy in short-range MD simulation is to minimize the number of neighboring particles that must be checked for possible interactions. This paper proposes and implements a new method called fast location of neighboring particles, which combines the benefits of both link-cell and neighborlist and can greatly accelerate the calculation of interaction. δ is the most important parameter in this new method, which can greatly influence the efficiency of parallel algorithm. In the MD simulation of thermophysical problems, it is important to maintain the load balance of the parallel SD algorithm. In the future, we will improve the load balance strategy and make the SD algorithm applicable to all kinds of thermophysical MD simulations. 4. ACKNOWLEDGEMENT This work was supported by Foundation Research Fund form Tsinghua University of China (Grant No. Jc200024). REFERENCES [] Chou F C, Lukes J R, Liang X G, et. al, Molecular Dynamics in Microscale Thermophysical Engineering, Heat Transfer, 0, 999,4-76 [2] M. putz, A. Kolb, Optimization techniques for parallel molecular dynamics using domain decomposition, Computer Physics Communications, (2-),998, [] S.Plimpton, Fast parallel algorithms for short-range molecular dynamics, Journal of Computationa Physics, 7(), 995, -9 [4] Koradi R., Billeter M., Guntert P., Point-centered domain decomposition for parallel molecular dynamics simulation, Journal of Computational Physics, 24(2-),

Hybrid Decomposition Method in Parallel Molecular Dynamics Simulation Based on SMP Cluster Architecture *

TSNGHUA SCENCE AND TECHNLGY SSN 1007-0214 09/23 pp183-188 Volume 10, Number 2, April 2005 Hybrid Decomposition Method in Parallel Molecular Dynamics Simulation Based on SMP Cluster Architecture * WANG