Computation Ideal ρ/ρ Speedup High ρ No. of Processors.

1 A Fully Concurrent DSMC Implementation with Adaptive Domain Decomposition C.D. Robinson and J.K. Harvey Department of Aeronautics, Imperial College, London, SW7 2BY, U.K. A concurrent implementation of the direct simulation Monte Carlo method (DSMC) for the solution of complex gas ows, coupled with an adaptive domain decomposition algorithm for unstructured meshes is described. An example indicates that use of the dynamic domain decomposition technique signicantly increases the parallel eciency of the DSMC code. A clear direction for further work is indicated. 1. Introduction A single solution method for gas ows ranging from the continuum to the rareed regime would be of great use to engineers and scientists. The direct simulation Monte Carlo method (DSMC) has this capability, but it is computationally expensive. Until recently computer power was insucient to apply the method to gases which were not rareed. Multi-processor (MP) computers provide an answer to this problem. DSMC is a particle based gas simulation method in which a computer is used to track simulator molecules. The particles are phenomenological models of the molecules in the real gas being computed. A mesh is used to discretise the oweld and consequently ows around complex bodies can be simulated if the correct meshing technique is used. A popular method of parallelising DSMC is by use of a spatial mesh decomposition over the processor array of an MP machine [5], [3]. The method then conforms to the single process multiple data (SPMD) paradigm, and the only addition required to the original serial algorithm is the inclusion of message passing for simulators which cross the subdomain boundaries. The computational load exerted by DSMC on a processor is largely dependent on the number of particles simulated upon it. As the particles are free to move throughout the oweld the load across the processor array will be unbalanced at least during some time of the computation, leading to an inecient use of computational resources. Since there is no way to predict the distribution of particles a priori an automatic adaptive scheme that will balance the load during the runtime of the DSMC computation is clearly required. As domain decomposition is used redistribution of load is equivalent to altering the mesh decomposition over the processors. CDR gratefully acknowledges the support of DERA for this work.

2 2. DSMC Implementation The key assumption behind the DSMC method, due largely to Bird [1], is the splitting of the movement and collisions of the simulators over a small timestep. The simulators are allowed to move freely over the timestep. Once the move phase is completed and boundary interactions have been computed a number of collisions are calculated such that the collision rate in the simulated volume of gas is commensurate with that in the gas being modelled. The collision partners are chosen by taking pairs of particles at random which have close spatial proximity. The cell structure is usually used for the denition of this proximity, with only pairs of particles from the same cell being considered as potential collision partners. Pairs of simulators are accepted for collision based on appropriate probabilities and the collision is modelled using standard two body mechanics. Splitting of collisions and movement is valid if the modelled gas is dilute, that is, the mean spacing between molecules is much greater than the size of the molecules. DSMC has the advantage that it is applicable to ows which are not in thermodynamic or chemical equilibrium. Chemical reactions are naturally treated within the collision framework, and pose no numerical diculties, other than an increased computational load. A computation is usually started impulsively and advances in real time via time-stepping, eventually reaching a steady state if appropriate boundary conditions are specied. A parallel DSMC system has been developed using the decomposition-spmd paradigms. The program consists of four classes of routines: physical modelling, geometric modelling, management and parallel. There are also several pre and post-processing elements. Each of these classes is functionally independent of the others, with interfacing carried out by the management routines. This approach helps code maintainability and allows ease of scalability, with no modications being required to run physical routines in a parallel environment. The codes are able to run in a serial or parallel environment without source code modications. Unstructured meshes are used because they can represent complex geometries readily. Generation of these meshes requires minimal eort on the part of the user and they can also automatically adapt to ow features such as shock waves. The code has been used to simulate a number of ows ranging from hypersonic shock-boundary layer interaction [7] through to low speed ow instabilities [10] and modelling of the ow in a chemical vapour deposition reactor chamber used for epitaxial silicon growth. The meshed computational domain is decomposed over the number of processors required for the calculation. Each processor possesses a sub-domain on which the DSMC computation is run. Each sub-domain consists of a computational domain and a one deep halo layer of cells which surrounds the computational domain. The halo cells are in the computational domain of neighbouring processors. The interface between the computational domain and the halo layer represents an inter-processor boundary (IPB). The geometric data is localised for each sub-domain and each processor knows nothing of the domain outside of its halo. Message passing for the codes is implemented using MPI primitives. The only message passing required for the DSMC calculation is related to the particles crossing IPB's. All particles crossing IPB's in one timestep are collected into a group for sending en-masse rather than individually. This message passing at each timestep ensures the calculation progresses in a synchronous manner across the processor array. If the timestep is of the

3 correct order particles should not move more than one cell width in the interval. Consequently if a particle crosses an IPB during a timestep it is generally possible to identify its nal position in the halo. Sure knowledge of the simulator's destination signicantly reduces communication overhead. Outgoing particles are sorted according to the destination processor and sent using non-blocking localised MPI routines. Use of this technique enhances scalability. 3. Dynamic Mesh Partitioning The mesh decomposition problem is usually approached as a graph partitioning exercise. A mesh is represented as a undirected graph G(V,E), where V represents the set of vertices v in the graph, and the vertices are connected by the set of edges, E. Each vertex can be given a weight w, which is an indication of the computational load associated with it. The vertices in the graph are assigned to dierent processors by the partition function P: This divides the domain into smaller sub-domains and introduces a contracted subdomain graph G (s) (P; "), where P represents the set of sub-domains p induced by P and " represents the edges connecting the sub-domains. An edge in G is termed cut and noted by E c if it joins two vertices v, and u in V such that P(v) = p and P(u) = q where p and q are dierent sub-domains in P. An edge between two dierent sub-domains p and q is dened as " pq [E cpq ; where E cpq is an edge connecting a vertex v belonging to p and vertex u belonging to q. P The aim is to partition the graph such that E c is minimised and the weight of each sub-domain is approximately equal. This problem, in the static case, is known to be NPcomplete, and so approximate heuristic procedures are always employed to obtain a nearoptimal solution [4], [12], [2]. In the case of a dynamic loading scenario the static problem must be iteratively solved in order to maintain ecient use of computing resources. 3.1. Issues Relevant to DSMC DSMC computations are cell-centred. Hence the set of graph vertices are the cells and E is equivalent to the cell connectivity. This type of graph is sometimes referred to as the dual graph of the mesh. The load imbalance during a parallel computation arises due to the particle movement. The spatial position of particles can change relatively quickly, and consequently the domain adaption may have to be carried out many times during a computation. Therefore the cost of re-mapping the domain must not be expensive in comparison to the DSMC calculation. The load variation between timesteps can be fairly signicant within a parallel DSMC calculation largely due to the uctuations in the number of particles per processor. Even a serial calculation may have loading dierences between timesteps due to the probabilistic nature of the collision algorithm and for these reasons DSMC cannot be load balanced to within very ne tolerances as is the case with conventional CFD codes. The number of particles crossing IPB's is dependent on the length (area) of the IPB. Hence, in common with other applications the communication cost is dependent on E c, but is more variable. Given all these factors it is clear that an automatic load balancing strategy is required that monitors the true load on a processor. Since the data structure on each sub-domain is localised the remapping policy should be localised in order to minimise communication costs and re-use the existing partition. This means that vertices should only be exchanged over connected edges in G (s) :

4 3.2. Current Implementation The approach taken by the authors has concentrated on implementing a robust parallelised dynamic domain decomposition scheme in the rst instance, with approximate load balancing and edge minimisation heuristics employed. It is aimed that these heuristics will be improved once the robust scheme is proven. The method is similar to that of other researchers in the area [2]. An initial decomposition is performed cheaply and the DSMC computation is initiated on this. This decomposition is then adapted in a localised diusive fashion, by moving vertices lying along the IPB's, throughout the DSMC computation in order to approximately balance the load between processors and minimise surface area. The adaptive domain decomposition program is called adder. A continuous load monitoring policy is employed by adder. At a xed interval of timesteps, each processor broadcasts its current load state and this is fed into a decision scheme such as the stop at rise (SAR) formula of Nicol and Saltz [6] which compares the cost of re-mapping with the cost of not re-mapping. Use of this re-mapping decision process enables automation and the use of true timing data to gauge load which would otherwise be very dicult to model. Once the decision is made for load balancing there are several steps taken on each processor. Firstly each processor computes a localised \load table" for its neighbourhood. It uses this to dene the direction of load ow across an edge " pq. Each processor then identies its vertices lying on an IPB by a search through the halo. The next step is the identication of vertices which should be shed in order to improve the geometric shape of the domain. This is based on a geometric criterion and only border vertices are allowed to be shed. Consequently a sub-domain can only receive vertices from within its halo. Each sub-domain carries out this operation in parallel. In order that conicts do not occur in these \geometry sheds", an update is carried out of the neighbourhood so that each processor is aware of cells in its halo that will be transferred to it. The next step is the identication of vertices to be transferred for the load equalisation. Once again only border vertices are allowed to be transferred. Currently all vertices along the edge " pq are ear-marked for movement. Allowed geometry sheds are also included in the sending lists at this stage. Once the vertices are ear-marked the quality of resulting boundary is examined and further vertices are added to the shedding list if their inclusion reduces E c. This candidate choosing strategy is admittedly crude. Once all the vertices are marked the actual inter-processor communication takes place. The DSMC data associated with the vertices must also be sent at this stage. Note that the geometric data associated with the vertices is not required since the received vertices are already in the halo. Once the transfer of computational cells is complete the haloes on the sub-domain need reconstructing and this results in the nal section of message passing. 4. Results The results in this paper are all from computations of rareed driven cavity ows, computed on the AP1000 at Imperial College. A typical computational domain is shown in Figure 1(a). The upper, left and right hand walls remain stationary whilst the bottom wall can be moved from left to right. All boundaries are the same temperature and are fully diuse. When the wall is moving the gas is compressed into the bottom right hand

5 1.92449 1.31633 1.01224 2.22857 120.0 100.0 Computation Ideal 0.404082 0.708163 ρ/ρ Speedup 80.0 60.0 40.0 8/β High ρ 20.0 0.0 0 10 20 30 40 50 60 70 80 90 100 110 120 No. of Processors Figure 1. (a) Normalised Density Contours (b) Comparison of Speedup corner whilst a region of low density develops near the centre of the cavity due to the induced vortex. These densities translate into computational workload for the processors. Figure 1(a) shows the normalised density contours for the case where the wall is moving at approximately Mach 8. The ow and its loading eects have been described in detail elsewhere [8]. If the bottom wall is stationary the only velocities present in the cavity are due to the thermal motions of the simulators and the particles will be uniformly distributed throughout the domain. Under these conditions with a uniform number of cells per processor the computation is as perfectly load balanced as possible. If the decomposition approaches minimal E c results will represent the peak parallel performance of the DSMC code. Figure 1(b) shows the speedup curve for calculations on the mesh shown in Figure 3(a) which has 10,000 cells and with 135,000 simulators, in which the bottom wall is stationary. All decompositions are static and approximately satisfy the conditions indicated above. The partitions are eectively a square mesh overlay, and are of the type shown in [8]. It is seen that the code scales well and compares favourably with the ideal linear speedup curve. The drop o at higher processors observed is due to load imbalance and communications costs, but the curve could be scaled by increasing the number of simulators and hence total computational load. Figure 2(a) shows fractions of load imbalance and number of simulators relative to the total number, versus processor numbers for the calculations which gave the speedup curve in Figure 1(b). It is surprising to observe that even in this \perfectly" balanced case there are still load imbalances of over 20 % for some numbers of processors. This relatively large level of load imbalance can be explained by uctuations in the number of simulators per processor. As the number of processors increases the number of particles per processor falls and the statistical scatter in these numbers has more eect on the load balance. This serves to illustrate the fact that it is pointless trying to load balance a

6 DSMC computation to within a ne tolerance. When the bottom wall is allowed to move chronic load imbalances occur with uniform 0.25 35.0 0.20 30.0 25.0 Fraction 0.15 0.10 Speedup 20.0 15.0 0.05 Imbalance Particles per Node 10.0 5.0 DSMC+ DSMC - static Ideal DSMC 0.00 0 10 20 30 40 50 60 70 80 90 100 110 120 Number of Processors 0.0 0 10 20 30 40 50 60 70 80 90 100 110 120 Number of Processors Figure 2. (a) Load Imbalance and Particles per Node Fractions (b) Speedup Comparisons static decompositions. Speedup curves are shown in Figure 2(b) in which the bottom wall is moving at approximately Mach 8. The DSMC calculation with a uniform static decomposition indicated by the circles shows fairly poor performance as might be expected, reaching a maximum speedup of 20 at 121 processors. The \ideal" DSMC curve shown in Figure 1(b) is also shown in Figure 2(b) for comparison. When the DSMC code is run with adder the parallel performance of the code pairing increases dramatically up to 36 processors but falls o thereafter. The code pairing is called DSMC+ (DSMC Parallel Load balanced Unstructured Solver). It can be seen that DSMC+ does not achieve the performance of the ideal case, although it is not too far o at the lower number of processors. Achieving the ideal performance however is an unrealistic goal since the cases with the moving wall possess substantially higher message passing overheads than those with the stationary wall as a result of the bulk gas velocities. An example of the nal domain decomposition obtained for the 25 processor case is shown in Figure 3(a). It can be seen that the lines of partition, shown in bold, have changed substantially from the initial uniform square mesh overlay. Note the reduced size of the sub-domain in the bottom right hand corner and the increased size of the sub-domains in the centre of the cavity reecting the high and low gas densities respectively. During the runs described adder was only invoked when the load imbalance was greater than 20 % and the SAR formula indicated that balancing should be done. The time interval for checking the timing statistics was ten timesteps. Up to 36 processors the load imbalance is kept at around 20 % by adder however past this number of processors the load imbalance grows. The reason that adder is unable to keep a check on the imbalance is that the weight within the borders of the sub-domains becomes too large and transferral of the entire border results in a large perturbation to the load on the processor. This

7 6.0 5.0 Average Time per Step (s) 4.0 3.0 2.0 1.0 DSMC ADDER 0.0 0 10 20 30 40 50 60 70 80 90 100 No. of Processors Figure 3. (a) Adapted Decomposition (b) Time per Step Comparison is illustrated by the fact that past 36 processors the domain is repartitioned at virtually every opportunity. Hence an instability develops in which borders are ipped back and forth between neighbouring sub-domains in an eort to balance the load. A very encouraging feature of these results is that although adder was called many times during the runs at higher processor numbers it accounted for a small fraction of the overall computational cost. This is illustrated in Figure 3(b) which shows a comparison of the real times per step of the DSMC code and adder. The run time for each code is of the same order, and the cost of adder does not increase signicantly with number of processors, showing it to be highly scalable. A further point is that most of the time taken up within adder is due to message passing [7]. This indicates that there is scope for inclusion of a more eective load balancing procedure without it having a deleterious eect on performance. A scheme suitable for DSMC load balancing is Song's iterative asynchronous procedure [11], and this will be implemented in the near future. 5. Conclusions DSMC is a exible computational technique capable of simulating a great variety of complex gas ows. A parallel DSMC tool has been described which utilises unstructured grids for geometric modelling exibility and has a modular structure enabling ease of software maintenance and extensibility. A domain decomposition technique is used and the program runs under the SPMD paradigm. In the case of a perfectly load balanced calculation the DSMC implementation shows good scalability. This is not the case for a general ow in which the load across the processor array becomes highly imbalanced due to the movement of simulators and hence load. A heuristic, diusive, hybrid graph-geometric, localised, concurrent scheme has been outlined for the purpose of adaptive domain decomposition in order to balance the load between the processors during run time. Results indicate that the method holds signi-

8 cant promise for greatly increasing the parallel scalability of the DSMC implementation. However the crude load balancing heuristics currently applied lead to an instability in the loading characteristic of the processor array. This can be alleviated by application of an exact load balancing scheme which will be implemented in future versions of the code. DSMC+ shows near optimal parallel scalability when the instability is not present. REFERENCES 1. G.A. Bird. Molecular Gas Dynamics and the Direct Simulation of Gas Flows. Oxford University Press, 1994. 2. C.Walshaw, M. Cross and M.G. Everett. A Localised Algorithm for Optimising Unstructured Mesh Partitions. International Journal of Supercomputer Applications, 9(4):280{295, 1995. 3. S. Dietrich and I. Boyd. Scalar and Parallel Optimised Implementation of the Direct Simulation Monte Carlo Method. J. Comp. Phys., 126:328{342, 1996. 4. G. Karypis & V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. Technical Report TR 95-035, Computer Science Dept. University of Minnesota, Minneapolis MN55455, U.S.A., 1995. Available from http://www.cs.umn.edu/ ~ karypis. 5. M. Ivanov, G.Markelov, S. Taylor and J. Watts. Parallel DSMC strategies for 3D computations. In P.Schiano, A. Ecer, J. Periaux and N. Satofuka, editor, Parallel Computational Fluid Dynamics: Algorithms and Results using Advanced Computers, pages 485{492. Elsevier, 1997. 6. D.M. Nicol and J.H. Saltz. Dynamic Remapping of Parallel Computations with Varying Resource Demands. IEEE Trans. Comput., 37((9)):1073{1087, 1988. 7. C. D. Robinson. Particle Simulations on Parallel Computers with Dynamic Load Balancing. PhD thesis, Imperial College, London, 1997. Under preparation. 8. C.D. Robinson and J.K. Harvey. Adaptive Domain Decomposition for Unstructured Meshes Applied to the Direct Simulation Monte Carlo Method. In P.Schiano, A. Ecer, J. Periaux and N. Satofuka, editor, Parallel Computational Fluid Dynamics: Algorithms and Results using Advanced Computers, pages 469{476. Elsevier, 1997. 9. C.D. Robinson and J.K. Harvey. A Parallel DSMC Implementation on Unstructured Meshes with Adaptive Domain Decomposition. In C. Shen, editor, Rareed Gas Dynamics. Peking University Press, In Press. Proceedings of the Twentieth International Symposium on Rareed Gas Dynamics 1996. 10. C.D. Robinson and J.K. Harvey. Two Dimensional DSMC Calculations of the Rayleigh-Benard Instability. In C. Shen, editor, Rareed Gas Dynamics. Peking University Press, In Press. Proceedings of the Twentieth International Symposium on Rareed Gas Dynamics 1996. 11. J. Song. A Partially Asynchronous and Iterative Algorithm for Distributed Load Balancing. Par. Comput., 4(2):15{25, 1994. 12. D. Vanderstraeten and R. Keunings. Optimized Partitioning of Unstructured Finite Element Meshes. Intl. J. Num. Meth. Engng., 38(3):433{450, 1995.