Dynamic load balancing in OSIRIS

Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal

Maintaining parallel load balance is crucial Full scale 3D modeling of relevant scenarios requires scalability to large number of cores Code performance must be sustained throughout the entire simulation The overall performance will be limited by the slowest node Simulation time is dominated by particle calculations Some nodes may have more particles than other If the distribution of particles remains approximately constant throughout the simulation we could tackle this at initialization LWFA Simulation Parallel Partition 94 24 24 = 55k cores Load Imbalance (max/avg load) 9.04 Average Performance ~12% peak Particles/node, iz = 12 Static load balancing However this will usually depend on the dynamics of the simulation Shared memory parallelism can help mitigate the problem Use a particle domain decomposition inside shared memory region Smear out localized computational load peaks Dynamic load balancing required for optimal efficiency 2

Adjusting processor load dynamically The code can change node boundaries dynamically to attempt to maintain a even load across nodes: Determine best possible partition from current particle distribution Reshape the simulation process 0 process 1 Unbalanced No Begin Gather Integral of Load Unbalance > Threshold Yes Global Communication log 2 (# nodes) Get Best Partition Use minimal number of messages Only the first communication depends on #nodes Balanced No Significant Improvement Yes Communication with overlapping nodes process 2 process 3 Reshape Simulation End Even particle load across processes 4 cores Node Boundaries 3

Calculate cumulative load along load balance direction node 0 node 1 node 2 node 3 Main overhead (~90%) Scales well for large number of nodes log 2 (# nodes) all nodes The algorithm for calculating the best partition requires the cumulative computational load Essentially a scan operation Implemented as a reduce all operation followed by local calculations Get Load along load balancing direction on each node Add load from all nodes + Distribute result to all nodes MPI_REDUCEALL(few kb) Integrate Load along load balancing direction All nodes know the result Simplifies best partition calculations node 0 node 1 node 2 node 3 all nodes Global Integral Load is known by all nodes 4

Finding the best partition Algorithm for 1D partitions 60467 Only 1D integral load required (less communication) Find point where integral load is half between endpoints 20105 Repeat with two resulting intervals until all node partitions are calculated Same calculation is done by all nodes No need for extra communication 40299 50445 70491 Fast algorithm Well balanced distribution Handles impossible scenarios well 10035 30149 i.e. when it is impossible to find a load balanced situation it finds the best solution Can handle extra constraints: Binary Search Algorithm minimum number of cells per node Cells per node must be an integer multiple of some quantity 5

Partition reshaping All nodes know previous and new partition For each node Detect overlap between local node on old partition and other nodes on new partition Detect overlap between local node on new partition and other nodes on old partition Send/receive data to proper nodes/from old nodes All nodes know old and new partition node 0 node 1 node 2 Old partition node 0 node 1 node 2 For particles Remove/add to particle buffer For grids Send/get data to/from other nodes Allocate new grid with new shape Copy in values from old grid and data from other nodes Free old grid Send data to node 0 Communication pattern for node 1 node 1 New partition Get data from node 2 6

Including cells in the computational load Particles only Particles + cells α = 1.5 Node Boundaries Load estimate considering only particles per vertical slice Node Boundaries Cell calculations (field solve + grid filtering) can be considered by including an effective cell weight In this example each cell is considered to have the same computational load as 1.5 particles 7

Hydrodynamic nano plasma explosion Radius: 12.8 c/ω p Simulation Setup 2 species u th = 0.1 c Number of Particles 1 10 6 8 10 5 6 10 5 4 10 5 2 10 5 0 0 500 1000 1500 2000 2500 3000 Min. Load Fixed Boundaries Dynamic Load Balance Iteration Max. Load Total Simulation Time [s] 800 700 600 500 400 300 200 100 0 No dynamic load balance Particles only Particles and Grids 128 64 32 16 8 Iterations Between Load Balance 8 2D Cartesian geometry 3000 iterations Grid 1024 2 cells 102.4 2 (c/ω p ) 2 Particles electrons + positrons 8 2 particle/species/cell Generalized thermal velocity 0.1 c Load Algorithm maintains the best possible load balance Repartitioning the simulation is a time consuming task Don t do it too often! Performance Particles & grids ~50% better than just particles n loadbalance ~64 iterations yields good results Performance boost 2 Other scenarios can be as high as 4

Multi-dimensional dynamic load balancing t = 0 t > 0 node boundary Difficult task: Best Partition Single solution exists only for 1D parallel partitions Improving load balance in one direction might result in worse results in another direction Multidimensional Partition Assume separable load function (i.e. load is a product of fx(x) fy(y) fz(z) ) Each partition direction becomes independent Accounting for transverse directions Use max / sum value across perpendicular partition 9

Large scale LWFA run: Close but no cigar The ASCR problems are very difficult to load balance Very small problem size per node Laser 10 When choosing partitions with less nodes along propagation direction imbalance degrades significantly Not enough room to dynamic load balance along propagation direction Dynamic load balancing in the transverse directions does not yield big improvements Very different distributions from slice to slice Dynamic load balance in just 1 direction does not improve imbalance Using max node load helps to highlight the hot spots for the algorithm x 1 -x 2 slice at box center similar partition along x 3 > 30% improvement in inbalance No overall speedup Best result: Dynamic load balance along x2 and x3 Use max load 30% improvement in imbalance but... Lead to a 5% overall slowdown!

Alternative: Patch based load balancing Implemented in PSC / SMILEI LWFA simulation Partition the space into (10-100x) more domains (patches) than processing elements (PE) Dynamically assign patches to PE Assign similar load to PEs Attempt to maintain neighboring patches in the same PE Standard MPI partition 316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305 326 316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305 326 Patch based partitioning A. Beck et al. / Nuclear Instruments and Methods in Physics Research A 829 (2016) 418 421 421 4. Summary and outlook The importance of 3D simulations has been established. Their cost is still prohibitive for most applications since large scale simulations are so massively parallel that load imbalance becomes a barrier. It cannot be alleviated by brute force parallelism. That is merely a waste of computational resources. Germaschewski et al. suggested a new data structure for PIC codes, adapted to both modern massively parallel hardware and to the associated algorithmic challenges [7]. This structure has been adapted with success in SMILEI [18] and shows tremendous improvement in the case of LWFA, even in 2D. Recent advances in the physics included in PIC codes in both astrophysical plasmas and laser-plasma communities often leads to prohibitive numbers of SP. Now, the problem is not just only compute bound, but also memory bound. This issue may be resolved only through a load-limiting algorithm such as particle merging through k-means clustering [6]. The two methods above could be combined into a coherent dynamic load management strategy without interfering with any of the physics included in the code. It is even compatible with different geometry models like CC itself which is also subject to load imbalance. Patch assignment 7 PE example Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined as number of particles. Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined number many of particles. Whileashaving small patches on a process means increased synchronization work at patch boundaries (in particular Germaschewski et al., JCP 318 305 (2016); Beck et al., NIMPR-A 829 418 (2016) 11 Fig. 3. Example of a 32! 32 patches decomposition between 7 MPI processes. MPI domains are delimited by different uniform colors. The line shows the Hilbert curve

Overview Maintaining load balancing is crucial for large scale simulations Lowest performing node dominates Particle load per PE evolves dynamically along the simulation Share memory parallelism helps smear out load spikes but does not solve the problem Dynamic load balancing is implemented by moving node boundaries Works well for many problems Difficult to expand to multi-dimensional partitions Unable to handle difficult LWFA scenarios Patch based dynamic load balancing appears to a viable solution Requires moving to a tiling scheme to organize the data Lots of work, but has other advantages, e.g. improved data locality, similarity with GPU code Current implementations rely on MPI_THREAD_MULTIPLE s 4.0 O iris 12