Peta-Scale Simulations with the HPC Software Framework walberla:

Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager, Martin Bauer, Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Outline Introduction The walberla Simulation Framework An Example Using the Lattice Boltzmann Method Parallelization Concepts Domain Partitioning & Data Handling Dynamic Domain Repartitioning AMR Challenges Distributed Repartitioning Procedure Dynamic Load Balancing Benchmarks / Performance Evaluation Conclusion 2

Introduction The walberla Simulation Framework An Example Using the Lattice Boltzmann Method

Introduction walberla (widely applicable Lattice Boltzmann framework from Erlangen): main focus on CFD (computational fluid dynamics) simulations based on the lattice Boltzmann method (LBM) (now also implementations of other methods, e.g., phase field) at its very core designed as an HPC software framework: scales from laptops to current petascale supercomputers largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich) hybrid parallelization: MPI + OpenMP vectorization of compute kernels written in C++(11), growing Python interface support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL) automated build and test system 4

Introduction AMR for the LBM example (vocal fold phantom geometry) DNS (direct numerical simulation) Reynolds number: 2500 / D3Q27 TRT 24,054,048 315,611,120 fluid cells / 1 5 levels without refinement: 311 times more memory and 701 times the workload 5

Parallelization Concepts Domain Partitioning & Data Handling

Parallelization Concepts simulation domain only in here domain partitioning into blocks static block-level refinement empty blocks are discarded 7

Parallelization Concepts simulation domain only in here domain partitioning into blocks octree partitioning within every block of the initial partitioning ( forest of octrees) static block-level refinement empty blocks are discarded 8

Parallelization Concepts static block-level refinement ( forest of octrees) allocation of block data ( grids) static load balancing load balancing can be based on either space-filling curves (Morton or Hilbert order) using the underlying forest of octrees or graph partitioning (METIS, ) whatever fits best the needs of the simulation 9

Parallelization Concepts static block-level refinement ( forest of octrees) static load balancing DISK compact (KiB/MiB) binary MPI IO DISK allocation of block data ( grids) separation of domain partitioning from simulation (optional) 10

Parallelization Concepts static block-level refinement ( forest of octrees) static load balancing data & data structure stored perfectly distributed DISK compact (KiB/MiB) binary MPI IO no replication of (meta) data! allocation of block data ( grids) DISK separation of domain partitioning from simulation (optional) 11

Parallelization Concepts all parts customizable via callback functions in order to adapt to the underlying simulation: 1) discarding of blocks 2) (iterative) refinement of blocks 3) load balancing 4) block data allocation static block-level refinement ( forest of octrees) static load balancing DISK support for arbitrary number of block data items DISK (each of arbitrary type) compact (KiB/MiB) binary MPI IO allocation of block data ( grids) separation of domain partitioning from simulation (optional) 12

Parallelization Concepts different views on / representations of the domain partitioning 2:1 balanced grid (used for the LBM on refined grids) distributed graph: nodes = blocks, edges explicitly stored as < block ID, process rank > pairs forest of octrees: octrees are not explicitly stored, but implicitly defined via block IDs 13

Parallelization Concepts different views on / representations of the domain partitioning 2:1 balanced grid (used for the LBM on refined grids) our parallel implementation [1] of local grid refinement for the LBM based on [2] shows excellent performance: simulations with in total close to one trillion cells close to one trillion cells updated per second (with 1.8 million threads) strong scaling: more than 1000 time steps / sec. 1 ms per time step distributed graph: nodes = blocks, edges explicitly stored as < block ID, process rank > pairs forest of octrees: [1] F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-Uniform [1] Grids, SIAM Journal on Scientific Computing (accepted for publication) [http://arxiv.org/abs/1508.07982] octrees are not explicitly stored, but implicitly defined via block IDs [2] M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den Akker, A generic, mass conservative local grid refine- [2] ment technique for lattice-boltzmann schemes, International Journal for Numerical Methods and Fluids 14

Dynamic Domain Repartitioning AMR Challenges Distributed Repartitioning Procedure Dynamic Load Balancing Benchmarks / Performance Evaluation

AMR Challenges challenges because of block-structured partitioning: only entire blocks split/merge (only few blocks per process) sudden increase/decrease of memory consumption by a factor of 8 (in 3D) ( octree partitioning & same number of cells for every block) split first, balance afterwards probably won t work for the LBM, all levels must be load-balanced separately for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures 16

Dynamic Domain Repartitioning different colors (green/blue) illustrate process assignment split merge forced split to maintain 2:1 balance 1) split/merge decision callback function to determine which blocks must split and which blocks may merge 2) skeleton data structure creation lightweight blocks (few KiB) with no actual data, 2:1 balance is automatically preserved 18

Dynamic Domain Repartitioning split merge forced split to maintain 2:1 balance 1) split/merge decision callback function to determine which blocks must split and which blocks may merge 2) skeleton data structure creation lightweight blocks (few KiB) with no actual data, 2:1 balance is automatically preserved 19

Dynamic Domain Repartitioning 3) load balancing callback function to decide to which process blocks must migrate to (skeleton blocks actually move to this process) 20

Dynamic Domain Repartitioning 3) load balancing lightweight skeleton blocks allow multiple migration steps to different processes ( enables balancing based on diffusion) 21

Dynamic Domain Repartitioning 3) load balancing links between skeleton blocks and corresponding real blocks are kept intact when skeleton blocks migrate 22

Dynamic Domain Repartitioning 3) load balancing for global load balancing algorithms, balance is achieved in one step skeleton blocks immediately migrate to their final processes 23

Dynamic Domain Repartitioning refine coarsen 4) data migration links between skeleton blocks and corresponding real blocks are used to perform actual data migration (includes refinement and coarsening of block data) 24

Dynamic Domain Repartitioning refine coarsen 4) data migration implementation for grid data: coarsening senders coarsen data before sending to target process refinement receivers refine on target process(es) 25

Dynamic Domain Repartitioning key parts customizable via callback functions in order to adapt to the underlying simulation: refine coarsen 1) decision which blocks split/merge 2) dynamic load balancing 4) data migration implementation for grid data: coarsening senders coarsen data before sending to target process refinement receivers refine on target process(es) 26

Dynamic Load Balancing 1) space filling curves (Morton or Hilbert): every process needs global knowledge ( all gather) scaling issues (even if it s just a few bytes from every process) 2) load balancing based on diffusion: iterative procedure (= repeat the following multiple times) communication with neighboring processes only calculate flow for every process-process connection use this flow as guideline in order to decide where blocks need to migrate for achieving balance runtime & memory independent of number of processes (true in practice? benchmarks) useful extension (benefits outweigh the costs): all reduce to check for early abort & adapt flow 27

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): during this refresh process all cells on the finest level are coarsened and the same amount of fine cells is created by splitting coarser cells 72 % of all cells change their size 32

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): avg. blocks/process (max. blocks/proc.) level initially after refresh after load balance 0 0.383 (1) 0.328 (1) 0.328 (1) 1 0.656 (1) 0.875 (9) 0.875 (1) 2 1.313 (2) 3.063 (11) 3.063 (4) 3 3.500 (4) 3.500 (16) 3.500 (4) 33

seconds LBM AMR - Performance SuperMUC space filling curve: Morton 3.5 3 2.5 2 1.5 1 0.5 0 time required for the entire refresh cycle (uphold 2:1 balance, dynamic load balancing, split/merge blocks, migrate data) 1024 8192 65,536 cores #cells per core 209,671 497,000 970,703 34

seconds LBM AMR - Performance SuperMUC space filling curve: Morton 3.5 3 2.5 2 1.5 14 billion cells 64 billion cells 33 billion cells #cells per core 209,671 497,000 970,703 1 0.5 0 1024 8192 65,536 cores 35

seconds LBM AMR - Performance SuperMUC diffusion load balancing 3.5 3 2.5 2 1.5 14 billion cells 64 billion cells 33 billion cells time almost independent of #processes! #cells per core 209,671 497,000 970,703 1 0.5 0 1024 8192 65,536 cores 36

seconds LBM AMR - Performance JUQUEEN space filling curve: Morton 12 197 billion cells 10 8 6 4 58 billion cells 14 billion cells hybrid MPI+OpenMP version with SMP 1 process 2 cores 8 threads #cells per core 31,062 127,232 429,408 2 0 256 4096 32,768 458,752 cores 37

seconds LBM AMR - Performance JUQUEEN diffusion load balancing 12 197 billion cells 10 8 58 billion cells 14 billion cells #cells per core 31,062 6 4 time almost independent of #processes! 127,232 429,408 2 0 256 4096 32,768 458,752 cores 38

iterations LBM AMR - Performance JUQUEEN diffusion load balancing 12 10 number of diffusion iterations until load is perfectly balanced 8 6 4 2 0 256 4096 32,768 458,752 cores 39

LBM AMR - Performance impact on performance / overhead of the entire dynamic repartitioning procedure? depends on the number of cells per core on the actual runtime of the compute kernels (D3Q19 vs. D3Q27, additional force models, etc.) on how often dynamic repartitioning is happening previous lid-driven cavity benchmark: overhead 1 to 3 (diffusion) or 1.5 to 10 (curve) time steps In practice, a lot of time is spent just to determine whether or not the grid must be adapted, i.e., whether or not refinement must take place. often the entire overhead of AMR 40

LBM AMR - Performance AMR for the LBM example (vocal fold phantom geometry) DNS (direct numerical simulation) Reynolds number: 2500 / D3Q27 TRT 24,054,048 315,611,120 fluid cells / 1 5 levels processes: 3584 (on SuperMUC phase 2) runtime: c. 24 h (3 c. 8 h) 41

LBM AMR - Performance AMR for the LBM example (vocal fold phantom geometry) load balancer: space filling curve (Hilbert order) time steps: 180,000 / 2,880,000 (finest grid) refresh cycles: 537 ( refresh every 335 time steps) without refinement: 311 times more memory and 701 times the workload 42

Conclusion

Conclusion & Outlook the approach for massively parallel grid repartitioning by using a block-structured domain partitioning and employing a lightweight copy of the data structure during dynamic load balancing is paying off and working extremely well: we can handle 10 11 cells (> 10 12 unknowns) with 10 7 blocks and 1.83 million threads resilience (using ULFM): store redundant, in-memory snapshots one/multiple process(es) fail restore data on different processes perform dynamic repartitioning continue :-) 45

THANK YOU FOR YOUR ATTENTION!

THANK YOU FOR YOUR ATTENTION! QUESTIONS?