Peta-Scale Simulations with the HPC Software Framework walberla:

Similar documents
simulation framework for piecewise regular grids

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

walberla: Developing a Massively Parallel HPC Framework

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Massively Parallel Phase Field Simulations using HPC Framework walberla

A Python extension for the massively parallel framework walberla

Software and Performance Engineering for numerical codes on GPU clusters

Large scale Imaging on Current Many- Core Platforms

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Lattice Boltzmann Methods on the way to exascale

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Generic finite element capabilities for forest-of-octrees AMR

References. T. LeBlanc, Memory management for large-scale numa multiprocessors, Department of Computer Science: Technical report*311

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

The Potential of Diffusive Load Balancing at Large Scale

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Towards a Reconfigurable HPC Component Model

Numerical Algorithms on Multi-GPU Architectures

Massively Parallel Finite Element Simulations with deal.ii

Lattice Boltzmann methods on the way to exascale

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Joint Advanced Student School 2007 Martin Dummer

I/O Monitoring at JSC, SIONlib & Resiliency

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

A Lightweight OpenMP Runtime

Sustainability and Efficiency for Simulation Software in the Exascale Era

Compute Node Linux (CNL) The Evolution of a Compute OS

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Knights Landing Scalability and the Role of Hybrid Parallelism

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

Data mining with sparse grids

International Supercomputing Conference 2009

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Parallel repartitioning and remapping in

Fault tolerant issues in large scale applications

Trends in HPC (hardware complexity and software challenges)

Fujitsu s Approach to Application Centric Petascale Computing

Forest-of-octrees AMR: algorithms and interfaces

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Data mining with sparse grids using simplicial basis functions

Center Extreme Scale CS Research

The MOSIX Scalable Cluster Computing for Linux. mosix.org

Introduction to parallel Computing

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Accelerating CFD with Graphics Hardware

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Fast Dynamic Load Balancing for Extreme Scale Systems

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Managing complex cluster architectures with Bright Cluster Manager

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Toward An Integrated Cluster File System

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

OpenFOAM on BG/Q porting and performance

τ-extrapolation on 3D semi-structured finite element meshes

GPU Implementation of a Multiobjective Search Algorithm

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

AUTOMATIC SMT THREADING

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

FFTSS Library Version 3.0 User s Guide

ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications

Massively Parallel Graph Algorithms with MapReduce

Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids

Tutorial: Application MPI Task Placement

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

SFS: Random Write Considered Harmful in Solid State Drives

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Parallel Computations

Explicit and Implicit Coupling Strategies for Overset Grids. Jörg Brunswig, Manuel Manzke, Thomas Rung

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Using Graph Partitioning and Coloring for Flexible Coarse-Grained Shared-Memory Parallel Mesh Adaptation

HPC Architectures. Types of resource currently in use

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

AutoTune Workshop. Michael Gerndt Technische Universität München

High performance computing and numerical modeling

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Enabling In Situ Viz and Data Analysis with Provenance in libmesh

Application Example Running on Top of GPI-Space Integrating D/C

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Transcription:

Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager, Martin Bauer, Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Outline Introduction The walberla Simulation Framework An Example Using the Lattice Boltzmann Method Parallelization Concepts Domain Partitioning & Data Handling Dynamic Domain Repartitioning AMR Challenges Distributed Repartitioning Procedure Dynamic Load Balancing Benchmarks / Performance Evaluation Conclusion 2

Introduction The walberla Simulation Framework An Example Using the Lattice Boltzmann Method

Introduction walberla (widely applicable Lattice Boltzmann framework from Erlangen): main focus on CFD (computational fluid dynamics) simulations based on the lattice Boltzmann method (LBM) (now also implementations of other methods, e.g., phase field) at its very core designed as an HPC software framework: scales from laptops to current petascale supercomputers largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich) hybrid parallelization: MPI + OpenMP vectorization of compute kernels written in C++(11), growing Python interface support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL) automated build and test system 4

Introduction AMR for the LBM example (vocal fold phantom geometry) DNS (direct numerical simulation) Reynolds number: 2500 / D3Q27 TRT 24,054,048 315,611,120 fluid cells / 1 5 levels without refinement: 311 times more memory and 701 times the workload 5

Parallelization Concepts Domain Partitioning & Data Handling

Parallelization Concepts simulation domain only in here domain partitioning into blocks static block-level refinement empty blocks are discarded 7

Parallelization Concepts simulation domain only in here domain partitioning into blocks octree partitioning within every block of the initial partitioning ( forest of octrees) static block-level refinement empty blocks are discarded 8

Parallelization Concepts static block-level refinement ( forest of octrees) allocation of block data ( grids) static load balancing load balancing can be based on either space-filling curves (Morton or Hilbert order) using the underlying forest of octrees or graph partitioning (METIS, ) whatever fits best the needs of the simulation 9

Parallelization Concepts static block-level refinement ( forest of octrees) static load balancing DISK compact (KiB/MiB) binary MPI IO DISK allocation of block data ( grids) separation of domain partitioning from simulation (optional) 10

Parallelization Concepts static block-level refinement ( forest of octrees) static load balancing data & data structure stored perfectly distributed DISK compact (KiB/MiB) binary MPI IO no replication of (meta) data! allocation of block data ( grids) DISK separation of domain partitioning from simulation (optional) 11

Parallelization Concepts all parts customizable via callback functions in order to adapt to the underlying simulation: 1) discarding of blocks 2) (iterative) refinement of blocks 3) load balancing 4) block data allocation static block-level refinement ( forest of octrees) static load balancing DISK support for arbitrary number of block data items DISK (each of arbitrary type) compact (KiB/MiB) binary MPI IO allocation of block data ( grids) separation of domain partitioning from simulation (optional) 12

Parallelization Concepts different views on / representations of the domain partitioning 2:1 balanced grid (used for the LBM on refined grids) distributed graph: nodes = blocks, edges explicitly stored as < block ID, process rank > pairs forest of octrees: octrees are not explicitly stored, but implicitly defined via block IDs 13

Parallelization Concepts different views on / representations of the domain partitioning 2:1 balanced grid (used for the LBM on refined grids) our parallel implementation [1] of local grid refinement for the LBM based on [2] shows excellent performance: simulations with in total close to one trillion cells close to one trillion cells updated per second (with 1.8 million threads) strong scaling: more than 1000 time steps / sec. 1 ms per time step distributed graph: nodes = blocks, edges explicitly stored as < block ID, process rank > pairs forest of octrees: [1] F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-Uniform [1] Grids, SIAM Journal on Scientific Computing (accepted for publication) [http://arxiv.org/abs/1508.07982] octrees are not explicitly stored, but implicitly defined via block IDs [2] M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den Akker, A generic, mass conservative local grid refine- [2] ment technique for lattice-boltzmann schemes, International Journal for Numerical Methods and Fluids 14

Dynamic Domain Repartitioning AMR Challenges Distributed Repartitioning Procedure Dynamic Load Balancing Benchmarks / Performance Evaluation

AMR Challenges challenges because of block-structured partitioning: only entire blocks split/merge (only few blocks per process) sudden increase/decrease of memory consumption by a factor of 8 (in 3D) ( octree partitioning & same number of cells for every block) split first, balance afterwards probably won t work for the LBM, all levels must be load-balanced separately for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures 16

AMR Challenges challenges because of block-structured partitioning: only entire blocks split/merge (only few blocks per process) sudden increase/decrease of memory consumption by a factor of 8 (in 3D) ( octree partitioning & same number of cells for every block) split first, balance afterwards probably won t work for the LBM, all levels must be load-balanced separately for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures no replication of (meta) data of any sort! 17

Dynamic Domain Repartitioning different colors (green/blue) illustrate process assignment split merge forced split to maintain 2:1 balance 1) split/merge decision callback function to determine which blocks must split and which blocks may merge 2) skeleton data structure creation lightweight blocks (few KiB) with no actual data, 2:1 balance is automatically preserved 18

Dynamic Domain Repartitioning split merge forced split to maintain 2:1 balance 1) split/merge decision callback function to determine which blocks must split and which blocks may merge 2) skeleton data structure creation lightweight blocks (few KiB) with no actual data, 2:1 balance is automatically preserved 19

Dynamic Domain Repartitioning 3) load balancing callback function to decide to which process blocks must migrate to (skeleton blocks actually move to this process) 20

Dynamic Domain Repartitioning 3) load balancing lightweight skeleton blocks allow multiple migration steps to different processes ( enables balancing based on diffusion) 21

Dynamic Domain Repartitioning 3) load balancing links between skeleton blocks and corresponding real blocks are kept intact when skeleton blocks migrate 22

Dynamic Domain Repartitioning 3) load balancing for global load balancing algorithms, balance is achieved in one step skeleton blocks immediately migrate to their final processes 23

Dynamic Domain Repartitioning refine coarsen 4) data migration links between skeleton blocks and corresponding real blocks are used to perform actual data migration (includes refinement and coarsening of block data) 24

Dynamic Domain Repartitioning refine coarsen 4) data migration implementation for grid data: coarsening senders coarsen data before sending to target process refinement receivers refine on target process(es) 25

Dynamic Domain Repartitioning key parts customizable via callback functions in order to adapt to the underlying simulation: refine coarsen 1) decision which blocks split/merge 2) dynamic load balancing 4) data migration implementation for grid data: coarsening senders coarsen data before sending to target process refinement receivers refine on target process(es) 26

Dynamic Load Balancing 1) space filling curves (Morton or Hilbert): every process needs global knowledge ( all gather) scaling issues (even if it s just a few bytes from every process) 2) load balancing based on diffusion: iterative procedure (= repeat the following multiple times) communication with neighboring processes only calculate flow for every process-process connection use this flow as guideline in order to decide where blocks need to migrate for achieving balance runtime & memory independent of number of processes (true in practice? benchmarks) useful extension (benefits outweigh the costs): all reduce to check for early abort & adapt flow 27

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): lid-driven cavity 4 grid levels domain partitioning 28

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): coarsen 29

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): coarsen refine 30

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): coarsen refine 2:1 balance 31

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): during this refresh process all cells on the finest level are coarsened and the same amount of fine cells is created by splitting coarser cells 72 % of all cells change their size 32

LBM AMR - Performance Benchmark Environments: JUQUEEN (5.0 PFLOP/s) Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI SuperMUC (2.9 PFLOP/s) Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI Benchmark (LBM D3Q19 TRT): avg. blocks/process (max. blocks/proc.) level initially after refresh after load balance 0 0.383 (1) 0.328 (1) 0.328 (1) 1 0.656 (1) 0.875 (9) 0.875 (1) 2 1.313 (2) 3.063 (11) 3.063 (4) 3 3.500 (4) 3.500 (16) 3.500 (4) 33

seconds LBM AMR - Performance SuperMUC space filling curve: Morton 3.5 3 2.5 2 1.5 1 0.5 0 time required for the entire refresh cycle (uphold 2:1 balance, dynamic load balancing, split/merge blocks, migrate data) 1024 8192 65,536 cores #cells per core 209,671 497,000 970,703 34

seconds LBM AMR - Performance SuperMUC space filling curve: Morton 3.5 3 2.5 2 1.5 14 billion cells 64 billion cells 33 billion cells #cells per core 209,671 497,000 970,703 1 0.5 0 1024 8192 65,536 cores 35

seconds LBM AMR - Performance SuperMUC diffusion load balancing 3.5 3 2.5 2 1.5 14 billion cells 64 billion cells 33 billion cells time almost independent of #processes! #cells per core 209,671 497,000 970,703 1 0.5 0 1024 8192 65,536 cores 36

seconds LBM AMR - Performance JUQUEEN space filling curve: Morton 12 197 billion cells 10 8 6 4 58 billion cells 14 billion cells hybrid MPI+OpenMP version with SMP 1 process 2 cores 8 threads #cells per core 31,062 127,232 429,408 2 0 256 4096 32,768 458,752 cores 37

seconds LBM AMR - Performance JUQUEEN diffusion load balancing 12 197 billion cells 10 8 58 billion cells 14 billion cells #cells per core 31,062 6 4 time almost independent of #processes! 127,232 429,408 2 0 256 4096 32,768 458,752 cores 38

iterations LBM AMR - Performance JUQUEEN diffusion load balancing 12 10 number of diffusion iterations until load is perfectly balanced 8 6 4 2 0 256 4096 32,768 458,752 cores 39

LBM AMR - Performance impact on performance / overhead of the entire dynamic repartitioning procedure? depends on the number of cells per core on the actual runtime of the compute kernels (D3Q19 vs. D3Q27, additional force models, etc.) on how often dynamic repartitioning is happening previous lid-driven cavity benchmark: overhead 1 to 3 (diffusion) or 1.5 to 10 (curve) time steps In practice, a lot of time is spent just to determine whether or not the grid must be adapted, i.e., whether or not refinement must take place. often the entire overhead of AMR 40

LBM AMR - Performance AMR for the LBM example (vocal fold phantom geometry) DNS (direct numerical simulation) Reynolds number: 2500 / D3Q27 TRT 24,054,048 315,611,120 fluid cells / 1 5 levels processes: 3584 (on SuperMUC phase 2) runtime: c. 24 h (3 c. 8 h) 41

LBM AMR - Performance AMR for the LBM example (vocal fold phantom geometry) load balancer: space filling curve (Hilbert order) time steps: 180,000 / 2,880,000 (finest grid) refresh cycles: 537 ( refresh every 335 time steps) without refinement: 311 times more memory and 701 times the workload 42

Conclusion

Conclusion & Outlook the approach for massively parallel grid repartitioning by using a block-structured domain partitioning and employing a lightweight copy of the data structure during dynamic load balancing is paying off and working extremely well: we can handle 10 11 cells (> 10 12 unknowns) with 10 7 blocks and 1.83 million threads 44

Conclusion & Outlook the approach for massively parallel grid repartitioning by using a block-structured domain partitioning and employing a lightweight copy of the data structure during dynamic load balancing is paying off and working extremely well: we can handle 10 11 cells (> 10 12 unknowns) with 10 7 blocks and 1.83 million threads resilience (using ULFM): store redundant, in-memory snapshots one/multiple process(es) fail restore data on different processes perform dynamic repartitioning continue :-) 45

THANK YOU FOR YOUR ATTENTION!

THANK YOU FOR YOUR ATTENTION! QUESTIONS?