Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Size: px

Start display at page:

Download "Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla"

Gervais Ford
5 years ago
Views:

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum,

1 Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

2 Outline Motivation walberla Framework Phase-Field Method in walberla Single-Core Optimizations Asynchronous Communication I/O and Post-processing In-Situ Processing with Python Summary and Outlook 2

Motivation large domain required to reduce boundary influence some physical patterns only occur in highly resolved simulations ( spiral ) simulate

3 Motivation large domain required to reduce boundary influence some physical patterns only occur in highly resolved simulations ( spiral ) simulate big domains in 3D unoptimized, general purpose code phase field code from KIT available goal: write optimized parallel version for specific model 3

4 The walberla Framework

walberla Framework widely applicable Lattice-Boltzmann from

CFD simulations with Lattice Boltzmann Method (LBM) evolved

5 walberla Framework widely applicable Lattice-Boltzmann from Erlangen HPC software framework, originally developed for CFD simulations with Lattice Boltzmann Method (LBM) evolved into general framework for algorithms on block-structured grids Vocal Fold Study (Florian Schornbaum) Free Surface Flow Fluid Structure Interaction (Simon Bogner) 6

6 walberla Framework Written in C++ with Python extensions Hybridly parallelized (MPI + OpenMP) No data structures growing with number of processes involved Scales from laptop to recent petascale machines Parallel I/O Portable (Compiler/OS) Automated tests / CI servers Open Source release planned llvm/clang 7

7 Block-structured Grids Domain Decomposition & Distribution to Processes: regular decomposition into blocks containing uniform grids grid refinement: octree-like decomposition In most cases, if a regular decomposition of a uniform grid is used, exactly one block is assigned to each process. forest of octrees: each block contains a uniform grid of the same size 2:1 balance between neighboring cells on level transitions 8

8 Hybrid Parallelization Distributed Memory Parallelization: MPI data exchange on borders between blocks via ghost layers sender process receiver process (slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply) support for overlapping communication and computation some advanced models require more complex communication patterns ( e.g. free-surface and fluid-structure interaction) 10

9 Phase field in walberla

10 Phase field algorithm two lattices (fields): phase field φ with 4 components chemical potential μ with 2 components storing two time steps in src and dst fields 12

11 Optimizations of Phase Field algorithm

12 Optimization Model Layer moving window technique simplifications due to special setup ( e.g. analytical temperature gradient ) shortcuts: some terms can be neglected in certain cell types Algorithm Layer access patterns / stencils overlapping computation and communication eliminate common subexpressions Hardware Layer SIMDification Memory layout (AoS vs. SoA)

13 Single Core Optimization Results general purpose C code basic walberla implementation with SIMD intrinsics single cell with T(z) optimization with staggered buffer with shortcuts (cellwise branching) interface solid liquid speedup: 6 speedup: 28 speedup: 59 speedup: 67 speedup: MLUP/s ( for φ-kernel only) test system: one SuperMUC core (Intel Xeon E C) for both kernels: systematic performance engineering leads to 80x faster code

14 Optimization Algorithm Layer Optimization

15 Communication Overlap Algorithm Layer Optimization communication of μ can be overlapped without kernel adaptations 19

16 Communication Overlap Algorithm Layer Optimization 20

17 Optimization Algorithm Layer Optimization

18 Optimization Algorithm Layer Optimization

19 Scaling Results SuperMUC JUQUEEN

20 I/O and Postprocessing

x 1474) domain as voxel file: 386 GB Solution: generate surface mesh from voxel data during simulation,

21 Managing I/O I/O necessary to store results (frequently) and for checkpointing (seldom) for highly parallel simulations the output of results quickly becomes bottleneck Example: storing one time step of (2420 x 2420 x 1474) domain as voxel file: 386 GB Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm one mesh for each phase boundary mesh size: < 10 MB 25

22 Managing I/O surface meshes still unnecessarily fine resolved: one triangle per interface cell 26

23 Managing I/O quadric edge reduce algorithm ( cglib ) crucial: mesh reduction step preserves boundary vertices hierarchical mesh coarsening and reduction during simulation result: one coarse mesh with size in the order of several MB local fine meshes generated by marching cubes on coarse mesh on root 27

Python Coupling extracting relevant data while simulation is running direct, efficient array access via Python numpy package data is shared, not copied using boost::python library to

24 Python Coupling extracting relevant data while simulation is running direct, efficient array access via Python numpy package data is shared, not copied using boost::python library to connect C++ code with Python further applications: flexible configuration model development: Matlab-like functionality available walberla python_coupling module boost::python libpython

25 Simplify Workflow

26 Python Coupling Method 1: Using Python from C++ Host Language: C++ Method 2: Using C++ from Python Host Language: Python walberla python_coupling module Python interpreter boost::python walberla.so libpython

27 Python Coupling Method 1: Using Python from C++ Host Language: C++ Method 2: Using C++ from Python Host Language: Python walberla python_coupling module Python interpreter boost::python walberla.so libpython Demo

28 Summary

29 Summary / Outlook Summary efficient phase field algorithm necessary to simulate certain physical effects ( spiral ) systematic performance engineering several levels speedup by factor of 80 compared to original version parallel output data processing during simulation to reduce result file size coupling to Python scripting language for in-situ processing Outlook GPU implementation coupling to Lattice Boltzmann Method improve discretization scheme (implicit method)

30 Thank you! Questions?

Massively Parallel Phase Field Simulations using HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich