First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Size: px

Start display at page:

Download "First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster"

Alban Bryant
5 years ago
Views:

1 First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application Lab, University of Reims GTC Europe - Mu nchen October, 11th 2017

2 1. Introduction : Context ROMEO HPC Center GPU Application Lab Yales2 2. Existing code profiling 3. Code porting Porting stategies Internal kernels performances 4. Benchmarks 5. Conclusions Limitations and future work 2

3 Introduction : Context Existing code profiling ROMEO HPC Center Code porting Benchmarks Conclusions University of Reims I about students I Multidisciplinary university (undergraduate, graduate, PhD, research labs) I HPC resources for both academic and industrial research I Expertise and teaching in HPC and GPU technologies I Integrated in the European HPC ecosystem (French Tier 1.5 equip@meso, ETP4HPC ) I Full hybrid cluster (2 Intel Ivy Bridge + 2 K20 + IB QDR ) ROMEO HPC Center 3

4 GPU Application Lab Objectives : intensive exploitation of ROMEO I Expertise in hybrid HPC, in particular in GPU Technologies I GPU code porting I Optimization and scaling-up towards a large number of GPUs I Training and teaching for ROMEO users Activities I GPU, hybrid and parallel codes optimization I Algorithms improvements regarding targeted architectures I Numerical methods adapting to hybrid and parallel architecture Various collaborations I Local URCA laboratories, and some external collaborations (ONERA, Univ. of Normandy) I Several domains of application (fluid mechanics, chemistry, computer science, applied maths,...) 4

Massively parallel solver for multi-physics problems in fluid dynamics from primary atomisation to pollutants dispersion in complex geometries I Code

Bénard (projects leaders) I 10 developers (engineers, researchers, PhD students,.

5 Massively parallel solver for multi-physics problems in fluid dynamics from primary atomisation to pollutants dispersion in complex geometries I Code developped at CORIA (University of Normandy) since 2007 I V. Moureau, G. Lartigue, P. Bénard (projects leaders) I 10 developers (engineers, researchers, PhD students,...) + contributors Code I Diphasic and reactive fluids flows simulations at low Mach number on complex geometries I LES and DNS solvers on unstructured meshes I 3D flow simulations on massively parallel architectures I Use by more than 160 academic and industrial researchers I 60+ scientific publications 5

YALES2, a complete library Main features I 350 000 lines of code f90 and f03 I Portable I Python Interface I Main solvers : I

solver (VDS) I Spray solver (SPS) I Magneto-Hydrodynamic solver (MHD) I Heat transfer solver (HTS) I Chemical reactor solver

6 YALES2, a complete library Main features I lines of code f90 and f03 I Portable I Python Interface I Main solvers : I Scalar solver (SCS) I Level set solver (LSS) I Lagrangian solver (LGS) I Incompressible solver (ICS) I Variable density solver (VDS) I Spray solver (SPS) I Magneto-Hydrodynamic solver (MHD) I Heat transfer solver (HTS) I Chemical reactor solver (CRS) I Darcy solver (DCY) I Mesh movement solver (MMS) I ALE solver (ALE) I Linear acoustics solver (ACS) I 5+ solvers in progress 6

7 HPC with YALES2 in combustion Multi-scale and multi-physics applications I More than 85% of used energy comes from combustion I Related to many fields (transportation, industry, energy,...) Examples in aeronautics : 7

8 HPC with YALES2 HPC I Using up to cores on national french clusters (IDRIS, CINES,...), regional (CRIANN) and local machines I Using advanced parallel programming techiques (hybrid computing, automatic mesh adaptation,...) I Collaborations with Exascale Lab, INTEL/CEA/GENCI/UVSQ I Code used as benchmark on prototypes (IDRIS, Ouessant : Power8+P100), Cellule de Veille technologique GENCI I Collaboration on GPU porting, GPU Application Lab, ROMEO 8

9 Existing code profiling 1. Introduction : Context ROMEO HPC Center GPU Application Lab Yales2 2. Existing code profiling 3. Code porting Porting stategies Internal kernels performances 4. Benchmarks 5. Conclusions Limitations and future work 9

floating point operations I Number of caches misses I.

10 Profiling the existing code Specific tools (MAQAO + TAU + PAPI) I In-depth profiling : I Computational time (per functions, per internal and external loops) I Number of floating point operations I Number of caches misses I... I Hot-spot : matrix-vector product in Preconditionned Conjugate Gradient (PCG) Functions profiles External loops profile 10

11 Profiling existing code Indentifying hot-spot I Preconditioned conjugate gradient : 250 lines of code for 55% of total time I Matrix-vector product : 30 lines of code for 30% of total time 11

12 Existing code profiling 1. Introduction : Context ROMEO HPC Center GPU Application Lab Yales2 2. Existing code profiling 3. Code porting Porting stategies Internal kernels performances 4. Benchmarks 5. Conclusions Limitations and future work 12

13 How to port hot-spot on GPU? Code main feature : data-centered structure I Hierarchical well-defined data structures based on bloc-decomposition of the mesh I Every computing loops follow the same skeleton (two levels of nested loops : over blocs meshes, then over vertex, edges or elements) Code porting Three major possibilities : I OpenACC with PGI compilers I Non intrusive for code (macros) I Complementary with in-progress OpenMP version I Strong potential with unified memory I No deep copy for complex data structures I No support for Fortran pointers I CUDA/C with Intel compilers I Fine management of GPU (code+data) I Passing through intermediary C interfaces I No deep copy for complex data structures I Code rewriting (only for computational loops) I CUDA/Fortran with PGI (not tested) I Similar to CUDA/C without interface 13

14 Code porting with CUDA Key points Data management I Exploiting the Fortran/C interoperability for data structures I Fortran derived types translation to C typedef (automatic translation tool YALES2-specific) GPU memory management I Allocation ans management of GPU specific data and utilities arrays I CPU-GPU transfers optimized with a bu er array (in Pinned memory) Execution model I Mapping mesh decomposition and hierarchical data structure to CUDA blocks/threads Algorithm adaptation : inverse connectivity for mesh exploration I Loop first over vertices instead of edges (Finite Volumes method works on edges by construction) 14

15 CUDA code porting Inverse connectivity for mesh exploration Matrix-vector product computing (op product) I Initial algorithm (not well suited to GPU) : Foreach bloc b of mesh //blocks Foreach edge e of b //threads vs, ve = vertex(e) result(vs) += f(value(e), data(vs), data(ve)) result(ve) -= f(value(e), data(vs), data(ve)) I Algorithm with inverse connectivity : Foreach bloc b of mesh //blocks Foreach vertex v of b //threads r = 0 // Register Foreach edge e from vertex s ve = end(e) r += f(value(e), data(v), data(ve)) Foreach edge e to vertex s vs = start(e) r -= f(value(e), data(vs), data(v)) result(v) = r 15

16 Kernel performances Performance comparison for Fortran loops and CUDA kernels op product x28.1 compute p compute gamma update scal res x16.3 x18.4 x19.3 exact residual x24.5 residual norm x28.3 compute final rho x Speedup for Conjugate Gradient internals loops on GPU (16MPI vs. 2 GPU) 16

17 Existing code profiling 1. Introduction : Context ROMEO HPC Center GPU Application Lab Yales2 2. Existing code profiling 3. Code porting Porting stategies Internal kernels performances 4. Benchmarks 5. Conclusions Limitations and future work 17

18 Overall algorithm Overall process Main loop of conjugate gradient containing : I Computing functions from previous figure (CUDA kernels) I Synchronization and data management host-side functions (Fortran + MPI) GPU management I Host-Device data transfers between computation kernels and host-side data management functions I Partial overlap of transfers by computations Code versions I CPU, initial algorithm and Fortran code I GPU, algorithm with inverted connectivity (better for GPU), only hot-spot internal function on GPU I GPU-PCG, GPU version with all Conjugate Gradient computations on GPU 18

19 Various cluster architectures Clusters configurations for single node comparison Comparison of local GPU accelerations versus CPU code I Basic runtime : N-MPI process (reference) vs N-MPI using N-GPU (1-to-1 association) I Runtime MPS : 16-MPI process (reference) vs 16-MPI with N-GPU Machines ROMEO 2 Intel E v2 8-cores, 2 K20x (PCIe) Myria 2 Intel E v4 14-cores, 2 P100 (PCIe) Ouessant 2 IBM Power S822LC 10-cores, 4 P100 (NVLINK) 19

20 Benchmarks results Application speedup on the di erent architectures Mesh elts gpu gpu-pcg MPS+gpu MPS+gpu-pcg x3.8 x4.4 Mesh elts gpu gpu-pcg MPS+gpu MPS+gpu-pcg x3.6 x4.5 Mesh elts gpu gpu-pcg MPS+gpu MPS+gpu-pcg x2.8 x Intel-K20 Intel-P100 IBM-P100 PCG speedup for di erent mesh size, code version and runtime configuration 20

21 Limitations and future work Discussion Overall successful study : performance improvement for entire application with GPU-accelerated code I Recent technologies helps for performance (the more recent the higher speedup) I MPS has a limited interest (wait for Volta version : internal support and client number) I Data transfers is still strongly limiting performances I Intrusive overlapping of transfers by non-gpu computations I Porting more functions on GPU (synchronization and data management for MPI) Future work and code developments perspectives I GPU porting of data management functions (use MPI GPU-aware) I Introduce an 3rd level of parallelism : OpenMP (accelerate CPU part of data management) 21

22 Thank you for your attention 22

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can