Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Size: px

Start display at page:

Download "Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?"

Drusilla Horton
5 years ago
Views:

Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.

1 Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) Max Katz (NVIDIA), Leopold Grinberg (IBM) This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DEAC52-07NA Lawrence Livermore National Security, LLC LLNL-PRES Slide 1

2 Multi-Process Service (MPS) Allows kernels launched from different (MPI) processes to be processed concurrently on the same Utilize inactive SMs when the work is small Share in space SMs Time on schedule LLNL-PRES Slide 2

3 Multi-Process Service (MPS) Allows kernels launched from different (MPI) processes to be processed concurrently on the same Utilize inactive SMs when the work is small Processes take turns if every SM is occupied Share in space Share in time SMs SMs Time on schedule Time on schedule LLNL-PRES Slide 2

4 Sierra system architecture finalized and currently under deployment at LLNL Compute System 4,320 nodes 1.29 PB Memory 240 Compute Racks 125 PFLOPS 12 MW LLNL-PRES Slide 3

5 Sierra system architecture finalized and currently under deployment at LLNL Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta s 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD 12 MW 16 GiB Globally addressable HBM2 associated with each Coherent Shared Memory LLNL-PRES Slide 3

6 Sierra system architecture finalized and currently under deployment at LLNL Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta s 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD 12 MW 16 GiB Globally addressable HBM2 associated with each Coherent Shared Memory 5% FLOPS 95% FLOPS LLNL-PRES Slide 3

7 Ways to utilize a node of Sierra (showing one socket) CPU LLNL-PRES Slide 4

8 Ways to utilize a node of Sierra (showing one socket) MPI process/core CPU LLNL-PRES Slide 4

9 Ways to utilize a node of Sierra (showing one socket) MPI process/core CPU MPI process/ CPU LLNL-PRES Slide 4

10 Ways to utilize a node of Sierra (showing one socket) MPI process/core CPU MPI process/core, MPS for MPI process/ CPU CPU LLNL-PRES Slide 4

11 Ways to utilize a node of Sierra (showing one socket) MPI process/core CPU MPI process/core, MPS for MPI process/ CPU CPU LLNL-PRES Slide 4

12 Parallel performance of multiphysics simulations Decide how to run each phase of the multiphysics simulation On vs. on CPU How many MPI processes (one per CPU core or one per?) If some phases use one MPI process per CPU core, can we use MPS for the accelerated phases? LLNL-PRES Slide 5

13 Parallel performance of multiphysics simulations Decide how to run each phase of the multiphysics simulation On vs. on CPU How many MPI processes (one per CPU core or one per?) If some phases use one MPI process per CPU core, can we use MPS for the accelerated phases? Outline: Tools used for measurement Multiphysics application How the application is accelerated Results: MPI process/ vs. 4 MPI processes/ + MPS Impact on kernel performance Impact on communication LLNL-PRES Slide 5

14 Tool: Caliper [SC 16] Performance analysis toolbox, leverages existing tools Developed at LLNL Caliper team is responsive to our needs JSON output format 1. Annotate: begin/end API similar to timers libraries Annotation of libraries (e.g., SAMRAI, hypre) combined seamlessly 2. Collect: Runtime parameters to instruct Caliper to measure: Measure MPI function calls Linux perf_event sampling (Libpfm) Measure CUDA driver/runtime calls (using CUPTI) 3. Analyze Using LLNL-PRES Slide 6

Application: ARES is a massively parallel,

Physics Capabilities: ALE-AMR Hydrodynamics

flow 3T plasma physics High-Explosive modeling

ray-tracing Magnetohydrodynamics (MHD) Dynamic

Confinement Fusion (ICF) Pulsed power National

15 Application: ARES is a massively parallel, multi-dimensional, multi-physics code at LLNL Physics Capabilities: ALE-AMR Hydrodynamics High-order Eulerian Hydrodynamics Elastic-Plastic flow 3T plasma physics High-Explosive modeling Diffusion, SN Radiation Particulate flow Laser ray-tracing Magnetohydrodynamics (MHD) Dynamic mixing Non-LTE opacities Applications: Inertial Confinement Fusion (ICF) Pulsed power National Ignition Facility debris High-Explosive experiments LLNL-PRES Slide 7

16 ARES 800k lines of C/C++ with MPI 22 years old, used daily on our current supercomputers Single code base effectively utilizes all HPC platforms LLNL-PRES Slide 8

17 ARES uses RAJA 800k lines of C/C++ with MPI 22 years old, used daily on our current supercomputers Single code base effectively utilizes all HPC platforms Use RAJA as an abstraction layer for on-node parallelization RAJA is a collection of C++ software abstractions Separation of concerns C-style for-loop: 1: double* x; double* y; 2: double a; 3: for( int i = begin; 4: i < end; ++i ) { 5: y[i] += a * x[i]; 6: } RAJA-style loop: 1: double* x; double* y; 2: double a; 3: RAJA::forall<exec_policy> 4: (begin, end, [=] (int i) { 5: y[i] += a * x[i]; 6: }); Use different RAJA backends (CUDA, OpenMP) LLNL-PRES Slide 8

18 Results 3D Sedov blastwave problem Hydrodynamics calculation 80 kernels LLNL-PRES Slide 9

CPUs (20 cores) 4x NVIDIA P100 (Pascal) s with 16GB memory each

0 * Some results generated with pre-release versions of

19 Results 3D Sedov blastwave problem Hydrodynamics calculation 80 kernels Pre-SIERRA machine (rzmanta) - Minsky nodes: 2x Power8+ CPUs (20 cores) 4x NVIDIA P100 (Pascal) s with 16GB memory each NVLINK 1.0 * Some results generated with pre-release versions of compilers; improvements in performance expected in future releases All results shown use 4 Minsky nodes (16 s) LLNL-PRES Slide 9

20 Domain decomposition with and without MPS MPI process/ 4 MPI processes/ + MPS Differences: Computation: Work per MPI process Communication: Neighbors and surface to volume ratio LLNL-PRES Slide 10

21 Overall runtime with and without MPS 30 MPI process/ 4 MPI processes/ + MPS Time (sec) Problem size (zones 3 ) LLNL-PRES Slide 11

22 Overall runtime with and without MPS Time (sec) MPI process/ 4 MPI processes/ + MPS Problem size (zones 3 ) Differences: Computation Communication Memory LLNL-PRES Slide 11

23 Computation time: Small kernels 2 MPI process/ 4 MPI processes/ + MPS Time (sec) Problem size (zones 3 ) LLNL-PRES Slide 12

24 Computation time: Small kernels Time (sec) MPI process/ 4 MPI processes/ + MPS Few zones, small amount of work per zone Problem size (zones 3 ) Dominated by kernel launch overhead MPS may be slightly slower LLNL-PRES Slide 12

25 Computation time: Large kernels Time (sec) MPI process/ 4 MPI processes/ + MPS Problem size (zones 3 ) LLNL-PRES Slide 13

26 Computation time: Large kernels Time (sec) MPI process/ 4 MPI processes/ + MPS Problem size (zones 3 ) MPS is faster especially when problem size is large Utilizing better? utilization? occupancy? Utilizing CPU better? More parallelization? Better utilization of CPU memory bandwidth? LLNL-PRES Slide 13

27 Waiting on the : cudadevicesynchronize Time (sec) MPI process/ 4 MPI processes/ + MPS Problem size (zones 3 ) LLNL-PRES Slide 14

28 Waiting on the : cudadevicesynchronize Time (sec) MPI process/ 4 MPI processes/ + MPS Problem size (zones 3 ) Appear to be waiting on the longer without MPS LLNL-PRES Slide 14

29 Domain decomposition with and without MPS MPI process/ 4 MPI processes/ + MPS LLNL-PRES Slide 15

30 Domain decomposition with and without MPS MPI process/ 4 MPI processes/ + MPS LLNL-PRES Slide 15

31 Domain decomposition with and without MPS MPI process/ 4 MPI processes/ + MPS Differences in communication: Number of neighbors in halo exchange Surface to volume ratio Processor mapping Other LLNL-PRES Slide 15

32 Communication time (MPI) 30 MPI process/ 4 MPI processes/ + MPS, decomp1 Time (sec) Problem size (zones 3 ) LLNL-PRES Slide 16

33 Communication time (MPI) Time (sec) MPI process/ 4 MPI processes/ + MPS, decomp1 More MPI processes = more communication Problem size (zones 3 ) LLNL-PRES Slide 16

34 Communication time (MPI) Time (sec) MPI process/ 4 MPI processes/ + MPS, decomp1 4 MPI processes/ + MPS, decomp2 More MPI processes = more communication Problem size (zones 3 ) LLNL-PRES Slide 16

35 Communication time (MPI) Time (sec) MPI process/ 4 MPI processes/ + MPS, decomp1 4 MPI processes/ + MPS, decomp2 4 MPI processes/ + MPS, decomp Problem size (zones 3 ) More MPI processes = more communication Not all decompositions result in the same communication time LLNL-PRES Slide 16

36 Conclusions MPS can be useful if non-accelerated portions of the code need all CPU cores MPS can help to utilize the better However, using more MPI processes makes communication more expensive - many factors may have impact Caliper measures many aspects of performance, but there are more questions Does using more CPU cores increase CPU memory bandwidth utilization? How well am I utilizing/occupying the? What is the bottleneck now: the CPU or the? Other issues on new platforms LLNL-PRES Slide 17

37 Thank you ARES team Caliper team RAJA team Steve Rennich, Max Katz Leopold Grinberg Lawrence Livermore National Laboratory NVIDIA IBM LLNL-PRES Slide 18

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under