Mapping MPI+X Applications to Multi-GPU Architectures

Size: px

Start display at page:

Download "Mapping MPI+X Applications to Multi-GPU Architectures"

Warren Russell
5 years ago
Views:

Mapping MPI+X Applications to Multi-GPU Architectures A

León Computer Scientist San Jose, CA March 28, 2018 GPU

1 Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA Lawrence Livermore National Security, LLC

2 Application developers face greater complexity Hardware architectures SMT GPUs FPGAs NVRAM NUMA, multi-rail Programming abstractions MPI OpenMP, POSIX threads CUDA, OpenMP 4.5, OpenACC Kokkos, RAJA Applications Need the hardware topology to run efficiently Need to run on more than one architecture Need multiple programming abstractions Hybrid applications 2

3 How do we map hybrid applications to increasingly complex hardware? Compute power is not the bottleneck Data movement dominates energy consumption HPC applications dominated by the memory system Latency and bandwidth Capacity tradeoffs (multi-level memories) Leverage local resources Avoid remote accesses More than compute resources, it is about the memory system! 3

4 The Sierra system that will replace Sequoia features a GPU-accelerated architecture Components IBM POWER9 Gen2 NVLink Compute Node Compute Rack Standard 19 2 IBM POWER9 CPUs Warm water cooling 4 NVIDIA Volta GPUs NVMe-compatible PCIe 1.6 TB SSD 256 GiB DDR4 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory Compute System 4320 nodes 1.29 PB Memory 240 Compute Racks 125 PFLOPS ~12 MW NVIDIA Volta 7 TFlop/s HBM2 Gen2 NVLink Mellanox Interconnect Single Plane EDR InfiniBand 2 to 1 Tapered Fat Tree GPFS File System 154 PB usable storage 1.54 TB/s R/W bandwidth 4

Machine (256GB total) NUMANode P#0 (128GB) Package P#0 PCI 15b3:1013 ib0 mlx5_0 PCI 1c58:0003 PCI 10de:15f9 card0 renderd128 PCI 10de:15f9 card1 renderd129 2017 CORAL EA Core

PU P#20 PU P#21 PU P#22 PU P#23 Core P#32 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 NVMe SSD Core P#40 PU P#32 PU P#33 PU P#34 PU P#35 PU P#36 PU P#37

P#46 PU P#47 Core P#112 PU P#72 PU P#73 PU P#74 PU P#75 PU P#76 PU P#77 PU P#78 PU P#79 Core P#72 PU P#48 PU P#49 PU P#50 PU P#51 PU P#52 PU P#53 PU P#54 PU P#55 Private,,

PCI 10de:15f9 PCI 10de:15f9 Core P#136 PU P#80 PU P#81 PU P#82 PU P#83 PU P#84 PU P#85 PU P#86 PU P#87 Core P#176 PU P#112 PU P#113 PU P#114 PU P#115 Core P#144 PU P#88 PU

P#103 Core P#208 PU P#128 PU P#129 PU P#130 PU P#131 Core P#168 PU P#104 PU P#105 PU P#106 PU P#107 PU P#108 PU P#109 PU P#110 PU P#111 Core P#216 PU P#136 PU P#137 PU P#138

NVIDIA Pascal Tesla P100 PU P#116 PU P#117 PU P#118 PU P#119 PU P#124 PU P#125 PU P#126 PU P#127 PU P#132 PU P#133 PU P#134 PU P#135 PU P#140 PU P#141 PU P#142 PU P#143 Core

5 Machine (256GB total) NUMANode P#0 (128GB) Package P#0 PCI 15b3:1013 ib0 mlx5_0 PCI 1c58:0003 PCI 10de:15f9 card0 renderd128 PCI 10de:15f9 card1 renderd CORAL EA Core P#8 PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7 Core P#16 PU P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#15 Core P#24 PU P#16 PU P#17 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23 Core P#32 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 NVMe SSD Core P#40 PU P#32 PU P#33 PU P#34 PU P#35 PU P#36 PU P#37 PU P#38 PU P#39 Core P#96 PU P#64 PU P#65 PU P#66 PU P#67 PU P#68 PU P#69 PU P#70 PU P#71 NUMANode P#1 (128GB) Core P#48 PU P#40 PU P#41 PU P#42 PU P#43 PU P#44 PU P#45 PU P#46 PU P#47 Core P#112 PU P#72 PU P#73 PU P#74 PU P#75 PU P#76 PU P#77 PU P#78 PU P#79 Core P#72 PU P#48 PU P#49 PU P#50 PU P#51 PU P#52 PU P#53 PU P#54 PU P#55 Private,, Core P#88 PU P#56 PU P#57 PU P#58 PU P#59 PU P#60 PU P#61 PU P#62 PU P#63 1 IB NIC per socket 10 cores per socket IBM Power8+ SL822LC Package P#1 PCI 15b3:1013 PCI 1b4b:9235 PCI 10de:15f9 PCI 10de:15f9 Core P#136 PU P#80 PU P#81 PU P#82 PU P#83 PU P#84 PU P#85 PU P#86 PU P#87 Core P#176 PU P#112 PU P#113 PU P#114 PU P#115 Core P#144 PU P#88 PU P#89 PU P#90 PU P#91 PU P#92 PU P#93 PU P#94 PU P#95 Core P#200 PU P#120 PU P#121 PU P#122 PU P#123 Core P#152 PU P#96 PU P#97 PU P#98 PU P#99 PU P#100 PU P#101 PU P#102 PU P#103 Core P#208 PU P#128 PU P#129 PU P#130 PU P#131 Core P#168 PU P#104 PU P#105 PU P#106 PU P#107 PU P#108 PU P#109 PU P#110 PU P#111 Core P#216 PU P#136 PU P#137 PU P#138 PU P#139 ib1 mlx5_1 2 Ethernet NICs PCI 1a03:2000 card4 controld64 PCI 14e4:1657 enp5p7s0f0 PCI 14e4:1657 enp5p7s0f1 card2 renderd130 card3 renderd131 2 GPUs per socket NVIDIA Pascal Tesla P100 PU P#116 PU P#117 PU P#118 PU P#119 PU P#124 PU P#125 PU P#126 PU P#127 PU P#132 PU P#133 PU P#134 PU P#135 PU P#140 PU P#141 PU P#142 PU P#143 Core P#224 PU P#144 PU P#145 PU P#146 PU P#147 Core P#232 PU P#152 PU P#153 PU P#154 PU P#155 SMT-8 PU P#148 PU P#149 PU P#150 PU P#151 PU P#156 PU P#157 PU P#158 PU P#159 Figure generated with hwloc 5

6 Existing approaches and their limitations MPI/RM approaches By thread By core By socket Latency (IBM Spectrum MPI) Bandwidth (IBM Spectrum MPI) OpenMP approaches Policies Spread, Close, Master Predefined places Threads, Cores, Sockets Limitations Memory system is not primary concern No coherent mapping across programming abstractions No heterogeneous devices support 6

G 2 Pascal GPUs G G Core P#0 PU P#0 PU P#1 PU

7 A portable algorithm for multi-gpu architectures: mpibind Node 2 IBM Power8+ processors Memory Memory Per socket Socket Socket Machine 10 SMT-8 cores 1 InfiniBand NIC G 2 Pascal GPUs G G Core P#0 PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7 NVMe SSD G Private,, per core 7

8 l 0 l 1 mpibind s primary consideration: The memory hierarchy Workers = 8 Vertices(l 2 ) = 20 NIC-0 GPU-0 GPU-1 NVM-0 NUMA Start workers / 2 NUMA 4 workers / 2 GPUs NUMA NIC-1 GPU-2 GPU-3 l 2... l 3 l 4 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c19 l 5 p0-7 p8-15 p16-23 p24-31 p32-39 p40-47 p48-55 p56-63 p64-71 p72-79 p80-87 p

9 The algorithm l 0 l 1 NUMA Start GPU0 NIC0 NUMA... GPU1 NIC1 Get hardware topology l 2... Devise memory tree G Assign devices to memory vertices l 3 l 4 Core0 Core1 p4 p5 p6 p7 Calculate # workers w (all processes and threads) Traverse tree to determine level k with at least w vertices Traverse subtrees selecting compute resources for each vertex m : vertices(k) PU Map workers to vertices respecting NUMA boundaries m: workers vertices(k) PU 9

10 Example mapping: One task per GPU Node Memory Memory Socket Socket G G G3 G1 1 3 def lat bw mb ,8,16,24, ,48,56,64, ,88,96,104, ,128,136,144,152 10

11 Evaluation: Synchronous collectives, GPU compute and bandwidth, app benchmark Machine Affinity Benchmarks Number of Nodes Processes (tasks) per node CORAL EA system Spectrum-MPI default Spectrum-MPI latency Spectrum-MPI bandwidth mpibind MPI Barrier MPI Allreduce Bytes&Flops Compute Bytes&Flops Bandwidth SW4lite 1, 2, 4, 8, 16 4, 8, 20 11

12 Enabled uniform access to GPU resources Compute micro-benchmark* Mean GFlops per Process Target PPN: 4 Target PPN: PPN GPU 0 Better Execute multiple instances concurrently 4 and 8 PPN Measure GPU FLOPS Processes time-share GPUs by default Performance without mpibind severely limited because of GPU mapping bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 12

13 Enabled access to the memory of all GPUs without user intervention Memory bandwidth micro-benchmark* 500 Target PPN: 4 Target PPN: PPN GPU 0 Execute multiple instances concurrently 4 or 8 PPN Mean GBps per Process Better Measure GPU global memory bandwidth Processes time-share GPUs by default Without mpibind some processes fail running out of memory bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 13

14 Impact on SW4lite Earthquake ground motion simulation Mean Execution Time (secs) CPU: MPI + OpenMP GPU: MPI + RAJA Better Forcing Supergrid Scheme BC_phys BC_comm Simplified version of SW4 Layer over half space (LOH.2) 17 million grid points (h100) Multiple runs, calculate mean 6 times for GPU 10 times for CPU Performance speedup CPU: mpibind over default: 3.7x GPU: mpibind over bandwidth: 2.2x GPU over CPU: 9.7x 200 TPP CPP x PPN CPN bandwidth under 0 default over bandwidth default latency mpibind bandwidth default latency mpibind latency under mpibind TPP: Threads per process, CPP: Cores per process, PPN: Processes per node, CPN: Cores per node 14

15 mpibind: A memory-driven mapping algorithm for multi-gpu architectures Focuses on hierarchical nature of memory system Provides portability and user transparency Same algorithm on GPU-based, KNL-based, and commoditybased systems Encompasses Hybrid programming abstractions Heterogeneous devices Outperforms existing approaches without user intervention Reduces runtime variability Competitive performance on collective operations Enables uniform access to all GPU resources 15

16 Bibliography and related GTC talks mpibind: A memory-centric affinity algorithm for hybrid applications. MEMSYS System noise revisited: Enabling application scalability and reproducibility with SMT. IPDPS D ground motion simulation in basins. Final report to Pacific Earthquake Engineering Research Center, SW4lite, Kokkos, and RAJA github.com/geodynamics/sw4lite github.com/kokkos/kokkos github.com/llnl/raja S8270 Acceleration of an LLNL production Fortran application on the Sierra supercomputer S8470 Using RAJA for accelerating LLNL production applications on the Sierra supercomputer S8489 Scaling molecular dynamics across 25,000 GPUs on Sierra & Summit 16

Modernizing OpenMP for an Accelerated World

Modernizing OpenMP for an Accelerated World Tom Scogland Bronis de Supinski This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract