Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Size: px

Start display at page:

Download "Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu"

Neil Johnson
5 years ago
Views:

1 Towards Automatic Heterogeneous Computing Performance Analysis Carl Pearson Adviser: Wen-Mei Hwu

2 Outline High Performance Computing Challenges Vision CUDA Allocation and Transfer Background System Characterization Software Characterization 2

3 Out-of-Order execution Vectorization Cache effects DRAM 3

4 Out-of-Order execution Vectorization Cache effects Multi-threading DRAM 4

5 Out-of-Order execution Vectorization Cache effects Multi-threading Non-uniform memory access DRAM DRAM 5

6 Out-of-Order execution Vectorization Cache effects Multi-threading Non-uniform memory access Bulk-synchronous programming DRAM DRAM 6

7 Out-of-Order execution Vectorization Cache effects Multi-threading Non-uniform memory access Bulk-synchronous programming Communication Data placement Compute placement DRAM DRAM 7

8 Flash Out-of-Order execution Vectorization Cache effects Multi-threading Non-uniform memory access Bulk-synchronous programming Communication Data placement Compute placement FPGA FPGA Near-/In-memory Compute DRAM NVM HBM 8

9 System Graph Communication Data placement Compute placement Mem Mem Graph: Nodes: {Compute, Storage} Edges: {Communication Links} 9

10 Application Graph System Graph V V C V V C Mem Mem Nodes: {Compute, Values} Edges: {Dependence} Values have a size Nodes: {Compute, Storage} Edges: {Communication Links} Edges have a performance model 10

11 System Characterization 1) Enumerate system topology Hwloc [1] Portable interface for hardware affinity P100 NVLink 80 GB/s P100 NVIDIA Management Library Power8 Power8 Cutting-edge NVIDIA hardware P100 P100 IBM Minsky Topology (not shown: network interfaces, disks, PCIe devices ) [1] Broquedis, François, et al. "hwloc: A generic framework for managing hardware affinities in HPC applications." Parallel, Distributed and Network-Based Processing (PDP), th Euromicro International Conference on. IEEE,

12 System Characterization 2) Characterize communication CUDA Application P100 NVLink 80 GB/s P100 Power8 Power8 libcublas libnccl libcudnn P100 P100 libcuda OS IBM Minsky Topology (not shown: network interfaces, disks, PCIe devices ) 12

13 CUDA 1-2 / CC 1.x :: Early CUDA cudamalloc malloc/new cudamemcpy cudamemcpypeer Pageable Allocation Allocation cudamallochost Pinned Allocation Address Space 0 Address Space 1 Address Space

14 CUDA Memcopy Flow (Pageable) 1) Allocate pageable memory DRAM APPLICATION ADDRESS SPACE KERNEL ADDRESS SPACE DRAM a_h a_h = new int[] a_d cudamalloc(&a_d) 14

15 CUDA Memcopy Flow (Pageable) 2) Initiate CUDA Memcpy DRAM APPLICATION ADDRESS SPACE KERNEL ADDRESS SPACE DRAM a_h cudamemcpy(a_d, a_h) a_d 15

16 CUDA Memcopy Flow (Pageable) 3) Driver copies to pinned internal buffer DRAM APPLICATION ADDRESS SPACE KERNEL ADDRESS SPACE DRAM a_h a_d 16

17 CUDA Memcopy Flow (Pageable) 4) instructs to begin Direct Memory Access copy DRAM APPLICATION ADDRESS SPACE KERNEL ADDRESS SPACE DRAM a_h a_d 17

18 CUDA Memcopy Flow (Pinned) 1) Allocate pinned memory DRAM APPLICATION ADDRESS SPACE KERNEL ADDRESS SPACE DRAM a_h cudamallochost(&a_h) a_d cudamalloc(&a_d) 18

19 CUDA Memcopy Flow (Pinned) 2) instructs to begin Direct Memory Access copy DRAM APPLICATION ADDRESS SPACE KERNEL ADDRESS SPACE DRAM a_h a_d 19

20 CUDA 3 / CC 1.x :: Basic CUDA cudamalloc malloc/new Pageable cudamemcpy Allocation cudamemcpypeer Allocation cudamallochost Pinned Allocation cudahostalloc(..., cudahostallocwritecombi ned) Write- Combined Allocation cudahostalloc(..., cudahostallocmapped) Mapped cudahostgetdevicepointer Mapping Mapping Address Space 0 Address Space 1 Address Space

21 CUDA 4.0 / CC 2.0+ :: Unified Virtual Addressing Pageable Allocation cudamemcpy Allocation Allocation cudapeeraccessenable Mapping Pinned Allocation Write- Combined Allocation Mapped implicit Mapping implicit Mapping Unified Address Space

22 CUDA 6.0 / CC 3.0+ :: Unified Memory cudamallocmanaged() access from any device, any time* Allocation (CUDA 4.0-style APIs still exist) * some restrictions apply, see CUDA Programing Guide for details 22

23 CUDA 6.0 / CC 3.0+ :: Unified Memory Bulk transfers on access 0 1 cudasetdevice(0); cudamallocmanaged(&a,...); a a[page0] = 0; // gpu0 a cudadevicesynchronize(); a[page1] = 1; // gpu1 a cudadevicesynchronize(); a[page2] = 2; // cpu a 23

24 CUDA 8.0 / CC 6.0+ :: Unified Memory Page faults on access Automatic prefetching (not shown) cudasetdevice(0); cudamallocmanaged(&a,...); a[page0] = 0; // gpu0 0 1 a[page1] = 1; // gpu1 a[page2] = 2; // cpu cudamemadvise(a, gpu1, cudamemadvisesetpreferredlocation); a[page1] = 1; // cpu cudamemprefetcasync(a, gpu1); Page fault and migration Page fault and migration Write served over NVLink Bulk page migration 24

25 CUDA 9.2 / CC 7.0+ / Power9 / NVLink2 :: Unified Memory Address Translation Services: can access page tables Access Counters for mapping vs migration Atomic operations over NVLink2 25

26 26

27 Identical CUDA Memcopies mapped pageable, cudamemcpy pinned, cudamemcpy write-combined, cudamemcpy 27

28 Prefetch Bandwidth and Demand Bandwidth 28

29 IMB Minsky Page Fault Latency ( to 0) ~20μs per page fault 29

30 Software Characterization Unmodified CUDA Application CUPTI profiler LD_PRELOAD LD_PRELOAD Observe generic shared library calls libcublas libnccl libcudnn CUDA Profiling Tools Interface Observe CUDA runtime calls libcuda OS 30

31 1: {"build":" ","git":"dirty","version":"0.1.0"} 2: {"device":0,"name":"cudasetdevice",... 3: {"name":"cudamalloc","ptr": ,"size":409600,... 4: {"name":"cudamalloc","ptr": ,"size":819200,... 5: {"name":"cudamalloc","ptr": ,"size":819200,... 6: {"count":409600,"dst": ,"name":"cudamemcpy","src": ,"... 7: {"count":819200,"dst": ,"name":"cudamemcpy","src": ,"... 8: {"block_dim":...,"grid_dim":...,"name":"cudaconfigurecall","shared_mem":0,"stream":0,... 9: {"arg": ,"name":"cudasetupargument","offset":0,"size":8,... 10: {"arg": ,"name":"cudasetupargument","offset":8,"size":8,... 11: {"arg": ,"name":"cudasetupargument","offset":16,"size":8,... 12: {"arg": ,"name":"cudasetupargument","offset":24,"size":4,... 13: {"arg": ,"name":"cudasetupargument","offset":28,"size":4,... 14: {"func": ,"name":"cudalaunch"} PROFILE FORMAT 2: device 0 is associated with calls from this thread 3: Allocation , bytes 4: Allocation , bytes 5: Allocation , bytes 6: Infer >= allocation Copy A6 -> A3 7: Infer >= allocation Copy A7 -> A4 9-13: Record arguments to upcoming kernel (including A3, A4, A5) 14: Launch address ENGLISH 31

32 Dynamic Dependence Graph from Profile 3: Allocation , bytes 4: Allocation , bytes 5: Allocation , bytes 6: Infer >= initialized allocation Copy A6 -> A3 7: Infer >= initialized allocation Copy A7 -> A4 9-13: Set up arguments (including A3,A4,A5) 14: Launch function Allocations A3 uninit v1 v4 A4 uninit v3 v5 A5 uninit v6 A6 v0 A7 v2 32

33 Replay with Tweaked Components IBM Minsky w/ Power8, P100s, NVLink IBM w/ Power9, V100s, NVLink Kernel performance model? 33

34 Replay with Tweaked Components IBM Minsky Power8, P100s, NVLink Illinois FutureArch RISCV, V100, NuLink, FPGA, NVM Scheduling problem Heterogeneous performance model 34

35 Questions / Comments / Concerns Special thanks to the Center for Cognitive Computing Systems Research (C3SR), NCSA, and Dominic Grande. 35

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana