DEEP DIVE INTO DYNAMIC PARALLELISM

Size: px

Start display at page:

Download "DEEP DIVE INTO DYNAMIC PARALLELISM"

Anissa Neal
5 years ago
Views:

1 April 4-7, 2016 Silicon Valley DEEP DIVE INTO DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA 1

2 OVERVIEW AND INTRODUCTION 2

3 WHAT IS DYNAMIC PARALLELISM? The ability to launch new kernels from the GPU Dynamically - based on run-time data Simultaneously - from multiple threads at once Independently - each thread can launch a different grid Introduced with CUDA 5.0 and compute capability 3.5 and up CPU GPU CPU GPU Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself 3

4 DYNAMIC PARALLELISM CPU GPU CPU GPU 4

5 AN EASY TO PARALLELIZE PROGRAM M for i = 1 to N for j = 1 to M convolution(i, j) next j next i N 5

6 A DIFFICULT TO PARALLELIZE PROGRAM for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i 6

7 A DIFFICULT TO PARALLELIZE PROGRAM max(x[i]) N for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i Bad alternative #1: Idle Threads N Bad alternative #2: Tail Effect 7

8 Serial Program for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i DYNAMIC PARALLELISM CUDA Program global void convolution(int x[]) { for j = 1 to x[blockidx] kernel<<<... >>>(blockidx, j) } N void main() { setup(x); convolution<<< N, 1 >>>(x); } With Dynamic Parallelism 8

9 Time (ms) lower is better EXPERIMENT 300 dynpar idlethreads taileffect Matrix Size * Device/SDK = K40m/v7.5 * K40m-CPU = E

10 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A SM SM SM SM A0 B<<<1,1>>>() cudalaunchdevice( B, 1, 1 ); 10

11 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A SM SM SM SM A0 B<<<1,1>>>() Allocate Task data structure 11

12 Task Tracking Structures LAUNCH EXAMPLE B Grid Scheduler A0 Tracking Structure Grid A SM SM SM SM A0 B<<<1,1>>>() Fill out Task data structure 12

13 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A B SM SM SM A0 SM Track Task B in Block A0 B<<<1,1>>>() 13

14 B<<<1,1>>>() Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A B SM SM SM A0 SM Launch Task B to GPU 14

15 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A, Grid B B SM SM SM SM A0 B0 C<<<1,1>>>() cudalaunchdevice( C, 1, 1 ); 15

16 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A, Grid B B C SM SM SM SM A0 B0 Allocate, fill out, and track Task C in block A0 C<<<1,1>>>() 16

17 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A, Grid B B C SM SM SM SM A0 B0 Task C is not yet runnable. Track C to run after B. 17

18 LAUNCH EXAMPLE Task Tracking Structures Task B completes. SKED runs Scheduler. Grid Scheduler Task B completes. Scheduler kernel runs. A0 Tracking Structure Grid A, Scheduler B C SM SM SM SM A0 18

19 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A, Scheduler B C SM SM SM SM A0 Sched Scheduler searches for work. 19

20 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A, Scheduler B C SM SM SM SM A0 Sched Scheduler completes B, and Identifies C as ready-to-run. 20

21 C<<<1,1>>>() Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A, Scheduler C SM SM SM SM A0 Sched Scheduler frees B for re-use, and launches C to the Grid Scheduler. 21

22 Task Tracking Structures LAUNCH EXAMPLE Grid Scheduler A0 Tracking Structure Grid A, Grid C C SM SM SM A0 SM C0 Task C now executes. 22

23 Programming Model BASIC RULES Essentially the same as CUDA Launch is per-thread and asynchronous Time Grid A Launch CPU Thread Grid A Complete Sync is per-block Grid A - Parent Grid A Threads CUDA primitives are per-block (cannot pass streams/events to children) Grid B Launch Grid B Complete cudadevicesynchronize()!= syncthreads() Events allow inter-stream dependencies Streams are shared within a block Implicit NULL stream results in ordering within a block; use named streams Grid B - Child Grid B Threads CUDA API available on the device: 23

24 MEMORY CONSISTENCY RULES Memory Model Launch implies membar (child sees parent state at time of launch) Time Grid A Launch CPU Thread Grid A Complete Sync implies invalidate (parent sees child writes after sync) Grid A - Parent Grid A Threads Texture changes by child are visible to parent after sync (i.e. sync == tex cache invalidate) Constants are immutable Grid B - Child Grid B Launch Grid B Threads Grid B Complete Local & shared memory are private: cannot be passed as child kernel args Fully consistent 25

25 EXPERIMENTS 26

26 DIRECTED BENCHMARKS Kernels written to measure specific aspects of dynamic parallelism Launch throughput Launch latency As a function of different configurations SDK Versions Varying Clocks 27

27 RESULTS LAUNCH THROUGHPUT 28

28 Grids/sec LAUNCH THROUGHPUT K40m K40m-CPU Num Child kernels launched * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875 * K40m-CPU = E * Host launches are with 32 streams 29

29 LAUNCH THROUGHPUT Observations About an order of magnitude higher than from host Dynamic parallelism is very useful when there are a lot of child kernels Two major limiters of launch throughput Pending Launch Count Grid Scheduler Limit 30

30 Grids/sec PENDING LAUNCH COUNT Num Child kernels launched * Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent different pending launch count limits 31

31 PENDING LAUNCH COUNT Observations Pre-allocated buffer in Global Memory to store kernels before their launch Default value 2048 kernels Buffer overflow implies resize performed on-the-go Substantial reduction in launch throughput! Know the number of pending child kernels! 32

32 PENDING LAUNCH COUNT CUDA API S cudadevicesetlimit(cudalimitdevruntimependinglaunchcount, yourlimit); Setting Limit cudadevicegetlimit(&yourlimit, cudalimitdevruntimependinglaunchcount); Querying Limit 4/27/

33 Grids/sec GRID SCHEDULER LIMIT Num device streams * Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875 * Different curves represent the total number of child kernels launched 34

34 GRID SCHEDULER LIMIT Observations Ability of grid scheduler to track the number of concurrent kernels The limit is currently 32 If this limit is crossed, upto 50% loss in launch throughput 35

35 RESULTS LAUNCH LATENCY 36

36 Time (ns) LAUNCH LATENCY Initial Subsequent K40m K40m-CPU * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875 * K40m-CPU = E * Host launches are with 32 streams 37

37 LAUNCH LATENCY Observations Initial and subsequent latencies are about 2-3x slower than that of host Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial kernel launches We are working towards improving this** ** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili, 2014 IEEE International Symposium on Workload Characterization (IISWC). 38

38 Time (ns) Time (ns) LAUNCH LATENCY - STREAMS Host streams Device streams * Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875 39

39 LAUNCH LATENCY - STREAMS Observations Host streams affect device-side launch latency Prefer device streams for dynamic parallelism 40

40 RESULTS DEVICE SYNCHRONIZE 41

41 DEVICE SYNCHRONIZE cudadevicesynchronize is costly Avoid it when possible, example below global void parent() { dosomeinitialization(); } childkernel<<<grid,blk>>>(); cudadevicesynchronize(); Unnecessary. Implicit join enforced by the programming model! 42

42 Time (ms) DEVICE SYNCHRONIZE - COST sync nosync Amount of work per thread (higher the number, more the work) * Device/SDK = K40/v7.5 43

43 DEVICE SYNCHRONIZE DEPTH Deepest recursion level until where cudadevicesynchronize works CUDA limit cudalimitdevruntimesyncdepth controls it Default is level 2 At the cost of extra global memory reserved for storing parent blocks 44

44 Memory Reserved (MB) DEVICE SYNCHRONIZE DEPTH Memory Usage Device Synchronize Depth 45

45 DEVICE SYNCHRONIZE DEPTH Error Handling cudadevicesynchronize fails silently beyond the set SyncDepth Use cudagetlasterror on device to inspect the error Kernel Kernel Kernel (depth=1) (depth=2) (depth=3) Kernel (depth=4) Kernel (depth=5) SyncDepth=2 46

46 DYNAMIC PARALLELISM - LIMITS 47

47 DYNAMIC PARALLELISM Limits Recursion depth is currently 24 Maximum size of formal parameters in the child kernel is 4096 B Violation causes a compile-time error Runtime exceptions in child kernel are only visible from host-side 48

48 ERROR HANDLING Runtime exceptions in child kernels Visible only from host-side -lineinfo of nvcc along with cuda-memcheck to locate the error location global void child(float* arr) { arr[0] = 1.0f; } global void parent() { child<<<1,1>>>(null); cudadevicesynchronize(); printf( %d\n, cudagetlasterror()); } Control never reaches here! parent<<<1,1>>>(); cudaerror_t err = cudadevicesynchronize(); Error caught here 49

49 SUCCESS STORIES 50

50 FMM Fast Multipole Method Solving the N-body problem Computational complexity O(n) Tree-based approach Image source: 51

51 lower is better FMM (2) Performance Dynamic 1: launch child grids for neighbors and children Dynamic 2: launch child grids for children only Dynamic 3: launch child grids for children only; start only p 2 kernel threads; use shared GPU memory From: FMM goes GPU A smooth trip or bumpy ride?, B. Kohnke, I.Kabadshow MPI BPC Göttingen & Jülich Supercomputing Centre, GTC

52 PANDA anti-proton ANnihilation at DArmstadt State-of-the-art hadron particle physics experiment 53

53 PANDA (2) Performance and Reasons for Improvements Avoiding extra PCI-e data transfers. Launch configuration data dependencies Higher launch throughput Reducing false dependencies between kernel launches. Waiting on stream prevents enqueuing of work into other streams Source: A CUDA Dynamic Parallelism Case Study: PANDA, Andrew Adinetz 54

54 SUMMARY 55

55 WHEN TO USE CUDA DYNAMIC PARALLELISM Three Good Reasons Algorithmic: Dynamically Formed Pockets of Structured Parallelism * Unbalanced load (e.g., vertex expansion in graphs, compressed sparse row) Tree traversal (fat and shallow computation trees) Adaptive Mesh Refinement Performance: Improve launch throughput Reduce PCIe traffic and false dependencies Maintenance: Simplified, more natural program flow *) from: Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, J.Wang and S. Yalamanchili, IISWC

56 REFERENCES CUDA-C Programming Guide, Adaptive Parallel Computation with CUDA Dynamic Parallelism FMM goes GPU, B. Kohnke and I.Kabadshow, GTC 2015, 58

57 April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

April 4-7, 2016 Silicon Valley CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016 AGENDA General debugging approaches Cuda-gdb Demo 2 CUDA API CHECKING CUDA calls are