High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

Size: px

Start display at page:

Download "High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA"

Theodora White
5 years ago
Views:

1 High-Productivity CUDA Programming Levi Barnes, Developer Technology Engineer, NVIDIA

2 MORE RESOURCES

3 How to learn more GTC -- March 2014 San Jose, CA gputechconf.com Video archives, too Qwiklabs nvlabs.qwiklabs.com Hands on labs/tutorials More soon MOOCs udacity.com coursera.com

4 HIGH-PRODUCTIVITY PROGRAMMING

5 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort)

6 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort) Make Fewer Mistakes

7 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort) Make Fewer Mistakes Find Mistakes Sooner

8 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort) Make Fewer Mistakes Find Mistakes Sooner Discover/Publish More

9 High-Productivity Programming What kinds of tools exist to meet those goals? Languages/compilers Libraries Integrated development environments (IDEs) Profiling, correctness-checking, and debugging tools

10 HIGH-PRODUCTIVITY CUDA: Programming Languages, Compilers

11 GPU Programming Languages Numerical analytics Fortran C C++ Python.NET MATLAB, Mathematica, LabVIEW OpenACC, CUDA Fortran OpenACC, CUDA C CUDA C++, Thrust, Hemi, ArrayFire NumbaPro, PyCUDA, Copperhead CUDAfy.NET, Alea.cuBase, NMath developer.nvidia.com/language-solutions

12 HIGH-PRODUCTIVITY CUDA: GPU-Accelerated Libraries

13 GPU Accelerated Libraries Drop-in Acceleration for your Applications NVIDIA cublas NVIDIA cusparse NVIDIA NPP NVIDIA cufft Matrix Algebra on GPU and Multicore GPU Accelerated Linear Algebra Vector Signal Image Processing NVIDIA curand IMSL Library CenterSpace NMath Building-block Algorithms C++ Templated Parallel Algorithms

14 HIGH-PRODUCTIVITY CUDA: Integrated Development Environments

NVIDIA Nsight, Eclipse Edition for Linux and MacOS CUDA-Aware Editor Automated CPU to GPU code refactoring Semantic highlighting of CUDA code Integrated code samples & docs Nsight Debugger

15 NVIDIA Nsight, Eclipse Edition for Linux and MacOS CUDA-Aware Editor Automated CPU to GPU code refactoring Semantic highlighting of CUDA code Integrated code samples & docs Nsight Debugger Simultaneously debug CPU and GPU Inspect variables across CUDA threads Use breakpoints & single-step debugging Nsight Profiler Quickly identifies performance issues Integrated expert system Source line correlation developer.nvidia.com/nsight

NVIDIA Nsight Visual Studio Edition, CUDA Debugger Debug CUDA kernels directly

conditional breakpoints to locate errors CUDA Memory Checker Enables precise

deep kernel analysis to detect factors limiting maximum performance CUDA

16 NVIDIA Nsight Visual Studio Edition, CUDA Debugger Debug CUDA kernels directly on GPU hardware Examine thousands of threads executing in parallel Use on-target conditional breakpoints to locate errors CUDA Memory Checker Enables precise error detection System Trace Review CUDA activities across CPU and GPU Perform deep kernel analysis to detect factors limiting maximum performance CUDA Profiler Advanced experiments to measure memory utilization, instruction throughput and stalls

17 HIGH-PRODUCTIVITY CUDA: Profiling and Debugging Tools

18 Debugging Solutions Command Line to Cluster-Wide NVIDIA Nsight Eclipse & Visual Studio Editions NVIDIA CUDA-GDB for Linux & Mac NVIDIA CUDA-MEMCHECK for Linux & Mac Allinea DDT with CUDA Distributed Debugging Tool TotalView for CUDA for Linux Clusters developer.nvidia.com/debugging-solutions

19 Performance Analysis Tools Single Node to Hybrid Cluster Solutions NVIDIA Nsight Eclipse & Visual Studio Editions NVIDIA Visual Profiler Vampir Trace Collector TAU Performance System PAPI CUDA Component Under Development developer.nvidia.com/performance-analysis-tools

20 Want to know more? Visit DeveloperZone developer.nvidia.com/cuda-tools-ecosystem

21 High-Productivity CUDA Programming Know the tools at our disposal Use them wisely Develop systematically

22 SYSTEMATIC CUDA DEVELOPMENT

23 APOD: A Systematic Path to Performance Assess

24 Assess HOTSPOTS Profile the code, find the hotspot(s) Focus your attention where it will give the most benefit

25 Applications Libraries OpenACC Directives Programming Languages

26 Profile-driven optimization Tools: nsight NVIDIA Nsight IDE nvvp NVIDIA Visual Profiler nvprof Command-line profiling cuobjdump Decompiler

27 Productize Check API return values Run cuda-memcheck tools Library distribution Cluster management Early gains Subsequent changes are evolutionary

28 Assess SYSTEMATIC CUDA DEVELOPMENT APOD Case Study

29 Assess 1 APOD CASE STUDY Round 1: Assess

30 Assess Assess 1 Profile the code, find the hotspot(s) Focus your attention where it will give the most benefit

31 Assess Assess 1 We ve found a hotspot to work on! What percent of our total time does this represent? How much can we improve it? What is the speed of light?

32 Assess Assess 1 ~36%? ~93% What percent of total time does our hotspot represent?

33 Assess Assess 1 We ve found a hotspot to work on! It s 93% of GPU compute time, but only 36% of total time What s going on during that other ~60% of time? Maybe it s one-time startup overhead not worth looking at Or maybe it s significant we should check

34 Assess: Asynchronicity Assess 1 This is the kind of case we would be concerned about Found the top kernel, but the GPU is mostly idle that is our bottleneck Need to overlap CPU/GPU computation and PCIe transfers

35 Parts of your system CPU Network Card GPU

36 Parts of your system CPU Network Card GPU Copy engine (2 directions)

37 Parts of your system CPU Network Card GPU Copy engine (2 directions) Compute engine

38 Parts of your system CPU Network Card GPU Copy engine (2 directions) Compute engine Integer ops Floating point ops Memory read/write

39 Asynchronicity = Overlap = Parallelism DMA DMA Heterogeneous system: overlap work and data movement Kepler/CUDA 5: Hyper-Q and CPU Callbacks make this fairly easy

40 Assess 1 APOD CASE STUDY Round 1:

41 : Achieve Asynchronicity Assess 1 What we want to see is maximum overlap of all engines How to achieve it? Use the CUDA APIs for asynchronous copies, stream callbacks Or use CUDA Proxy and multiple tasks/node to approximate this

42 Further or Move On? Assess 1 Even after we fixed overlap, we still have some pipeline bubbles CPU time per iteration is the limiting factor here So our next step should be to parallelize more

43 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel

44 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel But, the GPU was mostly idle Used asynchronous API to overlap

45 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel But, the GPU was mostly idle Used asynchronous API to overlap CPU is the new bottleneck Lesson : Big gains even w/o optimizing the kernel

46 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel But, the GPU was mostly idle Used asynchronous API to overlap CPU is the new bottleneck Lesson : Big gains even w/o optimizing the kernel Skip ahead to

47 Assess 1 APOD CASE STUDY Round 1:

48 Assess 1 Let s reap the benefits of this sooner rather than later!

49 Assess 1 Let s reap the benefits of this sooner rather than later! Functionality remains the same as before We re just keeping as many units busy at once as we can

50 Assess 1 Let s reap the benefits of this sooner rather than later! Functionality remains the same as before We re just keeping as many units busy at once as we can CUDA is no excuse for bad coding practices

51 Assess 1 Let s reap the benefits of this sooner rather than later! Functionality remains the same as before We re just keeping as many units busy at once as we can CUDA is no excuse for bad coding practices Add error checking, range bounds and deploy

52 Assess 2 APOD CASE STUDY Round 2: Assess

53 Assess Assess 2 Our first round already gave us a glimpse at what s next: CPU compute time is now dominating No matter how well we tune our kernels or reduce our PCIe traffic at this point, it won t reduce total time even a little

54 Assess Assess 2 We need to tune our CPU code somehow Tune neglected code Poor asynchronicity use idle CPU cores, if any Tune MPI communication Offload more work to the GPU If so, which approach will work best?

55 Assess Assess 2 We need to tune our CPU code somehow Tune neglected code Poor asynchronicity use idle CPU cores, if any Tune MPI communication Offload more work to the GPU If so, which approach will work best? Notice that these basically say parallelize

56 Assess: Offloading Work to GPU Assess 2 Applications Libraries OpenACC Directives Programming Languages Pick the best tool for the job

57 Assess 2 APOD CASE STUDY Round 2:

58 : e.g., with OpenACC Assess 2 CPU GPU Simple Compiler hints Program myscience... serial code...!$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end kernels... End Program myscience OpenACC Compiler Hint Compiler s code Works on many-core GPUs & multicore CPUs Your original Fortran or C code

Backends for CUDA, OpenMP, TBB Extensible and customizable Integrates with existing software Open source // generate 32M random numbers on host thrust::host_vector<int>

59 : e.g., with Thrust Assess 2 Similar to C++ STL High-level interface Flexible Enhances developer productivity Enables performance portability between GPUs and multicore CPUs Backends for CUDA, OpenMP, TBB Extensible and customizable Integrates with existing software Open source // generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); thrust.github.com or developer.nvidia.com/thrust

60 : e.g., with CUDA C Assess 2 Standard C Code Parallel C Code void saxpy_serial(int n, float a, float *x, float *y) { } for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; // Perform SAXPY on 1M elements saxpy_serial(4096*256, 2.0, x, y); global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Perform SAXPY on 1M elements saxpy_parallel<<<4096,256>>>(n,2.0,x,y); developer.nvidia.com/cuda-toolkit

61 Assess 2 APOD CASE STUDY Round 2:

62 Optimizing OpenACC Assess 2!$acc data copy(u, v) do t = 1, 1000!$acc kernels u(:,:) = u(:,:) + dt * v(:,:) Usually this means adding some extra directives to give the compiler more information do y=2, ny-1 do x=2,nx-1 v(x,y) = v(x,y) + dt * c * end do end do!$acc end kernels!$acc update host(u(1:nx/4,1:2)) call BoundaryCondition(u)!$acc update device(u(1:nx/4, 1:2) This is an entire session: S Optimizing OpenACC Codes S Hands-on Lab: OpenACC Optimization end do!$acc end data

63 Assess 2 APOD CASE STUDY Round 2:

64 Assess 2 We ve removed (or reduced) a bottleneck Our app is now faster while remaining fully functional * Let s take advantage of that! *Don t forget to check correctness at every step

65 Assess 3 APOD CASE STUDY Round 3: Assess

66 Assess Assess 3 ~93% Finally got rid of those other bottlenecks Time to go dig in to that top kernel

67 Assess Assess 3 ~93% What percent of our total time does this represent? How much can we improve it? What is the speed of light? How much will this improve our overall performance?

68 Assess Assess 3 Let s investigate Strong scaling and Amdahl s Law Weak scaling and Gustafson s Law Expected perf limiters: Bandwidth? Computation? Latency?

69 Assess: Understanding Scaling Assess 3 Strong Scaling Fixed problem size Seek a faster solution with more processors Perfect scaling: run time α 1/N Amdahl s Law: S = 1 1 P + P N 1 (1 P)

70 Assess: Understanding Scaling Assess 3 Weak Scaling Adding more processor and more work Perfect scaling: run time stays constant Gustafson s Law: S = N + (1 P)(1 N)

71 Assess: Applying Strong and Weak Scaling Assess 3 Which scaling is important? Determine possible speedup

72 Assess: Applying Strong Scaling Assess 3 Recall that in this case we are wanting to optimize an existing kernel with a pre-determined workload That s strong scaling, so Amdahl s Law will determine the maximum speedup ~93%

73 Assess: Applying Strong Scaling Assess 3 Now that we ve removed the other bottlenecks, our kernel is ~93% of total time Speedup S = 1 1 P + P S P (S P = speedup in parallel part) In the limit when S P is huge, S will approach In practice, it will be less than that depending on the S P achieved ~93%

74 Assess: Speed of Light Assess 3 What s the limiting factor? Memory bandwidth? Compute throughput? Latency? For our example kernel, SpMV, we think it should be bandwidth We re getting only ~38% of peak bandwidth. If we could get this to 65% of peak, that would mean 1.7 for this kernel, 1.6 overall S = ~93%

75 Assess: Limiting Factor Assess 3 What s the limiting factor? Memory bandwidth Compute throughput Latency

76 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs

77 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs Compare to actual achieved GB/s vs. theory and achieved Ginstr/s

78 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs Compare to actual achieved GB/s vs. theory and achieved Ginstr/s Underperformance indicates you re latency bound

79 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs Compare to actual achieved GB/s vs. theory and achieved Ginstr/s Underperformance indicates you re latency bound Profiler can help

80 Know your GPU K20 = GK110 Compute capability: 3.5 Threads / Warp: 32 Max Warps / Multi Processor : 64 Max Threads / MP :?? Max Threadblocks / MP : bit registers/ MP : Max registers / thread : 255 Max threads / threadblock : 1024 Shared memory / MP : 48k

81 Specs K20X Memory bandwidth: 250 GB/s Instruction throughput: 3.95 Tflops SP / 1.31TFlops DP

82 SpMV Sparse Matrix-Vector multiply # of bytes read? # of floating point operations (flops)? Ratio : K20 crossover? 0.25 TB/4Tflops = bytes/flop

83 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? Ratio : K20 crossover? 0.25 TB/4Tflops = bytes/flop

84 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? 1 multiply + 1 add per pair = 2 flops Ratio : K20 crossover? 0.25 TB/4Tflops = bytes/flop

85 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? 1 multiply + 1 add per pair = 2 flops Ratio : 10 bytes/flop K20 crossover? 0.25 TB/4Tflops = bytes/flop

86 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? 1 multiply + 1 add per pair = 2 flops Ratio : 10 bytes/flop K20 crossover? 0.25 TB/4Tflops = bytes/flop

87 Assess: Limiting Factor Assess 3 Profiler indicates our bandwidth is 38% of peak Below indicates problems

88 Assess 3 APOD CASE STUDY Round 3:

89 Assess 3 Our optimization efforts should all be profiler-guided Eliminate memory replays Boost occupancy Handle more than 1 item/thread (increases ILP) Reduce block synchronizes Reduce warp divergence

90 HIGH-PRODUCTIVITY CUDA: Wrap-up

91 High-Productivity CUDA Programming Recap: Know the tools at our disposal Use them wisely Develop systematically with APOD Assess

92 Online Resources devtalk.nvidia.com developer.nvidia.com docs.nvidia.com

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

High-Productivity CUDA Programming Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA HIGH-PRODUCTIVITY PROGRAMMING High-Productivity Programming What does this mean? What s the goal? Do Less Work