Accelerate your Application with Kepler. Peter Messmer

Size: px

Start display at page:

Download "Accelerate your Application with Kepler. Peter Messmer"

Norman Johns
5 years ago
Views:

1 Accelerate your Application with Kepler Peter Messmer

2 Goals How to analyze/optimize an existing application for GPUs What to consider when designing new GPU applications How to optimize performance with Kepler /CUDA5

3 GPU Acceleration

4 Here: Focus on Programming Languages Applications Libraries OpenACC Directives Programming Languages

5 Here: Focus on Programming Languages Applications Libraries OpenACC Directives Programming Languages

8 APOD A Systematic Path to Performance Assess Deploy Parallelize Optimize

9 Starting point: Matrix transpose for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[j][i] = in[i][j]; i j

10 Matrix transpose on CPU void transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; i } float in[], out[]; j transpose(in, out, width);

11 An initial CUDA Version global void transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; i } float in[], out[]; j cudamemcpy(in, in_host, N*N*sizeof(float)); transpose<<<1,1>>>(in, out);

12 An initial CUDA Version global transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; } float in[], out[]; transpose<<<1,1>>>(in, out); + Quickly implemented - Performance weak

13 An initial CUDA Version global transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; } float in[], out[]; transpose<<<1,1>>>(in, out); + Quickly implemented - Performance weak => Express Parallelism!

14 Recap: Kernel Execution Model Thread: Sequential execution unit All threads execute same sequential program Threads execute in parallel Threads Blocks: A group of threads Executes on a single Streaming Multiprocessor (SM) Threads within a block can cooperate Light-weight synchronization Data exchange Grid: A group of thread blocks Communication between blocks expensive Threadblocks of a grid execute on multiple SMs

15 First Parallelization: Inner Loop Process source rows independently tid tid tid global transpose(float in[], float out[]) { int tid = threadidx.x; } for(int j=0; j < N; j++) out[tid*n+j] = in[j*n+tid]; j in j float in[], out[]; transpose<<<1,n>>>(in, out); tid tid tid out

16 Second Parallelization: One Block per Row tid tid tid global transpose(float in[], float out[]) { int tid = threadidx.x; int bid = blockidx.x; out[tid*n+bid] = in[bid*n+tid]; bid bid in } bid bid float in[][], out[][]; transpose<<<n,n>>>(in, out); tid tid tid out

17 NVVP NVIDIA Visual Profiler - Application analysis - Kernel properties

18 Application Assessment with NVVP

19 Application Assessment with NVVP

20 Application Assessment with NVVP

21 Source-Level Hot-spot Analysis in NVVP

22 Source-Level Hot-spot Analysis in NVVP

23 What does Uncoalesced Store mean? Global memory access happens in transactions of 32 Bytes Coalesced access: Group of 32 threads ( warp ) accessing adjacent bytes Uncoalesced access: Group of 32 threads accessing scattered bytes Results in up to 32 transactions

24 Memory Access Patterns Array access: ~OK: x[i] = a[i+1] a[i] Bad: x[i] = a[64*i] a[i] SoA vs AoS: OK : point.x[i] Bad: point[i].x Random access Bad: a[rand_fun(i)]

25 How can we improve the write? Coalesced read Scattered write (stride N) Process matrix tile, not single row/column in Transpose matrix tile within block out

26 How can we improve the write? Coalesced read Scattered write (stride N) Process matrix tile, not single row/column in Transpose matrix tile within block => Need threads in a block to cooperate => Use shared memory out

27 Shared memory - Accessible by all threads in a block SM-0 Registers SM-N Registers - Fast compared to global mem - Low access latency - high BW - (almost like registers) SMEM SMEM Global Memory (DRAM) - Common uses: - Software managed cache - Data layout conversion

28 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);

29 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);

$Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n;$

30 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);

31 What happens at the barrier? - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum Thread-Block Matrix Tile

32 Thread Serialization - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum - Use more thread blocks, but: - # blocks per SM limited by # threads/block Thread-Block Matrix Tile

Thread Serialization - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked

33 Thread Serialization - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum - Use more thread blocks, but: - # blocks per SM limited by # threads/block Thread-Block Solution: Reduce number of threads per block Matrix Tile

34 Impact of Reduced Serialization

35 Impact of Reduced Serialization

36 Shared Memory Organization Organized in 32 independent banks Bank Bank Bank Bank Optimal access: disjoint banks or multicast Any 1:1 mapping or MC Multiple access to same bank: serialization C C C C Solution for transpose: pading tile[16][16] => tile[16][17] Bank Bank Bank Bank C C C C

37 Final Solution

38 Final Solution

39 APOD Cycle Summary Assessment: Algorithm highly parallel Parallelization: 1 thread per column 12 GB/s Parallelization: 1 thread per element 99 GB/s Optimization: Memory access coalescence 93 GB/s Optimization: Latency hiding 124 GB/s Optimization: Bank conflict resolution 170 GB/s => Ready for Deployment APOD applied at each P/O step

40 Additional Metrics

41 instructions Control Flow if (... ) { // then-clause } else { // else-clause }

42 instructions / time Execution within warps is coherent Warp ( vector of threads) Warp ( vector of threads)

43 instructions / time Execution diverges within a warp

44 instructions / time Execution diverges within a warp Solution: Group threads with similar control flow

45 Occupancy Need independent threads per SM to hide latencies - Memory access - Instruction Hardware resources determine maximum number of threads/threadblocks per SM Consumed resources determine actual number Occupancy = N actual / N max

46 Occupancy Limiting resources: - Number of threads - Number of registers per thread - Number of blocks - Amount of shared memory per block No need for 100% occupancy - Depends on kernel

47 Occupancy Calculator Analyze effect of resource consumption on occupancy

48 Alternatives to NVVP: nvprof %nvprof --print-gpu-trace./transpose Profiling result: Start Duration Grid Size Block Size Regs* Size Throughput Name ms us MB 4.80GB/s [CUDA memcpy HtoD] ms 1.67ms (1 1 1) ( ) transposenaive(float*, ms 1.67ms (1 1 1) ( ) transposenaive(float*, ms 1.67ms (1 1 1) ( ) transposenaive(float*, nvprof --print-gpu-trace --aggregate-mode-off --events sm_cta_launched./transpose Profiling result: Device Event Name, Kernel, Values 0 sm_cta_launched, transposenaive(float*,..), Command-Line Profiler - Access to hardware counters - List of supported counters: --query-events

49 Alternatives to NVVP: nvprof

50 Time Alternatives to NVVP: Instrumentation cudaeventrecord(start, 0); transpose<<<grid, threads>>>(..); start cudaeventrecord(stop,0); cudaeventsynchronize(stop); cudaeventelapsedtime(&time, start, stop); transpose stop EventSynchronize

51 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads

52 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads

54 Host App Streams Stream 0 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

55 Host App Streams Stream 0 Grid 1 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

56 Host App Streams Stream 0 Grid 1 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

57 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Grid 1 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order

58 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order

59 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

60 Host App Streams Stream 0 Grid 3 Grid Mgmt Unit SM-0 Grid 2 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

61 Host App Streams Stream 1 Grid 1 Grid 2 Grid 3 Stream 2 Grid 5 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

62 Host App Streams Stream 1 Grid 2 Grid 3 Stream 2 Grid 5 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 1 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

63 Host App Streams Stream 1 Grid 2 Grid 3 Stream 2 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 5 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

64 Host App Streams Stream 1 Grid 3 Stream 2 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 5 Grid 2 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

65 Asynchronous Data Transfer / Pipelining Time Host GPU

66 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

67 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

68 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

69 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

70 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

71 Asynchronous Data Transfer / Pipelining cudastream_t stream1, stream2; CPU GPU cudamemcpyasync( dst, src, size, dir, stream1 ); kernel<<<grid, block, 0, stream2>>>( ); Time CPU GPU

72 Hyper-Q Enables Efficient Scheduling Grid management unit can select most appropriate grid from 32 streams Improves scheduling of concurrently executed grids Particularly interesting for MPI applications

73 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 Multicore CPU only

74 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 Multicore CPU only

75 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 Multicore CPU only

76 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 N=8 Multicore CPU only

77 GPU Accelerated MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 N=8 N=1 Multicore CPU only GPU accelerated CPU

78 GPU Accelerated Strong Scaling GPU parallelizable part CPU parallel part Serial part With Hyper-Q/Proxy Available in K20 N=1 N=2 N=4 N=8 Multicore CPU only N=1 N=2 N=4 N=8 GPU accelerated CPU

79 GPU Accelerated Strong Scaling GPU parallelizable part CPU parallel part Serial part With Hyper-Q/Proxy Available in K20 N=1 N=2 N=4 N=8 Multicore CPU only N=1 N=2 N=4 N=8 GPU accelerated CPU

80 Example: Hyper-Q/Proxy for CP2K

81 How to use Hyper-Q - No application modifications necessary - Proxy process between user processes and GPU - nvidia-proxy-server-control d

82 Time Don t Forget Large-Scale Behavior Profile in realistic environment Get profile at scale Tau, Scalasca, VampirTrace+Vampir, Craypat,.. Compute Waste Fix messaging problems first! Nrank = 384 GPUs will accelerate your compute, amplify messaging problems Will also help CPU-only code

83 CUDA Dynamic Parallelism CPU GPU GPU as Co-Processor

84 CUDA Dynamic Parallelism CPU GPU CPU GPU GPU as Co-Processor Autonomous, Dynamic Parallelism

85 Dynamic Work Generation Initial Grid

86 Dynamic Work Generation Fixed Grid Statically assign conservative worst-case grid Initial Grid

87 Dynamic Work Generation Fixed Grid Statically assign conservative worst-case grid Dynamically assign performance where accuracy is required Initial Grid Dynamic Grid

88 CUDA Dynamic Parallelism Kernel launches grids Identical syntax as host CUDA runtime function in cudadevrt library global void childkernel() { printf("hello %d", threadidx.x); } global void parentkernel() { childkernel<<<1,10>>>(); cudadevicesynchronize(); printf("world!\n"); } int main(int argc, char *argv[]) { parentkernel<<<1,1>>>(); cudadevicesynchronize(); return 0; }

89 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Concurrent grids Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads

91 Before CUDA 5: Whole-Program Compilation main.cpp + a.cu b.cu c.cu program.exe #include all files together to build Earlier CUDA releases required a single source file for each kernel Linking with external code was not supported

92 CUDA 5: Separate Compilation & Linking a.cu b.cu c.cu main.cpp + program.exe a.o b.o c.o Separate compilation allows building independent object files CUDA 5 can link multiple object files into one program

93 Benefits of Separate Compilation & Linking Easier to reuse your existing code - No need to include all files together any more - extern attribute is respected Incremental compilation reduces build time - e.g. 47,000 line single-file: 50s down to 4s Use 3 rd party GPU Callable libraries or create your own - GPU Callable BLAS Library (libcublas_device.a) included in CUDA Toolkit 5.0, uses Dynamic Parallelism

94 CUDA 5: GPU Callable Libraries main.cpp + a.cu b.cu foo.cu + ab.a a.o + b.o program.exe Can combine object files into static libraries Link and externally call device code

95 CUDA 5: GPU Callable Libraries main.cpp main2.cpp + + foo.cu bar.cu a.cu b.cu + + ab.a ab.a a.o + b.o program.exe program2.exe Combine object files into static libraries Facilitates code reuse, reduces compile time

96 CUDA 5: Callbacks main.cpp + foo.cu Enables closed-source device libraries to call user-defined device callback functions + vendor.a + callback.cu program.exe

97 Device Linker Invocation Introduction of an optional link step for device code nvcc arch=sm_20 dc a.cu b.cu nvcc arch=sm_20 dlink a.o b.o o link.o g++ a.o b.o link.o L<path> -lcudart

98 Device Linker Invocation Introduction of an optional link step for device code nvcc arch=sm_20 dc a.cu b.cu nvcc arch=sm_20 dlink a.o b.o o link.o g++ a.o b.o link.o L<path> -lcudart Link device-runtime library for dynamic parallelism nvcc arch=sm_35 dc a.cu b.cu nvcc arch=sm_35 dlink a.o b.o -lcudadevrt o link.o g++ a.o b.o link.o L<path> -lcudadevrt -lcudart Currently, link occurs at cubin level (PTX not yet supported)

99 GPUDirect enables GPU-aware MPI GPU-GPU transfer across NIC Without CPU participation Unified Virtual addresses allows to detect location of buffer pointed to

100 GPUDirect enables GPU-aware MPI cudamemcpy(s_buf_h,s_buf_d,size,cudamemcpydevicetohost); MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); cudamemcpy(r_buf_h,r_buf_d,size,cudamemcpyhosttodevice); Simplifies to MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); (for CPU and GPU buffers)

101 GPU Management: nvidia-smi Multi-GPU systems are widely available Different systems are set up differently Want to get quick information on - Approximate GPU utilization - Approximate memory footprint - Number of GPUs - ECC state - Driver version Thu Nov 1 09:10: NVIDIA-SMI Driver Version: GPU Name Bus-Id Disp. Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+======================+====================== 0 Tesla K20X 0000:03:00.0 Off Off N/A 30C P8 28W / 235W 0% 12MB / 6143MB 0% Default Tesla K20X 0000:85:00.0 Off Off N/A 28C P8 26W / 235W 0% 12MB / 6143MB 0% Default Compute processes: GPU Memory GPU PID Process name Usage ============================================================================= No running compute processes found Inspect and modify GPU state

Where to find additional information CUDA

com [2] com/cuda/cuda-c-best-practices-guide [3]

102 Where to find additional information CUDA documentation [1] - Best Practice Guide [2] - Kepler Tuning Guide [3] Kepler whitepaper [4] [1] [2] [3] [4]

Where to find additional information: GTC Kepler architecture: GTC12 Session S0642: Inside Kepler Assessing

com/content/gtc-2010/pdfs/2012_gtc2010v2.

103 Where to find additional information: GTC Kepler architecture: GTC12 Session S0642: Inside Kepler Assessing performance limiters: GTC10 Session 2012: Analysis-driven Optimization (slides 5-19): Profiling tools: GTC12 sessions: S0419: Optimizing Application Performance with CUDA Performance Tools S0420: Nsight IDE for Linux and Mac... CUPTI documentation (describes all the profiler counters) Included in every CUDA toolkit (/cuda/extras/cupti/doc/cupti_users_guide.pdf GPU computing webinars in general:

104 Kepler and CUDA5: Powerful yet Easy Kepler and CUDA5 simplify GPU acceleration Bypass optimization trial/error with APOD Profile and analyze sefficiently with NVVP Improve MPI scalability with Hyper-Q/Proxy Parallelize with CUDA Dynamic Parallelism

105 Thank you!

106 Backup Slides

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use