CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Size: px

Start display at page:

Download "CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA"

Alexia Bennett
6 years ago
Views:

1 CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

2 WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE

3 A WORD ABOUT THE APPLICATION Grayscale Blur Edges

4 A WORD ABOUT THE APPLICATION Grayscale Conversion // r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = f*r f*g f*b;

A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter 1 2 3 2 4 6 3 6 9 4 8 12 3 6 9 2 4 6 1 2 3 4 3 2 1 8 6 4 2 12 9

5 A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter foreach pixel p: p = weighted sum of p and its 48 neighbors Image from Wikipedia

6 A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(gx + Gy) Weights for Gx: Weights for Gy:

7 OPTIMIZATION METHOD Trace the Application Identify the Hot Spot and Profile it Identify the Performance Limiter Memory Bandwidth Instruction Throughput Latency Optimize the Code Iterate We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps

8 ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC Microsoft Windows 7 x64 Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 Identical results are obtained on Linux

BEFORE WE START Some slides are for background Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013 http://on-demand.gputechconf.

9 BEFORE WE START Some slides are for background Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC CUDA Best Practices Guide Chameleon from Creative Commons

10 BEFORE WE START Instructions are executed by warps of threads It is a hardware concept There are 32 threads per warp

11 ITERATION 1

12 TRACE THE APPLICATION

13 TIMELINE

14 EXAMINE INDIVIDUAL KERNELS Launch

15 KERNEL OPTIMIZATION PRIORITIES Hotspot The Hotspot is gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms

16 PROFILE THE HOTSPOT Select the Kernel Launch

17 IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? Nsight EE helps us perform that analysis

18 LATENCY The kernel is limited by latency Hint: Memory related

19 LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 Fully covered latency Exposed latency No warp issues

20 LATENCY ANALYSIS Launch

21 STALL REASONS Indicate why a warp cannot issue There are 6 stall reasons reported by Nsight EE Execution dependency Data request Texture Synchronization Instruction fetch Other

22 STALL REASONS Stall reasons: Why warps cannot issue

23 STALL REASONS: EXECUTION DEPENDENCY a = b + c; // ADD a = b[i]; // LOAD d = a + e; // ADD d = a + e; // ADD Memory accesses may influence execution dependencies Global accesses create longer dependencies than shared ones Read-only/texture dependencies are counted in Texture Instruction level parallelism help reduce dependencies a = b + c; // Independent ADDs d = e + f;

24 ILP AND MEMORY ACCESSES float a = 0.0f; for( int i = 0 ; i < N ; ++i ) a += logf(b[i]); c = b[0] No ILP a += logf(c) 2-way ILP (with loop unrolling) float a, a0 = 0.0f, a1 = 0.0f; for( int i = 0 ; i < N ; i += 2 ) { a0 += logf(b[i]); a1 += logf(b[i+1]); } a = a0 + a1 c = b[1] a += logf(c) c = b[2] a += logf(c) c = b[3]... c0 = b[0] a0 += logf(c0) c0 = b[2] a0 += logf(c0) c1 = b[1] a1 += logf(c1) c1 = b[3] a1 += logf(c1) a += logf(c) a = a0 + a1 #pragma unroll is useful to extract ILP

25 STALL REASONS: DATA REQUEST High percentage of data request: Many memory replays To do: Improve coalescing of memory accesses Improve alignment of memory accesses

26 MEMORY TRANSACTIONS Warp of threads (32 threads) L1 transaction: 128B Alignment: 128B (0, 128, 256, ) L2 transaction: 32B Alignment: 32B (0, 32, 64, 96, )

27 MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores Threads read different elements of the same 128B segment 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred

28 MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment 32x 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred

29 REPLAYS A warp reads from addresses spanning 3 lines of 128B Threads 0-7 Threads Threads 8-15 Threads instr. executed and 2 replays = 1 request and 3 transactions Instruction issued Instruction re-issued 1 st replay Instruction re-issued 2 nd replay Time Threads 0-7/24-31 Threads 8-15 Threads 16-23

30 REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Issued Inst. 1 Issued Inst. 2 Issued Inst. 0 Completed Inst. 1 Completed Inst. 2 Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads 0-7/24-31 Threads 8-15 Threads Threads 0-7/24-31 Threads 8-15 Threads 16-23

31 STALL REASONS: DATA REQUEST Data request is also influenced by shared memory replays See CUDA Programming Guide, Sections and G.5.3 (Kepler) Shared memory bank conflicts Data request is also influenced by local memory replays See CUDA Programming Guide, Section Local memory and register spilling

32 COLLECT METRICS AND EVENT Collect Global load/store transactions per request metrics

33 TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store) Too many memory transactions (too much pressure on LSU)

34 TRANSACTIONS PER REQUEST Our blocks are 8x8 Warp 0 Warp threadidx.x We should use blocks of size 32x

35 IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x2 It runs faster: 3.930ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x

36 ITERATION 2

37 TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x We use the same block size for the Sobel filter kernel. That s the reason why it also improves (2 nd row of the Nsight table).

38 LATENCY The kernel is limited by latency

39 STALL REASONS High execution dependency It may not be obvious what to change

40 OCCUPANCY Nsight also warns us about Achieved occupancy

41 LATENCY: LACK OF OCCUPANCY Not enough active warps warp 0 warp 1 warp 2 warp 3 No warp issues The schedulers cannot find eligible warps at every cycle

42 OCCUPANCY We are limited by the size of the blocks Let s change our block size to 32x4

43 OCCUPANCY Blocks of size 32x4 It runs slightly faster: 3.606ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x

44 ITERATION 3

TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.

45 TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x We use the same block size for the Sobel filter kernel. That s the reason why it also improves (2 nd row of the Nsight table).

46 LATENCY The kernel is limited by latency Highest memory pressure is on the L2 cache

47 LATENCY High execution dependency (+ L2 cache pressure) Are we loading the same data several times from L2?

48 L2 CACHE HIT RATE The L2 cache hit rate is high: 99% To do: Reduce the amount of memory transferred Move the data closer to the SM (actually, to the SM)

49 SHARED MEMORY Adjacent pixels access have neighbors in common We should use shared memory to store those common pixels shared unsigned char smem_pixels[10][64];

50 SHARED MEMORY Shared memory: 1.210ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x

51 ITERATION 4

52 TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v1 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x

53 COMPUTE AND MEMORY BOUND The highest pressure is on the Load-Store Unit

54 INSTRUCTION THROUGHPUT SM Registers Sched Sched Sched Sched Pipes Pipes Pipes Pipes SMEM/L1$ Each SM has 4 schedulers (Kepler) Schedulers issue instructions to pipes A scheduler issues up to 2 instructions/cycle Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8) A scheduler issues inst. from a single warp Cannot issue to a pipe if its issue slot is full

55 INSTRUCTION THROUGHPUT Schedulers saturated Sched Sched Sched Sched Utilization: 90% Pipe saturated Sched Sched Sched Sched Utilization: 64% Schedulers and pipe saturated Sched Sched Sched Sched Utilization: 92% Load Store Texture Control Flow ALU Load Store Texture Control Flow ALU Load Store Texture Control Flow ALU 90% 65% 78% 11% 8% 4% 24% 27% 6% 4%

56 READ-ONLY CACHE (TEXTURE UNITS) SM SM Registers SMEM/L1$ Texture Units Skip LSU Cache loads Registers SMEM/L1$ Texture Units L2$ Global Memory (Framebuffer)

57 READ-ONLY PATH Annotate our pointer with const restrict global void gaussian_filter_7x7_v2(int w, int h, const uchar * restrict src, uchar *dst) The compiler generates LDG instructions: 1.018ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x Read-only path 1.018ms 6.15x

58 INSTRUCTION THROUGHPUT Things to investigate next Improve memory efficiency Reduce computational intensity (separable filter)

59 BACK TO STALL REASONS The stall reasons for that kernel

60 STALL REASONS: TEXTURE Texture/read-only load execution dependencies a = ldg(&b[i]); // Read-only LOAD d = a + e; // ADD Texture unit is saturated and does not accept requests Does not necessarily mean that the Texture issue slot is saturated To do: Reduce memory accesses Improve coalescing and memory alignment

61 STALL REASONS: SYNCHRONIZATION Time spent waiting at barriers ( syncthreads/ threadfence) Load imbalance (some warps work, others wait) if( is_first_warp ) do_something_expensive(); syncthreads(); To do: Make sure you do not have unneeded barriers Reduce load imbalance Rethink the algorithm to reduce synchronization points Fewer threads per block

62 STALL REASONS: INSTRUCTION FETCH Instructions are stored in global memory The SM has to load instructions before executing them Instructions are cached on the SM (I$) To do: Reduce branching to avoid I$ misses Reduce unrolling and function inlining Merge kernels (short kernels are more impacted by I$ misses)

63 MORE IN OUR COMPANION CODE Kernel Time Speedup Original version 6.265ms Larger blocks 6.075ms 1.03x Better memory accesses 3.605ms 1.74x Fewer registers 1.949ms 3.21x Shared memory 1.211ms 5.17x Read-Only path 1.019ms 6.15x Separable filter 0.656ms 9.55x Process two pixels per thread (improve memory efficiency + add ILP) 0.511ms 12.26x Use 64-bit shared memory (remove bank conflicts) 0.499ms 12.56x Use float instead of int (increase instruction throughput) 0.434ms 14.44x Your next idea!!!

64 CONCLUSION

65 OPTIMIZATION METHOD Trace the Application Identify the Hot Spot and Profile it Identify the Performance Limiter Memory Bandwidth Instruction Throughput Latency Optimize the Code Iterate

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight