CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE https://github.com/jdemouth/nsight-gtc2014

A WORD ABOUT THE APPLICATION Grayscale Blur Edges

A WORD ABOUT THE APPLICATION Grayscale Conversion // r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter 1 2 3 2 4 6 3 6 9 4 8 12 3 6 9 2 4 6 1 2 3 4 3 2 1 8 6 4 2 12 9 6 3 16 12 8 4 12 9 6 3 8 6 4 2 4 3 2 1 foreach pixel p: p = weighted sum of p and its 48 neighbors Image from Wikipedia

A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(gx + Gy) Weights for Gx: -1 0 1-2 0 2-1 0 1 Weights for Gy: 1 2 1 0 0 0-1 -2-1

OPTIMIZATION METHOD Trace the Application Identify the Hot Spot and Profile it Identify the Performance Limiter Memory Bandwidth Instruction Throughput Latency Optimize the Code Iterate We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy

ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC Microsoft Windows 7 x64 Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 Identical results are obtained on Linux

BEFORE WE START Some slides are for background Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013 http://on-demand.gputechconf.com/gtc/2013/video/s3466-performance-optimization-guidelines-gpu-architecture-details.mp4 http://on-demand.gputechconf.com/gtc/2013/presentations/s3466-programming-guidelines-gpu-architecture.pdf CUDA Best Practices Guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ Chameleon from http://www.vectorportal.com, Creative Commons

BEFORE WE START Instructions are executed by warps of threads It is a hardware concept There are 32 threads per warp

ITERATION 1

TRACE THE APPLICATION

TIMELINE

EXAMINE INDIVIDUAL KERNELS Launch

KERNEL OPTIMIZATION PRIORITIES Hotspot The Hotspot is gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms

PROFILE THE HOTSPOT Select the Kernel Launch

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? Nsight EE helps us perform that analysis

LATENCY The kernel is limited by latency Hint: Memory related

LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 Fully covered latency Exposed latency No warp issues

LATENCY ANALYSIS Launch

STALL REASONS Indicate why a warp cannot issue There are 6 stall reasons reported by Nsight EE Execution dependency Data request Texture Synchronization Instruction fetch Other

STALL REASONS Stall reasons: Why warps cannot issue

STALL REASONS: EXECUTION DEPENDENCY a = b + c; // ADD a = b[i]; // LOAD d = a + e; // ADD d = a + e; // ADD Memory accesses may influence execution dependencies Global accesses create longer dependencies than shared ones Read-only/texture dependencies are counted in Texture Instruction level parallelism help reduce dependencies a = b + c; // Independent ADDs d = e + f;

ILP AND MEMORY ACCESSES float a = 0.0f; for( int i = 0 ; i < N ; ++i ) a += logf(b[i]); c = b[0] No ILP a += logf(c) 2-way ILP (with loop unrolling) float a, a0 = 0.0f, a1 = 0.0f; for( int i = 0 ; i < N ; i += 2 ) { a0 += logf(b[i]); a1 += logf(b[i+1]); } a = a0 + a1 c = b[1] a += logf(c) c = b[2] a += logf(c) c = b[3]... c0 = b[0] a0 += logf(c0) c0 = b[2] a0 += logf(c0) c1 = b[1] a1 += logf(c1) c1 = b[3] a1 += logf(c1) a += logf(c) a = a0 + a1 #pragma unroll is useful to extract ILP

STALL REASONS: DATA REQUEST High percentage of data request: Many memory replays To do: Improve coalescing of memory accesses Improve alignment of memory accesses

MEMORY TRANSACTIONS Warp of threads (32 threads) L1 transaction: 128B Alignment: 128B (0, 128, 256, ) L2 transaction: 32B Alignment: 32B (0, 32, 64, 96, )

MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores Threads read different elements of the same 128B segment 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred

MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment 32x 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred

REPLAYS A warp reads from addresses spanning 3 lines of 128B Threads 0-7 Threads 24-31 Threads 8-15 Threads 16-23 1 instr. executed and 2 replays = 1 request and 3 transactions Instruction issued Instruction re-issued 1 st replay Instruction re-issued 2 nd replay Time Threads 0-7/24-31 Threads 8-15 Threads 16-23

REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Issued Inst. 1 Issued Inst. 2 Issued Inst. 0 Completed Inst. 1 Completed Inst. 2 Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads 0-7/24-31 Threads 8-15 Threads 16-23 Threads 0-7/24-31 Threads 8-15 Threads 16-23

STALL REASONS: DATA REQUEST Data request is also influenced by shared memory replays See CUDA Programming Guide, Sections 5.3.2 and G.5.3 (Kepler) Shared memory bank conflicts Data request is also influenced by local memory replays See CUDA Programming Guide, Section 5.3.2 Local memory and register spilling

COLLECT METRICS AND EVENT Collect Global load/store transactions per request metrics

TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store) Too many memory transactions (too much pressure on LSU)

TRANSACTIONS PER REQUEST Our blocks are 8x8 Warp 0 Warp 1 0 1 2 8 9 10 16 17 18 24 25 26 32 33 34 40 41 42 48 49 50 threadidx.x 3 11 19 4 5 6 12 20 13 14 21 22 27 28 29 30 35 43 51 36 37 38 44 45 46 52 53 54 7 15 23 31 39 47 55 56 57 58 59 60 61 62 63 We should use blocks of size 32x2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x2 It runs faster: 3.930ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x

ITERATION 2

TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x We use the same block size for the Sobel filter kernel. That s the reason why it also improves (2 nd row of the Nsight table).

LATENCY The kernel is limited by latency

STALL REASONS High execution dependency It may not be obvious what to change

OCCUPANCY Nsight also warns us about Achieved occupancy

LATENCY: LACK OF OCCUPANCY Not enough active warps warp 0 warp 1 warp 2 warp 3 No warp issues The schedulers cannot find eligible warps at every cycle

OCCUPANCY We are limited by the size of the blocks Let s change our block size to 32x4

OCCUPANCY Blocks of size 32x4 It runs slightly faster: 3.606ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x

ITERATION 3

TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x We use the same block size for the Sobel filter kernel. That s the reason why it also improves (2 nd row of the Nsight table).

LATENCY The kernel is limited by latency Highest memory pressure is on the L2 cache

LATENCY High execution dependency (+ L2 cache pressure) Are we loading the same data several times from L2?

L2 CACHE HIT RATE The L2 cache hit rate is high: 99% To do: Reduce the amount of memory transferred Move the data closer to the SM (actually, to the SM)

SHARED MEMORY Adjacent pixels access have neighbors in common We should use shared memory to store those common pixels shared unsigned char smem_pixels[10][64];

SHARED MEMORY Shared memory: 1.210ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x

ITERATION 4

TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v1 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x

COMPUTE AND MEMORY BOUND The highest pressure is on the Load-Store Unit

INSTRUCTION THROUGHPUT SM Registers Sched Sched Sched Sched Pipes Pipes Pipes Pipes SMEM/L1$ Each SM has 4 schedulers (Kepler) Schedulers issue instructions to pipes A scheduler issues up to 2 instructions/cycle Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8) A scheduler issues inst. from a single warp Cannot issue to a pipe if its issue slot is full

INSTRUCTION THROUGHPUT Schedulers saturated Sched Sched Sched Sched Utilization: 90% Pipe saturated Sched Sched Sched Sched Utilization: 64% Schedulers and pipe saturated Sched Sched Sched Sched Utilization: 92% Load Store Texture Control Flow ALU Load Store Texture Control Flow ALU Load Store Texture Control Flow ALU 90% 65% 78% 11% 8% 4% 24% 27% 6% 4%

READ-ONLY CACHE (TEXTURE UNITS) SM SM Registers SMEM/L1$ Texture Units Skip LSU Cache loads Registers SMEM/L1$ Texture Units L2$ Global Memory (Framebuffer)

READ-ONLY PATH Annotate our pointer with const restrict global void gaussian_filter_7x7_v2(int w, int h, const uchar * restrict src, uchar *dst) The compiler generates LDG instructions: 1.018ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x Read-only path 1.018ms 6.15x

INSTRUCTION THROUGHPUT Things to investigate next Improve memory efficiency Reduce computational intensity (separable filter)

BACK TO STALL REASONS The stall reasons for that kernel

STALL REASONS: TEXTURE Texture/read-only load execution dependencies a = ldg(&b[i]); // Read-only LOAD d = a + e; // ADD Texture unit is saturated and does not accept requests Does not necessarily mean that the Texture issue slot is saturated To do: Reduce memory accesses Improve coalescing and memory alignment

STALL REASONS: SYNCHRONIZATION Time spent waiting at barriers ( syncthreads/ threadfence) Load imbalance (some warps work, others wait) if( is_first_warp ) do_something_expensive(); syncthreads(); To do: Make sure you do not have unneeded barriers Reduce load imbalance Rethink the algorithm to reduce synchronization points Fewer threads per block

STALL REASONS: INSTRUCTION FETCH Instructions are stored in global memory The SM has to load instructions before executing them Instructions are cached on the SM (I$) To do: Reduce branching to avoid I$ misses Reduce unrolling and function inlining Merge kernels (short kernels are more impacted by I$ misses)

MORE IN OUR COMPANION CODE Kernel Time Speedup Original version 6.265ms Larger blocks 6.075ms 1.03x Better memory accesses 3.605ms 1.74x Fewer registers 1.949ms 3.21x Shared memory 1.211ms 5.17x Read-Only path 1.019ms 6.15x Separable filter 0.656ms 9.55x Process two pixels per thread (improve memory efficiency + add ILP) 0.511ms 12.26x Use 64-bit shared memory (remove bank conflicts) 0.499ms 12.56x Use float instead of int (increase instruction throughput) 0.434ms 14.44x Your next idea!!! https://github.com/jdemouth/nsight-gtc2014

CONCLUSION