CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Similar documents
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Profiling & Tuning Applications. CUDA Course István Reguly

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Hands-on CUDA Optimization. CUDA Workshop

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Fundamental CUDA Optimization. NVIDIA Corporation

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Advanced CUDA Optimizations

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Dense Linear Algebra. HPC - Algorithms and Applications

TUNING CUDA APPLICATIONS FOR MAXWELL

Maximizing Face Detection Performance

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CS377P Programming for Performance GPU Programming - II

CUDA Performance Optimization. Patrick Legresley

Double-Precision Matrix Multiply on CUDA

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Code Optimizations for High Performance GPU Computing

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

GPU Performance Nuggets

Performance Optimization Process

CME 213 S PRING Eric Darve

Advanced CUDA Programming. Dr. Timo Stich

Tesla Architecture, CUDA and Optimization Strategies

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Optimizing Parallel Reduction in CUDA

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

Tuning CUDA Applications for Fermi. Version 1.2

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

GPU Profiling and Optimization. Scott Grauer-Gray

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Performance Considerations (2 of 2)

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Portland State University ECE 588/688. Graphics Processors

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Image convolution with CUDA

NVIDIA GPU CODING & COMPUTING

CUDA Parallelism Model

4.10 Historical Perspective and References

B. Tech. Project Second Stage Report on

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Spring Prof. Hyesoon Kim

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Modern Processor Architectures. L25: Modern Compiler Design

Global Memory Access Pattern and Control Flow

Profiling of Data-Parallel Processors

CS 179 Lecture 4. GPU Compute Architecture

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Enabling the Next Generation of Computational Graphics with NVIDIA Nsight Visual Studio Edition. Jeff Kiel Director, Graphics Developer Tools

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Example 1: Color-to-Grayscale Image Processing

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Threading Hardware in G80

Optimization Strategies Global Memory Access Pattern and Control Flow

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Computer Architecture

Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering

ME964 High Performance Computing for Engineering Applications

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploring GPU Architecture for N2P Image Processing Algorithms

Understanding and modeling the synchronization cost in the GPU architecture

Improving Performance of Machine Learning Workloads

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

CUDA Threads. Origins. ! The CPU processing core 5/4/11

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Sparse Matrix-Matrix Multiplication on the GPU. Julien Demouth, NVIDIA

Parallel Numerical Algorithms

Unrolling parallel loops

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

CUDA Performance Optimization

A Tutorial on CUDA Performance Optimizations

Transcription:

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE https://github.com/jdemouth/nsight-gtc2014

A WORD ABOUT THE APPLICATION Grayscale Blur Edges

A WORD ABOUT THE APPLICATION Grayscale Conversion // r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter 1 2 3 2 4 6 3 6 9 4 8 12 3 6 9 2 4 6 1 2 3 4 3 2 1 8 6 4 2 12 9 6 3 16 12 8 4 12 9 6 3 8 6 4 2 4 3 2 1 foreach pixel p: p = weighted sum of p and its 48 neighbors Image from Wikipedia

A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(gx + Gy) Weights for Gx: -1 0 1-2 0 2-1 0 1 Weights for Gy: 1 2 1 0 0 0-1 -2-1

OPTIMIZATION METHOD Trace the Application Identify the Hot Spot and Profile it Identify the Performance Limiter Memory Bandwidth Instruction Throughput Latency Optimize the Code Iterate We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy

ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC Microsoft Windows 7 x64 Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 Identical results are obtained on Linux

BEFORE WE START Some slides are for background Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013 http://on-demand.gputechconf.com/gtc/2013/video/s3466-performance-optimization-guidelines-gpu-architecture-details.mp4 http://on-demand.gputechconf.com/gtc/2013/presentations/s3466-programming-guidelines-gpu-architecture.pdf CUDA Best Practices Guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ Chameleon from http://www.vectorportal.com, Creative Commons

BEFORE WE START Instructions are executed by warps of threads It is a hardware concept There are 32 threads per warp

ITERATION 1

TRACE THE APPLICATION

TIMELINE

EXAMINE INDIVIDUAL KERNELS Launch

KERNEL OPTIMIZATION PRIORITIES Hotspot The Hotspot is gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms

PROFILE THE HOTSPOT Select the Kernel Launch

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? Nsight EE helps us perform that analysis

LATENCY The kernel is limited by latency Hint: Memory related

LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 Fully covered latency Exposed latency No warp issues

LATENCY ANALYSIS Launch

STALL REASONS Indicate why a warp cannot issue There are 6 stall reasons reported by Nsight EE Execution dependency Data request Texture Synchronization Instruction fetch Other

STALL REASONS Stall reasons: Why warps cannot issue

STALL REASONS: EXECUTION DEPENDENCY a = b + c; // ADD a = b[i]; // LOAD d = a + e; // ADD d = a + e; // ADD Memory accesses may influence execution dependencies Global accesses create longer dependencies than shared ones Read-only/texture dependencies are counted in Texture Instruction level parallelism help reduce dependencies a = b + c; // Independent ADDs d = e + f;

ILP AND MEMORY ACCESSES float a = 0.0f; for( int i = 0 ; i < N ; ++i ) a += logf(b[i]); c = b[0] No ILP a += logf(c) 2-way ILP (with loop unrolling) float a, a0 = 0.0f, a1 = 0.0f; for( int i = 0 ; i < N ; i += 2 ) { a0 += logf(b[i]); a1 += logf(b[i+1]); } a = a0 + a1 c = b[1] a += logf(c) c = b[2] a += logf(c) c = b[3]... c0 = b[0] a0 += logf(c0) c0 = b[2] a0 += logf(c0) c1 = b[1] a1 += logf(c1) c1 = b[3] a1 += logf(c1) a += logf(c) a = a0 + a1 #pragma unroll is useful to extract ILP

STALL REASONS: DATA REQUEST High percentage of data request: Many memory replays To do: Improve coalescing of memory accesses Improve alignment of memory accesses

MEMORY TRANSACTIONS Warp of threads (32 threads) L1 transaction: 128B Alignment: 128B (0, 128, 256, ) L2 transaction: 32B Alignment: 32B (0, 32, 64, 96, )

MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores Threads read different elements of the same 128B segment 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred

MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment 32x 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred

REPLAYS A warp reads from addresses spanning 3 lines of 128B Threads 0-7 Threads 24-31 Threads 8-15 Threads 16-23 1 instr. executed and 2 replays = 1 request and 3 transactions Instruction issued Instruction re-issued 1 st replay Instruction re-issued 2 nd replay Time Threads 0-7/24-31 Threads 8-15 Threads 16-23

REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Issued Inst. 1 Issued Inst. 2 Issued Inst. 0 Completed Inst. 1 Completed Inst. 2 Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads 0-7/24-31 Threads 8-15 Threads 16-23 Threads 0-7/24-31 Threads 8-15 Threads 16-23

STALL REASONS: DATA REQUEST Data request is also influenced by shared memory replays See CUDA Programming Guide, Sections 5.3.2 and G.5.3 (Kepler) Shared memory bank conflicts Data request is also influenced by local memory replays See CUDA Programming Guide, Section 5.3.2 Local memory and register spilling

COLLECT METRICS AND EVENT Collect Global load/store transactions per request metrics

TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store) Too many memory transactions (too much pressure on LSU)

TRANSACTIONS PER REQUEST Our blocks are 8x8 Warp 0 Warp 1 0 1 2 8 9 10 16 17 18 24 25 26 32 33 34 40 41 42 48 49 50 threadidx.x 3 11 19 4 5 6 12 20 13 14 21 22 27 28 29 30 35 43 51 36 37 38 44 45 46 52 53 54 7 15 23 31 39 47 55 56 57 58 59 60 61 62 63 We should use blocks of size 32x2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x2 It runs faster: 3.930ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x

ITERATION 2

TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x We use the same block size for the Sobel filter kernel. That s the reason why it also improves (2 nd row of the Nsight table).

LATENCY The kernel is limited by latency

STALL REASONS High execution dependency It may not be obvious what to change

OCCUPANCY Nsight also warns us about Achieved occupancy

LATENCY: LACK OF OCCUPANCY Not enough active warps warp 0 warp 1 warp 2 warp 3 No warp issues The schedulers cannot find eligible warps at every cycle

OCCUPANCY We are limited by the size of the blocks Let s change our block size to 32x4

OCCUPANCY Blocks of size 32x4 It runs slightly faster: 3.606ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x

ITERATION 3

TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v0 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x We use the same block size for the Sobel filter kernel. That s the reason why it also improves (2 nd row of the Nsight table).

LATENCY The kernel is limited by latency Highest memory pressure is on the L2 cache

LATENCY High execution dependency (+ L2 cache pressure) Are we loading the same data several times from L2?

L2 CACHE HIT RATE The L2 cache hit rate is high: 99% To do: Reduce the amount of memory transferred Move the data closer to the SM (actually, to the SM)

SHARED MEMORY Adjacent pixels access have neighbors in common We should use shared memory to store those common pixels shared unsigned char smem_pixels[10][64];

SHARED MEMORY Shared memory: 1.210ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x

ITERATION 4

TRACE THE APPLICATION Hotspot The hotspot is still gaussian_filter_7x7_v1 Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x

COMPUTE AND MEMORY BOUND The highest pressure is on the Load-Store Unit

INSTRUCTION THROUGHPUT SM Registers Sched Sched Sched Sched Pipes Pipes Pipes Pipes SMEM/L1$ Each SM has 4 schedulers (Kepler) Schedulers issue instructions to pipes A scheduler issues up to 2 instructions/cycle Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8) A scheduler issues inst. from a single warp Cannot issue to a pipe if its issue slot is full

INSTRUCTION THROUGHPUT Schedulers saturated Sched Sched Sched Sched Utilization: 90% Pipe saturated Sched Sched Sched Sched Utilization: 64% Schedulers and pipe saturated Sched Sched Sched Sched Utilization: 92% Load Store Texture Control Flow ALU Load Store Texture Control Flow ALU Load Store Texture Control Flow ALU 90% 65% 78% 11% 8% 4% 24% 27% 6% 4%

READ-ONLY CACHE (TEXTURE UNITS) SM SM Registers SMEM/L1$ Texture Units Skip LSU Cache loads Registers SMEM/L1$ Texture Units L2$ Global Memory (Framebuffer)

READ-ONLY PATH Annotate our pointer with const restrict global void gaussian_filter_7x7_v2(int w, int h, const uchar * restrict src, uchar *dst) The compiler generates LDG instructions: 1.018ms Kernel Time Speedup Original version 6.265ms Better memory accesses 3.930ms 1.59x Larger blocks 3.606ms 1.74x Shared memory 1.210ms 5.18x Read-only path 1.018ms 6.15x

INSTRUCTION THROUGHPUT Things to investigate next Improve memory efficiency Reduce computational intensity (separable filter)

BACK TO STALL REASONS The stall reasons for that kernel

STALL REASONS: TEXTURE Texture/read-only load execution dependencies a = ldg(&b[i]); // Read-only LOAD d = a + e; // ADD Texture unit is saturated and does not accept requests Does not necessarily mean that the Texture issue slot is saturated To do: Reduce memory accesses Improve coalescing and memory alignment

STALL REASONS: SYNCHRONIZATION Time spent waiting at barriers ( syncthreads/ threadfence) Load imbalance (some warps work, others wait) if( is_first_warp ) do_something_expensive(); syncthreads(); To do: Make sure you do not have unneeded barriers Reduce load imbalance Rethink the algorithm to reduce synchronization points Fewer threads per block

STALL REASONS: INSTRUCTION FETCH Instructions are stored in global memory The SM has to load instructions before executing them Instructions are cached on the SM (I$) To do: Reduce branching to avoid I$ misses Reduce unrolling and function inlining Merge kernels (short kernels are more impacted by I$ misses)

MORE IN OUR COMPANION CODE Kernel Time Speedup Original version 6.265ms Larger blocks 6.075ms 1.03x Better memory accesses 3.605ms 1.74x Fewer registers 1.949ms 3.21x Shared memory 1.211ms 5.17x Read-Only path 1.019ms 6.15x Separable filter 0.656ms 9.55x Process two pixels per thread (improve memory efficiency + add ILP) 0.511ms 12.26x Use 64-bit shared memory (remove bank conflicts) 0.499ms 12.56x Use float instead of int (increase instruction throughput) 0.434ms 14.44x Your next idea!!! https://github.com/jdemouth/nsight-gtc2014

CONCLUSION

OPTIMIZATION METHOD Trace the Application Identify the Hot Spot and Profile it Identify the Performance Limiter Memory Bandwidth Instruction Throughput Latency Optimize the Code Iterate https://github.com/jdemouth/nsight-gtc2014