CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Similar documents
Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CUDA Performance Optimization. Patrick Legresley

GPU MEMORY BOOTCAMP III

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Fundamental CUDA Optimization. NVIDIA Corporation

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS377P Programming for Performance GPU Programming - I

Data Parallel Execution Model

Lecture 2: CUDA Programming

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Advanced CUDA Programming. Dr. Timo Stich

CUDA Memories. Introduction 5/4/11

GPU Performance Nuggets

Hands-on CUDA Optimization. CUDA Workshop

Warps and Reduction Algorithms

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Dense Linear Algebra. HPC - Algorithms and Applications

GPGPUGPGPU: Multi-GPU Programming

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Profiling & Tuning Applications. CUDA Course István Reguly

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Maximizing Face Detection Performance

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

TUNING CUDA APPLICATIONS FOR MAXWELL

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

OpenACC Course. Office Hour #2 Q&A

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU programming. Dr. Bernhard Kainz

Programmable Graphics Hardware (GPU) A Primer

More CUDA. Advanced programming

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Tesla Architecture, CUDA and Optimization Strategies

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Advanced CUDA Optimizations

ECE 574 Cluster Computing Lecture 15

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CME 213 S PRING Eric Darve

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

GPU Fundamentals Jeff Larkin November 14, 2016

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

Optimizing CUDA for GPU Architecture. CSInParallel Project

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

GPU CUDA Programming

Performance Optimization Process

When MPPDB Meets GPU:

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

ECE 574 Cluster Computing Lecture 17

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Computer Architecture

Sparse Linear Algebra in CUDA

Parallel Processing SIMD, Vector and GPU s cont.

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them


CUDA (Compute Unified Device Architecture)

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Programming in CUDA. Malik M Khan

Tuning Performance on Kepler GPUs

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA Programming Model

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CS377P Programming for Performance GPU Programming - II

B. Tech. Project Second Stage Report on

Introduction to Parallel Computing with CUDA. Oswald Haan

University of Bielefeld

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Performance Optimization Part II: Locality, Communication, and Contention

Double-Precision Matrix Multiply on CUDA

Improving Performance of Machine Learning Workloads

COSC 6339 Accelerators in Big Data

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

GPU programming basics. Prof. Marco Bertini

Introduction to GPU programming with CUDA

Transcription:

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017

The art of doing more with less 2

Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3

Unrealistic Effort/Reward Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 4

Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 5

Performance RULE #1: DON T TRY TOO HARD Peak Performance Don t waste this time Get on this curve Reduce this time Time 6

Performance RULE #1: DON T TRY TOO HARD Peak Performance Here be ninjas Hire an intern Premature excitement Point of diminishing returns Wait, it s going slower?? Most people give up here 4 weeks and this is it? Trough of despair Time 7

PERFORMANCE CONSTRAINTS Compute Intensity Divergence 10% 3% Instruction 2% Occupancy 10% Memory 75% 8

PERFORMANCE CONSTRAINTS Chart Title Compute Intensity Divergence Instruction CPU <> GPU Transfer Occupancy Coalescence Divergent Access Cache Inefficiency Register Spilling 9

MEMORY ORDERS OF MAGNITUDE SM L1$ L2 Cache GDRAM DRAM CPU regs shmem regs shmem regs shmem PCIe bus 20,000 GB/sec 2,000 GB/sec 300 GB/sec 16 GB/sec 150 GB/sec 10

TALK BREAKDOWN In no particular order 1. Why Didn t I Think Of That? 2. CPU Memory to GPU Memory (the PCIe Bus) 3. GPU Memory to the SM 4. Registers & Shared Memory 5. Occupancy, Divergence & Latency 6. Weird Things You Never Thought Of (and probably shouldn t try) 11

WHERE TO BEGIN? 12

THE OBVIOUS NVIDIA Visual Profiler Start with the Visual Profiler 13

CPU <> GPU DATA MOVEMENT 14

PCI ISSUES regs shmem regs shmem regs shmem PCIe bus 16 GB/sec Moving data over the PCIe bus 15

PIN YOUR CPU MEMORY GPU Memory CPU Memory Copy Data 16

PIN YOUR CPU MEMORY GPU Memory CPU Memory DMA Controller Data 17

PIN YOUR CPU MEMORY GPU Memory CPU Memory DMA Controller Data Swap 18

PIN YOUR CPU MEMORY GPU Memory CPU Memory Data DMA Controller CPU allocates & pins page then copies locally before DMA Pinned Copy of Data 19

PIN YOUR CPU MEMORY GPU Memory CPU Memory DMA Controller User Pinned Data cudahostalloc( &data, size, cudahostallocmapped ); cudahostregister( &data, size, cudahostregisterdefault ); 20

PIN YOUR CPU MEMORY 21

REMEMBER: PCIe GOES BOTH WAYS 22

STREAMS & CONCURRENCY Hiding the cost of data transfer Operations in a single stream are ordered But hardware can copy and compute at the same time Single Stream Copy data to GPU Compute Copy data to Host Time 23

STREAMS & CONCURRENCY Stream 2 Copy up Work Copy back Saved Time Stream 1 Copy up Work Copy back Single Stream Copy data to GPU Compute Copy data to Host Time 24

STREAMS & CONCURRENCY Can keep on breaking work into smaller chunks and saving time 8 streams 2 streams 1 stream 25

SMALL PCIe TRANSFERS PCIe is designed for large data transfers But fine-grained copy/compute overlap prefers small transfers So how small can we go? Too many 8 2 1 26

APPARENTLY NOT THAT SMALL 27

FROM GPU MEMORY TO GPU THREADS 28

FEEDING THE MACHINE regs shmem regs shmem regs shmem PCIe bus From GPU Memory to the SMs 29

USE THE PARALLEL ARCHITECTURE Hardware is optimized to use all SIMT threads at once L2 Cache Line Threads run in groups of 32 Cache is sized to service sets of 32 requests at a time High-speed GPU memory works best with linear access 30

VECTORIZE MEMORY LOADS Multi-Word as well as Multi-Thread T0-T32 int 31

VECTORIZE MEMORY LOADS T0-T15 int2 T16-T31 Fill multiple cache lines in a single fetch 32

VECTORIZE MEMORY LOADS T0-T7 int4 T8-T15 T16-T23 T24-T31 Fill multiple cache lines in a single fetch 33

VECTORIZE MEMORY LOADS 34

DO MULTIPLE LOADS PER THREAD Multi-Thread, Multi-Word AND Multi-Iteration global void copy(int2 *input, int2 *output, int max) { } int id = threadidx.x + blockdim.x * blockidx.x; if( id < max ) { output[id] = input[id]; } global void copy(int2 *input, int2 *output, int max, int loadsperthread) { } int id = threadidx.x + blockdim.x * blockidx.x; for(int n=0; n<loadsperthread; n++) { if( id >= max ) { break; } output[id] = input[id]; id += blockdim.x * griddim.x; } One copy per thread Maximum overhead Multiple copies per thread Amortize overhead 35

MAXIMAL LAUNCHES ARE BEST 36

COALESCED MEMORY ACCESS It s not just good enough to use all SIMT threads Coalesced: Sequential memory accesses are adjacent 1 2 3 4 Uncoalesced: Sequential memory accesses are unassociated 1 2 4 3 37

SIMT PENALTIES WHEN NOT COALESCED x = data[threadidx.x] x = data[rand()] Single 32-wide operation 32 one-wide operations 38

SCATTER & GATHER Gathering 1 2 3 4 Scattering 1 2 3 4 1 3 1 3 2 4 2 4 Reading randomly Writing sequentially Reading sequentially Writing randomly 39

AVOID SCATTER/GATHER IF YOU CAN 40

AVOID SCATTER/GATHER IF YOU CAN 41

SORTING MIGHT BE AN OPTION If reading non-sequential data is expensive, is it worth sorting it to make it sequential? Gathering 1 2 3 4 Coalesced Read 1 2 3 4 Slow Fast 2 4 1 3 Sort 1 2 3 4 42

SORTING MIGHT BE AN OPTION Even if you re only going to read it twice, then yes! 43

PRE-SORTING TURNS OUT TO BE GOOD 44

DATA LAYOUT: AOS vs. SOA Sometimes you can t just sort your data Array-of-Structures #define NPTS 1024 * 1024 struct Coefficients_AOS { double u[3]; double x[3][3]; double p; double rho; double eta; }; Coefficients_AOS griddata[npts]; Single-thread code prefers arrays of structures, for cache efficiency Structure-of-Arrays #define NPTS 1024 *1024 struct Coefficients_SOA { double u[3][npts]; double x[3][3][npts]; double p[npts]; double rho[npts]; double eta[npts]; }; Coefficients_SOA griddata; SIMT code prefers structures of arrays, for execution & memory efficiency 45

DATA LAYOUT: AOS vs. SOA #define NPTS 1024 * 1024 struct Coefficients_AOS { double u[3]; double x[3][3]; double p; double rho; double eta; }; Coefficients_AOS griddata[npts]; Structure Definition u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout 46

SOA: STRIDED ARRAY ACCESS GPU reads data one element at a time, but in parallel by 32 threads in a warp u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout Array-of-Structures Memory Layout double u0 = griddata[threadidx.x].u[0]; 47

AOS: COALESCED BUT COMPLEX GPU reads data one element at a time, but in parallel by 32 threads in a warp u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout Array-of-Structures Memory Layout Structure-of-Arrays Memory Layout double u0 = griddata.u[0][threadidx.x]; 48

BLOCK-WIDE LOAD VIA SHARED MEMORY Read data linearly as bytes. Use shared memory to convert to struct Block copies data to shared memory Device Memory Shared Memory 49

BLOCK-WIDE LOAD VIA SHARED MEMORY Read data linearly as bytes. Use shared memory to convert to struct Device Memory Shared Memory Threads which own the data grab it from shared memory 50

CLEVER AOS/SOA TRICKS 51

CLEVER AOS/SOA TRICKS Helps for any data size 52

HANDY LIBRARY TO HELP YOU Trove A utility library for fast AOS/SOA access and transposition https://github.com/bryancatanzaro/trove 53

(AB)USING THE CACHE 54

MAKING THE MOST OF L2-CACHE L2 cache is fast but small: Architecture L2 Cache Size Total Threads Cache Bytes per Thread Kepler 1536 KB 30,720 51 Maxwell 3072 KB 49,152 64 Pascal 4096 KB 114,688 36 L2 Cache GDRAM 2,000 GB/sec 300 GB/sec 55

TRAINING DEEP NEURAL NETWORKS 56

LOTS OF PASSES OVER DATA W1 3x3 convolution FFT W2 5x5 convolution + Cat! W3 7x7 convolution 57

MULTI-RESOLUTION CONVOLUTIONS Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7 58

TILED, MULTI-RESOLUTION CONVOLUTION Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7 Do 3 passes per-tile Each tile sized to fit in L2 cache 59

LAUNCHING FEWER THAN MAXIMUM THREADS 60

SHARED MEMORY: DEFINITELY WORTH IT 61

USING SHARED MEMORY WISELY Shared memory arranged into banks for concurrent SIMT access 32 threads can read simultaneously so long as into separate banks Shared memory has 4-byte and 8-byte bank sizes 62

STENCIL ALGORITHM Many algorithms have high data re-use: potentially good for shared memory Stencil algorithms accumulate data from neighbours onto a central point Stencil has width W (in the above case, W=5) Adjacent threads will share (W-1) items of data good potential for data re-use 63

STENCILS IN SHARED MEMORY 64

SIZE MATTERS 65

PERSISTENT KERNELS Revisiting the tiled convolutions Avoid multiple kernel launches by caching in shared memory instead of L2 void tiledconvolution() { convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunksize); convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunksize); convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunksize); } Separate kernel launches with L2 re-use global void convolutionshared(int *data, int count, int sharedelems) { extern shared int shdata[]; shdata[threadidx.x] = data[threadidx.x + blockdim.x*blockidx.x]; syncthreads(); } convolve<3>(threadidx.x, shdata, sharedelems); syncthreads(); convolve<5>(threadidx.x, shdata, sharedelems); syncthreads(); convolve<7>(threadidx.x, shdata, sharedelems); Single kernel launch with persistent kernel 66

PERSISTENT KERNELS 67

Read from CPU Write to CPU OPERATING DIRECTLY FROM CPU MEMORY Can save memory copies. It s obvious when you think about it... Copy data to GPU Compute Copy data to Host Compute only begins when 1 st copy has finished. Task only ends when 2 nd copy has finished. Compute Compute begins after first fetch. Uses lots of threads to cover host-memory access latency. Takes advantage of bi-directional PCI. 68

OPERATING DIRECTLY FROM CPU MEMORY 69

OCCUPANCY AND REGISTER LIMITATIONS Register file is bigger than shared memory and L1 cache! Occupancy can kill you if you use too many registers Often worth forcing fewer registers to allow more blocks per SM But watch out for math functions! launch_bounds (maxthreadsperblock, minblockspermultiprocessor) global void compute() { y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3); } Function float double log 7 18 cos 16 28 acos 6 18 cosh 7 10 tan 15 28 erfc 14 22 exp 7 10 log10 6 18 normcdf 16 26 cbrt 8 20 sqrt 6 12 rsqrt 5 12 y0 20 30 y1 22 30 fdivide 11 20 pow 11 24 grad. desc. 14 22 70

THANK YOU!