CUDA Performance Optimization. Patrick Legresley

Similar documents
Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental CUDA Optimization. NVIDIA Corporation

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Advanced CUDA Programming. Dr. Timo Stich

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Improving Performance of Machine Learning Workloads

GPU Profiling and Optimization. Scott Grauer-Gray

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Optimization Strategies Global Memory Access Pattern and Control Flow

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Global Memory Access Pattern and Control Flow

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Tesla Architecture, CUDA and Optimization Strategies

Advanced CUDA Optimizing to Get 20x Performance

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Lecture 2: CUDA Programming

GPU Fundamentals Jeff Larkin November 14, 2016

NVIDIA GPU CODING & COMPUTING

CUDA Memories. Introduction 5/4/11

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

State-of-the-art in Heterogeneous Computing

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CUDA Programming Model

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Performance Optimization Process

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Dense Linear Algebra. HPC - Algorithms and Applications

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Programming in CUDA. Malik M Khan

TUNING CUDA APPLICATIONS FOR MAXWELL

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Tuning CUDA Applications for Fermi. Version 1.2

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Advanced CUDA Optimization 1. Introduction

Introduction to CUDA Programming

Introduction to Parallel Computing with CUDA. Oswald Haan

Portland State University ECE 588/688. Graphics Processors

B. Tech. Project Second Stage Report on

Numerical Simulation on the GPU

Hands-on CUDA Optimization. CUDA Workshop

TUNING CUDA APPLICATIONS FOR MAXWELL

Warps and Reduction Algorithms

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Computer Architecture

Accelerating the acceleration search a case study. By Chris Laidler

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CS377P Programming for Performance GPU Programming - I

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Performance Considerations (2 of 2)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

Optimizing Parallel Reduction in CUDA

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS 179 Lecture 4. GPU Compute Architecture

Double-Precision Matrix Multiply on CUDA

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CS377P Programming for Performance GPU Programming - II

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 574 Cluster Computing Lecture 15

NVIDIA Fermi Architecture

General Purpose GPU Computing in Partial Wave Analysis

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Code Optimizations for High Performance GPU Computing

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to CUDA

1/25/12. Administrative

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Transcription:

CUDA Performance Optimization Patrick Legresley

Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations of CPU/GPU interaction Maximizing PCIe throughput Asynchronous memory copies NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 2

Coalescing global memory accesses A coordinated load/store by a half-warp (16 threads) A contiguous region of global memory 64 bytes - each thread accesses a 32-bit word: int, float, 128 bytes - each thread accesses a double-word: int2, float2, 256 bytes - each thread accesses a quad-word: int4, float4, Additional restrictions Starting address for a region must be a multiple of region size The k th thread in a half-warp must access the k th element in a block Exception: not all threads must be participating Predicated access, divergence within a half-warp NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 3

Coalesced Access: floats t0 t1 t2 t3 t14 t15 128 132 136 140 144 184 188 192 All threads participate t0 t1 t2 t3 t14 t15 128 132 136 140 144 184 188 192 Some threads do not participate NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 4

Uncoalesced Access: floats t0 t1 t2 t3 t14 t15 128 132 136 140 144 184 188 192 Permuted access by threads t0 t1 t2 t3 t13 t14 t15 128 132 136 140 144 184 188 192 Misaligned starting address (not a multiple of 64) NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 5

Coalesced Access: float3 Use shared memory to allow coalescing Threads read a block of floats into shared memory in a coalesced way Need sizeof(float3) * (threads per block) bytes of shared memory Processing Each thread retrieves its float3 from shared memory Rest of the compute code does not change NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 6

Coalescing: Timing Results Experiment Kernel: read a float, increment, write back 3M floats (12MB) Times averaged over 10K runs 12K blocks x 256 threads reading floats 356µs coalesced 357µs coalesced, some threads don t participate 3,494µs permuted/misaligned thread access 4K blocks x 256 threads reading float3s 3,302µs float3 uncoalesced 359µs float3 coalesced through shared memory NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 7

Coalescing: Structures of size 4, 8, or 16 bytes Use a Structure of Arrays (SoA) instead of Array of Structures (AoS) If SoA is not viable Force structure alignment: align(x), where X = 4, 8, or 16 Use shared memory to achieve coalescing x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 8

Coalescing (Compute 1.2+ GPUs) Much improved coalescing capabilities in 10-series architecture Hardware combines addresses within a half-warp into one or more aligned segments 32, 64, or 128 bytes All threads with addresses within a segment are serviced with a single memory transaction Regardless of ordering or alignment within the segment NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 9

Compute 1.2+ Coalesced Access: Reading floats t0 t1 t2 t3 t14 t15 128 132 136 140 144 184 188 192 Permuted access by threads t0 t1 t2 t3 t13 t14 t15 128 132 136 140 144 184 188 192 Misaligned starting address (not a multiple of 64) NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 10

Compute 1.2+ Coalesced Access: Reading floats 32-byte segment 64-byte segment t0 t1 t2 t3 t4 t13 t14 t15 116 120 124 128 132 168 172 176 Misaligned starting address (not a multiple of 64) Transaction size recursively reduced to minimize size NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 11

Textures in CUDA Texture is an object for reading data Benefits Data is cached (optimized for 2D locality) Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear dedicated hardware Wrap modes (for out-of-bounds addresses) Clamp to edge / repeat Addressable in 1D, 2D, or 3D Using integer or normalized coordinates Usage CPU code binds data to a texture object Kernel reads data by calling a fetch function NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 12

Texture Addressing 0 0 1 2 3 4 1 2 (2.5, 0.5) (1.0, 1.0) 3 Wrap Out-of-bounds coordinate is wrapped (modulo arithmetic) Clamp Out-of-bounds coordinate is replaced with the closest boundary 0 0 1 2 3 4 1 (5.5, 1.5) 0 0 1 2 3 4 1 (5.5, 1.5) 2 2 3 3 NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 13

Two CUDA Texture Types Bound to linear memory Global memory address is bound to a texture Only 1D Integer addressing No filtering, no addressing modes Bound to CUDA arrays CUDA array is bound to a texture 1D, 2D, or 3D Float addressing (size-based or normalized) Filtering Addressing modes (clamping, repeat) Both Return either element type or normalized float NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 14

CUDA Texturing Steps Host (CPU) code Allocate/obtain memory (global linear, or CUDA array) Create a texture reference object Currently must be at file-scope Bind the texture reference to memory/array When done: Unbind the texture reference, free resources Device (kernel) code Fetch using texture reference Linear memory textures tex1dfetch() Array textures tex1d() or tex2d() or tex3d() NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 15

Global memory optimization Coalescing greatly improves throughput Prefer Structures of Arrays over AoS If SoA is not viable, read/write through shared memory Try textures for uncoalescible read patterns Batching can help performance A thread reads/writes multiple elements Increases overlap opportunities Strive for 50% or higher occupancy Occupancy is number of threads running concurrently divided by maximum number of threads that can run concurrently NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 16

Occupancy and Register Pressure Latency is hidden by using many threads per multiprocessor Factors limiting the number of concurrent threads Number of registers 8192 or 16384 per multiprocessor, partitioned among concurrent threads Amount of shared memory 16KB per multiprocessor, partitioned among concurrent thread blocks Compile with --ptxas-options=-v flag Use --maxrregcount=n flag to NVCC N = desired maximum registers / thread At some point spilling into local memory will occur Reduces performance local memory is slow (physically in DRAM) NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 17

Shared Memory Orders of magnitude faster than global memory Uses Inter-thread communication within a block Use it to avoid non-coalesced access See Matrix Transpose SDK example Organization 16 banks Bank width: 32 bits Successive 32-bit words belong to different banks NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 18

Shared memory bank conflicts No bank conflicts all threads in a half-warp access different banks all threads in a half-warp read the same address Bank conflicts Multiple threads in a half-warp access the same bank Access is serialized Detecting warp_serialize profiler signal bank checker macro in the SDK Performance impact Shared memory intensive apps: up to 30% Little to no impact if performance limited by global memory NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 19

Bank Addressing Examples No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 20

Bank Addressing Examples 2-way Bank Conflicts Linear addressing stride == 2 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 8-way Bank Conflicts Linear addressing stride == 8 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 21

Assessing Memory Performance Global memory how close to the maximum bandwidth? Textures throughput alone can be tricky some codes exceed theoretical bus bandwidth (due to cache) how close to the theoretical fetch rate? G80: ~18 billion fetches per second Shared memory check for bank conflicts -> profiler, bank-macros NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 22

Runtime Math Library and Intrinsics Two types of runtime math library operations func(): direct mapping to hardware ISA Fast but lower accuracy (see prog. guide for details) Examples: sin(x), exp(x), pow(x, y) func(): compile to multiple instructions Slower but higher accuracy (5 ulp or less) Examples: sin(x), exp(x), pow(x, y) A number of additional intrinsics sincos(), rcp(),... NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 23

Control Flow Divergent branches Threads within a single warp take different paths if-else,... Different execution paths are serialized Divergence avoided when branch condition is a function of thread ID Example with divergence if (threadidx.x > 2) {...} else {...} Branch granularity < warp size Example without divergence if (threadidx.x / WARP_SIZE > 2) {...} else {...} Branch granularity is a whole multiple of warp size NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 24

Grid Size Heuristics # of blocks > # of multiprocessors All multiprocessors execute at least one block # of blocks / # of multiprocessors > 2 Multiple blocks can run concurrently on a multiprocessor Blocks that aren t waiting at a syncthreads() keep the hardware busy # of blocks > 100 to scale to future devices 1000s blocks per grid will scale across multiple generations NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 25

Optimizing threads per block Thread block size should be a multiple of warp size Avoid wasting computation and on chip resources on under-populated warps Heuristics Minimum: 64 threads per block Only if multiple concurrent blocks 128, 192, or 256 threads a better choice Usually still enough registers to compile and invoke successfully This all depends on your computation, so experiment! NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 26

Page-Locked System Memory Enables highest cudamemcpy performance 3.2 GB/s on PCI-e x16 Gen1 5.2 GB/s on PCI-e x16 Gen2 cudamallochost() / cudafreehost() calls See the bandwidthtest CUDA SDK sample Use with caution reduces RAM available to the virtual memory system NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 27

Asynchronous memory copy Asynchronous host device memory copy for pinned memory (allocated with cudamallochost in C) frees up CPU on all CUDA capable devices Overlap implemented by using a stream Stream = Sequence of operations that execute in order Stream API 0 = default stream cudamemcpyasync(dst, src, size, direction, 0); NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 28

Kernel and memory copy overlap Concurrent execution of a kernel and a host device memory copy for pinned memory Devices with compute capability >= 1.1 (G84 and up) Overlaps kernel execution in one stream with a memory copy from another stream Stream API cudastreamcreate(&stream1); cudastreamcreate(&stream2); cudamemcpyasync(dst, src, size, dir, stream1); kernel<<<grid, block, 0, stream2>>>( ); cudastreamquery(stream2); overlapped NVIDIA Corporation 2008 CUDA Tutorial Hot Chips 20 Aug. 24, 2008 29