Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Similar documents
Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

CS 314 Principles of Programming Languages

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

CS516 Programming Languages and Compilers II

Lecture 2: CUDA Programming

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

ECE 408 / CS 483 Final Exam, Fall 2014

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Module Memory and Data Locality

Supercomputing on your desktop:

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

Real-time Graphics 9. GPGPU

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduc)on to CUDA. T.V. Singh Ins)tute for Digital Research and Educa)on UCLA

CME 213 S PRING Eric Darve

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Example 1: Color-to-Grayscale Image Processing

Real-time Graphics 9. GPGPU

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Tesla Architecture, CUDA and Optimization Strategies

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary

Programming on GPUs (CUDA and OpenCL) Gordon Erlebacher Nov. 30, 2012

Lecture 10!! Introduction to CUDA!

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Code Optimizations for High Performance GPU Computing

Lab 1 Part 1: Introduction to CUDA

Advanced CUDA Optimizations

GPU programming basics. Prof. Marco Bertini

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CUDA Performance Optimization. Patrick Legresley

GPGPUGPGPU: Multi-GPU Programming

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Image convolution with CUDA

Hardware/Software Co-Design

Dense Linear Algebra. HPC - Algorithms and Applications

Module 3: CUDA Execution Model -I. Objective

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Image Processing Optimization C# on GPU with Hybridizer

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Practical Introduction to CUDA and GPU

CUDA as a Supporting Technology for Next- Generation AR Applications. Tutorial 4

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Double-Precision Matrix Multiply on CUDA

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

CUDA C Programming Mark Harris NVIDIA Corporation

Scientific discovery, analysis and prediction made possible through high performance computing.

COSC 6374 Parallel Computations Introduction to CUDA

CUDA Parallelism Model

High Performance Computing and GPU Programming

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Optimizing Matrix Transpose in CUDA. Greg Ruetsch. Paulius Micikevicius. Tony Scudiero.

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Efficient Integral Image Computation on the GPU

CS : High-Performance Computing with CUDA

University of Bielefeld

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Tiled Matrix Multiplication

Cartoon parallel architectures; CPUs and GPUs

Programming with CUDA, WS09

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Vector Addition on the Device: main()

Sparse Linear Algebra in CUDA

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

1/25/12. Administrative

Real-Time GPU Fluid Dynamics

Lecture 5. Performance Programming with CUDA

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Introduction to CUDA (1 of n*)

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Programming in CUDA. Malik M Khan

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CS377P Programming for Performance GPU Programming - I

Fast Bilateral Filter GPU implementation

ECE 574 Cluster Computing Lecture 15

Transcription:

26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86)

CUDA memory We already know... Global memory is slow. Shared memory is fast and can be used as manual cache There were some other kinds of memory... 27(86)27(86)

Coalescing Always access global memory in order If threads access global memory in order of thread numbers, performance will be improved Thread 1 2 3 4 5 6 7 8 Thread 1 2 3 4 5 6 7 8 RAM 1 2 3 4 5 6 7 8 RAM 1 2 3 4 5 6 7 8 Good Bad 28(86)28(86)

WTF? How can performance depend on what order I access my data??? Isn t it random access? Yes... You can access in any order you want, but ordered access helps the GPU to read more data in one access Why? Because the GPU can get much data in a single transaction, and neighbor threads are tested for accessing the same area 29(86)29(86)

Coalescing Example: Assume that we can get 4 data items per transaction. One access One access Eight separate accesses Thread 1 2 3 4 5 6 7 8 Thread 1 2 3 4 5 6 7 8 RAM 1 2 3 4 5 6 7 8 RAM 1 2 3 4 5 6 7 8 Good Bad 30(86)30(86)

Coalescing on Fermi & later Effect reduced by caches - but not removed. Coalescing is still needed for maximum performance. "Perhaps the single most important performance consideration... is coalescing of global memory accesses." (CUDA C Best Practices Guide 2016) 31(86)31(86)

Accelerating by coalescing Pure memory transfers can be 10x faster by taking advantage of memory coalescing Example: Matrix transpose No computations Only memory accesses. 32(86)32(86)

Matrix transpose Naive implementation global void transpose_naive(float *odata, float* idata, int width, int height) { unsigned int xindex = blockdim.x * blockidx.x + threadidx.x; unsigned int yindex = blockdim.y * blockidx.y + threadidx.y; } if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; odata[index_out] = idata[index_in]; } How can this be bad? 33(86)33(86)

Matrix transpose Coalescing problems Row-by-row and column-by-column. Column accesses non-coalesced 34(86)34(86)

Matrix transpose Coalescing solution Read from global memory to shared memory In order from global, any order to shared Write to global memory In order write to global, any order from shared 35(86)35(86)

Better CUDA matrix transpose kernel global void transpose(float *odata, float *idata, int width, int height) { shared float block[block_dim][block_dim+1]; Shared memory for temporary storage // read the matrix tile into shared memory unsigned int xindex = blockidx.x * BLOCK_DIM + threadidx.x; unsigned int yindex = blockidx.y * BLOCK_DIM + threadidx.y; if((xindex < width) && (yindex < height)) { unsigned int index_in = yindex * width + xindex; block[threadidx.y][threadidx.x] = idata[index_in]; } Read data to temporary buffer syncthreads(); // write the transposed matrix tile to global memory xindex = blockidx.y * BLOCK_DIM + threadidx.x; yindex = blockidx.x * BLOCK_DIM + threadidx.y; if((xindex < height) && (yindex < width)) { unsigned int index_out = yindex * height + xindex; odata[index_out] = block[threadidx.x][threadidx.y]; } } Write data to tglobal memory 36(86)36(86)

Coalescing rules of thumb The data block should start on a multiple of 64 It should be accessed in order (by thread number) It is allowed to have threads skipping their item Data should be in blocks of 4, 8 or 16 bytes 37(86)37(86)

Shared memory Split into multiple memory banks (32). Fastest if you access different banks with each thread Interleaved, 32 bits chunks Thus: Address in 32-bit steps between threads for best performance Address space Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 38(86)38(86)

Constant memory Sounds boring... but has its uses. Read-only (for kernels) constant modifier Use for input data, obviously 39(86)39(86)

40(86)40(86)

Benefits of constant memory No cudamemcpy needed Just use it from kernel, write from CPU For data read by all threads, significantly faster than global memory Read-only memory is easy to cache. 41(86)41(86)

Why faster access? When? All threads reading the same data. One read can be broadcast to all nearby threads. Nearby? All threads in same half-warp (16 threads) But no help if threads are reading different data 42(86)42(86)

Example of using constant memory: Ray-caster Demo from CUDA by example With and without using const 43(86)43(86)

Ray-caster example Every thread renders one pixel Loop through all spheres, find closest with intersection Write result to an image buffer. Image buffer displayed with OpenGL. Non-const: Uploads sphere array by cudamemcpy() Const: Declares array const, uses directly from kernel. (Slightly simpler code) 44(86)44(86)

Ray-caster example Resulting time: Without using const: 13.9 ms With const: 10.6 ms Significant difference - for something that simplified the code 45(86)45(86)

Constant memory conclusions Relatively fast memory access - for the case when all threads read the same memory Some advantage for code complexity. NOT something we use for everything. 46(86)46(86)

Texture memory/ Texture units Using texture units to access memory G80 processor hierarchy 47(86)47(86)

Texture memory/ Texture units Texture memory, yet another kind of memory (or memory access method) But didn t we hide the graphics heritage...? Access global memory though the texturing units. Lets CUDA take advantage of the strong points with texturing units. 48(86)48(86)

Texture memory features Read-only (writable using "surface objects"). Cached Can be fast if data access patterns are good. Texture filtering, linear interpolation. Edge handling. Especially good for handling 4 floats at a time (float4). cudabindtexturetoarray() binds data to a texture unit. 49(86)49(86)

Texture memory for graphics Texture data mostly for rendering textures One texel used by 4 neigbor pixels (when not exact integer coordinates) Designed for spatial locality 50(86)50(86)

Varying access patterns - but neighbors are still neighbors 51(86)51(86)

Spatial locality for other things than textures Image filters of local nature Physics simulations with local updates, transfer of heat, liquids, pressure... Big jumps, no gain 52(86)52(86)

Using texture memory in CUDA Allocate with cudamalloc Bind to texture unit using cudabindtexture2d() Read from data using tex2d() Drawback: Just like in OpenGL, messy to keep track of which texture unit/texture reference is which data. 53(86)53(86)

Clamp and repeat 2 2 1 1 1 0-1 -1 0 1 2 0-1 -1 0 1 2 Texture access needs no boundary checks 54(86)54(86)

Clamp and repeat You are used to this Now you can get this or this ERROR ERROR ERROR ERROR 4 3 4 3 1 1 2 2 ERROR 1 2 ERROR 2 1 2 1 1 1 2 2 ERROR 3 4 ERROR 4 3 4 3 3 3 4 4 ERROR ERROR ERROR ERROR 2 1 2 1 3 3 4 4 55(86)55(86)

Interpolation Computation tricks when optimizing Texture access provides hardware accelerated linear interpolation Access texture data on non-integer coordinates and the texture hardware will do linear interpolation automatically Can be used for many calculations, e.g. filters. 56(86)56(86)

Interpolation a = a-x b = b-x A B = a b x B a + A b Texture accesses and calculations hardware accelerated 57(86)57(86)

Hardware interpolation too good to be true... The interpolation trick sounds kind of useful (for some cases)... but isn t as useful as it seems. Why? It is ment for interpolating between texels, visually. Small errors is not a problem then May have low precision, like 10 steps. Not as fun then... 58(86)58(86)

Demo using texture memory Heat transfer demo 59(86)59(86)

Demo using texture memory Heat transfer demo Makes local operations modelling heat dissipation 60(86)60(86)

That s all folks Next: Sorting on the GPU. 61(86)61(86)