CUDA Memories. Introduction 5/4/11

Similar documents
CUDA Threads. Origins. ! The CPU processing core 5/4/11

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Performance Optimization. Patrick Legresley

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Advanced CUDA Programming. Dr. Timo Stich

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Efficient Data Transfers

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA (Compute Unified Device Architecture)

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

University of Bielefeld

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Tesla Architecture, CUDA and Optimization Strategies

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Introduc)on to GPU Programming

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 2: CUDA Programming

An Introduction to GPU Computing and CUDA Architecture

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CS 179: GPU Computing. Lecture 2: The Basics

Numerical Simulation on the GPU

B. Tech. Project Second Stage Report on

ECE 574 Cluster Computing Lecture 17

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Parallel Numerical Algorithms

ECE 574 Cluster Computing Lecture 15

Lecture 2: different memory and variable types

CUDA Programming Model

CUDA C/C++ BASICS. NVIDIA Corporation

Hands-on CUDA Optimization. CUDA Workshop

Parallel Computing: Parallel Architectures Jin, Hai

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Introduction to CUDA Programming

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU Profiling and Optimization. Scott Grauer-Gray

GPU Fundamentals Jeff Larkin November 14, 2016

Paralization on GPU using CUDA An Introduction

CUDA C/C++ BASICS. NVIDIA Corporation

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

1/25/12. Administrative

Introduction to GPU hardware and to CUDA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

General Purpose GPU Computing in Partial Wave Analysis

Using GPUs to compute the multilevel summation of electrostatic forces

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

L17: Asynchronous Concurrent Execution, Open GL Rendering

Introduction to CUDA C

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

arxiv: v1 [physics.comp-ph] 4 Nov 2013

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

GPUs and GPGPUs. Greg Blanton John T. Lubia

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

High Performance Computing and GPU Programming

Introduction to GPGPUs and to CUDA programming model

Benchmarking the Memory Hierarchy of Modern GPUs

Advanced CUDA Optimization 1. Introduction

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Portland State University ECE 588/688. Graphics Processors

Introduction to CUDA

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Image Processing Optimization C# on GPU with Hybridizer

GPU CUDA Programming

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Transcription:

5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction Memory usage is abstracted away on a CPU. e.g. Main memory, L1 and L2 cache. Paging. Not so on the GPU. Different types of memory must be utilised and programmed for maximum performance. However, some memory abstraction is appearing on the new Fermi cards. 1

Memory hierarchy Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory Memory hierarchy Declaring variables: device local // device shared device constant // device <-- global 2

Some suggestions Carefully divide data according to access patterns R/Only: constant memory (very fast if in cache) R/W shared in a block: shared memory (very fast) R/W in each thread: registers (very fast) But just hope you they don t spill (cf. the profiling/perf talks) R/W inputs/results : global memory (very slow) Put another way: Where do you declare a var? Host accessible? Make it a global/constant and put it outside all the functions No? Then it s a register/shared/local var and belongs in a kernel Note on pointer use Pointers can only point to memory allocated or declared in global memory: Allocated in the host and passed to the kernel: global void cuda_kernel(float* p) Obtained as the address of a global variable: float* p = &g_var; 3

Global memory Main GDDR3/GDDR4/GDDR5 memory on the GPU. Not close to the GPU chip. Slow to access. 400 600 cycles. But there's a lot of it. Latency hiding by the thread scheduler can ameliorate access times. Coalescing Global Memory Acesses CUDA devices with compute capability 1.0 and 1.1 can fetch data in a single 64-byte or 128-byte transaction. Access patterns where threads access a chunk of contiguous data are very efficient. The memory reads/writes are coalesced into a single access. Scattered access patterns require more memory transactions. s[threadidx.x] = a[threadidx.x]; // coalesces s[threadidx.x] = a[threadidx.x*7]; // does not More on this in the performance session. 4

Quadro FX 5600 Specs Memory 1.5 GB GDDR3 DDR => transfers 2 memory values per clock cycle 1600 MHz ( 2 * 800 MHz cf. DDR above ) 384 bit bus width Chip clock: 600 MHz PCI-Express x16 16 lanes full duplex ~ 4GB/s total Proviso: Not sure if the card uses PCIe 1.0 or 2.0---the latter would mean 8GB/s total ) Quadro FX 5600 Question 1: How many CUDA cores are there? 128 = 16 SM* 8 SP (since 8 SP per SM) Question 2: What s the GPU s memory bandwidth? 1600 MHz * 384 bit / 8 = 76800 = 76.8 GB/s 5

Quadro FX 5600 These calculations have importance Suppose you expect your kernel to perform 1 singleprecision floating-point operation for every floating-point value read from global memory. What s the peak performance you can achieve? You can expect to process 76.8 / 4 = 19.2 gigabyte single-precision data per second (19.2 GFLOPS) Quadro FX 5600 These calculations have importance Suppose you expect your kernel to perform 1 single-precision floating-point operation for every floating-point value read from global memory. What s the peak performance you can achieve? You can expect to process 768 / 4 = 19.2 giga single-precision data per second (19.2 GFLOPS) For a 8800 GTX (~Quadro FX 5600) Theoretical peak performance is ~500 GFLOPS In practice, some sources suggest ~340 GFLOPS Moral: Avoid global memory, calculate! 6

CHPC GPU: Quadro FX 5600 Note that the bandwidth between GPU memory and GPU chip is ~19x (or ~9x) greater than that between CPU and GPU memory offered through the PCI Express Bus. So if your code depends on a processing massive amounts data, you may not be able to feed the GPU enough to take advantage of it's compute power. One may be able to ameliorate this using asynchronous transfers. Texture Memory Buffers sections of global memory. Up to 8KB Texture cache is present in each SM. Read-only within an executing kernel. On a cache miss, data around the missed location is loaded into the cache. Good choice for spatially local accesses (1D/2D). Data that is almost coalesced. Same latency as DRAM, but special cases deliver data in < 100 cycles. However, don t count blocks one the same multiprocessor using spatially similar accesses. Not a good idea to run one or two blocks on a SM when high texture dependency expected. 7

Texture Memory texture<int, 1> texsrc; global the_kernel(int * array) { // access texture in some pattern array[threadidx.x] = tex1dfetch(texsrc, index) + tex1dfetch(texsrc, index+3); } int main() { int * d_a; cudamalloc((void**) &d_a, sizeof(int)*n); cudabindtexture(0,texsrc, d_a, sizeof(int)*n); the_kernel<<<1,n>>>(h_a); } cudaunbindtexture(texsrc); Texture Memory One very important use is for hardware-based interpolation For example, if you have a piecewise-linear 2D function you wish to have values for over some domain Then first load a set of sample points (say a uniform sampling over a grid) from that function into texture memory Setup linear interpolation for that texture Result: Texture fetches now interpolate results between samples tex2d(texsrc, 1.5, 2.6); Estimates the value at coordinate (1.5, 2.6) from the values around it. 8

Shared memory Really fast. 1 cycle. Shared amongst all threads in a block. 16KB per SM. 48KB on new cards. Use as much as possible. Split problem into chunks of shared memory and read that chunk in from global memory. Watch out for bank conflicts! Occur when different threads access the same shared memory address. Scheduler has to serialise the access. Rule of thumb to avoid these: either all threads access same memory address or all threads access unique memory addresses. Shared memory global void the_kernel(int * a) { shared int s[n]; // Read from slow global memory into // fast shared memory s[threadidx.x] = a[threadidx.x]; syncthreads(); // Do lots of processing with s[threadidx.x] } // Write back to global memory a[threadidx.x] = s[threadidx.x]; syncthreads(); 9

Registers Really fast. 1 cycle to access. They are split among available threads. e.g. 16384 registers per SM / 256 threads = 64 registers per thread Compiler will spill registers into local memory if it doesn't have enough. Recall local memory is a thread s unique and private chunk of global memory, so access to this is slow! Inspect the PTX assembly code to know for sure. Large structs get split across multiple registers. Constant Memory 64KB of constant memory, up to 8KB of constant cache per SM. If data is cached, access can be as low as 1 cycle, can be hundreds of cycles if not. No penalty for first access due to precaching. Great for lookup tables. e.g. precalculated trig values. 10

Constant Memory device constant ParamStructType params; global the_kernel(int * a, int * b) { a[threadidx.x] = params.snippet * b[threadidx.x]; } int main(void) { ParamStructType host_params; } cudamemcpytosymbol( params, hostparams,sizeof(paramstructtype)); CUDA Streams So far, the paradigm for calling GPU code has been synchronous. We do a memcpy or call a kernel and wait for the operation to finish, but the CPU or GPU can be idle during these operations. CUDA Streams allow concurrent execution of host and device code. Requires the programmer to understand asynchronous programming. Non-blocking Network I/O is the most obvious example. 11

CUDA Streams Overlap computation as much as possible to maximise computational resource usage. CUDA Streams Modifications to existing code Kernel Execution CudaStream passed as kernel parameter. Memory Allocation CudaMallocHost (Pinned Memory) Memory Copy CudaMemcpyAsync 12

CUDA Streams cudastream_t stream[2]; for(int i=0; i<2; ++i) cudastreamcreate(&stream[i]); float * a; cudamallochost((void**)&a, 2*size); for(int i=0; i<2; ++i) cudamemcpyasync(d_a + i*size, a + i*size, size, cudamemcpyhosttodevice, stream[i]); for(int i=0; i<2; ++i) kernel<<<100, 512, 0, stream[i]>>>(d_a + i*size, size); for(int i=0;i<2;++i) cudamemcpyasync(a + i*size, d_a + i*size, size, cudamemcpydevicetohost, stream[i]); CUDA Streams After the asynchronous calls, the CPU can do other work, or invoke other asynchronous calls. Once CPU tasks are completed, the CPU can invoke calls to see if the asynchronous streams have finished processing. Wait for one stream cudastreamsynchronize() Wait for all streams cudathreadsynchronize() 13