GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Similar documents
GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CUDA C/C++ BASICS. NVIDIA Corporation

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

An Introduction to GPU Computing and CUDA Architecture

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA C/C++ BASICS. NVIDIA Corporation

CS 179: GPU Computing. Lecture 2: The Basics

04. CUDA Data Transfer

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Massively Parallel Algorithms

HPCSE II. GPU programming and CUDA

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Tesla Architecture, CUDA and Optimization Strategies

CS179 GPU Programming Recitation 4: CUDA Particles

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

University of Bielefeld

Introduction to GPGPUs and to CUDA programming model

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU Programming Introduction

Lecture 6b Introduction of CUDA programming

ECE 574 Cluster Computing Lecture 15

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Parallel Numerical Algorithms

CUDA Programming Model

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU CUDA Programming

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

COSC 6339 Accelerators in Big Data

CUDA (Compute Unified Device Architecture)

EEM528 GPU COMPUTING

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

COSC 6374 Parallel Computations Introduction to CUDA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Basics. July 6, 2016

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Kenjiro Taura 1 / 36

COMP 322: Fundamentals of Parallel Programming

Lecture 10!! Introduction to CUDA!

Programming with CUDA, WS09

ECE 574 Cluster Computing Lecture 17

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 11: GPU programming

Introduction to CUDA C

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

NVIDIA CUDA Compute Unified Device Architecture

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Programming Parallel Computers

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA

Introduction to CUDA C

Programming Parallel Computers

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Lecture 2: CUDA Programming

Practical Introduction to CUDA and GPU

Introduc)on to GPU Programming

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA Programming

CS377P Programming for Performance GPU Programming - I

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Hands-on CUDA Optimization. CUDA Workshop

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA Performance Optimization. Patrick Legresley

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

CSE 160 Lecture 24. Graphical Processing Units

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction to GPU programming. Introduction to GPU programming p. 1/17

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Mathematical computations with GPUs

GPU programming. Dr. Bernhard Kainz

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Fundamental CUDA Optimization. NVIDIA Corporation

Massively Parallel Architectures

Transcription:

GPU Programming Using CUDA Samuli Laine NVIDIA Research

Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick look at the most important parts Basic optimization techniques Memory access pattern optimization Shared memory and block-wide cooperation

GPU vs CPU

CPU Workloads used to be sequential And still mostly are Thus, CPU design is latency-oriented Aim for low latency Make sequentially dependent code run as fast as possible Try to avoid memory access bottleneck with large caches Parallelism almost an afterthought SIMD instruction sets explicit programmer control Hyperthreading and multiple cores max speedup 2 #cores

GPU Workloads have always been massively parallel and mostly coherent (vertices, pixels, etc.) Thus, GPU design is throughput-oriented Aim for high throughput Finish as many instructions per clock cycle as possible Need a lot of computation on chip cannot afford large caches Solution: More parallelism Parallelism deeply ingrained in the design Single-threaded execution is not valid use of the hardware

Multicore CPU: Run ~10 threads fast Processor Memory Processor Memory Global Memory Few processors, each supporting 1 2 hardware threads Large on-chip caches near processors for hiding latency Each thread gets to run instructions close to back-to-back

Manycore GPU: Run ~10,000 threads fast Processor Memory Processor Memory Processor Memory Global Memory Hundreds of processors, each supporting dozens of hardware threads Small on-chip memories near processors Use as explicit local storage, allow thread co-operation Hide latency by switching between threads

Analogy GPU CPU Mostly payload Threads doing computation High latency High throughput Mostly boat Keeping few threads going fast Low latency Low throughput

When is using a GPU a good idea? (1) Large workload Preferably 10,000s of threads per launch Coherent execution Each thread does approximately the same thing No gratuitous branching, pointer chasing, etc. More about this in next week s lecture Coherent data access GPUs have very small amounts of cache per thread Spatial reuse is captured well, temporal reuse is captured badly

When is using a GPU a good idea? (2) Not all tasks are suitable for GPUs Compilers, Powerpoint, running OS kernel, etc. But many are Image / video / audio processing Deep learning, neural networks AlphaGO, self-driving cars, GPUs are used by everyone using DNNs Supercomputing applications Anything with a lot of number crunching

Questions?

Basics of CUDA

Documentation: docs.nvidia.com/cuda Main document: Programming Guide Language extensions Pre-defined variables in kernels Execution model: Parallelism, synchronization points, etc. API reference: CUDA Runtime API CUDA functions for memory management, etc. Plus a lot more Best Practices Guide, Tuning Guides, etc.

Quick note about API flavors We are using the CUDA Runtime API on this course Functions are prefixed with cuda..., for example cudamalloc() Don t confuse it with CUDA Driver API Low-level API that we will NOT use Functions are prefixed with cu..., for example cumemalloc() Just something to watch out for when seeking information

Median5 walkthrough We will look at code snippets from a simple GPU-based Median5 program throughout this part Simplified: No error checks, no file I/O, etc.

CUDA machine model GPU memory (on GPU board) GPU memory (on GPU board) GPU memory (on GPU board) GPU GPU GPU PCIE Bus CPU CPU CPU CPU Main memory (on motherboard)

CUDA machine model Terminology GPU memory (on GPU board) GPU CUDA device PCIE Bus CPU CPU CPU CPU Host Device memory Main memory (on motherboard) Host memory

CUDA machine model Memory GPU memory (on GPU board) GPU PCIE Bus CPU CPU CPU CPU Main memory (on motherboard) GPU memory is not directly accessible from CPU, and vice versa They can be made accessible, but it s relatively slow because of PCIE bus Yet sometimes this is the fastest solution, e.g., when read exactly once

Manual memory management Programs must explicitly Allocate memory buffers on both host and device Transfer data between host and device Transferring data takes time There has to be enough work per transferred data to justify transfer time However, transfers can be overlapped with computation Advanced topic for next week

Median5: Allocating host memory int main() { const int width = 16000; const int height = 16000; const int size = width * height; // Allocate CPU buffers for input and output. Using cudahostalloc() gives // page-locked memory that is faster to transfer between CPU and GPU than // memory allocated with malloc(). int* inputcpu = 0; int* outputcpu = 0; cudahostalloc((void**)&inputcpu, size * sizeof(int), cudahostallocmapped); cudahostalloc((void**)&outputcpu, size * sizeof(int), cudahostallocmapped);... Fill inputcpu[] with something...

Median5: Allocating device memory int main() {... // Allocate GPU buffers for input and output. int* inputgpu = 0; int* outputgpu = 0; cudamalloc((void**)&inputgpu, size * sizeof(int)); cudamalloc((void**)&outputgpu, size * sizeof(int));

Median5: Copying input data host device int main() {... // Copy input buffer from CPU to GPU. cudamemcpy(inputgpu, inputcpu, size * sizeof(int), cudamemcpyhosttodevice); // Now we have: // inputcpu = Host-side buffer containing the input. // inputgpu = Device-side buffer containing the input. // outputcpu = Host-side buffer containing garbage. // outputgpu = Device-side buffer containing garbage.

Launching a kernel GPU work is launched by invoking a large number of threads In particular, we launch N thread blocks, each having M threads In CUDA parlance, N = grid size, M = block size The function being executed on GPU is called the kernel

Launch syntax kernelfunc<<<gridsize, blocksize>>>(param1, param2,...) kernelfunc is a function annotated with keyword global So the compiler knows to compile it for GPU gridsize and blocksize are integers param1 etc. are parameters passed to every launched thread

Median5: Kernel specification Thread index in block Block index in grid Block size global void mediankernel(int* output, const int* input, const int width, const int height, const int size) { int idx = threadidx.x + blockidx.x * blockdim.x; // Who am I? int x = idx % width; // X coordinate in image int y = idx / width; // Y coordinate in image if (y >= height) // Exit if outside the image. return; int p0 = x + width * y; // Index of center pixel. int a0 = input[p0]; // Read input value at center pixel.... Read other input pixels and compute median... } output[p0] = result; // Write result into output buffer.

Median5: Kernel launch global void mediankernel(int* output, const int* input, const int width, const int height, const int size) {... } int main() {... // Determine block and grid sizes for the kernel launch. int szblock = 64; // 64 threads per block. int szgrid = (size + szblock - 1) / szblock; // Determine number of blocks. // Launch the kernel on GPU. mediankernel<<<szgrid, szblock>>>(outputgpu, inputgpu, width, height, size);

Why thread blocks? Why don t we have a single number telling the number of threads we want to launch? Instead of the current two (grid size, block size) Would get rid of annoying calculations when Determining how many blocks to launch Thread index inside kernel

Why thread blocks Thread block is the basic unit when launching work on GPU All threads of one block run concurrently, and on the same core Can co-operate through block-wide shared memory, do block-wide synchronization More about these later Each block must be independent No guarantees about execution order or concurrency between blocks Block-wide co-operation is essential for many algorithms An example will be shown next week

Thread block sizes Say you want to launch 400 400 threads, i.e., 160000 in total. Can you put all threads in the same block? No. Block has a maximum size of 1024 threads Can you create 160000 blocks with 1 thread in each? Don t do it! Technically possible, but hideously slow Block has a practical minimum size of 64 threads So what then? Start with blocks of 64 threads, set grid size as needed Try increasing block size (multiples of 64 work best), benchmark

Note: Thread block access patterns In kernel, we computed x = idx % width and y = idx / width We could have done it the other way around, too (BAD!): x = idx / height and y = idx % height This can have a dramatic effect on memory coherence!... vs.........

Note 2: 2D and 3D thread blocks and grids Syntactic sugar in CUDA interface Can specify grid and block sizes as 2D or 3D ranges instead of integers That s why it s threadidx.x instead of threadidx Suggestion: Don t use these constructs Historically useful, not so anymore You get better control by just using integer sizes and doing indexing manually

Median5: Copying results device host int main() {... // Copy output buffer from GPU to CPU. cudamemcpy(outputcpu, outputgpu, size * sizeof(int), cudamemcpydevicetohost);... Consume the output on CPU, free the buffers... } return 0; // Done, exit program. Done!

Questions?

CUDA language extensions and API

Code organization Source files may contain a mixture of host and device code Recommended to use extension.cu for these These source files are compiled with Nvidia s NVCC toolchain Host code gets compiled with native C++ compiler (gcc, msvc, ) Device code gets compiled with Nvidia tools NVCC patches CUDA kernel launches in host code into standard C++ Detect GPU architecture, upload relevant binaries to GPU, etc. Everything in a.cu file is linked together in a single.o file

Most important language extensions Function and variable decorators global : Executes on GPU, callable from CPU (= kernel) device : Executes on GPU, callable from GPU Storage specifiers shared : Block-wide shared memory constant : Constant memory Special variables in device code threadidx, blockidx, blockdim, griddim,... Kernel launch syntax <<<, >>>

API function error codes Every CUDA API function returns an error code Type cudaerror_t Value is cudasuccess (= 0) if no error occurred Helpers to convert error code to string const char* cudageterrorname(cudaerror_t error) const char* cudageterrorstring(cudaerror_t error) To get last error (e.g., from kernel launch) cudaerror_t cudagetlasterror(void) If something doesn t work, check errors first!

Device management cudagetdevicecount(int* count) cudagetdeviceproperties(cudadeviceprop* prop, int device) cudasetdevice(int device) Device 0 is selected by default Driver always places the best device to slot 0 So, often no need to call cudasetdevice at all Device selection is a low-overhead call Using multiple GPUs from same CPU thread is possible

Device memory allocation cudamalloc(void** devptr, size_t size) cudafree(void* devptr) cudamalloc gives a C/C++ pointer that can t be dereferenced on the host Mixing up host and device pointers crash Valid uses for a device pointer: CUDA API calls (e.g. cudamemcpy) Using it in a kernel, probably having passed it in as a parameter

Host memory allocation (optimization) cudamallochost(void** ptr, size_t size) cudafreehost(void* ptr) Returns host pointer to page-locked memory When copying from / to memory allocated with malloc or new[], the driver will automatically page-lock and unlock at every copy This works fine, just takes a bit of time With cudamallochost, page-locking is done only once Good thing if you have frequent CPU GPU transfers Page-locked memory cannot be swapped out by the OS, so be sure your program behaves nicely

Zero-copy memory CPU memory can be accessed directly by the GPU Slow due to every access going through the PCIE bus Makes sense only in specific circumstances Data is read / written once, with good access pattern Sparse access into a large buffer Asynchronous communication

Mapping host memory on device cudahostalloc(void** phost, size_t size, unsigned int flags) cudahostregister(void* ptr, size_t size, unsigned int flags) cudahostgetdevicepointer(void** pdevice, void* phost, unsigned int flags) cudahostunregister(void* ptr) cudafreehost(void* ptr) Set flags=cudahostallocmapped to get zero-copy -eligible memory Allocate mapped, page-locked memory Or register (i.e., page-lock + map) an existing range Query a device pointer to that range

Copying memory cudamemcpy(void* dst, const void* src, size_t count, cudamemcpykind kind) One function for all directions Plain memcpy is of course fine for host host copy dst and src must be either host or device pointers depending on usage kind is cudamemcpyhosttohost, cudamemcpyhosttodevice, cudamemcpydevicetohost, or cudamemcpydevicetodevice

Note: CUDA API is mostly asynchronous Many CUDA calls, all kernel launches are asynchronous Place a request in GPU command stream Host thread continues running immediately Kernels actually execute when the device frees up Implicit synchronization at certain points E.g., memory transfers Explicit synchronization is also possible Matters when benchmarking Also, when checking for errors!

Explicit synchronization cudadevicesynchronize(void) Returns after all kernel launches and API calls have been completed Checking errors after this will detect if kernel did something bad

Benchmarking pitfalls, part 1 1. Start CPU timer 2. Copy data from host to device 3. Launch kernel 4. Copy results from device to host 5. Stop CPU timer Problem: step 2 may need to wait until a previous kernel launch is complete

Benchmarking pitfalls, part 2 1. Copy data from host to device 2. Start CPU timer 3. Launch kernel 4. Stop CPU timer 5. Copy results from device to host Problem: Measures only the API overhead of a kernel launch Close to constant time regardless of what the kernel does!

Benchmarking pitfalls, part 3 1. Copy data from host to device 2. Call cudadevicesynchronize() 3. Start CPU timer 4. Launch kernel 5. Call cudadevicesynchronize() 6. Stop CPU timer 7. Copy results from device to host Problem: Low accuracy for short launches, includes sync overhead of step 5

Benchmarking done right For accurate GPU benchmarking, must use CUDA events Provides exact time when GPU passed a given point in command stream Not going into details now Spent time on this only because asynchronous API often surprises programmers trying to see where the time is spent Using an interactive profiler is also an option, but not always feasible

Questions?

Basic optimization techniques Optimizing memory access patterns Shared memory and block-wide cooperation

Memory access patterns Just as important in GPU programming as in CPU programming But the rules are slightly different Threads in the same block should access nearby memory locations at about the same time This can have a huge effect on performance

Access patterns example Example: We use blocks of 64 threads, and want each thread to process 10 elements of a linear array Thus, each block processes 640 elements It is best to have each block process a contiguous 640-element chunk of the array, just like on CPU Question: Which elements should be processed by each thread in the block?

Improving access patterns with striding No striding for (i = 0; i < 10; i++) { arrayidx = blockidx.x * 640 + threadidx.x * 10 + i... Thread 0: 0 1 2 3 4 5 6 7 8 9 Thread 1: 10 11 12 13 14 15 16 17 18 19.. Thread 63: 630 631 632 633 634 635 636 637 638 639 Bad access pattern Stride of 64 for (i = 0; i < 10; i++) { arrayidx = blockidx.x * 640 + threadidx.x + i * 64... Thread 0: 0 64 128 192 256 320 384 448 512 576 Thread 1: 1 65 129 193 257 321 385 449 513 577.. Thread 63: 63 127 191 255 319 383 447 511 575 639 Optimal access pattern

Access patterns, GPU vs CPU Note how the optimal access pattern differs from CPU CPU: Each thread should process contiguous chunk of memory Because threads run on different cores, on separate L1 caches GPU: Threads in the same block should access contiguous chunks of memory at the same time In reality, many threads in the same block do run in lock-step Memory accesses can be coalesced if they land on the same cache line Potential for 32 cost difference for each memory fetch!

Address space view Address 0 1 2 10 11 12 20 21 22 T0 i=0 T0 i=1 T0 i=2 T1 i=0 T1 i=1 T1 i=2 T2 i=0 T2 i=1 T2 i=2 Good for CPU Bad for GPU Address 0 1 2 64 65 66 128 129 130 T0 i=0 T1 i=0 T2 i=0 T0 i=1 T1 i=1 T2 i=1 T0 i=2 T1 i=2 T2 i=2 Bad for CPU Good for GPU

Shared memory Specified using the shared keyword Not a per-thread resource but a per-block resource Completely unlike ordinary variables All threads in a block see the same variable / array To avoid race conditions, threads in a block must often synchronize between accesses to shared memory There are no guarantees that all threads in a block progress at exactly the same speed

Shared memory, continued Shared memory is physically close to the execution units Much faster than ordinary device memory It is also a limited resource One limiter for how many threads the GPU can have resident at once If you use way too much shared memory (>48 KB per thread block), the kernel cannot be launched at all

Shared memory, typical scenario 1. Each thread moves a piece of data from device memory into shared memory 2. Threads in the block synchronize to ensure data is valid 3. Each thread looks at a window of data in shared memory Read-only access poses no hazards, so no need to synchronize 4. Each thread writes its result into its own location in device memory

Block-wide synchronization Achieved with a call to syncthreads() in the kernel Calling threads will stall until all non-terminated threads have reached that call. After that all threads in the thread block are released. If a thread has been terminated by return; it is not waited for If some threads call syncthreads() and some do not, the kernel will almost certainly hang Conditional synchronization is allowed only if all threads in the block agree on the condition

Shared memory, final thoughts Shared memory is a block-wide construct It is useful and necessary only when doing cooperation between threads in a thread block If all your threads are truly independent, you probably don t want to use shared memory Hence no need for block-wide synchronization either Useful for some applications, useless for some Programming guide has an example of matrix multiplication that takes advantage of shared memory

Next week GPU architecture How GPU actually executes code Spoiler: It s almost like SIMD but more powerful More optimization techniques

Thank you! Questions?