Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Similar documents
GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

High Performance Computing and GPU Programming

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU Profiling and Optimization. Scott Grauer-Gray

COSC 6374 Parallel Computations Introduction to CUDA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

GPU programming. Dr. Bernhard Kainz

ECE 574 Cluster Computing Lecture 15

Josef Pelikán, Jan Horáček CGG MFF UK Praha

University of Bielefeld

CME 213 S PRING Eric Darve

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU CUDA Programming

ECE 574 Cluster Computing Lecture 17

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

ECE 408 / CS 483 Final Exam, Fall 2014

Basics of CADA Programming - CUDA 4.0 and newer

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduction to CUDA Programming

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Scientific discovery, analysis and prediction made possible through high performance computing.

Dense Linear Algebra. HPC - Algorithms and Applications

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Performance Optimization. Patrick Legresley

Lab 1 Part 1: Introduction to CUDA

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Cartoon parallel architectures; CPUs and GPUs

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Lecture 11: GPU programming

COSC 6339 Accelerators in Big Data

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

EEM528 GPU COMPUTING

CUDA Parallelism Model

Practical Introduction to CUDA and GPU

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Parallel Programming Model Michael Garland

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Parallel Numerical Algorithms

Overview: Graphics Processing Units

Lecture 10!! Introduction to CUDA!

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

GPU Fundamentals Jeff Larkin November 14, 2016

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

GPU Programming Using CUDA

Introduction to GPGPUs and to CUDA programming model

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

CUDA. More on threads, shared memory, synchronization. cuprintf

Data Parallel Execution Model

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Warps and Reduction Algorithms

Lecture 1: an introduction to CUDA

Writing and compiling a CUDA code

High Performance Linear Algebra on Data Parallel Co-Processors I

Lecture 3: Introduction to CUDA

CUDA Kenjiro Taura 1 / 36

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CS377P Programming for Performance GPU Programming - I

Parallel Computing. Lecture 19: CUDA - I

CSE 160 Lecture 24. Graphical Processing Units

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Real-time Graphics 9. GPGPU

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Programming with CUDA, WS09

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Transcription:

Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh

Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA Application 2

Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA Application 3

Introduction Graphics Processing Units (GPUs) offer higher performance that CPUs More of the silicon on the chip is dedicated to computation: many cores in each GPU chip Graphics memory (GDRAM) has higher bandwidth than the DRAM memory used by CPUs GPUs can not be used alone, but must be used in combination with a CPU GPUs accelerate those computationally demanding sections of code (which we call kernels). Kernels are decomposed to run in parallel on the multiple cores The CPU and GPU each have their own memory spaces 4

Memory Memory CPU Bus GPU Main program code Key kernel code 5

NVIDIA CUDA Traditional languages alone are not sufficient for programming GPUs CUDA is an extension to C/C++ that allows programing of NVIDIA GPUs language extensions for defining kernels API functions for memory management 6

Stream Computing Data set decomposed into a stream of elements A single computational function operates on each element thread defined as execution of kernel on one data element Multiple cores can process multiple elements in parallel i.e. many threads running in parallel Suitable for data-parallel problems 7

Hardware Memory Memory GPU SM SM SM CPU Bus GPU SM SM Main program code Key kernel code Shared memory 8

GPU SM SM SM SM SM Shared memory NVIDIA GPUs have a 2-level hierarchy: Multiple Streaming Multiprocessors (SMs), each with multiple cores The number of SMs, and cores per SM, varies across generations 9

In CUDA, this is abstracted as Grid of Thread Blocks The multiple blocks in a grid map onto the multiple SMs Each block in a grid contains multiple threads, mapping onto the cores in an SM We don t need to know the exact details of the hardware (number of SMs, cores per SM). Instead, oversubscribe, and system will perform scheduling automatically Use more blocks than SMs, and more threads than cores Same code will be portable and efficient across different GPU versions. 10

CUDA dim3 type CUDA introduces a new dim3 type Simply contains a collection of 3 integers, corresponding to each of X,Y and Z directions. dim3 my_xyz_values(xvalue,yvalue,zvalue); 11

X component can be accessed as follows: my_xyz_values.x And similar for Y and Z E.g. for dim3 my_xyz_values(6,4,12); then my_xyz_values.z has value 12 12

13

Analogy You check in to the hotel, as do your classmates Rooms allocated in order Receptionist realises hotel is less than half full Decides you should all move from your room number i to room number 2i so that no-one has a neighbour to disturb them 14

Serial Solution: Receptionist works out each new number in turn 15

Parallel Solution: Everybody: check your room number. Multiply it by 2, and move to that room. 16

Serial solution: for (i=0;i<n;i++){ result[i] = 2*i; } We can parallelise by assigning each iteration to a separate CUDA thread. 17

CUDA C Example global void mykernel(int *result) { int i = threadidx.x; result[i] = 2*i; } Replace loop with function Add global specifier To specify this function is to form a GPU kernel Use internal CUDA variables to specify array indices threadidx.x is an internal variable unique to each thread in a block. X component of dim3 type. Since our problem is 1D, we are not using the Y or Z components (more later) 18

CUDA C Example And launch this kernel by calling the function on multiple CUDA threads using <<< >>> syntax dim3 blockspergrid(1,1,1); //use only one block dim3 threadsperblock(n,1,1); //use N threads in the block mykernel<<<blockspergrid, threadsperblock>>>(result); 19

CUDA Example The previous example only uses 1 block, i.e. only 1 SM on the GPU, so performance will be very poor. In practice, we need to use multiple blocks to utilise all SMs, e.g.: global void mykernel(int *result) { int i = blockidx.x * blockdim.x + threadidx.x; result[i] = 2*i; }... dim3 blockspergrid(n/256,1,1); //assuming 256 divides N exactly dim3 threadsperblock(256,1,1); mykernel<<<blockspergrid, threadsperblock>>>(result);... We have chosen to use 256 threads per block, which is typically a good number (see practical). 20

CUDA C Example More realistic 1D example: vector addition global void vectoradd(float *a, float *b, float *c) { int i = blockidx.x * blockdim.x + threadidx.x; c[i] = a[i] + b[i]; }... dim3 blockspergrid(n/256,1,1); //assuming 256 divides N exactly dim3 threadsperblock(256,1,1); vectoradd<<<blockspergrid, threadsperblock>>>(a, b, c);... 21

CUDA C Internal Variables For a 1D decomposition (e.g. the previous examples) blockdim.x: Number of threads per block Takes value 256 in previous example threadidx.x:unique to each thread in a block Ranges from 0 to 255 in previous example blockidx.x: Unique to every block in the grid Ranges from 0 to (N/256-1) in previous example 22

2D Example 2D or 3D CUDA decompositions also possible, e.g. for matrix addition (2D): global void matrixadd(float a[n][n], float b[n][n], float c[n][n]) { int j = blockidx.x * blockdim.x + threadidx.x; int i = blockidx.y * blockdim.y + threadidx.y; } c[i][j] = a[i][j] + b[i][j]; int main() { dim3 blockspergrid(n/16,n/16,1); // (N/16)x(N/16) blocks/grid (2D) dim3 threadsperblock(16,16,1); // 16x16=256 threads/block (2D) matrixadd<<<blockspergrid, threadsperblock>>>(a, b, c); } 23

Memory Management - allocation The GPU has a separate memory space from the host CPU Data accessed in kernels must be on GPU memory Need to manage GPU memory and copy data to and from it explicitly cudamalloc is used to allocate GPU memory cudafree releases it again float *a; cudamalloc(&a, N*sizeof(float)); cudafree(a); 24

Memory Management - cudamemcpy Once we've allocated GPU memory, we need to be able to copy data to and from it cudamemcpy does this: cudamemcpy(array_device, array_host, N*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(array_host, array_device, N*sizeof(float), cudamemcpydevicetohost); The first argument always corresponds to the destination of the transfer. Transfers between host and device memory are relatively slow and can become a bottleneck, so should be minimised when possible 25

Synchronisation between host and device Kernel calls are non-blocking. This means that the host program continues immediately after it calls the kernel Allows overlap of computation on CPU and GPU Use cudathreadsynchronize() to wait for kernel to finish vectoradd<<<blockspergrid, threadsperblock>>>(a, b, c); //do work on host (that doesn t depend on c) cudathreadsynchronise(); //wait for kernel to finish Standard cudamemcpy calls are blocking Non-blocking variants exist 26

Synchronisation between CUDA threads Within a kernel, to syncronise between threads in the same block use the syncthreads()call Therefore, threads in the same block can communicate through memory spaces that they share, e.g. assuming x local to each thread and array in a shared memory space if (threadidx.x == 0) array[0]=x; syncthreads(); if (threadidx.x == 1) x=array[0]; It is not possible to communicate between different blocks in a kernel: must instead exit kernel and start a new one 27

Compiling CUDA Code CUDA code is compiled using nvcc: nvcc o example example.cu 28

Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA See Practical PDF document GPU Optimisation Practical Exercise 2: Optimising a CUDA Application 29

Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA Application 30

GPU performance inhibitors Copying data to/from device Device under-utilisation/ GPU memory latency GPU memory bandwidth Code branching This lecture will address each of these And advise how to maximise performance Concentrating on NVIDIA, but many concepts will be transferable to e.g. AMD 31

Host Device Data Copy CPU (host) and GPU (device) have separate memories. All data read/written on the device must be copied to/from the device (over PCIe bus). This very expensive Must try to minimise copies Keep data resident on device May involve porting more routines to device, even if they are not computationally expensive Might be quicker to calculate something from scratch on device instead of copying from host 32

Data copy optimisation example Loop over timesteps inexpensive_routine_on_host(data_on_host) copy data from host to device expensive_routine_on_device(data_on_device) copy data from device to host End loop over timesteps Port inexpensive routine to device and move data copies outside of loop copy data from host to device Loop over timesteps inexpensive_routine_on_device(data_on_device) expensive_routine_on_device(data_on_device) End loop over timesteps copy data from device to host 33

Exposing parallelism GPU performance relies on parallel use of many threads Degree of parallelism much higher than a CPU Effort must be made to expose as much parallelism as possible within application May involve rewriting/refactoring If significant sections of code remain serial, effectiveness of GPU acceleration will be limited (Amdahl s law) 34

Occupancy and Memory Latency hiding Programmer decomposes loops in code to threads Obviously, there must be at least as many total threads as cores, otherwise cores will be left idle. For best performance, actually want #threads >> #cores Accesses to GPU memory have several hundred cycles latency When a thread stalls waiting for data, if another thread can switch in this latency can be hidden. NVIDIA GPUs have very fast thread switching, and support many concurrent threads 35

Exposing parallelism example Loop over i from 1 to 512 Loop over j from 1 to 512 independent iteration Original code 1D decomposition Calc i from thread/block ID Loop over j from 1 to 512 independent iteration 2D decomposition Calc i & j from thread/block ID independent iteration 512 threads 262,144 threads 36

Memory coalescing GPUs have high peak memory bandwidth Maximum memory bandwidth is only achieved when data is accessed for multiple threads in a single transaction: memory coalescing To achieve this, ensure that consecutive threads access consecutive memory locations Otherwise, memory accesses are serialised, significantly degrading performance Adapting code to allow coalescing can dramatically improve performance 37

Memory coalescing example consecutive threads are those with consecutive threadidx.x values Do consecutive threads access consecutive memory locations? index = blockidx.x*blockdim.x + threadidx.x; output[index] = 2*input[index]; Coalesced. Consecutive threadidx values correspond to consecutive index values 38

Memory coalescing examples Do consecutive threads read consecutive memory locations? In C, outermost index runs fastest: j here i = blockidx.x*blockdim.x + threadidx.x; for (j=0; j<n; j++) output[i][j]=2*input[i][j]; Not Coalesced. Consecutive threadidx.x corresponds to consecutive i values j = blockidx.x*blockdim.x + threadidx.x; for (i=0; i<n; i++) output[i][j]=2*input[i][j]; Coalesced. Consecutive threadidx.x corresponds to consecutive j values 39

Memory coalescing examples What about when using 2D or 3D CUDA decompositions? Same procedure. X component of threadidx is always that which increments with consecutive threads E.g., for matrix addition, coalescing achieved as follows: int j = blockidx.x * blockdim.x + threadidx.x; int i = blockidx.y * blockdim.y + threadidx.y; c[i][j] = a[i][j] + b[i][j]; 40

Code Branching On NVIDIA GPUs, there are less instruction scheduling units than cores Threads are scheduled in groups of 32, called a warp Threads within a warp must execute the same instruction in lock-step (on different data elements) The CUDA programming allows branching, but this results in all cores following all branches With only the required results saved This is obviously suboptimal Must avoid intra-warp branching wherever possible (especially in key computational sections) 41

Branching example E.g you want to split your threads into 2 groups: i = blockidx.x*blockdim.x + threadidx.x; if (i%2 == 0) else Threads within warp diverge i = blockidx.x*blockdim.x + threadidx.x; if ((i/32)%2 == 0) else Threads within warp follow same path 42

CUDA Profiling Simply set COMPUTE_PROFILE environment variable to 1 Log file, e.g. cuda_profile_0.log created at runtime: timing information for kernels and data transfer # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla M1060 # CUDA_CONTEXT 1 # TIMESTAMPFACTOR fffff6e2e9ee8858 method,gputime,cputime,occupancy method=[ memcpyhtod ] gputime=[ 37.952 ] cputime=[ 86.000 ] method=[ memcpyhtod ] gputime=[ 37.376 ] cputime=[ 71.000 ] method=[ memcpyhtod ] gputime=[ 37.184 ] cputime=[ 57.000 ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ 253.536 ] cputime=[ 13.00 0 ] occupancy=[ 0.250 ]... Possible to output more metrics (cache misses etc) See doc/compute_profiler.txt file in main CUDA installation 43

Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA Application See Practical PDF document 44