CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Similar documents
$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

An Introduction to GPU Computing and CUDA Architecture

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Lecture 6b Introduction of CUDA programming

HPCSE II. GPU programming and CUDA

GPU Programming Introduction

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to CUDA C

Introduction to CUDA C

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Vector Addition on the Device: main()

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Massively Parallel Algorithms

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA. More on threads, shared memory, synchronization. cuprintf

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

CS 314 Principles of Programming Languages

Introduction to CUDA C QCon 2011 Cyril Zeller, NVIDIA Corporation

Lecture 2: CUDA Programming

CS516 Programming Languages and Compilers II

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to CUDA Programming

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Introduction to GPGPUs and to CUDA programming model

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

04. CUDA Data Transfer

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 15

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

An Introduction to GPU Programming

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Tesla Architecture, CUDA and Optimization Strategies

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

COSC 462 Parallel Programming

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Lecture 2: Introduction to CUDA C

Paralization on GPU using CUDA An Introduction

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

GPU Programming Using CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Basic unified GPU architecture

Lecture 5. Performance Programming with CUDA

Lecture 10!! Introduction to CUDA!

Parallel Computing. Lecture 19: CUDA - I

Speed Up Your Codes Using GPU

Assignment 7. CUDA Programming Assignment

Module 2: Introduction to CUDA C. Objective

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

Scientific discovery, analysis and prediction made possible through high performance computing.

Module 2: Introduction to CUDA C

CS377P Programming for Performance GPU Programming - I

GPU applications within HEP

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Programming with CUDA, WS09

Lecture 3: Introduction to CUDA

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

CUDA Parallel Programming Model Michael Garland

High Performance Linear Algebra on Data Parallel Co-Processors I

COSC 6374 Parallel Computations Introduction to CUDA

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

GPU Programming with CUDA. Pedro Velho

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 11: GPU programming

EEM528 GPU COMPUTING

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

University of Bielefeld

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to CUDA CIRC Summer School 2014

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA Kenjiro Taura 1 / 36

Programmable Graphics Hardware (GPU) A Primer

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Parallel Accelerators

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Transcription:

CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory

Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and blocks 4. 1D Stencil Benefits and usage of shared memory in a GPU 5. Matrix Multiplication assigning the right data index to threads 6. Matrix Transpose Coalescence and memory accesses without bank conflicts 04.05.2015 2

Get the code and environment setup Copy and extract the source code of the lab exercises $ tar xvfp ~soc_master/4_cuda/ex.tar $ cd ex Documentation: $ evince ~soc_master/4_cuda/doc/cuda_c_programming_guide_4.2.pdf & Newer version, but with all libraries, packages, http://docs.nvidia.com/cuda Now we are ready to start! Reminder: Complete lecture evaluation survey! 04.05.2015 3

Exercise 1: Enumerate GPUs This is a very simple application which displays the properties and information about the GPU of your workstation We will use this simple exercise to: Get familiar with the framework Learn how to compile and launch applications Review the application output and code Follow next simple instructions: $ cd CUDA_EXERCISES $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x Enter in the application folder Compile the source code Launch the application 04.05.2015 4

Reviewing the application output The output should be like the one on the side Take your time to review the different fields --- General Information for device 0 --- Name: xxxx Compute capability: xxxx Clock rate: xxxx Device copy overlap: xxxx Kernel execution timeout : xxxx --- Memory Information for device 0 --- Total global mem: xxxx Total constant Mem: xxxx Max mem pitch: xxxx Texture Alignment: xxxx --- MP Information for device 0 --- Multiprocessor count: xxxx Shared mem per mp: xxxx Registers per mp: xxxx Threads in warp: xxxx Max threads per block: xxxx Max thread dimensions: (xxxx, xxxx, xxxx) Max grid dimensions: (xxxx, xxxx, xxxx) 04.05.2015 5

Reviewing the application code Open and review the file enum_gpu.cu It exploits the Device Management API Application can query and select GPUs cudagetdevicecount(int *count) cudasetdevice(int device) cudagetdevice(int *device) cudagetdeviceproperties(cudadeviceprop *prop, int device) Multiple host threads can share a device A single host thread can manage multiple devices cudasetdevice(i) to select current device cudamemcpy( ) for peer-to-peer copies 04.05.2015 6

Reviewing the application code It exploits also the Reporting Errors API All CUDA API calls return an error code (cudaerror_t) Error in the API call itself OR Error in an earlier asynchronous operation (e.g. kernel) Get the error code for the last error: cudaerror_t cudagetlasterror(void) Get a string to describe the error: char *cudageterrorstring(cudaerror_t) printf("%s\n", cudageterrorstring(cudagetlasterror())); 04.05.2015 7

Exercise 2: Hello World Hello World is the typical basic application Move to the Hello World application folder and compile $ cd.. $ cd 2.Hello_World $ make This application project contains two Hello World versions: hello_world.cu: CPU version simple_kernel.cu: GPU version 04.05.2015 8

CPU: Hello World int main(void) { printf("hello World!\n"); return 0; } Lets review the code opening the hello_world.cu source file: Standard C that runs on the host NVIDIA compiler (nvcc) can be used to compile programs with no device code To run this version type the following command: $./hello_world.x 04.05.2015 9

GPU: Hello World with Device Code To run this version type the following command: $./simple_kernel.x global void kernel(void) { } int main(void) { kernel<<<1,1>>>(); printf("hello World!\n"); return 0; } Lets review the code opening the simple_kernel.cu source file: CUDA C/C++ keyword global indicates a function that: Runs on the device Is called from host code nvcc separates source code into host and device components Device functions (e.g. kernel()) processed by NVIDIA compiler Host functions (e.g. main()) processed by standard host compiler gcc, cl.exe Triple angle brackets mark a call from host code to device code Also called a kernel launch We ll return to the parameters (1,1) in a moment That s all that is required to execute a function on the GPU! kernel() does nothing! 04.05.2015 10

Parallel Programming in CUDA C/C++ But wait GPU computing is about massive parallelism! We need a more interesting example We ll start by adding two integers and build up to vector addition Move to the 3.Vectors_Addition application folder Compile with make command You will have different application executables: add_simple.x add_simple_blocks.x add_simple_threads.x add_simple_last.x 04.05.2015 11

Exercise 3: Vectors Addition on the Device Open and review file add_simple.cu A simple kernel to add two integers global void add(int *a, int *b, int *c) { *c = *a + *b; } As before global is a CUDA C/C++ keyword meaning add() will execute on the device add() will be called from the host Note that we use pointers for the variables add() runs on the device, so a, b and c must point to device memory We need to allocate memory on the GPU 04.05.2015 12

Memory Management Host and device memory are separate entities Device pointers point to GPU memory May be passed to/from host code May not be dereferenced in host code Host pointers point to CPU memory May be passed to/from device code May not be dereferenced in device code Simple CUDA API for handling device memory cudamalloc(), cudafree(), cudamemcpy() Similar to the C equivalents malloc(), free(), memcpy() 04.05.2015 13

Addition on the Device: main() int main(void) { int a, b, c; int *d_a, *d_b, *d_c; int size = sizeof(int); // host copies of a, b, c // device copies of a, b, c // Allocate space for device copies of a, b, c cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Setup input values a = 2; b = 7; 04.05.2015 14

Addition on the Device: main() // Copy inputs to device cudamemcpy(d_a, &a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, &b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(&c, d_c, size, cudamemcpydevicetohost); } // Cleanup cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0; 04.05.2015 15

Moving to Parallel Open and review file add_simple_blocks.cu GPU computing is about massive parallelism So how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); Instead of executing add() once, execute N times in parallel 04.05.2015 16

Vector Addition on the Device With add() running in parallel we can do vector addition Terminology: each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockidx.x global void add(int *a, int *b, int *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; } By using blockidx.x to index into the array, each block handles a different element of the array 04.05.2015 17

Vector Addition on the Device global void add(int *a, int *b, int *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; } On the device, each block can execute in parallel: 04.05.2015 18

Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof(int);. // host copies of a, b, c // device copies of a, b, c // Alloc space for device copies of a, b, c cudamalloc((void **)&dev_a, size); cudamalloc((void **)&dev_b, size); cudamalloc((void **)&dev_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); 04.05.2015 19

Vector Addition on the Device: main() // Copy inputs to device cudamemcpy(dev_a, a, size, cudamemcpyhosttodevice); cudamemcpy(dev_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N blocks add<<<n,1>>>(dev_a, dev_b, dev_c); // Copy result back to host cudamemcpy(c, dev_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(dev_a); cudafree(dev_b); cudafree(dev_c); return 0; 04.05.2015 20

What we have learnt so far Difference between host and device Host CPU Device GPU Using global to declare a function as device code Executes on the device Called from the host Passing parameters from host code to a device function Basic device memory management cudamalloc() cudamemcpy() cudafree() Launching parallel kernels Launch N copies of add() with add<<<n,1>>>( ); Use blockidx.x to access block index 04.05.2015 21

CUDA Threads Open and review file add_simple_threads.cu Terminology: a block can be split into parallel threads Let s change add() to use parallel threads instead of parallel blocks global void add(int *a, int *b, int *c) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; } We use threadidx.x instead of blockidx.x Need to make one change in main() 04.05.2015 22

Vector Addition Using Threads: main() #define N 512 int main(void) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof(int); // host copies of a, b, c // device copies of a, b, c // Alloc space for device copies of a, b, c cudamalloc((void **)&dev_a, size); cudamalloc((void **)&dev_b, size); cudamalloc((void **)&dev_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); 04.05.2015 23

Vector Addition on the Device: main() // Copy inputs to device cudamemcpy(dev_a, a, size, cudamemcpyhosttodevice); cudamemcpy(dev_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N threads add<<<1,n>>>(dev_a, dev_b, dev_c); // Copy result back to host cudamemcpy(c, dev_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(dev_a); cudafree(dev_b); cudafree(dev_c); return 0; 04.05.2015 24

Combining Blocks and Threads Open and review file add_simple_last.cu We have seen parallel vector addition using: Several blocks with one thread each One block with several threads Let s adapt vector addition to use both blocks and threads 04.05.2015 25

Indexing Arrays with Blocks and Threads No longer as simple as using blockidx.x and threadidx.x Consider indexing an array with one element per thread (8 threads/block) With M threads per block, a unique index for each thread is given by: int index = threadidx.x + blockidx.x * M; 04.05.2015 26

Indexing Arrays: Example Which thread will operate on the red element? int index = threadidx.x + blockidx.x * M; = 5 + 2 * 8; = 21; 04.05.2015 27

Vector Addition with Blocks and Threads Use the built-in variable blockdim.x for threads per block int index = threadidx.x + blockidx.x * blockdim.x; Combined version of add() to use parallel threads and parallel blocks global void add(int *a, int *b, int *c) { int index = threadidx.x + blockidx.x * blockdim.x; c[index] = a[index] + b[index]; } What changes need to be made in main()? 04.05.2015 28

Addition with Blocks and Threads: main() #define N 512 int main(void) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof(int); // host copies of a, b, c // device copies of a, b, c // Alloc space for device copies of a, b, c cudamalloc((void **)&dev_a, size); cudamalloc((void **)&dev_b, size); cudamalloc((void **)&dev_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); 04.05.2015 29

Addition with Blocks and Threads: main() // Copy inputs to device cudamemcpy(dev_a, a, size, cudamemcpyhosttodevice); cudamemcpy(dev_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU add<<<n/threads_per_block,threads_per_block>>>(dev_a, dev_b, dev_c); // Copy result back to host cudamemcpy(c, dev_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(dev_a); cudafree(dev_b); cudafree(dev_c); return 0; 04.05.2015 30

Handling Arbitrary Vector Sizes Typical problems are not friendly multiples of blockdim.x Avoid accessing beyond the end of the arrays: global void add(int *a, int *b, int *c, int n) { int index = threadidx.x + blockidx.x * blockdim.x; if (index < n) c[index] = a[index] + b[index]; } Update the kernel launch: add<<<(n + M-1) / M,M>>>(d_a, d_b, d_c, N); 04.05.2015 31

Why Bother with Threads? Threads seem unnecessary They add a level of complexity What do we gain? Unlike parallel blocks, threads have mechanisms to efficiently: Communicate Synchronize 04.05.2015 32

Exercise 4: 1D Stencil Consider applying a 1D stencil to a 1D array of elements Each output element is the sum of input elements within a radius If radius is 3, then each output element is the sum of 7 input elements: radius radius

Implementing Within a Block Each thread processes one output element blockdim.x elements per block Input elements are read several times With radius 3, each input element is read seven times

Sharing Data Between Threads Terminology: within a block, threads share data via shared memory Extremely fast on-chip memory, user-managed Declare using shared, allocated per block Data is not visible to threads in other blocks

Implementing With Shared Memory Cache data in shared memory Read (blockdim.x + 2 * radius) input elements from global memory to shared memory Compute blockdim.x output elements Write blockdim.x output elements to global memory Each block needs a halo of radius elements at each boundary halo on left halo on right blockdim.x output elements

Question? global void stencil_1d(int *in, int *out) { shared int temp[/* WHAT SIZE? */]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; Find the right vector size and complete the code // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

Stencil Kernel // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; } // Store the result out[gindex-radius] = result;

Data Race! The stencil example will not work Suppose thread 15 reads the halo before thread 0 has fetched it temp[lindex] = in[gindex]; Store at temp[18] if (threadidx.x < RADIUS) { temp[lindex RADIUS = in[gindex RADIUS]; Skipped, threadidx > RADIUS temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } int result = 0; result += temp[lindex + 1]; Load from temp[19]

syncthreads() void syncthreads(); Synchronizes all threads within a block Used to prevent RAW / WAR / WAW hazards All threads must reach the barrier In conditional code, the condition must be uniform across the block QUESTION Add the syncthreads()call in the right code line.

Exercise 5: Matrix Multiplication The application performs a Matrix-Matrix Multiplication on GPU Complete the kernel code in order to assign the right data index to threads Assign one thread per output global void matmatmulgpu( double *a, double *b, double *c, int lda ){ unsigned int tid = blockidx.x*blockdim.x + threadidx.x; // assign one thread per output - row major over whole matrix int row = /* WHAT INDEX? */ ; int col = /* WHAT INDEX? */ ; double sum = 0.0; } for ( int k=0; k<lda; k++ ) { sum += a[row*lda+k] * b[k*lda+col]; } c[tid] = sum; return; 04.05.2015 41

Exercise 6: Matrix Transpose This example transposes arbitrary-size matrices It compares a naive transpose kernel that suffers from non-coalesced writes, to an optimized transpose with fully coalesced memory access and no bank conflicts Complete the GPU kernel Gain additional insight using the profiler (start_nvvp.sh in ~/ex dir) Configure what is being measured (e.g. DRAM write BW) // This kernel is optimized to ensure all global reads and writes are coalesced, // and to avoid bank conflicts in shared memory. This kernel is up to 11x faster // than the naive kernel below. Note that the shared memory array is sized to // (BLOCK_DIM+1)*BLOCK_DIM. This pads each row of the 2D block in shared memory // so that bank conflicts do not occur when threads address the array column-wise. global void transpose(float *odata, float *idata, int width, int height){ shared float block[block_dim][block_dim+1]; // read the matrix tile into shared memory // load one element per thread from device memory (idata) and store it // in transposed order in block[][] unsigned int xindex = /* COMPLETE */ ; unsigned int yindex = /* COMPLETE */ ; if((xindex < width) && (yindex < height)) { unsigned int index_in = /* COMPLETE */; block[threadidx.y][threadidx.x] = /* COMPLETE */; } // synchronise to ensure all writes to block[][] have completed /* COMPLETE */ ; // write the transposed matrix tile to global memory (odata) in linear order xindex = /* COMPLETE */ ; yindex = /* COMPLETE */ ; if((xindex < height) && (yindex < width)) { unsigned int index_out = /* COMPLETE */ ; odata[index_out] = /* COMPLETE */ ;} }v 04.05.2015 42