HPCSE II. GPU programming and CUDA

Similar documents
An Introduction to GPU Computing and CUDA Architecture

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Lecture 6b Introduction of CUDA programming

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

GPU Programming Introduction

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to CUDA C

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Massively Parallel Algorithms

Introduction to CUDA C

Vector Addition on the Device: main()

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA C QCon 2011 Cyril Zeller, NVIDIA Corporation

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Introduction to GPGPUs and to CUDA programming model

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

COSC 6374 Parallel Computations Introduction to CUDA

Tesla Architecture, CUDA and Optimization Strategies

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GPU Programming Using CUDA

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

EEM528 GPU COMPUTING

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming with CUDA. Pedro Velho

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU CUDA Programming

Parallel Numerical Algorithms

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to CUDA Programming

ECE 574 Cluster Computing Lecture 15

An Introduction to GPU Programming

CUDA Architecture & Programming Model

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Basic unified GPU architecture

University of Bielefeld

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduc)on to GPU Programming

Module 2: Introduction to CUDA C. Objective

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Lecture 11: GPU programming

ECE 574 Cluster Computing Lecture 17

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Basics of CADA Programming - CUDA 4.0 and newer

Programmable Graphics Hardware (GPU) A Primer

COSC 6339 Accelerators in Big Data

CS 179: GPU Computing. Lecture 2: The Basics

Introduction to CUDA

Module 2: Introduction to CUDA C

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Programming with CUDA, WS09

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU applications within HEP

GPU programming. Dr. Bernhard Kainz

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA C Programming Mark Harris NVIDIA Corporation

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Practical Introduction to CUDA and GPU

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Parallel Accelerators

CS179 GPU Programming Recitation 4: CUDA Particles

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CS : Many-core Computing with CUDA

Lecture 3: Introduction to CUDA

Transcription:

HPCSE II GPU programming and CUDA

What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control and caches GPU: large die area for data processing Control ALU ALU CPU Cache ALU ALU GPU DRAM DRAM picture source: Nvidia

GPGPUs Graphics Processing Unit (GPU): Hardware designed for output to display General Purpose computing on GPUs (GPGPU) used for non-graphics tasks physics simulation signal processing computational geometry computer vision computational biology computational finance meteorology

Why GPUs? GPU evolved into a very flexible and powerful processor It s programmable using high-level languages It offers more GFLOP/s and more GB/s than CPUs picture source: http://hemprasad.wordpress.com/category/computing-technology/parallel/gpu-cuda/

Low Latency or High Throughput? CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Tolerant of memory latency

Low Latency or High Throughput? CPU minimized latency within each thread GPU hides latency with computation from other thread warps CPU core Low Latency Processor Computation Thread/Warp T 1 T 2 T 3 T 4 T n Processing GPU Stream Multiprocessor High Throughput Processor W 4 W 3 W 2 W 1 Waiting for data Ready to be processed Context switch

Kepler GK110 Block Diagram 7.1B Transistors > 1 TFLOP FP64 1.5 MB L2 Cache 15 SMX units contains 15 streaming multiprocessors (SMX units)

GK110 Streaming Multiprocessor (SMX) Contains many specialized cores Up to 2048 threads concurrently 192 fp32 ops/clock 64 fp64 ops/clock 160 int32 ops/clock 48KB shared mem 64K 32-bit registers

GPU Parallelism Software Hardware Threads many 1 GPU Cores many many 1 Blocks many many 1 Streaming Multiprocessor many 1 1 Grid 1 1 1 GPU

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory/nic to GPU memory

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory/nic to GPU memory 2. Load GPU program and execute

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory/nic to GPU memory 2. Load GPU program and execute 3. Copy results from GPU memory to CPU memory/nic

3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Similar to OpenMP Maximum Flexibility

GPU accelerated libraries NVIDIA cublas NVIDIA curand NVIDIA cusparse NVIDIA NPP IMSL Library Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cufft Sparse Linear Algebra Building-block Algorithms for CUDA C++ STL Features for CUDA

CUDA Math Libraries cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated C++ Parallel Algorithms & Data Structures math.h - C99 floating-point Library Included in the free CUDA Toolkit

Programming for GPUs Early days: OpenGL (graphics API) Now CUDA: Nvidia proprietary API, works only on Nvidia GPUs OpenCL: open standard for heterogeneous computing OpenACC: open standard based on compiler directives

CUDA Example: Code main.cu code compilation run on GPU node device code (runs on CPU) host code (runs on CPU)

Nvidia CUDA C extension to write GPU code Only supported by Nvidia GPUs Code compilation (nvcc) and linking: device.cu device.cu nvcc device.o gcc program global void kernel() { // do something } host.cpp host.cpp gcc host.o int main() { }

Hello World

Hello World! int main(void) { printf("hello World!\n"); return 0; } Standard C that runs on the host NVIDIA compiler (nvcc) can be used to compile programs with no device code

Hello World! with Device Code global void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf("hello World!\n"); return 0; } Two new syntactic elements

Hello World! with Device Code mykernel<<<griddim, blockdim>>>( ); Triple angle brackets mark a call from host code to device code, also called a kernel launch griddim is the number of instances of the kernel blockdim is the number of threads within each instance griddim and blockdim may be 2D or 3D vectors (type vec3) to simplify application programs

Hello World! with Device Code global void mykernel(void) { } CUDA C/C++ keyword global indicates a function that Runs on the device Is called from host code nvcc separates source code into host and device components Device functions (e.g. mykernel()) processed by NVIDIA compiler Host functions (e.g. main()) processed by standard host compiler

GPU Kernel qualifiers Function qualifiers: global : called from CPU, runs on GPU device : called from GPU, runs on GPU host : called from CPU, runs on CPU host and device can be combined

Parallel Programming in CUDA C/C++ We need a more interesting example GPU computing is about massive parallelism! We ll start by adding two integers and build up to vector addition a b c

Addition on the Device A simple kernel to add two integers global void add(int *a, int *b, int *c) { *c = *a + *b; } As before global is a CUDA C/C++ keyword meaning add() will execute on the device add() will be called from the host

Addition on the Device Note that we use pointers for the variables global void add(int *a, int *b, int *c) { *c = *a + *b; } add() runs on the device, so a, b and c must point to device memory We need to allocate memory on the GPU

Memory Management Host and device memory are separate entities Device pointers point to GPU memory May be passed to/from host code May not be dereferenced in host code Host pointers point to CPU memory May be passed to/from device code May not be dereferenced in device code Simple CUDA API for handling device memory cudamalloc(), cudafree(), cudamemcpy() Similar to the C equivalents malloc(), free(), memcpy()

Data Transfer cudamemcpy(void* dst, void* src, size_t num_bytes, enum cudamemcpykind direction) direction can be either of cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice

Addition on the Device: main() int main(void) { // host copies of a, b, c int a, b, c; // device copies of a, b, c // Copy inputs to device cudamemcpy(d_a, &a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, &b, size, cudamemcpyhosttodevice); int *d_a, *d_b, *d_c; int size = sizeof(int); // Allocate space for device copies cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Setup input values a = 2; b = 7; // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(&c, d_c, size, cudamemcpydevicetohost); // Cleanup cudafree(d_a); cudafree(d_b); cudafree(d_c); } return 0;

Going parallel

Moving to Parallel GPU computing is about massive parallelism How do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); Instead of executing add() once, execute N times in parallel

Vector Addition on the Device With add() running in parallel we can do vector addition Each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockidx.x global void add(int *a, int *b, int *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; } By using blockidx.x to index into the array, each block handles a different element of the array

Vector Addition on the Device global void add(int *a, int *b, int *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; } On the device, each block can execute in parallel: Block 0 Block 1 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; Block 2 Block 3 c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

Block Scheduling Multithreaded!CUDA!Program! Block!0! Block!1! Block!2! Block!3! Block!4! Block!5! Block!6! Block!7! Block!5! Block!6! GPU!with!2!SMs! GPU!with!4!SMs! SM!0! SM!1! SM!0! SM!1! SM!2! SM!3! Block!0!! Block!1!! Block!0!! Block!1!! Block!2!! Block!3!! Block!2!! Block!3!! Block!4!! Block!5!! Block!6!! Block!7!! Block!4!! Block!5!! Block!6!! Block!7!! (

Special variables Some variables are predefined griddim size (or dimensions) of grid of blocks blockidx index (or 2D/3D indices) of block blockdim size (or dimensions) of each block threadidx index (or 2D/3D indices) of thread

Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);

Vector Addition on the Device: main() // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N blocks add<<<n,1>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

CUDA Threads Terminology: a block can be split into parallel threads Let s change add() to use parallel threads instead of parallel blocks global void add(int *a, int *b, int *c) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; } We use threadidx.x instead of blockidx.x Need to make one change in main()

One change in main // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N blocks add<<<n,1>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

One change in main // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N threads add<<<1,n>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

Vector addition add <<< B,T>>> ( ) B: Number of Blocks T: Number of threads/ Block add <<<N,1>>> ( ) B=N=16 T=1 add <<<1,N>>> ( ) T=N=16 B=1

Combining Blocks and Threads We ve seen parallel vector addition using: Several blocks with one thread each One block with several threads Let s adapt vector addition to use both blocks and threads Why? We ll come to that First let s discuss data indexing

Indexing with Blocks and Threads No longer as simple as using blockidx.x and threadidx.x Consider indexing an array with one element per thread (8 threads/block) With M threads per block, a unique index for each thread is given by: int index = blockidx.x * M + threadidx.x; 0 1 threadidx.x threadidx.x threadidx.x threadidx.x 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockidx.x = 0 blockidx.x = 1 blockidx.x = 2 blockidx.x = 3

Using Blocks and Threads Use the built-in variable blockdim.x for threads per block int index = blockidx.x * blockdim.x + threadidx.x; Combined version of add() to use parallel threads and parallel blocks global void add(int *a, int *b, int *c) { int index = blockidx.x * blockdim.x + threadidx.x c[index] = a[index] + b[index]; } What changes need to be made in main()?

Addition with Blocks and Threads: main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);

Addition with Blocks and Threads: main() // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU add<<<n/threads_per_block,threads_per_block>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

Handling Arbitrary Vector Sizes Typical problems are not even multiples of blockdim.x Avoid accessing beyond the end of the arrays: } global void add(int *a, int *b, int *c, int n) { int index = blockidx.x * blockdim.x + threadidx.x if (index < n) c[index] = a[index] + b[index]; }

Why Bother with Threads? Threads seem unnecessary They add a level of complexity What do we gain? Unlike parallel blocks, threads have mechanisms to efficiently: Communicate Synchronize To look closer, we need a new example

Using stencils

1D Stencil Consider applying a 1D stencil to a 1D array of elements Each output element is the sum of input elements within a radius If radius is 3, then each output element is the sum of 7 input elements: in out

Shared memory within a block Each thread processes one output element blockdim.x elements per block Input elements are read several times With radius 3, each input element is read seven times radius radius Within a block, threads can share data via shared memory Extremely fast on-chip memory, but very small Like a user-managed cache Declare using shared, allocated per block Data is not visible to threads in other blocks

Implementing With Shared Memory Cache data in shared memory Read (blockdim.x + 2 * radius) input elements from global memory to shared memory Compute blockdim.x output elements Write blockdim.x output elements to global memory Each block needs a halo of radius elements at each boundary in out halo on left halo on right blockdim.x output elements

Stencil Kernel (1 of 2) global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } Race condition! // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; } // Store the result out[gindex] = result;

syncthreads() void syncthreads(); Synchronizes all threads within a block All threads must reach the barrier In conditional code (if statements), the condition must be uniform across the block

Stencil Kernel global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } Race condition! // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; } // Store the result out[gindex] = result;

Stencil Kernel global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) syncthreads(); // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; } // Store the result out[gindex] = result;

Review Launching parallel threads Launch N blocks with M threads per block with kernel<<<n,m>>>( ); Use blockidx.x to access block index within grid Use threadidx.x to access thread index within block Assign elements to threads: int index = blockidx.x * blockdim.x + threadidx.x; Use shared to declare a variable/array in shared memory Data is shared between threads in a block Not visible to threads in other blocks Use syncthreads() as a barrier to avoid race conditions

Managing the Device

Coordinating Host & Device Kernel launches are asynchronous CPU needs to synchronize before consuming the results cudamemcpy() cudamemcpyasync() Blocks the CPU until the copy is complete. copy begins when all preceding CUDA calls have completed Asynchronous, does not block the CPU cudadevicesynchronize() Blocks the CPU until all preceding CUDA calls have completed

Reporting Errors All CUDA API calls return an error code (cudaerror_t) Get the error code for the last error: cudaerror_t cudagetlasterror() Get a string to describe the error: char *cudageterrorstring(cudaerror_t) if(cudagetlasterror()!= cudasuccess) std::cerr << cudageterrorstring(cudagetlasterror());

Device Management Application can query and select GPUs cudagetdevicecount(int *count) cudasetdevice(int device) cudagetdevice(int *device) cudagetdeviceproperties(cudadeviceprop *prop, int device) Multiple host threads can share a device A single host thread can manage multiple devices cudasetdevice(i) to select current device cudamemcpy( ) for peer-to-peer copies

Multiplying matrices

GPU GEMM Matrix-Matrix Multiplication A x B = C for (int i=0; i<n; i++) for (int j=0; j<n; j++) for (int k=0; k<n; k++) c[i*n+j] += a[i*n+k] * b[k*n+j]; Parallelize Kernel

A matrix multiplication kernel Let us use a 2D grid of blocks and threads global void matrix(float * a, float * b, float * c, int N) { int ix = threadidx.x + blockidx.x*blockdim.x; int iy = threadidx.y + blockidx.y*blockdim.y; if (ix<n && iy<n) { c[ix*n + iy] = 0; } } for (int k=0; k<n; k++) c[ix*n + iy] += a[ix*n + k] * b[k*n + iy]; can you optimize it using shared memory?

A matrix multiplication kernel And call it by creating a 2D grid dim3 blocks(n/4,n/4); dim3 threads(4,4); matrix<<< blocks, threads >>>(dev_a, dev_b, dev_c,n); what number of threads is ideal on your GPU?