GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Similar documents
Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

High Performance Computing and GPU Programming

Basics of CADA Programming - CUDA 4.0 and newer

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

COSC 6374 Parallel Computations Introduction to CUDA

Real-time Graphics 9. GPGPU

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Scientific discovery, analysis and prediction made possible through high performance computing.

ECE 574 Cluster Computing Lecture 17

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to Parallel Computing with CUDA. Oswald Haan

ECE 574 Cluster Computing Lecture 15

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Programming Using CUDA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Real-time Graphics 9. GPGPU

Module 2: Introduction to CUDA C. Objective

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Module 2: Introduction to CUDA C

Lecture 3: Introduction to CUDA

Introduction to CUDA Programming

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU CUDA Programming

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU programming. Dr. Bernhard Kainz

Parallel Computing. Lecture 19: CUDA - I

Lecture 2: Introduction to CUDA C

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

University of Bielefeld

Massively Parallel Algorithms

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA Parallelism Model

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA Fortran Brent Leback The Portland Group

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Practical Introduction to CUDA and GPU

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CS377P Programming for Performance GPU Programming - I

Lecture 11: GPU programming

Tesla Architecture, CUDA and Optimization Strategies

CS 179: GPU Computing. Lecture 2: The Basics

ECE 408 / CS 483 Final Exam, Fall 2014

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Programming with CUDA, WS09

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Introduction to CUDA C

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

HPCSE II. GPU programming and CUDA

CUDA Basics. July 6, 2016

CUDA 5 Features in PGI CUDA Fortran 2013

Introduction to CUDA C

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Lab 1 Part 1: Introduction to CUDA

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Lecture 10!! Introduction to CUDA!

Writing and compiling a CUDA code

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Speed Up Your Codes Using GPU

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Supporting Data Parallelism in Matcloud: Final Report

CUDA. Sathish Vadhiyar High Performance Computing

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

An Introduction to OpenAcc

Parallel Numerical Algorithms

GPU Programming Introduction

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

CUDA Kenjiro Taura 1 / 36

CUDA C PROGRAMMING GUIDE

EEM528 GPU COMPUTING

Use of Accelerate Tools PGI CUDA FORTRAN Jacket

COSC 6339 Accelerators in Big Data

Transcription:

GPU Programming EPCC The University of Edinburgh

Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2

NVIDIA CUDA CUDA allows NVIDIA GPUs to be programmed in C and C++ defines language extensions for defining kernels kernels execute in multiple threads concurrently on the GPU provides API functions for e.g. device memory management PGI provide a commercial Fortran version of CUDA 3

4

GPGPU: Stream Computing Data set decomposed into a stream of elements A single computational function (kernel) operates on each element thread defined as execution of kernel on one data element Multiple cores can process multiple elements in parallel i.e. many threads running in parallel Suitable for data-parallel problems 5

NVIDIA GPUs have a 2-level hierarchy: Multiple Stream Multiprocessors SMs each with multiple cores 6

In CUDA, this is abstracted as Grid of Thread Blocks The multiple blocks in a grid map onto the multiple SMs Each block in a grid contains multiple threads, mapping onto the cores in an SM We don t need to know the exact details of the hardware (number of SMs, cores per SM). Instead, oversubscribe, and system will perform scheduling automatically Use more blocks than SMs, and more threads than cores Same code will be portable and efficient across different GPU versions. 7

Example: vector addition. Consider the following loop: for (i=0;i<n;i++){ c[i] = a[i] + b[i]; } This is parallel in nature: we can assign each iteration to a separate CUDA thread. 8

CUDA C Example global void vectoradd(float *a, float *b, float *c) { int i = threadidx.x; c[i] = a[i] + b[i]; } Replace loop with function Add global specifier To specify this function is to form a GPU kernel Use internal CUDA variables to specify array indices threadidx.x is an internal variable unique to each thread in a block. 9

CUDA C Example And launch this kernel by calling the function on multiple CUDA threads using <<< >>> syntax int blockspergrid=1; //use only one block int threadsperblock=n; //use N threads in the block vectoradd<<<blockspergrid, threadsperblock>>>(a, b, c); 10

CUDA C Example The previous example only uses 1 block, i.e. only 1 SM on the GPU. In practice, we need to use multiple blocks, e.g.: global void vectoradd(float *a, float *b, float *c) { int i = blockidx.x * blockdim.x + threadidx.x; c[i] = a[i] + b[i]; }... int blockspergrid=n/256; //assuming 256 divides N exactly int threadsperblock=256; vectoradd<<<blockspergrid, threadsperblock>>>(a, b, c);... blockidx.x: Unique to every block in the grid blockdim.x: Number of threads per block We have chosen to use 256 threads per block, which is typically a good number (see practical). 11

CUDA C Example You can think of this as restructuring the original loop for (i=0;i<n;i++){ c[i] = a[i] + b[i]; } as for (i0=0;i0<(n/256);i0++){ for (i1=0;i1<256;i1++){ i = i0*256 + i1; c[i] = a[i] + b[i]; } } and parallelising inner loop over threads in a block, outer loop over blocks. 12

2D Example The previous examples were one dimensional. Each thread block can be 1D, 2D or 3D to best fit the global void matrixadd(float a[n][n], float b[n][n], float c[n][n]) { algorithm, e.g. for matrix addition: int i = threadidx.x; int j = threadidx.y; } c[i][j] = a[i][j] + b[i][j]; int main() { dim3 blockspergrid(1); /* 1 block per grid (1D) */ dim3 threadsperblock(n, N); /* NxN threads per block (2D) */ matrixadd<<<blockspergrid, threadsperblock>>>(a, b, c); } dim3 is a CUDA type, containing 3 integers (x,y and z components) 13

Multiple Block 2D Example Grid can also be be 1D, 2D or 3D global void matrixadd(float a[n][n], float b[n][n], float c[n] [N]) { int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; } c[i][j] = a[i][j] + b[i][j]; int main() { } dim3 blockspergrid(n/16,n/16); // (N/16)x(N/16) blocks/grid (2D) dim3 threadsperblock(16, 16); /. 16x16 threads/block (2D) matrixadd<<<blockspergrid, threadsperblock>>>(a, b, c); 14

Memory Management - allocation The GPU has a separate memory space from the host CPU We cannot simply pass normal C pointers to CUDA threads Need to manage GPU memory and copy data to and from it explicitly cudamalloc is used to allocate GPU memory cudafree releases it again float *a; cudamalloc(&a, N*sizeof(float)); cudafree(a); 15

Memory Management - cudamemcpy Once we've allocated GPU memory, we need to be able to copy data to and from it cudamemcpy does this: cudamemcpy(array_device, array_host, N*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(array_host, array_device, N*sizeof(float), cudamemcpydevicetohost); The first argument always corresponds to the destination of the transfer. Transfers between host and device memory are relatively slow and can become a bottleneck, so should be minimised when possible 16

Synchronisation between host and device Kernel calls are non-blocking. This means that the host program continues immediately after it calls the kernel Allows overlap of computation on CPU and GPU Use cudathreadsynchronize() to wait for kernel to finish vectoradd<<<blockspergrid, threadsperblock>>>(a, b, c); //do work on host (that doesn t depend on c) cudathreadsynchronise(); //wait for kernel to finish Standard cudamemcpy calls are blocking Non-blocking variants exist 17

Synchronisation between CUDA threads Within a kernel, to syncronise between threads in the same block use the syncthreads()call Therefore, threads in the same block can communicate through memory spaces that they share, e.g. assuming x local to each thread and array in a shared memory space if (threadidx.x == 0) array[0]=x; syncthreads(); if (threadidx.x == 1) x=array[0]; It is not possible to communicate between different blocks in a kernel: must instead exit kernel and start a new one 18

CUDA FORTRAN Allows Fortran codes to take advantage of GPU acceleration Available in the commercial PGI Fortran compiler from the Portland Group Very similar to CUDA C define kernels (subroutines) to be executed on the device with global attribute invoke them using <<< >>> syntax same API functions available (cudamalloc, cudafree, cudamemcpy, etc.) Some features are more intuitive than CUDA C, e.g. can use Fortran array syntax to copy to/from GPU instead of API functions 19

CUDA FORTRAN Example! Kernel declaration attributes(global) subroutine vectoradd(a, b, c) real, dimension(*) :: a, b, c integer :: i i = (blockidx%x-1)*blockdim%x + threadidx%x c(i) = a(i) + b(i) end subroutine! Kernel invocation blockspergrid = dim3(n/256, 1, 1) threadsperblock = dim3(256, 1, 1) call vectoradd <<<blockspergrid, threadsperblock>>> (a, b, c) 20

Multi-block 2D Example! Kernel declaration attributes(global) subroutine matrixadd(n, a, b, c) integer, value :: N real, dimension(n,n) :: a, b, c integer :: i, j i = (blockidx%x-1)*blockdim%x + threadidx%x j = (blockidx%y-1)*blockdim%y + threadidx%y c(i) = a(i) + b(i) end subroutine! Kernel invocation blockspergrid = dim3(n/16, N/16, 1) threadsperblock = dim3(16, 16, 1) call matrixadd <<<blockspergrid, threadsperblock>>> (a, b, c) 21

CUDA FORTRAN Data management Can use allocate() and deallocate() as for host memory real, device, allocatable, dimension(:) :: a allocate( a(n) ) deallocate ( a ) Can copy data between host and device using assignment d_a = a(1:n) though this is a blocking operation. Can also use istat = cudamemcpy(d_a, a, N) 22

Compiling CUDA Code CUDA C code is compiled using nvcc: nvcc o example example.cu CUDA Fortran is compiled using PGI compiler either use.cuf filename extension for CUDA files and/or pass Mcuda to the compiler command line pgf90 -Mcuda o example example.cuf 23

OpenCL Open Compute Language (OpenCL): The Open Standard for Heterogeneous Parallel Programming Open cross-platform framework for programming modern multicore and heterogeneous systems Supports wide range of applications and architectures, including GPUs Supported on NVIDIA Tesla + AMD FireStream See http://www.khronos.org/opencl/ 24

OpenCL vs CUDA on NVIDIA NVIDIA support both CUDA and OpenCL as APIs to the hardware. But put much more effort into CUDA CUDA more mature, well documented and performs better OpenCL and C for CUDA conceptually very similar Very similar abstractions, basic functionality etc Different names e.g. Thread CUDA -> Work Item (OpenCL) Porting between the two should in principle be straightforward OpenCL is a lower level API than C for CUDA More work for programmer OpenCL obviously portable to other systems But in reality work will still need to be done for efficiency on different architecture OpenCL may well catch up with CUDA given time 25

Example OpenCL kernel kernel void vectoradd(global const float *a, global const float *b, global float *c) { int i = get_global_id(0); c[i] = a[i] + b[i]; } kernel and global are OpenCL extensions to C get_global_id is OpenCL function Performs similar role to CUDA s blockidx/threadidx Invoking kernels is a much more involved process than in CUDA 26

Summary CUDA allows NVIDIA GPUs to be programmed using C/C++ defines language extensions and APIs to enable this PGI provide commercial Fortran version of CUDA OpenCL provides another means of programming GPUs in C conceptually similar to CUDA, but less mature and lower-level supports other hardware as well as NVIDIA GPUs 27