High Performance Computing - Introduction to GPGPU Programming. Prof Matt Probert

High Performance Computing - Introduction to GPGPU Programming Prof Matt Probert http://www-users.york.ac.uk/~mijp1

Overview What are GPUs? GPU Architecture Memory Programming in CUDA The Future

What is a GPU? A GPU is a massive vector processor Hundreds of processing units Large memory bandwidth Processing units can coaborate Some probems can be soved more efficienty on vector processors (remember Crays). GPUs have some big advantages. Cheap and very fast for vector probems Computation is asynchronous with the CPU Hardware tricks such as texture fitering (zero overhead!) Other acceerators (e.g. Xeon Phi) avaiabe

Vaue for Money (2016 prices) P100 ~ 9k Ebor 40k K80 5k Xeon E7 5k However, Ebor is a distributed data machine with 16 nodes with 16 Xeon cores and tota 256 GB of RAM etc. These are different toos for different probems.

When is GPU Programming usefu? The GPU devotes more transistors to data processing (nvidia C Programming Guide) Good at doing ots of numeric cacuations simutaneousy. - This is actuay required for the GPU to be efficient - There must be many more numeric operations than memory operations to break even. Usefu as a co-processor. - Can offoad GPU efficient cacuations whie the CPU continues with the rest. Brute forcing! - Sometimes a brute force method is more efficient with many processors than an eegant soution on just one

Foating Point Standard nvidia architectures 2.x onwards (i.e. Fermi onwards) are IEEE 754 compiant. Od nvidia architectures were mosty compiant. Generay exceptions were handed in a noncompiant way. Aso some mathematics e.g. FMAD, division and sqrt were not standard. Standard compiant intrinsic functions are avaiabe but at a arge computationa penaty (software impemented).

nvidia Architectures V1 = Tesa (2008) introduced CUDA with performance of 0.5 GFLOP/watt, V2 = Fermi (2010) 64-bit foating point and performance = 2 GFLOP/W V3 = Keper (2012) - dynamic paraeism V5 = Maxwe (2014) faster Keper 3.8 TFLOP DP, 24 GB GDRAM, 2 GPU, 500 GB/s, 300 W V6 = Pasca (2016) unified memory, stacked DRAM, direct interconnect between GPU and main RAM to eiminate common botteneck V7 = Vota (2018) new = tensor cores with haf-precision math for machine earning now 7.8 TFLOP DP, 15.7 TFLOP SP, 125 TFLOP hp

Lots of arithmetic units sharing sma caches. Execution bocks are schedued across processing cores. GPU Architecture Large memory bandwidth but sow memory access. Fermi onwards have fu cache hierarchy.

Each thread runs one instance of a kerne. Threads Threads are organised into bocks (which can be up to 3D), bocks are schedued on and off the processors. Execution of a bock on the processors is caed a warp. Grids (which can be 2D) contain many bocks which are executed on the device.

Texture Write via CPU Aows hardware interpoation 2D Locaity of arrays Constant Memory Mode Write via CPU Sma, used for random access instructions Goba Write via CPU and GPU Shared Loca to Bock Low atency Fasted comms between threads Loca Per thread ony memory Registers Thread ony, Fastest memory avaiabe Limited space

Registers are oca to thread and has thread ifetime. Shared memory shared between threads in a bock and has bock ifetime Goba memory is accessibe everywhere and is persistent. Memory Scope

Tensor Cores One of the key operations when training a neura network or doing ML is matrix-matrix mutipication (DGEMM) Each tensor core has a 4x4x4 matrix processing array Can do 64 mixed-precision OPs/cock with haf-precision inputs and either HP or SP output

Programming Technoogies (I) CUDA (Compute Unified Device Architecture) Nvidia's proprietary patform. Offers both API (extensions to C/C++) and a driver eve programming. Proprietary PGI Fortran compier aows Fortran deveopment both directive and kerne modes. OpenCL (Open Computing Language) Genera patform for computing patforms (GPUs and CPUs). Impemented on a driver eve e.g. buit into MacOS Specification is manufacturer (and device) independent write once, run anywhere.

OpenACC Programming Technoogies (II) Open standard version of the directives approach of PGI Spec v2.0 Juy 2013 has better support for contro of data movement, caing externa functions, and separate compiation for host & device so can buid ibraries Led by Cray, nvidia, PGI and CAPS with support for Fortran, C/C++ Now in gcc and gfortran (since v6.1) OpenMP v4.0 Pushed by Inte to support Xeon Phi etc Subset of OpenACC functionaity at present In gcc/gfortran since v4.9.1 Supposed to incude a of OpenACC in future but difficuties with Inte vs nvidia

CUDA Programme Structure Seria (host) code Kerne (device) code - Grid Bocks - Threads Must remember to aocate data on both host and device Kerne executes asynchronousy

Starting CUDA Headers must be incuded. In this introduction ony the API wi be demonstrated. use cudafor #incude <cuda.h>; #incude <cuda_runtime.h>; C/C++ fies with CUDA kernes must have the extension.cu Fortran CUDA fies must have the extension.cuf See references at end for more compete exampes...

Device Kernes A Kerne is a goba subroutine (static function) which runs on the device. One instance of the kerne wi be executed by every thread which is invoked. attributes(goba) subroutine my_kerne( a, b, c) static goba void my_kerne( foat * a, foat * b, foat * c); Kernes can ony address device memory spaces. Executing kernes behaviour differs by using the thread index which uniquey identifies the thread. tx = threadidx%x + (bockidx%x * bockdim%x) int tx = threadidx.x + (bockidx.x * bockdim.x);

Host Code Device memory must be aocated in advance by host: rea,device,aocatabe :: Adev(:)... aocate(adev(m,n),stat=istat) static device doube * devptr = NULL;... cudamaoc((void**)&devptr, N*sizeof(doube)); Data must be copied (over the bus) to the device and if necessary copied back to the host ater. This is very sow! (Host memory can be aocated in non-paged (pinned) memory to avoid one copy operation). Adev = A! Host to device cudamemcpy(devptr, hostptr, N*sizeof(doube), cudamemcpyhosttodevice);

Kerne Execution ca my_kerne<<<grid,bock>>>(arg1,arg2,...) my_kerne<<<grid,bock>>>(arg1,arg2...); grid specifies the dimensions of the grid (i.e. the number of bocks aunched wi be the product of the dimensions). bock specifies the dimensions of each bock (i.e. the number of threads per bock is the product of the dimensions). dim3 derived type which has three members. type(dim3) :: grid Integer :: x=5,y=5,z=1 grid = (x,y,z) int x=5,y=5,z=1; dim3 grid(x,y,z);

Kerne Compiation and Run CUDA kernes must be compied with a CUDA compier (pgfortran or nvcc). At runtime the kerne wi be copied to the device on first execution note that for accurate profiing, a kerne shoud be executed at east once before timing. OpenCL kernes are usuay compied at runtime. This aows the kerne to be optimised for the running context but has a arger initia overhead.

Advanced Features Streaming Data can be moved across the system bus whie computation is happening. This is good for arge data structures which don't fit in device memory. Texture/Surface memory Memory can be accessed via non integer or surface coordinate! Linear interpoation can aso be performed. Fast intrinsic function Some specia functions exist which aow certain operations to be performed very quicky athough are not IEEE compiant. For exampe rsqrt reciproca square root. These can be usefu for areas of the code where precision is ess important.

V6 (2014+) Drop-in repacement for BLAS etc with auto offoad from CPU to GPU so free speed-up! CUBLAS BLAS ibrary - Incudes a S,D,C,Z eve 1-3 BLAS routines CUFFT FFT ibrary (free) CUDA Libraries - 1D, 2D, 3D compex and rea - Stream enabed for parae data movement & computation CUSPARSE Sparse matrix ibrary - BLAS stye routines between sparse & dense matrices CURAND Random number ibrary Now aso XT versions for muti-gpu support (non-free)

CUBLAS Fortran Interfacing Interfacing the CUBLAS ibrary (written in C) with the cudafortran from PGI requires an interface to be written using the C-interoperabiity part of F2003, e.g. Modue cubas Interface cuda_gemm Subroutine cuda_sgemm(cta, ctb, m, n, k apha, A, & & da, B, db,beta, c, dc) bind(c,name='cubassgemm') Use iso_c_binding character(1,c_char),vaue :: m,n,k,da,db,dc rea(c_foat),vaue :: apha,beta rea(c_foat),device,dimension(da,*) :: A rea(c_foat),device,dimension(db,*) :: B rea(c_foat),device,dimension(dc,*) :: C End subroutine cuda_sgemm End interface cuda_gemm End modue cubas

Fortran Exampe subroutine mmu( A, B, C ) use cudafor rea, dimension(:,:) :: A, B, C integer :: N, M, L rea, device, aocatabe, dimension(:,:) :: Adev,Bdev,Cdev type(dim3) :: dimgrid, dimbock N = size(a,1) ; M = size(a,2) ; L = size(b,2) aocate( Adev(N,M), Bdev(M,L), Cdev(N,L) ) Adev = A(1:N,1:M) ; Bdev = B(1:M,1:L) dimgrid = dim3( N/16, L/16, 1 ) dimbock = dim3( 16, 16, 1 ) ca mmu_kerne<<<dimgrid,dimbock>>>( Adev,Bdev,Cdev,N,M,L ) C(1:N,1:M) = Cdev deaocate( Adev, Bdev, Cdev ) end subroutine

Fortran Kerne attributes(goba) subroutine MMUL_KERNEL( A,B,C,N,M,L) rea,device :: A(N,M),B(M,L),C(N,L) integer,vaue :: N,M,L integer :: i,j,kb,k,tx,ty rea,shared :: Ab(16,16), Bb(16,16) rea :: Cij tx = threadidx%x ; ty = threadidx%y i = (bockidx%x-1) * 16 + tx ; j = (bockidx%y-1) * 16 + ty Cij = 0.0 do kb = 1, M, 16! Fetch one eement each into Ab and Bb NB 16x16 = 256! threads in this thread-bock are fetching separate eements of Ab and Bb Ab(tx,ty) = A(i,kb+ty-1) Bb(tx,ty) = B(kb+tx-1,j)! Wait unti a eements of Ab and Bb are fied ca syncthreads() do k = 1, 16 Cij = Cij + Ab(tx,k) * Bb(k,ty) enddo! Wait unti a threads in the thread-bock finish with this Ab and Bb ca syncthreads() enddo C(i,j) = Cij end subroutine

OpenMP v4.5 Spec finaized in Nov 2015 (C/C++/Fortran) Extensions to SIMD and TASK Array reduction now aowed Extended TARGET attributes for better acceerator performance Lots of new stuff for DEVICE Avaiabe in GNU v6.1 and Inte v18.0 V5.0 due any day now...

OpenMP v4.5 On CPU: #pragma omp parae for On GPU: #pragma omp target teams distribute parae for IBM XL compier with nvidia offoading (Dec 16): LULESH benchmark from OpenMP.org

OpenACC (I) Compier can auto-generate kernes for a section of code (e.g. mutipe oops and/or Fortran array operations) Maximum fexibiity Auto-detect dependencies:!$acc kernes fortran oop(s) to be executed on device!$acc end kernes #pragma acc kernes { c oop(s) to be executed on device }

OpenACC (II) Or user can assert a singe oop is dependencyfree and hence safe to be paraeized:!$acc parae oop For i=1,n!singe fortran oop!nb no matching end-parae #pragma acc parae oop For (i=0;i<n;i++) { //singe c oop to be executed on device } // NB no bock braces NB data may be copied back from host between successive parae sections but stays on host for duration of a kernes section

Further Reading CUDA Deveoper ZONE https://deveoper.nvidia.com/cuda-education-training OpenCL Standard: http://www.khronos.org/openc OpenACC standard: http://www.openacc.org OpenMP v4.5 http://www.openmp.org/wp-content/upoads/ openmp-4.5.pdf