Using CUDA. Oswald Haan - PDF Free Download

Using CUDA Oswald Haan ohaan@gwdg.de

A first Example: Adding two Vectors void add( int N, float*a, float*b, float*c ) { int i; for(i=0; i<n; i++) { c[i] = a[i] + b[i]; global void add_d( int N, float *a, float *b, float *c ) { int i = threadidx.x + blockidx.x*blockdim.x; if (i < N) c[i] = a[i] + b[i]; Host code for add routine Device code for add kernel routine

A first Example: Adding two Vectors int main( void ) { int N = 3, i; float a[n], b[n], c[n]; int main( void ) { int N = 3, i; float a[n], b[n], c[n]; float *a_d, *b_d, *c_d; cudamalloc( &a_d, sizeof(float)*n); cudamalloc( &b_d, sizeof(float)*n); cudamalloc( &c_d, sizeof(float)*n); for(i=0; i<n; i++) { a[i] = -i; b[i] = i * i; for (i=0; i<n; i++) { a[i] = -i; b[i] = 3+i; cudamemcpy(a_d, a, sizeof(float)*n, cudamemcpyhosttodevice); cudamemcpy(b_d, b, sizeof(float)*n, cudamemcpyhosttodevice); add( N, a, b, c ); add_d<<<1,n>>>( N, a_d, b_d, c_d ); cudamemcpy(c, c_d, sizeof(float)*n, cudamemcpydevicetohost); cudafree(a_d); cudafree(b_d); cudafree(c_d); Host code for calling sequential routine Host code for calling kernel routine code in ~ohaan/cuda_kurs/add_vectors.cu

Managing Memory on Host and on Device float a[3], *a_d; cudamalloc( &a_d, 3*sizeof(float) ) Allocates memory for three floats at address a_d in device memory Stores this address at address &a_d in host memory a_d a_d[0] a_d[1] a_d[2] &a_d a_d a_d+1 a_d+2 name of value in memory cell a[0] a[1] a[2] name of address of memory cell a a+1 a+2 host memory device memory cudamemcpy(a_d, a, 3*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(a, a_d, 3*sizeof(float), cudamemcpydevicetohost); destinationaddress source address size of data to be copied

Compiling CUDA codes CUDA source files must have the extension.cu Compiler nvcc is provided in the CUDA toolkit CUDA toolkit is available on GWDG s cluster frontends gwdu101, gwdu102, gwdu103 by loading the CUDA toolkit module: module load cuda80 Compiling CUDA source file add_vector.cu with nvcc add_vectors.cu o add_vectors produces executable add_vector

Execution environment for CUDA executables GWDG s compute cluster provides nodes with different types of NVIDIA GPUs: gwdo161-gwdo180, each with one GeForce GTX 770 dge001-dge007, each with two GeForce GTX 1080 dge008-dge014, each with four GeForce GTX 980 dge015, with two GeForce GTX 980 dte001-dte010, each with two Tesla K40m All nodes with GPU devices belong to the LSF queue gpu Jobs are managed on the compute cluster by the Load Sharing Facility (LSF), which provides commands for submitting jobs and enquiring their status

Submitting CUDA jobs with bsub A batch job is submitted to the queue gpu with the command bsub -q gpu -n 1 -R "rusage[ngpus_shared=1]"./add_vectors With the option n 1 it will use 1 core on a host in the queue A node in the gpu queue provides as many gpu-shares as number of cores, which is 8 for the older gwdoxxx nodes and 24 for the new dgexxx and dtexxx nodes. With -R "rusage[ngpus_shared=1] the job will share the gpu resources of the node with other jobs running on this node, which have requested gpu-shares With -R "rusage[ngpus_shared=24] the job will use the gpu resources of a node with 24 cores exclusively In order to run jobs interactively, an interactive shell can be requested by bsub -ISs -q gpu -n 1 -R "rusage[ngpus_shared=1]" /bin/bash More options for the bsub command and the description for other LSF commands can be found at www.gwdg.de -> Services-> High Performance Computing -> Docs

A special Queue course for this Course Submitting a job to queue kurs-gpu : bsub -q kurs-gpu -o out.%j -n 1 -R "rusage[ngpus_shared=1]"./add_vectors Submitting with a jobfile lsf.job bsub < lsf.job #!/bin/sh Starting an interactive session: bsub -ISs -q kurs-gpu -n 1 -R "rusage[ngpus_shared=1]" /bin/bash #BSUB -q course #BSUB -W 1:00 #BSUB -o out.%j #BSUB -n 1 #BSUB -R "rusage[ngpus_shared=1]"./add_vectors

Enquiring Device Properties cudagetdevicecount(&ndevices); Sets int ndevices to the number of devices available in the node cudagetdeviceproperties(&prop, i); Delivers in the members of the structure cudadeviceprop prop the values for various properties of device number i Definition of cudadeviceprop in the section CUDA Runtime API 5.3 of the CUDA Toolkit Documentation int main() { int ndevices; cudagetdevicecount(&ndevices); for (int i = 0; i < ndevices; i++) { cudadeviceprop prop; cudagetdeviceproperties(&prop, i); printf("device Number: %d\n", i); printf(" Device name: %s\n", prop.name);... complete code for enquiring in ~ohaan/cuda_kurs/device_properties.cu

Output from program device_properties.cu Device Number: 0 Device name: GeForce GTX 980 Device capability major revision number: 5 Device capability minor revision number: 2 Clock Rate (KHz): 1240500 total Global Memory (byte): 4294770688 Shared Memory per Block (byte): 49152 total Constant Memory (byte): 65536 size of L2 cache (byte): 2097152 32-bit Registers per Block: 65536 max. Threads per Block: 1024 number of Threads in Warp: 32 number of Multiprocessors: 16 Memory Clock Rate (KHz): 3505000 Max Grid Size: 2147483647 65535 65535 Max Block Size: 1024 1024 64 Memory Bus Width (bits): 256 Peak Memory Bandwidth (GB/s): 224.320000 Device Number: 1 Device name: GeForce GTX 980 Device capability major revision number: 5 Device capability minor revision number: 2 Clock Rate (KHz): 1240500 total Global Memory (byte): 4294770688 Shared Memory per Block (byte): 49152 total Constant Memory (byte): 65536 size of L2 cache (byte): 2097152 32-bit Registers per Block: 65536 max. Threads per Block: 1024 number of Threads in Warp: 32 number of Multiprocessors: 16 Memory Clock Rate (KHz): 3505000 Max Grid Size: 2147483647 65535 65535 Max Block Size: 1024 1024 64 Memory Bus Width (bits): 256 Peak Memory Bandwidth (GB/s): 224.320000

GPU-Properties of different nodes GWDG node gwdo161- gwdo180 NVIDIA Modell GeForce 770 Graphics Chip Compute Capability Clock rate [MHz] Device memory [GB] Band width [GB/s] Number of SMXes CUDA cores per SMX (SP) CUDA cores per SMX (SP-SF) CUDA cores per SMX (DP) Perf. Ratio FP64/ FP32 GK104 3.0 1110 2 224 8 192 32 0 1:24 dge001- dge007 GeForce 1080 GP104 6.1 1733 8 320 20 128 32 4 1:32 dge008- dge015 GeForce 980 GM204 5.2 1126 4 224 16 128 32 4 1:32 dte001- dte015 Tesla K40 GK110 3.5 745 12 288 15 192 32 64 1:3

Some Properties of different Compute Capabilities Specification Value Version 3.0 3.5 5.2 6.1 Maximum x-dimension of a grid of thread blocks 2 31-1 = 2147483647 Maximum y- or z-dimension of a grid of thread blocks 65535 Maximum number of threads per block 1024 Maximum x- or y-dimension of a block 1024 Maximum z-dimension of a block 64 Maximum number of resident blocks per SMX 16 32 Maximum number of resident threads per SMX 2048 Number of 32-bit registers per thread 255 all specifications for compute capabilities can be found at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

Selecting Different GPUs Compiling: nvcc -arch=[sm_30 sm_35 sm_52 sm_61] according to compute capability of target GPU Without setting this flag, nvcc compiles for compute capability 2.0 Submitting: -R "nvgen=1" selects a node with a Kepler GPU (GeForce 770 or Tesla K40 ) -R "nvgen=2" selects a node with a Maxell GPU (GeForce 980) -R "nvgen=3" selects a node with a Pascal GPU (GeForce 1080) -R tesla selects a node with Tesla K40 GPU -R "ngpus=2" selects a node with two GPUs (GeForce 980 /1080 or Tesla K40) -R "ngpus=4" selects a node with four GPUs (GeForce 980) -m "gwdo[161-180]" selects one of the gwdoxxx nodes (GeForce 770 )

How to Use 2 GPUs simultaneously Device can be selected with cudasetdevice(device_number) Prepare two executables: exe0 including cudasetdevice(0) exe1 including cudasetdevice(1) #!/bin/sh #BSUB -q gpu #BSUB -W 1:00 #BSUB -o out.%j #BSUB -n 2 #BSUB -R "ngpus=2" #BSUB -R "rusage[ngpus_shared=24]"./exe0 > out0 &./exe1 > out1 & selects a node with 2 GPUs grants exclusive use of the GPUs starts two executables asynchronously

Using Unified Memory (compute capability >=3.0 CUDA version >=6.0 ) int main(void) { int N = 6, i; float *a, *b, *c; cudamallocmanaged( &a, sizeof(float)*n); cudamallocmanaged( &b, sizeof(float)*n); cudamallocmanaged( &c, sizeof(float)*n); for (i=0; i<n; i++) { a[i] = -i; b[i] = i+3; add_d<<<1,n>>>(n,a, b, c); cudadevicesynchronize(); for(i=n-3; i<n; i++) { printf( "%f + %f = %f\n", a[i], b[i], c[i] ); cudafree(a); cudafree(b); cudafree(c); allocates memory in host- and device-memory Initializes data in host-memory synchronization is necessary, because no access to unified memory from host until device is inactive reads data from device memory Host code code in ~ohaan/cuda_kurs/add_vectors_um.cu

Unified Memory with Static Allocation #include <stdio.h> const int N=6; device managed float a[n], b[n], c[n]; global void add_d() { int i = threadidx.x + blockidx.x*blockdim.x; if (i < N) c[i] = a[i] + b[i]; int main(void) { int i; for (int i=0; i<n; i++) { a[i] = -i; b[i] = i+3; add_d<<<1,n>>>(); cudadevicesynchronize(); for(i=n-3; i<n; i++) { printf( "%f + %f = %f\n", a[i], b[i], c[i] ); code in ~ohaan/cuda_kurs/add_vectors_um_static.cu

Large Vectors Maximal 1024 threads in a single block: add_d<<<1,n>>>( N, a_d, b_d, c_d ) gives unpredictable results for N > 1024 Use N_block blocks: Modify host code N_thrpb =1024; N_blks = (N+N_thrpb-1)/N_thrpb add_d<<<n_blocks,n_thrpb>>>( N, a_d, b_d, c_d ) No change in device code int i = threadidx.x + blockidx.x*blockdim.x; if (i < N) c[i] = a[i] + b[i];

CUDA Error Handling CUDA functions return an error code of type cudaerror_t cudaerror_t err = cudamalloc(...) which can be translated into an error message by calling cudageterrorstring(err) Errors in kernel functions can be enquired by kernel<<<grids,threads>>>(...); cudadevicesynchronize(); cudaerror_t err = cudagetlasterror();

cudacheckerror() from https://gist.github.com/jefflarkin/5390993 //Macro for checking cuda errors following a cuda launch or api call #define cudacheckerror() { \ cudaerror_t e=cudagetlasterror(); \ if(e!=cudasuccess) { \ printf("cuda failure %s:%d: '%s'\n \, FILE, LINE,cudaGetErrorString(e));\ macro code in ~ohaan/cuda_kurs/errchk.ut

Add Large Vectors with Error Checking #include "errchk.ut" int main(void) { int N = 6000, N_thrpb =1024, N_blks = (N+N_thrpb-1)/N_thrpb, i; float *a, *b, *c; printf( "N: %i, N_blks: %i, N_thrpb: %i\n", N, N_blks, N_thrpb); cudamallocmanaged( &a, sizeof(float)*n); cudacheckerror(); cudamallocmanaged( &b, sizeof(float)*n); cudacheckerror(); cudamallocmanaged( &c, sizeof(float)*n); cudacheckerror(); for (i=0; i<n; i++) { a[i] = -i; b[i] = i+3.; add_d<<<n_blks,n_thrpb>>>(n,a, b, c); cudacheckerror(); cudadevicesynchronize(); cudacheckerror(); for(i=n-3; i<n; i++) { printf( "%14e + %e = %e\n", a[i], b[i], c[i] ); cudafree(a); cudafree(b); cudafree(c); complete code in ~ohaan/cuda_kurs/add_largevectors.cu

3 dim Grids and Blocks can be configured with the CUDA type dim3: dim3 gdims(gdim_x,gdim_x,gdim_z); dim3 bdims(bdim_x,bdim_x,bdim_z); kernel <<<gdims,bdims>>> (...); This will launch a total number of gdim_x*gdim_x*gdim_z*bdim_x*bdim_x*bdim_z threads on the device At most (number of SMXes)*2048 threads will be executing at any time

Vector Addition with 3-dim Grids and Blocks global void add_d( int N, int *a, int *b, int *c ) { int gridsize = griddim.x * griddim.y * griddim.z; int blocksize = blockdim.x * blockdim.y * blockdim.z; int id_thr = threadidx.x + threadidx.y * blockdim.x + threadidx.z * blockdim.x * blockdim.y; int id_blk = blockidx.x + blockidx.y * griddim.x + blockidx.z * griddim.x * griddim.y; int i = id_thr + blocksize * id_blk; if (i < N) c[i] = a[i] + b[i];

Two-dimensional Arrays int main(void) { int i, j, n = 4, m = 3; int a[n][m], b[n][m], c[n][m]; int *a_d, *b_d, *c_d; size_t sizea = n*m*sizeof(int); for(i=0; i<n; i++) { for(j=0; j<m; j++) { a[i][j] = -i - j; b[i][j] = i + j + 3; cudamalloc( &a_d, sizea); cudamalloc( &b_d, sizea); cudamalloc( &c_d, sizea); cudamemcpy(a_d, a, sizea, cudamemcpyhosttodevice); cudamemcpy(b_d, b, sizea, cudamemcpyhosttodevice); dim3 block(5,5); add_d<<<1,block>>>(n, m, a_d, b_d, c_d); cudamemcpy(c, c_d, sizea, cudamemcpydevicetohost); cudafree(a_d); cudafree(b_d); cudafree(c_d); Host code complete code in ~ohaan/cuda_kurs/add_arrays.cu

Adding Two-dimensional Arrays global void add_d( int n, int m, int *a, int *b, int *c ) { int i, j, index; j = threadidx.x; i= threadidx.y; if( i<n && j<m ) { index = i*m + j; c[index] = a[index] + b[index] ; device code n rows i = 0,,n-1 m columns j = 0,,m-1 threadidx.y = 0,..., blockdim.y-1 array index = threadidx.x + threadidx.y*m thread index = threadidx.x + threadidx.y*blockdim.x threadidx.x = 0,..., blockdim.x-1

Large Two-dimensional Arrays int main(void) { int i, j, n = 10000, m = 5000; int bdim_x = 16, bdim_y = 16; int gdim_x=(m+bdim_x-1)/bdim_x, gdim_y=(n+bdim_y-1)/bdim_y; int a[n][m], b[n][m], c[n][m]; int *a_d, *b_d, *c_d; size_t sizea = n*m*sizeof(int); for(i=0; i<n; i++) { for(j=0; j<m; j++) { a[i][j] = -i - j; b[i][j] = i + j + 3; cudamalloc((void **) &a_d, sizea); cudamalloc((void **) &b_d, sizea); cudamalloc((void **) &c_d, sizea); cudamemcpy(a_d, a, sizea, cudamemcpyhosttodevice); cudamemcpy(b_d, b, sizea, cudamemcpyhosttodevice); dim3 blk(bdim_x,bdim_y), grd(gdim_x,gdim_y); add_d<<<grd,blk>>>(n, m, a_d, b_d, c_d); cudamemcpy(c, c_d, sizea, cudamemcpydevicetohost); cudafree(a_d); cudafree(b_d); cudafree(c_d); Host code complete code in ~ohaan/cuda_kurs/add_largearrays.cu

Adding Large Two-dimensional Arrays global void add_d( int n, int m, int *a, int *b, int *c ) { int i, j, index; j = blockidx.x*blockdim.x+threadidx.x; i = blockidx.y*blockdim.y+threadidx.y; if( i<n && j<m ) { index = i*m + j; c[index] = a[index] + b[index] ; device code j = 0,,m-1 i = 0,,n-1 n rows m columns 2-dim array index : (i,j) 2-dim thread index : (blockidx.x*blockdim.x+threadidx.x, blockidx.y*blockdim.y+threadidx.y)

Timing of CUDA Codes Read internal clock before and after a code segment in host code Since kernel calls from host are asynchronous, host and device must be synchronized by cudadevicesynchronize() before calling the internal clock (double) tstart = int_clock();... kernel<<<grids,threads>>>(...); cudadevicesynchronize(); (double) tend = int_clock(); printf( "cpu time : %lf \n", tend-tstart );

Internal Clock for Elapsed Time C-function gettimeofday returns elapsed time with microsec precision. #include <sys/time.h> double get_el_time(){ struct timeval et; gettimeofday ( &et,null); return (double)et.tv_sec +1.e+6*(double)et.tv_usec; code for get_ell_time in ~ohaan/cuda_kurs/time.ut

Timing of CUDA Code with CUDA Events Read internal clock before and after a code segment in host code cudaevent_t start, stop; cudaeventcreate(&start);cudaeventcreate(&stop); cudaeventrecord( start, 0 );... kernel<<<grids,threads>>>(...);... cudaeventrecord( stop, 0 ); cudaeventsynchronize( stop ); float et; cudaeventelapsedtime( &et,start, stop ); cudaeventdestroy( start );cudaeventdestroy( stop ); printf( "cpu time on device : %3.1f millisec \n", et );

Measuring Bandwidth for Adding Arrays GPU : GeForce GTX 1080 nominal bandwidth 320 GB/s Array size : 10 000 x 10 000 Performance depends on memory access method and on layout of blocks Separate memory for host and device bdim_x = 1024, bdim_y = 1 : 244 GB/s bdim_x = 32, bdim_y = 32; 242 GB/s bdim_x = 1, bdim_y = 1024; 59 GB/s code for bandwidth measurement in Unified memory bdim_x = 1024, bdim_y = 1 : ~ohaan/cuda_kurs/add_arrays_perf.cu 5 GB/s code for bandwidth measurement with um in ~ohaan/cuda_kurs/add_arrays_um_perf.cu

CPU 1 SMX 1 SMX n Memory Organization, Hardware View... Control unit: Schedule, dispatch Shared 32 bit registers... cache L2 cache............ Main memory Main memory Shared memory / L1 cache constant + texture cache Host Graphics Device Streaming Multiprocessor (SMX)

thread thread thread thread local mem local mem local mem local mem Device Memory, Software View... block...... block shared men shared men global mem constant mem texture mem

Types of Kernel Variables: Local Variables (scalars and arrays) defined in the scope of a kernel are local global void ker1(int laloc2,..){ int iloc1, iloc2; float aloc1[6], *aloc2; aloc2 = (float *)malloc(sizeof(float)*laloc2);... Each thread has its own set of local variables, which are placed in the register files of the SMXes, or in global memory, if there are not enough registers or if the variable is an indexed array Number of 32 bit registers per SMX 2 16 =65536 Maximal number of registers per thread 255 For the maximal number of 2048 threads per SMX number of registers per thread 32

Types of Kernel Variables: Global global variables (scalars and arrays) defined in the scope of the application, reside in device main memory and are shared by all threads in the kernels If allocated dynamically by calling cudamalloc() from host they can be accessed from host by cudamemcpy(...) If allocated statically by the device qualifier they can be accessed from host by cudamemcpytosymbol(...),cudamemcpyfromsymbol(...) accessing the same global variable from a kernel by different threads is not deterministic, since the order of execution for different blocks of threads is not prescribed

Accessing Global Variables Device memory is accessed by load/store operations for aligned memory segments of size 32, 64, or 128 Bytes If the 32 threads of a warp access 32 int or float variables lying consecutivley in memory, 4 load/store operations of 32 Byte segments serve all 32 accesses (coalescent access) Compare the performance of 2-dim array addition: blockdim.x = 32, blockdim.y = 32; 242 GB/s blockdim.x = 1, blockdim.y = 1024; 59 GB/s

Types of Kernel Variables: Constant constant memory variables (scalars and arrays) defined in the scope of the application, are read only, reside in device main memory are cached in the constant cache of each SMX and are shared by all threads in the kernels. Allocated on device by device constant qualifier device constant int sconst, float aconst[1024]; Can be initialized from host with cudamemcpytosymbol(aconst,a_h,1024*sizeof(float)); If all threads in a kernel read the same data, the use of constant memory variables reduces the accesses to device memory by employing the SMX s 8 kb sized constant caches.

Types of Kernel Variables: Shared shared variables (scalars and arrays) are defined in the scope of block of threads of a single kernel function and reside in the shared memory of the SMX executing the block of threads All threads of a block have access to a block s shared variables Threads of other blocks cannot access a block s shared variables Static allocation of one or more shared arrays in a kernel function global void ker1(...){ shared float sh_float[64]; shared int sh_int[64];... Dynamic allocation of shared memory (in one single shared array): declaration outside kernel shared float sh_array[]; allocation in host code via extended execution configuration size_t N_sh_bytes = 64*sizeof(float); ker1<<<grid,block,n_sh_bytes>>>( );

Device Synchronization from Host Synchroneous calls: cudamalloc, cudamemcpy,... Asynchroneous calls: kernel<<<... >>> (...), cudamemcpyasync,... A call in a host program to cudadevicesynchronize(); will synchronize all previously started activities of the device

Thread Synchronization from Device In a device function, threads within a block can be synchronized by calling the barrier syncthreads(); Waits until all threads in a block have reached this instruction and all accesses to global and shared memory from these threads are completed Danger of stalled execution: if (i < cut ) syncthreads(); will hang if in a block not all threads have i < cut or i>= cut Is used to coordinate memory access from threads within a single block syncthreads()cannot coordinate the execution of threads from different blocks

Atomic Operations Example: accumulate the content of array b into memory location a Sequential on host: for (i=0,i<n;i++) a = a + b[i] ; Parallel on kernel: if (i<n) a = a + b[i] ; If several threads modify the content of the same address, the result depends on the temporal order of their operation.

Atomic Operations Thread 0 Thread 1 Read r1 from a r1 = 0 Read r1 from a r1 = 0 r2 = r1 + b[0] r2 = b[0] r2 = r1 + b[1] r2 = b[1] write r2 to a a = b[0] write r2 to a a = b[1] Thread 0 Thread 1 Read r1 from a r1 = 0 r2 = r1 + b[0] r2 = b[0] write r2 to a a = b[0] Read r1 from a r1 = b[0] r2 = r1 + b[1] r2 = b[0]+b[1] write r2 to a a = b[0]+b[1]

Atomic Operations An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads Atomic add: int atomicadd(int* address, int val); Many more atomic operations are supported: cf. CUDA Toolkit Programming Guide B12