CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan

CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration

Compute Unified Device Architecture CUDA Exposes GPU computing for general purpose Flexible and scalable architecture Based on industry-standard C/C++ Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. For NVIDIA GPUs only

Concepts to be covered Heterogeneous computing Blocks, Threads Indexing Shared memory syncthreads() Warps, Divergence Asynchronous operation Handling errors Managing devices

Heterogeneous Computing CPU Host, CPU RAM Host Memory GPU Device, GPU RAM Device Memory www.nvidia.com

Hello World! GPU code kernel global indicates it runs on device Triple angle brackets mark a call from host code to device code kernel launch Returns void global void mykernel(void) { cuprintf( Hello World!\n ); } int main(void) { mykernel<<<1,1>>>(); printf( CPU Hello World!\n"); return 0; }

Hello World! nvcc helloworld.cu./a.out

Working with codes Open Terminal ssh X user#@10.21.1.166 (user 1-25) ssh X guest@10.6.5.254 (user 26-50) ssh X guest@192.168.1.211 (user 26-50) cd codes/helloworld/ make./helloworld gedit &

Hello World! Parallel Change to mykernel<<<n,1>>>(); Launches N blocks CPU calls kernel & continues its work Compile: $ cd helloworld_blocks $ make $./helloworld_blocks global void mykernel(void) { cuprintf( Hello World!\n ); } int main(void) { int N=100; mykernel<<<n,1>>>(); printf( CPU Hello World!\n"); return 0; }

Processing Flow PCI Bus Copy from Host Memory (CPU) to Device Memory (GPU)

Processing Flow CPU launches Kernel PCI Bus Kernel accesses memory at much faster rate Utilizes on-chip cache memory

Processing Flow PCI Bus Copy results back from Device Memory (GPU) to Host Memory (CPU)

Device Memory Management cudaerror_t cudamalloc( void ** devptr, size_t size_bytes) cudaerror_t cudamemcpy ( void* dst, void* src, size_t count, enum cudamemcpykind kind) cudamemcpyhosttohost Host -> Host cudamemcpyhosttodevice Host -> Device cudamemcpydevicetohost Device -> Host cudamemcpydevicetodevice Device -> Device Example: int a[100], *dev_a; cudamalloc (&dev_a, sizeof(int)*100); cudamemcpy( dev_a, a, sizeof(int)*100, cudamemcpyhosttodevice);

Vector Addition How to identify which block it is? Each block takes care of one element blockidx.x You can have 3 dimensional blocks blockidx.x, blockidx.y, blockidx.z 0 1 2 3 0 1 2 4 0 2 4 8 griddim.x dim3(65535,65535,1024) void vectoradd(int *a, int *b, int *c) { for( int i=0; i<100; i++) c[i] = a[i] + b[i]; } global void vectoradd(int *a, int *b, int *c) { int i = blockidx.x; c[i] = a[i] + b[i]; }

Vector Addition global void vectoradd(int *a,int *b, int *c) { int i= blockidx.x; c[i] = a[i] + b[i]; } int main(void) { int host_a[100], host_b[100],host_c[100]; int *dev_a, *dev_b, *dev_c; cudamalloc( &dev_a, sizeof(int)*100); cudamalloc( &dev_b, sizeof(int)*100); cudamalloc( &dev_c, sizeof(int)*100); } Memory Allocation Compile: $ cd vectoradd/ $ make $./vectoradd Memory Copy cudamemcpy(dev_a, host_a, sizeof(int)*100, cudamemcpyhosttodevice); cudamemcpy(dev_b, host_b, sizeof(int)*100, cudamemcpyhosttodevice); vectoradd<<<n,1>>>(dev_a,dev_b,dev_c); cudamemcpy(host_c, dev_c, sizeof(int)*100, cudamemcpydevicetohost); return 0;

Threads A block can have many threads. For vector addition, the kernel launch would be vectoradd<<<1,n>>>(da,db,dc); Maximum thread Dimension (3-dimensional) (1024,1024,64) threadidx.x blockdim.x global void vectoradd(int *a, int *b, int *c) { int i=threadidx.x; c[i]=a[i]+b[i]; } Compile: $ cd vectoradd_threads/ $ make $./vectoradd_threads

Threads 3D mesh inside 3D mesh Why threads? Communicate Synchronize Blocks can t

Built-in Variables threadidx.x, threadidx.y, threadidx.z blockidx.x, blockidx.y, blockidx.z blockdim.x, y, z (1024,1024,64) Number of threads per block. griddim.x, y, z (65535, 65535, 1024) - Number of blocks in a kernel call (called as Grid of blocks)

Index Calculation Using blocks and threads simultaneously. i=threadidx.x + blockdim.x*blockidx.x; threadidx.x threadidx.x threadidx.x threadidx.x 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockidx.x=0 blockidx.x=1 blockidx.x=2 blockidx.x=3 blockdim.x=8 (no of threads in a block) griddim.x=4 (no of blocks in that kernel launch) add<<<n/threads_per_block, THREADS_PER_BLOCK>>>(

Boundary Conditions Usually blockdim.x is in multiples of 32 Always put boundary conditions on data size global void vectoradd(int *a, int *b, int *c, int N) { int i = threadidx + blockdim.x * blockidx.x; if(i<n) c[i]=a[i]+b[i]; } Compile: $ cd vectoradd_full/ $ make $./vectoradd_full

For Very Large N Very large N (N>10 6 ) global void vectoradd(int *a, int *b, int *c, long N) { long i = threadidx + blockdim.x * blockidx.x; for(; i<n; i+= griddim.x * blockdim.x) c[i]=a[i]+b[i]; } Compile: $ cd vectoradd_large/ $ make $./vectoradd_large

Block Scheduling Streaming Multiprocessors are executing units - SM Different GPUs have different no of SMs. There is communication among threads. No communication among blocks. No specific order in block scheduling.

Block Scheduling All threads in a block execute in a single SM No guarantee in order of execution Hardware schedules based on available SMs 3 SMs available BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4

Block Scheduling All threads in a block execute in a single SM No guarantee in order of execution Hardware schedules based on available SMs 3 SMs available BLOCK 1 BLOCK 2 BLOCK 4

1-D Stencil Compute a(i)+a(i+1)+a(i+2) global void stencil(int *a, int *b) { int i=threadidx.x; b[i]=a[i]+a[i+1]+a[i+2]; } Compile: $ cd 1dstencil/ $ make $./1dstencil 0 1 2 3 4 5 6 7 threadidx.x=0 threadidx.x=1

Global Memory Till now we have been using global memory for our computations. Very slow to access Allocated using cudamalloc(..)

1-D Stencil Revisited Compute a(i)+a(i+1)+a(i+2) global void stencil(int *a, int *b) { int i=threadidx.x; b[i]=a[i]+a[i+1]+a[i+2]; } Data could be shared among threads 0 1 2 3 4 5 6 7 threadidx.x=0 threadidx.x=1 3 global read + 1 global write per thread

Shared Memory Memory shared among threads inside a block. Cannot be accessed from another block Declared inside kernel code shared int a[100]; On-chip, very fast

1-D Stencil Shared Copy to shared memory global void stencil(int *a, int *b) { int i = threadidx.x; shared sa[100]; sa[i] = a[i]; b[i] = sa[i]+sa[i+1]+sa[i+2]; } Compile: $ cd 1dstencil_shared $ make $./1dstencil_shared Write the result to global memory Shared memory is visible a block only. Cannot be accessed by other blocks, CPU

Access Times Registers (1-2 cycles) Shared memory (10 cycles) Global memory (100s of cycles) Local memory (100s of cycles)

Run time Comparison 3 global read + 1 global write per thread 3*100+100=400 cycles 1 global read + 3 shared read + 1 global write per thread 1*100+3*10+1*100= 230 cycles Use nvprof./file_name to see the runtime of programs

Memory Hierarchy Registers Per thread on chip Data lifetime = thread lifetime Local memory Per thread off-chip memory (DRAM) Data lifetime = thread lifetime Shared memory Per thread block : on-chip memory Data lifetime = block lifetime Global (device) memory Accessible by all threads and host (CPU) Data lifetime= Entire program from allocation to de-allocation Host (CPU) memory Not directly accessible by CUDA threads

syncthreads() Synchronizes all threads within a block Waits till all the threads execute till syncthreads(); Used to prevent RAW, WAR, WAW hazards RAW Read After Write WAR Write After Read WAW Write After Write Synchronize to commit all the memory writes, reads and computation. syncthreads();

Reduction Addition of N numbers Other operations +,*, AND, OR, XOR, maximum, minimum etc. void reduce(int *a, int *result) { *result=0; for( int i=0; i<100; i++) result=result+a[i]; } Serial How to parallelize?

Reduction N numbers log 2 (N) steps to compute Share result of 1 st step to other threads in 2 nd step. 0 1 2 3 4 5 6 7 threadidx.x 1 5 9 13 6 22 0 Some algorithms are not straight forward to implement in parallel.

Reduction kernel Read to Shared memory Operate & write to shared memory Write to global memory global void reduce(int *a, int *result) { int i= threadidx.x; shared s_a[n]; s_a[i]=a[i]; syncthreads( ); for( int stride=1; stride < N; stride*=2){ if(i%stride==0) s_a[2*i]=s_a[2*i] + s_a[2*i+stride]; syncthreads( ); } *result=s_a[0]; } Compile: $ cd reduction/ $ make $./reduction

CUDA programming model

CUDA programming model Blocks mapped to SM

Warps Inside SM, threads are split into group of 32 threads called warps. All threads in single warp execute in parallel. If executing warp needs waiting or barrier, it is put into hold and another warp is dispatched for execution. This is taken care by warp scheduler All threads in a warp execute SAME instruction.

Warp No guarantee on order of warps dispatched. GPU architectures Tesla,Fermi, Kepler Warp size = 32 Fermi 2 warp schedulers 2 instruction units

Divergence Alternative threads in a warp execute different each warp takes 2 time step Warp 1 if ( threadidx.x%2==0) a[ threadidx.x ] +=1 else a[ threadidx.x ] +=2 } 0 2 4 6 8... 1 3 5 7 9... if else

Divergence All threads in a warp executes same instruction each warp takes 1 time step if ( threadidx.x<32) a[ threadidx.x ] +=1 else a[ threadidx.x ] +=2 } Warp 1 (if) Warp 2 (else)

Reduction revisited Divergence at all strides Each thread in a warp execute different instructions Solution: Modify condition 0 1 2 3 4 5 6 7 1 5 9 13 6 22 0 for( int stride=1; stride < N; stride*=2){ if( i % stride==0) sa[2*i] = sa[2*i] + sa[2*i+stride]; syncthreads( ); } *result=sa[0]; }

Reduction (No Divergence) Add elements stride away 0 1 2 3 4 5 6 7 4 6 8 10 12 16 28 No Divergence for stride>=32 for( int stride=blockdim.x; stride >0; stride/=2) if( i < stride) sa[i]=sa[i] + sa[i+stride]; syncthreads( ); *result=sa[0]; } Compile: $ cd reduction_nodiv/ $ make $./reduction_nodiv

Resource allocation Split your program to small kernels. Why? Each SM has limited registers, shared memory. The amount depends on compute capability of GPU 1.0, 1.1, 1.2, 1.3, 2.x, 3.0, 3.5, 5.0 Fermi, Tesla, Kepler Global memory is large (>512MB) nvcc Xptxas=-v filename.cu

Resource limits Number of thread blocks per SM is limited by Registers Shared memory usage No. of blocks per SM Number of threads Limits 1.3 2.X 3.X 5.0 Registers/SM 16K 32K 64K 64K Shared Memory/SM 16KB 48KB 48KB 64KB Blocks/SM 8 8 16 32 Threads/SM 1024 1536 2048 2048 Occupancy

Asynchronous Kernel launches are Asynchronous cudamemcpy, cudamalloc Synchronous cudamemcpyasync() - Asynchronous, does not block the CPU cudadevicesynchronize() - Blocks the CPU until all preceding CUDA calls have completed Asynchronous calls can utilize CPU also while GPU is busy.

Handling Errors All CUDA API calls return an error code (cudaerror_t) Error in the API call itself Error in an earlier asynchronous operation (e.g. kernel) Get the error code for the last error: cudaerror_t cudagetlasterror(void) Get a string to describe the error: char *cudageterrorstring(cudaerror_t) printf("%s\n", cudageterrorstring(cudagetlasterror()));

Device Management Application can query and select GPUs cudagetdevicecount(int *count) cudasetdevice(int device) cudagetdevice(int *device) cudagetdeviceproperties(cudadeviceprop *prop, int device) Multiple host threads can share a device A single host thread can manage multiple devices cudasetdevice(i) to select current device cudamemcpy( ) for peer-to-peer copies

Summary Write and launch CUDA C/C++ kernels global, <<<>>>, blockidx, threadidx, blockdim Manage GPU memory cudamalloc(), cudamemcpy(), cudafree() Manage communication and synchronization shared, syncthreads() cudamemcpy() vs cudamemcpyasync(), cudadevicesynchronize() Resource limits Registers, Shared memory, blocks/sm, threads/sm

Advanced concepts(not covered) Memory Coalescing Constant memory Streams Atomics Shared memory conflicts Texture memory

Tools nvcc NVIDIA compiler nvprof - command line profiler nvvp Visual profiler cuda-memcheck Memory bugs Nsight Visual Studio, Eclipse Allinea DDT

Libraries CUBLAS CUDA accelerated Basic Linear Algebra CUFFT Fast Fourier Transform (1D, 2D, 3D) Thrust C++ template library (similar to C++ STL) CULA Dense, Sparse Linear Algebra OpenCV Computer Vision, Image processing AccelerEyes ArrayFire MATLAB, LabVIEW, Mathematica, Python ABACUS, AMBER, ANSYS, GROMACS, LAAMPS, NAMD,

Online Resources http://developer.nvidia.com/cuda-training Coursera - Heterogeneous computing Udacity - CS344 Intro To Parallel Programming GPU computing Webinars CUDA Documentation Books CUDA by Example Programming Massively Parallel Processors: A Hands-on Approach GPU GEMS

Questions?