Debugging and Optimization strategies

Debugging and Optimization strategies Philip Blakely Laboratory for Scientific Computing, Cambridge Philip Blakely (LSC) Optimization 1 / 25

Writing a correct CUDA code You should start with a functional CPU/serial code If possible, port small parts to GPU at a time Codes should give exactly the same results at all times If possible, display intermediate results to test individual kernels A lot of the difficulty with CUDA is thinking about parallelism in the right way, and working out which threads and blocks should do what work. Philip Blakely (LSC) Optimization 2 / 25

Error checking Always make use of error-return codes Use cudagetlasterror() Common errors: Mixing up host and device pointers - use e.g. hdata[], ddata[] naming convention Reversing direction of copying in cudamemcpy() cudamemcpy(dest, src, numbytes, cudamemcpykind) Allocating too little memory - cudamalloc needs no. of bytes - use sizeof() float3 a; a = cudamalloc((void )&a, sizeof(float3) N); Philip Blakely (LSC) Optimization 3 / 25

Floating point mismatch Moving from CPU code to GPU code will lead to changes in floating-point results Early CUDA-enabled GPUs did not conform to IEEE 754 exactly Unlikely to lead to noticable errors, except in last d.p. Recall that floating point addition is non-associative Upshot: If you are summing in parallel, result may be different from serial result Extra FMA instructions on GPUs can also affect result. See Floating Point on NVIDIA GPU.pdf Consider whether you need single-precision or double-precision arithmetic. Former is faster (by at least factor of 8) but may not be accurate enough. If any of this leads to stability problems - rethink your algorithm Philip Blakely (LSC) Optimization 4 / 25

Indexing errors Always check correct calculation of indexes Usually blockidx.x * blockdim.x + threadidx.x May have different expressions depending on distribution of data between threads/thread-blocks Always check for out-of-bounds access It s common to have redundant threads that shouldn t access data - make sure they have an appropriate if statement to make them do nothing Philip Blakely (LSC) Optimization 5 / 25

Race conditions If your code occasionally gives errors or gives different results for the same input, you ve probably got a race condition, i.e. the final state of code is dependent on order of execution of threads/blocks. Avoiding race conditions Never assume thread blocks are executed in a particular order Distinct thread blocks should not depend on each other s results Best approach is for a kernel to take data from one global array and compute a result into another array. Do not use the same array for input and output. Philip Blakely (LSC) Optimization 6 / 25

Global memory race conditions Usually occur when two or more threads update global memory value independently No guarantee in which order they will occur No guarantee which will succeed General rule: Avoid updating same global memory location from distinct threads (Unless you use atomic instructions, in which case you may have performance problems instead.) Philip Blakely (LSC) Optimization 7 / 25

Shared memory race conditions Occur when distinct threads within a thread-block update single shared memory location independently These can be overcome by correct use of syncthreads() Ensures that all threads in block have reached same point in kernel General advice: When using shared memory, check use of syncthreads() It may be possible to avoid some of this due to threads in the same warp advancing in lock-step. Philip Blakely (LSC) Optimization 8 / 25

Volatile variables In the absence of syncthreads() etc. the compiler is free to assume that no other thread will update a memory location It can therefore assume that adjacent reads give the same value and can reuse a previous value without rereading memory To avoid this, use volatile: volatile float avol = a; float tmp1 = a[i]; // Some calculation... float tmp2 = a[i]; Without volatile, compiler may/will only issue one memory-read instruction, even if in practice it might have been updated by another thread. Philip Blakely (LSC) Optimization 9 / 25

Basic debugging Often the easiest form of debugging is printf statements This is allowed in CUDA for compute capability 2.x Output for all printf statements encountered will be emitted on host No guarantee about which order print statements will occur in from different threads The output buffer is of limited size (1MB), and is dumped to screen at kernel launch or at cudadevicesynchronize() If the buffer fills up, further output will overwrite the earliest output The buffer size can be increased using cudadevicesetlimit() This feature should therefore only be used as a quick-and-dirty debugging tool. Upshot: printf may be helpful but abscence of output does not imply absence of thread reaching a particular point. Philip Blakely (LSC) Optimization 10 / 25

Use of a debugger The cuda-gdb debugger is an extension of gdb Need to compile with -g -G options Allows explicit switching between threads/blocks and checking of values Allows stepping through kernels line-by-line Not part of the practicals, as I ve had problems using it in the past, and it locks up the compute card which a kernel is being debugged. See /lsc/opt/cuda-8.0/doc/pdf/cuda-gdb.pdf Philip Blakely (LSC) Optimization 11 / 25

cuda-memcheck In a similar way to valgrind, cuda-memcheck can check for simple out-of-bounds memory access. Ill-formed CUDA program global void f(float a, int N) { int idx = blockdim.x blockidx.x + threadidx.x; } a[idx] = sinf(idx); Philip Blakely (LSC) Optimization 12 / 25

Memcheck output cuda-memcheck output ========= CUDA-MEMCHECK ========= Invalid global write of size 4 ========= at 0x00000498 in f ========= by thread (32,0,0) in block (2,0) ========= Address 0x00201080 is out of bounds ========= ========= ERROR SUMMARY: 1 error See /lsc/opt/cuda-8.0/doc/pdf/cuda Memcheck.pdf Philip Blakely (LSC) Optimization 13 / 25

Memcheck continued memcheck is capable of a large number of checks: Out-of-bounds access check Malloc/free error-checks CUDA API checks (although you should be doing this anyway) Memory leaks Race condition checks (tool=--racecheck) Philip Blakely (LSC) Optimization 14 / 25

CUDA features CUDA features miscellancy: Textures Warp-vote functions Atomic operations Streams Philip Blakely (LSC) Optimization 15 / 25

Textures Textures are 1D, 2D, or 3D constant arrays composed of 1,2, or 4 elements which are 8-bit, 16-bit, or 32-bit integers or floats. Provide some performance advantages Can linearly interpolate floating point values Can deal gracefully with out-of-range indexes Terminology and reason for existence obviously left over from graphics Can be useful for some applications Philip Blakely (LSC) Optimization 16 / 25

Textures - example application MRI scanners produce visualization of e.g. tumours Need to match-up images taken at different times Use Medial Image Registration to find transformation to map new image onto old one Automatically optimizes transformation parameters by computing distance function between images and minimizing it Determiniation of distance-function very compute-intensive Ansorge et al. used CUDA with texture-arrays to give speed-up of 750. (Disclaimer: serial code was unoptimal!) Philip Blakely (LSC) Optimization 17 / 25

Textures - allocation At file-scope: texture<int4, 2, cudareadmodeelementtype> pictexture; In main program: cudachannelformatdesc chandes = cudacreatechanneldesc<float4>(); cudaarray picarray; cudamallocarray(&picarray, &chandesc, width, height); Can now access the CUDA array as a texture within kernel: int4 pixel = tex2d(pictexture, xtrans, ytrans); Philip Blakely (LSC) Optimization 18 / 25

Textures - setting attributes Can set various attributes for accessing texture pictexture.addressmode[0] = cudaaddressmodewrap; pictexture.addressmode[1] = cudaaddressmodewrap; pictexture.filtermode = cudafiltermodepoint; pictexture.normalized = true; // access with floating point texture coordinates Normalized: true - Access using [0, 1] coordinates false - Access using integer [0, N 1] coordinates Wrap/Clamp bounds Wrap - wrap out-of-bounds coordinates to other side Clamp - clamp out-of-bounds coordinates to nearest end of range Filter mode: cudafiltermodepoint - just access exact memory value cudafiltermodelinear - perform (9-bit) linear interpolation Philip Blakely (LSC) Optimization 19 / 25

Warp vote functions The following functions allow communication between all threads in a single warp (32 threads): int all(int predicate); returns non-zero value iff predicate is non-zero on all active threads in the warp. int any(int predicate); returns non-zero value iff predicate is non-zero on at least one (active) thread in the warp. Philip Blakely (LSC) Optimization 20 / 25

Warp vote functions ctd. unsigned int ballot(int predicate); returns an integer whose Nth bit is set iff predicate is non-zero on thread N. Note that an unsigned int has 32 bits and there are 32 threads in a warp. Philip Blakely (LSC) Optimization 21 / 25

Warp-size note Kepler Tuning Guide.pdf Warp-synchronous programs are unsafe and easily broken by evolutionary improvements to the optimization strategies used by the CUDA compiler toolchain. Such programs also tend to assume that the warp size is 32 threads, which may not necessarily be the case for all future CUDA-capable architectures. Therefore, programmers should avoid warp-synchronous programming to ensure future-proof correctness in CUDA applications. On the other hand, the pre-processing WARP SIZE is available, and you could raise a compile-time error if this is not 32. Philip Blakely (LSC) Optimization 22 / 25

Atomic functions In machine-code, an instruction such as i += 1 where i is in global or shared memory requires three instructions: Read i into a register Increment i Write i back to memory which results in a potential race condition. atomicadd(&i, 1); only uses one instruction for global/shared memory Other atomic instructions: atomicsub(), atomicexch(),... Example use: Keep a count of how many thread-blocks have completed. Nearly avoids global memory access issue - but you still don t know what order updates will occur in Only available for integer and 32-bit float operations except for atomicadd() for 64-bit floats on Pascal archiecture (CC 6.0). Philip Blakely (LSC) Optimization 23 / 25

Streams Kernels can be run concurrently on latest Fermi and Kepler GPUs Done via streams Each stream is independent - no communication between them possible Any attempt at communication between streams via GPU memory very likely to give race conditions, with no way to rationalize them. Philip Blakely (LSC) Optimization 24 / 25

Streams ctd. Streams example for( int i = 0; i < 2; i++) cudastreamcreate(&stream[i]); for( int i = 0; i < 2 ; i++) f<<<100, 200, 0, stream[i]>>>(myarray[i]); If possible, the two calls of f will execute concurrently, on separate data. To synchronize all kernel calls and memory copies for a stream: cudastreamsynchronize() To make sure all streams have completed: cudathreadsynchronize() To destroy streams: cudastreamdestroy(stream[i]) Philip Blakely (LSC) Optimization 25 / 25