COMP528: Multi-core and Multi-Processor Computing

Size: px

Start display at page:

Download "COMP528: Multi-core and Multi-Processor Computing"

Randolf Mitchell
5 years ago
Views:

1 COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk 21

2 You should compute all values of x[i] and of f(x[i]) beforeyou do anything to find the MAXIMUM of f(x[i]). Your CUDA kernel is required to determine f(x[i]) and these values should be available to the host after the device has finished calculating them. Note that this CUDA kernel shouldnotdo anything to determine the MAXIMUM of f(x[i]). Only when all values of x[i] and f(x[i]) have been calculated for all values of i, you should determine, in any efficientway you desire, the MAXIMUM value of f(x[i]) and for which x[i] this relates to. That is, the output of your program is the approximate MAXIMUM of the continuous function f(x) between -100 and You should parallelise the formation of x[i] but may use OpenMPand/or CUDA, explaining your choice. CPU GPU determine (in parallel) x[i] on??? have access to f(x[i]) output MAXIMUM of f(x) between -100 and +100, and for which value of x determine MAXIMUM of f(x[i]) on??? calc f(x[i])

Recap of GPU & CUDA GPU specifically very good for specific workloads hw thread switching @ very low cost run async to CPU directives & OpenCL & CUDA CUDA threads, block of threads, grid of blocks

3 Recap of GPU & CUDA GPU specifically very good for specific workloads hw thread very low cost run async to CPU directives & OpenCL & CUDA CUDA threads, block of threads, grid of blocks thread index, block index, grid index ==> control the parallelism of the hardware CUDA cores, Streaming Multiprocessors, GPU kernel: call syntax; function syntax; use of thread/block index allocating & copying data, synchronously

4 Steps to Porting Serial to CUDA Determine work that has inherent parallelism Move (serial) work to a "kernel" global cuda_kernel(x, y, z) { 3. (a) Allocate kernel vars (b) Initialise kernel vars typically by copying data from h->d using cudamemcpy() (c) Add call to parallel CUDA kernel using <<<blks, thdsperblk>> kernel runs on device asynchronously (d) Copy results from d->h using cudamemcpy() } // parallel control via varying index my_i = threadidx.x + blockidx.x*blockdim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index Based upon Steps to CUDA High End Compute Ltd

5 Further Information CUDA by example: an introduction to general-purpose GPU programming, Sanders & Kandrot (2011) Chapter 3: calling a CUDA kernel Section 4.2.1: data transfers Sect 5.3: Shared Memory & synchronisation Sect 6.3: Events / timing Chapter 10: streams (asynchronisation) NVIDIA s CUDA web/resources GPU Gems eg download from NVIDIA

6 based on what we saw in lecture 20 initcpu.cu initialised x[] and y[] on CPU then transferred to GPU 1 kernel vecaddkernel : forms z[i] = x[i] + y[i] copy z[] from GPU to CPU initgpu.cu 1 kernel: initialises x & y, then forms z=x+y copy z[] from GPU to CPU how much faster?

7 Timing CUDA makes use of CUDA event API events make use of CUDA streams CUDA streams items in a given stream, excutedin that order eg start timer, run kernel, stop timer possible (with hwsupport) to run 2 (or more) streams per GPU further asynchronicity need to be aware of where synchronisations are so that we time what we want to time

8 CUDA Event: timing example cudaevent_t start, stop; cudaeventcreate(&start); cudaeventcreate(&stop); // set-up cudaeventrecord(start,0); // timestamp to &start (stream 0) { thing to time } cudaeventrecord(stop1,0); // timestamp to &stop float etime; cudaeventelapsedtime(&etime, start, stop2); // get time in seconds

9 initgpu+cudatime.cu 1 kernel: initialises x & y, then forms z=x+y copy z[] from GPU to CPU cudatimers - kernel launch time (since async launch) so where is rest of the 5 seconds initgpu+cudatime+openmptimer.cu 1 kernel: initialises x & y, then forms z=x+y copy z[] from GPU to CPU cuda timers OpenMP timer easy to time whole/segments of code missing 5s is pro/epi-logue NOTES POST LECTURE - it waspointedoutbystudent/s (with thanks) that all initial CUDA calls were taking about 5 seconds - investigationwithchadwick sys admin determined the need to set persistence on the cards, see - havingappliedthis, the 5 seconds per initial call has disappeared - the timings on these slides, and comments about missing5seconds needtobereevaluated

10 now we can instrument initcpu likewise then time both initcpu and initgpu to compare

11 initgpu+cudatime+openmptimer.cu (as before but now apply timers to initcpu) 1 kernel: initialises x & y, then forms z=x+y copy z[] from GPU to CPU cuda& OpenMP timers initcpu.cu initialised x[] and y[] on CPU then transferred to GPU 1 kernel vecaddkernel : forms z[i] = x[i] + y[i] copy z[] from GPU to CPU cuda& OpenMP timers COMPARISON initcpu: initon CPU, transfer, do kernel & transfer result = 1.2 ms of which kernel =.28 ms initgpu kernel & transfer result =.30 ms NB init& addition on kernel

12 what about async copy? CPU: init(x), then async copy whilst init(y) initcpu_async+timers.cu actually goes slower, seems need to be v. mindful of pinning host memory # of engines (doing copying) for hardware

13 Asynchronous By use of CUDA streams we can have overlapping events Such as async memcpy: cudamemcpyasync(*dest, *src, countbytes, direction, stream) CPU: init(x), then async copy whilst init(y) ==> maybe get cost of copying x for free dependent on cost of init(y) wrt cost of copying x in background

14 streams => task parallelism threads => data parallelism

15 Kernel: local variables Desirable WHY? Yes, possible BUT if statically allocated, needs to be pre-compile time constant eg #define maxn 1099 global void kernel() { int x[maxn]; } CC>=2.0 allows in-kernel dynamic memory allocation via malloc HOWEVER, care since each thread now allocated memory and may not need everything it is allocating

16 global qualifier: CUDA kernel called form host device qualifier: CUDA kernel called from GPU (iea global or another device qualified function)

17 CudaMemory

Tesla K80: 48 KB / block grid level (the GPU) 5= Image downloaded from: https://www.3dgep.

18 thread level 1 5= CUDA Memory Model registers (rw). Tesla K80: 64K of 32-bit regs local memory (rw) thread block level 2 shared memory (rw). Tesla K80: 48 KB / block grid level (the GPU) 5= Image downloaded from: global memory (rw) that generally used for CPU-GPU comms. Tesla K80: 11.4 GB constant memory (r only). Tesla K80: 64 KB texture memory (r only) speed of memory:

COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 19 Logistics 09:00,