COMP528: Multi-core and Multi-Processor Computing

Size: px

Start display at page:

Download "COMP528: Multi-core and Multi-Processor Computing"

Kathlyn Isabella Gordon
5 years ago
Views:

1 COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk 19

2 Logistics 09:00, Monday lectures: now in ASHTON LECTURE THEATRE Your feedback on lectures: Your assignment I will marks per category. Mean score (of those returning, ignoring late penalties): 60 Assignment 2 submission deadline: 11am, Friday 16 Nov Extenuating Circumstances/Sickness: need to speak to Lindsay/Student Office, maybe complete a form, ideally before deadline Assignment 3: available w/c Mon 19 Nov. Assignment 4 after that.

3 Why GPU? Doing the same thing, on a LOT of data items: weak but specific cores As quickly as possible: concurrently Apply this concurrency to numerical processing Different ways to program: directives CUDA opencl

4 Top500 (Nov2018) what notice?

5 Green500

6 CES 2016 Chadwick nodes: 33 TF

Terminology Host: the CPU (and its memory) maybe several cores at given time, each core could be doing something different Device: the GPU (and its memory) many cores designed to run 1000s of

7 Terminology Host: the CPU (and its memory) maybe several cores at given time, each core could be doing something different Device: the GPU (and its memory) many cores designed to run 1000s of threads, lightweight switching threads run same code ( kernel ) the host code will call N copies of the kernel, one running on each of N threads Threads lightweight will have unique identifier: helps with how to parallelise work of the kernel

8 NVIDIA GPU Hierarchy Hardware CUDA core where threads run and the work is performed traditionally lots of support for integer & single precision (float) increasingly more double precision (double) support AND half-int(for?) Streaming Multiprocessor (SM, (SMP)) a gang of threads the GPU a gang of SMs devicequery (CUDA SDK examples) -> gives details of actual GPU

9 Chadwick 2 nodes with GPUs visu1: 2* 6 cores 2.53GHz ie12 cores total 1* Quadro5000 visu2: 2 * 8 cores HaswellE GHz ie16 cores total 2* Tesla K80 1. only 2 compute nodes: so if these busy you may have to wait 2. each of these has 2 [NVIDIA] GPUs, but different models Which is best model?

10 Chadwick: GPU Models With CUDA SDK examples: devicequery lots of CUDA calls to query GPU devices then print out properties qrsh-v -cwd -pe smp 1 -l gputype=\* -now no./shownodegpu.sh qrsh-v -cwd -pe smp 1 -l gputype=tesla -now no./shownodegpu.sh for timings, would use -pesmpnn -l exclusive with NN=12 or 16 perhaps, would pin (bind) your code to a given core too DEMO module load cuda-8.0 qrsh -V -cwd -pe smp 1 -l gputype=\* -now no./shownodegpu.sh

11 K80 CUDA cap: 3.7 Global mem: 11.4 GB 13 SM 192 CUDA core/sm 2496 CUDA cores Warp size 32 Quadro 5000 CUDA capability: 2.0 Global mem: 2.5 GB 11 SM 32 CUDA cores/sm 352 CUDA cores Warp size 32

12 Threads etc Hardware # CUDA cores in a SM # SM in a GPU 3 levels: CUDA cores, SM, GPU CUDA core SM Image: Software [CUDA runtime] mirrors this threads thread blocks (or blocks ) kernel grid

Threads, Blocks & Warps Definitive ref http://docs.nvidia.com/cuda/cuda-c-programmingguide/index.

have multiple blocks per SM SM splits a thread block in to warps (32 threads) If a warp is blocked (mem access),

13 Threads, Blocks & Warps Definitive ref For today Thread blocks is quanta to consider A block is placed on a single SM Can have multiple blocks per SM SM splits a thread block in to warps (32 threads) If a warp is blocked (mem access), switch to another warp latency hiding Warps run same instructions, in lockstep fashion KEY PRINCIPLE: threads run concurrently parallelism

14 What about Memory? we will return to this next lecture!

15 CUDA

Reading/Background Materials CUDA by example: an introduction to generalpurpose GPU programming, Sanders & Kandrot (2011) hard copies in the library NVIDIA s CUDA web/resources https://www.nvidia.

16 Reading/Background Materials CUDA by example: an introduction to generalpurpose GPU programming, Sanders & Kandrot (2011) hard copies in the library NVIDIA s CUDA web/resources GPU Gems egdownload from NVIDIA GPUGems/gpugems_pref01.html only the parts on computation (not the parts on graphics!)

17 Steps to CUDA Determine work that has inherent parallelism Move (serial) work to a "kernel" 3. Invoke a parallel kernel by use of CUDA Based upon Steps to CUDA High End Compute Ltd

18 CUDA by Example: CUDA kernel serial_kernel(x, y, z, num) { for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } } start = clock(); serial_kernel(x, y, z, num) finish = clock(); global cuda_kernel(x, y, z) { // parallel control via varying index my_i = threadidx.x + blockidx.x*blockdim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index } start = clock(); cuda_kernel <<<blks,threadsperblock>>> (x, y, z); finish = clock();

19 CUDA by Example: CUDA kernel global cuda_kernel(x, y, z) { // parallel control via varying index my_i = threadidx.x + blockidx.x*blockdim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index } start = clock(); cuda_kernel <<<blks,threadsperblock>>> (x, y, z); finish = clock(); each thread is in a block blocks are uniquely numbered blockidx.x each thread in a given block has a unique number threadidx.x therefore my_i will be numbered 0,1,2,3 (each on a different thread, perhaps in a different thread block) requests blks thread blocks with threadsperblock thread per block and each thread running an instance of the cuda_kernel (with given args) on a separate thread

20 CUDA by Example: CUDA kernel global cuda_kernel(x, y, z) { // parallel control via varying index my_i = threadidx.x + blockidx.x*blockdim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index } my_i: x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 start = clock(); cuda_kernel <<<2, 5>>> (x, y, z); finish = clock(); block 0 thread 0 block 0 thread 1 block 0 thread 2 block 0 thread 3 block 0 thread 4 block 1 thread 0 block 1 thread 1 block 1 thread 2 block 1 thread 3 block 1 thread 4 my_id= 0 + 0* 5 = 0 my_id= * 5 = 1 my_id= 2 my_id= 3 my_id=4 my_id= 0 + 1*5 = 5 my_id= 1 + 1*5= 6 my_id= 7 my_id= 8 my_id= 9

21 What are we missing? getting data on to and off the GPU device num egnum=100 blks=3, threadsperblock=3 egnum=120 blks=10, threadsperblock=32

22 Steps to CUDA 1. Determine work that has inherent parallelism 2. Move (serial) work to a "kernel" 3. Invoke a parallel kernel by use of CUDA and now we would compile and run in parallel!

23 NEXT TIME Data transfers, synchronisation constructs Asynchronous data transfer LAB: simple examples

COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 21 You should compute