COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X

So far Why and what HPC / multi-core / multi-processing [#1 - #3] Use of HPC facility batch, timing variations Theory Amdahl s Law, Gustafson deadlock, livelock Message Passing Interface (MPI) distributed memory, many nodes, scaling up nodes & memory wrappers for compiler and launching OpenMP shared (globally addressable) memory, single node

So far But not only CPU GPU [#18 - #24] CUDA: <<<blocks, threads>>> & writing kernels Directives: OpenACC [#23], OpenMP 4.0+ [#24] OpenCL [#24] comp528 (c) univ of liverpool

Still to come Vectorisation including some optimisation Hybrid programming how to use OpenMP from MPI Libraries Black box codes Using what we have learned to understand how these can help Profiling & Optimisation

Still to come a summary lecture what s key to remember opportunity to ask Qs what might be interesting but would need another course and did somebody say cloud?

HYBRID

Today s High End Architectures processors: many cores each with vector unit maybe specialised units nodes TPU, Tensor cores (etc) for Machine Learning one or more processors zero or more GPUs potentially likes of Xeon Phi, FPGA, custom ASIC,

Today s High End Architectures i.e. electic mix Needing appropriate programming for max performance MPI for inter-node MPI or OpenMP for intra-node CUDA OpenCL OpenACC OpenMP for accelerators BUT heterogeneous arch ==> heterogeneous use of languages

MPI across nodes, OpenMP on a node or MPI per processor & OpenMP across cores? Already done (assignment #3) OpenMP for CPU + CUDA for GPU a single thread calls the CUDA kernel for GPU to run (calling a CUDA kernel in a parallel region would launch many instances of the kernel, each requesting <<<blocks, threads>>>)

MPI + OMP: Simple Case MPI code => runs a copy on each process Put one process per node When need to accelerate (eg a for loop), then use OpenMP the master OpenMP thread is the MPI process and we have the other cores as the slave OpenMP threads (Inter-process) Comms is only via MPI why dynamic else may load we wish balancing to use OMP eg schedule(dynamic) rather than MPI? consider each OpenMP team independent of (and without any knowledge of) other OpenMP teams

OpenMP program (no MPI) MPI with 1 process launching OpenMP parallel regions MPI with 4 processes each launching OpenMP parallel regions comp528 (c) univ of liverpool

MPI with 4 processes each launching OpenMP parallel regions REDUCTION TO ROOT There is no reason we have to have the same size OpenMP team on each MPI process Data exchange between MPI threads Via MPI Comms pt-to-pt collectives Easiest if OUTSIDE of the OpenMP regions

Example / DEMO ~/MPI/hybrid/ex1.c how compile hybrid? run: illustrate ~/MPI/hybrid/ex2.c v. simple example of summation over MPI*OMP MPI_Scatter #pragma omp par for reduction MPI_Reduce

Other Options HANDLE WITH CARE A single OMP thread (eg in a master or single region) sends info via MPI generally okay will be to another master thread is pretty much like sending outside OMP region

Other Options HANDLE WITH CARE One or more OMP threads in an OMP parallel region doing MPI Comms (or at same time) threaded MPI requires MPI_Init_Thread rather than MPI_Init MPI_Init_thread(argc, argv, required, provided) requires provided support (implementation dependent) of one of: MPI_THREAD_FUNNELLED MPI_THREAD_SERIALIZED MPI_THREAD_MULTIPLE

Performance, Batch etc How many cores to request in batch job? different batch systems would require request for: 4 processors * 7 cores (MPI:proc, OMP:core) 24 cores (& then worry re placement) Chadwick 24 cores, place MPI per node via mpirun (SHOW) Is it efficient use of resources? depends if runs faster but there is dead time (cf Amdahl)

Can think of some tricks #pragma omp parallel if (omp_get_thread_num()==0) { } MPI_Send( ) // or other MPI eg MPI_Recv on different MPI process else { } // do some OMP work on remaining threads what can we NOT do here?

Further Reading https://www.intertwineproject.eu/sites/default/files/images/intertwine_best_pr actice_guide_mpi%2bopenmp_1.2.pdf (Archer?)