Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point Datatypes and Packing Communicators and Groups Topologies Outline of the workshop Afternoon: Hybrid MPI/OpenMP Theory and benchmarking Examples 3 What is MPI? MPI is a specification for a standardized library: http://www.mpi-forum.org You use its subroutines You link it with your code History: MPI- (994), MPI-2 (997), MPI-3 (202). Different implementations : 4 MPICH(2), MVAPICH(2), OpenMPI, HP-MPI,... MPI-3 contains a more modern Fortran interface ( use mpi f08 ), less prone to errors. Still very new, but implemented, for instance, in OpenMPI.7.
Review: MPI routines we know... Startup and exit: MPI Init, MPI Finalize Information on the processes: MPI Comm rank, MPI Comm size Point-to-point communications: MPI Send, MPI Recv MPI Irecv, MPI Isend, MPI Wait Collective communications: MPI Bcast, MPI Reduce, MPI Scatter, MPI Gather 5 Fortran PROGRAM hello USE mpi Example: Hello from N cores INTEGER err, rank, size CALL MPI Init(err) CALL MPI Comm rank (MPI COMM WORLD, & rank, ierr) CALL MPI Comm size (MPI COMM WORLD, & size, ierr) WRITE(*,*) Hello from processor,& rank, of, size CALL MPI Finalize(err) C 6 #include <stdio.h> #include <mpi.h> int main (int argc, char * argv[]) { int rank, size; MPI Init( &argc, &argv ); MPI Comm rank( MPI COMM WORLD, &rank ); MPI Comm size( MPI COMM WORLD, &size ); printf( "Hello from processor %d of %d\n", rank, size ); MPI Finalize(); return 0; END PROGRAM hello } More on Collectives 7 8 More on Collectives More on Collectives All Functions MPI Allgather, MPI Allreduce Combined MPI Gather/MPI Reduce with MPI Bcast: all ranks receive the resulting data. MPI Alltoall Everybody gathers subsequent blocks. Works like a matrix transpose. v Functions MPI Scatterv, MPI Gatherv MPI Allgatherv, MPI Alltoallv Instead of count argument, use counts and displs arrays that specify the counts and array displacements for every rank involved. MPI Barrier Synchronization. MPI Abort Abort with error code.
Exercise : MPI Alltoall Log in and compile the file alltoall.f90 or alltoall.c: cp /software/workshop/advancedmpi/*. module add ifort icc openmpi mpicc alltoall.c -o alltoall mpif90 alltoall.f90 -o alltoall There are errors. Can you fix them? Hint: type man MPI Alltoall to obtain the syntax for the MPI function. To submit the job, use msub -q class alltoall.pbs 9 Exercise 2: Matrix-vector multiplication Complete the multiplication in mv.f90 or mv.c using MPI Allgatherv. Rows of the matrix are distributed among processors. Example: rows and 2 in rank 0, row 3 in rank : v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 a,x + a,2 x 2 + a,3 x 3 a 2, x + a 2,2 x 2 + a 2,3 x 3 v v 2 a 3, x + a 3,2 x 2 + a 3,3 x 3 v 3 0 Exercise 3: Matrix-vector multiplication Complete the multiplication in mv2.f90 or mv2.c using MPI Alltoallv. Columns of the matrix and input vector are distributed among processors. Example: columns and 2 in rank 0, column 3 in rank : v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 a 3, x + a 3,2 x 2 a 3,3 x 3 Exercise 3: Matrix-vector multiplication a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 a 3, x + a 3,2 x 2 a 3,3 x 3 (after MPI Alltoallv) a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 v v 2 a 3, x + a 3,2 x 2 a 3,3 x 3 v 3 Note: could also use MPI Reduce or MPI Allreduce here. 2
3 More on point-to-point MPI Ssend: Synchronized, force to complete only when matching receive posted. MPI Bsend: Buffered using user-provided buffer. MPI Rsend: Ready send, must go after matching receive was posted. Rarely used. MPI Issend, MPI Ibsend, MPI Irsend: Asynchronous versions. MPI Sendrecv[ replace]: Sends and receives, avoiding deadlock (like MPI Irecv, MPI Isend, MPI Wait) Note: generally plain MPI Recv and MPI Send are best. Packing and Datatypes These functions create new data types: MPI Type contiguous, MPI Type vector, MPI Type indexed: Transfer parts of a matrix directly. MPI Type struct: Transfer a struct. MPI Pack, MPI Unpack: Pack and send heterogenous data. Note: double precision variables can (on all current machines) contain 53-bit integers without loss of precision. So an alternative is to pack manually into a double precision array. 4 Pack example 5 Communicators 6 integer m double precision x(m) call MPI Pack size(,mpi INTEGER,MPI COMM WORLD,size int,ierr) call MPI Pack size(m,mpi DOUBLE PRECISION,MPI COMM WORLD,size double,ierr) bufsize = size int + size double allocate(buffer(bufsize)) pos = 0 if(rank==0)then call MPI Pack(m,,MPI INTEGER,buffer,bufsize,pos,MPI COMM WORLD,ierr) call MPI Pack(x,m,MPI DOUBLE PRECISION,buffer,bufsize,pos, & MPI COMM WORLD,ierr) endif call MPI Bcast(buffer,bufsize,MPI PACKED,0,MPI COMM WORLD,ierr) if(rank>0)then call MPI Unpack(buffer,bufsize,pos,m,, & MPI INTEGER,MPI COMM WORLD,ierr) call MPI Unpack(buffer,bufsize,pos,x,m, & MPI DOUBLE PRECISION,MPI COMM WORLD,ierr) endif So far only used MPI COMM WORLD. Can split this communicator into subsets, to allow collective operations on a subset of ranks. Easiest to use: MPI Comm split(comm, color, key, newcomm[, ierror]): comm: old communicator color: all processes with the same color go into the same communicator key: rank within new communicator (can be 0 for automatic determination) newcomm: resulting new communicator
7 Topologies 8 Topologies Topologies group processes in an n-dimensional grid (Cartesian) or graph. Here we restrict to a Cartesian 2D grid. Helps programmer and (sometimes) hardware. MPI Dims create(p, n, dims): create balanced n-dimensional grid for p processes in n-dimensional array dims. MPI Cart create(oldcomm, n, dims, periodic, reorder, newcomm): Creates new communicator for grid with n dimensions in dims, with implied periodicity in array periodic. reorder specifies whether the ranks may change for the new communicator. MPI Cart rank(comm, coords, rank): Given n-dimensional coordinates, return rank. MPI Cart coords(comm, rank, n, coords): Given the rank, return n coordinates. Exercise 4: Matrix-vector multiplication Complete the multiplication in mv3.f90 or mv3.c using a Cartesian topology. Blocks of the matrix are distributed among processors. Example: rows 2, columns 2 in rank 0 (0,0) rows 2, column 3 in rank (0,) row 3, columns 2 in rank 2 (,0) row 3, column 3 in rank 3 (,) v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 9 Exercise 4: Matrix-vector multiplication v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 v v 2 a 3, x + a 3,2 x 2 a 3,3 x 3 v 3 Use MPI Reduce call to obtain v. Advantage: both vectors and the matrix can be distributed in memory. 20
2 Hybrid MPI and OpenMP First step: measure efficiency 22 Most clusters, including Guillimin, contain multicore nodes. For Guillimin, 2 cores per node. Idea: use hybrid MPI and OpenMP: MPI for internode communication, OpenMP intranode, eliminating intranode communication. May or may not run faster than pure MPI code. Insert MPI Wtime calls to measure wall clock time. Run for various values of p to determine scaling. Amdahl s law 23 Let f be the fraction of operations in a computation that must be performed sequentially, where 0 f. The maximum speedup ψ achievable by a parallel computer with p processors performing the computation is ψ f + ( f )/p Example: if f = 0.0035 than the maximum speedup is 285 for p, and for p = 024, ψ = 223. Karp-Flatt metric 24 We can also determine the experimentally determined serial fraction e given measured speedup ψ. /ψ /p e = /p Example: p = 2, ψ =.95, e = 0.026. Example: p = 024, ψ = 200, e = 0.0040.
25 When to consider hybrid? If the serial portion is too expensive to parallelize using MPI but can be done using threads. Definitely! If the problem does not scale well due to excessive communication (e increases significantly as p increases). Maybe. Perhaps MPI performance can be improved: Fewer messages (less latency). Shorter messages. Replace communication by computation where possible. Example: for broadcasts, tree-like communication much more efficient than sending from master process directly to all other processes (fewer messages in master process). Analysts are here to help you optimize your code! When to consider hybrid? 26 Otherwise pure MPI can be just as fast. Also, must look out for OpenMP pitfalls: caching, false sharing, synchronization overhead, races. Example job script for Guillimin For 48 CPU cores on 4 nodes with 2 cores each: #!/bin/bash #PBS -l nodes=4:ppn=2 #PBS -V #PBS -N jobname cd $PBS O WORKDIR export IPATH NO CPUAFFINITY= export OMP NUM THREADS=2 mpiexec -n 4 -npernode./yourcode 27 Example job script for Guillimin Example job script for Guillimin The particular features of this submission script are as follows: export IPATH NO CPUAFFINITY=: tells the underlying software not to pin each process to one CPU core, which would effectively disable OpenMP parallelism. export OMP NUM THREADS=2: specifies the number of threads used for OpenMP for all 4 processes. mpiexec -n 4 -npernode./yourcode: starts program yourcode, compiled with MPI, in parallel on 4 nodes, with process per node. 28
29 OpenMP example: parallel for (C) OpenMP example: parallel do (Fortran) 30 Example: void addvectors(const int *a, const int *b, int *c, const int n) { int i; #pragma omp parallel for for (i = 0; i < n; i++) c[i] = a[i] + b[i]; } Here i is automatically made private because it is the loop variable. All other variables are shared. Loop split between threads, for example, for n=0, thread 0 does index 0 to 4 and thread does index 5 to 9. Example: subroutine addvectors(a, b, c, n) integer n, a(n), b(n), c(n) integer i!$omp PARALLEL DO do i =, n c(i) = a(i) + b(i) enddo!$omp END PARALLEL DO end subroutine Here i is automatically made private because it is the loop variable. All other variables are shared. Loop split between threads, for example, for n=0, thread 0 does index 0 to 4 and thread does index 5 to 9. Exercise 5: Matrix-vector multiplication 3 Consider again mv.c and mv.f90. Add a parallel for or parallel do pragma to the inner for/do loop to obtain a hybrid code, and submit. Measure its performance. Optional: do the same for the other 2 matrix-vector multiplication codes.