Review of MPI Part Russian-German School on High Performance Computer Systems, June, 7 th until July, 6 th 005, Novosibirsk 3. Day, 9 th of June, 005 HLRS, University of Stuttgart Slide
Chap. 5 Virtual Topologies. MPI Overview. Process model and language bindings MPI_Init() MPI_Comm_rank() 3. Messages and point-to-point communication 4. Non-blocking communication 5. Virtual topologies 8. Collective communication Slide
MPI Communicators Groups offer a subset of processes of the program. Communicators are the handle on groups for communication. The initial comm. MPI_COMM_WORLD consists of all processes. 0 4 6 3 5 7 Every process within a communicator has a distinct rank. There are functions to split / cut processes out of communicators. 0 3 4 5 6 7 MPI_Comm_split (MPI_COMM_WORLD, comm_rank%, comm_rank, &new_comm); Slide 3
MPI Communicators int MPI_Comm_split (MPI_Comm old, int color, int key, MPI_Comm * new) MPI_COMM_SPLIT (old, color, key, new, IERROR) INTEGER OLD, COLOR, KEY, NEW, IERROR Splits the communicator into color separate communicators. Each subgroup contains all processes of the same color. The value of key specifies the rank in the new communicator. 0 3 6 4 7 MPI_Comm_split (MPI_COMM_WORLD, comm_rank%3, comm_rank, &new_comm); 5 Slide 4
MPI Communicators int MPI_Comm_group (MPI_Comm comm, MPI_Group * group) MPI_COMM_GROUP (comm, group, IERROR) INTEGER comm, group, IERROR int MPI_Group_excl (MPI_Group group, int n, int * ranks, MPI_Group * new) MPI_GROUP_excl (group, n, ranks, new, IERROR) INTEGER group, n, ranks, new, IERROR Extract group of processes out of comm with MPI_Comm_group. Then one may MPI_Group_incl, MPI_Group_excl, MPI_Group_range_incl... Afterwards through a global operation MPI_Comm_create, convert group into comm. Slide 5
MPI Virtual Topologies Convenient process naming. Naming scheme to fit the communication pattern. Simplifies writing of code. Can allow MPI to optimize communications. Creating a topology produces a new communicator. MPI provides mapping functions: to compute process ranks, based on the topology naming scheme, and vice versa. Graph and Cartesian Topology. Slide 6
MPI Topology Types Cartesian Topologies each process is connected to its neighbor in a virtual grid, boundaries can be cyclic, or not, processes are identified by Cartesian coordinates, of course, communication between any two processes is still allowed. int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *out); comm_old = MPI_COMM_WORLD ndims = dims = (4, 3) periods = (/.true., 0/.false.) reorder = see next slide 0 (0,0) (0,) 3 (,0) 4 (,) 6 (,0) 7 (,) 9 (3,0) 0 (3,) (0,) 5 (,) 8 (,) (3,) Slide 7
MPI A -dimensional Cylinder Ranks and Cartesian process coordinates in comm_cart 0 (0,0) (0,) (0,) 7 3 (,0) 6 6 (,0) 5 9 (3,0) 4 MPI_Cart_rank 4 0 7 9 0 8 (,) (,) (3,) MPI_Cart_coords 3 5 8 0 (,) (,) (3,) Ranks in comm and comm_cart may differ, if reorder = or.true. This reordering may allow MPI to optimize communications Slide 8
MPI MPI_Cart_shift 0 (0,0) 3 (,0) 6 (,0) 9 (3,0) (0,) 4 (,) 7 (,) 0 (3,) (0,) 5 (,) 8 (,) (3,) invisible input argument: my_rank in cart MPI_Cart_shift( cart, direction, displace, rank_source, rank_dest, ierror) example on 0 or + 4 0 process rank=7 + 6 8 Slide 9
MPI Cartesian Partitioning Cut a grid up into slices. A new communicator is produced for each slice. Each slice can then perform its own collective communications. int MPI_Cart_sub(MPI_Comm comm_cart,int *remain_dims, MPI_Comm *comm_slice); Splitting in with first dimension remaining: 0 (0,0) 3 (,0) 6 (,0) 9 (3,0) (0,) 4 (,) 7 (,) 0 (3,) (0,) 5 (,) 8 (,) (3,) Slide 0
Chap. 6 Collective Communication. MPI Overview. Process model and language bindings MPI_Init() MPI_Comm_rank() 3. Messages and point-to-point communication 4. Non-blocking communication 5. Virtual topologies 8. Collective communication Slide
MPI Collective Communication Communications involving a group of processes. Called by all processes in a communicator. Examples: Barrier synchronization. Broadcast, scatter, gather. Reduction: Global sum, global maximum, etc. Collective action over a communicator. All process of the communicator must communicate, i.e. must call the collective routine. Synchronization may or may not occur, therefore all processes must be able to start the collective routine. All collective operations are blocking. No tags. Slide
MPI Barrier Synchronization MPI_Barrier is normally never needed: all synchronization is done automatically by the data communication: a process cannot continue before it has the data that it needs. if debugging: remove in production-code. if profiling: to separate time measurement of Load imbalance of computation: MPI_Wtime(); MPI_Barrier(); MPI_Wtime(); communication epochs: MPI_Wtime(); MPI_Allreduce(); ; MPI_Wtime(); if used for synchronizing external communication (e.g. I/O): exchanging tokens may be more efficient and scalable than a barrier on MPI_COMM_WORLD. Slide 3
MPI Broadcast int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm) before bcast r e d after bcast r e d r e d r e d r e d r e d e.g., root= Rank of the sending process (i.e., root process) Must be given identically by all processes Slide 4
MPI Gather e.g., root= before gather A B C D E after gather A B ABCD E C D E int MPI_Gather(void *sbuf, int scount, MPI_Datatype stype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm); Slide 5
MPI Scatter e.g., root= before scatter ABCD E after scatter A ABCD E B C D E int MPI_Scatter(void *sendbuf, int sendcount,mpi_datatype stype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) Slide 6
MPI Global Reduction Operations To perform a global reduce operation across all members of a group. d 0 o d o d o d 3 o o d s- o d s- d i = data in process rank i single variable, or vector o = associative operation Example: global sum or product global maximum or minimum global user-defined operation floating point rounding may depend on usage of associative law: [(d 0 o d ) o (d o d 3 )] o [ o (d s- o d s- )] ((((((d 0 o d ) o d ) o d 3 ) o ) o d s- ) o d s- ) Slide 7
MPI Operators for Global Reduction Sum of all inbuf values should be returned in resultbuf (only at root). MPI_Reduce(&inbuf, &resultbuf,, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); Predefined operation handle MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_BAND MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MAXLOC MPI_MINLOC Function Maximum Minimum Sum Product Logical AND Bitwise AND Logical OR Bitwise OR Logical exclusive OR Bitwise exclusive OR Maximum and location of the maximum Minimum and location of the minimum Slide 8
MPI Reduce before MPI_REDUCE inbuf result A B C D E F GH I J K L MN O o o o o after A B C D E F GH I J K L MN O AoDoGoJoM root= Slide 9
MPI Variants of Reduction Operations MPI_ALLREDUCE no root, returns the result in all processes MPI_REDUCE_SCATTER result vector of the reduction operation is scattered to the processes into the real result buffers MPI_SCAN prefix reduction result at process with rank i := reduction of inbuf-values from rank 0 to rank i Slide 0
MPI Allreduce before MPI_ALLREDUCE inbuf result A B C D E F GH I J K L MN O o o o o after A B C D E F GH I J K L MN O AoDoGoJoM Slide
MPI Scan before MPI_SCAN inbuf result A B C D E F GH I J K L MN O o o o o after A B C D E F GH I J K L MN O A AoD AoDoG AoDoGoJ AoDoGoJoM done in parallel Slide
Exercise Rotating information around a ring A set of processes are arranged in a ring. my_rank Each process stores its rank 0 in MPI_COMM_WORLD into snd_buf an integer variable snd_buf. 0 Each process passes this on to its neighbor on the right. sum 0 Each processor calculates the sum of all values. Keep passing it around the ring until my_rank the value is back where it started, i.e. each process calculates sum of all ranks. snd_buf Use non-blocking MPI_Issend to avoid deadlocks sum to verify the correctness, because 0 blocking synchronous send will cause a deadlock Init 0 my_rank snd_buf sum 0 Slide 3
Exercise Rotating information around a ring Initialization: Each iteration: 3 4 5 3 my_rank snd_buf 4 rcv_buf sum 5 Fortran: dest = mod(my_rank+,size) source = mod(my_rank +size,size) C: dest = (my_rank+) % size; source = (my_rank +size) % size; Single Program!!! my_rank snd_buf 4 rcv_buf sum 5 3 my_rank snd_buf 4 rcv_buf sum 5 3 Slide 4 see also login-slides
Advanced Exercises Irecv instead of Issend Substitute the Issend Recv Wait method by the Irecv Ssend Wait method in your ring program. Or Substitute the Issend Recv Wait method by the Irecv Issend Waitall method in your ring program. Or Use the collective call MPI_Reduce to deduce the information. Slide 5