Distributed Systems + Middleware Advanced Message Passing with MPI

Size: px

Start display at page:

Download "Distributed Systems + Middleware Advanced Message Passing with MPI"

Wendy Jordan
6 years ago
Views:

1 Distributed Systems + Middleware Advanced Message Passing with MPI Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it

2 References Tutorial on MPI Lawrence Livermore National Laboratory Middleware: MPI 2

MPI basics MPI (Message Passing Interface) is a middleware for high-performance computing scenarios It only standardizes an API Several C/C++ implementations exist

3 MPI basics MPI (Message Passing Interface) is a middleware for high-performance computing scenarios It only standardizes an API Several C/C++ implementations exist (OpenMPI, MPICH, etc.) MPI goal is to develop a standard API for highperformance distributed computing, being: Easy to use Portable Efficient Flexible Middleware: MPI 3

4 How an MPI program is organized Middleware: MPI 4

5 Format of MPI calls General format: rc = MPI_Xxxxx(parameter,... ) Example: rc = MPI_Bsend(&buf,count,type,dest,tag,comm) Error code is returned as "rc MPI_SUCCESS if successful Middleware: MPI 5

6 Example: Hello world in MPI #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); printf("hello, world!\n"); MPI_Finalize(); return 0; } Middleware: MPI 6

7 Compiling and running MPI programs How to compile and run MPI programs depends on the specific implementation used In OpenMPI and MPICH Compiling: mpicc o hello hello.c Running: mpirun [-oversubscribe] np <N> hello where np <N> specifies that the code has to be executed by N processes in parallel Middleware: MPI 7

8 Communicators, group and rank MPI uses objects called communicators and groups to organize collection of processes and define the scope of communication Most MPI routines require you to specify a communicator as an argument (more on this later) The MPI_COMM_WORLD is the predefined communicator that includes all MPI processes Within a communicator, every process has its own rank A unique, integer ID assigned by the system Ranks are contiguous and begin at zero Used by the programmer to specify the source and destination of messages Often also used conditionally to control program execution (if rank=0 do this / if rank=1 do that). Middleware: MPI 8

9 Discovering the MPI environment MPI provides several functions to manage and query the environment int MPI_Comm_size(MPI_Comm comm, int *size) To get the total number of MPI processes in the specified communicator int MPI_Comm_rank(MPI_Comm comm, int *rank) To get the rank of the calling MPI process within the specified communicator int MPI_Get_processor_name(char *name, int *resultlen) Returns the processor name, which is implementation dependent - may not be the same as the output of the "hostname" shell command Middleware: MPI 9

10 Example: Hello world v2 #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("hello, world! "); printf("i am process %d out of %d\n", rank, size); } MPI_Finalize(); return 0; Middleware: MPI 10

11 Point-to-point communication MPI = Message Passing Process 0 Process 1 Send(data) Receive(data) MPI defines: How data is represented How message recipient is identified How the actual comunication is implemented Middleware: MPI 11

12 Blocking vs. Non-blocking calls Most MPI p2p routines can be used in blocking or non-blocking mode Blocking: A blocking send routine will only "return" after it is safe to modify the application buffer for reuse Safe does not imply that the data was actually received - it may very well be sitting in a system buffer on the receiving host A blocking send can be synchronous (with a confirmation of the receiver) or asynchronous if a system buffer is used to hold the data for eventual delivery A blocking receive only "returns" after the data has arrived and is ready for use by the program Non-blocking: Non-blocking send and receive do not wait for any communication to happen It is unsafe to modify the application buffer until you know the requested nonblocking operation was actually performed There are "wait" routines used to do this Non-blocking communications are primarily used to overlap computation with communication and exploit possible performance gains Middleware: MPI 12

13 Ordering MPI guarantees that messages will not overtake each other If a sender sends two messages (Message 1 and Message 2) in succession to the same destination, and both match the same receive, the receive operation will receive Message 1 before Message 2 If a receiver posts two receives (Receive 1 and Receive 2), in succession, and both are looking for the same message, Receive 1 will receive the message before Receive 2 Order rules do not apply if there are multiple threads participating in the communication operations Middleware: MPI 13

14 MPI data types For reasons of portability, MPI predefines its elementary data types A specific data type exists for each primitive C type MPI_INT, MPI_DOUBLE_PRECISION, MPI_CHAR,... Programmers may also create their own, structured data types Middleware: MPI 14

15 Recipients and message tags For point-to-point communication, the recipient is specified through its rank and communicator Receive routines also asks for the rank of the sender This may be set to the wild card MPI_ANY_SOURCE to receive a message from any task Message tags have also to be specified to uniquely identify a message Send and receive operations should match message tags For a receive operation, the wild card MPI_ANY_TAG can be used to receive any message regardless of its tag Middleware: MPI 15

16 Base blocking routines MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) status provides additional information about the sender and message tag: int recvd_tag, recvd_from, recvd_count; MPI_Status status; MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG,..., &status ) recvd_tag = status.mpi_tag; recvd_from = status.mpi_source; MPI_Get_count( &status, datatype, &recvd_count ) Middleware: MPI 16

17 Example: Ping #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[]) { int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x'; MPI_Status stat; } MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank == 0) { dest = 1; source = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &stat); } else if(rank == 1) { dest = 0; source = 0; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &stat); rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } rc = MPI_Get_count(&stat, MPI_CHAR, &count); printf("task %d: received %d char(s) from task %d with tag %d\n", rank, count, stat.mpi_source, stat.mpi_tag); MPI_Finalize(); Middleware: MPI 17

18 Additional blocking routines MPI_Ssend Synchronous blocking send: send a message and block until the destination process has started to receive the message MPI_Rsend Blocking ready send Should only be used if the programmer is certain that the matching receive has already been posted MPI_Sendrecv Send a message and post a receive before blocking Will block until the sending application buffer is free for reuse and until the receiving application buffer contains the received message Middleware: MPI 18

19 Deadlock Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) With synchronous send there is always a deadlock Even using standard, asynchronous send, a deadlock may occur With large messages if the system buffer is not big enough Code is unsafe, its correctness depends from a system choice Middleware: MPI 19

20 Other cases of Deadlock int a[10], b[10], myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { } MPI_Ssend(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Ssend(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); }... Middleware: MPI 20

21 Other cases of deadlock int a[10], b[10], np, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &np); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Ssend(a, 10, MPI_INT, (myrank+1)%np, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1)%np, 1, MPI_COMM_WORLD);... We have a circular dependency, i.e., a deadlock! Middleware: MPI 21

22 Deadlock free circular communication int a[10], b[10], np, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &np); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank%2 == 1) { MPI_Ssend(a, 10, MPI_INT, (myrank+1)%np, 1,MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1)%np, 1, MPI_COMM_WORLD); } else { MPI_Recv(b, 10, MPI_INT, (myrank-1)%np, 1, MPI_COMM_WORLD); MPI_Ssend(a, 10, MPI_INT, (myrank+1)%np, 1, MPI_COMM_WORLD); }... Middleware: MPI 22

Non-blocking routines MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_Irecv(void *buf, int count, MPI_Datatype datatype,.

23 Non-blocking routines MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_Irecv(void *buf, int count, MPI_Datatype datatype,... int source, int tag, MPI_Comm comm, MPI_Request *request) Two blocking routines are provided to wait for the non-blocking send and receive to end: MPI_Wait(MPI_Request *request, MPI_Status *status) MPI_Waitall(int count, MPI_Request *array_of_requests, MPI_Status *array_of_statuses) Middleware: MPI 23

24 Example: Non-blocking routines int x; MPI_Request req; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if(myrank == 0) { MPI_Isend(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD, &req); compute(); // do something MPI_Wait(&req, status); } else if(myrank == 1) { } MPI_Recv(&x, 1, MPI_INT, 0, msgtag, MPI_COMM_WORLD, status); Middleware: MPI 24

25 Non-blocking routines and deadlock Non blocking routines reduce the chance of deadlocking Process 0 Isend(1) Irecv(1) Waitall Process 1 Isend(0) Irecv(0) Waitall Middleware: MPI 25

26 Example int main(int argc, char *argv[]) { int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2; MPI_Request reqs[4]; MPI_Status stats[4]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); prev = rank-1; next = rank+1; if(rank == 0) prev = numtasks - 1; if(rank == (numtasks - 1)) next = 0; MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]); MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]); MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]); } /* Do some work here */ MPI_Waitall(4, reqs, stats); MPI_Finalize(); Middleware: MPI 26

27 Collective communication routines Allows a group of processes to collaborate in achieving a common goal Include routines to synchronize processes, broadcast messages, scatter, gather, and reduce data Middleware: MPI 27

28 Synchronization MPI_Barrier(MPI_Comm comm) Creates a barrier synchronization in a group Each process, when reaching the MPI_Barrier call, blocks until all processes in the group reach the same MPI_Barrier call Then all tasks are free to proceed Middleware: MPI 28

29 Broadcast & Reduce Politecnico Process Ranks Send buffer Process Ranks Send buffer A?? Bcast (root=0) A A A 3? 3 A Process Ranks Send buffer Process Ranks Receive buffer A B C Reduce (root=0) X=A op B op C op D X?? 3 D 3? Middleware: MPI 29

30 Broadcast & Reduce MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) Sends a message from root to all processes part of communicator comm (including the root itself) The function must be invoked by all processes (sender, i.e., root, and receivers) Function returns when the message is available in the local buffer of each process MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) Function is invoked by all processes in a communicator Every process sends its value (kept in sendbuf ) All data sent are grouped using operation op and the function ends with the final value available to root at recvbuf There is also an MPI_Allreduce routine Middleware: MPI 30

31 Reduce operators MPI Name Operation MPI_MAX Maximum MPI_MIN Minimum MPI_PROD Product MPI_SUM Sum MPI_LAND Logical and MPI_LOR Logical or MPI_LXOR Logical exclusive or ( xor ) MPI_BAND Bitwise and MPI_BOR Bitwise or MPI_BXOR Bitwise xor MPI_MAXLOC Maximum value and location MPI_MINLOC Minimum value and location It is also possible to add user-defined operators through the function MPI_Op_create Middleware: MPI 31

32 Example: MP_reduce & MPI_MAXLOC struct { double val; int } in, out; rank; int i, myrank, root; double myval; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); in.val = myval; in.rank = myrank; MPI_Reduce( in, out, 1, MPI_DOUBLE_INT, MPI_MAXLOC, root, comm ); /* Result is now available at root */ if (myrank == root) { cout << Max is << out.val << from process << out.rank << endl; Middleware: MPI 32

33 Scatter & Gather Politecnico Process Ranks Send buffer Process Ranks Receive buffer ABCD???????????? Scatter (root=0) A B C D Process Ranks Send buffer Process Ranks Receive buffer A B C D Gather (root=0) ABCD???????????? Middleware: MPI 33

34 Scatter & Gather MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,mpi_comm comm) MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype,void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,mpi_comm comm) Invoked by all processes in the comunicator sendbuf in MPI_Scatter and recbuf in MPI_Gather are only relevant for the root process There is also an MPI_Allgather routine Middleware: MPI 34

35 Example: scatter of a matrix int main(int argc, char *argv[]) { int numtasks, rank, sendcount, recvcount, source; float sendbuf[4][4] = { {1,2,3,4}, {5,6,7,8}, {9,10,11,12}, {13,14,15,16} }; float recvbuf[4]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); if (numtasks == 4) { source = 0; sendcount = 4; recvcount = 4; MPI_Scatter(sendbuf, sendcount, MPI_FLOAT, recvbuf, recvcount, MPI_FLOAT, source, MPI_COMM_WORLD); printf("rank= %d Results: {%f, %f, %f, %f}\n", rank, recvbuf[0], recvbuf[1], recvbuf[2], recvbuf[3]); } else { printf("error, you must specify 4 tasks\n"); } } MPI_Finalize(); Middleware: MPI 35

36 Communicators and groups Communicators define the scope of a communication Apart from the MPI_COMM_WORLD communicator, it is possible to create new communicators starting from an existing communicator and a group of processes MPI functions to create groups and communicators MPI_Comm_create MPI_Comm_group MPI_Comm_split Middleware: MPI 36

MPI. (message passing, MIMD)

MPI. (message passing, MIMD) MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point