Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each data element to be sent dst: rank of the receiving process the I in MPI_Isend or MPI_Irecv is for Initiate! tag: application specific identifier for the type of message sent comm: communicator to be used determines which processes can receive the message rank is relative to the communicator used reqhandle: pointer to a request object used to find out when the send is complete 11/1/07 COMP4510 - Introduction to Parallel Computation 56 Basic MPI Communications (cont d) MPI_Irecv(buf,cnt,type,src,tag,comm,reqHandle) buf: source of data to be received cnt: number of data elements to be received Note: no type: type of each data element to be received stat parm src: rank of the sending process or MPI_ANY_SOURCE to receive a message from anyone. tag: identifier for the type of message received or MPI_ANY_TAG to receive a message with any tag. comm: communicator to be used determines which processes can receive the message reqhandle: pointer to a request object used to find out when the recv is complete 11/1/07 COMP4510 - Introduction to Parallel Computation 57 1
Basic MPI Communications (cont d) It is fine to use non-blocking communication primitives (allowing us to overlap communication and computation) but we must know when a send/recv completes for a send, we need to know when we can change the buffer for a recv we need to know when we can process the received data This is accomplished using MPI_Wait 11/1/07 COMP4510 - Introduction to Parallel Computation 58 Basic MPI Communications (cont d) MPI_Wait(reqHandle,status) reqhandle: pointer to a request object used to find out when the send is complete status: used to provide various bits of status information MPI_Wait waits until the communication operation associated with the request object reqhandle has completed status information is returned in status Thus we see pairs of operations: an MPI_Isend or MPI_Irecv paired with MPI_Wait 11/1/07 COMP4510 - Introduction to Parallel Computation 59 2
Basic MPI Communications (cont d) MPI_Isend(buf,sz,type,dst,tag,comm,&reqHndle); // do some useful computation here MPI_Wait(&reqHndle,&status); // safe to rewrite buf here or: MPI_Irecv(buf,sz,type,dst,tag,comm,&reqHndle); // do some useful computation here MPI_Wait(&reqHndle,&status); // process received data in buf here 11/1/07 COMP4510 - Introduction to Parallel Computation 60 MPI Collective Communications In many applications, it is common to need to move multiple pieces of data in certain specific ways between the processes in the virtual machine e.g. distribute copies of data to all nodes during initialization or collect partial results from all nodes,... MPI provides collective communications functions to support several such common patterns of data movement between processes These are much easier to use than writing your own code without them 11/1/07 COMP4510 - Introduction to Parallel Computation 61 3
MPI provides five basic routines for doing collective communications: MPI_Gather: collects data from each process and delivers it all to one MPI process MPI_Allgather: collects data from each process and delivers it all to each MPI process MPI_Bcast: sends a single piece of data from one process to all processes MPI_Scatter: takes data from one process and distributes it, in parts, to all processes MPI_Alltoall: takes selected data from each process and distributes it to all processes 11/1/07 COMP4510 - Introduction to Parallel Computation 62 Figure courtesy of Boston University: http://scv.bu.edu/tutorials/mpi/more/collectives.html 11/1/07 COMP4510 - Introduction to Parallel Computation 63 4
int MPI_Bcast( void *message, /* in-out */ int count, /* in */ MPI_Datatype sendtype, /* in */ int root, /* in */ MPI_Comm comm /* in */ ) 11/1/07 COMP4510 - Introduction to Parallel Computation 64 MPI_Bcast is used to distribute copies of data to all MPI processes E.g. Each process holds a row of a matrix and we wish to scale the elements by a given value One MPI process reads in the scaling value Parallel input from multiple processes isn t advised That process uses MPI_Bcast to distribute it Each process iterates through the elements in the row it stores, multiplying the value by the scaling factor 11/1/07 COMP4510 - Introduction to Parallel Computation 65 5
float scale; if (rank==0) { /* master */ /* read scaling factor into scale */ } MPI_Bcast(&scale,1,MPI_FLOAT,0, MPI_COMM_WORLD); if (rank>0) { /* slaves */ /* do scale x row */ } MPI_Bcast is executed on all MPI nodes - on node 0, scale is sent, on others it is received 11/1/07 COMP4510 - Introduction to Parallel Computation 66 MPI_Gather( void *sendbuf /* in */ int sendcount /* in */ MPI_Datatype sendtype /* in */ void *recvbuf /* out @ root */ int recvcount /* in */ MPI_Datatype recvtype /* in */ int root /* in */ MPI_Comm comm /* in */ ) 11/1/07 COMP4510 - Introduction to Parallel Computation 67 6
MPI_Gather is commonly used to collect partial results from MPI processes E.g. Each process has computed one row of a result matrix and we wish to collect all the rows on one node for output Each MPI process (including the master) computes its row of the matrix The MPI processes then use MPI_Gather to collect the rows from each process The master process then outputs the result Again, I/O is normally limited to one machine 11/1/07 COMP4510 - Introduction to Parallel Computation 68 #define R numberofrowsinthematrix #define C numberofcolumnsinthematrix int i,j; float vals[c],collectedvals[r][c]; /* compute values (including process 0) */ MPI_Gather(vals,C,MPI_FLOAT, collectedvals,c,mpi_float, 0,MPI_COMM_WORLD); N.B. Number of elements to receive from EACH node if (rank==0) { /* master */ for (i=0;i<r;i++) for (j=0;j<c;j++) /* output the collected array value */ } 11/1/07 COMP4510 - Introduction to Parallel Computation 69 7
MPI_Allgather( void *sendbuf /* in */ int sendcount /* in */ MPI_Datatype sendtype /* in */ void *recvbuf /* out */ int recvcount /* in */ MPI_Datatype recvtype /* in */ MPI_Comm comm /* in */ ) 11/1/07 COMP4510 - Introduction to Parallel Computation 70 MPI_Allgather is commonly used to distribute intermediate results to all processes E.g. Each process has computed one row of a result matrix and we need to distribute the entire result matrix to all process for further calculations Each MPI process computes its row Each MPI process uses MPI_Allgather to collect the partial results from the other processes Each process then has its own copy of the entire result matrix and can do whatever it needs to with it 11/1/07 COMP4510 - Introduction to Parallel Computation 71 8
#define R numberofrowsinthematrix #define C numberofcolumnsinthematrix float vals[c],collectedvals[r][c]; /* compute values (including process 0) */ MPI_Allgather(vals,C,MPI_FLOAT, collectedvals,c,mpi_float, MPI_COMM_WORLD); /* All nodes now have the entire matrix */ N.B. No root specified 11/1/07 COMP4510 - Introduction to Parallel Computation 72 int MPI_Scatter( void *sendbuf, /* in @ root */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void *recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm /* in */ ) 11/1/07 COMP4510 - Introduction to Parallel Computation 73 9
MPI_Scatter is commonly used to distribute data subsets to the processes that will operate on them E.g. One process needs to send each row of a matrix to the other processes for computation The process computes/reads the matrix It then uses MPI_Scatter ( executed by each process) to send each row, in turn, to the other processes for subsequent processing in parallel 11/1/07 COMP4510 - Introduction to Parallel Computation 74 #define R numberofrowsinthematrix #define C numberofcolumnsinthematrix float rows[c],ar[r][c]; /* process 0 loads the array, ar */ MPI_Scatter(ar,C,MPI_FLOAT,rows,C, MPI_FLOAT,0, MPI_COMM_WORLD); /* Each node has its own row in rows */ 11/1/07 COMP4510 - Introduction to Parallel Computation 75 10
MPI_Alltoall( void *sendbuf /* in */ int sendcount /* in */ MPI_Datatype sendtype /* in */ void *recvbuf /* out */ int recvcount /* int */ MPI_Datatype recvtype /* in */ MPI_Comm comm /* in */ ) 11/1/07 COMP4510 - Introduction to Parallel Computation 76 MPI_Alltoall distributes equal shares of data originating at all nodes in the cluster to all other nodes in the cluster Sounds powerful but what is it good for? 11/1/07 COMP4510 - Introduction to Parallel Computation 77 11
Let s think about transposing a matrix i.e. interchanging the rows and the columns 7 4 2 3 1 9 7 0 5 T = 7 3 7 4 1 0 2 9 5 This is a common operation a b c d e f g h i j k l m n o p T = a e i m b f j n c g k o d h l p 11/1/07 COMP4510 - Introduction to Parallel Computation 78 Let s think about transposing a matrix i.e. interchanging the rows and the columns 7 4 2 3 1 9 7 0 5 T = 7 3 7 4 1 0 2 9 5 This is a common operation a b c d e f g h i j k l m n o p T = a e i m b f j n c g k o d h l p 11/1/07 COMP4510 - Introduction to Parallel Computation 79 12
As an exercise lets rewrite the vector sum operation to use MPI_Scatter and MPI_Gather to distribute the vector and gather the partial sums, respectively Example Online 11/1/07 COMP4510 - Introduction to Parallel Computation 80 Reductions MPI also provides reduction (i.e. aggregation) operators Similar to those we saw in OpenMP but based around collective communications These are useful when we want to aggregate simple results together as we collect them from other processes E.g. summing up a number of partial sums 11/1/07 COMP4510 - Introduction to Parallel Computation 81 13
Reductions (cont d) MPI_Reduce( void *operand, /* in */ void *result, /* out */ int count, /* in */ MPI_Datatype dt, /* in */ MPI_Op operator, /* in */ int root, /* in */ MPI_Comm comm /* in */ ) 11/1/07 COMP4510 - Introduction to Parallel Computation 82 Reductions (cont d) A simple example of the use of MPI_Reduce is, again, the computation of a sum of the elements in a vector Each process is assigned a part of the vector to sum and then the partial-sums must be added together to give the final sum Instead of explicitly sending each partial sum to the master process and adding them up we can use reduction via MPI_Reduce 11/1/07 COMP4510 - Introduction to Parallel Computation 83 14
Reductions (cont d) Pictorially: Master: Distribute Vector Slaves: do i=1,25 sum 0 =... do i=26,50 sum 1 =......... Master: MPI_Reduce: sum=sum 0+sum 1+... 11/1/07 COMP4510 - Introduction to Parallel Computation 84 Reductions (cont d) int sum; /* total sum */ int psum; /* partial sum*/ /* compute partial sums */ MPI_Reduce(&psum, &sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (rank==0) printf( Sum is %d\n,sum); 11/1/07 COMP4510 - Introduction to Parallel Computation 85 15
Reductions (cont d) As an exercise lets rewrite the vector sum operation to replace the MPI_Gather that was used to gather the partial sums with MPI_Reduce which will gather and sum them at once Example Online 11/1/07 COMP4510 - Introduction to Parallel Computation 86 Reductions (cont d) MPI also provides an MPI_ALLreduce operation which collects aggregated results at all nodes MPI_Allreduce( void *sendbuf /* in */ void *recvbuf /* out */ int count /* in */ MPI_Datatype datatype /* in */ MPI_Op op /* in */ MPI_Comm comm /* in */ ) 11/1/07 COMP4510 - Introduction to Parallel Computation 87 16