PROGRAMMING WITH MESSAGE PASSING INTERFACE J. Keller Feb 26, 2018
Structure Message Passing Programs Basic Operations for Communication Message Passing Interface Standard First Examples Collective Communication Operations Send/Receive Variants Slide 2
Message-Passing Programs I Message Passing Program = Multiple processes, each with own address space Explicit parallelism via multiple processes Each input stored with one process: No replication of data if possible Same for intermediate results and final results Necessary: mapping of data to processes Mapping known to all processes Slide 3
Message-Passing Programs II Mapping data process such that majority of accesses is local Access to non-local data: communication between processes Communication: explicit and two-sided two-sided = sender and receiver involved Read access by process i to non-local data (stored with process j): two possible cases Slide 4
Message-Passing Programs III Case 1: process j knows which data, who wants them, and when Process j sends data, process i receives Case 2: some information unknown Process i sends request to process j Process j receives request Process j sends data, process i receives Write access to non-local data: Process i sends data (and meta information if needed) to process j Process j receives (and stores locally) Process j participates actively in each case! Slide 5
Message-Passing Programs IV Process j must plan calls to communication routines although not part of his computation Difficult for dynamic and/or unstructured communication Goal: well-structured Communication Communication as seldom as possible Slide 6
Message-Passing Programs V Advantages: Communication and communication cost explicit simplifies optimization Synchronisation of processes free with comm. Paradigm fits almost any collection of computers Consequence: Almost all programs on high-performance computers are message passing programs Today often coupled with OpenMP Slide 7
Basic Operations for Comm. I Basic operations for communication: send and receive Three relevant parameters: Pointer to data Size of data ID of communication partner Send: data to be sent and target ID Receive: buffer for data to be received ID can be wildcard = from any process Slide 8
Basic Operations for Comm. II Send/Receive in several variants: non-blocking (possibly) blocking synchronous Non-blocking Send: forwards pointer, size and target ID to communication system and returns Adantage: very fast Problem: unclear when data are transmitted/copied unclear when local data can be modified or deleted Solution: Send returns identifier to check Slide 9
Basic Operations for Comm. III Synchronous send: returns only when receiving process invokes receive funct. Advantage: local send buffer is free upon return processes synchronized, like barrier Disadvantage: possibly wait for receiving process (possibly) Blocking Send: return when data transmitted / copied to system buffer Advantage: local send buffer is free upon return Disadvantage: time depends on availability of buffers Slide 10 possibly almost as fast as non-blocking send, possibly as slow as synchr. send
Basic Operations for Comm. IV Non-blocking Receive: if no data yet arrived, return with status if data arrived, return with data Advantage: fast and flexible (can do something inbetween) Disadvantage: programmer responsible to check regularly Blocking Receive: wait for data received in system buffer, copy to local buffer and return Advantage: message handled as soon as possible Disadvantage: additional buffers, add. copy, add. time Slide 11
Basic Operations for Comm. V Beware of deadlock! Example: Process 0 Process 1 send(dataptr1,1,proc1); recv(dataptr2,1,proc1); send(dataptr3,1,proc0); recv(dataptr4,1,proc0); Example works for blocking send with buffer but not for synchronous send or blocking send without buffer Slide 12
Basic Operations for Comm. VI Same can happen with blocking receive! Example: Process 0 Process 1 recv(dataptr2,1,proc1); send(dataptr1,1,proc1); recv(dataptr4,1,proc0); send(dataptr3,1,proc0); Slide 13
Basic Operations for Comm. VII If send and receive are executed about simultaneously: better without buffer, especially for large data If send executed earlier than receive: better with buffer as send need not wait If receive executed earlier than send: Receive must wait better change algorithm Slide 14
MPI Standard I MPI = Message Passing Interface Standard, not a product Maintained by MPI-Forum, consortium of companies and universities www.mpi-forum.org Version 1.0 1994 Version 1.1 1995 Version 1.2 1997 (as part of MPI 2.0) in the course Version 1.3 2008 (as part of MPI 2.1, End of MPI 1.x) Version 2.0 1997 (MPI 1.2 + parallel I/O, one-sided comm., dyn. Proc.) Version 2.1 2008 (comprises MPI 1.3) Version 2.2 2009 Version 3.0 2012 (e.g. non-blocking collective operations) Version 3.1 2015 Slide 15 Version 4.0 under discussion
MPI Standard II Free implementations available, e.g.: LAM/MPI: www.lam-mpi.org MPICH: www.mpich.org Implementation comprises demon processes on all cluster nodes start processes, implement communication library with API allows MPI program to communicate Slide 16
MPI Standard III MPI program = SPMD program in C with extensions Program starts with fixed number of processes Compiling and start: depends on implementation Example: mpirun np 4 <exe-name> when on console give meta-data to batch system like torque MPI 1.2: >125 functions Minimum: 6 functions Slide 17
MPI Standard IV Use header file: #include mpi.h int MPI_Init(int *argc,char ***argv) Parameters = pointers to main() parameters Call in program before - calling other MPI functions - evaluating argc and argv parameters returns MPI_SUCCESS or error int MPI_Finalize(void) call towards end of program, no other MPI function called later return value: like MPI_Init Slide 18 Parallele Programmierung mit MPI LG Parallelität und VLSI Prof. Dr. J. Keller
MPI Standard V int MPI_Comm_size(MPI_Comm comm,int *size) int MPI_Comm_rank(MPI_Comm comm,int *rank) size gives number of processes rank gives ID of calling process in range 0 size-1 First parameter: communicator, declares set of processes Normally MPI_COMM_WORLD Processes can be partitioned, subsets denoted by other communicators Communicator important for collective communication like broadcast, i.e. sending a message to all processes Slide 19 Parallele Programmierung mit MPI LG Parallelität und VLSI Prof. Dr. J. Keller
MPI Standard VI int MPI_Send(void *buf,int cnt,mpi_datatype dt, int dest,int tag,mpi_comm comm) int MPI_Recv(void *buf,int cnt,mpi_datatype dt, int src,int tag,mpi_comm comm,mpi_status *st) Datatypes: MPI_CHAR, MPI_INT,MPI_FLOAT,MPI_DOUBLE Tag: distinguishes different messages from one sender Range: 0 MPI_TAG_UB (at least 32767) Wildcards: MPI_ANY_SOURCE, MPI_ANY_TAG Slide 20
MPI Standard VII Status: gives sender + tag, relevant when using MPI_ANY_* struct MPI_Status { int MPI_SOURCE, MPI_TAG, MPI_ERROR;} Return value: MPI_SUCCESS or error e.g. MPI_ERR_TRUNCATE at recv if msg longer than buffer If message could be shorter than buffer: int MPI_Get_count(MPI_Status *st,mpi_datatype dt,int *cnt) gives number of received elements MPI_Send and MPI_Recv: possibly blocking, buffers depend on implementation, i.e. might be synchronous Slide 21
MPI First Examples I #include mpi.h int main(int argc,char **argv){ int size, rank,tmp; if(mpi_init(&argc,&argv)!=mpi_success) return -1; MPI_Comm_size(MPI_COMM_WORLD,&size); if(size<2) return -2; // 2 Processes needed MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Finalize(); return 0; } Slide 22
MPI First Examples II if(rank==0){ tmp = 1234; MPI_Send((void*)&tmp,1,MPI_INT,1,15,MPI_C ); MPI_Recv((void*)&tmp,1,MPI_INT,1,16,MPI_C,NULL); if(tmp!= 1235) return -3; }else{ // rank == 1 MPI_Recv((void*)&tmp,1,MPI_INT,0,15,MPI_C,NULL); tmp++; MPI_Send((void*)&tmp,1,MPI_INT,0,16,MPI_C ); } Slide 23
MPI First Examples IIII Multiplication of 4 x n-matrix A and n-vector b Cluster has 4 nodes, i.e. program started with 4 proc. 3 questions: Which node gets which part of A and b? (input) Which node computes which part of c=a*b? (output) Which non-local data will it need? (communication) Computation: c = (c 1,,c 4 ) T c i = a i1 *b 1 + +a in *b n, i=1,,4 Slide 24
MPI First Examples IV Input, Matrix A: Node i stores i-th row of A in local memory, i=1 4 Output: Node i computes element c i of result vector Each node needs complete vector b Either: replicate vector 2*n doubles per node instead of 1.25*n, no communication Alternative: Each node stores ¼ of vector locally Communication: rotate parts in ring communication Slide 25
MPI First Examples V Initial vector distribution: possible computation Slide 26
MPI First Examples VI Vector after one ring communication: next comput. Slide 27
MPI First Examples VII Algorithm: res=0; for(j=0,rnd=rank;j<4;j++,rnd=(rnd+1)%4){ for(i=0;i<n/4;i++) res += row[rnd*n/4+i]*vector[i]; MPI_Send(vector,n/4,MPI_DOUBLE,(rank+1)%4,rnd, MPI_COMM_WORLD); MPI_Recv(vector,n/4,MPI_DOUBLE,(rank-1)%4,rnd, MPI_COMM_WORLD,NULL); } Slide 28
MPI First Examples VIII Important: synchronous send would lead to deadlock Cannon s Algorithm for Matrix-Matrix-Mult. uses similar approach to rotate matrix rows & columns Slide 29
Collective Communication I collective Communication: more than one sender or more than one receiver Example: Reduction = all MPI processes send data to one receiver, data combined through operation Allows e.g. to add up partial sums Slide 30
Collective Communication II Broadcast: MPI_Bcast(buffer, count, datatype, root, comm) One sender (=root) sends same data to all processes sender: buffer points to data to be sent all others: buffer points to memory to store received data Scatter: MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) One sender (=root) sends data to all processes but different data for each receiver Gather: MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) All processes send different data to one receiver (=root) data will be concatenated Slide 31
Collective Communication III Reduce: MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm) All processes send data to one receiver (=root) Data are combined by op, e.g. sum or max All-to-All: MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) Each process sends data to each process Barrier: MPI_Barrier(comm) Returns only when all processes of communicator have reached barrier Not for data exchange, but flow control, e.g. to separate rounds in iterative algorithms Slide 32
Collective Communication IV Prefix: int MPI_Scan(sendbuf,recvbuf,cnt,datatpe,op,comm) Process with rank i receives reduction of data from processes 0 to i Each process sender and receiver, hence no root! Example: 4 processes, data: 4, 3, 2, 1, operation: sum Result: Proc0: 4, Proc1: 7, Proc2: 9, Proc3: 10 Additionally: MPI_Allgather=gather+broadcast, no root MPI_Allreduce=reduce+broadcast, no root Slide 33
Collective Communikation V Why collective communication functions? NOT a new functionality! All collective operations implementable with send/recv plus some local computation Higher performance! Sophisticated algorithms e.g. for fast reduction Implementation can use platform specifics, e.g. support for synchronization Relieves programmer from re-inventing the wheel! Slide 34
Collective Communication VI Particular for collective communication in MPI: All processes of communicator use same routine no matter if involved as sender or receiver Disadvantage: large number of parameters Slide 35
MPI One More Example // compute sum over array in parallel If(rank==0){ p=(int*)malloc(size*n*sizeof(int)); init(p);} else p=(int*)malloc(n*sizeof(int)); MPI_Scatter(p,N,MPI_INT,p,N,MPI_INT,0,MPI_C ); for(lsum=0,i=0;i<n;i++) lsum += p[i]; MPI_Gather(&lsum,1,MPI_INT,p,1,MPI_INT,0,MPI_C ); if(rank==0) for(lsum=0,i=0;i<size;i++) lsum += p[i]; // Alternative: //MPI_Reduce(&lsum,p,1,MPI_INT,MPI_SUM,0,MPI_C ); Slide 36
Send/Recv Variants I Send and Receive in one int MPI_SendRecv(void *sbuf,int scnt,mpi_datatype sdt, int dst,int stag,void *rbuf,int rcnt,mpi_datatype rdt, int src,int rtag,mpi_comm comm,mpi_status *s) Same buffer, count, and data type for send and receive int MPI_SendRecv_replace(void *buf,int cnt, MPI_Datatype dt, int dst,int stag,int src,int rtag, MPI_Comm comm,mpi_status *s) Slide 37
Send/Recv Variants II Simplifies example code if(rank==0){ tmp = 1234; MPI_Send((void*)&tmp,1,MPI_INT,1,15,MPI_COMM_WORLD); MPI_Recv((void*)&tmp,1,MPI_INT,1,16,MPI_C,NULL); if(tmp!= 1235) return -3; }else{ // rank == 1 MPI_Recv((void*)&tmp,1,MPI_INT,0,15,MPI_C,NULL); tmp++; MPI_Send((void*)&tmp,1,MPI_INT,0,16,MPI_COMM_WORLD); } Slide 38
Send/Recv Variants III if(rank==0){ tmp = 1234; MPI_Sendrecv_replace((void*)&tmp,1,MPI_INT,1,15,1,16, MPI_COMM_WORLD,NULL); if(tmp!= 1235) return -3; }else{ // rank == 1 MPI_Recv((void*)&tmp,1,MPI_INT,0,15,MPI_C,NULL); tmp++; MPI_Send((void*)&tmp,1,MPI_INT,0,16,MPI_C ); } Works also for synchronous send Slide 39
Send/Recv Variants IV Parallel algorithms often work in rounds: Local computation Communication Local computation Communication Performance improved when computation and communication overlapped Example: Matrix-Vector-Product Slide 40
Send/Recv Variants V for(j=0,rnd=rank;j<4;j++,rnd=(rnd+1)%4){ for(i=0;i<n/4;i++) res += row[rnd*n/4+i]*vector[i]; MPI_Send(vector,n/4,MPI_DOUBLE,(rank-1)%4,rnd,MPI_C..); MPI_Recv(vector,n/4,MPI_DOUBLE,(rank+1)%4,rnd,MPI_C..); } Vector not modified in local computation send can start prior to local computation But: MPI_Send possibly blocking And if non-blocking, vector might be overwritten in Recv before Send is complete Slide 41
Send/Recv Variants VI Non-blocking variants MPI_Isend: additional parameter MPI_Request *rq MPI_Irecv: Request parameter instead of status MPI_Wait(MPI_Request *rq,mpi_status *st) Blocks until operation specified by request is complete returns status and resets request MPI_Test(MPI_Request *rq,int *flag,mpi_status *st) if operation specified by request complete: flag!=0 and reset request otherwise: flag == 0 Slide 42
Send/Recv Variants VII Adapt example code: for(j=0,rnd=rank;j<4;j++,rnd=(rnd+1)%4){ MPI_ISend(vector,n/4,MPI_D..,(rank-1)%4,rnd,MPI_C..,&req); for(i=0;i<n/4;i++) res += row[rnd*n/4+i]*vector[i]; MPI_Wait(&req,NULL); MPI_Recv(vector,n/4,MPI_DOUBLE,(rank+1)%4,rnd,MPI_C..); } Combination of non-blocking send and (possibly) blocking recv is allowed, also other combination Slide 43
Send/Recv Variants VIII Use of different communicators: affects send/recv, and essential for collective communication In MPI: Partition processes of one communicator in groups, each with a new communicator Previous communicators remain visible, i.e. process may belong to several communicators Note: rank may differ in different communicators order relation not maintained! Slide 44
Send/Recv Variants IX int MPI_Comm_split(MPI_Comm ca,int c,int k,mpi_comm *cn) Must be called by all processes of current communicator ca All processes with same value of c (color) belong to one common, new communicator Number and size of new communicators flexible Ranks within new communicator given according to value of k (key) For same values, order in current communicator maintained Slide 45
Send/Recv Varianten X Example with 5 processes: MPI_Comm nc; int nsize,nrank; int c,k; // size and rank in MPI_COMM_WORLD are computed if(rank<3){ c=5; k=4-rank;} else { c=4; k=1;} MPI_Comm_split(MPI_COMM_WORLD,c,k,&nc); MPI_Comm_size(nc,&nsize); MPI_Comm_rank(nc,&nrank); Slide 46
Send/Recv Variants XI rank 0 1 2 3 4 color 5 5 5 4 4 key 4 3 2 1 1 nrank 2 1 0 0 1 Slide 47
Master-Worker Example I Implement taskpool via master-worker approach Master: knows all tasks Workers: ask if idle and receive task (or hint if all tasks completed) Master does not know which worker asks next: use wildcard in recv Task result is an int >0, worker sends to master Slide 48
Master-Worker Example II Between MPI_Init and MPI_Finalize: Function main() comprises only distinction between master and workers if(rank==0) master(size); else worker(); Master initializes tasks Data structure tasktype comprises only an int >0 to avoid complex packing for send Master needs communicator size to count if all all workers informed that all tasks completed Master sends common information to all workers Slide 49
Master-Worker Example III void master(int size){ Pool Pd; int info; // allocate and initialize Pool, send common info init(&pd); MPI_Bcast((void*)&info,1,MPI_INT,0,MPI_COMM_WORLD); do{ providetask(&pd); }while(pd.tasknum > 0); size--; // master does not need notification while(size--) notifyworker(); } Slide 50
Master-Worker Example IV void providetask(pool *p){ int res; MPI_Status st; } MPI_Recv((void*)&res,1,MPI_INT,MPI_ANY_SOURCE,27, MPI_COMM_WORLD,&st); if(res>0) storeresult(res); MPI_Send((void*)&(p->queue[p->index]),1,MPI_INT, st.mpi_source,28,mpi_comm_world); p->index++; p->tasknum--; Slide 51
Master-Worker Example V void notifyworker(){ int res; int note=-1; // indicates: no more tasks MPI_Status st; } MPI_Recv((void*)&res,1,MPI_INT,MPI_ANY_SOURCE,27, MPI_COMM_WORLD,&st); if(res>0) storeresult(res); MPI_Send((void*)¬e,1,MPI_INT, st.mpi_source,28,mpi_comm_world); Slide 52
Master-Worker Example VI void worker(void){ int res=-1; int task; int info; // for master specific information } MPI_Bcast((void*)&info,1,MPI_INT,0,MPI_COMM_WORLD); do{ MPI_Send((void*)&res,1,MPI_INT,0,27,MPI_COMM_W ); MPI_Recv((void*)&task,1,MPI_INT,0,28,MPI_C,NULL); if(task!=-1) res=performtask(task); }while(task!=-1); Slide 53
Master-Worker Example VII Variants: If taskpool static: worker can collect results and send all results together to master at the end Possible: request new task before executing actual task Example for overlap of computation and communication saves round-trip time Can be done with MPI_ISend Slide 54
Master-Example VIII If pool dynamic, i.e. if task results can produce new tasks: send k results together Optimal value of k depends on initial size of pool and number of workers: choose large (to avoid communication) but small enough that pool is not empty in-between Many workers may need several master processes Mapping can be static, dynamic, or probabilistic Slide 55
Master-Worker Example IX Static mapping: Worker always asks same master If master has few tasks: master may request tasks from other master Probabilistic mapping: Worker asks each master with certain probability prob = 1/n for each of n masters very dynamic prob = w close to 1 for one master, =(1-w)/(n-1) for others almost static Other distributions, like geometric, possible Slide 56
Master-Worker Example X Dynamic mapping: Worker always asks one master first (static) If no task available, ask other masters (prob.) Static mapping: one communicator per master (and assigned workers, because of broadcast Dynamic and probabilistic mappings: one communicator per master and all workers i.e. in all communicators or: one communicator, and each master must participate in broadcast of other masters Slide 57