Parallel Computing. MPI Collective communication

Parallel Computing MPI Collective communication Thorsten Grahs, 18. May 2015

Table of contents Collective Communication Communicator Intercommunicator 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Collective Communication Communication involving a group of processes Selection of the collective group by a suitable communicator All communication members get an identical call. No tags Collective communication...does not necessarily mean all processes (i.e. global communication) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Collective Communication Amount of data sent must exactly match the amount of data received Collective routines are collective across an entire communicator and must be called in the same order from all processors within the communicator Collective routines are all blocking Buffer can be reused upon return Collective routines may return as soon as the calling process participation is complete No mixing of collective and point-to-point communication 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Collective Communication functions Barrier operation MPI_Barrier() All tasks waiting for each other Broadcast operation MPI_Bcast() One task sends to all Accumulation operation MPI_Reduce() One task associated / acts on distributed data Gather operation MPI_Gather() One task collects/gather data Scatter operation MPI_Scatter() One task scatter data (e.g. a vector) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Multi-Task functions Multi-Broadcast operation MPI_Allgather() All participating tasks make the data available to other participating tasks Multi-Accumulation operation MPI_Allreduce() All participating tasks get result of the operation Total exchange MPI_Alltoall() Each involved task sends and receives to/from all 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

Synchronisation Barrier operation MPI_Barrier(comm) All tasks in comm wait on each other to achieve a barrier. Only collective routine which provides explicit synchronization Returns at any processor only after all processes have entered the call Barrier can be used to ensure all processes have reached a certain point in the computation Mostly used for synchronization sequence of tasks (e.g. debugging) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

Example: MPI_Barrier Tasks are waiting on each other MPI_Isend is not completed Data can not be accessed. 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

Broadcast operation MPI_Bcast(buffer,count,datatype,root,communicator) All processes in the communicator use same function call. Data from rank root process are distributed to all process in the communicator The call is blocking, but not connected to synchronization 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

Accumulation operation MPI_Reduce(sendbf,recvbf,count,type,op,master,comm) Calling process is master Join operation op (e.g. summation) Processes involved put their local data into sendbf master collects results into recvbf 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Reduce operation Pre-defined operations MPI_MAX MPI_MAXLOC MPI_MIN MPI_SUM MPI_PROD MPI_LXOR MPI_BXOR... maximum maximum and index of maximum minimum summation product logical exclusive OR bitwise exclusive OR 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Example: Reduce Summation MPI_Reduce(teil,s,1,MPI_DOUBLE,MPI_SUM,0,comm) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Gather operation MPI_Gather(sbf,scount,stype,rbf,rcount,rtype,ma,comm) sbuf local send-buffer rbuf receive-buffer from master ma Each processor sends rcount elements of data type rtype to master ma Order of data in the rbuf corresponds to numerical order in communicator comm 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

Scatter operation MPI_Scatter(sbf,scount,stype,rbf,rcount,rtype,ma,comm) Master ma distributes/scatters data from sbf Each process receives sub-buffers from sbf in local receive buffer rbf Master ma sends to itself Order of data in the rbuf corresponds to numerical order in communicator comm 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Example: Scatter Three processes involved in comm Send-buffer: int sbuf[6]={3,14,15,92,65,35}; Recieve-buffer: int rbuf[2]; Function call MPI_Scatter(sbuf,2,MPI_INT,rbuf,2,MPI_INT,0,comm); leads to the following distribution: Process rbuf 0 { 3, 14} 1 {15, 92} 2 {65, 35} 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Example Scatter-Gather: Averaging 1 if (world_rank == 0) 2 rand_nums = create_rand_nums(elements_per_proc * world_size); 3 4 // Create a buffer that will hold a subset of the random numbers 5 float *sub_rand_nums = malloc(sizeof(float) * elements_per_proc); 6 7 // Scatter the random numbers to all processes 8 MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums, 9 elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD); 10 // Compute the average of your subset 11 float sub_avg = compute_avg(sub_rand_nums, elements_per_proc); 12 // Gather all partial averages down to the root process 13 float *sub_avgs = NULL; 14 if (world_rank == 0) 15 sub_avgs = malloc(sizeof(float) * world_size); 16 MPI_Gather(&sub_avg,1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,MPI_COMM_WORLD); 17 18 // Compute the total average of all numbers. 19 if (world_rank == 0) 20 float avg = compute_avg(sub_avgs, world_size); 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Multi-broadcast operation MPI_Allgather(sbuf,scount,stype,rbuf,rcount,rtype,comm) Data from local sbuf are sent to all in rbuf Indication of master redundant since all processes receive the same data MPI_Allgather corresponds to MPI_Gather followed by a MPI_Bcast 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Example Allgather: Averaging 1 // Gather all partial averages down to all the processes 2 float *sub_avgs = (float *)malloc(sizeof(float) * world_size); 3 MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 4 MPI_COMM_WORLD); 5 6 // Compute the total average of all numbers. 7 float avg = compute_avg(sub_avgs, world_size); Output /home/th/: mpirun -n 4./average 100 Avg of all elements from proc 1 is 0.479736 Avg of all elements from proc 3 is 0.479736 Avg of all elements from proc 0 is 0.479736 Avg of all elements from proc 2 is 0.479736 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Total exchange MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm) Matrix view Before MPI_Alltoall process k has row k of the matrix After MPI_Alltoall process k has column k of the matrix MPI_Alltoall corresponds to MPI_Gather followed by a MPI_Scatter 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Variable exchange operations Variable scatter & Gather variants MPI_Scatterv & MPI_Gatherv Variable are: Number of data elements that will be distributed to individual processes Their position in the send-buffer sbuf 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Variable Scatter & Gather Variable scatter MPI_Scatterv(sbf,scount,displs,styp, rbf,rcount,rtyp,ma,comm) scount[i] contains the number of data elements which has to be send to process i. displs[i] defines the start of the data block for process i relative to sbuf. Variable gather MPI_Gatherv(sbuf,scount,styp, rbuf,rcount,displs,rtyp,ma,comm) Also variable function for Allgather, Allscatter & Alltoall 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Example MPI_Scatterv 1 /*Initialising */ 2 if(myrank==root) init(sbuf,n); 3 /* Splitting work and data */ 4 MPI_Comm_size(comm,&size); 5 Nopt=N/size; 6 Rest=N-Nopt*size; 7 displs[0]=0; 8 for(i=0;i<n;i++) { 9 scount[i]=nopt; 10 if(i>0) displs[i]=displs[i-1]+scount[i-1]*sizeof(double); 11 if(rest>0) { scount[i]++; Rest--;} 12 } 13 /* Distributing data */ 14 MPI_Scatterv(sbuf,scount,displs,MPI_DOUBLE,rbuf, 15 scount[myrank],mpi_double,root,comm); 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Comparison between BLAS & Reduce Multiplication Matrix with Vector 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Example comparison Compare different approaches A R N M, N rows, M columns Row-wise distribution y=ax BLAS-Routine Column-wise distribution Reduction operation 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Example row-wise Row-wise distribution Result vector y distributed 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Example row-wise BLAS Building block Multiplikation Matrix*Vektor BLAS (Basic Linear Algorithm subroutines) algorithm dgemv 1 void local_mv(n,m,y,a,lda,x) 2 { 3 double x[n],a[n*m],y[m],s; 4 /*partial sum-local op.*/ 5 for(i=0;i<m;i++) { 6 s=0; 7 for(j=0;j<n;j++) 8 s+=a[i*lda+j]*x[j]; 9 y[i]=s; 10 } 11 } Timing arith. 2 N M T a mem.access - x M T m (N, 1) - y T m (M, 1) - A M T m (N, 1) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Example row-wise vector Task Initial distribution: All data at process 0 Result vector y expected at process 0 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Example row-wise matrix Operations Distribute x to all processes: MPI_Bcast (p-1)*tk(n) Distribute rows of A: MPI_Scatter (p-1)*tk(m*n) Vector x Matrix A 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Example row-wise results Operations Arithmetic Communication Memory access 2 N M T a (p 1) [T k (N) + T k (M N) + T k (M)] 2 M T m (N, 1) + T m (M, 1) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

Example column-wise Task Distribution column-wise Solution vector assembled by reduction operation 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Example column-wise vector Distributing vector x Vector x MPI_Scatter (p-1)*tk(m) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Example column-wise matrix Distributing matrix A Matrix A pack blocks in buffer memory: N*Tm(M,1)+M*Tm(N,1) Sending: (p-1)tk(m*n) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Example column-wise result Assemble vector y MPI_Reduce Cost for reduction of y: log 2 (p)(t k (N) + NT a + 2T m (N, 1)) Arithmetic 2 N M T a Communication (p 1)[T k (M) + T K (M N)] + log 2 (p)t k (N) Memory access N T m (M, 1) + M T M (N, 1) + 2 log 2 (p)t m (N, 1) Algorithm is slightly faster Parallelization is only useful if the corresponding data distribution is already available before the algorithm starts 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

Communicator 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Communicators Motivation Communicator: Distinguish different contexts Conflict-free organization of groups Integration of third party software Example: Distinction between library functions application Predefined communicators MPI_COMM_WORLD MPI_COMM_SELF MPI_COMM_NULL 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Duplicate communicators MPI_Comm_dup(MPI_COMM comm, MPI_COMM &newcomm); Creates a copy newcomm of comm Identical process group Allows clear delineation characterisation of process groups example MPI_COMM myworld;... MPI_Comm_dup(MPI_COMM_WORLD, &myworld) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

Splitting communicators MPI_Comm_split(MPI_COMM comm, int color, int key, MPI_COMM &newcomm); Divides communicator comm into multiple communicators with disjoint processor groups MPI_Comm_split has to be called by all processes in comm Processes with the same value of color forms a new communicator group 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Example Splitting communicator 1 MPI_COMM comm1, comm2; 2 MPI_Comm_size(comm,&size); 3 MPI_Comm_rank(comm,&rank); 4 i=rank%3; 5 j=size-rank; 6 if(i==0) 7 MPI_Comm_split(comm,MPI_UNDEFINED,0,&newcomm); 8 else if(i==1) 9 MPI_Comm_split(comm,i,j,&comm1); 10 else 11 MPI_Comm_split(comm,i,j,&comm2) MPI_UNDEFINED returns null-handle MPI_COMM_NULL. 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Example Splitting communicator MPI_COMM_WORLD Rang P0 P1 P2 P3 P4 P5 P6 P7 P8 color 1 2 1 2 1 2 key 8 7 6 5 4 3 2 1 0 MPI_COMM_WORLD comm1 P1 P4 P7 2 1 0 comm2 P2 P5 P8 2 1 0 P0 P3 P6 0 1 2 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Free communicator group Clean up MPI_COMM_free(MPI_COMM *comm); Deletes the communicator comm Resources occupied by comm are released by MPI. After the function call, the communicator has the value of the null-handle MPI_COMM_NULL MPI_COMM_free has to be called by all process, which belongs to comm 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

Grouping communicators MPI_COMM_group(MPI_COMM comm, MPI_Group grp) Creates a process group from a communicator More group constructors MPI_COMM_create Generating a communicator from the group MPI_Group_incl Include processes into a group MPI_Group_excl Exclude processes from a group MPI_Group_range_incl Forms a group from a simple pattern MPI_Group_range_excl Excludes processes from a group by simple pattern 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

Example: create a group Group grp=(a,b,c,d,e,f,g), n=3, rank=[5,0,2] MPI_Group_incl(grp, n, &rank, &newgrp) Include in new group newgrp n=3 processes defined by pattern rank=[5,0,2] newgrp=(f,a,c) MPI_Group_excl(grp, n, &rank, &newgrp) Exclude from new group newgrp n=3 processes defined by pattern rank=[5,0,2] newgrp=(b,d,e,g) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

Example: create a group II Group grp=(a,b,c,d,e,f,g,h,i,j),, n=3, ranges=[[6,7,1],[1,6,2],[0,9,4]] Ranges forms a triple [start, end, spacing] MPI_Group_range_incl(grp, 3, ranges, &newgrp) Include in new group newgrp n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]] newgrp=(g,h,b,d,f,a,e,i) MPI_Group_range_excl(grp, 3, ranges, &newgrp) Exclude from new group newgrp n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]] newgrp=(c,j) 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43

Operations on communicator groups More grouping functions Merging groups Intersection of groups Difference of groups Comparing groups Delete/Free groups Size of a group Rank of a group... MPI_Group_union MPI_Group_intersection MPI_Group_difference MPI_Group_compare MPI_Group_free MPI_Group_size MPI_Group_rank 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

Intercommunicator 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45

Intercommunicator Intracommunicator Up til now, we had only handled communication inside a contiguous group. This communication was inside (intra/internal) a communicator. Intercommunicator A communicator who establishes a context between groups Intercommunicators are associated with 2 groups of disjoint processes Intercommunicators are associated with a remote group and a local group The target process (destination for send, source for receive) is its rank in the remote group. A communicator is either intra or inter, never both 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46

Create intercommunicator MPI_Intercomm_create(local_comm, local_bridge, bridge_comm, remote_bridge, tag, &newcomm ) local_comm local Intracommunicator (handle) local_bridge Rank of a distinguished process in local_comm (integer) bridge_comm Remote intracommunication, which should be connected to local_comm by the newly build intercommunicator newcomm remote_bridge Rank of a certain process in remote communicator 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47

Communication between groups Function uses point-to-point communication with specified tag between the two processes defined as bridge heads. 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48

Example 1 int main(int argc, char **argv) 2 { 3 MPI_Comm mycomm; /* intra-communicator local sub-group */ 4 MPI_Comm myfirstcomm; /* inter-communicator */ 5 MPI_Comm mysecondcomm; /* second inter-communicator (group 1 only) */ 6 int memberkey, rank; 7 8 MPI_Init(&argc, &argv); 9 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 10 11 /* User code must generate memberkey in the range [0, 1, 2] */ 12 memberkey = rank % 3; 13 14 /* Build intra-communicator for local sub-group */ 15 MPI_Comm_split(MPI_COMM_WORLD,memberKey,rank,&myComm); 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49

Example 1 /* Build inter-communicators. Tags are hard-coded. */ 2 if (memberkey == 0) 3 { /*Group 0 communicates with group 1. */ 4 MPI_Intercomm_create( mycomm, 0, MPI_COMM_WORLD, 1, 5 01, &myfirstcomm); } 6 else if (memberkey == 1) 7 { /* Group 1 communicates with groups 0 and 2. */ 8 MPI_Intercomm_create( mycomm, 0, MPI_COMM_WORLD, 0, 9 01, &myfirstcomm); 10 MPI_Intercomm_create( mycomm, 0, MPI_COMM_WORLD, 2, 11 12, &mysecondcomm); 12 } 13 else if (memberkey == 2) 14 { /* Group 2 communicates with group 1. */ 15 MPI_Intercomm_create( mycomm, 0, MPI_COMM_WORLD, 1, 16 12, &mysecondcomm); 17 } 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50

Example 1 /* Do work... */ 2 3 switch(memberkey) /* free communicators appropriately */ 4 { 5 case 1: 6 MPI_Comm_free(&myFirstComm); 7 MPI_Comm_free(&mySecondComm); 8 case 0: 9 MPI_Comm_free(&myFirstComm); 10 case 2: 11 MPI_Comm_free(&mySecondComm); 12 break; 13 } 14 15 MPI_Finalize(); 16 } 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 51

Motivation Intercommunicator Used for Meta-Computing Cloud-Computing Low bandwidth between components e.g. cluster < > pc bridge head controls communication with remote-computer 18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 52