COSC 6374 Parallel Computation. Communication Performance Modeling (II) Edgar Gabriel Fall Overview. Impact of communication costs on Speedup

COSC 6374 Prllel Computtion Communition Performne Modeling (II) Edgr Griel Fll 2015 Overview Impt of ommunition osts on Speedup Crtesin stenil ommunition All-to-ll ommunition Impt of olletive ommunition opertions Use se: rodst sed mtrix multiply opertions for dense mtries Doule uffering nd non-loking olletive opertions Implementtion spets 1

Crtesin stenil ommunition Assumptions 2-D rtesin grid ll proesses hve 4 ommunition prtners Communition ours sequentilly : 4 sends + 4 reeives Prolem size : N*N, Messge length to eh neighor: Using Hokney s model: T omm ( N, p) 8( l N /( p)) N p Crtesin stenil ommunition (II) Speedup: S = T(1) / T (p) n: numer of proesses T(p) = T omp + T omm T omp sles with the numer of proesses, T(1) = T omp => no ommunition Thus Tomp Sstenil Tomp N 8( l ) p p 2

2-D stenil ommunition Gigit Ethernet QDR InfiniBnd Lteny 50µs 1µs Bndwidth 5 MB/s 5 GB/s Assumptions in the grph: T omp = 00s 1 stenil ommunition every 2s All-to-ll ommunition Assumptions: p proesses Prolem size: N Prolem size per proess: N/p Messge length per proess pir: N/p 2 Eh proess sends N messges nd reeives p messges sequentilly Congestion nd onurreny of messges not onsidered Communition Costs using Hokney s Model: 2 2N T omm ( N, p) 2 p( l N /( p )) 2 pl p 3

All-to-ll ommunition Speedup: S = T(1) / T (p) n: numer of proesses T(p) = T omp + T omm T omp sles with the numer of proesses, T(1) = T omp => no ommunition Thus Tomp Slltoll Tomp 2N 2 pl n p All-to-ll ommunition Gigit Ethernet QDR InfiniBnd Lteny 50µs 1µs Bndwidth 5 MB/s 5 GB/s Assumptions in the grph: T omp = 00s 1 ll-to-ll ommunition every 2s 4

Conlusions Communition osts limit the speedup Drmtilly for All-to-ll Modertely for Stenil ommunition Stenil ommunition impt gets worse when inresing the numer of opertions per seond How to inrese the slility of the pplition Communition osts hve to go to zero non-loking ommunition Non-loking olletive opertions introdued in MPI-3 High level strtion seprting funtionlity from implementtion Non-loking exeution llows hide ommunition osts Colletive Opertions Offer higher level strtion for often ourring ommunition ptterns Seprte desired dt movement from tul implementtion Allow for numerous optimiztions internlly E.g. O(p) vs. O(log(p)) lgorithms O(p) liner lgorithms often found in pplitions not using olletive opertions - its simple Hrdwre topology sed lgorithms: minimize utiliztion of the lowest performing network onnetions Hrdwre sed optimiztions: some networking rds hve uilt in support for some olletive opertions 5

O(p) vs. O(log(p)) olletive lgorithms Estimted exeution time of Bst opertion of 1 MB messge length using QDR InfiniBnd prmeters Conlusions Colletive Opertions essentil for slility of pplitions t lrge proess ounts Simplify ode mintenne nd redility Redue ommunition osts ompred to (trivil) liner lgorithms often used y pplitions tht do not use olletive opertions But ommunition osts re not zero 6

Non-loking Brodst opertion MPI_Ist (void *uf, int nt, MPI_Dttype dt, int root, MPI_Comm omm, MPI_Request *req); The proess with the rnk root distriutes the dt stored in uf to ll other proesses in the ommunitor omm. Initites olletive ommunition, ompletion hs to enfored seprtely using MPI_Test or MPI_Wit MPI_Request hndle llows to uniquely identify urrently ongoing opertion Non-loking olletive opertions Completion opertion only indites lol ompletion E.g. you do not know whether nother proess hs finished tht opertion Completion of non-loking olletive opertion does not imply ompletion of nother non-loking individul or olletive opertion posted erlier Multiple non-loking olletive opertions n e outstnding on single ommunitor Unlike point-to-point opertions, non-loking olletive opertions do not mth with loking olletive opertions All proesses must ll olletive opertions (loking nd non-loking) in the sme order per ommunitor 7

height Non-loking opertions progression Prolem: how to ensure tht non-loking opertions (individul or olletive) ontinue exeution in the kground? Two options: Using progress thred: seprte thred exeutes the non-loking opertion in loking mnner Prolem: numer of threds n e lrge (e.g. 1,000 nonloking send opertions to different proesses) Prolem: ommunition through seprte thred inreses the network lteny (thred synhroniztion et.) Regulrly invoking the progress engine: Send lrge messge hunk-y-hunk ut void loking Progress only mde inside the MPI lirry Use se: dense mtrix multiply Blok-olumn wise dt distriution (height * lwidth) Assuming squre mtries, even dt distriution Exmple: 4x4 mtrix for 2 proesses Mtrix A Mtrix B Mtrix C rnk=0 rnk=1 rnk=0 rnk=1 rnk=0 rnk=1 00 20 30 01 11 21 31 02 22 32 03 13 23 33 00 20 30 01 11 21 31 02 22 32 03 13 23 33 = 00 20 30 01 11 21 31 02 22 32 03 13 23 33 lwidth lwidth 8

Brodst sed prllel dense mtrix multiply In itertion i, proess with rnk=i rodsts its portion of the Mtrix A to ll proesses. Itertion 0 00 20 30 Rnk 0 Rnk 1 01 11 21 31 00 20 30 01 11 21 31 00 20 30 01 11 21 31 Rnk 0 Rnk 1 02 22 32 03 13 23 33 Itertion 1 02 22 32 03 13 23 33 00 20 30 01 11 21 31 02 22 32 03 13 23 33 02 22 32 03 13 23 33 Overll ode struture doule A[height][lwidth], B[height][lwidth]; doule C[height][lwidth], tmp[height][lwidth]; MPI_Comm_size ( omm, &size); MPI_Comm_rnk ( omm, &rnk); for ( i=0; i<size; i++ ) { if ( rnk == i ) { mempy (tmp, A, lwidth*height*sizeof(doule)); MPI_Bst (tmp, lwidth*height, MPI_DOUBLE, i, omm); lolmtmul( tmp, B, C, i); 9

Use se: dense mtrix multiply Brodst sed lgorithm in itertion it, proess with rnk=it rodsts its portion of the Mtrix A to ll proesses. Multiplying the portion of mtrix B with the urrent vlue of the Mtrix A void lolmtmul ( doule **X, doule **Y, doule **Z, int it ) { for ( i=0; i<height; i++ ) for ( j=0; j<lwidth; j++ ) for ( k=0; k<lwidth; k++) { Z[i][j] += X[i][k] * Y[it*lwidth+k][j]; Overll ode struture Alternting sequenes of ommunition nd omputtion The MPI_Bst in every itertion hs different root proess Communition uses olletive opertions Allows for optimiztion For fixed prolem size, Speedup will e limited y ommunition osts Not shown: initiliztion of C to zero

Doule Buffering Use two uffers for the sme vrile One used for omputtion One used for ommunition Allows to overlp ommunition nd omputtion -> not llowed to touh ommunition uffer while non-loking opertion is ongoing Buffer pointers used to lter the ommunition nd omputtion uffer in eh itertion Avoids dditionl mempy opertion Revised ode struture doule tmp[height][lolwidth],tmp2[height][lolwidth]; doule **ommuf=tmp, **ompuf=tmp2; MPI_Request req; if (rnk == 0 ){ mempy (ompuf, A, lwidth*height*sizeof(doule); MPI_Bst (ompuf, lwidth*height, MPI_DOUBLE, 0, omm); for ( i=1; i<size; i++ ) { if ( rnk == i ) { mempy (ommuf, A, lwidth*height *sizeof(doule)); MPI_Ist (ommuf, lwidth*height, MPI_DOUBLE, i, omm, &req); lolmtmul( ompuf, B, C, i-1); // Perform omputtions doule **tpoint=ommuf; MPI_Wit ( &req, MPI_STATUS_IGNORE); // Wit for ompletion ommuf=ompuf; SWAP (ommuf, ompuf); // Swp pointers ompuf=tpoint; lolmtmul ( ompuf, B, C, i-1); 11

Revised ode struture Doule uffering Overlps two itertions in the exeution Requires n dditionl uffer Complites ode (mrginlly) Communition of itertion 0 n e loking -> Cn not ontinue efore it finishes nywy Computtion lwys performed on dt of itertion i-1 Computtion of lst itertion fter the loop Rememer the overlp prolem? Need to modify lolmtmul s well Revised ode struture void lolmtmul ( doule **X, doule **Y, doule **Z, int it, MPI_Request *req ) { int flg; for ( i=0; i<height; i++ ) { for ( j=0; j<lwidth; j++ ) { for ( k=0; k<lwidth; k++) { Z[i][j] += X[i][k] * Y[it*lwidth+k][j]; if ( (i % 4) == 0 ) MPI_Test ( req, flg, MPI_STATUS_IGNORE);

Performne Results Compring loking nd non-loking version for two different mtrix sizes Tests performed on rill luster t University of Houston (16 nodes, 48 ores per node, 2 DDR IB HCAs per node) Implementtion spets: LiNBC Implements non-loking versions of ll MPI olletive opertions Shedule sed design: proess-lol shedule of p2p opertions is reted 1 2 0 3 2 1 2 4 3 3 3 5 6 Pseudoode for shedule t rnk 1: NBC_Shed_rev(uf, nt, dt, 0, shed); NBC_Shed_rr(shed); NBC_Shed_send(uf, nt, dt, 3, shed); NBC_Shed_rr(shed); NBC_Shed_send(uf, nt, dt, 5, shed); See http://www.unixer.de/pulitions/img/hoefler-hlrs-n.pdf for more detils 13

Implementtion spets: LiNBC Shedule exeution is represented s stte mhine Stte nd shedule re tthed to every request Shedules might e hed/reused Progress is most importnt for effiient overlp Progression in NBC_Test/NBC_Wit Other non-loking olletive opertions Virtully ll olletive ommunition opertions from MPI-1 hs non-loking ounter prt in MPI-3 e.g. MPI_Istter, MPI_Igther, MPI_Iredue, MPI_Illredue, MPI_Illtoll, New topology olletive ommunitions E.g. MPI_Neighor_illtoll, MPI_Neighor_illgther Some ommunitor onstrutors E.g. MPI_Comm_idup 14

Other non-loking olletive opertions Colletive opertions whih do not hve non-loking ounterprt Some ommunition retors: There is no MPI_Comm_irete, MPI_Comm_isplit There is no MPI_Crt_irete, MPI_Grph_dist_irete Colletive file I/O opertions There is MPI_File_ired_ll, MPI_File_iwrite_ll, et. 15