COSC 6374 Parallel Computation. Dense Matrix Operations

COSC 6374 Prllel Computtion Dense Mtrix Opertions Edgr Griel Fll Edgr Griel Prllel Computtion Edgr Griel erminology Dense Mtrix: ll elements of the mtrix ontin relevnt vlues ypilly stored s 2-D rry, (e.g. doule [16][]); Sprse mtrix: most elements of the mtrix re zero Optimized storge tehniques nd mtries: store only the relevnt digonls of the mtrix Highly irregulr sprse mtries: store the oordintes of every non-zero element together with the ontent oeing-hrwell formt: exploit ertin regulrities (e.g. nerly onstnt numer of entries per row or olumn) Jgged Digonl storge formt: see oeing-hrwell formt 1

Replition vs. Communition Lrge dt items typilly distriuted ross multiple proesses Wht is lrge? Smll dt items n replited on ll proesses or ommunited whenever required Costs for ommunition: ~ network lteny Costs for replition: ~ memory onsumption + ~ repeted omputtion opertions Prllel Computtion Edgr Griel Mtrix opertions: =. Multiplying Mtrix with onstnt Constnt is definitely smll nd is thus replited on ll proesses E.g. ompiled in the ode Red from onfigurtion file Opertion does not require ny ommunition to e performed rivilly prllel Opertion n e performed independent of the wy the mtrix hs een distriuted ross the proesses Prllel Computtion Edgr Griel 2

Prllel Computtion Edgr Griel Mtrix Opertions: =. rnspose Mtrix Often not neessry, sine the opertions (e.g. Mtrixvetor multiply) n e (esily) reformulted for Mtrix- rnspose-vetor multiply opertions nd void the dt trnspose Opertions requiring the trnspose: multi-dimensionl FF ssumption: Mtries, re squre Element [x][y] should e on the sme proess s element [x][y] -> requires ommunition ross the proesses = : One element per proess Initil dt distriution: one element of the Mtrix per proess 0 1 2 3 4 5 6 7 8 9 14 15 Proess with oordintes (x,y) needs to send its dt item to the proess with the oordintes (y,x) nd reeive its dt item from (y,x) Prllel Computtion Edgr Griel 3

= : One element per proess // ssumptions: // newomm hs een reted using MPI_Crt_rete // doule, re the element of the mtries // owned y eh proess. is lredy set. int oords[2]; // my oordintes in the 2-D topology int rem_oords[2]; // oordintes of my ounterprt MPI_Request req[2]; MPI_Sttus stts[2]; // Determine my own rnk in newomm MPI_Comm_rnk (newomm, &rnk); // Determine my own oordintes in newomm MPI_Crt_oords (newomm, rnk, ndims, oords ); //Determine the oordintes of my ounterprt rem_oords[0] = oords[1]; rem_oords[1] = oords[0]; Prllel Computtion Edgr Griel = // Determine the rnk of my ounterprt using his oordintes MPI_Crt_rnk ( newomm, rem_oords, &rem_rnk); // Initite non-loking ommunition to send MPI_Isend ( &, 1, MPI_DOULE, rem_rnk, 0, newomm,&req[0]); // Initite non-loking ommunition to reeive MPI_Irev ( &, 1, MPI_DOULE, rem_rnk, 0, newomm,&req[1]); // Wit on oth non-loking opertions to finish MPI_Witll ( 2, req, stts); Prllel Computtion Edgr Griel : One element per proess Notes: using non-loking ommunition voids to hve to shedule messges to void dedlok proesses on the min digonl send messge to themselves 4

= : Column-wise dt distriution One olumn per proess rnk = 0 1 2 3 4 5 6 7 8 rnk = 0 1 2 3 4 5 6 7 8 Element [i] needs to e sent to proess i Element [i] will e reeived from proess i Prllel Computtion Edgr Griel = MPI_Request *reqs; MPI_Sttus *stts; int rnk, size; doule [N], [N]; : Column-wise dt distriution // Determine the numer of proesses working on the prolem // nd my rnk in the ommunitor MPI_Comm_size ( omm, &size); MPI_Comm_rnk ( omm, &rnk); // llote the required numer of Requests nd Sttuses. Sine // the ode is supposed to work for ritrry numers of // proessors, you n not use stti rrys for reqs nd stts reqs = (MPI_Request *) mllo ( 2*size*sizeof(MPI_Request) ); stts = (MPI_Sttus *) mllo ( 2*size*sizeof(MPI_Sttus) ); Prllel Computtion Edgr Griel 5

= : Column-wise dt distriution // Strt now ll non-loking ommunition opertions for (i=0; i<size; i++ ) { MPI_Isend (&[i], 1, MPI_DOULE, i, 0, omm, &reqs[2*i]); MPI_Irev (&[i], 1, MPI_DOULE, i, 0, omm, &(reqs[2*i+1]); // Wit for ll non-loking opertions to finish MPI_Witll ( 2*size, reqs, stts); Notes: identil pproh nd ode for row-wise dt distriution s long s the lol portions of oth nd re stored s one-dimensionl rrys numer of messges: N 2 = np 2 Prllel Computtion Edgr Griel = : lok olumn-wise dt distriution Eh proess holds N lol olumns of eh mtrix with N = np 1 i= 0 ssuming N n e divided evenly onto np proesses rnk = 0 1 2 rnk = 0 1 2 N lol Prllel Computtion Edgr Griel 6

= Element [i][j] hs to eome element [j][i] ssuming i, j re glol indexes Vrile delrtions on eh proess: doule [N][N lol ]; doule [N][N lol ]; [i][j] is loted on the proess with the rnk r = j/n lol hs the lol indexes [i 1 ][j 1 ] with i 1 =i nd j 1 =j%n lol [j][i] is loted on the proess with the rnk s=i/n lol hs the lol indexes [j 2 ][i 2 ] with j 2 =j nd i 2 =i%n lol Prllel Computtion Edgr Griel : lok olumn-wise dt distriution = : lok olumn-wise dt distriution // ode frgment for the ommunition for ( j1=0; j1<n lol ; j1++) { for (i=0; i<n; i++ ) { dest = i / N lol ; MPI_Isend ( &([i][j1], 1, MPI_DOULE, dest, 0, omm, &reqs[ ]); for ( j=0; j<n; j++ ) { for ( i2=0; i2<n lol ; i2++ ) { sr = j / N lol ; MPI_Irev ( &([j][i2]), 1, MPI_DOULE, sr, 0, omm, &reqs[ ]); Prllel Computtion Edgr Griel 7

= : lok olumn-wise dt distriution he lgorithm on the previous slide is good euse it doesn t require ny dditionl temporry storge he lgorithm on the previous slide is d euse it sends N 2 messges, with N>>np osts of eh messge is proportionl to the network lteny for short messges Mtrix hs to e trversed in non-ontiguous mnner C stores multi-dimensionl rrys in row-mjor order essing [0][0] thn [1][0] mens tht we jump in the min memory nd hve lrge numer of he misses Prllel Computtion Edgr Griel Memory lyout of multi-dimensionl rrys E.g. 2-D mtrix Memory lyout in C Memory lyout in Fortrn Prllel Computtion Edgr Griel 8

= : lok olumn-wise dt distriution lterntive lgorithm eh proess sends in relity N lol *N lol elements to every other proess send n entire lok of N lol *N lol elements lok hs to e trnsposed either t the sender or t the reeiver rnk = 0 1 2 rnk = 0 1 2 Prllel Computtion Edgr Griel = : lok olumn-wise dt distriution // Send the mtrix lok-y-lok for ( i=0; i<n; i+=n lol ) { MPI_Isend ( &([i][0], N lol *N lol, MPI_DOULE, i, 0, omm, &reqs[2*i]); MPI_Irev( &([i][0], N lol *N lol, MPI_DOULE, i, 0, omm, &*reqs[2*i+1]); MPI_Witll ( 2*size, reqs, stts); // Now trnspose eh lok for ( i=0; i<n; i+=n lol ) { for ( k=0; k<n lol ; k++ ) { for ( j=k; j<n lol ; j++ ) { temp = [i+k][j]; [i+k][j] = [i+j][k]; [i+j][k] = temp; Prllel Computtion Edgr Griel 9

= : other 1-D dt distriutions lok row-wise dt distriution lgorithm very similr to lok olumn-wise dt distriution Cyli olumn-wise dt distriution proess with rnk r gets the olumns r, r+np, r+2*np, et dvntge: none for the Mtrix trnspose opertions for some other opertions, this dt distriution leds often to etter lod lne thn lok olumn-wise distriution Cyli row-wise dt distriution lok-yli olumn-wise dt distriution lok-yli row-wise dt distriution Prllel Computtion Edgr Griel = : 2-D dt distriution Eh proess holds lok of N lol *N lol elements 2-D distriution voids skinny mtries often esier to rete lod lne thn with 1-D lok olumn/row distriution Prllel Computtion Edgr Griel

= : 2-D dt distriution ssumption: using 2-D rtesin ommunitor lgorithm: Determine your rnk using MPI_Comm_rnk Determine your oordintes using MPI_Crt_oords Determine the oordintes of your ommunition prtner y reverting the x nd y oordintes of your oordintes determine the rnk of your ommunition prtner using MPI_Crt_rnk Send lok of N lol *N lol elements to omm. prtner Reeive lok of N lol *N lol elements from omm. prtner rnspose the lok tht hs een reeived lgorithm omines tehniques from the one element per proess distriution nd the lok olumn-wise distriution Prllel Computtion Edgr Griel = : lok row-wise distriution repliting the vetor doule [nlol][n], [n]; doule [nlol], glol[n]; int i,j; for (i=0; i<nlol; i++) { for ( j=0;j<n; j++ ) { [i] = [i] + (i,j)*(j); * = MPI_llgther(, nlol, MPI_DOULE, glol, nlol, MPI_DOULE, MPI_COMM_WORLD ); Prllel Computtion Edgr Griel

= : lok row-wise distriution Why replite the vetor? memory requirement is O(N) with N eing the size of the vetor in ontrry to Mtrix O(N 2 ) or other higher dimensionl rrys inreses the performne of the Mtrix-vetor multiply opertion Why do we need the llgther t the end? most pplitions require uniform tretment of similr ojets e.g. one vetor is replited, ll should e replited if the result vetor is used in susequent opertion, you would need different implementtions in the ode depending on whether the vetor is distriuted or replited Prllel Computtion Edgr Griel = : lok olumn-wise distriution int min( int rg, hr **rgv) { doule [n][nlol], [nlol]; doule [n], t[n]; int i,j; * = for (i=0; i<n; i++) { for ( j=0;j<nlol;j++ ) { t[i] = t[i] + (i,j)*(j); MPI_llredue ( t,, n, MPI_DOULE, MPI_SUM, MPI_COMM_WORLD ); Prllel Computtion Edgr Griel

= : lok olumn-wise distriution Why not replite the vetor in this distriution there is no enefit in doing tht for this opertion there might e other opertions in the ode tht mndte tht ut the result vetor is replited sure, the lgorithm mndtes tht you n still drop the elements tht you don t need fterwrds Prllel Computtion Edgr Griel C = : Mtrix-Mtrix Multiply lok-olumn wise dt distriution Exmple for 2 proesses nd 4x4 mtrix Exmple uses glol indies for the mtrix Mtrix Mtrix Mtrix C rnk=0 rnk=1 rnk=0 rnk=1 rnk=0 rnk=1 = Prllel Computtion Edgr Griel

C = : lok-olumn wise distriution 1 st step: eh proess lultes prt of the result elements nd stores it in mtrix C = + + + + + + + + + + + + + + + + Prllel Computtion Edgr Griel C = : lok-olumn wise distriution 2 nd step: Proess 0 nd 1 swp there portions of Mtrix Mtrix nd C unhnged, e.g. Mtrix rnk=0 rnk=1 Prllel Computtion Edgr Griel 14

C = : lok-olumn wise distriution Finish mtrix multiply opertion = + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Prllel Computtion Edgr Griel C = : lok-olumn wise distriution Generliztion for np proesses rnk = 0 1 n-1 rnk = 0 1 n-1 rnk = 0 1 n-1 Prllel Computtion Edgr Griel 2 0 1 2 n- 1 x = C 0 1 2 n-1 0 1 2 n-1 0 1 2 n-1 0 1 2 n-1 2 n-1 0 1 n-2 it = 0 x 0 1 2 n-1 = 0 1 2 n-1 it = 1 2 1 2 3 0 x 0 1 2 n-1 = 0 1 2 n-1. it = np-1 0 1 2 n- 1 0 1 2 n- 1 C C 0 1 2 n-1 0 1 2 n-1..... 15

C = : lok-olumn wise distriution np-1 steps required to give every proess ess to the entire mtrix lgorithm does not require proess to hold the entire mtrix t ny point in time finl shift opertion required in order for eh proess to hve its originl portion of the mtrix k Communition etween proesses often using ring, e.g. proess x sends to x-1 nd reeives from x+1 speil se for proess 0 nd np-1 need to use temporry uffer if simultneously sending nd reeiving the mtrix Prllel Computtion Edgr Griel C = : lok-olumn wise distriution // nlolols: no. of olumns held y proess // nrows: no. of rows of the mtrix // np: numer of proesses // rnk: rnk of this proess sendto = rnk-1; revfrom = rnk+1; if ( rnk == 0 ) sendto = np-1; if ( rnk == np-1 ) revfrom = 0; MPI_Isend(, nrows*nlolols, MPI_DOULE, sendto, 0, omm, &req[0]); MPI_Irev ( temp, nrows, nlolols, MPI_DOULE, revfrom, 0, omm, &req[1]); MPI_Witll ( req, sttuses ); // Copy dt from temporry uffer into mempy (, temp, nrows*nlolols*sizeof(doule)); Prllel Computtion Edgr Griel 16

C = : lok-olumn wise distriution Mpping of glol to lol indies required sine C dt struture n not strt t n ritrry vlue, ut hs to strt t index 0 need to know from whih proess we hold the tul dt item in order to know whih elements of the Mtrix to use mpping will depend of the diretion of the ring ommunition Prllel Computtion Edgr Griel C = : lok-olumn wise distriution // nlolols: no. of olumns held y proess // nrows: no. of rows of the mtrix // np: numer of proesses // rnk: rnk of this proess for ( it=0; it < np; it++ ) { offset = (rnk+it)%np * nlolols; for (i=0; i<nrows; i++) { for ( j=0;j<nlolols;j++ ) { for (k=0; k<nlolols; k++) { C[i][j] += [i][k] + [offset+k][j]; // Communition s shown on previous slides Prllel Computtion Edgr Griel 17

C = : lok-olumn wise distriution lterntive ommunition pttern for lok-olumn wise distriution: in itertion it, proess with rnk=it rodsts its portion of the Mtrix to ll proesses. Mpping of glol to lol indies it simpler Communition osts higher thn for the ring ommunition Prllel Computtion Edgr Griel C = : lok-row wise distriution Similr lgorithm s for lok-olumn wise, e.g. 1 st step rnk=0 rnk=1 = + + + + + + + + + + + + + + + + 2 nd step omitted here, only differene to lok-olumn wise distriution is tht the mtrix is rotted mong the proesses mpping of lol to glol indies relevnt for Mtrix Prllel Computtion Edgr Griel 18

C = : 2-D dt distriution oth mtries need to e rotted mong the proesses only proesses holding portion of the sme rows of Mtrix need to rotte mongst eh other only proesses holding portion of the sme olumns of Mtrix need to rotte mongst eh other Prllel Computtion Edgr Griel C = : 2-D dt distriution e.g. for 2 nd step: ssuming 2-D proess topology Mtrix is ommunited in ring to the left neighor Mtrix is ommunited in ring to the upper neighor Prllel Computtion Edgr Griel 19

C = : 2-D dt distriution Cnnon s lgorithm for squre mtries Set up 2-D proess topology determine nlolols nd nlolrows for eh proess initil shift opertion suh tht eh proess multiplies its lol sumtries y i steps (see next slide) for i=0; i< numer of proesses in row (or olumn) lulte lol prt of mtrix-mtrix multiply opertion send lol portion of to the left neighor reeive next portion of from the right neighor send lol portion of to the upper neighor reeive next portion of from the lower neighor Prllel Computtion Edgr Griel Initil ssignment of Mtries nd 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 Initil shift of Mtries nd suh tht: Mtrix is shifted y i proesses left for proesses in the i-th olumn of the proess topology Mtrix is shifted y j proesses up for proesses in the j-th olumn of the proess topology 0,0 0,1 0,2 0,3 1,1 1,2 1,3 1,0 2,2 2,3 2,0 2,1 3,3 3,0 3,1 3,2 0,0 1,1 2,2 3,3 1,0 2,1 3,2 0,3 2,0 3,1 0,2 1,3 3,0 0,1 1,2 2,3 Prllel Computtion Edgr Griel