Parallel matrix-vector multiplication

Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more than L = 12. Parallel machnes often have more memory than commonly used sequental machnes such as workstatons or PCs and ths memory can be used to solve larger problems. Our task s then to dstrbute the matrx over the processors, such that the problem can be solved as effcently as possble, hopefully also mprovng the performance by a factor close to the number of processors used. A.1 BSP A bulk synchronous parallel (BSP) program operates by alternatng between a phase where all processors smultaneously compute local results and a phase where they communcate wth each other. A superstep n a BSP algorthm conssts of a computaton phase followed by a communcaton phase. Before and after each communcaton phase a global synchronzaton s carred out. The BSPlb lbrary (for the programmng language C) [78, 79] conssts of only 20 prmtves and s based on one-sded communcatons. One-sded communcatons, as opposed to two-sded communcatons, cannot create deadlock stuatons. The communcaton mechansms bult nto the BSP lbrary are remote wrte, remote read and bulk synchronous message passng. In all three cases the remote processor s, at least conceptually, passve n the current superstep. The basc communcaton prmtves are summarzed below. Remote wrte: the processor that executes a put statement copes a block 69

70 APPENDIX A. PARALLEL MATRIX-VECTOR MULTIPLICATION of memory to a remote memory address at the tme of the next synchronzaton. Remote read: the processor that executes a get statement copes a block of memory from a remote memory address at the tme of the next synchronzaton. Bulk synchronous message passng: the processor that executes a send statement sends a message, consstng of a tag and a payload part, to the buffer of a remote processor at the tme of the next synchronzaton. The messages can be read from the buffer by a move operaton after the next synchronzaton. The BSP cost model conssts of four parameters: the number of processors p, the speed of the processors s, the communcaton tme g and the synchronzaton tme l. The speed of the processors s measured as the number of floatng pont operatons per second. The communcaton tme s measured as the average tme taken to communcate a sngle word to a remote processor, when all the processors are smultaneously communcatng; the unt of tme s the tme per floatng pont operaton (flop). The synchronzaton tme s the amount of tme needed for all processors to synchronze, also measured n flop tme. As mentoned earler a BSP program s ether n a computng phase or n a communcaton phase. Ths makes predctng the performance of algorthms much easer than n the case of parallel programmng models where computaton and communcaton are nterleaved n a less structured fashon. The analyss of the cost of a superstep s relatvely smple. For each processor we count the number of flops w, the number of words sent to other processors h (s) and the number of words receved h (r). The tme taken by processor for computaton s w and for communcaton s h = Max(h (s), h (r) ). The cost of the superstep s Max (w ) + Max (h )g + l. Ths shows that optmally we should dvde the problem to be solved n equal parts, n the sense that the calculatons and communcatons are evenly dstrbuted over the avalable processors. Of course, we should also take care to reduce the total amount of communcaton. A.2 Matrx dstrbuton A good way to dstrbute an n n dense matrx over p = MN processors s a generalzed M N block/cyclc dstrbuton: the rows are dvded nto p row blocks of equal sze and the columns nto N column blocks of equal sze; then

A.2. MATRIX DISTRIBUTION 71 0 1 2 3 4 5 0 3 1 4 2 5 0 3 1 4 2 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 1 2 3 4 5 Fgure A.1: M N generalzed block/cyclc dstrbuton for matrces on p = MN = 6 processors. The rows have a block-cyclc dstrbuton, wth p blocks whch are cyclcly numbered 0, 1,..., M 1, 0, 1,..., and the columns have a block dstrbuton, N blocks numbered 0, 1,... N 1. From left to rght: M = 6, N = 1; M = 3, N = 2; M = 2, N = 3 and M = 1, N = 6. the matrx elements a j are assgned to the processors as follows: φ 0 () = ( dv n p ) mod M, φ 1 (j) = j dv n N, a j P (φ 0 () + Mφ 1 (j)), (A.1) as shown n fgure A.1. The vector elements are best dstrbuted to the same processor as the dagonal of the matrx. Note that for each generalzed block/cyclc dstrbuton: all processors have an equally large part of the matrx; each column s dstrbuted over M processors; each row s dstrbuted over N processors; each processor has the same number of submatrces and each processor has the same number of dagonal elements. Ths scheme fts wthn the general Cartesan framework of the work of Bsselng and McColl [80]; t s smlar but not dentcal to the block/cyclc dstrbuton. The approach of Bsselng and McColl to the matrx vector product r = A x can be dvded nto four stages: fan-out: the elements x j are communcated to the processors contanng the values a j ; local matrx-vector multplcatons: the partal results u t = j a jx j are computed, wth the sum taken over only the local values of a j, whch all have the same t = φ 1 (j); fan-n: the partal results, u φ1 (j), of the processors are sent to the processor that possesses the correspondng element r ; summaton of the partal results: r = N 1 t=0 u t.

72 APPENDIX A. PARALLEL MATRIX-VECTOR MULTIPLICATION If the matrx s dvded nto rows (whch s the specal case N = 1 for our generalzed block/cyclc dstrbuton), the fan-n and summaton of partal sums s avoded; ths saves some communcaton, but all processors then have to communcate wth all other processors n the fan-out part. On the other hand, f the matrx s dvded nto columns (M = 1), then the fan-out communcaton s avoded and the fan-n communcaton s an all-to-all operaton. For the general M N dstrbuton, the fan-out s an M-to-M communcaton and the fan-n an N-to-N communcaton. The communcaton then takes O((M + N) n p )g tme, nstead of O(M N n p )g. The communcaton s mnmal f M = N = p s used. For a sparse matrx, the algorthm s adapted to avod computatons and communcatons nvolvng zero elements: elements x j are only sent f the correspondng a j 0; partal sums are only computed usng products a j x j wth a j 0 and the partal sums are only sent and summed f they are nonzero. The next secton shows how advantage s taken of the specfc sparsty structure of the matrx. A.3 Explotng the sparsty structure In our problem, for L > 12, we cannot afford to store the complete matrx on a sngle processor, so we need to dstrbute t over a number of processors. The matrx we have to deal wth s sparse and we explot ths n our computatons, snce we only handle nonzero elements A j. The standard approach to communcate a subset of elements of a vector s to gather all elements and ther global ndces n separate arrays, and then sendng those arrays to the processors that need them. The overhead of repeatedly sendng the same arrays wth ndces may be removed by sendng them only the frst tme the matrx vector multplcaton s performed, but the overhead of repeatedly packng and unpackng the vector elements cannot be removed n general. Our transton matrx has a partcular structure wth patches wth many nonzero elements. We explot ths to make communcatons faster by sendng contguous subvectors, avodng the packng and unpackng overhead. Consder a rectangular patch (.e., a contguous submatrx). A value x j must be sent to the owner of the patch f an element A j n column j of the patch s nonzero. It s lkely that most columns of the patch have at least one nonzero, so we mght as well send all x j for that patch. Ths makes t possble to send a contguous subvector of x, whch s more effcent than sendng separate components; ths comes at the expense of a few unnecessary communcatons. The trade-off can be shfted by ncreasng or decreasng the patch sze. To fnd sutable patches, we frst dvde the state vector nto contguous

A.3. EXPLOITING THE SPARSITY STRUCTURE 73 Fgure A.2: Reduced transton matrx for polymer length L = 5. The sze of the matrx s 37 37 and t has 233 nonzero elements, shown as black squares. To the left of each row s the correspondng knk representaton wrtten as a bnary number, wth black crcles denotng 1 and open ones 0. The horzontal lnes on the left show the ntal dvson of the reduced state vector nto eght contguous parts, optmzed to balance the number of nonzeros n the correspondng matrx rows. The jumps of these lnes ndcate slght adjustments to make the dvson ft the nonzero structure of the matrx. The resultng vector dvson nduces a dvson of the rows and columns of the matrx, and hence a parttonng nto 64 submatrces, shown by the gray checkerboard pattern. Complete submatrces are now assgned to the processors of a parallel computer. subvectors. We use a heurstc to partton the matrx nto blocks of rows wth approxmately the same number of nonzeros. If we use P processors, and we want each processor to have K subvectors, we have to dvde the vector nto KP subvectors. (The factor K s the overparttonng factor.) Ths ntal dvson tres to mnmze the computaton tme. Next, we adjust the dvsons to reduce communcaton: a sutable patch n the matrx corresponds to an nput subvector of knk representatons where only the last few bts dffer, and also to an output subvector wth that property. Therefore, we search for a par of adjacent knk representatons that has a dfferent bt as much as possble to the left. Ths s a sutable place to splt. We try to keep the dstance from the startng pont as small as possble. As an example of the structure of the reduced transton matrces and the dvson nto submatrces, we show the nonzero structure of the matrx for L = 5 n fgure A.2 and ts correspondng communcaton matrx n fgure A.3 (left). The communcaton matrx s bult from the parttoned transton matrx, by consderng each submatrx as a sngle element. It s a sparse matrx of much

74 APPENDIX A. PARALLEL MATRIX-VECTOR MULTIPLICATION Fgure A.3: Communcaton matrx for L = 5 (left) and L = 13 (rght). Note that the matrx for L = 5 can be obtaned by replacng each nonempty submatrx n Fg. A.2 by a sngle nonzero element. The communcaton matrx for L = 13, of sze 320 320, s dstrbuted over 16 processors n a row dstrbuton. smaller sze whch determnes the communcaton requrements. Our communcaton matrx for L = 13 s gven n fgure A.3 (rght). A.4 Tmngs Our computatons were performed on a Cray T3E computer. The peak performance of a sngle node of the Cray T3E s 600 Mflop/s for computatons. The bsp probe benchmark shows a performance of 47 Mflop/s per node [78]. The peak nterprocessor bandwdth s 500 Mbyte/s (bdrectonal). The bsp probe benchmark shows a sustaned bdrectonal performance of 94 Mbyte/s per processor when all 64 processors communcate at the same tme. Ths s equvalent to a BSP parameter g = 3.8, where g s the cost n flop tme unts of one 64-bt word leavng or enterng a processor. The measured global synchronzaton tme for 64 processors s 48 µs, whch s equvalent to l = 2 259 flop tme unts. Table A.1 presents the executon tme of one teraton of the algorthm n two forms: the BSP cost a + bg + cl counts the flops and the communcatons and thus gves the tme on an arbtrary computer wth BSP parameters g and l, whereas the tme n mllseconds gves the measured tme on ths partcular archtecture, splt nto computaton and communcaton tme. (The total mea-

A.4. TIMINGS 75 L P BSP cost tme (ms) effcency speedup 12 8 545 156 + 64 716g + 2l 47 + 4.3 85% 6.8 13 16 1 002 824 + 187 347g + 2l 89 + 13 81% 13.0 14 32 1 836 920 + 425 152g + 2l 169 + 44 73% 23.4 15 64 3 452 776 + 1 380 415g + 2l 330 + 112 67% 42.9 Table A.1: BSP cost, tme, effcency, and speedup for one matrx-vector multplcaton. sured synchronzaton tme s neglgble.) The BSP cost can be used to predct the run tme of our algorthm on dfferent archtectures. Table A.1 also gves the effcency and speedup relatve to a sequental program. Peak computaton performance s often only reached for dense matrx-matrx multplcaton; the performance for sparse matrx-vector multplcaton s always much lower. Comparng the flop count and the measured computaton tme for the largest problem L = 15, we see that we acheve about 10.5 Mflop/s per processor. Comparng the communcaton count wth the measured communcaton tme, we obtan a g-value of 8.1 µs, (or g = 3.8 flop unts; see above). Ths means that we attan the maxmum sustanable communcaton speed. Ths s due to the desgn of our algorthm, whch communcates contguous subvectors nstead of sngle components. Furthermore, the results show that our choce to optmze manly the computaton (by choosng a row dstrbuton) s justfed for ths archtecture: the communcaton tme s always less than a thrd of the total tme. For a dfferent machne, wth a hgher value of g, more emphass must be placed on optmzng the communcaton, leadng to a two-dmensonal dstrbuton. Each teraton of our computaton contans one matrx-vector multplcaton. The number of teratons needed for convergence depends on the length of the polymer, and on the appled electrc feld. The teraton was stopped when ether the accuracy was better than 10 10, or the number of teratons exceeded 100 000. In the latter case, the accuracy was computed at termnaton. Typcally, for L = 15 and a low electrc feld strength, 50 000 teratons are needed, takng about 6 hours per data pont. Only computed values wth accuracy 10 4 or better are shown n fgure 5.3. For L = 12, we compared the output for the parallel program wth that of the sequental program and found the dfference to be wthn roundng errors. The total speedup for L = 15, compared to a nave mplementaton (for whch one would need 38.5 Tbyte of memory), s a factor 1.5 10 6 : a factor of 17 248 by usng a reduced state space, a factor of 2 by shftng the egenvalues of the reduced transton matrx, and a factor 42.9 by usng a parallel program on 64 processors.