CS475 Parallel Programming

Size: px

Start display at page:

Download "CS475 Parallel Programming"

Alan Tate
5 years ago
Views:

1 CS475 Parallel Programmig Dese Matrix Multiply Wim Bohm, Colorado State Uiversity Except as otherwise oted, the cotet of this presetatio is licesed uder the Creative Commos Attributio 2.5 licese.

2 Block mappig a matrix oto p PEs Blocked / Checkerboard 2 matrix o p PEs Map /sqrt(p) x /sqrt(p) blocks oto PEs Maps well o a 2D mesh Fiest graularity: 1 elemet per PE, (p = *) May matrix algorithms allow block formulatio Matrix add Matrix multiply

3 x Matrix Multiply for i = 0 to -1 for j = 0 to -1 Cij = 0 for k = 0 to -1 Cij += Aik * Bkj We do ot cosider recursive < O( 3 ) algorithms ( s.a. Strasse). These could be the top level, drivig the O( 3 ) algorithms described here.

4 x Matrix Multiply for i = 0 to -1 for j = 0 to -1 Cij = 0 for k = 0 to -1 Cij += Aik * Bkj outer i,j idices of A ad B determie the target C elemet or block

5 x Matrix Multiply for i = 0 to -1 for j = 0 to -1 Cij = 0 for k = 0 to -1 Cij += Aik * Bkj A ad B elemets or block compute oly if their ier idex (k) is equal distributive ad commutative + ad * the order i which we compute ad accumulate the blocks does ot matter!

6 2 ier products B *j B A i* Cij=A i*. B *j A C

7 outer products A A *k B k* B C for k = 0 to -1 forall i i [0,-1] forall j i [0,-1] C ij += A ik * B kj outer i,j idices of A ad B are the target C idices ier idices k of A ad B are equal

8 3x3 example: ier product A 00 A 01 A 02 B 00 B 01 B 02 A 10 A 11 A 12 X B 10 B 11 B 12 = A 20 A 21 A 22 B 20 B 21 B 22 A 00 B 00 +A 01 B 10 + A 02 B 20 A 00 B 01 +A 01 B 11 + A 02 B 21 A 00 B 02 +A 01 B 12 + A 02 B 22 A 10 B 00 +A 11 B 10 + A 12 B 20 A 10 B 01 +A 11 B 11 + A 12 B 21 A 10 B 02 +A 11 B 12 + A 12 B 22 A 20 B 00 +A 21 B 10 + A 22 B 20 A 20 B 01 +A 21 B 11 + A 22 B 21 A 20 B 02 +A 21 B 12 + A 22 B 22

9 3x3 example: outer 1 A 00 A 01 A 02 B 00 B 01 B 02 A 10 A 11 A 12 X B 10 B 11 B 12 A 20 A 21 A 22 B 20 B 21 B 22 A 00 B 00 +A 01 B 10 + A 02 B 20 A 00 B 01 +A 01 B 11 + A 02 B 21 A 00 B 02 +A 01 B 12 + A 02 B 22 A 10 B 00 +A 11 B 10 + A 12 B 20 A 10 B 01 +A 11 B 11 + A 12 B 21 A 10 B 02 +A 11 B 12 + A 12 B 22 A 20 B 00 +A 21 B 10 + A 22 B 20 A 20 B 01 +A 21 B 11 + A 22 B 21 A 20 B 02 +A 21 B 12 + A 22 B 22

10 3x3 example: outer 2 A 00 A 01 A 02 B 00 B 01 B 02 A 10 A 11 A 12 X B 10 B 11 B 12 A 20 A 21 A 22 B 20 B 21 B 22 A 00 B 00 +A 01 B 10 + A 02 B 20 A 00 B 01 +A 01 B 11 + A 02 B 21 A 00 B 02 +A 01 B 12 + A 02 B 22 A 10 B 00 +A 11 B 10 + A 12 B 20 A 10 B 01 +A 11 B 11 + A 12 B 21 A 10 B 02 +A 11 B 12 + A 12 B 22 A 20 B 00 +A 21 B 10 + A 22 B 20 A 20 B 01 +A 21 B 11 + A 22 B 21 A 20 B 02 +A 21 B 12 + A 22 B 22

11 3x3 example: outer 3 A 00 A 01 A 02 B 00 B 01 B 02 A 10 A 11 A 12 X B 10 B 11 B 12 A 20 A 21 A 22 B 20 B 21 B 22 A 00 B 00 +A 01 B 10 + A 02 B 20 A 00 B 01 +A 01 B 11 + A 02 B 21 A 00 B 02 +A 01 B 12 + A 02 B 22 A 10 B 00 +A 11 B 10 + A 12 B 20 A 10 B 01 +A 11 B 11 + A 12 B 21 A 10 B 02 +A 11 B 12 + A 12 B 22 A 20 B 00 +A 21 B 10 + A 22 B 20 A 20 B 01 +A 21 B 11 + A 22 B 21 A 20 B 02 +A 21 B 12 + A 22 B 22

12 Blocked outer product Bk * A *k B A C C ij

13 Blocked outer product Bk * A *k B A C C ij

14 Blocked outer product otice: width of A block row does ot eed to be equal to width of B block colum A *k B Bk * A C C ij

15 Blocked Matrix Multiply Stadard ier product algorithm ca be blocked p processors: /sqrt(p) * /sqrt(p) sized blocks PEij has blocks Aij ad Bij ad computes block Cij Cij eeds Aik ad Bkj, k = 0 to -1 Assumig blocked data distributio for all three matrices, some form of commuicatio is eeded

16 B *j B A i* A C C ij

17 B *j B A i* A C C ij

18 B *j B etc. A i* A C C ij

19 Simple Block Matrix Multiply All PEs i a row eed all row blocks of A Oe block: 2 / p all-to-all block broadcast of A i ar ow of PEs O(sqrt(p) * ( 2 / p) All PEs i a colum eed all colum blocks of B all-to-all block broadcast of B i colum PEs O(sqrt(p) * ( 2 / p ) Compute block Cij i PEij: 3 /p time Space use for A ad B: per PE: 2*sqrt(p)*( 2 / p )= 2 2 / sqrt(p), Total: 2 2 sqrt(p), a replicatio factor of sqrt(p)

20 Cao s Matrix Multiply Avoids space overhead processsor has ot more tha 1 A block, 1 B block, ad 1 C block iterleaves block moves ad computatio PEij computes block Cij Iitial aligmet of data Circular left shift block Aij by i steps Circular up shift block Bij by j steps Iterleave computatio ad commuicatio Compute: block matrix multiplicatio Commuicate: circular shift left A blocks circular shift up B blocks

21 Iitial state: p ij ows blocks ij A 00 A 01 A 02 A 03 B 00 B 01 B 02 B 03 A 10 A 11 A 12 A 13 B 10 B 11 B 12 B 13 A 20 A 21 A 22 A 23 B 20 B 21 B 22 B 23 A 30 A 31 A 32 A 33 B 30 B 31 B 32 B 33

22 Cao: alig blocks so all ca compute A 00 A 01 A 02 A 03 B 00 B 01 B 10 B 11 B 20 B 21 B 30 B 31 First row of A ad colum of B are i the right place, but which B block does A 01 eed to compute? B 11 So rotate B *1 1 up

23 Cao: alig blocks so all ca compute A 00 A 01 A 02 A 03 B 00 B 11 B 10 B 21 B 20 B 31 B 30 B 01

24 Cao: alig blocks so all ca compute A 00 A 01 A 02 A 03 B 00 B 11 B 02 B 10 B 21 B 12 B 20 B 31 B 22 B 30 B 01 B 32 Which B block does A 02 eed to compute? B 22 So rotate B *2 2 up

25 Cao: alig blocks so all ca compute A 00 A 01 A 02 A 03 B 00 B 11 B 22 B 10 B 21 B 32 B 20 B 31 B 02 B 30 B 01 B 12

26 Cao: alig blocks so all ca compute A 00 A 01 A 02 A 03 B 00 B 11 B 22 B 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 ad rotate B *3 3 up

27 Cao: alig blocks so all ca compute A 00 A 01 A 02 A 03 B 00 B 11 B 22 B 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 ad ow alig A rows with B

28 Cao: alig blocks so all ca compute A 00 A 01 A 02 A 03 B 00 B 11 B 22 B 33 A 11 A 12 A 13 A 10 B 10 B 21 B 32 B 03 A 22 A 23 A 20 A 21 B 20 B 31 B 02 B 13 A 33 A 30 A 31 A 32 B 30 B 01 B 12 B 23 ow all blocks are i the right place to multiply for the ext step: As cyclic shift left, ad Bs cyclic shift up

29 Cao: alig blocks so all ca compute A 01 A 02 A 03 A 00 B 10 B 21 B 32 B 03 A 12 A 13 A 10 A 11 B 20 B 31 B 02 B 13 A 23 A 20 A 21 A 22 B 30 B 01 B 12 B 23 A 30 A 31 A 32 A 33 B 00 B 11 B 22 B 33

30 Cao: alig blocks so all ca compute A 02 A 03 A 00 A 01 B 20 B 31 B 22 B 13 A 13 A 10 A 11 A 12 B 30 B 01 B 32 B 23 A 20 A 21 A 22 A 23 B 00 B 11 B 02 B 33 A 31 A 32 A 33 A 30 B 10 B 21 B 12 B 03

31 Cao: alig blocks so all ca compute A 03 A 00 A 01 A 02 B 30 B 01 B 22 B 13 A 10 A 11 A 12 A 13 B 00 B 11 B 32 B 23 A 21 A 22 A 23 A 20 B 10 B 21 B 02 B 33 A 32 A 33 A 30 A 31 B 20 B 31 B 12 B 03

32 Cost of Cao s Matrix Multiply Iitial data aligmet Aligig A or B Worst distace * size» sqrt(p) * 2 /p Total» 2* sqrt(p) * 2 /p Iterleave computatio ad commuicatio Compute: total = 3 / p Commuicate: A blocks circular shift left B blocks circular shift up Total cost = umber of shifts * size» 2* sqrt(p) * 2 /p Space: 3 2 /p per PE (A, B ad C)

33 Fox s Matrix Multiply Also avoids space overhead iterleaves broadcasts for A blocks, block shifts for B, ad computatio Iitial data distributio: stadard block Iitial computatio Broadcast Aii i row i Compute: block matrix multiplicatio for j = 1 to sqrt(p) 1 Circular up shift B blocks Broadcast Aik block i row i, where k = (j+i) mod sqrt(p) Compute: block matrix multiplicatio ad add to C block

34 Iitial state: p ij ows blocks ij A 00 A 01 A 02 A 03 B 00 B 01 B 02 B 03 A 10 A 11 A 12 A 13 B 10 B 11 B 12 B 13 A 20 A 21 A 22 A 23 B 20 B 21 B 22 B 23 A 30 A 31 A 32 A 33 B 30 B 31 B 32 B 33

35 Fox: broadcast diagoal block i row A 00 A 01 A 02 A 03 B 00 B 01 B 02 B 03 A 10 A 11 A 12 A 13 B 10 B 11 B 12 B 13 A 20 A 21 A 22 A 23 B 20 B 21 B 22 B 23 A 30 A 31 A 32 A 33 B 30 B 31 B 32 B 33

36 Fox: ext diagoal, B rotates up A 00 A 01 A 02 A 03 B 10 B 11 B 12 B 13 A 10 A 11 A 12 A 13 B 20 B 21 B 22 B 23 A 20 A 21 A 22 A 23 B 30 B 31 B 32 B 33 A 30 A 31 A 32 A 33 B 00 B 01 B 02 B 03

37 Fox: ext diagoal, B rotates up A 00 A 01 A 02 A 03 B 20 B 21 B 22 B 23 A 10 A 11 A 12 A 13 B 30 B 31 B 32 B 33 A 20 A 21 A 22 A 23 B 00 B 01 B 02 B 03 A 30 A 31 A 32 A 33 B 10 B 11 B 12 B 13

38 Fox: ext diagoal, B rotates up A 00 A 01 A 02 A 03 B 30 B 31 B 32 B 33 A 10 A 11 A 12 A 13 B 00 B 01 B 02 B 03 A 20 A 21 A 22 A 23 B 10 B 11 B 12 B 13 A 30 A 31 A 32 A 33 B 20 B 21 B 22 B 23

39 Cost of Fox s Matrix Multiply A: sqrt(p) times sqrt(p) broadcasts of blocks sized 2 /p (oe-to-sqrt(p)): total volume 2 (all of A) sqrt(p) circular shifts Each circular shift (earest eighbor): volume = 2 /p Computatio time: O( 3 /p)

40 Dekel, Nassimi, Sahi Matrix Multiply 3D Mesh formulatio: Z plaes have equal k values A s colums distributed/replicated over Y plaes B s rows distributed/replicated over X plaes lots of data replicatio Do all poit to poit multiplies i parallel Collapse sum reductio up / dow the Z plaes

41 Dekel, Nassimi, Sahi Matrix Multiply Replicate A Replicate B B B j Sum reduce up A 1 A i A A 1k A ik A k B 1 B k1 B kj B k B 1 Block multiply B 1j A 11 A i1 A 1 B 11

42 Sum Reduce 1 B B j A 1 A i A B 1 B k B kj A 1k A ik A k B k1 B 1 A i1 *B 1j B 1j A 11 A i1 A 1 B 11

43 Sum Reduce k B B j A 1 A i A B 1 B k A ik *B kj B kj A 1k A ik A k B k1 B 1 A i1 *B 1j B 1j A 11 A i1 A 1 B 11

44 Sum Reduce A i *B j B B j A 1 A i A B 1 B k A ik *B kj B kj A 1k A ik A k B k1 B 1 A i1 *B 1j B 1j A 11 A i1 A 1 B 11

45 Cij += Ai* B*j C ij B B j A 1 A i A B 1 B k B kj A 1k A ik A k B k1 B 1 B 1j A 11 A i1 A 1 B 11

46 Dekel, Nassimi, Sahi Matrix Multiply C 1 C i C C 1j C ij C j B C 11 A 1 A i A C i1 C 1 B 1 B j B k B kj A 1k A ik A k B k1 B 1 B 1j A 11 A i1 A 1 B 11

47 MPI 2x2 block matrix multiply

48 four processes, four blocks per matrix A00 B00 C00 A10 B10 C10 A01 B01 0 C01 1 A11 B11 2 C11 3

49 exchage rows A00 B00 C00 A10 B10 C10 A01 B01 0 C01 1 A11 B11 2 C11 3 A00,A01 A00,A01 B00 B01 C00 0 C01 1 A10,A11 A10,A11 B10 B11 C10 2 C11 3

50 exchage colums A00 A01 B00 B10 C00 A10 A11 B00 B10 C10 A00 A01 B01 B11 C A10 A11 B01 B11 C11 2 3

51 multiply A00 A01 B00 B10 C00 A10 A11 B00 B10 C10 A00 A01 B01 B11 C A10 A11 B01 B11 C11 2 3

52 gather C00 C01 C10 C

53 multiply a block /* block size */ #defie b 8 /* A, B, C are it* */ void multblock(it C[b][b], it A[b][b], it B[b][b]) { it i,j,k; for(i=0;i<b;i++){ for(j=0;j<b;j++){ for(k=0;k<b;k++) C[i][j] += A[i][k]*B[k][j]; } } }

54 sequetial mai iitialize it mai(it argc, char *argv[]) { it i,j,k, ioff, joff; it A00[b][b], A01[b][b], A10[b][b], A11[b][b]; it B00[b][b], B01[b][b], B10[b][b], B11[b][b]; it C00[b][b], C01[b][b], C10[b][b], C11[b][b]; /* iitialize A, B ad C blocks */ for(i=0,ioff=b;i<b;i++,ioff++){ for(j=0,joff=b;j<b;j++,joff++){ A00[i][j] = i+j; A01[i][j] = i+joff; A10[i][j] = ioff+j; A11[i][j] = ioff + joff; B00[i][j] = i-j; B01[i][j] = i-joff; B10[i][j] = ioff-j; B11[i][j] = ioff - joff; C00[i][j] = 0; C01[i][j] = 0; C10[i][j] = 0; C11[i][j] = 0; }}

55 A 0 1 iitial B

56 sequetial mai compute multblock(c00,a00,b00); multblock(c00,a01,b10); pritf("\ C00: "); pritblock(c00); multblock(c01,a00,b01); multblock(c01,a01,b11); pritf("\ C01: "); pritblock(c01); multblock(c10,a10,b00); multblock(c10,a11,b10); pritf("\ C10: "); pritblock(c10); multblock(c11,a10,b01); multblock(c11,a11,b11); pritf("\ C11: "); pritblock(c11);

57 MPI code all pe-s declare all blocks (easiest) each pe iitializes it s A,B ad C blocks exchage A row blocks exchage B col blocks compute Gather C blocks I oly used blockig block J seds ad recvs makig sure seds ad recvs correctly ordered

58 pe 0 iitialize MPI_Iit( &argc, &argv ); MPI_Comm_rak( MPI_COMM_WORLD, &my_id ); MPI_Comm_size( MPI_COMM_WORLD, &p ); MPI_Barrier(MPI_COMM_WORLD); switch(my_id) { case 0: pritf("pe0: Iit\"); /* Iitialize A00, BOO ad C00 */ for(i=0;i<b;i++){ for(j=0,joff=b;j<b;j++,joff++){ A00[i][j] = i+j; B00[i][j] = i-j; C00[i][j] =0; } }

59 some exchages /* Row Exchage 0 <--> 1 */ pritf("pe0: <--> PE1: Row Exchage\"); MPI_Recv( (it *)A01, b*b, MPI_INT, 1, 1, MPI_COMM_WORLD, &status); MPI_Sed( (it *)A00, b*b, MPI_INT, 1, 2, MPI_COMM_WORLD); /* Col Exchage 0 <--> 2 */ pritf("pe0: <--> PE2: Col Exchage\"); MPI_Recv( (it *)B10, b*b, MPI_INT, 2, 3, MPI_COMM_WORLD, &status); MPI_Sed( (it *)B00, b*b, MPI_INT, 2, 4, MPI_COMM_WORLD); Row EXCHANGE i PE1: /* Row Exchage 0 <--> 1 */ pritf("pe1: <--> PE0: Row Exchage\"); MPI_Sed( (it *)A01, b*b, MPI_INT, 0, 1, MPI_COMM_WORLD); MPI_Recv( (it *)A00, b*b, MPI_INT, 0, 2, MPI_COMM_WORLD, &status);

60 pe 0 computes /* Block Multiply C00 = A00*B00 + A01*B10 */ multblock(c00,a00,b00); multblock(c00,a01,b10);

61 pe 0 gathers /* Gather */ pritf("pe0: Gather C01 <-- PE1\"); MPI_Recv( (it *)C01, b*b, MPI_INT, 1, 5, MPI_COMM_WORLD, &status); pritf("pe0: Gather C10 <-- PE2\"); MPI_Recv( (it *)C10, b*b, MPI_INT, 2, 6, MPI_COMM_WORLD, &status); pritf("pe0: Gather C11 <-- PE3\"); MPI_Recv( (it *)C11, b*b, MPI_INT, 3, 7, MPI_COMM_WORLD, &status);

62 pe 0 prits /* Prit */ pritf("pe0: Prit blocks\"); pritf("\ C00: "); pritblock(c00); pritf("\ C01: "); pritblock(c01); pritf("\ C10: "); pritblock(c10); pritf("\ C11: "); pritblock(c11); pritf("\"); break;

63 all pe-s happy J EXIT: MPI_Fialize();

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j