Parallel Numerical Algorithms

Size: px

Start display at page:

Download "Parallel Numerical Algorithms"

Sara Oliver
6 years ago
Views:

1 Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77

2 Outline BLAS 1 BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 77

3 Basic Linear Algebra Subprograms Basic Linear Algebra Subprograms (BLAS) are building blocks for many other matrix computations BLAS encapsulate basic operations on vectors and matrices so they can be optimized for particular computer architecture while high-level routines that call them remain portable BLAS offer good opportunities for optimizing utilization of memory hierarchy Generic BLAS are available from netlib, and many computer vendors provide custom versions optimized for their particular systems Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 77

4 Examples of BLAS BLAS Level Work Examples Function 1 O(n) daxpy Scalar vector + vector ddot Inner product dnrm2 Euclidean vector norm 2 O(n 2 ) dgemv Matrix-vector product dtrsv Triangular solve dger Outer-product 3 O(n 3 ) dgemm Matrix-matrix product dtrsm Multiple triangular solves dsyrk Symmetric rank-k update γ }{{} 1 γ }{{} 2 γ }{{} 3 BLAS 1 effective sec/flop BLAS 2 effective sec/flop BLAS 3 effective sec/flop Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 77

5 BLAS Inner product of two n-vectors x and y given by x T y = n x i y i i=1 Computation of inner product requires n multiplications and n 1 additions M 1 = Θ(n), Q 1 = Θ(n), T 1 = Θ(γ n) Effective as hard as scalar reduction, can be done via binary or binomial tree summation Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 77

6 BLAS Partition For i = 1,..., n, fine-grain task i stores x i and y i, and computes their product x i y i Communicate Sum reduction over n fine-grain tasks x 7 y 7 x 8 y 8 x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 77

7 Fine-Grain z i = x i y i reduce z i across all tasks i = 1,..., n { local scalar product } { sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 77

8 Agglomeration and Mapping Agglomerate Map Combine k components of both x and y to form each coarse-grain task, which computes inner product of these subvectors Communication becomes sum reduction over n/k coarse-grain tasks Assign (n/k)/p coarse-grain tasks to each of p processors, for total of n/p components of x and y per processor x 7 y 7 + x 8 y 8 + x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 77

9 Coarse-Grain z i = x T [i] y [i] reduce z i across all processors i = 1,..., p { local inner product } { sum reduction } [ x[i] subvector of x assigned to processor i ] Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 77

10 Performance BLAS The parallel costs (S p, W p, F p ) for the inner product are given by Computational cost F p = Θ(n/p) regardless of network The latency and bandwidth costs depend on network: 1-D mesh: S p, W p = Θ(p) 2-D mesh: S p, W p = Θ( p) hypercube: S p, W p = Θ(log p) For a hypercube or fully-connected network time is T p = αs p + βw p + γf p = Θ(α log(p) + γn/p) Efficiency and scaling are the same as for binary tree sum Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 77

11 Inner product on 1-D Mesh For 1-D mesh, total time is T p = Θ(γn/p + αp) To determine strong scalability, we set constant efficiency and solve for p s const = E ps = T ( ) ( ) 1 γn 1 = Θ p s T ps γn + αp 2 = Θ s 1 + (α/γ)p 2 s/n which yields p s = Θ( (γ/α)n) 1-D mesh weakly scalable to p w = Θ((γ/α)n) processors: ( ) ( ) 1 1 E pw (p w n) = Θ 1 + (α/γ)p 2 = Θ w/(p w n) 1 + (α/γ)p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 77

12 Inner product on 2-D Mesh For 2-D mesh, total time is T p = Θ(γn/p + α p) To determine strong scalability, we set constant efficiency and solve for p s const = E ps = T ( ) ( ) 1 γn 1 = Θ p s T ps γn + αp 3/2 = Θ s 1 + (α/γ)ps 3/2 /n which yields p s = Θ((γ/α) 2/3 n 2/3 ) 2-D mesh weakly scalable to p w = Θ((γ/α) 2 n 2 ), since ( ) ( ) 1 1 E pw (p w n) = Θ 1 + (α/γ)p 3/2 = Θ w /(p w n) 1 + (α/γ) p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 77

13 BLAS Outer product of two n-vectors x and y is n n matrix Z = xy T whose (i, j) entry z ij = x i y j For example, T = Computation of outer product requires n 2 multiplications, M 1 = Θ(n 2 ), Q 1 = Θ(n 2 ), T 1 = Θ(γn 2 ) (in this case, we should treat M 1 as output size or define the problem as in the BLAS: Z = Z input + xy T ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 77

14 BLAS Partition For i, j = 1,..., n, fine-grain task (i, j) computes and stores z ij = x i y j, yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either for some j, task (i, j) stores x i and task ( j, i) stores y i, or task (i, i) stores both x i and y i, i = 1,..., n Communicate For i = 1,..., n, task that stores x i broadcasts it to all other tasks in ith task row For j = 1,..., n, task that stores y j broadcasts it to all other tasks in jth task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 77

15 Fine-Grain Tasks and Communication Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 77

16 Fine-Grain broadcast x i to tasks (i, k), k = 1,..., n broadcast y j to tasks (k, j), k = 1,..., n z ij = x i y j { horizontal broadcast } { vertical broadcast } { local scalar product } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 77

17 Agglomeration BLAS Agglomerate With n n array of fine-grain tasks, natural strategies are 2-D: Combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 77

18 2-D Agglomeration BLAS Each task that stores portion of x must broadcast its subvector to all other tasks in its task row Each task that stores portion of y must broadcast its subvector to all other tasks in its task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 77

19 2-D Agglomeration BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 77

20 1-D Agglomeration BLAS If either x or y stored in one task, then broadcast required to communicate needed values to all other tasks If either x or y distributed across tasks, then multinode broadcast required to communicate needed values to other tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 77

21 1-D Column Agglomeration Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 77

22 1-D Row Agglomeration Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 77

23 Mapping BLAS Map 2-D: Assign (n/k) 2 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 77

24 2-D Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 77

25 1-D Column Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 77

26 1-D Row Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 77

27 Coarse-Grain broadcast x [i ] to ith process row broadcast y [ j ] to jth process column Z [i ][ j ] = x [i ] y T [ j ] { horizontal broadcast } { vertical broadcast } { local outer product } [ Z[i ][ j ] means submatrix of Z assigned to process (i, j) by mapping ] Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 77

28 Performance BLAS The parallel costs (S p, W p, F p ) for the outer product are given by Computational cost F p = Θ(n 2 /p) regardless of network The latency and bandwidth costs can be derived from the cost of broadcast 1-D agglomeration: S p = Θ(log p), W p = Θ(n) 2-D agglomeration: S p = Θ(log p), W p = Θ(n/ p) For 1-D agglomeration, execution time is T 1-D p = Tp bcast (n) + Θ(γn 2 /p) = Θ(α log(p) + βn + γn 2 /p) For 2-D agglomeration, execution time is T 2-D p = 2T bcast p (n/ p)+θ(γn 2 /p) = Θ(α log(p)+βn/ p+γn 2 /p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 77

29 Strong Scaling 1-D agglomeration is strongly scalable to p s = Θ(min((γ/α)n 2 / log((γ/α)n 2 ), (γ/β)n)) processors, since E 1-D p s = Θ(1/(1 + (α/γ) log(p s )p s /n 2 + (β/γ)p s /n)) 2-D agglomeration is strongly scalable to p s = Θ(min((γ/α)n 2 / log((γ/α)n 2 ), (γ/β) 2 n 2 )) processors, since E 2-D p s = Θ(1/(1 + (α/γ) log(p s )p s /n 2 + (β/γ) p s /n)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 77

30 Weak Scaling 1-D agglomeration is weakly scalable to processors, since p w = Θ(min(2 (γ/α)n2, (γ/β) 2 n 2 )) E 1-D p w ( p w n) = Θ(1/(1 + (α/γ) log(p w )/n 2 + (β/γ) p w /n)) 2-D agglomeration is weakly scalable to processors, since p w = Θ(2 (γ/α)n2 ) E 2-D p w ( p w n) = Θ(1/(1 + (α/γ) log(p w )/n 2 + (β/γ)/n)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 77

31 Memory Requirements The memory requirements are dominated by storing Z, which in practice means the outer-product is a poor primitive (local flop-to-byte ratio of 1) If possible, Z should be operated on as it is computed, e.g. if we really need v i = j f(x i y j ) for some scalar function f If Z does not need to be stored, vector storage dominates Without further modification, 1-D algorithm requires M p = Θ(np) overall storage for vector Without further modification, 2-D algorithm requires M p = Θ(n p) overall storage for vector Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 77

32 Consider matrix-vector product y = Ax where A is n n matrix and x and y are n-vectors Components of vector y are given by inner products: n y i = a ij x j, j=1 i = 1,..., n The sequential memory, work, and time are M 1 = Θ(n 2 ), Q 1 = Θ(n 2 ), T 1 = Θ(γn 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 77

33 BLAS Partition For i, j = 1,..., n, fine-grain task (i, j) stores a ij and computes a ij x j, yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either for some j, task ( j, i) stores x i and task (i, j) stores y i, or Communicate task (i, i) stores both x i and y i, i = 1,..., n For j = 1,..., n, task that stores x j broadcasts it to all other tasks in jth task column For i = 1,..., n, sum reduction over ith task row gives y i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 77

34 Fine-Grain Tasks and Communication a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 77

35 Fine-Grain broadcast x j to tasks (k, j), k = 1,..., n y i = a ij x j reduce y i across tasks (i, k), k = 1,..., n { vertical broadcast } { local scalar product } { horizontal sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 77

36 Agglomeration BLAS Agglomerate With n n array of fine-grain tasks, natural strategies are 2-D: Combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 77

37 2-D Agglomeration BLAS Subvector of x broadcast along each task column Each task computes local matrix-vector product of submatrix of A with subvector of x Sum reduction along each task row produces subvector of result y Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 77

38 2-D Agglomeration BLAS a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 77

39 1-D Agglomeration BLAS 1-D column agglomeration Each task computes product of its component of x times its column of matrix, with no communication required Sum reduction across tasks then produces y 1-D row agglomeration If x stored in one task, then broadcast required to communicate needed values to all other tasks If x distributed across tasks, then multinode broadcast required to communicate needed values to other tasks Each task computes inner product of its row of A with entire vector x to produce its component of y Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 77

40 1-D Column Agglomeration a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 77

41 1-D Row Agglomeration a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 77

42 1-D Agglomeration BLAS Column and row algorithms are dual to each other Column algorithm begins with communication-free local vector scaling (daxpy) computations combined across processors by a reduction Row algorithm begins with broadcast followed by communication-free local inner-product (ddot) computations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 77

43 Mapping BLAS Map 2-D: Assign (n/k) 2 /p coarse-grain tasks to each of p processes using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processes using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 77

44 2-D Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 77

45 1-D Column Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 77

46 1-D Row Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 77

47 Coarse-Grain broadcast x [ j ] to jth process column y [i ] = A [i ][ j ] x [ j ] reduce y [i ] across ith process row { vertical broadcast } { local matrix-vector product } { horizontal sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 77

48 Performance BLAS The parallel costs (S p, W p, F p ) for the matrix-vector product are Computational cost F p = Θ(n 2 /p) regardless of network The latency and bandwidth costs can be derived from the cost of broadcast 1-D agglomeration: S p = Θ(log p), W p = Θ(n) 2-D agglomeration: S p = Θ(log p), W p = Θ(n/ p) For 1-D row agglomeration, execution time is T 1-D p = Tp bcast (n) + Θ(γn 2 /p) = Θ(α log(p) + βn + γn 2 /p) For 2-D agglomeration, using T bcast = T reduce, time is T 2-D p = 2T bcast p (n/ p)+θ(γn 2 /p) = Θ(α log(p)+βn/ p+γn 2 /p) So scalability is essentially the same as for outer product Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 77

49 Consider matrix-matrix product C = A B where A, B, and result C are n n matrices Entries of matrix C are given by n c ij = a ik b kj, k=1 i, j = 1,..., n Each of n 2 entries of C requires n multiply-add operations, so model serial time as T 1 = γ n 3 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 77

50 Matrix-matrix product can be viewed as n 2 inner products, or sum of n outer products, or n matrix-vector products and each viewpoint yields different algorithm One way to derive parallel algorithms for matrix-matrix product is to apply parallel algorithms already developed for inner product, outer product, or matrix-vector product However, considering the problem as a whole yields the best algorithms Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 77

51 BLAS Partition For i, j, k = 1,..., n, fine-grain task (i, j, k) computes product a ik b kj, yielding 3-D array of n 3 fine-grain tasks Assuming no replication of data, at most 3n 2 fine-grain tasks store entries of A, B, or C, say task (i, j, j) stores a ij, task (i, j, i) stores b ij, and task (i, j, k) stores c ij for i, j = 1,..., n and some fixed k We refer to subsets of tasks along i, j, and k dimensions as rows, columns, and layers, respectively, so kth column of A and kth row of B are stored in kth layer of tasks i k j Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 77

52 BLAS Communicate Broadcast entries of jth column of A horizontally along each task row in jth layer Broadcast entries of ith row of B vertically along each task column in ith layer For i, j = 1,..., n, result c ij is given by sum reduction over tasks (i, j, k), k = 1,..., n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 77

53 Fine-Grain Algorithm broadcast a ik to tasks (i, q, k), q = 1,..., n broadcast b kj to tasks (q, j, k), q = 1,..., n c ij = a ik b kj reduce c ij across tasks (i, j, q), q = 1,..., n { horizontal broadcast } { vertical broadcast } { local scalar product } { lateral sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 77

54 Agglomeration BLAS Agglomerate With n n n array of fine-grain tasks, natural strategies are 3-D: Combine q q q subarray of fine-grain tasks 2-D: Combine q q n subarray of fine-grain tasks, eliminating sum reductions 1-D column: Combine n 1 n subarray of fine-grain tasks, eliminating vertical broadcasts and sum reductions 1-D row: Combine 1 n n subarray of fine-grain tasks, eliminating horizontal broadcasts and sum reductions Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 77

55 Mapping BLAS Map Corresponding mapping strategies are 3-D: Assign (n/q) 3 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 3-D mesh 2-D: Assign (n/q) 2 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 77

56 Agglomeration with Block Mapping 1-D row 1-D col 2-D 3-D 1-D row 1-D column 2-D 3-D agglomerations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 77

57 Coarse-Grain 3-D broadcast A [i ][k] to ith processor row broadcast B [k][ j ] to jth processor column C [i ][ j ] = A [i ][k] B [k][ j ] reduce C [i ][ j ] across processor layers { horizontal broadcast } { vertical broadcast } { local matrix product } { lateral sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 77

58 Agglomeration with Block Mapping 2-D: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 1-D column: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 1-D row: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 77

59 Coarse-Grain 2-D allgather A [i ][ j ] in ith processor row allgather B [i ][ j ] in jth processor column C [i ][ j ] = 0 for k = 1,..., p C [i ][ j ] = C [i ][ j ] + A [i ][k] B [k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 77

60 SUMMA Algorithm BLAS Algorithm just described requires excessive memory, since each process accumulates p blocks of both A and B One way to reduce memory requirements is to broadcast blocks of A successively across processor rows broadcast blocks of B successively across processor cols C [i ][ j ] = 0 for k = 1,..., p broadcast A [i ][ k ] in ith processor row broadcast B [k ][ j ] in jth processor column C [i ][ j ] = C [i ][ j ] + A [i ][k] B [k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 77

61 SUMMA Algorithm BLAS A A B B B 16 CPUs (4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B A Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 77

62 Cannon Algorithm BLAS Another approach, due to Cannon (1969), is to circulate blocks of B vertically and blocks of A horizontally in ring fashion Blocks of both matrices must be initially aligned using circular shifts so that correct blocks meet as needed Requires less memory than SUMMA and replaces line broadcasts with shifts, lowering the number of messages Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 77

63 Cannon Algorithm BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 77

64 Fox Algorithm BLAS It is possible to mix techniques from SUMMA and Cannon s algorithm: circulate blocks of B in ring fashion vertically along processor columns step by step so that each block of B comes in conjunction with appropriate block of A broadcast at that same step This algorithm is due to Fox et al. Shifts, especially in Cannon s algorithm, are harder to generalize to nonsquare processor grids than collectives in algorithms like SUMMA Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 64 / 77

65 Execution Time for 3-D Agglomeration For 3-D agglomeration, computing each of p blocks C [i ][ j ] requires matrix-matrix product of two (n/ 3 p ) (n/ 3 p ) blocks, so F p = (n/ 3 p ) 3 = n 3 /p On 3-D mesh, each broadcast or reduction takes time T bcast p 1/3 ((n/p 1/3 ) 2 ) = O(α log p + βn 2 /p 2/3 ) Total time is therefore T p = O(α log p + β n 2 /p 2/3 + γ n 3 /p) However, overall memory footprint is M p = Θ(p(n/p 1/3 ) 2 ) = Θ(p 1/3 n 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 65 / 77

66 Strong of 3-D Agglomeration The 3-D agglomeration efficiency is given by E p (n) = pt 1(n) T p (n) = O(1/(1+(α/γ)p log p/n3 +(β/γ) p 1/3 /n)) For strong scaling to p s processors we need E ps (n) = Θ(1), so 3-D agglomeration strong scales to p s = O(min((γ/α)n 3 / log(n), (γ/β)n 3 )) processors Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 66 / 77

67 Weak of 3-D Agglomeration For weak scaling (with constant input size / processor) to p w processor, we need E pw (n p w ) = Θ(1), which holds However, note that the algorithm is not memory-efficient as M p = Θ(p 1/3 n 2 ), and if keeping memory footprint per processor constant, we must grow n with p 1/3 Scaling with constant memory footprint processor (M p /p = const) is possible to p m processors where E pm (np 1/3 m ) = Θ(1), which holds for p m = Θ(2 (γ/α)n3 ) processors The isoefficiency function is Q(p) = Θ(p log(p)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 67 / 77

68 Execution Time for 2-D Agglomeration For 2-D agglomeration (SUMMA), computation of each block C [i ][ j ] requires p matrix-matrix products of (n/ p ) (n/ p ) blocks, so F p = p (n/ p ) 3 = n 3 /p For broadcast among rows and columns of processir grid, communication time is 2 pt bcast p (n 2 /p) = Θ(α p log(p) + βn 2 / p) Total time is therefore T p = O(α p log(p) + βn 2 / p + γ n 3 /p) The algorithm is memory-efficient, M p = Θ(n 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 68 / 77

69 Strong of 2-D Agglomeration The 2-D agglomeration efficiency is given by E p (n) = pt 1(n) T p (n) = O(1/(1+(α/γ)p3/2 log p/n 3 +(β/γ) p/n)) For strong scaling to p s processors we need E ps (n) = Θ(1), so 2-D agglomeration strong scales to p s = O(min((γ/α)n 2 / log(n) 2/3, (γ/β)n 2 )) processors For weak scaling to p w processors with n 2 /p matrix elements per processor, we need E pw ( p w n) = Θ(1), so 2-D agglomeration (SUMMA) weak scales to p w = O(2 (γ/α)n3 ) processors Cannon s algorithm achieves unconditional weak scalability Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 69 / 77

70 for 1-D Agglomeration For 1-D agglomeration on 1-D mesh, total time is T p = O(α log(p) + βn 2 + γn 3 /p) The corresponding efficiency is E p = O(1/(1 + (α/β)p log(p)n 3 + (β/γ)p/n) Strong scalability is possible to at most p s = O((γ/β)n) processors Weak scalability is possible to at most p w = O( (γ/β)n) processors Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 70 / 77

71 Rectangular Matrix Multiplication If C is m n, A is m k, and B is k n, choosing a 3D grid that optimizes volume-to-surface-area ratio yields bandwidth cost... ( W p (m, n, k) = O min p 1 p 2 p 3 =p [ mk + kn + mn ]) p 1 p 2 p 1 p 3 p 2 p 3 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 71 / 77

72 Communication vs. Memory Tradeoff Communication cost for 2-D algorithms for matrix-matrix product is optimal, assuming no replication of storage If explicit replication of storage is allowed, then lower communication volume is possible via 3-D algorithm Generally, we assign n/p 1 n/p 2 n/p 3 computation subvolume to each processor The largest face of the subvolume gives communication cost, the smallest face gives minimal memory usage can keep smallest face local while successively computing slices of subvolume Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 72 / 77

73 Leveraging Additional Memory in Matrix Multiplication Provided M total available memory, 2-D and 3-D algorithms achieve bandwidth cost : M < n 2 W p (n, M) = n 2 / p : M = Θ(n 2 ) n 2 /p 2/3 : M = Θ(n 2 p 1/3 ) For general M, possible to pick processor grid to achieve W p (n, M) = O(n 3 /( p M 1/2 ) + n 2 /p 2/3 ) and an overall execution time of T p (n, M) = O(α(log p + n 3 p/ M 3/2 ) + βw p (n, M) + γn 3 /p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 73 / 77

74 Strong Scaling using All Available Memory The efficiency of the algorithm for a given amount of total memory M p is E p (n, M p ) = O(1/(1 + (α/γ)(p log p/n 3 + p 3/2 / + (β/γ)( p/ M 1/2 p + p 1/3 /n))) For strong scaling assuming M p = p M 1, we need M 3/2 p ) E ps (n, p s M1 ) = p s T 1 (n, M 1 )/T ps (n, p s M1 ) = Θ(1) In this case, we obtain p s = Θ(min((α/γ)n 3 / log(n), (β/γ)n 3 )) as good as the 3-D algorithm, but more memory-efficient Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 74 / 77

75 References BLAS R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, A three-dimensional approach to parallel matrix multiplication, IBM J. Res. Dev., 39: , 1995 J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Comput. 12: , 1989 J. Demmel, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, Communication-optimal parallel recursive rectangular matrix multiplication, IPDPS, 2013 J. W. Demmel, M. T. Heath, and H. A. van der Vorst, Parallel numerical linear algebra, Acta Numerica 2: , 1993 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 75 / 77

76 References BLAS R. Dias da Cunha, A benchmark study based on the parallel computation of the vector outer-product A = uv T operation, Concurrency: Practice and Experience 9: , 1997 G. C. Fox, S. W. Otto, and A. J. G. Hey, Matrix algorithms on a hypercube I: matrix multiplication, Parallel Comput. 4:17-31, 1987 D. Irony, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput. 64: , S. L. Johnsson, Communication efficient basic linear algebra computations on hypercube architectures, J. Parallel Distrib. Comput. 4(2): , 1987 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 76 / 77

77 References BLAS S. L. Johnsson, Minimizing the communication time for matrix multiplication on multiprocessors, Parallel Comput. 19: , 1993 B. Lipshitz, Communication-avoiding parallel recursive algorithms for matrix multiplication, Tech. Rept. UCB/EECS , University of California at Berkeley, Ma013. O. McBryan and E. F. Van de Velde, Matrix and vector operations on hypercube parallel processors, Parallel Comput. 5: , 1987 R. A. Van De Geijn and J. Watts, SUMMA: Scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience 9(4): , 1997 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 77 / 77

Parallel Numerical Algorithms

Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel