Parallel Numerical Algorithms

Size: px
Start display at page:

Download "Parallel Numerical Algorithms"

Transcription

1 Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77

2 Outline BLAS 1 BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 77

3 Basic Linear Algebra Subprograms Basic Linear Algebra Subprograms (BLAS) are building blocks for many other matrix computations BLAS encapsulate basic operations on vectors and matrices so they can be optimized for particular computer architecture while high-level routines that call them remain portable BLAS offer good opportunities for optimizing utilization of memory hierarchy Generic BLAS are available from netlib, and many computer vendors provide custom versions optimized for their particular systems Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 77

4 Examples of BLAS BLAS Level Work Examples Function 1 O(n) daxpy Scalar vector + vector ddot Inner product dnrm2 Euclidean vector norm 2 O(n 2 ) dgemv Matrix-vector product dtrsv Triangular solve dger Outer-product 3 O(n 3 ) dgemm Matrix-matrix product dtrsm Multiple triangular solves dsyrk Symmetric rank-k update γ }{{} 1 γ }{{} 2 γ }{{} 3 BLAS 1 effective sec/flop BLAS 2 effective sec/flop BLAS 3 effective sec/flop Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 77

5 BLAS Inner product of two n-vectors x and y given by x T y = n x i y i i=1 Computation of inner product requires n multiplications and n 1 additions M 1 = Θ(n), Q 1 = Θ(n), T 1 = Θ(γ n) Effective as hard as scalar reduction, can be done via binary or binomial tree summation Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 77

6 BLAS Partition For i = 1,..., n, fine-grain task i stores x i and y i, and computes their product x i y i Communicate Sum reduction over n fine-grain tasks x 7 y 7 x 8 y 8 x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 77

7 Fine-Grain z i = x i y i reduce z i across all tasks i = 1,..., n { local scalar product } { sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 77

8 Agglomeration and Mapping Agglomerate Map Combine k components of both x and y to form each coarse-grain task, which computes inner product of these subvectors Communication becomes sum reduction over n/k coarse-grain tasks Assign (n/k)/p coarse-grain tasks to each of p processors, for total of n/p components of x and y per processor x 7 y 7 + x 8 y 8 + x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 77

9 Coarse-Grain z i = x T [i] y [i] reduce z i across all processors i = 1,..., p { local inner product } { sum reduction } [ x[i] subvector of x assigned to processor i ] Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 77

10 Performance BLAS The parallel costs (S p, W p, F p ) for the inner product are given by Computational cost F p = Θ(n/p) regardless of network The latency and bandwidth costs depend on network: 1-D mesh: S p, W p = Θ(p) 2-D mesh: S p, W p = Θ( p) hypercube: S p, W p = Θ(log p) For a hypercube or fully-connected network time is T p = αs p + βw p + γf p = Θ(α log(p) + γn/p) Efficiency and scaling are the same as for binary tree sum Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 77

11 Inner product on 1-D Mesh For 1-D mesh, total time is T p = Θ(γn/p + αp) To determine strong scalability, we set constant efficiency and solve for p s const = E ps = T ( ) ( ) 1 γn 1 = Θ p s T ps γn + αp 2 = Θ s 1 + (α/γ)p 2 s/n which yields p s = Θ( (γ/α)n) 1-D mesh weakly scalable to p w = Θ((γ/α)n) processors: ( ) ( ) 1 1 E pw (p w n) = Θ 1 + (α/γ)p 2 = Θ w/(p w n) 1 + (α/γ)p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 77

12 Inner product on 2-D Mesh For 2-D mesh, total time is T p = Θ(γn/p + α p) To determine strong scalability, we set constant efficiency and solve for p s const = E ps = T ( ) ( ) 1 γn 1 = Θ p s T ps γn + αp 3/2 = Θ s 1 + (α/γ)ps 3/2 /n which yields p s = Θ((γ/α) 2/3 n 2/3 ) 2-D mesh weakly scalable to p w = Θ((γ/α) 2 n 2 ), since ( ) ( ) 1 1 E pw (p w n) = Θ 1 + (α/γ)p 3/2 = Θ w /(p w n) 1 + (α/γ) p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 77

13 BLAS Outer product of two n-vectors x and y is n n matrix Z = xy T whose (i, j) entry z ij = x i y j For example, T = Computation of outer product requires n 2 multiplications, M 1 = Θ(n 2 ), Q 1 = Θ(n 2 ), T 1 = Θ(γn 2 ) (in this case, we should treat M 1 as output size or define the problem as in the BLAS: Z = Z input + xy T ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 77

14 BLAS Partition For i, j = 1,..., n, fine-grain task (i, j) computes and stores z ij = x i y j, yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either for some j, task (i, j) stores x i and task ( j, i) stores y i, or task (i, i) stores both x i and y i, i = 1,..., n Communicate For i = 1,..., n, task that stores x i broadcasts it to all other tasks in ith task row For j = 1,..., n, task that stores y j broadcasts it to all other tasks in jth task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 77

15 Fine-Grain Tasks and Communication Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 77

16 Fine-Grain broadcast x i to tasks (i, k), k = 1,..., n broadcast y j to tasks (k, j), k = 1,..., n z ij = x i y j { horizontal broadcast } { vertical broadcast } { local scalar product } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 77

17 Agglomeration BLAS Agglomerate With n n array of fine-grain tasks, natural strategies are 2-D: Combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 77

18 2-D Agglomeration BLAS Each task that stores portion of x must broadcast its subvector to all other tasks in its task row Each task that stores portion of y must broadcast its subvector to all other tasks in its task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 77

19 2-D Agglomeration BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 77

20 1-D Agglomeration BLAS If either x or y stored in one task, then broadcast required to communicate needed values to all other tasks If either x or y distributed across tasks, then multinode broadcast required to communicate needed values to other tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 77

21 1-D Column Agglomeration Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 77

22 1-D Row Agglomeration Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 77

23 Mapping BLAS Map 2-D: Assign (n/k) 2 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 77

24 2-D Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 77

25 1-D Column Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 77

26 1-D Row Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 77

27 Coarse-Grain broadcast x [i ] to ith process row broadcast y [ j ] to jth process column Z [i ][ j ] = x [i ] y T [ j ] { horizontal broadcast } { vertical broadcast } { local outer product } [ Z[i ][ j ] means submatrix of Z assigned to process (i, j) by mapping ] Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 77

28 Performance BLAS The parallel costs (S p, W p, F p ) for the outer product are given by Computational cost F p = Θ(n 2 /p) regardless of network The latency and bandwidth costs can be derived from the cost of broadcast 1-D agglomeration: S p = Θ(log p), W p = Θ(n) 2-D agglomeration: S p = Θ(log p), W p = Θ(n/ p) For 1-D agglomeration, execution time is T 1-D p = Tp bcast (n) + Θ(γn 2 /p) = Θ(α log(p) + βn + γn 2 /p) For 2-D agglomeration, execution time is T 2-D p = 2T bcast p (n/ p)+θ(γn 2 /p) = Θ(α log(p)+βn/ p+γn 2 /p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 77

29 Strong Scaling 1-D agglomeration is strongly scalable to p s = Θ(min((γ/α)n 2 / log((γ/α)n 2 ), (γ/β)n)) processors, since E 1-D p s = Θ(1/(1 + (α/γ) log(p s )p s /n 2 + (β/γ)p s /n)) 2-D agglomeration is strongly scalable to p s = Θ(min((γ/α)n 2 / log((γ/α)n 2 ), (γ/β) 2 n 2 )) processors, since E 2-D p s = Θ(1/(1 + (α/γ) log(p s )p s /n 2 + (β/γ) p s /n)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 77

30 Weak Scaling 1-D agglomeration is weakly scalable to processors, since p w = Θ(min(2 (γ/α)n2, (γ/β) 2 n 2 )) E 1-D p w ( p w n) = Θ(1/(1 + (α/γ) log(p w )/n 2 + (β/γ) p w /n)) 2-D agglomeration is weakly scalable to processors, since p w = Θ(2 (γ/α)n2 ) E 2-D p w ( p w n) = Θ(1/(1 + (α/γ) log(p w )/n 2 + (β/γ)/n)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 77

31 Memory Requirements The memory requirements are dominated by storing Z, which in practice means the outer-product is a poor primitive (local flop-to-byte ratio of 1) If possible, Z should be operated on as it is computed, e.g. if we really need v i = j f(x i y j ) for some scalar function f If Z does not need to be stored, vector storage dominates Without further modification, 1-D algorithm requires M p = Θ(np) overall storage for vector Without further modification, 2-D algorithm requires M p = Θ(n p) overall storage for vector Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 77

32 Consider matrix-vector product y = Ax where A is n n matrix and x and y are n-vectors Components of vector y are given by inner products: n y i = a ij x j, j=1 i = 1,..., n The sequential memory, work, and time are M 1 = Θ(n 2 ), Q 1 = Θ(n 2 ), T 1 = Θ(γn 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 77

33 BLAS Partition For i, j = 1,..., n, fine-grain task (i, j) stores a ij and computes a ij x j, yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either for some j, task ( j, i) stores x i and task (i, j) stores y i, or Communicate task (i, i) stores both x i and y i, i = 1,..., n For j = 1,..., n, task that stores x j broadcasts it to all other tasks in jth task column For i = 1,..., n, sum reduction over ith task row gives y i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 77

34 Fine-Grain Tasks and Communication a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 77

35 Fine-Grain broadcast x j to tasks (k, j), k = 1,..., n y i = a ij x j reduce y i across tasks (i, k), k = 1,..., n { vertical broadcast } { local scalar product } { horizontal sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 77

36 Agglomeration BLAS Agglomerate With n n array of fine-grain tasks, natural strategies are 2-D: Combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 77

37 2-D Agglomeration BLAS Subvector of x broadcast along each task column Each task computes local matrix-vector product of submatrix of A with subvector of x Sum reduction along each task row produces subvector of result y Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 77

38 2-D Agglomeration BLAS a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 77

39 1-D Agglomeration BLAS 1-D column agglomeration Each task computes product of its component of x times its column of matrix, with no communication required Sum reduction across tasks then produces y 1-D row agglomeration If x stored in one task, then broadcast required to communicate needed values to all other tasks If x distributed across tasks, then multinode broadcast required to communicate needed values to other tasks Each task computes inner product of its row of A with entire vector x to produce its component of y Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 77

40 1-D Column Agglomeration a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 77

41 1-D Row Agglomeration a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 77

42 1-D Agglomeration BLAS Column and row algorithms are dual to each other Column algorithm begins with communication-free local vector scaling (daxpy) computations combined across processors by a reduction Row algorithm begins with broadcast followed by communication-free local inner-product (ddot) computations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 77

43 Mapping BLAS Map 2-D: Assign (n/k) 2 /p coarse-grain tasks to each of p processes using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processes using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 77

44 2-D Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 77

45 1-D Column Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 77

46 1-D Row Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 77

47 Coarse-Grain broadcast x [ j ] to jth process column y [i ] = A [i ][ j ] x [ j ] reduce y [i ] across ith process row { vertical broadcast } { local matrix-vector product } { horizontal sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 77

48 Performance BLAS The parallel costs (S p, W p, F p ) for the matrix-vector product are Computational cost F p = Θ(n 2 /p) regardless of network The latency and bandwidth costs can be derived from the cost of broadcast 1-D agglomeration: S p = Θ(log p), W p = Θ(n) 2-D agglomeration: S p = Θ(log p), W p = Θ(n/ p) For 1-D row agglomeration, execution time is T 1-D p = Tp bcast (n) + Θ(γn 2 /p) = Θ(α log(p) + βn + γn 2 /p) For 2-D agglomeration, using T bcast = T reduce, time is T 2-D p = 2T bcast p (n/ p)+θ(γn 2 /p) = Θ(α log(p)+βn/ p+γn 2 /p) So scalability is essentially the same as for outer product Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 77

49 Consider matrix-matrix product C = A B where A, B, and result C are n n matrices Entries of matrix C are given by n c ij = a ik b kj, k=1 i, j = 1,..., n Each of n 2 entries of C requires n multiply-add operations, so model serial time as T 1 = γ n 3 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 77

50 Matrix-matrix product can be viewed as n 2 inner products, or sum of n outer products, or n matrix-vector products and each viewpoint yields different algorithm One way to derive parallel algorithms for matrix-matrix product is to apply parallel algorithms already developed for inner product, outer product, or matrix-vector product However, considering the problem as a whole yields the best algorithms Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 77

51 BLAS Partition For i, j, k = 1,..., n, fine-grain task (i, j, k) computes product a ik b kj, yielding 3-D array of n 3 fine-grain tasks Assuming no replication of data, at most 3n 2 fine-grain tasks store entries of A, B, or C, say task (i, j, j) stores a ij, task (i, j, i) stores b ij, and task (i, j, k) stores c ij for i, j = 1,..., n and some fixed k We refer to subsets of tasks along i, j, and k dimensions as rows, columns, and layers, respectively, so kth column of A and kth row of B are stored in kth layer of tasks i k j Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 77

52 BLAS Communicate Broadcast entries of jth column of A horizontally along each task row in jth layer Broadcast entries of ith row of B vertically along each task column in ith layer For i, j = 1,..., n, result c ij is given by sum reduction over tasks (i, j, k), k = 1,..., n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 77

53 Fine-Grain Algorithm broadcast a ik to tasks (i, q, k), q = 1,..., n broadcast b kj to tasks (q, j, k), q = 1,..., n c ij = a ik b kj reduce c ij across tasks (i, j, q), q = 1,..., n { horizontal broadcast } { vertical broadcast } { local scalar product } { lateral sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 77

54 Agglomeration BLAS Agglomerate With n n n array of fine-grain tasks, natural strategies are 3-D: Combine q q q subarray of fine-grain tasks 2-D: Combine q q n subarray of fine-grain tasks, eliminating sum reductions 1-D column: Combine n 1 n subarray of fine-grain tasks, eliminating vertical broadcasts and sum reductions 1-D row: Combine 1 n n subarray of fine-grain tasks, eliminating horizontal broadcasts and sum reductions Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 77

55 Mapping BLAS Map Corresponding mapping strategies are 3-D: Assign (n/q) 3 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 3-D mesh 2-D: Assign (n/q) 2 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 77

56 Agglomeration with Block Mapping 1-D row 1-D col 2-D 3-D 1-D row 1-D column 2-D 3-D agglomerations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 77

57 Coarse-Grain 3-D broadcast A [i ][k] to ith processor row broadcast B [k][ j ] to jth processor column C [i ][ j ] = A [i ][k] B [k][ j ] reduce C [i ][ j ] across processor layers { horizontal broadcast } { vertical broadcast } { local matrix product } { lateral sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 77

58 Agglomeration with Block Mapping 2-D: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 1-D column: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 1-D row: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 77

59 Coarse-Grain 2-D allgather A [i ][ j ] in ith processor row allgather B [i ][ j ] in jth processor column C [i ][ j ] = 0 for k = 1,..., p C [i ][ j ] = C [i ][ j ] + A [i ][k] B [k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 77

60 SUMMA Algorithm BLAS Algorithm just described requires excessive memory, since each process accumulates p blocks of both A and B One way to reduce memory requirements is to broadcast blocks of A successively across processor rows broadcast blocks of B successively across processor cols C [i ][ j ] = 0 for k = 1,..., p broadcast A [i ][ k ] in ith processor row broadcast B [k ][ j ] in jth processor column C [i ][ j ] = C [i ][ j ] + A [i ][k] B [k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 77

61 SUMMA Algorithm BLAS A A B B B 16 CPUs (4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B A Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 77

62 Cannon Algorithm BLAS Another approach, due to Cannon (1969), is to circulate blocks of B vertically and blocks of A horizontally in ring fashion Blocks of both matrices must be initially aligned using circular shifts so that correct blocks meet as needed Requires less memory than SUMMA and replaces line broadcasts with shifts, lowering the number of messages Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 77

63 Cannon Algorithm BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 77

64 Fox Algorithm BLAS It is possible to mix techniques from SUMMA and Cannon s algorithm: circulate blocks of B in ring fashion vertically along processor columns step by step so that each block of B comes in conjunction with appropriate block of A broadcast at that same step This algorithm is due to Fox et al. Shifts, especially in Cannon s algorithm, are harder to generalize to nonsquare processor grids than collectives in algorithms like SUMMA Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 64 / 77

65 Execution Time for 3-D Agglomeration For 3-D agglomeration, computing each of p blocks C [i ][ j ] requires matrix-matrix product of two (n/ 3 p ) (n/ 3 p ) blocks, so F p = (n/ 3 p ) 3 = n 3 /p On 3-D mesh, each broadcast or reduction takes time T bcast p 1/3 ((n/p 1/3 ) 2 ) = O(α log p + βn 2 /p 2/3 ) Total time is therefore T p = O(α log p + β n 2 /p 2/3 + γ n 3 /p) However, overall memory footprint is M p = Θ(p(n/p 1/3 ) 2 ) = Θ(p 1/3 n 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 65 / 77

66 Strong of 3-D Agglomeration The 3-D agglomeration efficiency is given by E p (n) = pt 1(n) T p (n) = O(1/(1+(α/γ)p log p/n3 +(β/γ) p 1/3 /n)) For strong scaling to p s processors we need E ps (n) = Θ(1), so 3-D agglomeration strong scales to p s = O(min((γ/α)n 3 / log(n), (γ/β)n 3 )) processors Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 66 / 77

67 Weak of 3-D Agglomeration For weak scaling (with constant input size / processor) to p w processor, we need E pw (n p w ) = Θ(1), which holds However, note that the algorithm is not memory-efficient as M p = Θ(p 1/3 n 2 ), and if keeping memory footprint per processor constant, we must grow n with p 1/3 Scaling with constant memory footprint processor (M p /p = const) is possible to p m processors where E pm (np 1/3 m ) = Θ(1), which holds for p m = Θ(2 (γ/α)n3 ) processors The isoefficiency function is Q(p) = Θ(p log(p)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 67 / 77

68 Execution Time for 2-D Agglomeration For 2-D agglomeration (SUMMA), computation of each block C [i ][ j ] requires p matrix-matrix products of (n/ p ) (n/ p ) blocks, so F p = p (n/ p ) 3 = n 3 /p For broadcast among rows and columns of processir grid, communication time is 2 pt bcast p (n 2 /p) = Θ(α p log(p) + βn 2 / p) Total time is therefore T p = O(α p log(p) + βn 2 / p + γ n 3 /p) The algorithm is memory-efficient, M p = Θ(n 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 68 / 77

69 Strong of 2-D Agglomeration The 2-D agglomeration efficiency is given by E p (n) = pt 1(n) T p (n) = O(1/(1+(α/γ)p3/2 log p/n 3 +(β/γ) p/n)) For strong scaling to p s processors we need E ps (n) = Θ(1), so 2-D agglomeration strong scales to p s = O(min((γ/α)n 2 / log(n) 2/3, (γ/β)n 2 )) processors For weak scaling to p w processors with n 2 /p matrix elements per processor, we need E pw ( p w n) = Θ(1), so 2-D agglomeration (SUMMA) weak scales to p w = O(2 (γ/α)n3 ) processors Cannon s algorithm achieves unconditional weak scalability Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 69 / 77

70 for 1-D Agglomeration For 1-D agglomeration on 1-D mesh, total time is T p = O(α log(p) + βn 2 + γn 3 /p) The corresponding efficiency is E p = O(1/(1 + (α/β)p log(p)n 3 + (β/γ)p/n) Strong scalability is possible to at most p s = O((γ/β)n) processors Weak scalability is possible to at most p w = O( (γ/β)n) processors Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 70 / 77

71 Rectangular Matrix Multiplication If C is m n, A is m k, and B is k n, choosing a 3D grid that optimizes volume-to-surface-area ratio yields bandwidth cost... ( W p (m, n, k) = O min p 1 p 2 p 3 =p [ mk + kn + mn ]) p 1 p 2 p 1 p 3 p 2 p 3 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 71 / 77

72 Communication vs. Memory Tradeoff Communication cost for 2-D algorithms for matrix-matrix product is optimal, assuming no replication of storage If explicit replication of storage is allowed, then lower communication volume is possible via 3-D algorithm Generally, we assign n/p 1 n/p 2 n/p 3 computation subvolume to each processor The largest face of the subvolume gives communication cost, the smallest face gives minimal memory usage can keep smallest face local while successively computing slices of subvolume Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 72 / 77

73 Leveraging Additional Memory in Matrix Multiplication Provided M total available memory, 2-D and 3-D algorithms achieve bandwidth cost : M < n 2 W p (n, M) = n 2 / p : M = Θ(n 2 ) n 2 /p 2/3 : M = Θ(n 2 p 1/3 ) For general M, possible to pick processor grid to achieve W p (n, M) = O(n 3 /( p M 1/2 ) + n 2 /p 2/3 ) and an overall execution time of T p (n, M) = O(α(log p + n 3 p/ M 3/2 ) + βw p (n, M) + γn 3 /p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 73 / 77

74 Strong Scaling using All Available Memory The efficiency of the algorithm for a given amount of total memory M p is E p (n, M p ) = O(1/(1 + (α/γ)(p log p/n 3 + p 3/2 / + (β/γ)( p/ M 1/2 p + p 1/3 /n))) For strong scaling assuming M p = p M 1, we need M 3/2 p ) E ps (n, p s M1 ) = p s T 1 (n, M 1 )/T ps (n, p s M1 ) = Θ(1) In this case, we obtain p s = Θ(min((α/γ)n 3 / log(n), (β/γ)n 3 )) as good as the 3-D algorithm, but more memory-efficient Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 74 / 77

75 References BLAS R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, A three-dimensional approach to parallel matrix multiplication, IBM J. Res. Dev., 39: , 1995 J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Comput. 12: , 1989 J. Demmel, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, Communication-optimal parallel recursive rectangular matrix multiplication, IPDPS, 2013 J. W. Demmel, M. T. Heath, and H. A. van der Vorst, Parallel numerical linear algebra, Acta Numerica 2: , 1993 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 75 / 77

76 References BLAS R. Dias da Cunha, A benchmark study based on the parallel computation of the vector outer-product A = uv T operation, Concurrency: Practice and Experience 9: , 1997 G. C. Fox, S. W. Otto, and A. J. G. Hey, Matrix algorithms on a hypercube I: matrix multiplication, Parallel Comput. 4:17-31, 1987 D. Irony, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput. 64: , S. L. Johnsson, Communication efficient basic linear algebra computations on hypercube architectures, J. Parallel Distrib. Comput. 4(2): , 1987 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 76 / 77

77 References BLAS S. L. Johnsson, Minimizing the communication time for matrix multiplication on multiprocessors, Parallel Comput. 19: , 1993 B. Lipshitz, Communication-avoiding parallel recursive algorithms for matrix multiplication, Tech. Rept. UCB/EECS , University of California at Berkeley, Ma013. O. McBryan and E. F. Van de Velde, Matrix and vector operations on hypercube parallel processors, Parallel Comput. 5: , 1987 R. A. Van De Geijn and J. Watts, SUMMA: Scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience 9(4): , 1997 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 77 / 77

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization Edgar Solomonik University of Illinois at Urbana-Champaign August 31, 2016 Review of

More information

Matrix multiplication

Matrix multiplication Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:

More information

Chapter 8 Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization

CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model. Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

CS 770G - Parallel Algorithms in Scientific Computing

CS 770G - Parallel Algorithms in Scientific Computing CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,

More information

Dense LU Factorization

Dense LU Factorization Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...

More information

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast

More information

Lecture 4: Graph Algorithms

Lecture 4: Graph Algorithms Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e

More information

With regards to bitonic sorts, everything boils down to doing bitonic sorts efficiently.

With regards to bitonic sorts, everything boils down to doing bitonic sorts efficiently. Lesson 2 5 Distributed Memory Sorting Bitonic Sort The pseudo code for generating a bitonic sequence: 1. create two bitonic subsequences 2. Make one an increasing sequence 3. Make one into a decreasing

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators Lecture 13 Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators Announcements Extra lecture Friday 4p to 5.20p, room 2154 A4 posted u Cannon s matrix multiplication

More information

SCALABILITY ANALYSIS

SCALABILITY ANALYSIS SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS Evaluation Sequential: runtime (execution time) Ts =T (InputSize) Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture)

More information

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Richard P. Brent and Peter E. Strazdins Computer Sciences Laboratory and Department of Computer Science Australian National University

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication Penporn Koanantakool 1,2, riful zad 2, ydin Buluç 2,Dmitriy Morozov 2, Sang-Yun Oh 2,3, Leonid Oliker 2, Katherine Yelick 1,2 1 omputer

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

COMMUNICATION IN HYPERCUBES

COMMUNICATION IN HYPERCUBES PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)

More information

Matrix multiplication on multidimensional torus networks

Matrix multiplication on multidimensional torus networks Matrix multiplication on multidimensional torus networks Edgar Solomonik James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-28

More information

Project C/MPI: Matrix-Vector Multiplication

Project C/MPI: Matrix-Vector Multiplication Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving

More information

Lecture 6: Parallel Matrix Algorithms (part 3)

Lecture 6: Parallel Matrix Algorithms (part 3) Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of

More information

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Parallel Reduction from Block Hessenberg to Hessenberg using MPI Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson

More information

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8. CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

All-Pairs Shortest Paths - Floyd s Algorithm

All-Pairs Shortest Paths - Floyd s Algorithm All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel

More information

Parallel dense linear algebra computations (1)

Parallel dense linear algebra computations (1) Parallel dense linear algebra computations (1) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.07] Tuesday, January 29, 2008 1 Sources for today s material Mike Heath

More information

Linear systems of equations

Linear systems of equations Linear systems of equations Michael Quinn Parallel Programming with MPI and OpenMP material do autor Terminology Back substitution Gaussian elimination Outline Problem System of linear equations Solve

More information

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms

CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms Edgar Solomonik University of Illinois at Urbana-Champaign October 12, 2016 Defining

More information

Communication Optimal Parallel Multiplication of Sparse Random Matrices

Communication Optimal Parallel Multiplication of Sparse Random Matrices Communication Optimal arallel Multiplication of Sparse Random Matrices Grey Ballard Aydin Buluc James Demmel Laura Grigori Benjamin Lipshitz Oded Schwartz Sivan Toledo Electrical Engineering and Computer

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three

More information

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Parallel Sorting. Sathish Vadhiyar

Parallel Sorting. Sathish Vadhiyar Parallel Sorting Sathish Vadhiyar Parallel Sorting Problem The input sequence of size N is distributed across P processors The output is such that elements in each processor P i is sorted elements in P

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication

Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication Md Salman Ahmed, Jennifer Houser, Mohammad Asadul Hoque, Phil Pfeiffer Department of Computing Department of

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Matrix-vector Multiplication

Matrix-vector Multiplication Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Parallel Programming with MPI and OpenMP

Parallel Programming with MPI and OpenMP Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 1: Course motivation and overview; collective communication

CS 598: Communication Cost Analysis of Algorithms Lecture 1: Course motivation and overview; collective communication CS 598: Communication Cost Analysis of Algorithms Lecture 1: Course motivation and overview; collective communication Edgar Solomonik University of Illinois at Urbana-Champaign August 22, 2016 Motivation

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring Parallel Programs Conditions of Parallelism: Data Dependence Control Dependence Resource Dependence Bernstein s Conditions Asymptotic Notations for Algorithm Analysis Parallel Random-Access Machine (PRAM)

More information

Algorithm-Based Fault Tolerance. for Fail-Stop Failures

Algorithm-Based Fault Tolerance. for Fail-Stop Failures Algorithm-Based Fault Tolerance 1 for Fail-Stop Failures Zizhong Chen and Jack Dongarra Abstract Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging.

More information

Giovanni De Micheli. Integrated Systems Centre EPF Lausanne

Giovanni De Micheli. Integrated Systems Centre EPF Lausanne Two-level Logic Synthesis and Optimization Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers

More information

CPS222 Lecture: Parallel Algorithms Last Revised 4/20/2015. Objectives

CPS222 Lecture: Parallel Algorithms Last Revised 4/20/2015. Objectives CPS222 Lecture: Parallel Algorithms Last Revised 4/20/2015 Objectives 1. To introduce fundamental concepts of parallel algorithms, including speedup and cost. 2. To introduce some simple parallel algorithms.

More information

Parallel Computing. Parallel Algorithm Design

Parallel Computing. Parallel Algorithm Design Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Avoiding communication in primal and dual block coordinate descent methods

Avoiding communication in primal and dual block coordinate descent methods Avoiding communication in primal and dual block coordinate descent methods Aditya Devarakonda Kimon Fountoulakis James Demmel Michael W. Mahoney Electrical Engineering and Computer Sciences University

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop

More information

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting

More information

Linear Arrays. Chapter 7

Linear Arrays. Chapter 7 Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between

More information

Matrix algorithms: fast, stable, communication-optimizing random?!

Matrix algorithms: fast, stable, communication-optimizing random?! Matrix algorithms: fast, stable, communication-optimizing random?! Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, James Demmel, Olga Holtz, Robert

More information

Mathematics and Computer Science

Mathematics and Computer Science Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR

More information

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product) Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using

More information

Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms Edgar Solomonik and James Demmel Department of Computer Science University of California at Berkeley, Berkeley,

More information

Multilevel Hierarchical Matrix Multiplication on Clusters

Multilevel Hierarchical Matrix Multiplication on Clusters Multilevel Hierarchical Matrix Multiplication on Clusters Sascha Hunold Department of Mathematics, Physics and Computer Science University of Bayreuth, Germany Thomas Rauber Department of Mathematics,

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Student Number: CSE191 Midterm II Spring Plagiarism will earn you an F in the course and a recommendation of expulsion from the university.

Student Number: CSE191 Midterm II Spring Plagiarism will earn you an F in the course and a recommendation of expulsion from the university. Plagiarism will earn you an F in the course and a recommendation of expulsion from the university. (1 pt each) For Questions 1-5, when asked for a running time or the result of a summation, you must choose

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #12 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Last class Outline

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering

More information

Parallel Computation/Program Issues

Parallel Computation/Program Issues Parallel Computation/Program Issues Dependency Analysis: Types of dependency Dependency Graphs Bernstein s s Conditions of Parallelism Asymptotic Notations for Algorithm Complexity Analysis Parallel Random-Access

More information

Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Technical Report Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Andreas Gerstlauer Robert A. van de Geijn UT-CERC-12-02 October 5, 2011 Computer Engineering

More information

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm Outine Parae Numerica Agorithms Chapter 8 Prof. Michae T. Heath Department of Computer Science University of Iinois at Urbana-Champaign CS 554 / CSE 512 1 2 3 4 Trianguar Matrices Michae T. Heath Parae

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares

12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares 12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares Optimal substructure Dynamic programming is typically applied to optimization problems. An optimal solution to the original

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

1 Definition of Reduction

1 Definition of Reduction 1 Definition of Reduction Problem A is reducible, or more technically Turing reducible, to problem B, denoted A B if there a main program M to solve problem A that lacks only a procedure to solve problem

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Divide and Conquer Algorithms. Sathish Vadhiyar

Divide and Conquer Algorithms. Sathish Vadhiyar Divide and Conquer Algorithms Sathish Vadhiyar Introduction One of the important parallel algorithm models The idea is to decompose the problem into parts solve the problem on smaller parts find the global

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Parallel Programming with MPI and OpenMP

Parallel Programming with MPI and OpenMP Parallel Programming with MPI and OpenMP Michael J. Quinn (revised by L.M. Liebrock) Chapter 7 Performance Analysis Learning Objectives Predict performance of parallel programs Understand barriers to higher

More information

Algorithm-Based Fault Tolerance for Fail-Stop Failures

Algorithm-Based Fault Tolerance for Fail-Stop Failures IEEE TRANSCATIONS ON PARALLEL AND DISTRIBUTED SYSYTEMS Algorithm-Based Fault Tolerance for Fail-Stop Failures Zizhong Chen, Member, IEEE, and Jack Dongarra, Fellow, IEEE Abstract Fail-stop failures in

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

Parallel Graph Algorithms

Parallel Graph Algorithms Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050/VT3 Part I Introduction Overview Graphs definitions & representations Minimal Spanning Tree (MST) Prim s algorithm Single Source

More information

Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Performance without Pain = Productivity Data Layout and Collective Communication in UPC Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research

More information

Maunam: A Communication-Avoiding Compiler

Maunam: A Communication-Avoiding Compiler Maunam: A Communication-Avoiding Compiler Karthik Murthy Advisor: John Mellor-Crummey Department of Computer Science Rice University! karthik.murthy@rice.edu February 2014!1 Knights Tour Closed Knight's

More information