Parallel Numerical Algorithms
|
|
- Sara Oliver
- 6 years ago
- Views:
Transcription
1 Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77
2 Outline BLAS 1 BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 77
3 Basic Linear Algebra Subprograms Basic Linear Algebra Subprograms (BLAS) are building blocks for many other matrix computations BLAS encapsulate basic operations on vectors and matrices so they can be optimized for particular computer architecture while high-level routines that call them remain portable BLAS offer good opportunities for optimizing utilization of memory hierarchy Generic BLAS are available from netlib, and many computer vendors provide custom versions optimized for their particular systems Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 77
4 Examples of BLAS BLAS Level Work Examples Function 1 O(n) daxpy Scalar vector + vector ddot Inner product dnrm2 Euclidean vector norm 2 O(n 2 ) dgemv Matrix-vector product dtrsv Triangular solve dger Outer-product 3 O(n 3 ) dgemm Matrix-matrix product dtrsm Multiple triangular solves dsyrk Symmetric rank-k update γ }{{} 1 γ }{{} 2 γ }{{} 3 BLAS 1 effective sec/flop BLAS 2 effective sec/flop BLAS 3 effective sec/flop Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 77
5 BLAS Inner product of two n-vectors x and y given by x T y = n x i y i i=1 Computation of inner product requires n multiplications and n 1 additions M 1 = Θ(n), Q 1 = Θ(n), T 1 = Θ(γ n) Effective as hard as scalar reduction, can be done via binary or binomial tree summation Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 77
6 BLAS Partition For i = 1,..., n, fine-grain task i stores x i and y i, and computes their product x i y i Communicate Sum reduction over n fine-grain tasks x 7 y 7 x 8 y 8 x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 77
7 Fine-Grain z i = x i y i reduce z i across all tasks i = 1,..., n { local scalar product } { sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 77
8 Agglomeration and Mapping Agglomerate Map Combine k components of both x and y to form each coarse-grain task, which computes inner product of these subvectors Communication becomes sum reduction over n/k coarse-grain tasks Assign (n/k)/p coarse-grain tasks to each of p processors, for total of n/p components of x and y per processor x 7 y 7 + x 8 y 8 + x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 77
9 Coarse-Grain z i = x T [i] y [i] reduce z i across all processors i = 1,..., p { local inner product } { sum reduction } [ x[i] subvector of x assigned to processor i ] Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 77
10 Performance BLAS The parallel costs (S p, W p, F p ) for the inner product are given by Computational cost F p = Θ(n/p) regardless of network The latency and bandwidth costs depend on network: 1-D mesh: S p, W p = Θ(p) 2-D mesh: S p, W p = Θ( p) hypercube: S p, W p = Θ(log p) For a hypercube or fully-connected network time is T p = αs p + βw p + γf p = Θ(α log(p) + γn/p) Efficiency and scaling are the same as for binary tree sum Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 77
11 Inner product on 1-D Mesh For 1-D mesh, total time is T p = Θ(γn/p + αp) To determine strong scalability, we set constant efficiency and solve for p s const = E ps = T ( ) ( ) 1 γn 1 = Θ p s T ps γn + αp 2 = Θ s 1 + (α/γ)p 2 s/n which yields p s = Θ( (γ/α)n) 1-D mesh weakly scalable to p w = Θ((γ/α)n) processors: ( ) ( ) 1 1 E pw (p w n) = Θ 1 + (α/γ)p 2 = Θ w/(p w n) 1 + (α/γ)p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 77
12 Inner product on 2-D Mesh For 2-D mesh, total time is T p = Θ(γn/p + α p) To determine strong scalability, we set constant efficiency and solve for p s const = E ps = T ( ) ( ) 1 γn 1 = Θ p s T ps γn + αp 3/2 = Θ s 1 + (α/γ)ps 3/2 /n which yields p s = Θ((γ/α) 2/3 n 2/3 ) 2-D mesh weakly scalable to p w = Θ((γ/α) 2 n 2 ), since ( ) ( ) 1 1 E pw (p w n) = Θ 1 + (α/γ)p 3/2 = Θ w /(p w n) 1 + (α/γ) p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 77
13 BLAS Outer product of two n-vectors x and y is n n matrix Z = xy T whose (i, j) entry z ij = x i y j For example, T = Computation of outer product requires n 2 multiplications, M 1 = Θ(n 2 ), Q 1 = Θ(n 2 ), T 1 = Θ(γn 2 ) (in this case, we should treat M 1 as output size or define the problem as in the BLAS: Z = Z input + xy T ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 77
14 BLAS Partition For i, j = 1,..., n, fine-grain task (i, j) computes and stores z ij = x i y j, yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either for some j, task (i, j) stores x i and task ( j, i) stores y i, or task (i, i) stores both x i and y i, i = 1,..., n Communicate For i = 1,..., n, task that stores x i broadcasts it to all other tasks in ith task row For j = 1,..., n, task that stores y j broadcasts it to all other tasks in jth task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 77
15 Fine-Grain Tasks and Communication Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 77
16 Fine-Grain broadcast x i to tasks (i, k), k = 1,..., n broadcast y j to tasks (k, j), k = 1,..., n z ij = x i y j { horizontal broadcast } { vertical broadcast } { local scalar product } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 77
17 Agglomeration BLAS Agglomerate With n n array of fine-grain tasks, natural strategies are 2-D: Combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 77
18 2-D Agglomeration BLAS Each task that stores portion of x must broadcast its subvector to all other tasks in its task row Each task that stores portion of y must broadcast its subvector to all other tasks in its task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 77
19 2-D Agglomeration BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 77
20 1-D Agglomeration BLAS If either x or y stored in one task, then broadcast required to communicate needed values to all other tasks If either x or y distributed across tasks, then multinode broadcast required to communicate needed values to other tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 77
21 1-D Column Agglomeration Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 77
22 1-D Row Agglomeration Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 77
23 Mapping BLAS Map 2-D: Assign (n/k) 2 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 77
24 2-D Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 77
25 1-D Column Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 77
26 1-D Row Agglomeration with Block Mapping Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 77
27 Coarse-Grain broadcast x [i ] to ith process row broadcast y [ j ] to jth process column Z [i ][ j ] = x [i ] y T [ j ] { horizontal broadcast } { vertical broadcast } { local outer product } [ Z[i ][ j ] means submatrix of Z assigned to process (i, j) by mapping ] Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 77
28 Performance BLAS The parallel costs (S p, W p, F p ) for the outer product are given by Computational cost F p = Θ(n 2 /p) regardless of network The latency and bandwidth costs can be derived from the cost of broadcast 1-D agglomeration: S p = Θ(log p), W p = Θ(n) 2-D agglomeration: S p = Θ(log p), W p = Θ(n/ p) For 1-D agglomeration, execution time is T 1-D p = Tp bcast (n) + Θ(γn 2 /p) = Θ(α log(p) + βn + γn 2 /p) For 2-D agglomeration, execution time is T 2-D p = 2T bcast p (n/ p)+θ(γn 2 /p) = Θ(α log(p)+βn/ p+γn 2 /p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 77
29 Strong Scaling 1-D agglomeration is strongly scalable to p s = Θ(min((γ/α)n 2 / log((γ/α)n 2 ), (γ/β)n)) processors, since E 1-D p s = Θ(1/(1 + (α/γ) log(p s )p s /n 2 + (β/γ)p s /n)) 2-D agglomeration is strongly scalable to p s = Θ(min((γ/α)n 2 / log((γ/α)n 2 ), (γ/β) 2 n 2 )) processors, since E 2-D p s = Θ(1/(1 + (α/γ) log(p s )p s /n 2 + (β/γ) p s /n)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 77
30 Weak Scaling 1-D agglomeration is weakly scalable to processors, since p w = Θ(min(2 (γ/α)n2, (γ/β) 2 n 2 )) E 1-D p w ( p w n) = Θ(1/(1 + (α/γ) log(p w )/n 2 + (β/γ) p w /n)) 2-D agglomeration is weakly scalable to processors, since p w = Θ(2 (γ/α)n2 ) E 2-D p w ( p w n) = Θ(1/(1 + (α/γ) log(p w )/n 2 + (β/γ)/n)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 77
31 Memory Requirements The memory requirements are dominated by storing Z, which in practice means the outer-product is a poor primitive (local flop-to-byte ratio of 1) If possible, Z should be operated on as it is computed, e.g. if we really need v i = j f(x i y j ) for some scalar function f If Z does not need to be stored, vector storage dominates Without further modification, 1-D algorithm requires M p = Θ(np) overall storage for vector Without further modification, 2-D algorithm requires M p = Θ(n p) overall storage for vector Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 77
32 Consider matrix-vector product y = Ax where A is n n matrix and x and y are n-vectors Components of vector y are given by inner products: n y i = a ij x j, j=1 i = 1,..., n The sequential memory, work, and time are M 1 = Θ(n 2 ), Q 1 = Θ(n 2 ), T 1 = Θ(γn 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 77
33 BLAS Partition For i, j = 1,..., n, fine-grain task (i, j) stores a ij and computes a ij x j, yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either for some j, task ( j, i) stores x i and task (i, j) stores y i, or Communicate task (i, i) stores both x i and y i, i = 1,..., n For j = 1,..., n, task that stores x j broadcasts it to all other tasks in jth task column For i = 1,..., n, sum reduction over ith task row gives y i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 77
34 Fine-Grain Tasks and Communication a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 77
35 Fine-Grain broadcast x j to tasks (k, j), k = 1,..., n y i = a ij x j reduce y i across tasks (i, k), k = 1,..., n { vertical broadcast } { local scalar product } { horizontal sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 77
36 Agglomeration BLAS Agglomerate With n n array of fine-grain tasks, natural strategies are 2-D: Combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 77
37 2-D Agglomeration BLAS Subvector of x broadcast along each task column Each task computes local matrix-vector product of submatrix of A with subvector of x Sum reduction along each task row produces subvector of result y Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 77
38 2-D Agglomeration BLAS a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 77
39 1-D Agglomeration BLAS 1-D column agglomeration Each task computes product of its component of x times its column of matrix, with no communication required Sum reduction across tasks then produces y 1-D row agglomeration If x stored in one task, then broadcast required to communicate needed values to all other tasks If x distributed across tasks, then multinode broadcast required to communicate needed values to other tasks Each task computes inner product of its row of A with entire vector x to produce its component of y Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 77
40 1-D Column Agglomeration a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 77
41 1-D Row Agglomeration a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 77
42 1-D Agglomeration BLAS Column and row algorithms are dual to each other Column algorithm begins with communication-free local vector scaling (daxpy) computations combined across processors by a reduction Row algorithm begins with broadcast followed by communication-free local inner-product (ddot) computations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 77
43 Mapping BLAS Map 2-D: Assign (n/k) 2 /p coarse-grain tasks to each of p processes using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processes using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 77
44 2-D Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 77
45 1-D Column Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 77
46 1-D Row Agglomeration with Block Mapping a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 a 41 a 42 a 43 a 44 a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 a 61 a 62 a 63 a 64 a 65 a 66 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 77
47 Coarse-Grain broadcast x [ j ] to jth process column y [i ] = A [i ][ j ] x [ j ] reduce y [i ] across ith process row { vertical broadcast } { local matrix-vector product } { horizontal sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 77
48 Performance BLAS The parallel costs (S p, W p, F p ) for the matrix-vector product are Computational cost F p = Θ(n 2 /p) regardless of network The latency and bandwidth costs can be derived from the cost of broadcast 1-D agglomeration: S p = Θ(log p), W p = Θ(n) 2-D agglomeration: S p = Θ(log p), W p = Θ(n/ p) For 1-D row agglomeration, execution time is T 1-D p = Tp bcast (n) + Θ(γn 2 /p) = Θ(α log(p) + βn + γn 2 /p) For 2-D agglomeration, using T bcast = T reduce, time is T 2-D p = 2T bcast p (n/ p)+θ(γn 2 /p) = Θ(α log(p)+βn/ p+γn 2 /p) So scalability is essentially the same as for outer product Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 77
49 Consider matrix-matrix product C = A B where A, B, and result C are n n matrices Entries of matrix C are given by n c ij = a ik b kj, k=1 i, j = 1,..., n Each of n 2 entries of C requires n multiply-add operations, so model serial time as T 1 = γ n 3 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 77
50 Matrix-matrix product can be viewed as n 2 inner products, or sum of n outer products, or n matrix-vector products and each viewpoint yields different algorithm One way to derive parallel algorithms for matrix-matrix product is to apply parallel algorithms already developed for inner product, outer product, or matrix-vector product However, considering the problem as a whole yields the best algorithms Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 77
51 BLAS Partition For i, j, k = 1,..., n, fine-grain task (i, j, k) computes product a ik b kj, yielding 3-D array of n 3 fine-grain tasks Assuming no replication of data, at most 3n 2 fine-grain tasks store entries of A, B, or C, say task (i, j, j) stores a ij, task (i, j, i) stores b ij, and task (i, j, k) stores c ij for i, j = 1,..., n and some fixed k We refer to subsets of tasks along i, j, and k dimensions as rows, columns, and layers, respectively, so kth column of A and kth row of B are stored in kth layer of tasks i k j Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 77
52 BLAS Communicate Broadcast entries of jth column of A horizontally along each task row in jth layer Broadcast entries of ith row of B vertically along each task column in ith layer For i, j = 1,..., n, result c ij is given by sum reduction over tasks (i, j, k), k = 1,..., n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 77
53 Fine-Grain Algorithm broadcast a ik to tasks (i, q, k), q = 1,..., n broadcast b kj to tasks (q, j, k), q = 1,..., n c ij = a ik b kj reduce c ij across tasks (i, j, q), q = 1,..., n { horizontal broadcast } { vertical broadcast } { local scalar product } { lateral sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 77
54 Agglomeration BLAS Agglomerate With n n n array of fine-grain tasks, natural strategies are 3-D: Combine q q q subarray of fine-grain tasks 2-D: Combine q q n subarray of fine-grain tasks, eliminating sum reductions 1-D column: Combine n 1 n subarray of fine-grain tasks, eliminating vertical broadcasts and sum reductions 1-D row: Combine 1 n n subarray of fine-grain tasks, eliminating horizontal broadcasts and sum reductions Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 77
55 Mapping BLAS Map Corresponding mapping strategies are 3-D: Assign (n/q) 3 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 3-D mesh 2-D: Assign (n/q) 2 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 77
56 Agglomeration with Block Mapping 1-D row 1-D col 2-D 3-D 1-D row 1-D column 2-D 3-D agglomerations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 77
57 Coarse-Grain 3-D broadcast A [i ][k] to ith processor row broadcast B [k][ j ] to jth processor column C [i ][ j ] = A [i ][k] B [k][ j ] reduce C [i ][ j ] across processor layers { horizontal broadcast } { vertical broadcast } { local matrix product } { lateral sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 77
58 Agglomeration with Block Mapping 2-D: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 1-D column: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 1-D row: A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 21 A 22 B 21 B 22 = A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 77
59 Coarse-Grain 2-D allgather A [i ][ j ] in ith processor row allgather B [i ][ j ] in jth processor column C [i ][ j ] = 0 for k = 1,..., p C [i ][ j ] = C [i ][ j ] + A [i ][k] B [k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 77
60 SUMMA Algorithm BLAS Algorithm just described requires excessive memory, since each process accumulates p blocks of both A and B One way to reduce memory requirements is to broadcast blocks of A successively across processor rows broadcast blocks of B successively across processor cols C [i ][ j ] = 0 for k = 1,..., p broadcast A [i ][ k ] in ith processor row broadcast B [k ][ j ] in jth processor column C [i ][ j ] = C [i ][ j ] + A [i ][k] B [k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 77
61 SUMMA Algorithm BLAS A A B B B 16 CPUs (4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B A Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 77
62 Cannon Algorithm BLAS Another approach, due to Cannon (1969), is to circulate blocks of B vertically and blocks of A horizontally in ring fashion Blocks of both matrices must be initially aligned using circular shifts so that correct blocks meet as needed Requires less memory than SUMMA and replaces line broadcasts with shifts, lowering the number of messages Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 77
63 Cannon Algorithm BLAS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 77
64 Fox Algorithm BLAS It is possible to mix techniques from SUMMA and Cannon s algorithm: circulate blocks of B in ring fashion vertically along processor columns step by step so that each block of B comes in conjunction with appropriate block of A broadcast at that same step This algorithm is due to Fox et al. Shifts, especially in Cannon s algorithm, are harder to generalize to nonsquare processor grids than collectives in algorithms like SUMMA Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 64 / 77
65 Execution Time for 3-D Agglomeration For 3-D agglomeration, computing each of p blocks C [i ][ j ] requires matrix-matrix product of two (n/ 3 p ) (n/ 3 p ) blocks, so F p = (n/ 3 p ) 3 = n 3 /p On 3-D mesh, each broadcast or reduction takes time T bcast p 1/3 ((n/p 1/3 ) 2 ) = O(α log p + βn 2 /p 2/3 ) Total time is therefore T p = O(α log p + β n 2 /p 2/3 + γ n 3 /p) However, overall memory footprint is M p = Θ(p(n/p 1/3 ) 2 ) = Θ(p 1/3 n 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 65 / 77
66 Strong of 3-D Agglomeration The 3-D agglomeration efficiency is given by E p (n) = pt 1(n) T p (n) = O(1/(1+(α/γ)p log p/n3 +(β/γ) p 1/3 /n)) For strong scaling to p s processors we need E ps (n) = Θ(1), so 3-D agglomeration strong scales to p s = O(min((γ/α)n 3 / log(n), (γ/β)n 3 )) processors Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 66 / 77
67 Weak of 3-D Agglomeration For weak scaling (with constant input size / processor) to p w processor, we need E pw (n p w ) = Θ(1), which holds However, note that the algorithm is not memory-efficient as M p = Θ(p 1/3 n 2 ), and if keeping memory footprint per processor constant, we must grow n with p 1/3 Scaling with constant memory footprint processor (M p /p = const) is possible to p m processors where E pm (np 1/3 m ) = Θ(1), which holds for p m = Θ(2 (γ/α)n3 ) processors The isoefficiency function is Q(p) = Θ(p log(p)) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 67 / 77
68 Execution Time for 2-D Agglomeration For 2-D agglomeration (SUMMA), computation of each block C [i ][ j ] requires p matrix-matrix products of (n/ p ) (n/ p ) blocks, so F p = p (n/ p ) 3 = n 3 /p For broadcast among rows and columns of processir grid, communication time is 2 pt bcast p (n 2 /p) = Θ(α p log(p) + βn 2 / p) Total time is therefore T p = O(α p log(p) + βn 2 / p + γ n 3 /p) The algorithm is memory-efficient, M p = Θ(n 2 ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 68 / 77
69 Strong of 2-D Agglomeration The 2-D agglomeration efficiency is given by E p (n) = pt 1(n) T p (n) = O(1/(1+(α/γ)p3/2 log p/n 3 +(β/γ) p/n)) For strong scaling to p s processors we need E ps (n) = Θ(1), so 2-D agglomeration strong scales to p s = O(min((γ/α)n 2 / log(n) 2/3, (γ/β)n 2 )) processors For weak scaling to p w processors with n 2 /p matrix elements per processor, we need E pw ( p w n) = Θ(1), so 2-D agglomeration (SUMMA) weak scales to p w = O(2 (γ/α)n3 ) processors Cannon s algorithm achieves unconditional weak scalability Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 69 / 77
70 for 1-D Agglomeration For 1-D agglomeration on 1-D mesh, total time is T p = O(α log(p) + βn 2 + γn 3 /p) The corresponding efficiency is E p = O(1/(1 + (α/β)p log(p)n 3 + (β/γ)p/n) Strong scalability is possible to at most p s = O((γ/β)n) processors Weak scalability is possible to at most p w = O( (γ/β)n) processors Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 70 / 77
71 Rectangular Matrix Multiplication If C is m n, A is m k, and B is k n, choosing a 3D grid that optimizes volume-to-surface-area ratio yields bandwidth cost... ( W p (m, n, k) = O min p 1 p 2 p 3 =p [ mk + kn + mn ]) p 1 p 2 p 1 p 3 p 2 p 3 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 71 / 77
72 Communication vs. Memory Tradeoff Communication cost for 2-D algorithms for matrix-matrix product is optimal, assuming no replication of storage If explicit replication of storage is allowed, then lower communication volume is possible via 3-D algorithm Generally, we assign n/p 1 n/p 2 n/p 3 computation subvolume to each processor The largest face of the subvolume gives communication cost, the smallest face gives minimal memory usage can keep smallest face local while successively computing slices of subvolume Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 72 / 77
73 Leveraging Additional Memory in Matrix Multiplication Provided M total available memory, 2-D and 3-D algorithms achieve bandwidth cost : M < n 2 W p (n, M) = n 2 / p : M = Θ(n 2 ) n 2 /p 2/3 : M = Θ(n 2 p 1/3 ) For general M, possible to pick processor grid to achieve W p (n, M) = O(n 3 /( p M 1/2 ) + n 2 /p 2/3 ) and an overall execution time of T p (n, M) = O(α(log p + n 3 p/ M 3/2 ) + βw p (n, M) + γn 3 /p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 73 / 77
74 Strong Scaling using All Available Memory The efficiency of the algorithm for a given amount of total memory M p is E p (n, M p ) = O(1/(1 + (α/γ)(p log p/n 3 + p 3/2 / + (β/γ)( p/ M 1/2 p + p 1/3 /n))) For strong scaling assuming M p = p M 1, we need M 3/2 p ) E ps (n, p s M1 ) = p s T 1 (n, M 1 )/T ps (n, p s M1 ) = Θ(1) In this case, we obtain p s = Θ(min((α/γ)n 3 / log(n), (β/γ)n 3 )) as good as the 3-D algorithm, but more memory-efficient Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 74 / 77
75 References BLAS R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, A three-dimensional approach to parallel matrix multiplication, IBM J. Res. Dev., 39: , 1995 J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Comput. 12: , 1989 J. Demmel, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, Communication-optimal parallel recursive rectangular matrix multiplication, IPDPS, 2013 J. W. Demmel, M. T. Heath, and H. A. van der Vorst, Parallel numerical linear algebra, Acta Numerica 2: , 1993 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 75 / 77
76 References BLAS R. Dias da Cunha, A benchmark study based on the parallel computation of the vector outer-product A = uv T operation, Concurrency: Practice and Experience 9: , 1997 G. C. Fox, S. W. Otto, and A. J. G. Hey, Matrix algorithms on a hypercube I: matrix multiplication, Parallel Comput. 4:17-31, 1987 D. Irony, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput. 64: , S. L. Johnsson, Communication efficient basic linear algebra computations on hypercube architectures, J. Parallel Distrib. Comput. 4(2): , 1987 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 76 / 77
77 References BLAS S. L. Johnsson, Minimizing the communication time for matrix multiplication on multiprocessors, Parallel Comput. 19: , 1993 B. Lipshitz, Communication-avoiding parallel recursive algorithms for matrix multiplication, Tech. Rept. UCB/EECS , University of California at Berkeley, Ma013. O. McBryan and E. F. Van de Velde, Matrix and vector operations on hypercube parallel processors, Parallel Comput. 5: , 1987 R. A. Van De Geijn and J. Watts, SUMMA: Scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience 9(4): , 1997 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 77 / 77
Parallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization Edgar Solomonik University of Illinois at Urbana-Champaign August 31, 2016 Review of
More informationMatrix multiplication
Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:
More informationChapter 8 Dense Matrix Algorithms
Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationParallel Computing: Parallel Algorithm Design Examples Jin, Hai
Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationSummary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.
Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationMatrix Multiplication
Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More informationCS 770G - Parallel Algorithms in Scientific Computing
CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,
More informationDense LU Factorization
Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationLecture 4: Graph Algorithms
Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e
More informationWith regards to bitonic sorts, everything boils down to doing bitonic sorts efficiently.
Lesson 2 5 Distributed Memory Sorting Bitonic Sort The pseudo code for generating a bitonic sequence: 1. create two bitonic subsequences 2. Make one an increasing sequence 3. Make one into a decreasing
More informationDense matrix algebra and libraries (and dealing with Fortran)
Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)
More informationA Few Numerical Libraries for HPC
A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear
More informationLecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators
Lecture 13 Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators Announcements Extra lecture Friday 4p to 5.20p, room 2154 A4 posted u Cannon s matrix multiplication
More informationSCALABILITY ANALYSIS
SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS Evaluation Sequential: runtime (execution time) Ts =T (InputSize) Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture)
More informationImplementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000
Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Richard P. Brent and Peter E. Strazdins Computer Sciences Laboratory and Department of Computer Science Australian National University
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationCommunication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication Penporn Koanantakool 1,2, riful zad 2, ydin Buluç 2,Dmitriy Morozov 2, Sang-Yun Oh 2,3, Leonid Oliker 2, Katherine Yelick 1,2 1 omputer
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationCOMMUNICATION IN HYPERCUBES
PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)
More informationMatrix multiplication on multidimensional torus networks
Matrix multiplication on multidimensional torus networks Edgar Solomonik James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-28
More informationProject C/MPI: Matrix-Vector Multiplication
Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving
More informationLecture 6: Parallel Matrix Algorithms (part 3)
Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of
More informationParallel Reduction from Block Hessenberg to Hessenberg using MPI
Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson
More informationLecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.
CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview
More informationLecture 8 Parallel Algorithms II
Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel
More informationAll-Pairs Shortest Paths - Floyd s Algorithm
All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel
More informationParallel dense linear algebra computations (1)
Parallel dense linear algebra computations (1) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.07] Tuesday, January 29, 2008 1 Sources for today s material Mike Heath
More informationLinear systems of equations
Linear systems of equations Michael Quinn Parallel Programming with MPI and OpenMP material do autor Terminology Back substitution Gaussian elimination Outline Problem System of linear equations Solve
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms
CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms Edgar Solomonik University of Illinois at Urbana-Champaign October 12, 2016 Defining
More informationCommunication Optimal Parallel Multiplication of Sparse Random Matrices
Communication Optimal arallel Multiplication of Sparse Random Matrices Grey Ballard Aydin Buluc James Demmel Laura Grigori Benjamin Lipshitz Oded Schwartz Sivan Toledo Electrical Engineering and Computer
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationParallel Sorting. Sathish Vadhiyar
Parallel Sorting Sathish Vadhiyar Parallel Sorting Problem The input sequence of size N is distributed across P processors The output is such that elements in each processor P i is sorted elements in P
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationReducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication
Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication Md Salman Ahmed, Jennifer Houser, Mohammad Asadul Hoque, Phil Pfeiffer Department of Computing Department of
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationMatrix-vector Multiplication
Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm
More informationIntroduction to Parallel & Distributed Computing Parallel Graph Algorithms
Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental
More informationParallel Programming with MPI and OpenMP
Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 1: Course motivation and overview; collective communication
CS 598: Communication Cost Analysis of Algorithms Lecture 1: Course motivation and overview; collective communication Edgar Solomonik University of Illinois at Urbana-Champaign August 22, 2016 Motivation
More informationSorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel
More informationParallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring
Parallel Programs Conditions of Parallelism: Data Dependence Control Dependence Resource Dependence Bernstein s Conditions Asymptotic Notations for Algorithm Analysis Parallel Random-Access Machine (PRAM)
More informationAlgorithm-Based Fault Tolerance. for Fail-Stop Failures
Algorithm-Based Fault Tolerance 1 for Fail-Stop Failures Zizhong Chen and Jack Dongarra Abstract Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging.
More informationGiovanni De Micheli. Integrated Systems Centre EPF Lausanne
Two-level Logic Synthesis and Optimization Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers
More informationCPS222 Lecture: Parallel Algorithms Last Revised 4/20/2015. Objectives
CPS222 Lecture: Parallel Algorithms Last Revised 4/20/2015 Objectives 1. To introduce fundamental concepts of parallel algorithms, including speedup and cost. 2. To introduce some simple parallel algorithms.
More informationParallel Computing. Parallel Algorithm Design
Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationImproving Performance of Sparse Matrix-Vector Multiplication
Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign
More informationAvoiding communication in primal and dual block coordinate descent methods
Avoiding communication in primal and dual block coordinate descent methods Aditya Devarakonda Kimon Fountoulakis James Demmel Michael W. Mahoney Electrical Engineering and Computer Sciences University
More informationChapter 5: Analytical Modelling of Parallel Programs
Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel
More informationCSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms
Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop
More informationCSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting
More informationLinear Arrays. Chapter 7
Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between
More informationMatrix algorithms: fast, stable, communication-optimizing random?!
Matrix algorithms: fast, stable, communication-optimizing random?! Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, James Demmel, Olga Holtz, Robert
More informationMathematics and Computer Science
Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR
More informationParallel Programming. Matrix Decomposition Options (Matrix-Vector Product)
Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using
More informationCommunication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms
Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms Edgar Solomonik and James Demmel Department of Computer Science University of California at Berkeley, Berkeley,
More informationMultilevel Hierarchical Matrix Multiplication on Clusters
Multilevel Hierarchical Matrix Multiplication on Clusters Sascha Hunold Department of Mathematics, Physics and Computer Science University of Bayreuth, Germany Thomas Rauber Department of Mathematics,
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationStudent Number: CSE191 Midterm II Spring Plagiarism will earn you an F in the course and a recommendation of expulsion from the university.
Plagiarism will earn you an F in the course and a recommendation of expulsion from the university. (1 pt each) For Questions 1-5, when asked for a running time or the result of a summation, you must choose
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #12 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Last class Outline
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering
More informationParallel Computation/Program Issues
Parallel Computation/Program Issues Dependency Analysis: Types of dependency Dependency Graphs Bernstein s s Conditions of Parallelism Asymptotic Notations for Algorithm Complexity Analysis Parallel Random-Access
More informationCo-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures
Technical Report Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Andreas Gerstlauer Robert A. van de Geijn UT-CERC-12-02 October 5, 2011 Computer Engineering
More information15. The Software System ParaLab for Learning and Investigations of Parallel Methods
15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationOutline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm
Outine Parae Numerica Agorithms Chapter 8 Prof. Michae T. Heath Department of Computer Science University of Iinois at Urbana-Champaign CS 554 / CSE 512 1 2 3 4 Trianguar Matrices Michae T. Heath Parae
More informationCopyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8
Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More information12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares
12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares Optimal substructure Dynamic programming is typically applied to optimization problems. An optimal solution to the original
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More information1 Definition of Reduction
1 Definition of Reduction Problem A is reducible, or more technically Turing reducible, to problem B, denoted A B if there a main program M to solve problem A that lacks only a procedure to solve problem
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationDivide and Conquer Algorithms. Sathish Vadhiyar
Divide and Conquer Algorithms Sathish Vadhiyar Introduction One of the important parallel algorithm models The idea is to decompose the problem into parts solve the problem on smaller parts find the global
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationParallel Programming with MPI and OpenMP
Parallel Programming with MPI and OpenMP Michael J. Quinn (revised by L.M. Liebrock) Chapter 7 Performance Analysis Learning Objectives Predict performance of parallel programs Understand barriers to higher
More informationAlgorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE TRANSCATIONS ON PARALLEL AND DISTRIBUTED SYSYTEMS Algorithm-Based Fault Tolerance for Fail-Stop Failures Zizhong Chen, Member, IEEE, and Jack Dongarra, Fellow, IEEE Abstract Fail-stop failures in
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationParallel Graph Algorithms
Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050/VT3 Part I Introduction Overview Graphs definitions & representations Minimal Spanning Tree (MST) Prim s algorithm Single Source
More informationPerformance without Pain = Productivity Data Layout and Collective Communication in UPC
Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research
More informationMaunam: A Communication-Avoiding Compiler
Maunam: A Communication-Avoiding Compiler Karthik Murthy Advisor: John Mellor-Crummey Department of Computer Science Rice University! karthik.murthy@rice.edu February 2014!1 Knights Tour Closed Knight's
More information