Chapter 8 Dense Matrix Algorithms

Size: px

Start display at page:

Download "Chapter 8 Dense Matrix Algorithms"

Archibald Lawrence
5 years ago
Views:

1 Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23.

2 Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations

3 Matix Algorithms: Introduction Due to their regular structure, parallel computations involving matrices and vectors can readily use data decomposition. Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.

4 Matrix-Vector Multiplication Multiply a dense n n matrix A with an n input vector x to yield an n result vector y. The serial algorithm requires n 2 multiplications and additions. W = n 2. ()

5 Matrix-Vector Multiplication: Rowwise -D artitioning The n n matrix A is partitioned among n processes, each process stores one row of the matrix. The n vector x is distributed such that each process owns one of its elements.

6 Matrix-Vector Multiplication: Rowwise -D artitioning Matrix A Vector x rocesses.. n/p.. n p- p- p- p- (a) Initial partitioning of the matrix and the starting vector x (b) Distribution of the full vector among all the processes by all-to-all broadcast Matrix A Vector y p-.. p- p- p-.. p- p- p- p- (c) Entire vector distributed to each process after the broadcast (d) Final distribution of the matrix and the result vector y Multiplication of an n n matrix with an n vector using rowwise block -D partitioning. For the one-row-per-process case, p = n.

7 Matrix-Vector Multiplication: Rowwise -D artitioning Each process starts with only one element of x, thus an all-to-all broadcast is required to distribute all the elements of x to all the processes. rocess i now computes y[i] = Σ n j= (A[i,j] x[j]). The all-to-all broadcast and the computation of y[i] both take time Θ(n). Therefore, the overall parallel time is Θ(n).

8 Matrix-Vector Multiplication: Rowwise -D artitioning Consider now the case when p < n and we use block D partitioning. Each process initially stores n/p rows of the A matrix and a portion of the x vector of size n/p. The all-to-all broadcast takes place among p processes and involves messages of size n/p. This is followed by n/p local dot products. Thus, the parallel run time of this procedure is This is cost-optimal. T = n2 p +t slogp+t w n. (2)

9 Matrix-Vector Multiplication: Rowwise -D artitioning Scalability Analysis: We know that T o = pt W, therefore, we have, T o = t s plogp+t w np. (3) For isoefficiency, we have W = KT o, where K = E/( E) for desired efficiency E. From this, we have W = O(p 2 ) (from the t w term). There is also a bound on isoefficiency because of concurrency. In this case, p < n, therefore, W = n 2 = Ω(p 2 ). Overall isoefficiency is W = O(p 2 ).

10 Matrix-Vector Multiplication: 2-D artitioning The n n matrix A is partitioned among n 2 processors such that each processor owns a single element. The n vector x is distributed only in the last column of n processors.

11 Matrix-vector multiplication with block 2-D partitioning. For the one-element-per-process case, p = n 2 if the matrix size is n n. Matrix-Vector Multiplication: 2-D artitioning p 2 p Matrix A.. p-.. p- Vector x n p n/ p n (a) Initial data distribution and communication steps to align the vector along the diagonal (b) One-to-all broadcast of portions of the vector along process columns Matrix A Vector y.. p- p 2 p.. p- (c) All-to-one reduction of partial results (d) Final distribution of the result vector

12 Matrix-Vector Multiplication: 2-D artitioning We must first distribute the vector x appropriately. The first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. The second communication step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. Finally, the result vector is computed by performing an all-toone reduction along each row.

13 Matrix-Vector Multiplication: 2-D artitioning Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-toone reduction in each row. Bothstep2andstep3takeΘ(logn)time,sothatoverallparallel time is Θ(logn). The cost (process-time product) is Θ(n 2 logn); hence, the algorithm is not cost-optimal.

14 Matrix-Vector Multiplication: 2-D artitioning When using fewer than n 2 processors, each process owns an (n/ p) (n/ p) block of the matrix A. The vector x is distributed in portions of n/ p elements in the last process-column only. In this case, the message sizes for the alignment, broadcast, and reduction are all (n/ p). The computation is a product of an (n/ p) (n/ p) submatrix with a vector of length (n/ p).

15 Matrix-Vector Multiplication: 2-D artitioning The first alignment step takes time t s +t w n/ p. Thebroadcastandreductionstepstaketime(t s +t w n/ p)log( p). Local matrix-vector products take time t c n 2 /p. Total time is T n2 p +t slogp+t w n p logp (4)

16 Matrix-Vector Multiplication: 2-D artitioning Scalability Analysis: T o = pt p W = t s plogp+t w n plogp. Equating T o with W, term by term, for isoefficiency, we have, W = K 2 t 2 wplog 2 p as the dominant term. The isoefficiency due to concurrency is O(p). The overall isoefficiency is O(plog 2 p) (due to the network bandwidth). For cost optimality, ( ) we have, W = n 2 = plog 2 p. For this, we have, p = O. n 2 log 2 n

17 Matrix-Matrix Multiplication Consider the problem of multiplying two n n dense, square matrices A and B to yield the product matrix C = A B. The serial complexity is O(n 3 ). A useful concept in this case is called block operations. In this view, an n n matrix A can be regarded as a q q array of blocks A i,j ( i,j < q) such that each block is an (n/q) (n/q) submatrix. In this view, we perform q 3 matrix multiplications, each involving (n/q) (n/q) matrices.

18 Matrix-Matrix Multiplication Consider two n n matrices A and B partitioned into p blocks A i,j and B i,j ( i,j < p) of size (n/ p) (n/ p) each. rocess i,j initially stores A i,j and B i,j and computes block C i,j of the result matrix. Computing submatrix C i,j requires all submatrices A i,k and B k,j for k < p. All-to-all broadcast blocks of A along rows and B along columns. erform local submatrix multiplication.

19 Matrix-Matrix Multiplication The two broadcasts take time 2(t s log( p)+t w (n 2 /p)( p )). The computation requires p multiplications of (n/ p) (n/ p) sized submatrices. The parallel run time is approximately T = n3 p +t slogp+2t w n 2 p. (5) ThealgorithmiscostoptimalandtheisoefficiencyisO(p.5 )due to bandwidth term t w and concurrency. Major drawback is that it is not memory optimal.

20 Matrix-Matrix Multiplication: Cannon s Algorithm In this algorithm, we schedule the computations of the p processes of the ith row such that, at any given time, each process is using a different block A i,k. These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh A i,k after each rotation.

21 communication steps in Cannon s algorithm on 6 processes. Matrix-Matrix Multiplication: Cannon s Algorithm A, A, A,2 A,3 B, B B B,,2,3 A, A, A,2 A,3 B, B B B,,2,3 A 2, A 2, A 2,2 A 2,3 B 2, B B B 2, 2,2 2,3 A 3, A 3, A 3,2 A 3,3 B 3, B B B 3, 3,2 3,3 (a) Initial alignment of A (b) Initial alignment of B A, A, A,2 A,3 A, A,2 A,3 A, B, B, B 2,2 B 3,3 B, B 2, B 3,2 B,3 A, B, A,2 A,3 A, B 2, B 3,2 B,3 A,2 A,3 A, A, B 2, B 3, B,2 B,3 A 2,2 A 2,3 A 2, A 2, B 2, B 3, B,2 B,3 A 2,3 A 2, A 2, A 2,2 B 3, B, B,2 B 2,3 A 3,3 A 3, A 3, A 3,2 A 3, A 3, A 3,2 A 3,3 B 3, B, B,2 B 2,3 B, B, B 2,2 B 3,3 (c) A and B after initial alignment (d) Submatrix locations after first shift A,2 A,3 A, A, B 2, B 3, B,2 B,3 A,3 A, B 3, B, A, B,2 B A,2 2,3 A,3 A, A, A,2 B 3, B, B,2 B 2,3 A, A, A,2 A,3 B, B, B 2,2 B 3,3 A 2, A 2, A 2,2 A 2,3 A 2, A 2,2 A 2,3 A 2, B, B, B 2,2 B 3,3 B, B 2, B 3,2 B,3 A 3, B, A 3,2 A 3,3 A 3, B 2, B 3,2 B,3 A 3,2 A 3,3 A 3, A 3, B 2, B 3, B,2 B,3 (e) Submatrix locations after second shift (f) Submatrix locations after third shift

22 Matrix-Matrix Multiplication: Cannon s Algorithm Align the blocks of A and B in such a way that each process multiplies its local submatrices. This is done by shifting all submatrices A i,j to the left (with wraparound) by i steps and all submatrices B i,j up (with wraparound) by j steps. erform local block multiplication. Each block of A moves one step left and each block of B moves one step up (again with wraparound). erform next block multiplication, add to partial result, repeat until all p blocks have been multiplied.

23 Matrix-Matrix Multiplication: Cannon s Algorithm In the alignment step, since the maximum distance over which a block shifts is p, the two shift operations require a total of 2(t s +t w n 2 /p) time. Eachofthe psingle-stepshiftsinthecompute-and-shiftphase of the algorithm takes t s +t w n 2 /p time. The computation time for multiplying p matrices of size (n/ p) (n/ p) is n 3 /p. The parallel time is approximately: T = n3 p +2 pt s +2t w n 2 p. (6) The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm, except, this is memory optimal.

24 Matrix-Matrix Multiplication: DNS Algorithm Uses a 3-D partitioning. Visualize the matrix multiplication algorithm as a cube matrices A and B come in two orthogonal faces and result C comes out the other orthogonal face. Each internal node in the cube represents a single add-multiply operation (and thus the complexity). DNS algorithm partitions this cube using a 3-D block scheme.

25 Matrix-Matrix Multiplication: DNS Algorithm Assume an n n n mesh of processors. Move the columns of A and rows of B and perform broadcast. Each processor computes a single add-multiply. This is followed by an accumulation along the C dimension. Since each add-multiply takes constant time and accumulation and broadcast takes logn time, the total runtime is logn. This is not cost optimal. It can be made cost optimal by using n/ log n processors along the direction of accumulation.

26 Matrix-Matrix Multiplication: DNS Algorithm k = 3,3,3 2,3 3,3 A, B k = 2,2,2 2,2 3,2 A k k =,, 2, 3, j,,,,2,3,3 2,3,,2 2, 3, (a) Initial distribution of A and B i 2, 2,2 3, 3,2 3,3 k =,, 2, 3, (b) After movinga[i,j] from i,j, to i,j,j,3,3,3 2,3,3,3 2,3 3,3,3,3 2,3 3,3,3 2,3 3,3 3,3 C[,] = A[,3] x B[3,] 3, 3,3 3,3 3,3 3,2 3,2 3,2 3,2 3, 3, 3, 3, 3, 3, 3, 3,3 A,2,2,2 2,2,2,2 2,2 3,2,2,2 2,2 3,2,2 2,2 3,2 3,2 + A[,2] x B[2,] 2, 2,3 2,3 2,3 2,2 2,2 2,2 2,2 2, 2, 2, 2, 2, 2, 2, 2,3 B,,, 2, 3,,, 2, 3,,, 2, 3, A[,] x B[,], 2, 3, +,,3,3,3,2,2,2,2,,,,,,,,3,,, 2, 3,,, 2,,, 2,, 2, 3, 3, + 3, A[,] x B[,],,3,3,3,2,2,2,2,,,,,,,,3 (c) After broadcastinga[i,j] along j axis (d) Corresponding distribution of B The communication steps in the DNS algorithm while multiplying 4 4 matrices A and B on 64 processes.

27 Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n 3 processors. Assume that the number of processes p is equal to q 3 for some q < n. The two matrices are partitioned into blocks of size(n/q) (n/q). Each matrix can thus be regarded as a q q two-dimensional square array of blocks. The algorithm follows from the previous one, except, in this case, we operate on blocks rather than on individual elements.

28 Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n 3 processors. The first one-to-one communication step is performed for both A and B, and takes t s +t w (n/q) 2 time for each matrix. Thetwoone-to-allbroadcaststake2(t s logq+t w (n/q) 2 logq)time for each matrix. The reduction takes time t s logq +t w (n/q) 2 logq. Multiplication of (n/q) (n/q) submatrices takes (n/q) 3 time. The parallel time is approximated by: T = n3 p +t n 2 slogp+t w p2/3logp. (7) The isoefficiency function is Θ(p(logp) 3 ).

29 Solving a System of Linear Equations Consider the problem of solving linear equations of the kind: a, x + a, x + + a,n x n = b, a, x. + a, x. + + a,n x n. = b,. a n, x + a n, x + + a n,n x n = b n. This is written as Ax = b, where A is an n n matrix with A[i,j] = a i,j, b is an n vector [b,b,...,b n ] T, and x is the solution.

30 Solving a System of Linear Equations Two steps in solution are: reduction to triangular form, and back-substitution. The triangular form is as: x + u, x + u,2 x u,n x n = y, x + u,2 x u,n x n. = y,. x n = y n. We write this as: Ux = y. A commonly used method for transforming a given matrix into an upper-triangular matrix is Gaussian Elimination.

31 Gaussian Elimimation. procedure GAUSSIAN ELIMINATION (A, b, y) 2. begin 3. for k := to n do /* Outer loop */ 4. begin 5. for j := k + to n do 6. A[k,j] := A[k,j]/A[k,k]; /* Division step */ 7. y[k] := b[k]/a[k,k]; 8. A[k,k] := ; 9. for i := k + to n do. begin. for j := k + to n do 2. A[i,j] := A[i,j] A[i,k] A[k,j]; /* Elimination step */ 3. b[i] := b[i] A[i,k] y[k]; 4. A[i,k] := ; 5. endfor; /* Line 9 */ 6. endfor; /* Line 3 */ 7. end GAUSSIAN ELIMINATION Serial Gaussian Elimination

32 Gaussian Elimination Thecomputationhasthreenestedloops inthekthiterationof the outer loop, the algorithm performs (n k) 2 computations. Summing from k =..n, we have roughly (n 3 /3) multiplicationssubtractions. Inactive part Column k Column j Row k (k,k) (k,j) A[k,j] := A[k,j]/A[k,k] Active part Row i (i,k) (i,j) A[i,j] := A[i,j] - A[i,k] x A[k,j] A typical computation in Gaussian elimination.

33 arallel Gaussian Elimination Assume p = n with each row assigned to a processor. The first step of the algorithm normalizes the row. This is a serial operation and takes time (n k) in the kth iteration. In the second step, the normalized row is broadcast to all the processors. This takes time (t s +t w (n k ))logn. Each processor can independently eliminate this row from its own. This requires (n k ) multiplications and subtractions. The total parallel time can be computed by summing from k =..n as T = 3 2 n(n )+t snlogn+ 2 t wn(n )logn. (8) The formulation is not cost optimal because of the t w term.

34 arallel Gaussian Elimination (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,3) (3,4) (3,5) (3,6) (3,7) (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (6,3) (6,4) (6,5) (6,6) (6,7) (7,3) (7,4) (7,5) (7,6) (7,7) (a) Computation: (i) A[k,j] := A[k,j]/A[k,k] for k < j < n (ii) A[k,k] := (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,4) (3,5) (3,6) (3,7) (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (6,3) (6,4) (6,5) (6,6) (6,7) (7,3) (7,4) (7,5) (7,6) (7,7) (b) Communication: One-to-all brodcast of row A[k,*] (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,4) (3,5) (3,6) (3,7) (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (6,3) (6,4) (6,5) (6,6) (6,7) (7,3) (7,4) (7,5) (7,6) (7,7) (c) Computation: (i) A[i,j] := A[i,j] - A[i,k] x A[k,j] for k < i < n and k < j < n (ii) A[i,k] := for k < i < n Gaussian elimination steps during the iteration corresponding to k = 3 for an 8 8 matrix partitioned rowwise among eight processes.

35 arallel Gaussian Elimination: ipelined Execution In the previous formulation, the(k +)st iteration starts only after all the computation and communication for the kth iteration is complete. In the pipelined version, there are three steps normalization of a row, communication, and elimination. These steps are performed in an asynchronous fashion. Aprocessor k waitstoreceiveandeliminateallrowspriortok. Once it has done this, it forwards its own row to processor k+.

36 ipelined Gaussian elimination on a 5 5 matrix partitioned with one row per process. arallel Gaussian Elimination: ipelined Execution (,) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,) (2,2) (2,3) (2,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (a) Iteration k = starts (b) (c) (d) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,2) (2,3) (2,4) (2,) (2,2) (2,3) (2,4) (2,) (2,2) (2,3) (2,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,2) (3,3) (3,4) (3,) (3,2) (3,3) (3,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,2) (4,3) (4,4) (e) Iteration k = starts (f) (g) Iteration k = ends (h) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (2,2) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (3,) (3,2) (3,3) (3,4) (3,2) (3,3) (3,4) (3,2) (3,3) (3,4) (3,2) (3,3) (3,4) (4,) (4,2) (4,3) (4,4) (4,) (4,2) (4,3) (4,4) (4,2) (4,3) (4,4) (4,2) (4,3) (4,4) (i) Iteration k = 2 starts (j) Iteration k = ends (k) (l) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (3,3) (3,4) (3,4) (3,4) (3,4) (4,2) (4,3) (4,4) (4,3) (4,4) (4,3) (4,4) (4,4) (m) Iteration k = 3 starts (n) (o) Iteration k = 3 ends (p) Iteration k = 4 Communication for k =, 3 Computation for k =, 3 Communication for k = Computation for k =, 4 Communication for k = 2 Computation for k = 2

37 arallel Gaussian Elimination: ipelined Execution The total number of steps in the entire pipelined procedure is Θ(n). In any step, either O(n) elements are communicated between directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row. The parallel time is therefore O(n 2 ). This is cost optimal.

38 arallel Gaussian Elimination: Block D with p < n The above algorithm can be easily adapted to the case when p < n. In the kth iteration, a processor with all rows belonging to the active part of the matrix performs (n k )n/p multiplications and subtractions. In the pipelined version, for n > p, computation dominates communication. The parallel time is given by: 2(n/p)Σ n k= (n k ), or approximately, n 3 /p. While the algorithm is cost optimal, the cost of the parallel algorithm is higher than the sequential run time by a factor of 3/2.

39 arallel Gaussian Elimination: Block D with p < n (,) (,2) (,3) (,4) (,5) (,6) (,7) (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (4,3) (4,4) (4,5) (4,6) (4,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,3) (3,4) (3,5) (3,6) (3,7) (,2) (,3) (,4) (,5) (,6) (,7) (5,3) (5,4) (5,5) (5,6) (5,7) 2 (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (2,3) (2,4) (2,5) (2,6) (2,7) (6,3) (6,4) (6,5) (6,6) (6,7) 2 (6,3) (6,4) (6,5) (6,6) (6,7) (3,3) (3,4) (3,5) (3,6) (3,7) 3 (7,3) (7,4) (7,5) (7,6) (7,7) (7,3) (7,4) (7,5) (7,6) (7,7) 3 (a) Block -D mapping (b) Cyclic -D mapping Computation load on different processes in block and cyclic -D partitioning of an 8 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.

40 arallel Gaussian Elimination: Cyclic D Mapping The load imbalance problem can be alleviated by using a cyclic mapping. In this case, other than processing of the last p rows, there is no load imbalance.

Dense Matrix Algorithms

Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication