CS 770G - Parallel Algorithms in Scientific Computing

Size: px

Start display at page:

Download "CS 770G - Parallel Algorithms in Scientific Computing"

Bathsheba Carson
5 years ago
Views:

1 CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6

2 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis, Benjamin Cummings. Numerical inear lgebra for High-Performance Computers Dongarra, Duff, Sorensen, van der Vorst, SIM. portion of the notes comes from Prof. J. Demmel s CS267 course at C Berkeley. 2

3 Review of Gaussian Elimination (GE) for Solving xb dd multiples of each row to later rows to make upper triangular. Solve resulting triangular system x c by substitution. for each column i zero it out below the diagonal by adding multiples of row i to later rows for i 1 to n-1 for each row j below row i for j i+1 to n add a multiple of row i to row j for k i to n (j,k) (j,k) - ((j,i)/(i,i)) * (i,k) 3

4 Initial version: Refine GE lgorithm (1) for each column i zero it out below the diagonal by adding multiples of row i to later rows for i 1 to n-1 for each row j below row i for j i+1 to n add a multiple of row i to row j for k i to n (j,k) (j,k) - ((j,i)/(i,i)) * (i,k) Remove computation of constant (j,i)/(i,i) from inner loop: for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i to n (j,k) (j,k) - m * (i,k) 4

5 ast version: Refine GE lgorithm (2) for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i to n (j,k) (j,k) - m * (i,k) Don t compute what we already know: zeros below diagonal in column i for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - m * (i,k) 5

6 ast version: Refine GE lgorithm (3) for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - m * (i,k) Store multipliers m below diagonal in zeroed entries for later use: for i 1 to n-1 for j i+1 to n (j,i) (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - (j,i) * (i,k) 6

7 ast version: Refine GE lgorithm (4) for i 1 to n-1 for j i+1 to n (j,i) (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - (j,i) * (i,k) Express using matrix operations (BS) for i 1 to n-1 (i+1:n,i) (i+1:n,i) / (i,i) (i+1:n,i+1:n) (i+1:n, i+1:n ) - (i+1:n, i) * (i, i+1:n) 7

8 What GE Really Computes for i 1 to n-1 (i+1:n,i) (i+1:n,i) / (i,i) (i+1:n,i+1:n) (i+1:n, i+1:n ) - (i+1:n, i) * (i, i+1:n) Call the strictly lower triangular matrix of multipliers M, and let I+M. Call the upper triangle of the final matrix. emma ( Factorization): If the above algorithm terminates (does not divide by zero) then *. Solving *xb using GE Factorize * using GE (cost 2/3 n 3 flops) Solve *y b for y, using substitution (cost n 2 flops) Solve *x y for x, using substitution (cost n 2 flops) Thus *x (*)*x *(*x) *y b as desired 8

9 Problems with Basic GE lgorithm What if some (i,i) is zero? Or very small? Result may not exist, or be unstable, so need to pivot Current computation all BS 1 or BS 2, but we know that BS 3 (matrix multiply) is fastest for i 1 to n-1 (i+1:n,i) (i+1:n,i) / (i,i) BS 1 (scale a vector) (i+1:n,i+1:n) (i+1:n, i+1:n ) BS 2 (rank-1 update) - (i+1:n, i) * (i, i+1:n) Peak BS 3 BS 2 BS 1 9

10 Pivoting in Gaussian Elimination [ 0 1 ] fails completely, even though is easy [ 1 0 ] Illustrate problems in 3-decimal digit arithmetic: [ 1e-4 1 ] and b [ 1 ], correct answer to 3 places is x [ 1 ] [ 1 1 ] [ 2 ] [ 1 ] Result of decomposition is [ 1 0 ] [ 1 0 ] [ fl(1/1e-4) 1 ] [ 1e4 1 ] No roundoff error yet [ 1e-4 1 ] [ 1e-4 1 ] Error in 4th decimal place [ 0 fl(1-1e4*1) ] [ 0-1e4 ] Check if * [ 1e-4 1 ] (2,2) entry entirely wrong [ 1 0 ] lgorithm forgets (2,2) entry, gets same and for all (2,2) <5 Numerical instability Computed solution x totally inaccurate Cure: Pivot (swap rows of ) so entries of and bounded 10

11 Gaussian Elimination with Partial Pivoting (GEPP) Partial Pivoting: swap rows so that each multiplier satisfies: (i,j) (j,i)/(i,i) < 1 for i 1 to n-1 find and record k where (k,i) max{i < j < n} (j,i) i.e. largest entry in rest of column i if (k,i) 0 exit with a warning that is singular, or nearly so elseif k! i swap rows i and k of end if (i+1:n,i) (i+1:n,i) / (i,i) each quotient lies in [-1,1] (i+1:n,i+1:n) (i+1:n, i+1:n ) - (i+1:n, i) * (i, i+1:n) emma: This algorithm computes P**, where P is a permutation matrix Since each entry of (i,j) < 1, this algorithm is considered numerically stable. For details see PCK code at and Dongarra s book.

12 Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would have been optimized for a given hardware platform ny other matrix operation that you may think of (matrixmatrix multiply, Cholesky factorization, QR, Householder method, etc) can be constructed from BS subprograms in a similar fashion and in fact have been constructed in the package called PCK. Note that only evel 1 and 2 BS routines have been used in decomposition. Efficiency considerations? 12

13 Overview of PCK Standard library for dense/banded linear algebra inear systems: *xb east squares problems: min x *x-b 2 Eigenvalue problems: x λx, x λbx Singular value decomposition (SVD): ΣV T lgorithms reorganized to use BS3 as much as possible. Basis of math libraries on many computers. Many algorithmic innovations remain. 13

14 Performance of PCK (n1000) 14

15 Performance of PCK (n100) 15

16 Summary, Cont d Need to devise algorithms that can make use of evel 3 BS (matrix-matrix) routines, for several reasons: evel 3 routines are known to run much more efficiently due to larger ratio of computation to memory references/communication Parallel algorithms on distributed memory machines will require that we decompose the original matrix into blocks which reside in each processor (similar to HW1) Parallel algorithms will require that we minimize the surface-to volume ratio of our decompositions, and blocking becomes the natural approach. 16

17 Converting BS2 to BS3 in GEPP Blocking se to optimize matrix-multiplication. Harder here because of data dependencies in GEPP. Delayed pdates Save updates to trailing matrix from several consecutive BS2 updates. pply many saved updates simultaneously in one BS3 operation. Same idea works for much of dense linear algebra Open questions remain. Need to choose a block size b (k in Dongarra s book) lgorithm will save and apply b updates. b must be small enough so that active submatrix consisting of b columns of fits in cache. b must be large enough to make BS3 fast. 17

18 18 Blocked lgorithms - Factorization

19 Blocked lgorithms - Factorization With these relationships, we can develop different algorithms by choosing the order in which operations are performed. Blocksize, b, needs to be chosen carefully. b1 produces the usual algorithm, which b>1 will improve performance on a single-processor Three natural variants: eft-ooking Right-ooking Crout 19

20 Blocked lgorithms - Factorization ssume you have already done the first row and column of the GEPP nd you have the sub-block below left to work on Notice that the decomposition in this sub-block is independent of the portion you have already completed 20

21 Blocked lgorithms - Factorization For simplicity, change notation of sub-block to P P Notice that, once you have done Gaussian Elimination on and 21, you have already obtained,, and 21. Now you can re-arrange the block equations by substituting: ~ 12 ( nd repeat the procedure recursively. )

22 Blocked lgorithms - Factorization graphical view of what is going on is given by:

23 Blocked lgorithms - Factorization eft-ooking Right-ooking Crout Variations in algorithm are due to the order in which submatrix operations are performed. Slight advantages to Crout s algorithm (hybrid of the first two). Pre-computed sub-blocks Currently being operated on sub-blocks 23

Review: BS 3 (Blocked) GEPP BS 3 for ib 1 to n-1 step b Process matrix b columns at a time end ib + b-1 Point to end of block of b columns apply BS2 version of GEPP to get (ib:n, ib:end) P * * let

24 Review: BS 3 (Blocked) GEPP BS 3 for ib 1 to n-1 step b Process matrix b columns at a time end ib + b-1 Point to end of block of b columns apply BS2 version of GEPP to get (ib:n, ib:end) P * * let denote the strict lower triangular part of (ib:end, ib:end) + I (ib:end, end+1:n) -1 * (ib:end, end+1:n) update next b rows of (end+1:n, end+1:n ) (end+1:n, end+1:n ) - (end+1:n, ib:end) * (ib:end, end+1:n) apply delayed updates with single matrix-multiply with inner dimension b 24

25 Parallel lgorithms for Dense Matrices ll that follows is applicable to dense or full matrices only. Square matrices discussed, but arguments valid for rectangular matrices as well. Typical parallelization steps: Decomposition: identify parallel work and partitioning. Mapping: which procs execute which portion of the work. ssignment: load balance work among procs. Organization: communication and synchronization. 25

26 Parallel lgorithms for Dense Matrices The proc that owns a given portion of a matrix is responsible for doing all of the computation that involved that portion of the matrix. This is the sensible thing to so, since communication is minimized (although, due to data dependencies within the matrix, it will still be necessary) The question is: how should we subdivide a matrix so that parallel efficiency is maximized? There are various options. 26

27 Different Data ayouts for Parallel GE Bad load balance: P 0 idle after first n/4 steps oad balanced, but can t easily use BS2 or BS3 Can trade load balance and BS2/3 performance by choosing b, but factorization of block column is a bottleneck The winner! Complicated addressing 27

28 Row and Column Block Cyclic ayout Matrix is composed of brow-by-bcol blocks. Procs are distributed in a 2D array indexed by (pi, pj), 0 pi < Prow, 0 pj < Pcol. i,j is mapped to proc (pi, pj) using the formulae: pi pj floor( i / brow) floor( j / bcol) mod mod Prow Pcol In the figure, p4, ProwPcolbrowbcol2. Pcol-fold parallelism in any column, and calls to the BS2 and BS3 on matrices of size brow-bybcol. Serial bottleneck is eased. Need not be symmetric in rows and columns. 28

29 Row and Column Block Cyclic ayout In factorization, distribution of work becomes uneven as the computation progresses. arger block sizes result in greater load imbalance but reduce frequency of communication between procs. Block size controls these tradeoffs. Some procs need to do more work between synchronization points than others (e.g. partial pivoting over rows in a single block-col, other procs stay idle. The computation of each block row of the factorization requires the solution of a lower triangular system across procs in a single row). Processor decomposition controls this type of tradeoff. 29

30 Parallel GE with a 2D Block Cyclic ayout Block size, b, in the algorithm and the block sizes brow and bcol in the layout satisfy bbrowbcol. Shaded regions indicate busy processors or communication performed. nnecessary to have a barrier between each step of the algorithm, e.g.. step 9, 10, and can be pipelined. See Dongarra s book for more details. 30

31 31

32 32 Matrix multiply of green green - blue * pink

33 33

34 Parallel Matrix Transpose The transpose of a matrix is defined as: T i, j, 0 i, j, i ll elements below diagonal move above the diagonal and vice versa. ssume it takes 1 unit time to exchange a pair of matrix elements. Sequential time of transposing an n n matrix is: T s n 2 n 2 Consider parallel architectures organized in both a 2D mesh and hypercube structures. j n 34

35 Parallel Matrix Transpose - 2D Mesh P P 0 1 P P 2 3 P 0 P P 4 8 P 12 P 4 P 5 P 6 P 7 P 1 P 5 P 9 P 13 P8 P P P 9 10 P 2 P 6 P 10 P 14 P 12 P 13 P 14 P 15 P P 3 7 P P 15 Initial Matrix Final Matrix Elements/blocks on lower-left part of matrix move up to the diagonal, and then right to their final location. Each step taken requires communication. Elements/blocks on upper-right part of matrix move down to the diagonal, and then left to their final location. Each step taken requires communication. 35

36 Parallel Matrix Transpose - 2D Mesh P P 0 1 P P 2 3 P 0 P P 4 8 P 12 P 4 P 5 P 6 P 7 P 1 P 5 P 9 P 13 P8 P P P 9 10 P 2 P 6 P 10 P 14 P 12 P 13 P 14 P 15 P P 3 7 P P 15 Initial Matrix Final Matrix If each of the p procs contains a single number, after all of these communication steps, the matrix has been transposed. However, if each proc contains a sub-block of the matrix, after all blocks have been communicated to their final locations, they need to be locally transposed. Each sub-block will contain (n/ p) (n/ p) elements and the cost of communication will be higher than before. Cost of communication is dominated by elements/blocks that reside in the top-right and bottom-left corners. They have to take an approximate number of hops equal to 2 p. 36

37 Parallel Matrix Transpose - 2D Mesh 2 Each block contains n / p elements, so it takes at most 2 2( t + t n / p) p s w for all blocks to move to their final destinations. fter that, the local blocks need to be transposed, which can be done in an amount of time approximately equal to 2 n /(2 p) Thus, a total wall clock time equals to 2 n T + 2t p + 2t P s 2 p Summing over all p processors, the total time consumed by the parallel algorithm is of order T ( 2 TOT Θ n p) w 2 n p which is higher than the sequential complexity (order n^2). This algorithm, on a 2D mesh is not cost optimal. The same is true regardless of whether store-and-forward or cut-through routing schemes are used. 37

38 Parallel Matrix Transpose - Hypercube P P 0 1 P P 2 3 P 0 P 1 P P 8 9 P 4 P 5 P 6 P 7 P 4 P 5 P 12 P 13 P8 P P P 9 10 P P 2 3 P P 10 P 12 P 13 P 14 P 15 P 6 P 7 P 14 P 15 This algorithm is called recursive subdivision and maps naturally onto a hypercube. fter the blocks have all been transposed, the elements inside each block (local to a proc) still need to be transposed. Wall clock time Total time 2 2 n n T + ( t + t ) log p T ( 2 log ) P s w TOT Θ n p 2 p p P 0 P 1 P 2 P 5 P 6 P P 3 7 P P 4 8 P 9 P 10 P 12 P 13 P 14 P P 15 38

39 39 Parallel Matrix Transpose - Hypercube With cut-through routing, the timing improves slightly to ) log ( )log 2 ( 2 2 p n T p t p n t t T TOT h w s P Θ + + which is still suboptimal. sing striped partitioning (a.k.a column blocked layout) and cut-through routing on a hypercube, the total time becomes cost-optimal. ) ( log 2 1 1) ( n T p p t p n t p t p n T TOT h w s P Θ Note that this type of partitioning may be cost-optimal for the transpose operation. But not necessarily for other matrix operations, such as factorization and matrix-matrix multiply.

Dense Matrix Algorithms

Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication