Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Size: px

Start display at page:

Download "Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet"

Virgil Patterson
5 years ago
Views:

1 Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) Sparse matrices and storage schemes (formats) Parallel algorithms for basic sparse operations Inner product (sdot) Matrix-vector multiply (structured and unstructured sparseness) Iterative methods for solving Ax = b parallel aspects Jacobi, Gauss-Seidel, conjugate gradients (CG), [preconditioning] Ordering and reordering of rows in sparse A and grid graphs Things that are not covered: Direct methods for sparse systems (+robust and general, -more memory) Reordering (faster and more accurate/stable solution) Symbolic factorization (set up the data structure for the matrix factorization, allocates memory, fill-in) Numeric factorization Parallel algorithms for sparse systems Ax = b 3 Discretized domain a metal sheet 4 Physical system represented by a mathematical model (e.g. PDE) Continuous domain (surface, space etc) discretized by a mesh finite number of grid points Relations between variables in the discretized model only give a small number of nonzero elements in the system matrices Computations are much more efficient and require much less storage space if the sparseness is utilized In addition, the memory requirements for storing the sparse matrices as dense matrices (nxn, typically) do not exist The metal sheet s surface temperature is modeled by computing its values at grid points 0-48.

2 Storage schemes for sparse matrices: Coordinate format 5 Compressed sparse row (CSR) format 6 q = # nonzero el. << n 2 Nonzero elements arbitrary order qx1 Row coordinates Nonzero elements in row order Corresponding column index nxn Column coordinates Pointer to i:th rows first el. in VAL and J Also compressed sparse column (CSC) rows and colums exchange roles in the storage layout (also called Harwell-Boeing-format) Diagonal storage format d = #diagonals = 4 Nonzero elements stored diagonalwise in each column Distance to main diagonal 7 Ellpack-Itpack format (E-I format) Good when m = max(#nz) in any row is not much larger mean(#nz) per row Nonzero elements stored row-by-row Corresponding column index 8 nxd nxm dx1-1 signals end of row

3 Jagged-diagonal format 9 Parallel dot (inner) product 10 First: Matrix rows are ordered in the decreasing number of nonzero entries Common op. in algorithms for sparse matrices (vectors are dense!) If x and y (n x 1) are uniformly partitioned among p processors, each processor performs n/p mults (*) and n/p-1 adds (+) followed by a global sum T p = 2n/p + (t s )log p Nonzero elements of (b)-matrix in jagged diagonal - order qx1 (q = # nz) Column index Pointer to start index for each diagonal Sparse matrix-vector multipy (GEMV) 11 Block-tridiagonal matrices Laplace PDE 12 y = Ax, A sparse nxn, x dense nx1 y dense Most costly operation in most (iterative) algorithms for solving linear systems of equations Examples of sparse matrix structures: a few diagonals close to the main diagonal unstructured (almost random location of matrix elements) band matrices: nonzero elements confined in a band around the main diagnal (but can be unstructured within the band) symmetric Make use of the matrix structure(s) if possible! Assume discretized mesh for PDE (Laplace) Grid points numbered row-by-row from 0 to n-1 Finite difference approximation of the derivatives of u(x,y): In general for row i : a i x[i n] + b i x[i 1] + c i x[i] + d i x[i + 1] + e i x[i + n] = f i Coeff s with index i represent elements in matrix A (5 el s per row at most) Vector x[0 : n-1] keeps approximations to u(x,y) for the n grid points.

4 Block-tridiagonal matrices Laplace example n = #grid points = 16 block size = n x n c 0 d 0 e 0 b 1 c 1 d 1 e 1 b 2 c 2 d 2 e 2 b 3 c 3 e 3 a 4 c 4 d 4 e 4 a 5 b 5 c 5 d 5 e 5 a 6 b 6 c 6 d 6 e 6 a 7 b 7 c 7 e 7 a 8 c 8 d 8 a 9 b 9 c 9 d 9 e 9 a 10 b 10 c 10 d 10 c 10 a 11 b 11 c 11 c 11 a 12 c 12 d 12 a 13 b 13 c 13 d 13 a 14 b 14 c 14 d 14 a 15 b 15 c 15 n blocks on main diagonal; n 1 blocks on sub- och sup-diagonals How do we do parallel matrix-vector multiply with this structure? e 8 13 Matrix-vector mult. for block-tridiagonal matrix y = A x, diagonal storage for A Assume block-striped partitioning of A and x, p n (# elems per proc n/p > n) Each row in A requires 5 vector-elements x[i] for its subcalculation Elements x[i] on main diagonal placed right same index i - trivial parallelization [ Elements x[i-1],x[i+1] on sub- and super-diagonals placed at proc. neighbors comm. cost 2(t s ) ] only if p > n! Elements on distant diagonals (+/- n) must also be exchanged comm. cost 2(t s n) Computation cost: 5t a *n/p T p = 5t a n/p + 2(t s n) Isoefficiency function: Θ(p²) No comm. needed Comm. with nearest neighbor 14 Better partitioning of block-tridiagonal matrix 15 Matrix-vector mult. for unstructured matrix 16 For p > n the following partition of grid is much better (n = 36, p = 9): nxn matrix, m = avg(#nz)/row mn/p el s / proc Matrix row el s which belong to points within a given partition are on the same proc. similar with the vector (n/p) (= 2) block of (n/p) rows (+ vector-element) Vector elements corresponding to boundary points of the partition are exchanged with logical neighbors: 4(t s (n/p)) T p = 5t a n/p + 4(t s (n/p)) Iso. eff. func: Θ(p²) E-I: Row blocking of VAL and J T p = t a mn/p + t s log p + t w n = Θ(T s )! Similar problem!!

5 Faster algorithm for unstructured matrices 17 Sparse matrix and its graph representation 18 2D-blocking of A Vector belongs to last proc.-column Alignment-operation of x One-to-all broadcast of x in processor columns Single-node accumulation of sub-results T p = t a mn/p + t s log p + (3/2)t w nlog p/ p Θ(T s ) Not very scalable or cost optimal! Dependences between matrix elements are shown by the adjacency-matrix graph In order to minimize communication the grah is partitioned using good heuristics No details here!!! For better algorithms some structure is required (e.g. symmetry) Matrix-vector mult. - unstructured band matrix Iterative methods for solving Ax = b w band width Ellpack-Itpack Exchange of vector- elements needed for doing the computations T p = t a mn/p + t s wp/n + t w w Cost optimal for p = O(mn/w) Conclusion: worse scalability for large band width! Iterative methods for solving Ax = b start with an initial guess x 0 and generate a sequence of approximations x k to the solution x In each iteration the matrix A is used in one or several matrix-vector multiply operations (sparse GEMV) #iterations to solve the problem depends on the method used, properties of A and required accuracy of the solution In practice, iteration terminates when the residual norm(b Ax k ), or some other measure of error, is as small as desired (<=tol) Other common operations: inner products (sdot), saxpy Performance analysis is typically done per iteration Unlike direct methods for solving Ax = b (LU), no fill is incurred

6 Jacobi method 21 Parallell Jacobi 22 Consider A = D + M, where D is diagonal and M the rest. Jacobis method: x k+1 = - D -1 (Mx k + b) Strictly diagonal dominance of A: a ii > sum( a ij ), i <> j, for all i, guarantees convergence Requires communication! Performed in parallel without communication!!! Requires communication! Parallel Gauss-Seidel 23 The row ordering impacts on the dependences 24 Intuitively in Jacobi: computation of x must be done sequentially since x[i] is dependent on x[0], x[1]...,x[i-1] If A is sparse there is not dependences to all preceding x-values In Jacobi: computation of x[i] depends on x[i-1] and x[i- n] Since x[i-1] is computed in the iteration before x[i] these computations cannot be parallelized (depending on the row ordering!) Solution: compute independent x-values (non-neighbors) in //! if A[i,j] = 0 then x[i] has no dependency to x[j] x[i] can be computed as soon as all x[j] for j < i and A[i,j] 0 have been computed several x-values can be computed in parallel Gauss-Seidel reuses new values as soon as they have been computed and performs two Jacobi-steps in each iteration: A = D L - M, D = diagonal, L = strict lower triangular, M = rest x k+1 = (D - L) -1 (Mx k + b) c 0 d 0 e 0 b 1 c 1 d 1 e 1 b 2 c 2 d 2 e 2 b 3 c 3 e 3 a 4 c 4 d 4 e 4 a 5 b 5 c 5 d 5 e 5 a 6 b 6 c 6 d 6 e 6 a 7 b 7 c 7 e 7 a 8 c 8 d 8 e 8 a 9 b 9 c 9 d 9 e 9 a 10 b 10 c 10 d 10 c 10 a 11 b 11 c 11 c 11 a 12 c 12 d 12 a 13 b 13 c 13 d 13 a 14 b 14 c 14 d 14 a 15 b 15 c 15

7 Red-black ordering Gauss-Seidel uses implicit redblack reordering first are red points computed, then black points. 25 Multi-colored ordering of matrices 26 Generalization of Gauss-Seidel Conjugate gradient (CG) method Parallel conjugate gradient (PCG) method Most used method for iterative solution of Ax = b when A is symmetric (A = A T ) and positive definite (x T Ax > 0, for all vectors x <> 0) Finds the minimum of q(x) = (1/2) x T Ax - x T b Ingredients in PCG: SAXPY (single prec. ax plus y) Inner-products Matrix-vector multiply operations Solution of linear systems (preconditioned CG) Gradient (derivative) to q(x) is Ax - b ( =0 i min-point) In iteration k the search direction p k and a step length σ k which minimizes q along p k are computed Parallelization done with methods mentioned (or not mentioned!) New x-vector is computed: x k = x k-1 + σ k p k Residual is updated: r k = r k-1 - σ k A p k The iteration is finished when the residual is small enough!

Finite element method (FEM) 29 Stiffness matrix 30 Compute approximate numerical solutions to PDEs over a discretized domain Unlike the finite difference (FD) grid, a grid point exchanges information

if grid points i and j share an element) Ax = b, b is the force vector In most applications the graph is quite irregular => unstructured sparse matrix Computation of stiffness matrix A and force

8 Finite element method (FEM) 29 Stiffness matrix 30 Compute approximate numerical solutions to PDEs over a discretized domain Unlike the finite difference (FD) grid, a grid point exchanges information with all grid points with which it shares an element (in total 9 including itself) Stiffness matrix A derived by computing a set of integrals over the elements of the finite element graph (A[i,j] 0 if grid points i and j share an element) Ax = b, b is the force vector In most applications the graph is quite irregular => unstructured sparse matrix Computation of stiffness matrix A and force vector b is relatively cheap and can be done locally by the processor that owns the respective grid points computation of A is trivial to parallelize Linear system is large and sparse => most computational expensive phase of the FEM Assume Ax = b is solved by using the CG-method: SAXPY - no communcation overhead Dot-product: O(log p) with CT routing p = # processors used Matrix-vector multiply: communication depends on the number of grid points that the processor holds and which share element(s) with another processor Kery issue: Minimize load inbalance and maximize comp. intensity (#flops/mem transfer) Performance is determined by how the computational grid is partitioned! 1-dimensional partitioning of grid graphs 31 2-dimensional partitioning of grid graphs 32 Optimal partitioning is NP- hard! Vertical and horisontal partitioning are overlapped Does not give the same amount of nodes per partition Balance the load between partitions transfer nodes from heavily loaded to lightly loaded nodes

9 Block partitioning of arbitrary graphs 33 Graph is partitioned in levels Processors are assigned nodes according to the level partition

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;