COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

Size: px

Start display at page:

Download "COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science"

Charles Hodges
6 years ago
Views:

Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination.

1 Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense Matrix Problem Example: Computational Electromagnetics - Discretize surface into triangular facets using standard modeling tools - Amplitude of currents on surface are unknowns - Integral equation is discretized into a set of linear equations image: NW Univ. Comp. Electromagnetics Laboratory 2 1

2 Computational Electromagnetics Solve Ax=b Developed during 1980s, driven by defense applications Determine the RCS (radar cross section) of airplane Reduce signature of plane (stealth technology) Other applications are antenna design, medical equipment Two fundamental numerical approaches: MOM methods of moments ( frequency domain) Large dense matrices Finite differences (time domain) Even larger sparse matrices 3 Computational Electromagnetics Methods of Moments (MOM) After discretization the integral equation has the form where A x = b A is the (dense) impedance matrix, x is the unknown vector of amplitudes, and b is the excitation vector. (see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing 92, pp ) 4 2

3 Computational Electromagnetics (MOM) The main steps in the solution process are Fill: Factor: Solve: computing the matrix elements of A factoring the dense matrix A solving for one or more excitations b Field Calc: computing the fields scattered from the object 5 Analysis of MOM for Parallel Implementation Task Work Parallelism Parallel Speed Fill O(n**2) embarrassing low Factor O(n**3) moderately diff. very high Solve O(n**2) moderately diff. high Field Calc. O(n) embarrassing high 6 3

4 Parallel Implementation Results on Intel Delta Task Time (hours) Fill (compute n 2 matrix entries) 9.20 (embarrassingly parallel but slow) Factor (Gaussian Elimination, O(n 3 ) ) 8.25 (good parallelism with right algorithm) Solve (O(n 2 )) 2.17 (reasonable parallelism with right algorithm) Field Calc. (O(n)) 0.12 (embarrassingly parallel and fast) The problem solved was for a matrix of size 48, Gflops for Factor - The world record in HPL 59.7 GF/s World record June Out of Core LU Factor and Solve CM-5, 64 nodes, 8 GF/s peak, complex data, scalable disk array N Disks Factor Solve (1024 RHS) Total Hrs GF/s Hrs GF/s Hrs GF/s

5 Review of Matrix Multiplication Goal: Multiply n x n matrices C = A B using O(n 3 ) arithmetic operations, minimizing data movement Sequential Assume fast memory of size M < 3n 2, count slow mem. refs. Thm: need (n 3 /M 1/2 ) slow mem. refs. and (n 3 /M 3/2 ) messages Attainable using blocked or recursive matrix multiply Parallel Assume P processors, O(n 2 /P) data per processor Thm: need (n 2 /P 1/2 ) words sent and (P 1/2 ) messages Attainable by Cannon, nearly by SUMMA SUMMA used in practice (PBLAS) c copies of data c 1/2 times fewer words, c 3/2 fewer messages Which other linear algebra problems can we do with as little data movement? Today: Solve Ax=b in detail, summarize what s known, open 9 Why dense A, as opposed to sparse A? Many large matrices are sparse, but Dense algorithms easier to understand Some applications yields large dense matrices LINPACK Benchmark ( How fast is your computer? How fast can you solve dense Ax=b? Large sparse matrix algorithms often yield smaller (but still large) dense problems 10 5

6 Top500 system performance evolution Performance doubling period on average: No months 7 years No months The Connection Machine Gaussian Elimination (GE) for solving Ax=b Add multiples of each row to later rows to make A upper triangular Solve resulting triangular system Ux = c by substitution for each column i zero it out below the diagonal by adding multiples of row i to later rows for each row j below row i add a multiple of row i to row j tmp = A(j,i); for k = i to n A(j,k) = A(j,k) - (tmp/a(i,i)) * A(i,k) After i=1 After i=2 After i=3 After i=n

7 Initial Version Refine GE Algorithm (1) for each column i zero it out below the diagonal by adding multiples of row i to later rows for each row j below row i add a multiple of row i to row j tmp = A(j,i); for k = i to n A(j,k) = A(j,k) - (tmp/a(i,i)) * A(i,k) Remove computation of constant tmp/a(i,i) from inner loop. m = A(j,i)/A(i,i) for k = i to n A(j,k) = A(j,k) - m * A(i,k) m i j 13 Last version Refine GE Algorithm (2) m = A(j,i)/A(i,i) for k = i to n A(j,k) = A(j,k) - m * A(i,k) Don t compute what we already know: zeros below diagonal in column i m = A(j,i)/A(i,i) A(j,k) = A(j,k) - m * A(i,k) m i j Do not compute zeros 14 7

8 Last version Refine GE Algorithm (3) m = A(j,i)/A(i,i) A(j,k) = A(j,k) - m * A(i,k) Store multipliers m below diagonal in zeroed entries for later use A(j,i) = A(j,i)/A(i,i) A(j,k) = A(j,k) - A(j,i) * A(i,k) m i j Store m here 15 Refine GE Algorithm (4) Last version A(j,i) = A(j,i)/A(i,i) A(j,k) = A(j,k) - A(j,i) * A(i,k) Split Loop A(j,i) = A(j,i)/A(i,i) A(j,k) = A(j,k) - A(j,i) * A(i,k) i j Store all m s here before updating rest of matrix 16 8

9 Last version Refine GE Algorithm (5) A(j,i) = A(j,i)/A(i,i) A(j,k) = A(j,k) - A(j,i) * A(i,k) Express using matrix operations (BLAS) A(i+1:n,i) = A(i+1:n,i) * ( 1 / A(i,i) ) BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n, i+1:n ) - A(i+1:n, i) * A(i, i+1:n) BLAS 2 (rank-1 update) 17 9

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?