Parallel Implementations of Gaussian Elimination

Size: px

Start display at page:

Download "Parallel Implementations of Gaussian Elimination"

Thomas Goodwin
6 years ago
Views:

1 s of Western Michigan University January 27, 2012 CS 6260: in Parallel

2 Linear systems of equations General form of a linear system of equations is given by a 11 x a 1n x n = b 1 a 21 x a 2n x n = b 2.. a m1 x a mn x n = b m where a ij s and b i s are known and we are solving for x i s. CS 6260: in Parallel

3 Ax = b More compactly, we can rewrite system of linear equations in the form Ax = b where A = a 11 a a 1n a 21 a a 2n a m1 a m2... a mn, x = x 1 x 2. x n, b = b 1 b 2. b m CS 6260: in Parallel

4 Why study solutions of Ax = b 1 One of the most fundamental problems in the field of scientific computing 2 Arises in many applications: CS 6260: in Parallel

5 Why study solutions of Ax = b 1 One of the most fundamental problems in the field of scientific computing 2 Arises in many applications: Chemical engineering Interpolation Structural analysis Regression Analysis Numerical ODEs and PDEs CS 6260: in Parallel

6 Methods for solving Ax = b 1 Direct methods 2 Iterative methods CS 6260: in Parallel

7 Methods for solving Ax = b 1 Direct methods obtain the exact solution (in real arithmetic) in finitely many operations 2 Iterative methods generate sequence of approximations that converge in the limit to the solution CS 6260: in Parallel

8 Methods for solving Ax = b 1 Direct methods obtain the exact solution (in real arithmetic) in finitely many operations Gaussian elimination (LU factorization) QR factorization WZ factorization 2 Iterative methods generate sequence of approximations that converge in the limit to the solution Jacobi iteration Gauss-Seidal iteration SOR method (successive over-relaxation) CS 6260: in Parallel

9 Methods for solving Ax = b 1 Direct methods obtain the exact solution (in real arithmetic) in finitely many operations Gaussian elimination (LU factorization) QR factorization WZ factorization 2 Iterative methods generate sequence of approximations that converge in the limit to the solution Jacobi iteration Gauss-Seidal iteration SOR method (successive over-relaxation) CS 6260: in Parallel

10 Motivation When solving Ax = b we will assume throughout this presentation that A is non-singular and A and b are known a 11 x a 1n x n = b 1 a 21 x a 2n x n = b 2.. a n1 x a nn x n = b n. CS 6260: in Parallel

11 Motivation Assuming that a 11 =0wefirstsubtracta 21 /a 11 times the first equation from the second equation to eliminate the coefficient x 1 in the second equation, and so on until the coefficients of x 1 in the last n 1 rows have all been eliminated. This gives the modified system of equations CS 6260: in Parallel

12 Motivation Assuming that a 11 =0wefirstsubtracta 21 /a 11 times the first equation from the second equation to eliminate the coefficient x 1 in the second equation, and so on until the coefficients of x 1 in the last n 1 rows have all been eliminated. This gives the modified system of equations where a (1) ij a 11 x 1 + a 12 x a 1n x n = b 1 a (1) 22 x a (1) 2n x n = b (1) 2... a (1) n1 x a nn (1) x n = b n (1), a i1 = a ij a 1j b (1) a i1 i = b i b 1 i, j =1, 2,...,n. a 11 a 11 CS 6260: in Parallel

13 (Forward Reduction) Applying the same process the last n 1 equations of the modified system to eliminate coefficients of x 2 in the last n 2 equations, and so on, until the entire system has been reduced to the (upper) triangular form a 11 a 12 a 1n a (1) 22 a (1) 2n.... a nn (n 1) x 1 x 2 x n = b 1 b (1) 2. b (n 1) n. CS 6260: in Parallel

14 (Forward Reduction) Applying the same process the last n 1 equations of the modified system to eliminate coefficients of x 2 in the last n 2 equations, and so on, until the entire system has been reduced to the (upper) triangular form a 11 a 12 a 1n a (1) 22 a (1) 2n.... a nn (n 1) x 1 x 2 x n = b 1 b (1) 2. b (n 1) n. The superscripts indicate the number of times the elements had to be changed. CS 6260: in Parallel

15 Example Perform the forward reduction the the system given below x x 2 = x 3 1 CS 6260: in Parallel

16 Example Perform the forward reduction the the system given below x x 2 = x 3 1 We start by writing down the augmented matrix for the given system: CS 6260: in Parallel

17 Example We perform appropriate row operations to obtain: CS 6260: in Parallel

18 Example We perform appropriate row operations to obtain: R 2 ( 2 4)R 1 R 2 R 3 ( 1 4 )R 1 R CS 6260: in Parallel

19 Example We perform appropriate row operations to obtain: R 2 ( 2 4)R 1 R 2 R 3 ( 1 4 )R 1 R 3 R 3 ( )R 2 R CS 6260: in Parallel

20 Example Note that the row operations used to eliminate x 1 from the second and the third equations are equivalent to multiplying on the left the augmented matrix: = CS 6260: in Parallel

21 Example Note that the row operations used to eliminate x 1 from the second and the third equations are equivalent to multiplying on the left the augmented matrix: L = CS 6260: in Parallel

22 Example Note that the row operations used to eliminate x 1 from the second and the third equations are equivalent to multiplying on the left the augmented matrix: L L 1 [A b] = CS 6260: in Parallel

23 Example L L 1 L = L2 L 1 Thus, we can write [A b] = (L 2 L 1 ) A = L A = U CS 6260: in Parallel

24 LU factorization Motivation In general, we can think of row operations as multiplying with matrices on the left, thus the i th elimination step is equivalent as multiplication on the left by 1... L i = 1 l i+1,i..... l n,i 1 CS 6260: in Parallel

25 LU factorization Motivation Continuing in this fashion we obtain LAx = Lb, L = Ln 1 L 2 L 1, CS 6260: in Parallel

26 LU factorization Motivation Continuing in this fashion we obtain LAx = Lb, L = Ln 1 L 2 L 1, A = L 1 U = LU. CS 6260: in Parallel

27 LU factorization Motivation Continuing in this fashion we obtain A = L 1 U = LU. Thus, the Gaussian elimination algorithm for solving Ax = b is mathematically equivalent to the three-step process: 1. Factor A = LU 2. Solve (forward substitution) Ly = b 3. Solve (back substitution) Ux = y. CS 6260: in Parallel

28 LU factorization Motivation Continuing in this fashion we obtain A = L 1 U = LU. Thus, the Gaussian elimination algorithm for solving Ax = b is mathematically equivalent to the three-step process: 1. Factor A = LU 2. Solve (forward substitution) Ly = b 3. Solve (back substitution) Ux = y. Order of operations: n n2 n 3 multiplications/divisions n n2 2 5n 6. CS 6260: in Parallel

29 Need for partial pivoting Apply LU factorization without pivoting to A = 1 1 in three-decimal-digit floating point arithmetic. CS 6260: in Parallel

30 Need for partial pivoting A = Solution: L and U are easily obtainable and are given by: 1 0 L = fl(1/10 4, fl(1/10 4 ) rounds to 10 4, ) U = 0 fl(1 10 4, fl( ) rounds to ) so LU = = but A =. 1 1 CS 6260: in Parallel

31 Need for partial pivoting A = but LU = Remark 1: Note that original a 22 has been entirely lost from the computation by subtracting 10 4 from it. Thus if we were to use this LU factorization in order to solve a system there would be no way to guarantee an accurate answer. This is called numerical instability. CS 6260: in Parallel

32 Need for partial pivoting A = but LU = Remark 1: Note that original a 22 has been entirely lost from the computation by subtracting 10 4 from it. Thus if we were to use this LU factorization in order to solve a system there would be no way to guarantee an accurate answer. This is called numerical instability. Remark 2: Suppose we attempted to solve system Ax =[1, 2] T for x using this LU decomposition. The correct answer is x [1, 1] T.Instead,ifweweretostickwiththe three-digit-floating point arithmetic we would get an answer x =[0, 1] T which is completely erroneous. CS 6260: in Parallel

33 Motivation CS 6260: in Parallel

Row-Oriented Algorithm Pipelined Implementation Row-Oriented Algorithm 1 Determination of the local pivot element, 2 Determination of the global pivot element,

34 Row-Oriented Algorithm Pipelined Implementation Row-Oriented Algorithm 1 Determination of the local pivot element, 2 Determination of the global pivot element, 3 Exchange of the pivot row, 4 Distribution of the pivot row, 5 Computation of the elimination factors, 6 Computation of the matrix elements. CS 6260: in Parallel

35 Row-Oriented Algorithm Row-Oriented Algorithm Pipelined Implementation Remark 1: The computation of the solution vector x in the backward substitution is inherently serial, since the values of x k depend on each other and are computed one after another. Thus, in step k the processor P q owning row k computers the value of x k and sends the value to all other processors by a single broadcast operation. CS 6260: in Parallel

36 Row-Oriented Algorithm Row-Oriented Algorithm Pipelined Implementation Remark 1: The computation of the solution vector x in the backward substitution is inherently serial, since the values of x k depend on each other and are computed one after another. Thus, in step k the processor P q owning row k computers the value of x k and sends the value to all other processors by a single broadcast operation. Remark 2: Note that the data distribution is not quite adequate. For example, suppose we are at the i th step and that i > m n/p where m is a natural number. In that case, processors p 0, p 1,...,p m 1 are idle since all their work is done until the back substitution step at the very end. Hence there is an issue with load balancing. CS 6260: in Parallel

37 Row-Oriented Algorithm Row-Oriented Algorithm Pipelined Implementation Remark 2: Note that the data distribution is not quite adequate. For example, suppose we are at the i th step and that i > m n/p where m is a natural number. In that case, processors p 0, p 1,...,p m 1 are idle since all their work is done until the back substitution step at the very end. Hence there is an issue with load balancing. block-cyclic row distribution CS 6260: in Parallel

38 Row-Oriented Algorithm Row-Oriented Algorithm Pipelined Implementation Remark 3: We could also consider column-oriented data distribution. In that case, at the k th step the processor that owns k th column would have all needed data to compute the new pivot. On the one hand, this would reduce communication between processors that we had when considering row-oriented data distribution. On the other hand, with column orientation pivot determination is all done serially. Thus, in case when n >> p (which is often case) row oriented data distribution might be more advantageous since the pivot determination is done in parallel CS 6260: in Parallel

39 Pipelined Implementation Row-Oriented Algorithm Pipelined Implementation 1 Challenges and possible approaches Send ahead strategy CS 6260: in Parallel

40 Pipelined Implementation Row-Oriented Algorithm Pipelined Implementation 1 Challenges and possible approaches Send ahead strategy Challenges with pivoting CS 6260: in Parallel

41 Pipelined Implementation Row-Oriented Algorithm Pipelined Implementation 1 Challenges and possible approaches Send ahead strategy Challenges with pivoting Column pivoting CS 6260: in Parallel

42 Pipelined Implementation Row-Oriented Algorithm Pipelined Implementation 1 Challenges and possible approaches Send ahead strategy Challenges with pivoting Column pivoting 2 Advantages It allows asynchronous execution: processes can reduce their portions of the augmented matrix as soon as the pivot rows are available CS 6260: in Parallel

43 Pipelined Implementation Row-Oriented Algorithm Pipelined Implementation 1 Challenges and possible approaches Send ahead strategy Challenges with pivoting Column pivoting 2 Advantages It allows asynchronous execution: processes can reduce their portions of the augmented matrix as soon as the pivot rows are available It allows processes to overlap communication and computation times. CS 6260: in Parallel

44 Final Remarks Compute communication and computation times. CS 6260: in Parallel

45 Final Remarks Compute communication and computation times. Largely depends on what kind of distributed system used (usual configurations that have been studying are 1D and 2D meshes as well as hypercube). CS 6260: in Parallel

46 Final Remarks Compute communication and computation times. Largely depends on what kind of distributed system used (usual configurations that have been studying are 1D and 2D meshes as well as hypercube). Compute isoefficiency functions and determine scalability. CS 6260: in Parallel

47 Final Remarks Compute communication and computation times. Largely depends on what kind of distributed system used (usual configurations that have been studying are 1D and 2D meshes as well as hypercube). Compute isoefficiency functions and determine scalability. Row (column) oriented algorithms, stable but not scalable. Pipelined implementation is perfectly scalable. Structured systems Symmetric/Hermitian. Banded and upper (lower) triangular. Positive definite (CG method). CS 6260: in Parallel

48 Final Remarks Compute communication and computation times. Largely depends on what kind of distributed system used (usual configurations that have been studying are 1D and 2D meshes as well as hypercube). Compute isoefficiency functions and determine scalability. Row (column) oriented algorithms, stable but not scalable. Pipelined implementation is perfectly scalable. Structured systems Symmetric/Hermitian. Banded and upper (lower) triangular. Positive definite (CG method). Sparse systems and iterative methods (numerical PDEs) CS 6260: in Parallel

Dense Matrix Algorithms

Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication