Numerical Algorithms

Similar documents
Matrix Multiplication

Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms

Lecture 17: Array Algorithms

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Matrix multiplication

Parallel Implementations of Gaussian Elimination

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Parallel Numerical Algorithms

f xx + f yy = F (x, y)

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Introduction to Multigrid and its Parallelization

Linear Equation Systems Iterative Methods

Hypercubes. (Chapter Nine)

Matrix Multiplication

Matrix Multiplication

AMS526: Numerical Analysis I (Numerical Linear Algebra)

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

Parallelizing LU Factorization

Linear Arrays. Chapter 7

Work Allocation. Mark Greenstreet. CpSc 418 Oct. 25, Mark Greenstreet Work Allocation CpSc 418 Oct. 25, / 13

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Matrix Multiplication

(Sparse) Linear Solvers

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 6: Parallel Matrix Algorithms (part 3)

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

10/26/ Solving Systems of Linear Equations Using Matrices. Objectives. Matrices

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Contents. I The Basic Framework for Stationary Problems 1

Dense LU Factorization

Chapter Introduction

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Numerical Method (2068 Third Batch)

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Chapter 11 Image Processing

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

Iterative Methods for Linear Systems

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur

Image Processing. Application area chosen because it has very good parallelism and interesting output.

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Sparse matrices, graphs, and tree elimination

Matrix Inverse 2 ( 2) 1 = 2 1 2

The Matrix-Tree Theorem and Its Applications to Complete and Complete Bipartite Graphs

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Parallel Numerical Algorithms

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Numerical Linear Algebra

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

Pipelined Computations

For example, the system. 22 may be represented by the augmented matrix

CHAPTER 5 Pipelined Computations

Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture - 36

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

Smoothers. < interactive example > Partial Differential Equations Numerical Methods for PDEs Sparse Linear Systems

Introduction to Parallel. Programming

(Sparse) Linear Solvers

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture

Linear systems of equations

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Lecture 15: More Iterative Ideas

PROGRAMMING OF MULTIGRID METHODS

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

Parallelization of an Example Program

Simulating ocean currents

Synchronous Computations

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

AMath 483/583 Lecture 21 May 13, 2011

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

0_PreCNotes17 18.notebook May 16, Chapter 12

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

AMath 483/583 Lecture 24. Notes: Notes: Steady state diffusion. Notes: Finite difference method. Outline:

Parallel Poisson Solver in Fortran

Synchronous Computations

Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

The Simplex Algorithm. Chapter 5. Decision Procedures. An Algorithmic Point of View. Revision 1.0

AMath 483/583 Lecture 24

Chemical Engineering 541

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

Lesson 2 7 Graph Partitioning

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen

Space Filling Curves and Hierarchical Basis. Klaus Speer

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Synchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Algorithms and Applications

AMS526: Numerical Analysis I (Numerical Linear Algebra)

HPC Algorithms and Applications

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Transcription:

Chapter 10 Slide 464 Numerical Algorithms

Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations

Slide 466 Matrices A Review An n m matrix Column a 0,0 a 0,1 a 0,m-2 a 0,m-1 a 1,0 a 1,1 a 1,m-2 a 1,m-1 Row a n-2,0 a n-1,0 a n-2,1 a n-1,1 a n-2,m-2 a n-2,m-1 a n-1,m-2 an-1,m-1

Slide 467 Matrix Addition Involves adding corresponding elements of each matrix to form the result matrix. Given the elements of A as a i,j and the elements of B as b i,j, each element of C is computed as c i,j = a i,j + b i,j (0 i < n, 0 j < m)

Slide 468 Matrix Multiplication Multiplication of two matrices, A and B, produces the matrix C whose elements, c i,j (0 i < n, 0 j < m), are computed as follows: l 1 c i, j = a i,k b k,j k = 0 where A is an n l matrix and B is an l m matrix.

Matrix multiplication, C = A B Slide 469 Multiply Column j Sum results Row i c i,j A B = C

Matrix-Vector Multiplication c = A b Slide 470 Matrix-vector multiplication follows directly from the definition of matrix-matrix multiplication by making B an n 1 matrix (vector). Result an n 1 matrix (vector). A b = c Row i sum c i

Slide 471 Relationship of Matrices to Linear Equations A system of linear equations can be written in matrix form: Ax = b Matrix A holds the a constants x is a vector of the unknowns b is a vector of the b constants.

Implementing Matrix Multiplication Slide 472 Sequential Code Assume throughout that the matrices are square (n n matrices). The sequential code to compute A B could simply be for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } This algorithm requires n 3 multiplications and n 3 additions, leading to a sequential time complexity of Ο(n 3 ). Very easy to parallelize.

Parallel Code Slide 473 With n processors (and n n matrices), can obtain: Time complexity of O(n 2 ) with n processors Each instance of inner loop independent and can be done by a separate processor Time complexity of O(n) with n 2 processors One element of A and B assigned to each processor. Cost optimal since O(n 3 ) = n O(n 2 ) = n 2 O(n)]. Time complexity of O(log n) with n 3 processors By parallelizing the inner loop. Not cost-optimal since O(n 3 ) n 3 O(log n)). O(log n) lower bound for parallel matrix multiplication.

Partitioning into Submatrices Slide 474 Suppose matrix divided into s 2 submatrices. Each submatrix has n/ s n/s elements. Using notation A p,q as submatrix in submatrix row p and submatrix column q: for (p = 0; p < s; p++) for (q = 0; q < s; q++) { C p,q = 0; /* clear elements of submatrix */ for (r = 0; r < m; r++)/* submatrix multiplication &*/ C p,q = C p,q + A p,r * B r,q ;/*add to accum. submatrix*/ } The line C p,q = C p,q + A p,r * B r,q ; means multiply submatrix A p,r and B r,q using matrix multiplication and add to submatrix C p,q using matrix addition. Known as block matrix multiplication.

Slide 475 Block Matrix Multiplication Multiply q Sum results p = A B C

Submatrix multiplication a 0,0 a 0,1 a 0,2 a 0,3 a 1,0 a 1,1 a 1,2 a 1,3 b 0,0 b 0,1 b 0,2 b 0,3 b 1,0 b 1,1 b 1,2 b 1,3 Slide 476 (a) Matrices a 2,0 a 2,1 a 2,2 a 2,3 b 2,0 b 2,1 b 2,2 b 2,3 a 3,0 a 3,1 a 3,2 a 3,3 b 3,0 b 3,1 b 3,2 b 3,3 A 0,0 B 0,0 A 0,1 B 1,0 a 0,0 a 0,1 b 0,0 b 0,1 a 0,2 a 0,3 b 2,0 b 2,1 + (b) Multiplying A 0,0 B 0,0 to obtain C 0,0 a 1,0 a 1,1 b 1,0 b 1,1 a 0,0 b 0,0 +a 0,1 b 1,0 a 0,0 b 0,1 +a 0,1 b 1,1 a 1,2 a 1,3 b 3,0 b 3,1 a 0,2 b 2,0 +a 0,3 b 3,0 a 0,2 b 2,1 +a 0,3 b 3,1 = + a 1,0 b 0,0 +a 1,1 b 1,0 a 1,0 b 0,1 +a 1,1 b 1,1 a 1,2 b 2,0 +a 1,3 b 3,0 a 1,2 b 2,1 +a 1,3 b 3,1 a 0,0 b 0,0 +a 0,1 b 1,0 +a 0,2 b 2,0 +a 0,3 b 3,0 a 0,0 b 0,1 +a 0,1 b 1,1 +a 0,2 b 2,1 +a 0,3 b 3,1 = a 1,0 b 0,0 +a 1,1 b 1,0 +a 1,2 b 2,0 +a 1,3 b 3,0 a 1,0 b 0,1 +a 1,1 b 1,1 +a 1,2 b 2,1 +a 1,3 b 3,1 = C 0,0

Direct Implementation Slide 477 One processor to compute each element of C - n 2 processors would be needed. One row of elements of A and one column of elements of B needed. Some of same elements sent to more than one processor. Can use submatrices. Column j b[][j] Row i a[i][] Processor P i,j c[i][j]

Performance Improvement Using tree construction n numbers can be added in log n steps using n processors: Slide 478 a 0,0 b 0,0 a 0,1 b 1,0 a 0,2 b 2,0 a 0,3 b 3,0 P 0 P 1 P 2 P 3 + + P 0 P 2 Computational time complexity of Ο(log n) using n 3 processors. P 0 + c 0,0

Slide 479 Recursive Implementation i j i P 0 P 1 P 2 P 3 P 0 + P 1 P 2 + P 3 j A pp A pq B pp B pq C pp C pq P 4 + P 5 P 6 + P 7 A qp A qq B qp B qq C qp C qq P 4 P 5 P 6 P 7 Apply same algorithm on each submatrix recursivly. Excellent algorithm for a shared memory systems because of locality of operations.

Recursive Algorithm mat_mult(a pp, B pp, s) { if (s == 1) /* if submatrix has one element */ C = A * B; /* multiply elements */ else { /* continue to make recursive calls */ s = s/2; /* no of elements in each row/column */ P0 = mat_mult(a pp, B pp, s); P1 = mat_mult(a pq, B qp, s); P2 = mat_mult(a pp, B pq, s); P3 = mat_mult(a pq, B qq, s); P4 = mat_mult(a qp, B pp, s); P5 = mat_mult(a qq, B qp, s); P6 = mat_mult(a qp, B pq, s); P7 = mat_mult(a qq, B qq, s); C pp = P0 + P1; /* add submatrix products to */ C pq = P2 + P3; /* form submatrices of final matrix */ C qp = P4 + P5; C qq = P6 + P7; } return (C); /* return final matrix */ } Slide 480

Slide 481 Mesh Implementations Cannon s algorithm Fox s algorithm (not in textbook but similar complexity) Systolic array All involve using processor arranged a mesh and shifting elements of the arrays through the mesh. Accumulate the partial sums at each processor.

Mesh Implementations Cannon s Algorithm Slide 482 Uses a mesh of processors with wraparound connections (a torus) to shift the A elements (or submatrices) left and the B elements (or submatrices) up. 1.Initially processor P i,j has elements a i,j and b i,j (0 i < n, 0 k < n). 2. Elements are moved from their initial position to an aligned position. The complete ith row of A is shifted i places left and the complete jth column of B is shifted j places upward. This has the effect of placing the element a i,j+i and the element b i+j,j in processor P i,j,. These elements are a pair of those required in the accumulation of c i,j. 3.Each processor, P i,j, multiplies its elements. 4. The ith row of A is shifted one place right, and the jth column of B is shifted one place upward. This has the effect of bringing together the adjacent elements of A and B, which will also be required in the accumulation. 5. Each processor, P i,j, multiplies the elements brought to it and adds the result to the accumulating sum. 6. Step 4 and 5 are repeated until the final result is obtained (n - 1 shifts with n rows and n columns of elements).

Slide 483 Movement of A and B elements j i A P i,j B

Slide 484 Step 2 Alignment of elements of A and B i j B A j places i places b i+j,j a i,j+i

Step 4 One-place shift of elements of A and B Slide 485 i j B A P i,j

Systolic array Pumping action b 3,0 b 2,0 b 1,0 b 0,0 b 3,1 b 2,1 b 1,1 b 0,1 b 3,2 b 2,2 b 1,2 b 0,2 b 3,3 b 2,3 b 1,3 b 0,3 Slide 486 a 0,3 a 0,2 a 0,1 a 0,0 c 0,0 c 0,1 c 0,2 c 0,3 One cycle delay a 1,3 a 1,2 a 1,1 a 1,0 c 1,0 c 1,1 c 1,2 c 1,3 a 2,3 a 2,2 a 2,1 a 2,0 c 2,0 c 2,1 c 2,2 c 2,3 a 3,3 a 3,2 a 3,1 a 3,0 c 3,0 c 3,1 c 3,2 c 3,3

Matrix-Vector Multiplication Pumping action b 3 b 2 b 1 b 0 Slide 487 a 0,3 a 0,2 a 0,1 a 0,0 c 0 a 1,3 a 1,2 a 1,1 a 1,0 c 1 a 2,3 a 2,2 a 2,1 a 2,0 c 2 a 3,3 a 3,2 a 3,1 a 3,0 c 3

Solving a System of Linear Equations Slide 488 a n 1,0 x 0 + a n 1,1 x 1 + a n 1,2 x 2 + a n 1,n 1 x n 1 = b n 1... a 2,0 x 0 + a 2,1 x 1 + a 2,2 x 2 + a 2,n 1 x n 1 = b 2 a 1,0 x 0 + a 1,1 x 1 + a 1,2 x 2 + a 1,n 1 x n 1 = b 1 a 0,0 x 0 + a 0,1 x 1 + a 0,2 x 2 + a 0,n 1 x n 1 = b 0 which, in matrix form, is Ax = b Objective is to find values for the unknowns, x 0, x 1,, x n 1, given values for a 0,0, a 0,1,, a n 1,n 1, and b 0,, b n.

Solving a System of Linear Equations Slide 489 Dense matrices Gaussian Elimination - parallel time complexity O(n 2 ) Sparse matrices By iteration - depends upon iteration method and number of iterations but typically O(log n) Jacobi iteration Gauss-Seidel relaxation (not good for parallelization) Red-Black ordering Multigrid

Slide 490 Gaussian Elimination Convert general system of linear equations into triangular system of equations. Then be solved by Back Substitution. Uses characteristic of linear equations that any row can be replaced by that row added to another row multiplied by a constant. Starts at the first row and works toward the bottom row. At the ith row, each row j below the ith row is replaced by row j + (row i) ( a j,i / a i,i ). The constant used for row j is a j,i /a i,i. Has the effect of making all the elements in the ith column below the ith row zero because a j, i a j, i = a j, i + a i, i ------------ = 0 a i, i

Gaussian elimination Slide 491 Column Row Row i Row j Already cleared to zero Column i a ji Cleared to zero Step through

Slide 492 Partial Pivoting If a i,i is zero or close to zero, we will not be able to compute the quantity a j,i /a i,i. Procedure must be modified into so-called partial pivoting by swapping the ith row with the row below it that has the largest absolute element in the ith column of any of the rows below the ith row if there is one. (Reordering equations will not affect the system.) In the following, we will not consider partial pivoting.

Slide 493 Without partial pivoting: Sequential Code for (i = 0; i < n-1; i++) /* for each row, except last */ for (j = i+1; j < n; j++) {/*step thro subsequent rows */ m = a[j][i]/a[i][i]; /* Compute multiplier */ for (k = i; k < n; k++)/*last n-i-1 elements of row j*/ a[j][k] = a[j][k] - a[i][k] * m; b[j] = b[j] - b[i] * m;/* modify right side */ } The time complexity is O(n 3 ).

Parallel Implementation Slide 494 Column Row n i +1 elements (including b[i]) Row i Broadcast ith row Already cleared to zero

Analysis Communication Slide 495 n 1 broadcasts performed sequentially. ith broadcast contains n i + 1 elements. Time complexity of Ο(n 2 ) (see textbook) Computation After row broadcast, each processor P j beyond broadcast processor P i will compute its multiplier, and operate upon n j + 2 elements of its row. Ignoring the computation of the multiplier, there are n j + 2 multiplications and n j + 2 subtractions. Time complexity of Ο(n 2 ) (see textbook). Efficiency will be relatively low because all the processors before the processor holding row i do not participate in the computation again.

Pipeline implementation of Gaussian elimination Slide 496 P 0 P 1 P 2 P n-1 Row Broadcast rows

Strip Partitioning Row 0 Slide 497 P 0 n/p P 1 2n/p P 2 3n/p P 3 Poor processor allocation! Processors do not participate in computation after their last row is processed.

Cyclic-Striped Partitioning Slide 498 An alternative which equalizes the processor workload: Row 0 n/p P 0 2n/p P1 3n/p

Iterative Methods Slide 499 Time complexity of direct method at Ο(N 2 ) with N processors, is significant. Time complexity of iteration method depends upon: the type of iteration, number of iterations number of unknowns, and required accuracy but can be less than the direct method especially for a few unknowns i.e a sparse system of linear equations.

Slide 500 Jacobi Iteration Iteration formula - ith equation rearranged to have ith unknown on left side: x k i = 1 -------- a b i i, i j i k 1 a i, j x j Superscript indicates iteration: x k k 1 i is kth iteration of x i, x j is (k 1)th iteration of x j.

Example of a Sparse System of Linear Equations Laplace s Equation 2 f -------- x 2 2 f + -------- = 0 y 2 Slide 501 Solve for f over the two-dimensional x-y space. For a computer solution, finite difference methods are appropriate Two-dimensional solution space is discretized into a large number of solution points.

Finite Difference Method Slide 502 Solution space f(x, y) y x

If distance between points,, made small enough: 2 f -------- 1 x 2 ----- [ f ( x +, y ) 2 f ( x, y ) + f ( x, y ) ] 2 2 f -------- 1 y 2 ----- [ f ( x, y + ) 2 f ( x, y ) + f ( x, y ) ] 2 Substituting into Laplace s equation, we get 1 ----- [ f ( x +, y ) + f ( x, y ) + f ( x, y + ) + f ( x, y ) 4 f ( x, y ) ] = 0 2 Rearranging, we get [ f ( x, y ) + f ( x, y ) + f ( x +, y ) + f ( x, y + ) ] f ( x, y ) = ----------------------------------------------------------------------------------------------------------------------------------- 4 Rewritten as an iterative formula: f k ( x, y ) [ f k 1 ( x, y ) + f k 1 ( x, y ) + f k 1 ( x +, y ) + f k 1 ( x, y + ) ] = -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4 f k (x, y) - kth iteration, f k 1 (x, y) - (k 1)th iteration. Slide 503

Natural Order Boundary points (see text) Slide 504 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 21 x 22 x 23 x 24 x 25 x 26 x 27 x 28 x 29 x 31 x 32 x 33 x 34 x 35 x 36 x 37 x 38 x 39 x 41 x 42 x 43 x 44 x 45 x 46 x 47 x 48 x 49 x 51 x 52 x 53 x 54 x 55 x 56 x 57 x 58 x 59 x 61 x 62 x 63 x 64 x 65 x 66 x 67 x 68 x 69 x 71 x 72 x 73 x 74 x 75 x 76 x 77 x 78 x 79 x 81 x 82 x 83 x 84 x 85 x 86 x 87 x 88 x 89 x 10 x 20 x 30 x 40 x 50 x 60 x 70 x 80 x 90 x 91 x 92 x 93 x 94 x 95 x 96 x 97 x 98 x 99 x 100

Relationship with a General System of Linear Equations Slide 505 Using natural ordering, ith point computed from ith equation: or x x i n + x i 1 + x i + 1 + x i = i + n -------------------------------------------------------------------------- 4 x i n + x i 1 4x i + x i+1 + x i+n = 0 which is a linear equation with five unknowns (except those with boundary points). In general form, the ith equation becomes: a i,i n x i n + a i,i 1 x i 1 + a i,i x i + a i,i+1 x i+1 + a i,i+n x i+n = 0 where a i,i = 4, and a i,i n = a i,i 1 = a i,i+1 = a i,i+n = 1.

Slide 506 A Those equations with a boundary point on diagonal unnecessary for solution To include boundary values and some zero entries (see text) x x 1 x 2 0 0 1 1 4 1 1 1 1 4 1 1 ith equation 1 1 4 1 1 a i,i n a i,i 1 a i,i a i,i+1 1 1 1 4 1 1 4 1 a i,i+n 1 1 = x N-1 x N 0 0

Gauss-Seidel Relaxation Uses some newly computed values to compute other values in that iteration. Slide 507 Sequential order of computation Basic form not suitable for parallelization Point computed Point to be computed

Gauss-Seidel Iteration Formula Slide 508 x k i 1 1 i a-------- b i a i, j x k N = j a i, j x k j 1 i, i j = 1 j = i + 1 where the superscript indicates the iteration. With natural ordering of unknowns, formula reduces to x k i = ( 1/a i,i )[a i,i n x k i n + a i,i 1 x k i 1 + a i,i+1 x k 1 i+1 + a i,i+n x k 1 i+n ] At the kth iteration, two of the four values (before the ith element) taken from the kth iteration and two values (after the ith element) taken from the (k 1)th iteration. We have: f k ( x, y) [ f k ( x, y) + f k ( x, y ) + f k 1 ( x +, y) + f k 1 ( x, y + ) ] = -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4

Red-Black Ordering First, black points computed. Next, red points computed. Black points computed simultaneously, and red points computed simultaneously. Slide 509 Red Black

Slide 510 Red-Black Parallel Code forall (i = 1; i < n; i++) forall (j = 1; j < n; j++) if ((i + j) % 2 == 0) /* compute red points */ f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]); forall (i = 1; i < n; i++) forall (j = 1; j < n; j++) if ((i + j) % 2!= 0) /* compute black points */ f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);

Slide 511 Higher-Order Difference Methods More distant points could be used in the computation. The following update formula: f k ( x, y) = 1 ----- 16 f k 1 ( x, y) + 16 f k 1 ( x, y ) + 16 f k 1 ( x +, y) + 16 f k 1 ( x, y + ) 1 1 60 - f k 1 ( x 2, y) f k 1 ( x, y 2 ) f k 1 ( x + 2, y) f k 1 ( x, y + 2 )

Slide 512 Nine-point stencil

Overrelaxation Slide 513 Improved convergence obtained by adding factor (1 ω)x i to Jacobi or Gauss-Seidel formulae. Factor ω is the overrelaxation parameter. Jacobi overrelaxation formula x i k where 0 < ω < 1. = ω ------ b k 1 k 1 i a ij x i + ( 1 ω)x i a ii j i k x i = Gauss-Seidel successive overrelaxation i 1 N ω ------ b k k 1 k 1 i a ij x i a ij x i + ( 1 ω)x i a ii j = 1 j = i + 1 where 0 < ω 2. If ω = 1, we obtain the Gauss-Seidel method.

Slide 514 Multigrid Method First, a coarse grid of points used. With these points, iteration process will start to converge quickly. At some stage, number of points increased to include points of the coarse grid and extra points between the points of the coarse grid. Initial values of extra points found by interpolation. Computation continues with this finer grid. Grid can be made finer and finer as computation proceeds, or computation can alternate between fine and coarse grids. Coarser grids take into account distant effects more quickly and provide a good starting point for the next finer grid.

Multigrid processor allocation Slide 515 Coarsest grid points Finer grid points Processor

Slide 516 (Semi) Asynchronous Iteration As noted early, synchronizing on every iteration will cause significant overhead - best if one can is to synchronize after a number of iterations.