AMath 483/583 Lecture 24. Notes: Notes: Steady state diffusion. Notes: Finite difference method. Outline:

Similar documents
AMath 483/583 Lecture 24

AMath 483/583 Lecture 21 May 13, 2011

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Introduction to Parallel Processing. Lecture #10 May 2002 Guy Tel-Zur

AMath 483/583 Lecture 16 May 2, Notes: Notes: Fine vs. coarse grain parallelism. Solution of independent ODEs by Euler s method.

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

Parallel Poisson Solver in Fortran

Introduction to Multigrid and its Parallelization

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2

Linear Equation Systems Iterative Methods

AMath 483/583 Lecture 18 May 6, 2011

High Performance Computing: Tools and Applications

Iterative Methods for Linear Systems

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

f xx + f yy = F (x, y)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

Code Parallelization

AMath 483/583 Lecture 21

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

2 T. x + 2 T. , T( x, y = 0) = T 1

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Numerical Algorithms

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm

Intermediate MPI features

Large-Scale Simulations on Parallel Computers!

Advanced Parallel Programming

Introduction to MPI. Ricardo Fonseca.

HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

More Communication (cont d)

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

Lecture 6. Floating Point Arithmetic Stencil Methods Introduction to OpenMP

MPI Lab. Steve Lantz Susan Mehringer. Parallel Computing on Ranger and Longhorn May 16, 2012

1.2 Numerical Solutions of Flow Problems

Lecture 3. Stencil methods Multithreaded programming with OpenMP

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

AMath 483/583 Lecture 2. Notes: Notes: Homework #1. Class Virtual Machine. Notes: Outline:

A Heat-Transfer Example with MPI Rolf Rabenseifner

An introduction to mesh generation Part IV : elliptic meshing

AMath 483/583 Lecture 2

PARALLEL AND DISTRIBUTED COMPUTING

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Maximizing an interpolating quadratic

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Parallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab

Chapter 13. Boundary Value Problems for Partial Differential Equations* Linz 2002/ page

APPM/MATH Problem Set 4 Solutions

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

CSC Summer School in HIGH-PERFORMANCE COMPUTING

High-Performance Computing: MPI (ctd)

2018/7/ :25-12:10

Instruction Manual: Relaxation Algorithm

INTRODUCTION TO MPI VIRTUAL TOPOLOGIES

Today s class. Roots of equation Finish up incremental search Open methods. Numerical Methods, Fall 2011 Lecture 5. Prof. Jinbo Bi CSE, UConn

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Topologies in MPI. Instructor: Dr. M. Taufer

What is Stencil Computation?

Lecture 3. Stencil methods Multithreaded programming with OpenMP

Parallel Numerical Algorithms

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

Decomposing onto different processors

Fault Tolerant Domain Decomposition for Parabolic Problems

Exploring unstructured Poisson solvers for FDS

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Lecture 15: More Iterative Ideas

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Turbostream: A CFD solver for manycore

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

CINES MPI. Johanne Charpentier & Gabriel Hautreux

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Lecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications

Practical Scientific Computing: Performanceoptimized

EE/CSCI 451: Parallel and Distributed Computation

Sami Ilvonen Pekka Manninen. Introduction to High-Performance Computing with Fortran. September 19 20, 2016 CSC IT Center for Science Ltd, Espoo

CSE 262 Lecture 13. Communication overlap Continued Heterogeneous processing

Parallel Implementations of Gaussian Elimination

Geometric Modeling Assignment 3: Discrete Differential Quantities

CSC Summer School in HIGH-PERFORMANCE COMPUTING

Laplace Exercise Solution Review

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Review of previous examinations TMA4280 Introduction to Supercomputing

HPC Algorithms and Applications

Lecture 23: Starting to put it all together #2... More 2-Point Boundary value problems

High Performance Computing Course Notes Message Passing Programming III

A Heat-Transfer Example with MPI Rolf Rabenseifner

Transcription:

AMath 483/583 Lecture 24 Outline: Heat equation and discretization OpenMP and MPI for iterative methods Jacobi, Gauss-Seidel, SOR Notes and Sample codes: Class notes: Linear algebra software $UWHPSC/codes/openmp/jacobi1d_omp1.f90 $UWHPSC/codes/openmp/jacobi1d_omp2.f90 $UWHPSC/codes/mpi/jacobi1d_mpi.f90 Steady state diffusion If f(x, t) = f(x) does not depend on time and if the boundary conditions don t depend on time, then u(x, t) will converge towards steady state distribution satisfying (by setting u t = 0.) 0 = Du xx (x) + f(x) This is now an ordinary differential equation (ODE) for u(x). We can solve this on an interval, say 0 x 1 with Boundary conditions: u(0) = α, u(1) = β. Finite difference method U i u(x i ) u x (x i+1/2 ) U i+1 U i x u x (x i 1/2 ) U i U i 1 x So we can approximate second derivative at x i by: u xx (x i ) 1 ( Ui+1 U i U ) i U i 1 x x x = 1 x 2 (U i 1 2U i + U i+1 ) This gives coupled system of n linear equations: 1 x 2 (U i 1 2U i + U i+1 ) = f(x i ) for i = 1, 2,..., n. With U 0 = α and U n+1 = β.

Iterative methods Coupled system of n linear equations: (U i 1 2U i + U i+1 ) = x 2 f(x i ) for i = 1, 2,..., n. With U 0 = α and U n+1 = β. Iterative method starts with initial guess U [0] to solution and then improves U [k] to get U [k+1] for k = 0, 1,.... Note: Generally does not involve modifying matrix A. Do not have to store matrix A at all, only know about stencil. Jacobi iteration (U i 1 2U i + U i+1 ) = x 2 f(x i ) Solve for U i : U i = 1 2 ( Ui 1 + U i+1 + x 2 f(x i ) ). Note: With no heat source, f(x) = 0, the temperature at each point is average of neighbors. Suppose U [k] is a approximation to solution. Set U [k+1] i = 1 ( ) U [k] i 1 2 + U [k] i+1 + x2 f(x i ) Repeat for k = 0, 1, 2,... until convergence. for i = 1, 2,..., n. Can be shown to converge (eventually... very slow!) Jacobi iteration in Fortran uold = u do iter=1,maxiter! starting values before updating dumax = 0.d0 do i=1,n u(i) = 0.5d0*(uold(i-1) + uold(i+1) + dx**2*f(i)) dumax = max(dumax, abs(u(i)-uold(i)))! check for convergence: if (dumax.lt. tol) exit uold = u! for next iteration Note: we must use old value at i 1 for Jacobi. Otherwise we get the Gauss-Seidel method. u(i) = 0.5d0*(u(i-1) + u(i+1) + dx**2*f(i)) This actually converges faster!

Jacobi with OpenMP coarse grain General Approach: Fork threads only once at start of program. Each thread is responsible for some portion of the arrays, from i=istart to i=iend. Each iteration, must copy u to uold, update u, check for convergence. Convergence check requires coordination between threads to get global dumax. Print out final result after leaving parallel block See code in the repository or the notes: $UWHPSC/codes/openmp/jacobi1d_omp2.f90 Jacobi with MPI Each process is responsible for some portion of the arrays, from i=istart to i=iend. No shared memory: each process only has part of array. Updating formula: u(i) = 0.5d0*(uold(i-1) + uold(i+1) + dx**2*f(i)) Need to exchange values at boundaries: Updating at i=istart requires uold(istart-1) Updating at i=iend requires uold(istart+1) Example with n = 9 interior points (plus boundaries): Process 0 has istart = 1, iend = 5 Process 1 has istart = 6, iend = 9 Jacobi with MPI Sending to neighbors call mpi_comm_rank(mpi_comm_world, me, ierr)... do iter = 1, maxiter uold = u if (me > 0) then! Send left endpoint value to "left" call mpi_isend(uold(istart), 1, MPI_DOUBLE_PRECISION, & me - 1, 1, MPI_COMM_WORLD, req1, ierr) end if if (me < ntasks-1) then! Send right endpoint value to "right" call mpi_isend(uold(iend), 1, MPI_DOUBLE_PRECISION, & me + 1, 2, MPI_COMM_WORLD, req2, ierr) end if end do Note: Non-blocking mpi_isend is used, Different tags (1 and 2) for left-going, right-going messages.

Jacobi with MPI Receiving from neighbors Note: uold(istart) from me+1 goes into uold(iend+1): uold(iend) from me-1 goes into uold(istart-1): do iter = 1, maxiter! mpi_send s from previous slide if (me < ntasks-1) then! Receive right endpoint value call mpi_recv(uold(iend+1), 1, MPI_DOUBLE_PRECISION, & me + 1, 1, MPI_COMM_WORLD, mpistatus, ierr) end if if (me > 0) then! Receive left endpoint value call mpi_recv(uold(istart-1), 1, MPI_DOUBLE_PRECISION, & me - 1, 2, MPI_COMM_WORLD, mpistatus, ierr) end if! Apply Jacobi iteration on my section of array do i = istart, iend u(i) = 0.5d0*(uold(i-1) + uold(i+1) + dx**2*f(i)) dumax_task = max(dumax_task, abs(u(i) - uold(i))) end do end do Jacobi with MPI Other issues: Convergence check requires coordination between processes to get global dumax. Use MPI_ALLREDUCE so all process check same value. Part of final result must be printed by each process (into common file heatsoln.txt), in proper order. See code in the repository or the notes: $UWHPSC/codes/mpi/jacobi1d_mpi.f90 Jacobi with MPI Writing solution in order Want to write table of values x(i),u(i) in heatsoln.txt. Need them to be in proper order, so Process 0 must write to this file first, then Process 1, etc. Approach: Each process me waits for a message from me-1 indicating that it has finished writing its part. (Contents not important.) Each process must open the file (without clobbering values already there), write to it, then close the file. Assumes all processes share a file system! On cluster or supercomputer, need to either: send all results to single process for writing, or write distributed files that may need to be combined later (some visualization tools handle distributed data!)

Jacobi in 2D Updating point 7 for example (u 32 ): U [k+1] 32 = 1 [k] (U 22 4 + U [k] 42 + U [k] 21 + U [k] 41 + h2 f 32 ) Jacobi in 2D using MPI With two processes: Could partition unknown into Process 0 takes grid points 1 8 Process 1 takes grid points 9 16 Each time step: Process 0 sends top boundary (5 8) to Process 1, Process 1 sends bottom boundary (9 12) to Process 0. Jacobi in 2D using MPI With more grid points and processes... Could partition several different ways, e.g. with 4 processes: The partition on the right requires less communication. With m 2 processes on grid with n 2 points: 2(m 2 1)n boundary points on left, 4(m 1)n boundary points on right.

Jacobi in 2D using MPI For partition on left: Natural to number processes 0,1,2,3 and pass boundary data from Process k to k ± 1. For m m array of processors as on right: How do we figure out the neighboring process numbers? Creating a communicator for Cartesian blocks integer dims(2) logical isperiodic(2), reorder ndim = 2! 2d grid of processes dims(1) = 4! for 4x6 grid of processes dims(2) = 6 isperiodic(1) =.false.! periodic in x? isperiodic(2) =.false.! periodic in y? reorder =.true.! optimize ordering call MPI_CART_CREATE(MPI_COMM_WORLD, ndim, & dims, isperiodic, reorder, comm2d, ierr) Create communicator comm2d. See also: MPI_CART_CREATE, MPI_CART_SHIFT, MPI_CART_COORDS. Gauss-Seidel iteration in Fortran do iter=1,maxiter dumax = 0.d0 do i=1,n u(i) = 0.5d0*(u(i-1) + u(i+1) + dx**2*f(i))! check for convergence: if (dumax.lt. tol) exit Note: Now u(i) depends on value of u(i-1) that has already been updated for previous i. Good news: This converges about twice as fast as Jacobi! But... loop carried dependence! Cannot parallelize so easily.

Red-black ordering We are free to write equations of linear system in any order... reordering rows of coefficient matrix, right hand side. Can also number unknowns of linear system in any order... reordering elements of solution vector. Red-black ordering: Iterate through points with odd index first (i = 1, 3, 5,...) and then even index points (i = 2, 4, 6,...). Then all black points can be updated in any order, all red points can then be updated in any order. Same asymptotic convergence rate as natural ordering. Red-Black Gauss-Seidel do iter=1,maxiter dumax = 0.d0! UPDATE ODD INDEX POINTS:!$omp parallel do reduction(max : dumax) &!$omp private(uold) do i=1,n,2 u(i) = 0.5d0*(u(i-1) + u(i+1) + dx**2*f(i))! UPDATE EVEN INDEX POINTS:!$omp parallel do reduction(max : dumax) &!$omp private(uold) do i=2,n,2 u(i) = 0.5d0*(u(i-1) + u(i+1) + dx**2*f(i))! check for convergence: if (dumax.lt. tol) exit Gauss-Seidel method in 2D If x = y = h: 1 h 2 (U i 1,j + U i+1,j + U i,j 1 + U i,j+1 4U i,j ) = f(x i, y j ). Solve for U i,j and iterate: u [k+1] i,j = 1 4 (u[k+1] i 1,j + u[k] i+1,j + u[k+1] i,j 1 + u[k] i,j+1 h2 f i,j ) Again no need for matrix A. Note: Above indices for old and new values assumes we iterate in the natural row-wise order.

Gauss-Seidel in 2D Updating point 7 for example (u 32 ): Depends on new values at points 6 and 3, old values at points 7 and 10. U [k+1] 32 = 1 [k+1] (U 22 + U [k] 42 4 + U [k+1] 21 + U [k] 41 + h2 f 32 ) Red-black ordering in 2D Again all black points can be updated in any order: New value depends only on red neighbors. Then all red points can be updated in any order: New value depends only on black neighbors. SOR method Gauss-Seidel move solution in right direction but not far enough in general. Iterates relax towards solution. Successive Over-Relaxation (SOR): Compute Gauss-Seidel approximation and then go further: U i U [k+1] i where 1 < ω < 2. = 1 [k+1] (U i 1 + U [k] i+1 2 + x2 f(x i )) = U [k] i + ω(ui U [k] i ) Optimal omega (For this problem): ω = 2 2π x.

Convergence rates 10 0 errors vs. iteration k 10 1 Jacobi Gauss Seidel 10 2 10 3 SOR 10 4 10 5 10 6 0 10 20 30 40 50 60 70 80 90 100 Red-Black SOR in 1D do iter=1,maxiter dumax = 0.d0! UPDATE ODD INDEX POINTS:!$omp parallel do reduction(max : dumax) &!$omp private(uold, ustar) do i=1,n,2 ustar = 0.5d0*(u(i-1) + u(i+1) + dx**2*f(i)) u(i) = uold + omega*(ustar-uold)! UPDATE EVEN INDEX POINTS:!$omp parallel do reduction(max : dumax) &!$omp private(uold, ustar) do i=2,n,2 ustar = 0.5d0*(u(i-1) + u(i+1) + dx**2*f(i)) u(i) = uold + omega*(ustar-uold)! check for convergence... Note that uold, ustar must be private! Announcements Homework 6 is in the notes and due next Friday. Quizzes for this week s lectures due next Wednesday. Office hours today 9:30 10:20. Next week: Monday: no class Wednesday: Guest lecture Brad Chamberlain, Cray Chapel: A Next-Generation Partitioned Global Address Space (PGAS) Language