HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI

Similar documents
HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

HPC Fall 2007 Project 1 Fast Matrix Multiply

AMath 483/583 Lecture 21 May 13, 2011

AMath 483/583 Lecture 24. Notes: Notes: Steady state diffusion. Notes: Finite difference method. Outline:

AMath 483/583 Lecture 24

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

f xx + f yy = F (x, y)

Laplace Exercise Solution Review

MPI Casestudy: Parallel Image Processing

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics

Practical Introduction to Message-Passing Interface (MPI)

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

High Performance Computing: Tools and Applications

Iterative Methods for Linear Systems

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Assignment 5 Using Paraguin to Create Parallel Programs

Parallel Computing Lab 10: Parallel Game of Life (GOL) with structured grid

Laplace Exercise Solution Review

Synchronous Computations

2 T. x + 2 T. , T( x, y = 0) = T 1

Parallelization of an Example Program

Distributed Memory Programming With MPI Computer Lab Exercises

Simulating ocean currents

MPI Optimization. HPC Saudi, March 15 th 2016

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm

Assignment 2 Using Paraguin to Create Parallel Programs

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Parallel Programming Assignment 3 Compiling and running MPI programs

Optimising MPI Applications for Heterogeneous Coupled Clusters with MetaMPICH

Advanced Parallel Programming

Parallel Computing. Parallel Algorithm Design

FLUENT Secondary flow in a teacup Author: John M. Cimbala, Penn State University Latest revision: 26 January 2016

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

ITCS 4145/5145 Assignment 2

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

All-Pairs Shortest Paths - Floyd s Algorithm

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

SINGAPORE-MIT ALLIANCE GETTING STARTED ON PARALLEL PROGRAMMING USING MPI AND ESTIMATING PARALLEL PERFORMANCE METRICS

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Hands-on. MPI basic exercises

MPI Case Study. Fabio Affinito. April 24, 2012

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Linear Equation Systems Iterative Methods

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen

Term Paper for EE 680 Computer Aided Design of Digital Systems I Timber Wolf Algorithm for Placement. Imran M. Rizvi John Antony K.

COSC 6374 Parallel Computation. Edgar Gabriel Fall Each student should deliver Source code (.c file) Documentation (.pdf,.doc,.tex or.

Optimization of MPI Applications Rolf Rabenseifner

Synchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen

Construct Hybrid Microcircuit Geometry

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

(Refer Slide Time: 00:02:00)

Shared Memory Programming With OpenMP Exercise Instructions

Tutorial 1. Introduction to Using FLUENT: Fluid Flow and Heat Transfer in a Mixing Elbow

Finite Differencing Using Excel Learning how to use Surface Charts T 1 T 4 T 2 T 3

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Performance Engineering: Lab Session

To connect to the cluster, simply use a SSH or SFTP client to connect to:

High Performance Beowulf Cluster Environment User Manual

Introduction to Multigrid and its Parallelization

Introduction to PICO Parallel & Production Enviroment

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

Shared Memory Programming With OpenMP Computer Lab Exercises

HPC Algorithms and Applications

CSE 160 Lecture 13. Non-blocking Communication Under the hood of MPI Performance Stencil Methods in MPI

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

Hybrid Parallel Programming Part 2

Numerical Algorithms

Tutorial 2. Modeling Periodic Flow and Heat Transfer

LECTURE ON MULTI-GPU PROGRAMMING. Jiri Kraus, November 14 th 2016

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Message-Passing and MPI Programming

Chapter 4 Determining Cell Size

What s in this talk? Quick Introduction. Programming in Parallel

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Introduction to Spatial Database Systems

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

The MPI Message-passing Standard Lab Time Hands-on. SPD Course 11/03/2014 Massimo Coppola

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

The DTU HPC system. and how to use TopOpt in PETSc on a HPC system, visualize and 3D print results.

Parallel Code Optimisation

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

EN 10211:2007 validation of Therm 6.3. Therm 6.3 validation according to EN ISO 10211:2007

HPC Parallel Programing Multi-node Computation with MPI - I

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

High-Performance Computing

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level

2D Heat Distribution Prediction in Parallel

OBTAINING AN ACCOUNT:

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Transcription:

HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI Robert van Engelen Due date: December 14, 2007 1 Introduction 1.1 Account and Login Information For this assignment you need an SCS account. The account gives you access to pamd and the phoenix cluster. Contact the instructor if you do not have an SCS account. To login to phoenix from outside SCS you must first ssh to pamd (using ssh name@pamd.scs.fsu.edu) and then ssh to phoenix. Use ssh -Y to enable X11 forwarding. 1.2 Setup Note that when you login to phoenix your home directory is mounted as fshome. If you don t have a.cshrc file in phoenix home, then copy your fshome/.cshrc to your phoenix home. In your.cshrc on phoenix add the following lines: # OpenMPI with GCC, use missing man pages from MPICH setenv OPENMPIHOME /opt/openmpi setenv MANPATH $MANPATH:$OPENMPIHOME/man:/usr/local/mpich2/1.0.4p1/man setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/local/openmpi/gcc/lib:/opt/openmpi/gcc/lib set path = ( $OPENMPIHOME/bin /opt/openmpi/gcc/bin $path ) 1

1.3 Download Next, download the project source code from http://www.cs.fsu.edu/~engelen/courses/hpc/pr3.zip The package bundles the following files: Makefle: a standard Makefile to build the project. heatdist.c: steady-state heat distribution Jacobi iterations in MPI run.sh: wrapper to run mpirun sge.sh: wrapper to run SGE script heatdist.sh: script to run SGE job (used by sge.sh) 1.4 Overview The steady-state heat distribution problem consists of an n n discrete grid with temperature field h. Given boundary conditions T = 10, we solve h iteratively starting with h = 0 using the Jacobi iteration: h t+1 i,j = 1 4 (ht i 1,j + h t i+1,j + h t i,j 1 + h t i,j+1) Starting with h t i,j = 0 for step t = 0, this (slowly) converges to the exact solution h t i,j = T = 10 with t for all points, given the boundary condition T = 10 surrounding the n n square region as shown below: T=10 T=10 T=0 T=10 T=10 2

The parallelization strategy consists of applying a partitioning strategy to decompose the domain row-wise in blocks: j p0 m=(n-2)/p+2 i p1 p2 n p3 n n-2 In this project we use a n n grid with n = 82, where the boundary values are positioned along the edges of the grid (i = 0, i = n 1, j = 0, and j = n 1), thus the interior region consists of 80 80 points and boundary values are stored along the edges (the boundary values are not updated in each Jacobi iteration). Each processor has m = n 2 + 2 grid points, with interior region (m 2) (m 2). The P edges are either boundaries with fixed boundary values, for j = 0 and j = n 1, or ghost cells (also referred to as halos ) for local rows i = 0 and i = m 1. These ghost cells are updated in each iteration via nearest-neighbor communications, except for processor p = 0 (top) and p = P 1 (bottom). On each processor the interior points of field h t at iteration t are updated as follows: for (i = 1; i < m-1; i++) hnew[n*i+j] = 0.25*(hp[n*(i-1)+j]+hp[n*(i+1)+j]+hp[n*i+j-1]+hp[n*i+j+1]); Note that the grid is mapped to one-dimensional arrays hp and hnew using a row-major layout, where the h i,j point corresponds to element hp[n*i+j]. This makes it easy to communicate ghost cell rows. The above code assumes that the values of hp at ghost cell rows i = 0 and i = m 1 were copied from the hp values from the processors above and below, respectively. Note that the values of hp at j = 0 and j = n 1 are fixed boundary values. The implemented code overlaps computation with communication using the MPI Isend and MPI Irecv calls. This is accomplished by updating the ghost cells while the inner interior 3

rows are computed where the inner interior rows are i = 2,..., n 2. The outer interior rows are then updated when the ghost cell values arrive: if (p < P-1) MPI_Isend(&hp[n*(m-2)], n, MPI_DOUBLE, p+1, dntag, MPI_COMM_WORLD, &sndreq[0]); MPI_Irecv(&hp[n*(m-1)], n, MPI_DOUBLE, p+1, uptag, MPI_COMM_WORLD, &rcvreq[0]); if (p > 0) MPI_Isend(&hp[n*1], n, MPI_DOUBLE, p-1, uptag, MPI_COMM_WORLD, &sndreq[1]); MPI_Irecv(&hp[n*0], n, MPI_DOUBLE, p-1, dntag, MPI_COMM_WORLD, &rcvreq[1]); /* Compute inner interior rows */ for (i = 2; i < m-2; i++) hnew[n*i+j] = 0.25*(hp[n*(i-1)+j]+hp[n*(i+1)+j]+hp[n*i+j-1]+hp[n*i+j+1]); /* Wait on receives */ if (P > 1) if (p == 0) MPI_Wait(&rcvreq[0], stat); else if (p == P-1) MPI_Wait(&rcvreq[1], stat); else MPI_Waitall(2, rcvreq, stat); /* Compute outer interior rows */ hnew[n*1+j] = 0.25*(hp[n*0+j] +hp[n*2+j] +hp[n*1+j-1] +hp[n*1+j+1]); hnew[n*(m-2)+j] = 0.25*(hp[n*(m-3)+j]+hp[n*(m-1)+j]+hp[n*(m-2)+j-1]+hp[n*(m-2)+j+1]); /* Wait on sends to release buffer */ if (P > 1) if (p == 0) MPI_Wait(&sndreq[0], stat); else if (p == P-1) MPI_Wait(&sndreq[1], stat); else MPI_Waitall(2, sndreq, stat); /* Copy new */ for (i = 1; i < m-1; i++) hp[n*i+j] = hnew[n*i+j]; See also the lecture notes and the code heatdist.c for more details. 4

1.5 Getting Started To get started with this assignment and try out the package content, follow these steps: Open a terminal and ssh -Y to phoenix (via pamd.scs.fsu.edu) Run make to build heatdist Run sh run.sh which solves the steady-state heat distribution problem using 4 tasks on the single phoenix head node with a 82 82 grid using Jacobi iterations. The run produces output that shows the upper region of the temperature field residing on processor p0 coded as 0-9 values. Note that the the value 9 represents the solution T = 10. The run also produces a log file heatdist.clog2 with MPI statistics. You can inspect the log file as follows: Start jumpshot heatdist.clog2 in an xterm. Answer YES to convert the CLOG-2 file to SLOG-2 format. Select CONVERT button. After conversion select the OK button. View the timeline by zooming in and out with the cursor. Since there are 5000 iterations that can take up to two seconds to compute, you need to go to a 0.1 second scale to view the actual MPI operations and messages. The blue rectangles represent computation (with communication overlap that requires an MPI Wait and MPI Waitall to receive the ghost cell values needed to compute the outer interior points). Always use make clean to cleanup the log files. 2 Your Assignment 1. Study the code. Explain why we need to wait for the MPI Irecv and MPI Isend to complete at the two specific points in the code. 2. Run the code with P = 1, P = 2, P = 4, and P = 8 on the phoenix head node using script run.sh (edit the script to change the process count). The phoenix head node is a 2 dual-core Xeon processor. For each run with P = 1, P = 2, P = 4, and P = 8 write down the running time (as displayed by the program) and determine from a couple of snapshots of the jumpshot profile the approximate average computation and 5

End. communication ratio (assume that anything that is not computation is communication overhead). Note that the blue rectangle without any inner nested rectangles represents pure computation (study the MPE event calls in the code!). 3. The current code runs 5000 iterations, which is more than enough to converge. Add the Pacheco termination criterion to the code, where the iterations should stop when n 2 n 2 i=1 j=1 (h t i,j h t+1 i,j ) 2 ɛ 2 where ɛ = 0.1 is given by TOLERANCE=0.1 in the code. To compute the termination criterion in parallel, you need to compute the local errors e p = m 2 i=1 n 2 j=1 (hp i,j hnew i,j ) 2 and then check that P 1 p=0 e p ɛ 2. Each processor should know that the termination criterion is met, since they all should terminate in the same iteration t. Run the code with P = 1, P = 2, P = 4, and P = 8 on the phoenix head node with run.sh and compare the total running times to the running times you recorded in the previous question. Note that the tolerance allows the innermost region of the solution to be smaller than T = 10. Hint: you need a global reduction and broadcast, or a combination of the two. 4. Reduce the overhead of the termination criterion check by checking for convergence periodically once in 100 iterations. Compare the running times to the running times you recorded in the previous question. Explain why periodic termination checking might be good enough. 5. In this part we will run the updated code (with periodic termination check) with P = 4 and P = 8 on the phoenix cluster using script sge.sh. Edit heatdist.sh to adjust the number of processors. Before you can run the code, copy sge.sh, heatdist.sh, and heatdist binary to the phoenix home dir to run sge.sh. Use qstat to check the job status. Use qdel with the job id to remove the job when you notice a problem. When all goes well the output is saved in heatdist.sh.ox with X the job id. The logs are saved in heatdist.clog2. Write down the total running time and determine from a couple of jumpshot profile snapshots the approximate average computation and communication ratio (anything that is not computation is communication overhead). 6. Use Gauss-Seidel relaxation by updating hp instead of hnew. You cannot use MPI Isend reliably any longer, since the data will be changed while the send is in progress. Use a local-blocking send instead but keep the non-blocking MPI Irecv. Determine for P = 1, P = 2, P = 4, and P = 8 on the phoenix head node (run.sh) how many iterations it takes to converge using this relaxation scheme. 6