MPI Casestudy: Parallel Image Processing

Similar documents
Exercises: Message-Passing Programming

Exercises: Message-Passing Programming

Exercises: Message-Passing Programming

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics

Parallel Programming with Fortran Coarrays: Exercises

Message-Passing and MPI Programming

CFD exercise. Regular domain decomposition

MPI Case Study. Fabio Affinito. April 24, 2012

High-Performance Computing: MPI (ctd)

Basic MPI Communications. Basic MPI Communications (cont d)

Topic Notes: Message Passing Interface (MPI)

Message Passing Programming. Designing MPI Applications

Message-Passing Thought Exercise

A Message Passing Standard for MPP and Workstations

Parallel Poisson Solver in Fortran

Parallel programming MPI

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Advanced Parallel Programming

Lecture 4: Locality and parallelism in simulation I

ECE 574 Cluster Computing Lecture 13

Mixed Mode MPI / OpenMP Programming

More Communication (cont d)

Review of MPI Part 2

Masterpraktikum - Scientific Computing, High Performance Computing

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Advanced Parallel Programming

Alexey Syschikov Researcher Parallel programming Lab Bolshaya Morskaya, No St. Petersburg, Russia

Scientific Programming in C VI. Common errors

Parallel I/O. and split communicators. David Henty, Fiona Ried, Gavin J. Pringle

Programming with MPI

Message-Passing Programming with MPI

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

LECTURE 0: Introduction and Background

ARRAY DATA STRUCTURE

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández

MPI: A Message-Passing Interface Standard

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

A Message Passing Standard for MPP and Workstations. Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W.

Topics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1)

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Message-Passing and MPI Programming

Message-Passing and MPI Programming

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

PPCES 2016: MPI Lab March 2016 Hristo Iliev, Portions thanks to: Christian Iwainsky, Sandra Wienke

Parallelization of an Example Program

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Point-to-Point Synchronisation on Shared Memory Architectures

Intermediate MPI features

Numerical computing. How computers store real numbers and the problems that result

Computer Graphics Prof. Sukhendu Das Dept. of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 14

Arrays. Defining arrays, declaration and initialization of arrays. Designed by Parul Khurana, LIECA.

MAT 003 Brian Killough s Instructor Notes Saint Leo University

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Non-Blocking Communications

Parallelization Principles. Sathish Vadhiyar

CIS 121 Data Structures and Algorithms with Java Spring 2018

MPI Collective communication

Fractals exercise. Investigating task farms and load imbalance

Collective Communication

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Basic Communication Operations (Chapter 4)

HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI

Cluster Computing MPI. Industrial Standard Message Passing

Simulating ocean currents

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Handling Parallelisation in OpenFOAM

Lecture V: Introduction to parallel programming with Fortran coarrays

MPI 3.0 Neighbourhood Collectives

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Convolutional Neural Networks for Object Classication in CUDA

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

Parallel Computing. Parallel Algorithm Design

Non-Blocking Communications

Programming Problem for the 1999 Comprehensive Exam

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

Practical Scientific Computing: Performanceoptimized

Intel MPI Library Conditional Reproducibility

High Performance Computing

Signed umbers. Sign/Magnitude otation

Parallel Programming

Programming with MPI Collectives

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

Welcome to the introductory workshop in MPI programming at UNICC

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Paul Burton April 2015 An Introduction to MPI Programming

Code Parallelization

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

UNIT 2 GRAPHIC PRIMITIVES

Parallel Implementations of Gaussian Elimination

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Numerical Methods in Scientific Computation

DDS Dynamic Search Trees

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Transcription:

MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by writing a serial code that performs the required calculation but is also designed to make subsequent parallelisation as straightforward as possible. The image itself is a large two-dimensional grid, so the natural parallel approach is to use regular domain decomposition. As the image processing method we will use involves nearest-neighbour interactions between grid points, this will require boundary swaps between neighbouring processes. The exercise also utilises parallel scatter and gather operations. For simplicity, we will use a one-dimensional process grid (i.e. decompose the problem into slices). The solution can easily be coded using either C or Fortran. The only subtlety is that the natural direction for the parallel decomposition, whether to slice up the 2D grid over the first or second dimension, is different for the two languages. Some sample images and IO routines in C and Fortran can be found in the file MPP-casestudy.tar on the MPP course web pages. 2 The Method You will be given a file that represents the output from a very simple edge-detection algorithm applied to a greyscale image of size M N. The edge pixel values are constructed from the image using edge i,j = image i 1,j + image i+1,j + image i,j 1 + image i,j+1 4 image i,j If an image pixel has the same value as its four surrounding neighbours (i.e. no edge) then the value of edge i,j will be zero. If the pixel is very different from its four neighbours (i.e. a possible edge) then edge i,j will be large in magnitude. We will always consider i and j to lie in the range 1, 2,... M and 1, 2,... N respectively. Pixels that lie outside this range (e.g. image i,0 or image M+1,j ) are considered to be white, i.e. to have the value 255. Many more sophisticated methods for edge detection exist, but this is a nice simple approach. Figure 1 for an example of how this works in practice. The exercise is actually to do the reverse operation and construct the initial image given the edges. This is a slightly artificial thing to do, and is only possible given the very simple approach used to detect the edges in the first place. However, it turns out that the reverse calculation is iterative, requiring many successive operations each very similar to the edge detection calculation itself. The fact that calculating the image from the edges requires a large amount of computation, including many boundary swaps in the parallel code, makes it a much more suitable program than edge detection itself for the purposes of timing and parallel scaling studies. As an aside, this inverse operation is also very similar to a large number of real scientific HPC calculations that solve partial differential equations using iterative algorithms such as Jacobi or Gauss-Seidel. 1 See

Figure 1: Result of simple edge detection 3 Serial Code You are provided with two routines pgmread and pgmwrite in C (see pgmio.c) or Fortran (see pgmio.f90). The first routine reads the input file, and the second writes out an array as a Portable Grey Map (PGM) file that can be viewed using the program display. There is also a helper function pgmsize to compute the size of an image file, which is useful if you are using dynamic array allocation. Instructions detailing how to call these routines from your own program are included as comments at the top of each file - you should not waste time looking at the code that actually implements the routines!. The file formats are very simple, e.g. the edge PGM file is a text file which contains the image dimensions M and N and the M N integer values of edge i,j. Writing an array as a greyscale image using pgmwrite requires a small amount of processing as the greyscale values must be positive, whereas a general data array (such as edge i,j ) can have both positive and negative entries. Although the input and output files both contain only integer values, note that pgmread reads the edge data into an array of double-precision floating-point numbers (i.e. double / DOUBLE PRECISION) and pgmwrite writes out from an array of floating-point numbers. This is because the reverse calculation that we are performing cannot easily be done in integer arithmetic. 3.1 Viewing the edges Write a program that simply declares a floating-point array (i.e. double in C or DOUBLE PRECISION in Fortran) of just the right size for the smaller image edge192x128.pgm. Use the two supplied routines to read in the edge data file and then immediately write it out again as a PGM file. View the image using display. Is it obvious what the full image should look like when you can only see the edges? 3.2 Reconstructing the image It turns out that the full image can be reconstructed from the edges by repeated operations of the form new i,j = 1 4 (old i 1,j + old i+1,j + old i,j 1 + old i,j+1 edge i,j ) where old and new are the image values at the beginning and end of each iteration. We will take the initial value for the image array (the value of old at the start of the first iteration) to be pure white. 2

The simplest way to set pixels off the edge of the array to 255 is to declare all arrays in your program with explicit halos, i.e. double old[m+2][n+2] in C or DOUBLE PRECISION old(0:m+1,0:n+1) in Fortran (similarly for new and edge), and set the halo values to 255. For IO you will need one more array called buf which has no halos. The entire loop then becomes: 1. read the edges data file into the buf array 2. loop over i = 1, M; j = 1, N edge i,j = buf i 1,j 1 (in C) edge i,j = buf i,j (in Fortran) 3. end loop 4. set the entire old array to 255 including the halos 5. begin loop over iterations loop over i = 1, M; j = 1, N set new i,j = 1 4 (old i 1,j + old i+1,j + old i,j 1 + old i,j+1 edge i,j ) end loop set the old array equal to new, making sure that you do not copy the halos 6. end loop over iterations 7. copy the old array back to buf excluding the halos 8. write out the final image by passing buf to pgmwrite The initialisation of old to 255 achieves two things. Most importantly it sets the very outer edges of the image to 255. However, it also sets up our initial guess for the solution to be a completely white image. 3.3 Testing the serial code Run your code for 0, 1, 10, 100 and 1000 iterations and view the reconstructed image each time. Do you see what you expect? How many iterations do you think it will take to recover an exact image? 4 Initial parallelisation For the initial parallelisation we will simply use trivial parallelism, i.e. each process will work on different sections of the image but with no communication between them. Although this will not reconstruct the image exactly, since we are not performing the required halo swaps, it serves as a good intermediate step to a full parallel code. Most importantly it will have exactly the same data decomposition and parallel IO approaches as a fully working parallel code. Before progressing any further, please ensure you have a backup copy of your working serial code. The entire parallelisation process is made much simpler if we ensure that the slices of the image operated on by each process are contiguous pieces of the whole image (in terms of the layout in memory). This means dividing up the image between processes over the first dimension i for a C array edge[i][j], and the second dimension j for a Fortran array edge(i,j). Figure 2 illustrates how this would work on 4 processes where the slices are numbered according to the rank of the process that owns them. 3

Figure 2: Decomposition strategies for 4 processes Again, for simplicity, we will always assume that M and N are exactly divisible by the number of processes P. It is easiest to program the exercise if you make P a compile time constant (a #define in C or a parameter in Fortran). The simplest approach to doing the IO is to read the entire file into an array on one master process (usually rank = 0) and then to distribute it amongst the other processes. Note that this is not particularly efficient in terms of memory usage as we need enough space to store the whole image on a single process. I will call the dimensions of the image slices M P and N P where for C: M P = M/P and N P = N for Fortran: M P = M and N P = N/P 4.1 The parallel program The steps to creating the first parallel program are as follows. 1. re-dimension all the arrays with M and N replaced by M P and N P 2. create a new array called masterbuf of size M N 3. initialise MPI, compute the size and rank, and check that size = P 4. on the master process only, read the edges data file into masterbuf 5. split the data up amongst processes using MPI_Scatter with sendbuf = masterbuf and recvbuf = buf 6. now follow steps 2 through 7 of the original serial code, exactly as before, except with M and N replaced by the local sizes M P and N P 7. transfer the data from all the buf arrays back to masterbuf on the master using a call to MPI_Gather 8. on the master process only, write out the final image by passing masterbuf to pgmwrite 4.2 Testing the parallel program Run your code for 0, 1, 10, 100 and 1000 iterations and look at the output images. How do they compare to the (correct) serial code? Is the total execution time decreasing as you expect? In terms of Amdahl s law, what are the inherently serial and potentially parallel parts of your (incomplete) MPI program? 4

5 Full parallel code The only addition now required to get a full parallel code is to add halo swaps to the old array (which is the only array for which there are non-local array references of the form old i 1,j, old i,j+1 etc). This should be done once every iteration, immediately after the start of the iteration loop and before any other computation has taken place. To do this, each process must know the rank of its neighbouring processes. Referring to the Fortran decomposition as shown in Figure 2, these are given by rank 1 and rank + 1. However, since we have fixed boundary conditions on the edges, rank 0 does not send any data to (or receive from) the left and rank 3 need not send any data to (or receive from) the right. This is best achieved by defining a 1D Cartesian topology with non-periodic boundary conditions - you should already have code to do this from the previous message round a ring exercise. You also need to ensure that the processes do not deadlock by all trying to do synchronous sends at the same time. Again, you should re-use the code from the same exercise. The communications involves sending and receiving entire rows or columns of data (depending on whether you are using C or Fortran). The process is as follows - it may be helpful to look at the appropriate decomposition in Figure 2. For C: send the N array elements (old[m P ][j]; j = 1,N) to rank + 1 receive N array elements from rank 1 into (old[0][j]; j = 1,N) send the N array elements (old[1][j]; j = 1,N) to rank 1 receive N array elements from rank + 1 into (old[m P +1][j]; j = 1,N) For Fortran: send the M array elements (old[i][n P ]; i = 1,M) to rank + 1 receive M array elements from rank 1 into (old[i][0]; i = 1,M) send the M array elements (old[i][1]; i = 1,M) to rank 1 receive M array elements from rank + 1 into (old[i][n P + 1]; i = 1,M) Each of the two send-receive pairs is basically the same as a step of the ring exercise except that data is being sent in different directions (first clockwise then anti-clockwise). Remember that you can send and receive entire halos as single messages due to the way we have chosen to split the data amongst processes. 5.1 Testing the complete code Again, run your program for 0, 1, 10, 100 and 1000 iterations and compare the output images to the serial code. Are they exactly the same? How do the execution times compare to the serial code and the previous (incomplete) parallel code? How does the time scale with P? Plot parallel scaling curves for a range of problem sizes. You may want to insert explicit timing calls into the code so you can exclude the IO overheads. 6 Further Work Congratulations for getting this far! Here are a number of suggestions for extensions to your current parallel code, in no particular order. 5

6.1 Stopping criterion It is possible to quantify how many iterations of the main loop are required rather than simply stopping after some fixed number. The easiest thing to do is to monitor how much the image changes at each iteration; we will stop when none of the pixels change by more than a certain amount. To do this, compute the maxium absolute change of any pixel in the image, i.e. compute given by: = max new i,j old i,j i,j where x means the absolute value of x which is always positive. In a parallel code you will need to compute locally on each process, then do a global reduction across processes. Rewrite your code so that it terminates when is less than some amount, say 0.1. Given the overhead of calculating, particularly considering the need for a global sum, is it worth checking it every iteration? By timing your code with and without computing, estimate the optimal frequency at which you should check for completion. 6.2 Overlapping communication and calculation One of the tutorial problems concerns overlapping communication and calculation, i.e. doing calculation that does not require the communicated halo data at the same time as the halo is being sent using nonblocking routines. Calculations involving the halos are done separately after the halos have arrived. See if you can implement this in practice in your code. Does it improve performance? 6.3 Derived Data Types for IO The parallel IO currently proceeds in three phases, e.g. on input 1. read the data into masterbuf 2. scatter the data from masterbuf to buf 3. copy buf into edge and in the reverse order for output. By defining a derived datatype that maps onto the internal region of edge, excluding the halos, see if you can transfer data directly between masterbuf and edge. 6.4 Reducing IO memory By altering the IO routines so that they can read in the data in smaller chunks, see if you can create a working parallel code without the need for the large masterbuf array (you may need to change the order in which the data is stored in the input file). How does the performance of the new parallel input and/or output routines compare to the original, especially on large numbers of processes? 6.5 Alternative decomposition The way that the slices were mapped onto processes was chosen to simplify the parallelisation. How would the IO and halo-swap routines need to be changed if you wanted to parallelise your code by decomposing over the other dimension (over j for C and i for Fortran). You should be able to make only minor changes to your existing code if you define appropriate derived datatypes for the halo data, although the IO will be more complicated. What is the effect on performance? 6