MPI, Part 3. Scientific Computing Course, Part 3

Similar documents
MPI, Part 2. Scientific Computing Course, Part 3

Compressible Fluid Dynamics

MPI is a Library for Message-Passing

An introduction to MPI. (word cloud of all the MPI hydro code written for this course:

Programming Distributed Memory Systems with MPI

Decomposing onto different processors

Advanced Parallel Programming

Advanced Parallel Programming

Buffering in MPI communications

MPI IO. Timothy H. Kaiser, Ph.D. Friday, August 12, 11

Recap of Parallelism & MPI

CS 179: GPU Programming. Lecture 14: Inter-process Communication

High Performance Computing Course Notes Message Passing Programming III

USER-DEFINED DATATYPES

Parallel I/O Techniques and Performance Optimization

Introduction to MPI. Ricardo Fonseca.

Intermediate MPI (Message-Passing Interface) 1/11

High Performance Computing Course Notes Message Passing Programming III

Intermediate MPI (Message-Passing Interface) 1/11

Introduction to MPI Programming Part 2

Non-Blocking Communications

Cluster Computing MPI. Industrial Standard Message Passing

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

MPI Parallel I/O. Chieh-Sen (Jason) Huang. Department of Applied Mathematics. National Sun Yat-sen University

Non-Blocking Communications

Reusing this material

A message contains a number of elements of some particular datatype. MPI datatypes:

Parallel Programming

Parallel Programming, MPI Lecture 2

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

Reusing this material

MPI-IO. Warwick RSE. Chris Brady Heather Ratcliffe. The Angry Penguin, used under creative commons licence from Swantje Hess and Jannis Pohlmann.

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop

Parallel Programming using MPI. Supercomputing group CINECA

Lesson 1. MPI runs on distributed memory systems, shared memory systems, or hybrid systems.

What s in this talk? Quick Introduction. Programming in Parallel

Introduction to the Message Passing Interface (MPI)

MPI point-to-point communication

Practical Scientific Computing: Performanceoptimized

MPI. (message passing, MIMD)

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Introduction to MPI: Part II

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Outline. Communication modes MPI Message Passing Interface Standard

The MPI Message-passing Standard Practical use and implementation (II) SPD Course 27/02/2017 Massimo Coppola

Message Passing Interface

Message Passing Interface. most of the slides taken from Hanjun Kim

DPHPC Recitation Session 2 Advanced MPI Concepts

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

For developers. If you do need to have all processes write e.g. debug messages, you d then use channel 12 (see below).

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

CSE 160 Lecture 18. Message Passing

Slides prepared by : Farzana Rahman 1

MPI Runtime Error Detection with MUST

Parallel programming with MPI Part I -Introduction and Point-to-Point Communications

Parallel programming with MPI Part I -Introduction and Point-to-Point

MPI Derived Datatypes

Message Passing Interface. George Bosilca

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Parallel I/O with MPI

Experiencing Cluster Computing Message Passing Interface

University of Notre Dame

Point-to-Point Communication. Reference:

Basic MPI Communications. Basic MPI Communications (cont d)

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

MPI Program Structure

Message Passing Interface

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Distributed Systems + Middleware Advanced Message Passing with MPI

Report S1 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( )

Parallel Programming with MPI: Day 1

Parallel Programming. Using MPI (Message Passing Interface)

Introduction to MPI. Martin Čuma Center for High Performance Computing University of Utah

Introduction to parallel computing concepts and technics

COSC 6374 Parallel Computation. Introduction to MPI V Derived Data Types. Edgar Gabriel Fall Derived Datatypes

High-Performance Computing: MPI (ctd)

Introduction to MPI. Shaohao Chen Research Computing Services Information Services and Technology Boston University

ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

All-Pairs Shortest Paths - Floyd s Algorithm

Message Passing Interface

Report S1 C. Kengo Nakajima. Programming for Parallel Computing ( ) Seminar on Advanced Computing ( )

High performance computing. Message Passing Interface

Introdução ao MPI-IO. Escola Regional de Alto Desempenho 2018 Porto Alegre RS. Jean Luca Bez 1 Francieli Z. Boito 2 Philippe O. A.

Scientific Computing

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Report S1 C. Kengo Nakajima

Message Passing Programming with MPI. Message Passing Programming with MPI 1

COMP 322: Fundamentals of Parallel Programming

More MPI. Bryan Mills, PhD. Spring 2017

Introduction to MPI part II. Fabio AFFINITO

Parallel Computing. PD Dr. rer. nat. habil. Ralf-Peter Mundani. Computation in Engineering / BGU Scientific Computing in Computer Science / INF

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Introduction to MPI-2 (Message-Passing Interface)

CSE 160 Lecture 15. Message Passing

Outline. Introduction to HPC computing. OpenMP MPI. Introduction. Understanding communications. Collective communications. Communicators.

Transcription:

MPI, Part 3 Scientific Computing Course, Part 3

Non-blocking communications

Diffusion: Had to Global Domain wait for communications to compute Could not compute end points without guardcell data All work halted while all communications occurred Significant parallel overhead Job 1 n-4 n-3 n-2 n-1 n -1 0 1 2 3 Job 2

Diffusion: Had to Global Domain wait? Job 1 But inner zones could have been computed just fine Ideally, would do inner zones work while communications is being done; then go back and do end points. n-4 n-3 n-2 n-1 n -1 0 1 2 3 Job 2

Nonblocking Sends Allows you to get work done while message is in flight Must not alter send buffer until send has completed. C: MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) FORTRAN: MPI_ISEND(BUF,INTEGER COUNT,INTEGER DATATYPE,INTEGER DEST,INTEGER TAG, INTEGER COMM, INTEGER REQUEST,INTEGER IERROR) MPI_Isend(...) work... work..

Nonblocking Recv Allows you to get work done while message is in flight Must not access recv buffer until recv has completed. C: MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) FORTRAN: MPI_IREV(BUF,INTEGER COUNT,INTEGER DATATYPE,INTEGER SOURCE,INTEGER TAG, INTEGER COMM, INTEGER REQUEST,INTEGER IERROR) MPI_Irecv(...) work... work..

How to tell if message is completed? int MPI_Wait(MPI_Request *request,mpi_status *status); MPI_WAIT(INTEGER REQUEST,INTEGER STATUS(MPI_STATUS_SIZE),INTEGER IERROR) int MPI_Waitall(int count,mpi_request *array_of_requests, MPI_Status *array_of_statuses); MPI_WAITALL(INTEGER COUNT,INTEGER ARRAY_OF_ REQUESTS(*),INTEGER ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*),INTEGER Also: MPI_Waitany, MPI_Test...

Guardcells Works for parallel decomposition! Job 1 needs info on Job 2s 0th zone, Job 2 needs info on Job 1s last zone Pad array with guardcells and fill them with the info from the appropriate node by message passing or shared memory Hydro code: need guardcells 2 deep Global Domain Job 1 n-4 n-3 n-2 n-1 n -1 0 1 2 3 Job 2

1 2 Guard cell fill When we re doing boundary conditions. Swap guardcells with neighbour. 1: u(:, nx:nx+ng, ng:ny-ng) 2: u(:,1:ng, ng:ny-ng) 2: u(:, ng+1:2*ng, ng:ny-ng) 1: u(:, nx+ng+1:nx+2*ng, ng:ny-ng) (ny-2*ng)*ng values to swap

Cute way for Periodic BCs 1 2 Actually make the decomposed mesh periodic; Make the far ends of the mesh neighbors Don t know the difference between that and any other neighboring grid Cart_create sets this up for us automatically upon request.

1 2 Implementing in MPI No different in principle than diffusion Just more values And more variables: dens, ener, imomx... Simplest way: copy all the variables into an NVARS*(ny-2*ng)*ng sized 1: u(:, nx:nx+ng, ng:ny-ng) 2: u(:,1:ng, ng:ny-ng) 2: u(:, ng+1:2*ng, ng:ny-ng) 1: u(:, nx+ng+1:nx+2*ng, ng:ny-ng) nvars*(ny-2*ng)*ng values to swap

Implementing in 1 MPI No different in principle than diffusion Just more values And more variables: dens, ener, temp... Simplest way: copy all the variables into an NVARS*(ny-2*ng)*ng sized 2

Implementing in 1 2 MPI Even simpler way: Loop over values, sending each one, rather than copying into buffer. NVARS*nguard*(ny-2*nguard ) latency hit. Would completely dominate communications cost.

Implementing in 1 MPI This approach is simple, but introduces extraneous copies Memory bandwidth is already a bottleneck for these codes It would be nice to just point at the start of the guardcell data and have MPI read it from there. 2

Implementing in 1 MPI Let me make one simplification for now; copy whole stripes This isn t necessary, but will make stuff simpler at first Only a cost of 2xNg 2 = 8 extra cells (small fraction of ~200-2000 that would normally be copied) 2

Implementing in MPI Recall how 2d memory is laid out y-direction guardcells contiguous j i

Implementing in MPI Can send in one go: j call MPI_Send(u(1,1,ny), nvars*nguard*ny, MPI_REAL,...) ierr = MPI_Send(&(u[ny][0][0]), nvars*nguard*ny, MPI_FLOAT,...) i

Implementing in 1 MPI Creating MPI Data types. MPI_Type_contiguous: simplest case. Lets you build a string of some other type. MPI_Datatype ybctype; Count OldType &NewType ierr = MPI_Type_contiguous(nvals*nguard*(ny), MPI_REAL, &ybctype); ierr = MPI_Type_commit(&ybctype); MPI_Send(&(u[ny][0][0]), 1, ybctype,...) ierr = MPI_Type_free(&ybctype);

Implementing in 1 MPI Creating MPI Data types. MPI_Type_contiguous: simplest case. Lets you build a string of some other type. Count OldType NewType integer :: ybctype call MPI_Type_contiguous(nvals*nguard*(ny), MPI_REAL, ybctype, ierr) call MPI_Type_commit(ybctype, ierr) MPI_Send(u(1,1,ny), 1, ybctype,...) call MPI_Type_free(ybctype, ierr)

Implementing in MPI Recall how 2d memory is laid out x gcs or boundary values not contiguous How do we do something like this for the x-direction? j i

Implementing in MPI int MPI_Type_vector( int count, int blocklen, int stride, MPI_Datatype old_type, MPI_Datatype *newtype ); stride = nx*nvars j i count = ny blocklen = ng*nvars

Implementing in MPI ierr = MPI_Type_vector(ny, nguard*nvars, nx*nvars, MPI_FLOAT, &xbctype); j ierr = MPI_Type_commit(&xbctype); ierr = MPI_Send(&(u[0][nx][0]), 1, xbctype,...) i ierr = MPI_Type_free(&xbctype); stride = nx*nvars count = ny blocklen = ng*nvars

Implementing in MPI Check: total amount of data = blocklen*count = ny*ng*nvars Skipped over stride*count = nx*ny*nvars stride = nx*nvars j i count = ny blocklen = ng*nvars

In MPI, there s always more than one way.. MPI_Type_create_subarray ; piece of a multi-dimensional array. Much more convenient for higher-dimensional arrays (Otherwise, need vectors of vectors of vectors...) int MPI_Type_create_subarray( int ndims, int *array_of_sizes, int *array_of_subsizes, int *array_of_starts, int order, MPI_Datatype oldtype, MPI_Datatype &newtype); call MPI_Type_create_subarray( integer ndims, [array_of_sizes], [array_of_subsizes], [array_of_starts], order, oldtype, newtype, ierr)

MPI-IO Would like the new, parallel version to still be able to write out single output files. But at no point does a single processor have entire domain...

Parallel I/O Each processor has to write its own piece of the domain.. without overwriting the other. Easier if there is global coordination

MPI-IO Uses MPI to coordinate reading/writing to single file...stuff... Coordination -- collective operations.

header -- ASCII characters P6, comments, height/width, max val PPM file format Simple file format Someone has to write a header, then each PE has to output only its 3-bytes pixels skipping everyone elses. { row by row triples of bytes: each pixel = 3 bytes

MPI-IO File View Each process has a view of the file that consists of only of the parts accessible to it. For writing, hopefully non-overlapping! Describing this - how data is laid out in a file - is very similar to describing how data is laid out in memory...

MPI-IO File View int MPI_File_set_view( MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info) /* displacement in bytes from start */ /* elementary type */ /* file type; prob different for each proc */ /* native or internal */ /* MPI_INFO_NULL for today */ disp etypes

MPI-IO File View int MPI_File_set_view( MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info) /* displacement in bytes from start */ /* elementary type */ /* file type; prob different for each proc */ /* native or internal */ /* MPI_INFO_NULL */ { { { Filetypes (made up of etypes; repeat as necessary)

MPI-IO File Write int MPI_File_write_all( MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status) Writes (_all: collectively) to part of file within view.

C syntax MPI_Status status; ierr = MPI_Init(&argc, &argv); ierr = MPI_Comm_{size,rank}(Communicator, &{size,rank}); ierr = MPI_Send(sendptr, count, MPI_TYPE, destination, tag, Communicator); ierr = MPI_Recv(rcvptr, count, MPI_TYPE, source, tag, Communicator, &status); ierr = MPI_Sendrecv(sendptr, count, MPI_TYPE, destination,tag, recvptr, count, MPI_TYPE, source, tag, Communicator, &status); ierr = MPI_Allreduce(&mydata, &globaldata, count, MPI_TYPE, MPI_OP, Communicator); Communicator -> MPI_COMM_WORLD MPI_Type -> MPI_FLOAT, MPI_DOUBLE, MPI_INT, MPI_CHAR... MPI_OP -> MPI_SUM, MPI_MIN, MPI_MAX,...

FORTRAN syntax integer status(mpi_status_size) call MPI_INIT(ierr) call MPI_COMM_{SIZE,RANK}(Communicator, {size,rank},ierr) call MPI_SSEND(sendarr, count, MPI_TYPE, destination, tag, Communicator) call MPI_RECV(rcvarr, count, MPI_TYPE, destination,tag, Communicator, status, ierr) call MPI_SENDRECV(sendptr, count, MPI_TYPE, destination,tag, recvptr, count, MPI_TYPE, source, tag, Communicator, status, ierr) call MPI_ALLREDUCE(&mydata, &globaldata, count, MPI_TYPE, MPI_OP, Communicator, ierr) Communicator -> MPI_COMM_WORLD MPI_Type -> MPI_REAL, MPI_DOUBLE_PRECISION, MPI_INTEGER, MPI_CHARACTER MPI_OP -> MPI_SUM, MPI_MIN, MPI_MAX,...