COMP 322: Fundamentals of Parallel Programming

Similar documents
COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd

Programming Using the Message Passing Paradigm

Outline. Communication modes MPI Message Passing Interface Standard

Recap of Parallelism & MPI

Basic MPI Communications. Basic MPI Communications (cont d)

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

Parallel programming MPI

Collective Communications

Distributed Systems + Middleware Advanced Message Passing with MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI. (message passing, MIMD)

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI

Programming with MPI Collectives

Message Passing Interface

Non-Blocking Communications

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Programming Using the Message-Passing Paradigm (Chapter 6) Alexandre David

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM

Message Passing Interface. most of the slides taken from Hanjun Kim

Standard MPI - Message Passing Interface

High Performance Computing

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

What s in this talk? Quick Introduction. Programming in Parallel

Programming Using the Message Passing Interface

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Non-Blocking Communications

High-Performance Computing: MPI (ctd)

HPC Parallel Programing Multi-node Computation with MPI - I

Introduction to MPI: Part II

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Distributed Memory Parallel Programming

Review of MPI Part 2

Parallel Programming. Using MPI (Message Passing Interface)

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

More about MPI programming. More about MPI programming p. 1

Today's agenda. Parallel Programming for Multicore Machines Using OpenMP and MPI

MPI MESSAGE PASSING INTERFACE

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

MA471. Lecture 5. Collective MPI Communication

MPI - The Message Passing Interface

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Scientific Computing

Cluster Computing MPI. Industrial Standard Message Passing

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

The MPI Message-passing Standard Practical use and implementation (VI) SPD Course 08/03/2017 Massimo Coppola

CDP. MPI Derived Data Types and Collective Communication

Basic Communication Operations (Chapter 4)

Part - II. Message Passing Interface. Dheeraj Bhardwaj

MPI MESSAGE PASSING INTERFACE

Practical Scientific Computing: Performanceoptimized

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

MPI Collective communication

Data parallelism. [ any app performing the *same* operation across a data stream ]

Parallel Programming

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Parallel Programming, MPI Lecture 2

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

Parallel Computing. Distributed memory model MPI. Leopold Grinberg T. J. Watson IBM Research Center, USA. Instructor: Leopold Grinberg

Experiencing Cluster Computing Message Passing Interface

Message Passing Interface

Intermediate MPI (Message-Passing Interface) 1/11

CPS 303 High Performance Computing

Intermediate MPI (Message-Passing Interface) 1/11

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Programming SoHPC Course June-July 2015 Vladimir Subotic MPI - Message Passing Interface

Point-to-Point Communication. Reference:

Distributed Memory Systems: Part IV

Message Passing Interface

Introduction to the Message Passing Interface (MPI)

Parallel Computing and the MPI environment

Introduction to MPI. SuperComputing Applications and Innovation Department 1 / 143

Intermediate MPI. M. D. Jones, Ph.D. Center for Computational Research University at Buffalo State University of New York

Lecture: Parallel Programming on Distributed Systems with MPI

Week 3: MPI. Day 02 :: Message passing, point-to-point and collective communications

Introduction to MPI. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Parallel Programming in C with MPI and OpenMP

Distributed Memory Programming with MPI

Capstone Project. Project: Middleware for Cluster Computing

Message Passing with MPI

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop

Message-Passing and MPI Programming

Document Classification

MPI Tutorial Part 1 Design of Parallel and High-Performance Computing Recitation Session

MPI Tutorial Part 1 Design of Parallel and High-Performance Computing Recitation Session

Introduction to MPI Part II Collective Communications and communicators

Introduction to MPI. Ricardo Fonseca.

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced

Intermediate MPI features

Topic Overview. Principles of Message-Passing Programming. Principles of Message-Passing Programming. CZ4102 High Performance Computing

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

Collective Communication in MPI and Advanced Features

Lecture 9: MPI continued

Parallel Programming with MPI MARCH 14, 2018

Message Passing Interface - MPI

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

MPI Message Passing Interface

Fengguang Song Department of Computer Science IUPUI

Transcription:

COMP 322: Fundamentals of Parallel Programming https://wiki.rice.edu/confluence/display/parprog/comp322 Lecture 37: Introduction to MPI (contd) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu 1 COMP 322 Lecture 36 20 April 2011

Acknowledgments for Todayʼs Lecture" Principles of Parallel Programming, Calvin Lin & Lawrence Snyder Includes resources available at http://www.pearsonhighered.com/educator/academic/product/ 0,3110,0321487907,00.html Parallel Architectures, Calvin Lin Lectures 5 & 6, CS380P, Spring 2009, UT Austin http://www.cs.utexas.edu/users/lin/cs380p/schedule.html mpijava home page: http://www.hpjava.org/mpijava.html MPI-based Approaches for Java presentation by Bryan Carpenter http://www.hpjava.org/courses/arl MPI lectures given at Rice HPC Summer Institute 2009, Tim Warburton, May 2009 2

Announcements" Homework 7 due by 5pm on Friday, April 22 nd Send email to comp322-staff if you re running into issues with accessing SUG@R nodes, or anything else Take-home final exam will be given on Friday, April 22 nd Content will cover second half of semester Come to Friday s lecture for review of final exam material! Due by 5pm on Friday, April 29 th 3

Organization of a Distributed-Memory Multiprocessor" Figure (a) Host node (Pc) connected to a cluster of processor nodes (P 0 P m ) Processors P 0 P m communicate via an interconnection network Supports much lower latencies and higher bandwidth than standard TCP/IP networks Figure (b) Each processor node consists of a processor, memory, and a Network Interface Card (NIC) connected to a router node (R) in the interconnect 4

Use of MPI on an SMP" Process 0 on Core A Process 1 on Core B Process 2 on Core C Process 3 on Core D Regs Regs Regs Regs L1 d-cache L1 i-cache L1 d-cache L1 i-cache L1 d-cache L1 i-cache L1 d-cache L1 i-cache L2 unified cache L2 unified cache L3 unified cache (shared by all cores) Main memory Memory hierarchy for a single Intel Xeon Quad-core E5440 HarperTown processor chip A SUG@R node contains two such chips 5

Recap of mpijava Send() and Recv()" Send and receive members of Comm: void Send(Object buf, int offset, int count, Datatype type," int dst, int tag) ;" Status Recv(Object buf, int offset, int count, Datatype type," int src, int tag) ;" The arguments buf, offset, count, type describe the data buffer the storage of the data that is sent or received. They will be discussed on the next slide. dst is the rank of the destination process relative to this communicator. Similarly in Recv(), src is the rank of the source process. An arbitrarily chosen tag value can be used in Recv() to select between several incoming messages: the call will wait until a message sent with a matching tag value arrives. The Recv() method returns a Status value, discussed later. Both Send() and Recv() are blocking operations by default Analogous to a phaser next operation 6

Deadlock Scenario #1 (C version)" Consider: int a[10], b[10], myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); }... If MPI_Send is blocking, there is a deadlock. 7

Deadlock Scenario #2 (C version)" Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes). int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD);... Once again, we have a deadlock if MPI_Send is blocking. 8

Approach #1 to Deadlock Avoidance --- Reorder Send and Recv calls" We can break the circular wait to avoid deadlocks as follows: int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank%2 == 1) { MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); } else { MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); }... 9

Approach #2 to Deadlock Avoidance --- a combined Sendrecv() call" Since it is fairly common to want to simultaneously send one message while receiving another (as illustrated on the previous) slide, MPI provides a more specialized operation for this. In mpijava the corresponding method of Comm has the complicated signature: Status Sendrecv(Object sendbuf, int sendoffset, int sendcount," Datatype sendtype, int dst, int sendtag," Object recvbuf, int recvoffset, int recvcount," Datatype recvtype, int src, int recvtag) ;" This can be more efficient that doing separate sends and receives, and it can be used to avoid deadlock conditions in certain situations Analogous to phaser next operation, where programmer does not have access to individual signal/wait operations There is also a variant called Sendrecv_replace() which only specifies a single buffer: the original data is sent from this buffer, then overwritten with incoming data. 10

Using Sendrecv for Deadlock Avoidance in Scenario #2 (C version)" Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes). int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Sendrecv(a, 10, MPI_INT, (myrank+1)%npes, 1, b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD);... A combined Sendrecv() call avoids deadlock in this case 11

MPI Nonblocking Point-to-point Communication" 12

Latency in Blocking vs. Nonblocking Communication" Blocking communication Nonblocking communication 13

Non-Blocking Send and Receive operations" In order to overlap communication with computation, MPI provides a pair of functions for performing non-blocking send and receive operations ( I stands for Immediate ): int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) These operations return before the operations have been completed. Function MPI_Test tests whether or not the nonblocking send or receive operation identified by its request has finished. int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_Wait waits for the operation to complete. int MPI_Wait(MPI_Request *request, MPI_Status *status) 14

Non-blocking Example" Example pseudo-code on process 0: if(procid==0){ MPI_Isend outgoing to 1 MPI_Irecv incoming from 1.. compute.. MPI_Wait until Irecv has received incoming.. compute.. MPI_Wait until Isend does not need outgoing } Example pseudo-code on process 1: if(procid==1){ } MPI_Isend outgoing to 0 MPI_Irecv incoming from 0.. compute.. MPI_Wait until Irecv has filled incoming.. compute.. MPI_Wait until Isend does not need outgoing Using the non-blocked send and receives allows us to overlap the latency and buffering overheads with useful computation. 15

Avoiding Deadlocks (C version) " Using non-blocking operations removes most deadlocks. Consider: int a[10], b[10], myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { } MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); else if (myrank == 1) { }... MPI_Recv(b, 10, MPI_INT, 0, 2, &status, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, &status, MPI_COMM_WORLD); Replacing either the send or the receive operations with non-blocking counterparts fixes this deadlock. 16

MPI Collective Communication" 17

Collective Communications" A popular feature of MPI is its family of collective communication operations. Each of these operations is defined over a communicator. All processes in a communicator must perform the same operation Implicit barrier (next) The simplest example is the broadcast operation: all processes invoke the operation, all agreeing on one root process. Data is broadcast from that root. void Bcast(Object buf, int offset, int count, Datatype type, int root) Broadcast a message from the process with rank root to all processes of the group. 18

MPI_Bcast" 0 29 29 29 29 1 29 2 29 29 proc 3 29 4 29 29 29 5 29 6 29 29 7 29 19

More Examples of Collective Operations" All the following are instance methods of Intracom: void Barrier() Blocks the caller until all processes in the group have called it. void Gather(Object sendbuf, int sendoffset, int sendcount, Datatype sendtype, Object recvbuf, int recvoffset, int recvcount, Datatype recvtype, int root) Each process sends the contents of its send buffer to the root process. void Scatter(Object sendbuf, int sendoffset, int sendcount, Datatype sendtype, Object recvbuf, int recvoffset, int recvcount, Datatype recvtype, int root) Inverse of the operation Gather. void Reduce(Object sendbuf, int sendoffset, Object recvbuf, int recvoffset, int count, Datatype datatype, Op op, int root) Combine elements in send buffer of each process using the reduce operation, and return the combined value in the receive buffer of the root process. 20

MPI_Gather" On occasion it is necessary to copy an array of data from each process into a single array on a single process. Graphically: P0 P1 P2 1 3 4-2 -1 4 1 3 4-2 -1 4 Note: only process 0 (P0) needs to supply storage for the output 21

MPI_Reduce" void MPI.COMM_WORLD.Reduce( Object[] sendbuf /* in */, int sendoffset /* in */, Object[] recvbuf /* out */, int recvoffset /* in */, int count /* in */, MPI.Datatype datatype /* in */, MPI.Op operator /* in */, int root /* in */ ) Rank0 15 Rank1 10 Rank2 12 Rank3 8 Rank4 4 49 MPI.COMM_WORLD.Reduce( msg, 0, result, 0, 1, MPI.INT, MPI.SUM, 2); 22

Predefined Reduction Operations " Operation Meaning Datatypes MPI_MAX Maximum C integers and floating point MPI_MIN Minimum C integers and floating point MPI_SUM Sum C integers and floating point MPI_PROD Product C integers and floating point MPI_LAND Logical AND C integers MPI_BAND Bit-wise AND C integers and byte MPI_LOR Logical OR C integers MPI_BOR Bit-wise OR C integers and byte MPI_LXOR Logical XOR C integers MPI_BXOR Bit-wise XOR C integers and byte MPI_MAXLOC max-min value-location Data-pairs MPI_MINLOC min-min value-location Data-pairs 23

MPI_MAXLOC and MPI_MINLOC" The operation MPI_MAXLOC combines pairs of values (v i, l i ) and returns the pair (v, l) such that v is the maximum among all v i 's and l is the corresponding l i (if there are more than one, it is the smallest among all these l i 's). MPI_MINLOC does the same, except for minimum value of v i. An example use of the MPI_MINLOC and MPI_MAXLOC operators. 24

Datatypes for MPI_MAXLOC and MPI_MINLOC" MPI datatypes for data-pairs used with the MPI_MAXLOC and MPI_MINLOC reduction operations. MPI Datatype MPI_2INT MPI_SHORT_INT MPI_LONG_INT MPI_LONG_DOUBLE_INT MPI_FLOAT_INT MPI_DOUBLE_INT C Datatype pair of s and and and and and 25

More Collective Communication Operations " If the result of the reduction operation is needed by all processes, MPI provides: int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) MPI also provides the MPI_Allgather function in which the data are gathered at all the processes. int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm) To compute prefix-sums, MPI provides: int MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) 26

MPI_Alltoall" int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm) P0 1 3-2 5 7 1 1 3 4-2 -1 4 P1 4-2 5 2-3 6-2 5 5 2 5 1 P2-1 4 5 1 2 4 7 1-3 6 2 4 Each process submits an array to MPI_Alltoall. The array on each process is split into nprocs sub-arrays Sub-array n from process m is sent to process n placed in the m th block in the result array. 27

Groups and Communicators " In many parallel algorithms, communication operations need to be restricted to certain subsets of processes. MPI provides mechanisms for partitioning the group of processes that belong to a communicator into subgroups each corresponding to a different communicator. The simplest such mechanism is: int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) This operation groups processors by color and sorts resulting groups on the key. 28

Groups and Communicators" In many parallel algorithms, communication operations need to be restricted to certain subsets of processes. MPI provides mechanisms for partitioning the group of processes that belong to a communicator into subgroups The simplest such mechanism is: int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) This operation groups processors by color and sorts resulting groups on the key. 29

Summary of MPI Collective Communications" A large number of collective operations are available with MPI Too many to mention... This table summarizes some of the most useful collective operations MPI_Gather MPI_Reduce MPI_Scatter gather together arrays from all processes in comm reduce (elementwise) arrays from all processes in communicator a root process sends consecutive chunks of an array to all processes MPI_Alltoall Block transpose MPI_Bcast a root process sends the same array of data to all processes. 30