The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Similar documents
The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Introduction to the Message Passing Interface (MPI)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI. (message passing, MIMD)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

Parallel Computing and the MPI environment

Practical Scientific Computing: Performanceoptimized

Outline. Communication modes MPI Message Passing Interface Standard

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Parallel Programming in C with MPI and OpenMP

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Distributed Memory Parallel Programming

MPI MESSAGE PASSING INTERFACE

Message Passing Interface

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Message Passing Interface. most of the slides taken from Hanjun Kim

CS 426. Building and Running a Parallel Application

Parallel Programming in C with MPI and OpenMP

Practical Introduction to Message-Passing Interface (MPI)

Message Passing Interface

Distributed Memory Programming with Message-Passing

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Message Passing Interface

Chapter 3. Distributed Memory Programming with MPI

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Slides prepared by : Farzana Rahman 1

Parallel programming MPI

CSE 160 Lecture 15. Message Passing

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

A Message Passing Standard for MPP and Workstations

MPI Collective communication

Chapter 4. Message-passing Model

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

CSE 160 Lecture 18. Message Passing

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Computer Architecture

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

Recap of Parallelism & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Parallel Programming. Using MPI (Message Passing Interface)

Tutorial 2: MPI. CS486 - Principles of Distributed Computing Papageorgiou Spyros

MPI Message Passing Interface

MPI MESSAGE PASSING INTERFACE

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Holland Computing Center Kickstart MPI Intro

Cluster Computing MPI. Industrial Standard Message Passing

Introduction to parallel computing concepts and technics

Intermediate MPI features

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Experiencing Cluster Computing Message Passing Interface

HPC Parallel Programing Multi-node Computation with MPI - I

High-Performance Computing: MPI (ctd)

int sum;... sum = sum + c?

Parallel Computing Paradigms

MPI Runtime Error Detection with MUST

An Introduction to MPI

Introduction in Parallel Programming - MPI Part I

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste

Scientific Computing

The Message Passing Model

15-440: Recitation 8

MA471. Lecture 5. Collective MPI Communication

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

MPI MESSAGE PASSING INTERFACE

AMath 483/583 Lecture 21

Distributed Systems + Middleware Advanced Message Passing with MPI

ECE 574 Cluster Computing Lecture 13

Distributed Memory Programming with MPI

Parallel Short Course. Distributed memory machines

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Practical Course Scientific Computing and Visualization

Basic MPI Communications. Basic MPI Communications (cont d)

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

PyConZA High Performance Computing with Python. Kevin Colville Python on large clusters with MPI

What s in this talk? Quick Introduction. Programming in Parallel

東京大学情報基盤中心准教授片桐孝洋 Takahiro Katagiri, Associate Professor, Information Technology Center, The University of Tokyo

Introduction to MPI. Ricardo Fonseca.

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Introduction to Parallel Programming

Parallel Programming Using MPI

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Message Passing with MPI

Overview of MPI 國家理論中心數學組 高效能計算 短期課程. 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Standard MPI - Message Passing Interface

Parallel Computing. PD Dr. rer. nat. habil. Ralf-Peter Mundani. Computation in Engineering / BGU Scientific Computing in Computer Science / INF

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Message-Passing Computing

Introduction to MPI: Part II

Parallel Programming with MPI: Day 1

A Message Passing Standard for MPP and Workstations. Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W.

Transcription:

1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx

Why Two URLs? 2 http://mpi-forum.org This is the definitive reference for the MPI standard. Go here if you want to read the official specification, which, BTW, continues to evolve. https://www.open-mpi.org/ This consortium formed later. This is the open source version of MPI. If you want to start using MPI, I recommend you look here.

The Open MPI Consortium 3

MPI: The Basic Idea 4 Network Memory Memory CPU CPU Programs on different CPUs coordinate computations by passing messages between each other Note: Each CPU in the MPI cluster must be conditioned ahead of time by having the MPI server code installed on it. Each slave MPI CPU must also have an integer ID assigned to it (called the rank) and must be registered with the master MPI CPU.

Compiling and Running 5 % mpicc -o program program.c... or % mpic++ -o program program.cpp... % mpiexec np 64 program # of processors to use

Setting Up and Finishing 6 #include <mpi.h> Int main( int argc, char *argv[ ] ) { MPI_Init( &argc, &argv ); } MPI_Finalize( ); return 0; If you don t need to process command line arguments, you can also call: MPI_Init( NULL, NULL );

MPI Follows a Single-Program-Multiple-Data Model 7 A communicator is a collection of CPUs that are capable of sending messages to each other Getting information about our place in the communicator: int numcpus; // total # of cpus involved int me; // which one I am This requires MPI server code getting installed on all those CPUs. MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); size rank

A Good Place to Start: MPI Broadcasting 8 MPI_Bcast( array, count, type, src, MPI_COMM_WORLD ); Address of data to send from if you are the src node; Address of the data to receive into if you are not # elements MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE rank of the CPU doing the sending Broadcast Both the sender and receivers need to execute MPI_Bcast there is no separate receive function

How Does this Work? Think Star Trek Wormholes! 9

MPI Broadcast Example 10 This is our heat transfer equation from before. Clearly, every CPU will need to know this value. k Ti 1 2Ti T i 1 Ti t 2 C x int numcpus; int me; float k_over_rho_c; // the ROOT node will know this value, the others won t (yet) #define ROOT 0 MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); if( me == ROOT ) { << read k_over_rho_c from the data file >> MPI_Bcast( &k_over_rho_c, 1, MPI_FLOAT, ROOT, MPI_COMM_WORLD ); } else { MPI_Bcast( &k_over_rho_c, 1, MPI_FLOAT, ROOT, MPI_COMM_WORLD ); } // send // receive

Sending Data from a Source CPU to Several Destination CPUs 11 MPI_Send( array, numtosend, type, dst, tag, MPI_COMM_WORLD ); address of data to send from # elements MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE rank of the CPU to send to An integer to differentiate this transmission from any other transmission (be sure this is unique!) Rules: One message from a specific src to a specific dst cannot overtake a previous message from the same src to the same dst. MPI_Send( ) blocks until the transfer is far enough along that array can be destroyed or re-used. There are no guarantees on order from different src s.

Receiving Data in a Destination CPU from a Source CPU 12 MPI_Recv( array, maxcanreceive, type, src, tag, MPI_COMM_WORLD, &status ); address of data to receive into Type = MPI_Status # elements we can receive, at most MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE Rank of the CPU we are expecting to get a transmission from An integer to differentiate what transmission we are looking for with this call (be sure this matches what the sender is sending!) Rules: The receiver blocks waiting for data that matches what it declares to be looking for One message from a specific src to a specific dst cannot overtake a previous message from the same src to the same dst There are no guarantees on the order from different src s The order from different src s could be implied in the tag status is type MPI_Status the &status can be replaced with MPI_STATUS_IGNORE

Example 13 Remember, this same code runs on all CPUs: int numcpus; int me; #define MYDATA_SIZE 128 char mydata[ MYDATA_SIZE ]; #define ROOT 0 MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); if( me == ROOT ) // the master Not really using the tag here { for( int dst = 0; dst < numcpus; dst++ ) { char *InputData = Hello, Beavers! ; MPI_Send( InputData, strlen(inputdata)+1, MPI_CHAR, dst, 0, MPI_COMM_WORLD ); } } else // a slave { MPI_Recv( mydata, MYDATA_SIZE, MPI_CHAR, ROOT, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); printf( %s from rank # %d\n, in, me ); }

How Much Data Did I Actually Receive and Where Did I Get It? 14 MPI_Recv( array, maxcanreceive, type, src, tag, MPI_COMM_WORLD, &status ); # elements we can receive, at most MPI_Status status; Int actualcount; MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE Type = MPI_Status MPI_Get_count( &status, type, &actualcount ); int src = status.mpi_source; int tag = status.mpi_tag;

How does MPI let the Sender perform an MPI_Send( ) even if the Receivers are not ready to MPI_Recv( )? 15 Sender Receiver MPI_Send( ) MPI_Recv( ) MPI Transmission Buffer MPI Transmission Buffer MPI_Send( ) blocks until the transfer is far enough along that array can be destroyed or re-used.

Another Example 16 You typically don t send the entire workload to each dst you just send part of it, like this: #define NUMELEMENTS????? int numcpus; int me; #define ROOT 0 MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); int localsize = NUMELEMENTS / numcpus; float *mydata = new float [ localsize ]; // assuming it comes out evenly if( me == ROOT ) // the master { float *InputData = new float [ NUMELEMENTS ]; << read the full input data into InputData from disk >> for( int dst = 1; dst < numcpus; dst++ ) { MPI_Send( &InputData[dst*localSize], localsize, MPI_FLOAT, dst, 0, MPI_COMM_WORLD ); } } else // a slave { MPI_Recv( mydata, localsize, MPI_FLOAT, ROOT, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); // do something with this subset of the data } And, what s a good example of when you want to do this?

Remember This? It s Baaaaaack. 17 CPU #0 CPU #1 CPU #2 CPU #3 The Compute : Communicate Ratio still applies, except that it is even more important now because there is much more overhead in the Communicate portion. This pattern of breaking a big problem up into pieces, sending them to different CPUs, computing on the pieces, and getting the results back is so common that it has its term: Scatter/Gather, and has its own MPI function calls.

MPI Scatter 18 Take a data array, break it into ~equal portions, and send it to each CPU MPI_Scatter( snd_array, snd_count, snd_type, rcv_array, rcv_count, rcv_type, src, MPI_COMM_WORLD ); # elements to send MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE # elements to receive MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE Scatter Both the sender and receivers need to execute MPI_Scatter there is no separate receive function

MPI Scatter Example 19 #define NUMELEMENTS????? int numcpus; int me; #define ROOT 0 MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); Int localsize = NUMELEMENTS / numcpus; // assuming it comes out evenly float *mydata = new float [ localsize ]; if( me == ROOT ) { float *InputData = new float [ NUMELEMENTS ]; << read the full input data into InputData from disk >> MPI_Scatter( InputData, NUMELEMENTS, MPI_FLOAT, mydata, localsize, MPI_FLOAT, ROOT, MPI_WORLD_COMM ); } else { MPI_Scatter( NULL, 0, MPI_FLOAT, mydata, localsize, MPI_FLOAT, ROOT, MPI_WORLD_COMM ); } Scatter

MPI Gather 20 MPI_Gather( snd_array, snd_count, snd_type, rcv_array, rcv_count, rcv_type, dst, MPI_COMM_WORLD ); # elements to send MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE # elements to receive MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE Gather

MPI Gather Example 21 #define NUMELEMENTS????? int numcpus; int me; #define ROOT 0 MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); Int localsize = NUMELEMENTS / numcpus; // assuming it comes out evenly float *mydata = new float [ localsize ]; if( me == ROOT ) { float *InputData = new float [ NUMELEMENTS ];... } else { } This is the actual gathering MPI_Gather( mydata, localsize, MPI_FLOAT, InputData, NUMELEMENTS, MPI_FLOAT, ROOT, MPI_WORLD_COMM ); << write data from Array to disk >> These are the transmissions into the gathering MPI_Gather( mydata, localsize, MPI_FLOAT, NULL, 0, MPI_FLOAT, ROOT, MPI_WORLD_COMM ); Gather

22 Scatter Gather

A Full MPI Scatter / Gather Example 23 #define NUMELEMENTS????? int numcpus; int me; #define ROOT 0 MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); Int localsize = NUMELEMENTS / numcpus; // assuming it comes out evenly float *mydata = new float [ localsize ]; if( me == ROOT ) { float *InputData = new float [ NUMELEMENTS ]; << read the full input data into InputData[ ] from disk >> MPI_Scatter( InputData, NUMELEMENTS, MPI_FLOAT, mydata, localsize, MPI_FLOAT, ROOT, MPI_WORLD_COMM ); << do some computing on mydata[ ] >> MPI_Gather( mydata, localsize, MPI_FLOAT, InputData, NUMELEMENTS, MPI_FLOAT, ROOT, MPI_WORLD_COMM ); << write data from InputData[ ] to disk >> } else { MPI_Scatter( NULL, 0, MPI_FLOAT, mydata, localsize, MPI_FLOAT, ROOT, MPI_WORLD_COMM ); << do some computing on mydata[ ] >> } MPI_Gather( mydata, localsize, MPI_FLOAT, NULL, 0, MPI_FLOAT, ROOT, MPI_WORLD_COMM );

MPI Reduction 24 MPI_Reduce( partialresult, globalresult, count, type, operator, dst, MPI_COMM_WORLD ); Where the partial result is stored on each CPU Place to store the full result on the dst CPU Number of elements in the partial result MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE MPI_MIN MPI_MAX MPI_SUM MPI_PROD MPI_MINLOC MPI_MAXLOC Who is given the final answer + + + Reduction Both the sender and receivers need to execute MPI_Reduce there is no separate receive function

MPI Reduction Example 25 int numcpus; int me; float globalsum; #define ROOT 0 MPI_Comm_Size( MPI_COMM_WORLD, &numcpus ); MPI_Comm_Rank( MPI_COMM_WORLD, &me ); float partialsum = 0.; << compute this CPUs partialsum by, perhaps, adding up a local array >> MPI_Reduce( &partialsum, &globalsum, 1, MPI_FLOAT, MPI_SUM, ROOT, MPI_COMM_WORLD );

MPI Barriers 26 MPI_Barrier( MPI_COMM_WORLD ); 0 1 2 3 4 5 wait wait wait wait wait Barrier All CPUs must execute a call to MPI_Barrier( ) before any of them can move past it. Reminder: barriers are based on count, not location.

MPI Derived Types 27 Idea: In addition to types MPI_INT, MPI_FLOAT, etc., allow the creation of new MPI types so that you can transmit an array of structures. Reason: There is significant overhead with each transmission. Better to send one entire array of structures instead of sending several arrays separately. MPI_Type_create_struct( count, blocklengths, displacements, types, datatype ); struct point { }; int pointsize; float x, y, z; MPI_datatype point_t; int blocklengths[ ] = { 1, 1, 1, 1 }; int displacements[ ] = { 0, 4, 8, 12 }; MPI_type types[ ] = { MPI_INT, MPI_FLOAT, MPI_FLOAT, MPI_FLOAT ); MPI_Type_create_struct( 4, blocklengths, displacements, types, &point_t ); You can now use point_t everywhere you could have used MPI_INT, MPI_FLOAT,etc.

28 Welcome to Count the Parallelisms

Suppose We Have This Setup 29 Network Memory Memory Core Core Core Core Core CPU Core Core CPU Core GPU GPU

30 Network Memory Memory Core Core Multicore CPU Core Core Core Core Multicore CPU Core Core GPU GPU 1. OpenMP

31 Network Memory Memory Multicore CPU Multicore CPU GPU GPU 1. OpenMP 2. CPU SIMD

32 Network Memory Multicore CPU CUDA GPU / OpenCL Memory Multicore CPU CUDA GPU / OpenCL 1. OpenMP 2. CPU SIMD 3. GPU

33 Network MPI Memory Multicore CPU CUDA GPU / OpenCL Memory Multicore CPU CUDA GPU / OpenCL 1. OpenMP 2. CPU SIMD 3. GPU 4. MPI and, they can all be functioning within the same application!