Lecture 9: MPI continued

Similar documents
Lecture 7: Distributed memory

Slides prepared by : Farzana Rahman 1

MPI Message Passing Interface. Source:

Recap of Parallelism & MPI

HPC Parallel Programing Multi-node Computation with MPI - I

Standard MPI - Message Passing Interface

Message Passing Interface

High Performance Computing

Parallel Computing Paradigms

MPI. (message passing, MIMD)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Message Passing Interface

Message Passing Interface

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Message Passing Interface. most of the slides taken from Hanjun Kim

Collective Communication in MPI and Advanced Features

Practical Course Scientific Computing and Visualization

Introduction to MPI: Part II

MPI Collective communication

Parallel programming MPI

Data parallelism. [ any app performing the *same* operation across a data stream ]

Introduction to parallel computing concepts and technics

MPI Tutorial Part 1 Design of Parallel and High-Performance Computing Recitation Session

Introduction to the Message Passing Interface (MPI)

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

MPI Tutorial Part 1 Design of Parallel and High-Performance Computing Recitation Session

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Non-Blocking Communications

Message Passing Interface - MPI

Introduction to Parallel Programming with MPI

Outline. Communication modes MPI Message Passing Interface Standard

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Holland Computing Center Kickstart MPI Intro

Parallel Programming

Non-Blocking Communications

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

MA471. Lecture 5. Collective MPI Communication

Parallel Programming. Using MPI (Message Passing Interface)

Lecture 4 Introduction to MPI

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

More about MPI programming. More about MPI programming p. 1

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Point-to-Point Communication. Reference:

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

High Performance Computing Course Notes Message Passing Programming I

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Parallel Computing and the MPI environment

Programming with MPI Collectives

COSC 6374 Parallel Computation

The MPI Message-passing Standard Practical use and implementation (V) SPD Course 6/03/2017 Massimo Coppola

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

MPI MESSAGE PASSING INTERFACE

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

CSE 160 Lecture 15. Message Passing

int sum;... sum = sum + c?

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Introduction to parallel computing

MPI Message Passing Interface

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

Distributed Systems + Middleware Advanced Message Passing with MPI

The Message Passing Model

Distributed Memory Programming with MPI

Practical Scientific Computing: Performanceoptimized

NUMERICAL PARALLEL COMPUTING

CSE 160 Lecture 18. Message Passing

What s in this talk? Quick Introduction. Programming in Parallel

COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd

DPHPC Recitation Session 2 Advanced MPI Concepts

MPI point-to-point communication

Hybrid MPI and OpenMP Parallel Programming

Lecture 6: Message Passing Interface

COMP 322: Fundamentals of Parallel Programming

Introduction to MPI. Susan Mehringer Cornell Center for Advanced Computing. 19 May Based on materials developed by CAC and TACC

Distributed Memory Programming with Message-Passing

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop

Intermediate MPI. M. D. Jones, Ph.D. Center for Computational Research University at Buffalo State University of New York

Message Passing Interface - MPI

ME964 High Performance Computing for Engineering Applications

Topic Notes: Message Passing Interface (MPI)

ECE 574 Cluster Computing Lecture 13

CS 426. Building and Running a Parallel Application

Parallel Short Course. Distributed memory machines

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Message-Passing Computing

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste

Programming SoHPC Course June-July 2015 Vladimir Subotic MPI - Message Passing Interface

Introduction to MPI. Ritu Arora Texas Advanced Computing Center June 17,

Introduction to MPI. Ricardo Fonseca.

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Acknowledgments. Programming with MPI Basic send and receive. A Minimal MPI Program (C) Contents. Type to enter text

Collective Communications I

Parallel Programming using MPI. Supercomputing group CINECA

Programming with MPI Basic send and receive

Parallel Programming in C with MPI and OpenMP

MPI - The Message Passing Interface

An Introduction to Parallel Programming

Transcription:

Lecture 9: MPI continued David Bindel 27 Sep 2011

Logistics Matrix multiply is done! Still have to run. Small HW 2 will be up before lecture on Thursday, due next Tuesday. Project 2 will be posted next Tuesday. Email me if interested in Sandia recruiting Also email me if interested in MEng projects.

Previously on Parallel Programming Can write a lot of MPI code with 6 operations we ve seen: MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv... but there are sometimes better ways. Decide on communication style using simple performance models.

Communication performance Basic info: latency and bandwidth Simplest model: t comm = α + βm More realistic: distinguish CPU overhead from gap ( inverse bw) Different networks have different parameters Can tell a lot via a simple ping-pong experiment

OpenMPI on crocus Two quad-core chips per nodes, five nodes Heterogeneous network: Crossbar switch between cores (?) Bus between chips Gigabit ehternet between nodes Default process layout (16 process example) Processes 0-3 on first chip, first node Processes 4-7 on second chip, first node Processes 8-11 on first chip, second node Processes 12-15 on second chip, second node Test ping-pong from 0 to 1, 7, and 8.

Approximate α-β parameters (on node) Time/msg (microsec) 12 10 8 6 4 2 Measured (1) Model (1) Measured (7) Model (7) 0 0 2 4 6 8 10 12 14 16 18 Message size (kilobytes) α 1 1.0 10 6, β 1 5.7 10 10 α 2 8.4 10 7, β 2 6.8 10 10

Approximate α-β parameters (cross-node) Time/msg (microsec) 250 200 150 100 50 Measured (1) Model (1) Measured (7) Model (7) Measured (8) Model (8) 0 0 2 4 6 8 10 12 14 16 18 Message size (kilobytes) α 3 7.1 10 5, β 3 9.7 10 9

Moral Not all links are created equal! Might handle with mixed paradigm OpenMP on node, MPI across Have to worry about thread-safety of MPI calls Can handle purely within MPI Can ignore the issue completely? For today, we ll take the last approach.

Reminder: basic send and recv MPI_Send(buf, count, datatype, dest, tag, comm); MPI_Recv(buf, count, datatype, source, tag, comm, status); MPI_Send and MPI_Recv are blocking Send does not return until data is in system Recv does not return until data is ready

Blocking and buffering data buffer buffer data P0 OS Net OS P1 Block until data in system maybe in a buffer?

Blocking and buffering data data P0 OS Net OS P1 Alternative: don t copy, block until done.

Problem 1: Potential deadlock Send Send... blocked... Both processors wait to finish send before they can receive! May not happen if lots of buffering on both sides.

Solution 1: Alternating order Send Recv Recv Send Could alternate who sends and who receives.

Solution 2: Combined send/recv Sendrecv Sendrecv Common operations deserve explicit support!

Combined sendrecv MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status); Blocking operation, combines send and recv to avoid deadlock.

Problem 2: Communication overhead Sendrecv Sendrecv..waiting... Sendrecv Sendrecv..waiting... Partial solution: nonblocking communication

Blocking vs non-blocking communication MPI_Send and MPI_Recv are blocking Send does not return until data is in system Recv does not return until data is ready Cons: possible deadlock, time wasted waiting Why blocking? Overwrite buffer during send = evil! Read buffer before data ready = evil! Alternative: nonblocking communication Split into distinct initiation/completion phases Initiate send/recv and promise not to touch buffer Check later for operation completion

Overlap communication and computation Start send Start recv End send End recv Start send Start recv } Compute, but don t touch buffers! End send End recv

Nonblocking operations Initiate message: MPI_Isend(start, count, datatype, dest tag, comm, request); MPI_Irecv(start, count, datatype, dest tag, comm, request); Wait for message completion: MPI_Wait(request, status); Test for message completion: MPI_Wait(request, status);

Multiple outstanding requests Sometimes useful to have multiple outstanding messages: MPI_Waitall(count, requests, statuses); MPI_Waitany(count, requests, index, status); MPI_Waitsome(count, requests, indices, statuses); Multiple versions of test as well.

Other send/recv variants Other variants of MPI_Send MPI_Ssend (synchronous) do not complete until receive has begun MPI_Bsend (buffered) user provides buffer (via MPI_Buffer_attach) MPI_Rsend (ready) user guarantees receive has already been posted Can combine modes (e.g. MPI_Issend) MPI_Recv receives anything.

Another approach Send/recv is one-to-one communication An alternative is one-to-many (and vice-versa): Broadcast to distribute data from one process Reduce to combine data from all processors Operations are called by all processes in communicator

Broadcast and reduce MPI_Bcast(buffer, count, datatype, root, comm); MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm); buffer is copied from root to others recvbuf receives result only at root op { MPI_MAX, MPI_SUM,... }

Example: basic Monte Carlo #include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { int nproc, myid, ntrials; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Comm_rank(MPI_COMM_WORLD, &my_id); if (myid == 0) { printf("trials per CPU:\n"); scanf("%d", &ntrials); } MPI_Bcast(&ntrials, 1, MPI_INT, 0, MPI_COMM_WORLD); run_trials(myid, nproc, ntrials); MPI_Finalize(); return 0; }

Example: basic Monte Carlo Let sum[0] = i X i and sum[1] = i X 2 i. void run_mc(int myid, int nproc, int ntrials) { double sums[2] = {0,0}; double my_sums[2] = {0,0}; /*... run ntrials local experiments... */ MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) { int N = nproc*ntrials; double EX = sums[0]/n; double EX2 = sums[1]/n; printf("mean: %g; err: %g\n", EX, sqrt((ex*ex-ex2)/n)); } }

Collective operations Involve all processes in communicator Basic classes: Synchronization (e.g. barrier) Data movement (e.g. broadcast) Computation (e.g. reduce)

Barrier MPI_Barrier(comm); Not much more to say. Not needed that often.

Broadcast P0 A P0 A P1 Broadcast P1 A P2 P2 A P3 P3 A

Scatter/gather P0 A B C D Scatter P0 A P1 P2 P3 P1 Gather P2 P3 B C D

Allgather P0 A P0 A B C D P1 B Allgather P1 A B C D P2 C P2 A B C D P3 D P3 A B C D

Alltoall P0 A0 A1 A2 A3 P0 A0 B0 C0 D0 P1 B0 B1 B2 B3 Alltoall P1 A1 B1 C1 D1 P2 C0 C1 C2 C3 P2 A2 B2 C2 D2 P3 D0 D1 D2 D3 P3 A3 B3 C3 D3

Reduce P0 A P0 ABCD P1 B Reduce P1 P2 C P2 P3 D P3

Scan P0 A P0 A P1 B Scan P1 AB P2 C P2 ABC P3 D P3 ABCD

The kitchen sink In addition to above, have vector variants (v suffix), more All variants (Allreduce), Reduce_scatter,... MPI3 adds one-sided communication (put/get) MPI is not a small library! But a small number of calls goes a long way Init/Finalize Get_comm_rank, Get_comm_size Send/Recv variants and Wait Allreduce, Allgather, Bcast