Programming with MPI

Similar documents
Programming with MPI

Programming with MPI

Programming with MPI

Message-Passing and MPI Programming

Introduction to OpenMP

Message-Passing and MPI Programming

Introduction to OpenMP

Programming with MPI

Programming with MPI

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Message-Passing and MPI Programming

Introduction to OpenMP

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Message-Passing and MPI Programming

Programming with MPI

Programming with MPI

Programming with MPI

Programming with MPI

Parallel Programming (3)

Introduction to Modern Fortran

Introduction to OpenMP

Message-Passing and MPI Programming

Chapter 24a More Numerics and Parallelism

Chapter 21a Other Library Issues

Message-Passing and MPI Programming

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

Why C? Because we can t in good conscience espouse Fortran.

Message Passing Programming. Designing MPI Applications

WG21/N2088 J16/ IEEE 754R Support and Threading (and Decimal)

Introduction to OpenMP

Implementation of Parallelization

CSE 410 Final Exam Sample Solution 6/08/10

OPENMP TIPS, TRICKS AND GOTCHAS

Memory Consistency Models

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Topic Notes: Message Passing Interface (MPI)

High Performance Computing: Tools and Applications

ECE 574 Cluster Computing Lecture 13

DPHPC Recitation Session 2 Advanced MPI Concepts

Introduction to Modern Fortran

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

MPI: A Message-Passing Interface Standard

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

printf( Please enter another number: ); scanf( %d, &num2);

Message Passing Programming. Modes, Tags and Communicators

Parallel Programming (1)

Chapter 17 vector and Free Store. Bjarne Stroustrup

Introduction to Modern Fortran

NAS, SANs and Parallel File Systems

arxiv: v1 [cs.dc] 27 Sep 2018

CS 61C: Great Ideas in Computer Architecture Intro to Assembly Language, MIPS Intro

Martin Kruliš, v

OPENMP TIPS, TRICKS AND GOTCHAS

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Message Passing Programming. Modes, Tags and Communicators

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

7.1 Some Sordid Details

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Introduction to Modern Fortran

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

These are notes for the third lecture; if statements and loops.

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

ENCM 501 Winter 2019 Assignment 9

Introduction to the Course

Project 3 Due October 21, 2015, 11:59:59pm

One Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing

What Every Programmer Should Know About Floating-Point Arithmetic

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

15 Sharing Main Memory Segmentation and Paging

Further MPI Programming. Paul Burton April 2015

mpi4py HPC Python R. Todd Evans January 23, 2015

CS 241 Honors Concurrent Data Structures

Distributed Memory Programming With MPI Computer Lab Exercises

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Parallel Programming

Our new HPC-Cluster An overview

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Chapter 18 Vectors and Arrays [and more on pointers (nmm) ] Bjarne Stroustrup

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

FaRM: Fast Remote Memory

!OMP #pragma opm _OPENMP

Remote Procedure Call (RPC) and Transparency

OpenMP: Open Multiprocessing

Intro. Scheme Basics. scm> 5 5. scm>

A Message Passing Standard for MPP and Workstations

MPI One sided Communication

CS113: Lecture 7. Topics: The C Preprocessor. I/O, Streams, Files

Analyzing Systems. Steven M. Bellovin November 26,

CS61C Machine Structures. Lecture 4 C Pointers and Arrays. 1/25/2006 John Wawrzynek. www-inst.eecs.berkeley.edu/~cs61c/

Standard MPI - Message Passing Interface

Integer Multiplication and Division

Transcription:

Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010

Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA That is Remote Direct Memory Access [ It is called RMA in MPI ] Much loved by Cray and the DoD/ASCI people One process accesses the data of another Potentially more efficient on SMP systems It was added in MPI 2, but is not often used I will explain why I don t like RDMA Then describe how it can be used semi--safely

Programming with MPI p. 3/?? RDMA Models Active RDMA is batching up sends and receives A bit like non--blocking using MPI --- Waitall Not too hard to use correctly, but needs discipline Passive uses uninvolved process s memory Much harder to use and less portable PGAS languages effectively use this model Lastly, can use true virtual shared memory Using all processes memory as if it were local MPI does not support this, for very good reasons

Programming with MPI p. 4/?? Active (Fenced) Communication Process 1 Process 2 Process 3 Fence Local accesses Put Local accesses Put Get Get Put Get Put Fence Local accesses Local accesses

Programming with MPI p. 5/?? Passive Communication Process 1 Process 2 Process 3 Handshake Local accesses Put Put Implicit Handshake Get Local accesses Get Get Implicit

Programming with MPI p. 6/?? Problems with RDMA (1) RDMA is often said to be easier to use correctly Twaddle! How does other process know its data is in use? Get it wrong, and almost unfindable chaos ensues Adding handshaking often makes it pointless Most specifications don t say how to handle that MPI does, which makes it fairly complicated

Programming with MPI p. 7/?? Problems with RDMA (2) Another problem is that it handicaps analysis tools With two--sided, can diagnose unmatched transfers But there s no concept of matching for one--sided Its semantics are similar to SMP threading Lots of lovely potential race conditions And no realistic tools to detect them Similar remarks apply to non--trivial tuning Those problems are insoluble provably so Most people s experiences are very negative

Programming with MPI p. 8/?? Problems with RDMA (3) Next problem is implementation and progress Passive RDMA needs a remote agent on memory May hang until far process reaches an MPI call With passive use, that may never happen Can also occur with active, but less likely Deadlock may happen even in correct code MPI forbids that for two--sided and (mostly) active Some subtle problems even on SMP systems

Programming with MPI p. 9/?? Problems with RDMA (4) Last problem is performance complications On commodity clusters, it may well be slower Remote agent may interfere with MPI process Probably faster on specialist HPC clusters Cray etc. have hardware and kernel RDMA Transfers data without any action by MPI process On SMP systems, it s very hard to guess + May well have lower communication overhead But may cause serious cache and TLB problems

Programming with MPI p. 10/?? Most Likely Benefit One logical transfer is many actual ones I.e. you need to transfer lots of objects together Essentially an alternative to packing up into one array On SMP system or specialist HPC cluster Does NOT gain you anything using Ethernet Nor Infiniband with commodity software TCP/IP and Ethernet need a kernel thread Currently, so does the OpenIB Infiniband driver

Programming with MPI p. 11/?? A SMP System Myth The following is NOT true, in general: One--sided transfers use fewer memory copies True in theory, but generally not in practice Probably true only for MPI --- Alloc --- mem memory For all Unix--like systems, including Microsoft s Process A cannot access process B s memory Can only if using a shared memory segment And almost all data areas aren t in one of them Two memory copies needed for it, as well

Programming with MPI p. 12/?? Recommendation Stick to simple forms of active communication Only one taught here is collective--like transfers Group pairwise transfers are also described, roughly Be extremely cautious about deadlock Use only methods that don t rely on progress details Scrupulously adhere to MPI s restrictions Designed to maximise reliable implementability That is, naturally, all that this course teaches

Programming with MPI p. 13/?? Preliminaries Recommended to create info object, for efficiency Most of its uses are for other MPI 2 facilities Do this on each process, but it is not collective Asserts that won t use passive communication Curiously, no option to specify a read--only window Also a relevant function MPI --- Alloc --- mem May help with performance on some systems But is NOT recommended for use in Fortran It s fine in MPI 3.0, which uses Fortran 2003

Programming with MPI p. 14/?? Fortran Example INTEGER :: info, error CALL MPI --- Info --- create ( info, error ) CALL MPI --- Info --- set ( info, " no --- locks ", " true ", error ) &! Use the info object CALL MPI --- Info --- free ( info, error )

Programming with MPI p. 15/?? C Example MPI --- Info info ; int error ; error = MPI --- Info --- create ( & info ) ; error = MPI --- Info --- set ( info, " no --- locks ", " true " ) ; / * Use the info object * / error = MPI --- Info --- free ( info ) ;

Programming with MPI p. 16/?? Windows A window is an array used for one--sided transfers Must register and release such arrays collectively Args are address, size in bytes and element size All of them can be different on each process Same remark applies to the element type of the array Offsets are specified in units of element size Like array indices into the target window Size of zero means no local memory accessible

Programming with MPI p. 17/?? Alignment Strongly advise to align correctly for datatype Only a performance issue, but may be significant Some systems may benefit from stronger alignment Read the implementation notes to find that out Element size need not match the element type But simplest use is to make it match It can be 1 if using byte offsets In that case, it s your job to get it right

Programming with MPI p. 18/?? Fortran Example REAL ( KIND = KIND ( 0.0D0 ) ) :: array ( 1000 ) INTEGER :: info, window, size, error INTEGER ( KIND = MPI --- ADDRESS --- KIND ) :: win --- size! Create the info object as described above CALL MPI --- Sizeof ( array, size, error ) win --- size = 1000 * size CALL MPI --- Win --- create ( array, win --- size, size, info, MPI --- COMM --- WORLD, window, error ) CALL MPI --- Win --- free ( window, error ) & &

Programming with MPI p. 19/?? C Example double array [ 1000 ] ; MPI --- Info info ; MPI --- Win window ; int error ; MPI --- Aint win --- size ; / * Create the info object as described above * / win --- size = 1000 * sizeof ( double ) ; error = MPI --- Win --- create ( array, win --- size, sizeof ( double ), info, MPI --- COMM --- WORLD, & window ) ; error = MPI --- Win --- free ( window ) ;

Programming with MPI p. 20/?? Horrible Gotcha The window must be an ordinary data array Not an MPI standard issue, and is not stated there It s a hardware, system and compiler one Applies to other RDMA and asynchronous use, too Do not use Fortran PARAMETER arrays Do not use C/C++ static const arrays Or anything any library returns that might be Or anything else that might be exceptional At best, the performance might be dire

Programming with MPI p. 21/?? Fencing Exactly like MPI --- Barrier, but on a window set Assertions described shortly ignore for now Fortran example: CALL MPI --- Win --- fence ( assertions, window, error ) C example: error = MPI --- Win --- fence ( assertions, window ) C example: window.fence ( assertions )

Programming with MPI p. 22/?? Use of Windows (1) Rules for use of windows apply much more generally But are easiest to describe in terms of fencing Fences divide time into sequence of access epochs Each window should be considered separately So consider one epoch on one window No restrictions if not accessed remotely in epoch It includes local writes from it using RDMA That is unlike the rules for MPI --- Send

Programming with MPI p. 23/?? Use of Windows (2) Window is exposed if it may be accessed remotely If the window is exposed: Mustn t update any part of window locally Seems unnecessary, but are good reasons for this Can use RDMA to local process to bypass this Any location may be updated only once And not at all if it is read locally or remotely The standard write--once--or--read--many model Accumulations are an exception see later

Programming with MPI p. 24/?? Optimisation Issues Bends language standards in several subtle ways Can cause accesses to get out of order Very rare for C and C++ and truly evil if it does Almost all such problems are user error Fortran is more optimisable, and it can happen Window should have ASYNCHRONOUS attribute So should any dummy argument that it is passed to during any access epoch Simplest not to pass it during an access epoch

Programming with MPI p. 25/?? Assertions (1) This is an integer passed to synchronisation calls Can be combined using logical OR (IOR or ) Purely optional, but may help with performance If you get them wrong, behaviour is undefined Pass the argument as 0 if you are unsure Fall into two classes: local ones and collective Latter are supported only by MPI --- Win --- fence Apply to an epoch between synchronisations Can be either preceding or succeeding epoch

Programming with MPI p. 26/?? Assertions (2) Local assertions: MPI --- MODE --- NOSTORE about preceding epoch The window was not updated in any way MPI --- MODE --- NOPUT about succeeding epoch The window will not be updated by RDMA Collective assertions: MPI --- MODE --- NOPRECEDE about preceding epoch No RDMA calls were made by this process MPI --- MODE --- NOSUCCEED succeeding epoch No RDMA calls will be made by this process

Programming with MPI p. 27/?? Assertions (3) If alternating computation and communication Do the following collectively (same on all processes)... Do some computation Fence with MPI --- MODE --- NOPRECEDE Do some RDMA communication Fence with MPI --- MODE --- NOSUCCEED MPI --- MODE --- NOPUT Do some computation... And use MPI --- MODE --- NOSTORE when appropriate It does not need to be the same on all processes

Programming with MPI p. 28/?? Transfers MPI --- Put and MPI --- Get have identical syntax Effectively start a remote MPI --- Recv or MPI --- Send Both sets of arguments are provided by the caller Remote ones interpreted in context of target window Strongly recommended to match types and counts Remote datatype must match remote element type Like non--blocking, so don t reuse local buffers The next synchronisation call completes the transfer No guarantee that it is implemented like that

Programming with MPI p. 29/?? Fortran Example REAL :: array --- 1 ( 1000 ), array --- 2 ( 1000 ) INTEGER :: to = 3, from = 2, window, error CALL MPI --- Put ( array --- 1, 1000, MPI --- REAL, to, offset --- 1, 1000, MPI --- REAL, & window, error ) CALL MPI --- Get ( array --- 2, 1000, MPI --- REAL, from, offset --- 2, 1000, MPI --- REAL, window, error ) & & &

Programming with MPI p. 30/?? C Example double array --- 1 [ 1000 ], array --- 2 [ 1000 ] ; MPI --- Win window ; int to = 3, from = 2, error ; error = MPI --- Put ( array --- 1, 1000, MPI --- DOUBLE, to, offset --- 1, 1000 * size, MPI --- DOUBLE, window ) error = MPI --- Get ( array --- 2, 1000, MPI --- DOUBLE, from, offset --- 2, 1000 * size, MPI --- DOUBLE, window )

Programming with MPI p. 31/?? Reminder The target address is not specified directly window describes the remote array to use offset is in units of remote element size from start of remote window

Programming with MPI p. 32/?? Accumulation (1) A point--to--point reduction with a remote result Accumulation is done in an undefined order Syntax and use is almost identical to MPI --- Put Reduction operation before the window argument Exactly the same operations as in MPI --- Reduce Only predefined operations not user--defined One extra predefined operation MPI --- REPLACE Simply stores the value in target location

Programming with MPI p. 33/?? Accumulation (2) You mustn t update or access it any other way But you can use multiple separate locations Each can use the same or a different operation Cannot use it for atomic access Can only accumulate until next synchronisation Partly possible, but needs features not taught here Use only one operation for one location MPI doesn t require it your sanity does

Programming with MPI p. 34/?? Syntax Comparison CALL MPI --- Put ( array, 1000, MPI --- REAL, to, offset, 1000, MPI --- REAL, window, error ) & & CALL MPI --- Accumulate ( array, 1000, & MPI --- REAL, to, offset, 1000, & MPI --- REAL, MPI --- SUM, window, error ) Note only change is addition of MPI --- SUM C and C+ have identical changes

Programming with MPI p. 35/?? Writing Collectives Above is enough to write collectives using RDMA If clean and simple, you should have no trouble MPI specifies to never deadlock but, for sanity: Do not overlap use of window sets Like communicators, serialise overlapping use Do not overlap with other types of transfer Serialise collectives, point--to--point and one--sided

Programming with MPI p. 36/?? Group Pairwise RDMA (1) This is a lot more complicated So much so, this course doesn t cover the details It is also a lot more error--prone Make a mistake, and your program may deadlock Or it may cause data corruption, with no diagnostic Worse, these may happen only probabilistically I do not recommend using this, except for experts The following is a summary of how to use it

Programming with MPI p. 37/?? Group Pairwise RDMA (2) MPI --- Win --- post ( group, assertions, window ) This registers the window for RDMA accesses Only the processes in group may access it MPI specifies that it does not block MPI --- Win --- wait ( window ) This blocks until all RDMA has completed All processes have called MPI --- Win --- complete It cancels the registration for RDMA accesses

Programming with MPI p. 38/?? Group Pairwise RDMA (3) MPI --- Win --- start ( group, assertions, window ) This registers the window for RDMA calls You may access only the data of processes in group It may block, but is not required to I.e. until all processes have called MPI --- Win --- post MPI --- Win --- complete ( window ) This cancels the registration for RDMA calls It may block, but is not required to Circumstances too complicated to describe here

Programming with MPI p. 39/?? Group Pairwise RDMA (4) You will need some of the group calls You need not use them collectively for this See the lecture on Communicators etc. for them MPI --- Comm --- group ( comm, group ) This obtains the group for a communicator MPI --- Group --- incl ( group, count, ranks, newgroup ) This creates a subset group from some ranks MPI --- Group --- free ( group ) This frees a group after it has been used

Programming with MPI p. 40/?? Group Pairwise RDMA (5) That s essentially all of the facilities You can test for completion using MPI --- Win --- test But can t touch the window until it completes If you want to use group pairwise RDMA: Read the MPI standard

Programming with MPI p. 41/?? Passive RDMA Did you think that the group pairwise form is bad? Passive RDMA is simpler but much trickier Don t go there

Programming with MPI p. 42/?? Debugging and Tuning You re on your own