Programming with MPI

Similar documents
Message-Passing and MPI Programming

Message-Passing and MPI Programming

Programming with MPI

Programming with MPI

Message-Passing and MPI Programming

Programming with MPI

Programming with MPI

Programming with MPI

ECE 574 Cluster Computing Lecture 13

Message-Passing and MPI Programming

Introduction to OpenMP

Programming with MPI

Programming with MPI

Programming with MPI

The MPI Message-passing Standard Practical use and implementation (V) SPD Course 6/03/2017 Massimo Coppola

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Message-Passing and MPI Programming

Slides prepared by : Farzana Rahman 1

Topic Notes: Message Passing Interface (MPI)

Introduction to Modern Fortran

Advanced Message-Passing Interface (MPI)

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Collective Communication in MPI and Advanced Features

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Collective Communication: Gatherv. MPI v Operations. root

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

Message-Passing and MPI Programming

Data parallelism. [ any app performing the *same* operation across a data stream ]

Programming with MPI

MA471. Lecture 5. Collective MPI Communication

Programming with MPI Collectives

MPI Collective communication

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Collective Communication: Gather. MPI - v Operations. Collective Communication: Gather. MPI_Gather. root WORKS A OK

MPI - v Operations. Collective Communication: Gather

Message Passing Interface: Basic Course

Basic MPI Communications. Basic MPI Communications (cont d)

Introduction to OpenMP

Introduction to OpenMP

Chapter 24a More Numerics and Parallelism

Lecture 9: MPI continued

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Introduction to Modern Fortran

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

MPI Casestudy: Parallel Image Processing

Introduction to Modern Fortran

Introduction to OpenMP

Chapter 3. Distributed Memory Programming with MPI

Chapter 18 Vectors and Arrays [and more on pointers (nmm) ] Bjarne Stroustrup

MPI Performance Snapshot

Introduction to MPI, the Message Passing Library

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Why C? Because we can t in good conscience espouse Fortran.

HPC Parallel Programing Multi-node Computation with MPI - I

Section Notes - Week 1 (9/17)

High Performance Computing

Introduction to Modern Fortran

Divisibility Rules and Their Explanations

Numerical Programming in Python

Chapter 21a Other Library Issues

Collective Communications

Collective Communication

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Problem Solving through Programming In C Prof. Anupam Basu Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Distributed Memory Programming With MPI (4)

Recursively Enumerable Languages, Turing Machines, and Decidability

Parallel Programming, MPI Lecture 2

Difference Between Dates Case Study 2002 M. J. Clancy and M. C. Linn

(Refer Slide Time: 06:01)

Welcome to the introductory workshop in MPI programming at UNICC

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

The MPI Message-passing Standard Practical use and implementation (VI) SPD Course 08/03/2017 Massimo Coppola

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

CDP. MPI Derived Data Types and Collective Communication

(Refer Slide Time 6:48)

Message Passing Interface

These are notes for the third lecture; if statements and loops.

Intro. Scheme Basics. scm> 5 5. scm>

Testing is a very big and important topic when it comes to software development. Testing has a number of aspects that need to be considered.

Parallel programming MPI

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

MATH 676. Finite element methods in scientific computing

Review of MPI Part 2

Message Passing Programming. Designing MPI Applications

printf( Please enter another number: ); scanf( %d, &num2);

Cache Coherence Tutorial

Contents. Slide Set 2. Outline of Slide Set 2. More about Pseudoinstructions. Avoid using pseudoinstructions in ENCM 369 labs

ENCM 369 Winter 2017 Lab 3 for the Week of January 30

UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING

Communication Patterns

Further MPI Programming. Paul Burton April 2015

Reusing this material

Introduction to MPI Part II Collective Communications and communicators

Pipelining and Vector Processing

Chapter 1 Getting Started

Transcription:

Programming with MPI p. 1/?? Programming with MPI More on Datatypes and Collectives Nick Maclaren nmm1@cam.ac.uk May 2008

Programming with MPI p. 2/?? Less Basic Collective Use A few important facilities we haven t covered Less commonly used, but fairly often needed In particular, one of them can help a lot with I/O And then how to use collectives efficiently Plus one potentially useful minor feature

Programming with MPI p. 3/?? Fortran Precisions (1) Fortran 90 allows selectable precisions KIND=SELECTED --- INTEGER --- KIND(precision) KIND=SELECTED --- REAL --- KIND(precision[,range]) Can create a MPI derived datatype to match these Then can use it just like a built--in datatype Surprisingly, it is a predefined datatype Do NOT commit or free it [ Don t worry if that makes no sense to you ]

Programming with MPI p. 4/?? Fortran Precisions (2) INTEGER ( KIND = & SELECTED --- INTEGER --- KIND ( 15 ) ), DIMENSION ( 100 ) :: array INTEGER :: root, integertype, error & CALL MPI --- Type --- create --- f90 --- integer ( 15, integertype, error ) CALL MPI --- Bcast ( array, 100, integertype, root, & MPI --- COMM --- WORLD, error ) & &

Programming with MPI p. 5/?? Fortran Precisions (3) REAL and COMPLEX are very similar REAL ( KIND = & SELECTED --- REAL --- KIND ( 15, 300 ) ), DIMENSION ( 100 ) :: array CALL MPI --- Type --- create --- f90 --- real ( 15, 300, realtype, error ) COMPLEX ( KIND = & SELECTED --- REAL --- KIND ( 15, 300 ) ), DIMENSION ( 100 ) :: array CALL MPI --- Type --- create --- f90 --- complex ( 15, 300, complextype, error ) & & & &

Programming with MPI p. 6/?? Searching (1) You can use global reductions for searching Bad news: it needs MPI s derived datatypes Good news: there are some useful built--in ones We do a reduction with a composite datatype: ( <value>, <index> ) As with sum, we build up from a binary operator Two built--in operators: MPI --- MINLOC and MPI --- MAXLOC We shall use finding the minimum as an example

Programming with MPI p. 7/?? Searching (2) (<value --- 1>,<index --- 1>) & (<value --- 2>,<index --- 2>) We shall produce a result (<value --- x>,<index --- x>) If <value --- 1> <value --- 2> then <value --- x> <value --- 1> <index --- x> <index --- 1> Else <value --- x> <value --- 2> <index --- x> <index --- 2> Equality is a bit cleverer it rarely matters If it does to you, see the MPI standard

Programming with MPI p. 8/?? Searching (2) Operator MPI --- MINLOC does precisely that Operator MPI --- MAXLOC searches for the maximum You create the (<value>,<index>) pairs first The <index> can be anything whatever is useful An <index> should usually be globally unique I.e. not just the index into a local array E.g. combine process number and local index So how do we set up the data?

Programming with MPI p. 9/?? Fortran Searching (1) Fortran 77 did not have derived types The datatypes are arrays of length two Recommended datatypes are: MPI --- 2INTEGER & MPI --- 2DOUBLE --- PRECISION DOUBLE PRECISION can hold any INTEGER on any current system, when using MPI I don t recommend MPI --- 2REAL, except on Cray

Programming with MPI p. 10/?? Fortran Searching (2) INTEGER :: sendbuf ( 2, 100 ), & recvbuf ( 2, 100 ), myrank, error, i INTEGER, PARAMETER :: root = 3 DO i = 1, 100 sendbuf ( 1, i ) = <value> sendbuf ( 2, i ) = 1000 * myrank + i END DO CALL MPI --- Reduce ( sendbuf, recvbuf, & 100, MPI --- 2INTEGER, MPI --- MINLOC, & root, MPI --- COMM --- WORLD, error )

Programming with MPI p. 11/?? C Searching The datatypes are struct {<value type>; int;} C structure layout is a can of worms Recommended datatypes are: MPI --- 2INT, MPI --- LONG --- INT, MPI --- DOUBLE --- INT For <value type> of int, long and double That will usually work not always Use MPI --- LONG --- DOUBLE --- INT for long double Don t use MPI --- FLOAT --- INT or SHORT --- INT for C reasons you don t want to know!

Programming with MPI p. 12/?? C Example struct { double value ; int index ; } sendbuf [100], recvbuf [100] ; int root = 3, myrank, error, i; for ( i = 1 ; i < 100 ; ++i ) { sendbuf [ i ]. value = <value> ; sendbuf [ i ]. index = 1000 * myrank + i ; } error = MPI --- Reduce ( sendbuf, recvbuf, 100, MPI --- DOUBLE --- INT, MPI --- MINLOC, root, MPI --- COMM --- WORLD )

Programming with MPI p. 13/?? Data Distribution (1) It can be inconvenient to make all counts the same E.g. with a 100 100 matrix on 16 CPUs One approach is to pad the short vectors That is usually more efficient than it looks There are also extended MPI collectives for that Obviously, their interface is more complicated MPI --- Gatherv, MPI --- Scatterv, MPI --- Allgatherv, MPI --- Alltoallv Use whichever approach is easiest for you

Programming with MPI p. 14/?? Data Distribution (2) A vector of counts instead of a single count One count for each process Provided only where there are multiple buffers Receive counts for MPI --- Gatherv & MPI --- Allgatherv Send counts for MPI --- Scatterv Both counts for MPI --- Alltoallv Used only on root for MPI --- Gatherv & MPI --- Scatterv But, for MPI --- Allgatherv and MPI --- Alltoallv, the count vectors must match on all processes I recommend always making them match, for sanity

Programming with MPI p. 15/?? Gatherv A0 Process 0 A0 B0 C0 B0 Process 1 C0 Process 2

Programming with MPI p. 16/?? Allgatherv A0 Process 0 A0 B0 C0 B0 Process 1 A0 B0 C0 C0 Process 2 A0 B0 C0

Programming with MPI p. 17/?? Data Distribution (3) The scalar counts may all be different, of course They must match for each pairwise send and receive E.g. for MPI --- Gatherv: the send count on process N matches the Nth receive count element on the root MPI --- Scatterv just the converse of MPI --- Gatherv For MPI --- Allgatherv: the send count on process N matches Nth receive count element on all processes

Programming with MPI p. 18/?? Data Distribution (4) The most complicated one is MPI --- Alltoallv It isn t hard to use, if you keep a clear head Use a pencil and paper if you get confused Consider processes M and N the Nth send count on process M matches the Mth receive count on process N As said earlier, think of it as a matrix transpose! With the data vectors as its elements

Programming with MPI p. 19/?? Alltoallv Process 0 A0 A1 A2 A0 B0 C0 B0 B1 B2 A1 B1 C1 C0 C1 C2 A2 B2 C2 Process 2

Programming with MPI p. 20/?? Data Distribution (5) Where they have a vector of counts they also have a vector of offsets The offset of the data of each process Not the offset of the basic elements This allows for discontiguous buffers Each pairwise transfer must still be contiguous Normally, the first offset will be zero But here is a picture of when it isn t

Programming with MPI p. 21/?? Multiple Transfer Buffers Process 0 Process 1 Process 2 Process 3 Argument Offsets Counts = ( 4 2 3 7 ) Offsets = ( 2 8 14 18 )

Programming with MPI p. 22/?? Data Distribution (6) Unlike the counts, the offsets are purely local They need not match on all processes Even in the case of MPI --- Alltoallv the offset vectors needn t match in any way Each one is used just as a mapping for the local layout of its associated buffer

Programming with MPI p. 23/?? Data Distribution (7) Keep your use of these collectives simple MPI won t get confused, but you and I will And any overlap is undefined behaviour Will show a picture of a fairly general MPI --- Alltoallv Just to see what can be done, not to recommend it Then simple examples of using MPI --- Gatherv Start with this (or MPI --- Scatterv) when testing

Programming with MPI p. 24/?? Alltoallv Process 0 A0 A1 A2 A0 B0 C0 B0 B1 B2 A1 B1 C1 C0 C1 C2 A2 B2 C2 Process 2

Programming with MPI p. 25/?? Gatherv (1) Fortran example: INTEGER, DIMENSION ( 0 : * REAL(KIND=KIND(0.0D0)) :: ) :: counts sendbuf ( 100 ), recvbuf ( 100, 30 ) INTEGER :: myrank, error, i INTEGER, PARAMETER :: root = 3, & i, i = 0, 30 -- 1 ) /) offsets ( * ) = (/ ( 100 * & CALL MPI --- Gatherv ( sendbuf, counts ( myrank ), MPI --- DOUBLE --- PRECISION, recvbuf, counts, offsets, & MPI --- DOUBLE --- PRECISION, root, MPI --- COMM --- WORLD, error ) & & &

Programming with MPI p. 26/?? Gatherv (2) C example: int counts [ ] ; double sendbuf[100], recvbuf[30][100] ; int root = 3, offsets[30], error, i ; for ( i = 0 ; i < 30 ; ++ i ) offsets [ i ] = 100 * i ; error = MPI --- Gatherv ( sendbuf, counts [ myrank ], MPI --- DOUBLE, recvbuf, counts, offsets, MPI --- DOUBLE, root, MPI --- COMM --- WORLD ) ;

Programming with MPI p. 27/?? Practical Point That s easy when the counts are predictable But a lot of the time, they won t be For scatter, calculate the counts on root Use MPI --- Scatter for the count values Then do the full MPI --- Scatterv on the data Gather calculates a count on each process Use MPI --- Gather for the count values Then do the full MPI --- Gatherv on the data Allgather and alltoall are similar

Programming with MPI p. 28/?? Efficiency (1) Generally, use collectives whereever possible Provide most opportunity for implementation tuning Use the composite ones where that is useful MPI --- Allgather, MPI --- Allreduce, MPI --- Alltoall Do as much as possible in one collective Fewer, larger transfers are always better Consider packing scalars into arrays for copying Even converting integers to reals to do so

Programming with MPI p. 29/?? Efficiency (2) But what ARE small and large? A rule of thumb is about 4 KB per buffer That is each single, pairwise transfer buffer But it s only a rule of thumb and not a hard fact Don t waste time packing data unless you need to

Programming with MPI p. 30/?? Synchronisation Collectives are not synchronised at all except for MPI --- Barrier, that is Up to three successive collectives can overlap In theory, this allows for improved efficiency In practice, it makes it hard to measure times To synchronise, call MPI --- Barrier The first process leaves only after the last process enters

Programming with MPI p. 31/?? Scheduling (1) Implementations usually tune for gang scheduling Collectives often run faster when synchronised Consider adding a barrier before every collective It s an absolutely trivial change, after all Best to run 3 10 times with, and 3 10 without The answer will either be clear, or it won t matter If you have major, non--systematic differences you have a nasty problem, and may need help

Programming with MPI p. 32/?? Scheduling (2) You can overlap collectives and point--to--point MPI requires implementations to make that work I strongly advise not doing that A correct program will definitely not hang or crash But it may run horribly slowly... Remember that three collectives can overlap? Point--to--point can interleave with those I recommend alternating the modes of use It s a lot easier to validate and debug, too!

Programming with MPI p. 33/?? Scheduling (3) Start here... [ Consider calling MPI --- Barrier here ] Any number of collective calls [ Consider calling MPI --- Barrier here ] Any number of point--to--point calls Wait for all of those calls to finish And repeat from the beginning...

Programming with MPI p. 34/?? In-Place Collectives (1) You can usually pass the same array twice But it s a breach of the Fortran standard And not clearly permitted in C I recommend avoiding it if at all possible It will rarely cause trouble or get diagnosed But, if it does, the bug will be almost unfindable I have used systems on which it would fail The Hitachi SR2201 was one, for example Ask me why, offline, if you are masochistic

Programming with MPI p. 35/?? In-Place Collectives (2) MPI 2 defined a MPI --- IN --- PLACE pseudo--buffer Specifies the result overwrites the input I.e. the real buffer is both source and target Need to read MPI standard for full specification For allgather[v], alltoall[v] and allreduce[v]: use it for the send buffer on all processes Send counts, datatype and offsets ignored

Programming with MPI p. 36/?? In-Place Collectives (3) Will give only a C example: double buffer [ 30 ] [ 100 ] ; int error ; error = MPI --- Alltoall ( MPI --- IN --- PLACE, 100, MPI --- DOUBLE, / * Or even MPI --- IN --- PLACE, 0, MPI --- INT, * / buffer, 100, MPI --- DOUBLE, MPI --- COMM --- WORLD ) The Fortran is very similar

Programming with MPI p. 37/?? Epilogue That is essentially all you need to know! We have covered everything that seems to be used MPI collectives look very complicated, but aren t A few more features, which are rarely used Mainly introduced by MPI 2 for advanced uses They are mentioned, briefly, in the extra lectures Two exercises on searching and scatterv