Practical Scientific Computing: Performanceoptimized

Similar documents
Parallel Computing. PD Dr. rer. nat. habil. Ralf-Peter Mundani. Computation in Engineering / BGU Scientific Computing in Computer Science / INF

Experiencing Cluster Computing Message Passing Interface

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI Message Passing Interface

High Performance Computing

Parallel Programming. Using MPI (Message Passing Interface)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI. (message passing, MIMD)

Distributed Memory Parallel Programming

Message Passing Interface

COSC 6374 Parallel Computation

Outline. Communication modes MPI Message Passing Interface Standard

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Message Passing Interface

Parallel Short Course. Distributed memory machines

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Introduction to the Message Passing Interface (MPI)

Message Passing Interface. most of the slides taken from Hanjun Kim

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Parallel Programming in C with MPI and OpenMP

Slides prepared by : Farzana Rahman 1

MPI MESSAGE PASSING INTERFACE

Parallel Computing Paradigms

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

4. Parallel Programming with MPI

Parallel Programming in C with MPI and OpenMP

Intermediate MPI. M. D. Jones, Ph.D. Center for Computational Research University at Buffalo State University of New York

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Chapter 4. Message-passing Model

Scientific Computing

Cluster Computing MPI. Industrial Standard Message Passing

Recap of Parallelism & MPI

Standard MPI - Message Passing Interface

Practical Introduction to Message-Passing Interface (MPI)

Week 3: MPI. Day 02 :: Message passing, point-to-point and collective communications

Introduction to parallel computing concepts and technics

Parallel Programming with MPI MARCH 14, 2018

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Writing Message Passing Parallel Programs with MPI

A message contains a number of elements of some particular datatype. MPI datatypes:

Programming Using the Message Passing Paradigm

MPI MESSAGE PASSING INTERFACE

Message Passing Interface. George Bosilca

High performance computing. Message Passing Interface

Point-to-Point Communication. Reference:

Message Passing Interface

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Parallel Programming

Parallel Programming using MPI. Supercomputing group CINECA

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Introduction to Parallel Programming

Introduction to the Message Passing Interface (MPI)

Parallel Computing. Distributed memory model MPI. Leopold Grinberg T. J. Watson IBM Research Center, USA. Instructor: Leopold Grinberg

Message passing. Week 3: MPI. Day 02 :: Message passing, point-to-point and collective communications. What is MPI?

The Message Passing Model

Distributed Memory Programming with MPI

Acknowledgments. Programming with MPI Basic send and receive. A Minimal MPI Program (C) Contents. Type to enter text

Introduction to MPI. Branislav Jansík

Programming with MPI Basic send and receive

Part - II. Message Passing Interface. Dheeraj Bhardwaj

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

High-Performance Computing: MPI (ctd)

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Distributed Systems + Middleware Advanced Message Passing with MPI

Holland Computing Center Kickstart MPI Intro

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Reusing this material

MPI Collective communication

ME964 High Performance Computing for Engineering Applications

A Message Passing Standard for MPP and Workstations

Review of MPI Part 2

Collective Communication in MPI and Advanced Features

Parallel programming with MPI Part I -Introduction and Point-to-Point Communications

Collective Communications

CSE 160 Lecture 15. Message Passing

Parallel Computing and the MPI environment

Parallelization. Tianhe-1A, 2.45 Pentaflops/s, 224 Terabytes RAM. Nigel Mitchell

Parallel Programming with MPI: Day 1

Parallel programming with MPI Part I -Introduction and Point-to-Point

MPI Program Structure

Basic MPI Communications. Basic MPI Communications (cont d)

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

CS 426. Building and Running a Parallel Application

CSE 160 Lecture 18. Message Passing

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Parallel Computing Programming Distributed Memory Architectures

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

Practical Scientific Computing: Performanceoptimized

Message-Passing Computing

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

Programming with MPI Collectives

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Computer Architecture

Introduction in Parallel Programming - MPI Part I

Transcription:

Practical Scientific Computing: Performanceoptimized Programming Programming with MPI November 29, 2006 Dr. Ralf-Peter Mundani Department of Computer Science Chair V Technische Universität München, Germany

From an Engineer s Point of View problem: simulation of physical phenomenon/technical process mathematical model discretisation algorithm development implementation running (sequential) code what s left: parallelisation

From an Engineer s s Point of View problem: simulation of physical phenomenon/technical process mathematical model discretisation algorithm development implementation running (sequential) code a typical engineer s answer: now you only have to do some parallelisation

MPI The Message Passing Interface de facto standard for writing parallel programs both free available and vendor-supplied implementations supports most interconnects to be used within C, C++, Fortran 77, and Fortran 90 programs target platforms: SMPs, clusters, massively parallel processors from our students sometimes called: Most Painful Interface useful links http://www.mpi-forum.org/ http://www.hlrs.de/mpi/ http://www-unix.mcs.anl.gov/mpi/

SPMD Model vs. MPMD Model SPMD Single Program, Multiple Data processes perform the same task over different data but: restriction of the general message-passing model main_routine ( /* arguments */ ) { if (process is to become a controller process) { Controller ( /* arguments */ ) } else { Worker ( /* arguments */ ) } }

SPMD Model vs. MPMD Model MPMD Multiple Program, Multiple Data processes perform different tasks over different (same) data main_routine ( /* arguments */ ) { if (process #1) { do_this ( /* arguments */ ) } else if (process #2) { do_that ( /* arguments */ ) } else if (process #3) { } } else if ( )

Message-Passing Programming Model sequential programming paradigm memory M processor P

Message-Passing Programming Model message-passing programming paradigm each processor runs one (or more) processes all variables are private communication between processes via messages memory M M M processor P P P communication network

Messages what information has to be provided for the message transfer which process is sending the message where is the data stored on the sending process what kind of data is being sent how much data is there which process(es) are receiving the message where should the data be left on the receiving process how much data is the receiving process prepared to accept MPI offers several ways of sending/receiving messages

Communication point-to-point communication 2 processes involved: sender and receiver way of sending interacts with execution of sub-program synchronous asynchronous blocking non-blocking

Communication point-to-point communication 2 processes involved: sender and receiver way of sending interacts with execution of sub-program synchronous (e.g. fax) asynchronous blocking non-blocking send is provided information about completion of message communication not complete until message has been received

Communication point-to-point communication 2 processes involved: sender and receiver way of sending interacts with execution of sub-program synchronous asynchronous (e.g. mail box) blocking non-blocking send only knows when message has left communication completes as soon as message is on its way

Communication point-to-point communication 2 processes involved: sender and receiver way of sending interacts with execution of sub-program synchronous asynchronous blocking (e.g. fax) non-blocking operations only finish when communication has completed

Communication point-to-point communication 2 processes involved: sender and receiver way of sending interacts with execution of sub-program synchronous asynchronous blocking non-blocking (e.g. fax with memory) operations return straight away and allow program to continue at some later time program can test for completion of operation

Communication collective communication all (some) processes involved barrier broadcast scatter and gather reduce

Communication collective communication all (some) processes involved barrier broadcast scatter and gather reduce synchronises processes no data exchanged but: each process blocked until all have called barrier routine

Communication collective communication all (some) processes involved barrier broadcast scatter and gather reduce one to many communication: one process sends same message to several destinations with a single operation

Communication collective communication all (some) processes involved barrier broadcast gather and scatter reduce one process takes/gives data items from/to several processes

Communication collective communication all (some) processes involved barrier broadcast scatter and gather reduce one process takes data items from several processes and reduces them to a single data item reduce operations: global sum, global minimum/maximum,

Writing and Running MPI Programs header file: #include <mpi.h> all names of routines and constants are prefixed with MPI_ first routine called in any MPI program must be for initialisation MPI_Init (argc, argv) clean-up at end of program when all comm. have completed MPI_Finalize (void) MPI_Finalize() does not cancel outstanding communications MPI_Init() and MPI_Finalize() are mandatory

Writing and Running MPI Programs processes can only communicate if they share a communicator predefined communicator MPI_COMM_WORLD contains list of processes consecutively numbered from 0 (number called rank) rank identifies each process within communicator size identifies amount of processes within communicator why creating a new communicator restrict collective communication to subset of processes creating a virtual topology, e.g. torus

Writing and Running MPI Programs determination of rank and size MPI_Comm_rank (comm, rank) MPI_Comm_size (comm, size) some remarks rank [0, size 1] size has to be specified at program s start MPI-1: size cannot be changed during runtime MPI-2: spawning of processes during runtime possible

Writing and Running MPI Programs compilation of MPI programs: mpicc, mpicxx, mpif77, or mpif90 > mpicc [ o my_prog ] my_prog.c available nodes for running an MPI program have to be stated explicitly via machinefile (list of hostnames or FQDNs) computer1.my_domain.com computer2.my_domain.com running an MPI program (MPI-1) > mpirun machinefile <file> -np <#procs> my_prog

Writing and Running MPI Programs running an MPI program (MPI-2) > mpdboot n <#mpds> -f <file> > mpiexec n <#procs> my_prog Note: mpd is only started once clean-up after usage or in case mpd crashed badly (MPI-2 only) > mpdcleanup f <file>

Writing and Running MPI Programs example: Hello world! int main (int argc, char **argv) { int rank, size; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); if (rank == 0) printf ( %d processes running\n, size); else printf ( Slave %d: Hello world!\n, rank); } MPI_Finalize (); return 0;

Messages message is an array of elements of a particular MPI datatype type of content must be specified in the send and receive, thus, no type conversions on heterogeneous parallel architectures MPI datatypes basic types (see tabular) derived types (built up from basic types (e.g. vector)) MPI datatype MPI_CHAR MPI_SHORT MPI_INT MPI_LONG C/C++ datatype signed char signed short int signed int signed long int

Messages MPI datatypes (cont d) MPI datatypes MPI_UNSIGNED_CHAR MPI_UNSIGNED_SHORT MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_FLOAT MPI_DOUBLE MPI_LONG_DOUBLE MPI_BYTE MPI_PACKED C/C++ datatypes unsigned char unsigned short int unsigned int unsigned long int float double long double represents eight binary digits for matching any other type

Point-to to-point Communication communication modes synchronous send buffered send standard send ready send receive only completes when receive has completed always completes (even if receive has not) either synchronous or buffered always completes (even if receive has not) completes when a message has arrived all modes exist in both blocking and non-blocking form blocking: return from routine implies completion non-blocking: modes have to be tested for completion

Blocking Communication neither sender nor receiver are able to continue the program during the message passing stage sending a message (generic) MPI_Send (buf, count, datatype, dest, tag, comm) receiving a message MPI_Recv (buf, count, datatype, src, tag, comm, status) tag: marker to distinguish between different sorts of messages status: sender and tag can be queried for message received

Blocking Communication synchronous send MPI_Ssend() receiving process sends back an acknowledgement receiving of acknowldg. by sender completes send routine sending process is idle until receiving process catches up buffered send MPI_Bsend() message is copied to system buffer for later transmission user must attach buffer space first (MPI_Buffer_Attach) only one buffer can be attached per process at a time buffered send guarantees to complete immediately non-blocking bsend has no advantage over blocking bsend

Blocking Communication standard send MPI_Send() completes once the message has been sent dependent from message size either buffered or synchronous send ready send MPI_Rsend() completes immediately matching receive must have already been posted, otherwise outcome is undefined performance may be improved by avoiding handshaking and buffering between sender and receiver non-blocking rsend has no advantage over blocking rsend

Blocking Communication receive MPI_Recv() completes when message has arrived usage of wildcards is possible receive from any source: MPI_ANY_SOURCE receive with any tag: MPI_ANY_TAG general rule: messages do not overtake each other 2 1

Blocking Communication example: a simple ping pong int rank, buf; MPI_Comm_rank (MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send (&rank, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); MPI_Recv (&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } else { MPI_Recv (&buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (&rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); }

Non-Blocking Communication question: does this communication in a ring work? int main (int argc, char **argv) { int rank, buf; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Recv (&buf, 1, MPI_INT, rank-1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send (&rank, 1, MPI_INT, rank+1, 0, MPI_COMM_WORLD); } MPI_Finalize (); return 0;

Non-Blocking Communication blocking communication does not return until communication has completed (thus, buffers can be used or re-used) risk of idly waiting and/or deadlocks use non-blocking communication instead hence, separate communication into three phases initiate non-blocking communication do some work (e.g. involving other communications) wait for non-blocking communication to complete non-blocking routines have identical arguments to blocking counterparts, except for an extra argument request request handle is important for testing if communication has completed

Non-Blocking Communication sending a message MPI_Isend (buf, count, datatype, dest, tag, comm, request) receiving a message MPI_Irecv (buf, count, datatype, src, tag, comm, request) ways of sending synchronous send MPI_Issend() buffered send MPI_Ibsend() standard send MPI_Isend() ready send MPI_Irsend()

Non-Blocking Communication testing communication for completion is essential before making use of the result of communication re-using the communication buffer completion tests come into two types wait type blocks until communication has completed (useful when data is required or buffer is about to be re-used) MPI_Wait (request, status) test type returns TRUE or FALSE depending whether or not communication has completed; it does not block MPI_Test (request, flag, status) MPI_Isend() with an immediate MPI_Wait() cf. to blocking

Non-Blocking Communication waiting and testing for multiple communications MPI_Waitall() MPI_Testall() MPI_Waitany() MPI_Testany() MPI_Waitsome() MPI_Testsome() blocks until all have completed TRUE if all, otherwise FALSE blocks until one or more have completed, returns (arbitrary) index returns flag and (arbitrary) index blocks until one ore more have completed, returns index of all completed ones returns flag and index of all completed ones blocking and non-blocking forms can be combined

Non-Blocking Communication example: communication in a ring int main (int argc, char **argv) { int rank, buf; MPI_Request request; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Irecv (&buf, 1, MPI_INT, rank-1, 0, MPI_COMM_WORLD, &request); MPI_Send (&rank, 1, MPI_INT, rank+1, 0, MPI_COMM_WORLD); MPI_Wait (&request, MPI_STATUS_IGNORE); } MPI_Finalize (); return 0;

Collective Communication characteristics: all processes (within communicator) communicate synchronisation may or may not occur all collective operations are blocking no tags allowed receive buffers must be exactely the right size

Collective Communication barrier synchronisation blocks calling process until all other processes have called it thus, MPI_Barrier() always synchronises MPI_Barrier (comm) broadcast has a specified root process every process receives one copy of the message from root all processes must specify the same root MPI_Bcast (buf, count, datatype, root, comm)

Collective Communication broadcast sends message to all participating processes example: first process in competition informs others to stop A A A A A

Collective Communication gather and scatter has a specified root process all processes must specify the same root send and receive details must be specified as arguments MPI_Gather (sbuf, scount, sdatatype, rbuf, rcount, rdatatype, root, comm) MPI_Scatter (sbuf, scount, sdatatype, rbuf, rcount, rdatatype, root, comm) variants: MPI_Allgather() and MPI_Alltoall()

Collective Communication scatter data from one process are distributed among all processes example: rows of a matrix for a parallel solution A A B C D B C D

Collective Communication gather data from all processes are collected by one process example: assembly of solution vector from parted solutions A B A B C D C D

Collective Communication gather-to-all all processes collect distributed data from all others example: as bf., processes need solution for continuation A A B C D B A B C D C A B C D D A B C D

Collective Communication all-to-all data from all processes are distributed among all others example: any ideas? A B C D A E I M E F G H B F J N I J K L C G K O M N O P D H L P

Collective Communication Collective Communication all-to-all (cont d) all-to-all also known as total exchange example: transposition of matrix matrix A stored row-wise in memory processes transpose blocks in parallel total exchange of blocks = q p o n m l k i h g f e d c b a A = q m h d p l g c p k f b n i e a A T

Collective Communication global reduction operations useful for computing a result over distributed data has a specified root process every process must specify the same root every process must specify the same operation reduction operations can be predefined or user-defined root process ends up with an array of results MPI_Reduce (sbuf, rbuf, count, datatype, op, root, comm)

Collective Communication global reduction operations (cont d) operator MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_BAND MPI_LOR MPI_BOR result find maximum find minimum calculate sum calculate product make logical AND make bitwise AND make logical OR make bitwise OR

Collective Communication global reduction operations (cont d) operator MPI_LXOR MPI_BXOR MPI_MAXLOC MPI_MINLOC result make logical XOR make bitwise XOR find maximum and its position find minimum and its position variants (no specified root) MPI_Allreduce(): all processes receive result MPI_Reduce_Scatter(): resulting vector is distributed MPI_Scan(): processes receive partial result

Collective Communication reduce data from all processes are reduced to single data item(s) example: global minimum/maximum/sum/product/ A A B B C C D R D

Collective Communication all-reduce all processes are provided reduced data item(s) example: processes need global minimum for continuation A R B C A B C D R R D R

Collective Communication reduce-scatter data from all processes are reduced and distributed example: any ideas? A B C D R = A E I M B F E F G H S I J K L M N O P J N T = C G K O U = D H L P

Collective Communication parallel prefix processes receive partial result of reduce operation example: carry look-ahead for adding two numbers A A B C A B C D A B A B C D A B C D

Collective Communication parallel prefix (cont d) carrying problem (overflow) when adding two digits consider c 3 c 2 c 1 c 0 Carry a 4 a 3 a 2 a 1 a 0 First Integer b 4 b 3 b 2 b 1 b 0 Second Integer s 4 s 3 s 2 s 1 s 0 Sum c i can be computed from c i-1 by (a i + b i ) c i-1 + a i b i carry look-ahead: each c i computed by parallel prefix afterward, s i are calculated in parallel

Collective Communication parallel prefix (cont d) implementation with binary trees example: finding all (partial) sums in Ο(log n) time 36 10 26 3 7 11 15 1 2 3 4 5 6 7 8 up the tree 36 10 36 3 10 21 36 1 3 6 10 15 21 28 36 down the tree (prefix) even entries: b i = c i odd entries: b i = c i-1 + b i

mundani@in.tum.de http://www5.in.tum.de/~mundani/