ECE 574 Cluster Computing Lecture 13

Similar documents
ECE 574 Cluster Computing Lecture 13

Topic Notes: Message Passing Interface (MPI)

Masterpraktikum - Scientific Computing, High Performance Computing

Introduction to MPI, the Message Passing Library

Message Passing Interface: Basic Course

HPC Parallel Programing Multi-node Computation with MPI - I

Programming with MPI

Advanced Message-Passing Interface (MPI)

Masterpraktikum - Scientific Computing, High Performance Computing

Message-Passing and MPI Programming

ECE 574 Cluster Computing Lecture 10

Parallel Programming

Introduction to the Message Passing Interface (MPI)

Parallel programming MPI

MPI. (message passing, MIMD)

Programming with MPI Collectives

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Message Passing Interface

Practical Introduction to Message-Passing Interface (MPI)

MPI Collective communication

Collective Communication in MPI and Advanced Features

Welcome to the introductory workshop in MPI programming at UNICC

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Message-Passing and MPI Programming

15-440: Recitation 8

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

MPI Message Passing Interface

Standard MPI - Message Passing Interface

Lecture 9: MPI continued

Message-Passing and MPI Programming

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

An introduction to MPI

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

Collective Communication

The MPI Message-passing Standard Practical use and implementation (V) SPD Course 6/03/2017 Massimo Coppola

Implementation of Parallelization

Programming with MPI

Slides prepared by : Farzana Rahman 1

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Computing Paradigms

Distributed Optimization via ADMM

Parallel Programming. Using MPI (Message Passing Interface)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

MPI Performance Snapshot

High Performance Computing

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

Tutorial: parallel coding MPI

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Message Passing Interface - MPI

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Programming with MPI

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Message Passing Interface

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Introduction to Parallel Programming

Programming with MPI

Distributed Memory Programming With MPI Computer Lab Exercises

CSE 160 Lecture 15. Message Passing

Programming with MPI

Chapter 3. Distributed Memory Programming with MPI

Introduzione al Message Passing Interface (MPI) Andrea Clematis IMATI CNR

Introduction to parallel computing with MPI

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Message Passing Interface. most of the slides taken from Hanjun Kim

Message-Passing Computing

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

NUMERICAL PARALLEL COMPUTING

Data parallelism. [ any app performing the *same* operation across a data stream ]

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Basic MPI Communications. Basic MPI Communications (cont d)

Introduction to pthreads

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

UNIVERSITY OF MORATUWA

ECE 598 Advanced Operating Systems Lecture 12

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Message-Passing Interface Basics

Distributed Memory Systems: Part IV

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Parallel Programming with MPI and OpenMP

OpenMP and MPI parallelization

Distributed Memory Programming with MPI

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Review of MPI Part 2

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

MPI: A Message-Passing Interface Standard

Preventing and Finding Bugs in Parallel Programs

Introduction to Parallel Programming with MPI

Cluster Computing MPI. Industrial Standard Message Passing

MPI Casestudy: Parallel Image Processing

High-Performance Computing: MPI (ctd)

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

ECE 598 Advanced Operating Systems Lecture 10

ECE 574 Cluster Computing Lecture 15

Distributed Memory Parallel Programming

Transcription:

ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017

Announcements HW#5 Finally Graded Had right idea, but often result not an *exact* match. Often due to edge conditions. Does that matter? Approximate computing? HW#6 also Finally graded Comment your code! Even if it s just a few new lines, say what it is supposed to be doing. Parallelism is never trivial. Have to put parallel either in separate directive, or 1

in sections. Also time measurement outside parallel area (time in each section is the same with or without threads, the difference is they can happen simultaneously). i.e. be sure to measure wall clock, not user, time Don t nest parallel! remove sections stuff for fine. Also, does it makes sense to parallelize the most inner loop of 3? Also what if you mark variables private that shouldn t be? scope! Also if have sum marked private in inner loop, need 2

to make sure it somehow gets added on the outer (reduction). Be careful with bracket placement. Don t need one for a for, for example. Also, remember as soon as you do parallel everything in the brackets runs on X threads. So if you parallel, have loops, then a for... those outer loops are each running X times so you re calculating everything X times over. This isn t a race condition because we don t modify the inputs so it doesn t matter how many times we calc each output. 3

HW#7 (MPI) will be assigned after midterm Short Midterm Thursday Project Ideas due 30 March (next Thursday) Spent most of break trying to make PAPI faster. Turned up a lot of Linux kernel bugs. One involved pthreads, fun. Trying to get a paper out of it. Cluster/Infiniband digression. Managed to get the haswell/broadwell cluster connected by Infiniband over IP (20GB/s!) but can t get OpenMPI to use the connection. 4

Midterm Review Can bring one page (8.5 by 11 ) of notes. Otherwise closed notes, computers, cell-phones, Beowulf cluster, etc. Performance Speedup, Parallel efficiency Strong and Weak scaling Definition of Distributed vs Shared Memory Know why changing order of loops can make things faster 5

Pthread Programming Know about race condition, deadlock Know roughly the layout of a pthreads program. (define pthread t thread structures, pthread create, pthread join) Know why you d use a mutex. OpenMP Programming parallel directive scope section for directive 6

MPI continued Some references https://computing.llnl.gov/tutorials/mpi/ http://moss.csc.ncsu.edu/~mueller/cluster/mpi.guide.pdf https://cvw.cac.cornell.edu/mpicc/default 7

Writing MPI code #include "mpi.h" Over 430 routines use mpicc to compile gcc or other compiler underneath, just sets up includes and libraries for you. mpirun -n 4./test mpi MPI Init() called before anything else MPI Finalize() at the end Error handling most errors just abort 8

Communicators You can specify communicator groups, and only send messages to specific groups. MPI COMM WORLD is the default, means all processes. 9

Rank Rank is the process number. MPI Comm rank(mpi Comm comm, int size) MPI Comm rank(mpi COMM WORLD, &rank); You can find the number of processes: MPI Comm size(mpi Comm comm, int size) 10

Error Handling MPI SUCCESS (0) is good By default it aborts if any sort of error Can override this 11

Timing MPI Wtime(); wallclock time in double floating point. For PAPI-like measurements MPI Wtick(); 12

Point to Point Operations Buffering what happens if we do a send but receiving side not ready? Blocking blocking calls returns after it is safe to modify your send buffer. Not necessarily mean it has been sent, may just have been buffered to send. Blocking receive means only returns when all data received Non-blocking return immediately. Not safe to change buffers until you know it is finished. Wait routines for 13

this. Order messages will not overtake each other. Send #1 and #2 to same receive, #1 will be received first Fairness no guarantee of fairness. Process 1 and 2 both send to same receive on 3. No guarantee which one is received 14

MPI Send, MPI Recv block MPI Send(buffer,count,type,dest,tag,comm) non-block MPI Isend(buffer,count,type,dest,tag,comm,req block MPI Recv(buffer,count,type,source,tag,comm,statu non-block MPI Irecv(buffer,count,type,source,tag,comm,r buffer pointer to the data buffer count number of items to send 15

type MPI predefines a bunch. MPI CHAR, MPI INT, MPI LONG, MPI DOUBLE, etc. can also create own complex data types destination rank to send it to source rank to receive from. Also can be MPI ANY SOURCE Tag arbitrary integer uniquely identifying message. Can pick yourself. 0-32767 guaranteed, can be higher. Communicator can specify subgroups. Usually use 16

MPI COMM WORLD status status of message, a struct in C request on non-blocking this is a handle to the request that can be queried later to see that status 17

Fancier blocking send/receives Lots, with various type of blocking and buffer attaching and synchronous/asynchronous 18

Efficient way of getting data to all processes master send to each individual, take a while some sort of tree, 0 to 1 and 2, 1 sends to 3 and 4, etc. use broadcast instead 19

Collective Communication All must participate or there can be problems. Do not take tag arguments Can only operate on MPI defined data types, not custom Operations Synchronization all processes wait Data Movement broadcast, scatter-gather scatter = take one structure and split among processes gather = take data from all processes and combine it Reduction one process combines results of all others 20

MPI Barrier() All processes wait at this point. MPI Barrier (comm) 21

MPI Bcast() MPI Bcast (&buffer,count,datatype,root,comm) Sends data from the root process to each other process. Is blocking; when encountering a Bcast all nodes wait until they have received the data. 22

MPI Scatter() / MPI Gather() MPI Scatter (&sendbuf,sendcnt,sendtype,&recvbuf recvcnt,recvtype,root,comm) Copies sencnt sized chunks of sendbuf to each processes recvbuf MPI Gather (&sendbuf,sendcnt,sendtype,&recvbuf, recvcount,recvtype,root,comm) Have to take care if area sending not a multiple of your number of ranks 23

MPI Reduce() MPI Reduce( void* send data, void* recv data, int count, MPI Datatype datatype, MPI Op op, in root, MPI Comm communicator) Operations MPI MAX,MPI MIN max, min MPI SUM sum MPI PROD product MPI LAND, MPI BAND logical/bitwise and MPI LOR,MPI BOR logical/bitwise OR 24

MPI LXOR,MPI BXOR logical/bitwise XOR MPI MAXLOC,MPI MINLOC value and location Can also create custom 25

MPI Allgather() Gathers, to all. Equivalent of gathering back to root, then rebroadcasting to all. 26

MPI Allreduce() Like an MPI Reduce followed by an MPI Bcast MPI Allreduce( void* send data, void* recv data, int count, MPI Datatype datatype, MPI Op op, MP communicator) Once the reduction is done, broadcasts the results to all processes 27

MPI Reduce scatter() 28

MPI Alltoall() Scatter data from all to all 29

MPI Scatterv() Vector scatter. Send non-contiguous chunks. In addition to regular scatter parameters, a list of start offsets and lengths. 30

MPI Scan() Lets you do partial reductions. 31

Custom Data Types You can create custom data types that aren t the MPI default, sort of like structures. Open question: can you just cast your data into integers and uncast on the other side? 32

Groups vs Communicators Can create custom groups if you don t want to broadcast to all. 33

Virtual Topologies Map to a geometric shape (grid or graph) Doesn t have to match underlying hardware 34

Examples See the provided tar file with example code. 35

Running MPI code mpiexec -n 4./test mpi You ll often see mpirun instead. Some implementations have that, but it s not the official standard way. 36

Send Example mpi send.c Run with mpirun -np 4./mpi send Sends 1 million integers (each with value of 1) to each node Each adds up 1/4th then sends only the sum (a single int) back Notice this is a lot like pthreads where we have to do a 37

lot of work manually. 38

TODO Blocking vs NonBlock Example? 39

Wtime Example mpi wtime.c Same as previous example. but with timing Unlike PAPI, the time is returned as a floating point value 40

Barrier Example mpi barrier.c Each machine sleeps some time based on rank All wait at barrier until last one arrives 41

Bcast Example mpi bcast.c Same buffer on each machine At the broadcast function, one sends its version of the buffer and the rest wait until they receive the value. In the end they all have the same value 42

mpi scatter.c Scatter Example Instead of sending all of A, breaks it into chunks and sends it to B in each rank. 43

Gather Example mpi gather.c Each rank has its own copy of A which it sets to entirely it s rank number Then a gather happens on rank0, of one int each. So what should B have in it? (0, 1, 2, 3,...) 44

Reduce Example mpi reduce.c Instead of waiting in a loop for tasks finishing and then adding up the results one by one, use a reduction instead. Many MPI routines are convenience things that could be done by a sequence of separate commands. 45