Lecture 7: Distributed memory

Similar documents
Lecture 9: MPI continued

Message Passing Interface

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

MPI. (message passing, MIMD)

High Performance Computing Course Notes Message Passing Programming I

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

int sum;... sum = sum + c?

Introduction to parallel computing concepts and technics

Distributed Memory Programming with Message-Passing

Holland Computing Center Kickstart MPI Intro

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Parallel Computing Paradigms

Introduction to the Message Passing Interface (MPI)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Programming with MPI. Pedro Velho

Parallel Programming, MPI Lecture 2

HPC Parallel Programing Multi-node Computation with MPI - I

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

High performance computing. Message Passing Interface

CSE 160 Lecture 18. Message Passing

CS 426. Building and Running a Parallel Application

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

MPI Message Passing Interface

Distributed Memory Machines. Distributed Memory Machines

Introduction to MPI. Ricardo Fonseca.

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Parallel Short Course. Distributed memory machines

Point-to-Point Communication. Reference:

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Message Passing Interface. most of the slides taken from Hanjun Kim

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

ECE 574 Cluster Computing Lecture 13

Lesson 1. MPI runs on distributed memory systems, shared memory systems, or hybrid systems.

A few words about MPI (Message Passing Interface) T. Edwald 10 June 2008

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Parallel Programming with MPI: Day 1

CSE 160 Lecture 15. Message Passing

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

MPI MESSAGE PASSING INTERFACE

Message Passing Interface

Parallel Programming in C with MPI and OpenMP

Message Passing Interface

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Parallel hardware. Distributed Memory. Parallel software. COMP528 MPI Programming, I. Flynn s taxonomy:

MPI Runtime Error Detection with MUST

Acknowledgments. Programming with MPI Basic send and receive. A Minimal MPI Program (C) Contents. Type to enter text

ME964 High Performance Computing for Engineering Applications

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Programming with MPI Basic send and receive

COSC 6374 Parallel Computation

Recap of Parallelism & MPI

Message-Passing Interface Basics

15-440: Recitation 8

MPI MPI. Linux. Linux. Message Passing Interface. Message Passing Interface. August 14, August 14, 2007 MPICH. MPI MPI Send Recv MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

NUMERICAL PARALLEL COMPUTING

Distributed Memory Programming with MPI

Distributed Memory Parallel Programming

Distributed Memory Programming with MPI

Parallel Programming. Using MPI (Message Passing Interface)

An Introduction to MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Parallel Programming using MPI. Supercomputing group CINECA

Message Passing Interface - MPI

Tutorial 2: MPI. CS486 - Principles of Distributed Computing Papageorgiou Spyros

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Lecture 3 Message-Passing Programming Using MPI (Part 1)

Slides prepared by : Farzana Rahman 1

Parallel Programming

Distributed Memory Machines and Programming. Lecture 7

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

CMSC 714 Lecture 3 Message Passing with PVM and MPI

Introduction to the Message Passing Interface (MPI)

O.I. Streltsova, D.V. Podgainy, M.V. Bashashin, M.I.Zuev

Message Passing Interface - MPI

Parallel programming with MPI Part I -Introduction and Point-to-Point Communications

Introduction to MPI. Branislav Jansík

An introduction to MPI

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Parallel programming with MPI Part I -Introduction and Point-to-Point

Parallel Programming with MPI and OpenMP

CMSC 714 Lecture 3 Message Passing with PVM and MPI

Interconnection Networks

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

More about MPI programming. More about MPI programming p. 1

Outline. Introduction to HPC computing. OpenMP MPI. Introduction. Understanding communications. Collective communications. Communicators.

Distributed Simulation in CUBINlab

The Message Passing Model

DISTRIBUTED MEMORY PROGRAMMING WITH MPI. Carlos Jaime Barrios Hernández, PhD.

Programming Using the Message-Passing Paradigm (Chapter 6) Alexandre David

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Transcription:

Lecture 7: Distributed memory David Bindel 15 Feb 2010

Logistics HW 1 due Wednesday: See wiki for notes on: Bottom-up strategy and debugging Matrix allocation issues Using SSE and alignment comments Timing and OS scheduling Gnuplot (to go up shortly) Submit by emailing me a tar or zip file. For your writeup, text or PDF preferred! Please also submit a feedback form (see web page). Next HW: a particle dynamics simulation.

Plan for this week Last week: shared memory programming Shared memory HW issues (cache coherence) Threaded programming concepts (pthreads and OpenMP) A simple example (Monte Carlo) This week: distributed memory programming Distributed memory HW issues (topologies, cost models) Message-passing programming concepts (and MPI) A simple example ( sharks and fish )

Basic questions How much does a message cost? Latency: time to get between processors Bandwidth: data transferred per unit time How does contention affect communication? This is a combined hardware-software question! We want to understand just enough for reasonable modeling.

Thinking about interconnects Several features characterize an interconnect: Topology: who do the wires connect? Routing: how do we get from A to B? Switching: circuits, store-and-forward? Flow control: how do we manage limited resources?

Thinking about interconnects Links are like streets Switches are like intersections Hops are like blocks traveled Routing algorithm is like a travel plan Stop lights are like flow control Short packets are like cars, long ones like buses? At some point the analogy breaks down...

Bus topology P0 P1 P2 P3 $ $ $ $ Mem One set of wires (the bus) Only one processor allowed at any given time Contention for the bus is an issue Example: basic Ethernet, some SMPs

Crossbar P0 P0 P1 P2 P3 P1 P2 P3 Dedicated path from every input to every output Takes O(p 2 ) switches and wires! Example: recent AMD/Intel multicore chips (older: front-side bus)

Bus vs. crossbar Crossbar: more hardware Bus: more contention (less capacity?) Generally seek happy medium Less contention than bus Less hardware than crossbar May give up one-hop routing

Network properties Think about latency and bandwidth via two quantities: Diameter: max distance between nodes Bisection bandwidth: smallest bandwidth cut to bisect Particularly important for all-to-all communication

Linear topology p 1 links Diameter p 1 Bisection bandwidth 1

Ring topology p links Diameter p/2 Bisection bandwidth 2

Mesh May be more than two dimensions Route along each dimension in turn

Torus Torus : Mesh :: Ring : Linear

Hypercube Label processors with binary numbers Connect p 1 to p 2 if labels differ in one bit

Fat tree Processors at leaves Increase link bandwidth near root

Others... Butterfly network Omega network Cayley graph

Current picture Old: latencies = hops New: roughly constant latency (?) Wormhole routing (or cut-through) flattens latencies vs store-forward at hardware level Software stack dominates HW latency! Latencies not same between networks (in box vs across) May also have store-forward at library level Old: mapping algorithms to topologies New: avoid topology-specific optimization Want code that runs on next year s machine, too! Bundle topology awareness in vendor MPI libraries? Sometimes specify a software topology

α-β model Crudest model: t comm = α + βm t comm = communication time α = latency β = inverse bandwidth M = message size Works pretty well for basic guidance! Typically α β t flop. More money on network, lower α.

LogP model Like α-β, but includes CPU time on send/recv: Latency: the usual Overhead: CPU time to send/recv Gap: min time between send/recv P: number of processors Assumes small messages (gap bw for fixed message size).

Communication costs Some basic goals: Prefer larger to smaller messages (avoid latency) Avoid communication when possible Great speedup for Monte Carlo and other embarrassingly parallel codes! Overlap communication with computation Models tell you how much computation is needed to mask communication costs.

Message passing programming Basic operations: Pairwise messaging: send/receive Collective messaging: broadcast, scatter/gather Collective computation: sum, max, other parallel prefix ops Barriers (no need for locks!) Environmental inquiries (who am I? do I have mail?) (Much of what follows is adapted from Bill Gropp s material.)

MPI Message Passing Interface An interface spec many implementations Bindings to C, C++, Fortran

Hello world #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("hello from %d of %d\n", rank, size); MPI_Finalize(); return 0; }

Communicators Processes form groups Messages sent in contexts Separate communication for libraries Group + context = communicator Identify process by rank in group Default is MPI_COMM_WORLD

Sending and receiving Need to specify: What s the data? Different machines use different encodings (e.g. endian-ness) = bag o bytes model is inadequate How do we identify processes? How does receiver identify messages? What does it mean to complete a send/recv?

MPI datatypes Message is (address, count, datatype). Allow: Basic types (MPI_INT, MPI_DOUBLE) Contiguous arrays Strided arrays Indexed arrays Arbitrary structures Complex data types may hurt performance?

MPI tags Use an integer tag to label messages Help distinguish different message types Can screen messages with wrong tag MPI_ANY_TAG is a wildcard

MPI Send/Recv Basic blocking point-to-point communication: int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm); int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);

MPI send/recv semantics Send returns when data gets to system... might not yet arrive at destination! Recv ignores messages that don t match source and tag MPI_ANY_SOURCE and MPI_ANY_TAG are wildcards Recv status contains more info (tag, source, size)

Ping-pong pseudocode Process 0: for i = 1:ntrials send b bytes to 1 recv b bytes from 1 end Process 1: for i = 1:ntrials recv b bytes from 0 send b bytes to 0 end

Ping-pong MPI void ping(char* buf, int n, int ntrials, int p) { for (int i = 0; i < ntrials; ++i) { MPI_Send(buf, n, MPI_CHAR, p, 0, MPI_COMM_WORLD); MPI_Recv(buf, n, MPI_CHAR, p, 0, MPI_COMM_WORLD, NULL); } } (Pong is similar)

Ping-pong MPI for (int sz = 1; sz <= MAX_SZ; sz += 1000) { if (rank == 0) { clock_t t1, t2; t1 = clock(); ping(buf, sz, NTRIALS, 1); t2 = clock(); printf("%d %g\n", sz, (double) (t2-t1)/clocks_per_sec); } else if (rank == 1) { pong(buf, sz, NTRIALS, 0); } }

Running the code On my laptop (OpenMPI) mpicc -std=c99 pingpong.c -o pingpong.x mpirun -np 2./pingpong.x Details vary, but this is pretty normal.

Approximate α-β parameters (2-core laptop) 8.00e-06 Measured Model 7.00e-06 6.00e-06 Time/msg 5.00e-06 4.00e-06 3.00e-06 2.00e-06 1.00e-06 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 b α 1.46 10 6, β 3.89 10 10

Where we are now Can write a lot of MPI code with 6 operations we ve seen: MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv... but there are sometimes better ways. Next time: non-blocking and collective operations!