Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Similar documents
CDP. MPI Derived Data Types and Collective Communication

Parallel Programming. Using MPI (Message Passing Interface)

Parallel Computing Paradigms

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

The MPI Message-passing Standard Practical use and implementation (V) SPD Course 6/03/2017 Massimo Coppola

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

Collective Communications

High-Performance Computing: MPI (ctd)

Intermediate MPI features

Cluster Computing MPI. Industrial Standard Message Passing

Lecture 17: Array Algorithms

What s in this talk? Quick Introduction. Programming in Parallel

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Basic MPI Communications. Basic MPI Communications (cont d)

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Distributed Memory Systems: Part IV

All-Pairs Shortest Paths - Floyd s Algorithm

Parallel Programming, MPI Lecture 2

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Review of MPI Part 2

Message Passing Interface

Introduction to MPI Part II Collective Communications and communicators

Collective Communication in MPI and Advanced Features

Message Passing Interface. most of the slides taken from Hanjun Kim

Part - II. Message Passing Interface. Dheeraj Bhardwaj

With regards to bitonic sorts, everything boils down to doing bitonic sorts efficiently.

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

Parallel programming MPI

Introduction to Parallel. Programming

Programming Using the Message Passing Paradigm

Lecture 16. Parallel Matrix Multiplication

Data parallelism. [ any app performing the *same* operation across a data stream ]

HPC Parallel Programing Multi-node Computation with MPI - I

Lecture 6: Parallel Matrix Algorithms (part 3)

CS575 Parallel Processing

Lecture 7: Distributed memory

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Today's agenda. Parallel Programming for Multicore Machines Using OpenMP and MPI

MPI. (message passing, MIMD)

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Working with IITJ HPC Environment

Introduction to MPI. SuperComputing Applications and Innovation Department 1 / 143

Basic Communication Operations (Chapter 4)

EE/CSCI 451: Parallel and Distributed Computation

Parallel Programming with MPI and OpenMP

Collective Communications I

The Message-Passing Paradigm

COMP4300/8300: Parallelisation via Data Partitioning. Alistair Rendell

4. Parallel Programming with MPI

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

CSC630/CSC730: Parallel Computing

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 3: Sorting 1

Interconnection networks

Programming with MPI Collectives

COMP 322: Fundamentals of Parallel Programming

Programming Using the Message-Passing Paradigm (Chapter 6) Alexandre David

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

NUMERICAL PARALLEL COMPUTING

ECE 574 Cluster Computing Lecture 13

Collective Communications II

Message Passing Interface - MPI

Advanced Message-Passing Interface (MPI)

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Matrix Multiplication

Using Model Checking with Symbolic Execution for the Verification of Data-Dependent Properties of MPI-Based Parallel Scientific Software

Matrix-vector Multiplication

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

COMMUNICATION IN HYPERCUBES

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Overview. Processor organizations Types of parallel machines. Real machines

Interconnection Network

Parallel Architecture. Sathish Vadhiyar

Parallel Programming in C with MPI and OpenMP

Hierarchical Approach to Optimization of MPI Collective Communication Algorithms

Physical Organization of Parallel Platforms. Alexandre David

INTERCONNECTION NETWORKS LECTURE 4

Parallel Programming with MPI MARCH 14, 2018

EE/CSCI 451: Parallel and Distributed Computation

Distributed Systems + Middleware Advanced Message Passing with MPI

Message Passing with MPI

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Programming with MPI

Message-Passing and MPI Programming

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Lightweight MPI Communicators with Applications to

UNIVERSITY OF MORATUWA

Message Passing and Network Fundamentals ASD Distributed Memory HPC Workshop

Introduction to MPI. Ricardo Fonseca.

MPI Collective communication

Message Passing Interface - MPI

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

MPI - The Message Passing Interface

L15: Putting it together: N-body (Ch. 6)!

Transcription:

Intro to MPI

Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort

Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms Graph Coloring

Network Model Message passing model Distributed memory nodes Graph G = (N, E) nodes processors edges two-way communication link No shared RAM Asynchronous Two basic communication constructs send (data, to_proc_i) recv (data, from_proc_j)

Network Topologies line ring Link: one task pair, bidirectional mesh wraparound mesh Send-Recv time: latency + message/bw Node: one send/recv Hypercube?

Hypercube generic term 3-cube, 4-cube,..., q-cube 0 1 0 00 01 000 001 100 101 100 101 000 001 (a) Binary 1-cube, built of two binary 0-cubes, labeled 0 and 1 (b) Binary 2-cube, built of two binary 1-cubes, labeled 0 and 1 1 10 11 0 1 010 011 110 111 (c) Binary 3-cube, built of two binary 2-cubes, labeled 0 and 1 110 111 010 011 0100 0101 1100 1101 0000 0001 1000 1001 0 1 d dimensions, 2 d processes 0110 0111 0010 0011 1110 1111 1010 1011 two processes connected iff they differ in only one bit (d) Binary 4-cube, built of two binary 3-cubes, labeled 0 and 1 distance = log p

Network topologies Fat Tree

Basic concepts in networks Routing (delivering messages) Diameter maximum distance between nodes Communication costs Latency: time to initiate communication Bandwidth: channel capacity (data/time) Number and size of messages matters Point-to-point and collective communications Synchronization, all-to-all messages

Network Models Allow for thinking and measuring communication costs and non-locality

p2p cost in MPI start-up time: t s add header/trailer, error correction, execute the routing algorithm, establish the connection between source and destination per-hop time: t h Time to travel between two directly connected nodes node latency per-word transfer time: t w We will assume that the cost of sending a message of size m is: t comm = t s + t w m latency 1 bandwidth

Network Metrics Diameter Ring, Fully connected ring Max distance between any two nodes Connectivity Diameter p 2 1 Number of links needed to remove a node Bisection width Connectivity 2 p 1 Bisection Width Cost Number of links to break network into equal halves Total number of links Cost 2 p 2 /4 p 1 p p 1 2

Network Metrics Diameter Hypercube Max distance between any two nodes Diameter log p Connectivity Number of links needed to remove a node Bisection width Number of links to break network into equal halves Connectivity log p (= d) Bisection Width p/2 Cost p log p 2 Cost Total number of links

Example problem how would you perform a fully distributed matrix multiplication (dgemm) Assume a 2D grid topology Need to compute C = A B P 11 P 12 P 13 P 14 P 21 P 22 P 23 P 24 C ij = k A ik B kj How to divide data across processors? What is the local work? Communication? Costs? P 31 P 32 P 33 P 34 P 41 P 42 P 43 P 44

Matrix-Matrix Multiplication on 2D Mesh Compute C = AB, p = n n Rows of A fed from left Columns of B fed from right Operation is synchronous ij indexing of processor ids Each ij processor computes C ij P 11 P 12 P 13 P 14 P 21 P 22 P 23 P 24 P 31 P 32 P 33 P 34 P 41 P 42 P 43 P 44

Matrix-Matrix Multiplication % i,j process id % p=n, 1 element per process for k=1:n recv ( A(i, k), (i-1, j ) ) recv ( B(k, j), (i, j-1) ) C(i,j) += A(i,k)*B(k,j); send ( A(i,k), (i+1, j ) ) send ( B(k,j), (i, j+1) ) B(4,1) B(3,1) B(2,1) B(1,1) B(4,2) B(3,2) B(2,2) B(1,2) B(4,3) B(3,3) B(2,3) B(1,3) B(4,4) B(3,4) B(2,4) B(1,4) A(1,4) A(1,3) A(1,2) A(1,1) P 11 P 12 P 13 P 14 A(2,4) A(2,3) A(2,2) A(2,1) P 21 P 22 P 23 P 24 A(3,4) A(3,3) A(3,2) A(3,1) P 31 P 32 P 33 P 34 A(4,4) A(4,3) A(4,2) A(4,1) P 41 P 42 P 43 P 44

Collective Communications

MPI Collectives Communicator ( MPI_Comm ) determines the scope within which a point-to-point or collective operation is to operate Communicators are dynamic they can be created and destroyed during program execution. MPI_COMM_WORLD int MPI_Barrier(MPI_Comm comm); avoid Synchronize all processes within a communicator

Data Movement broadcast gather(v) scatter(v) allgather(v) alltoall(v)

Broadcast int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

Gather & Scatter int MPI_Gather(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm ) int MPI_Scatter(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm)

Gather & Scatter int MPI_Gather(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm ) int MPI_Scatter(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm)

Allgather int MPI_Allgather(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm)

All to All int MPI_Alltoall(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm)

Global Computation Reduce Scan

Reduce int MPI_Reduce(void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, int root, MPI_Comm comm) int MPI_Allreduce(void* sbuf, void* rbuf, int count MPI_Datatype stype, MPI_Op op, MPI_Comm comm) int MPI_Reduce_scatter(void* sbuf, void* rbuf, int* rcounts, MPI_Datatype stype, MPI_Op op, MPI_Comm comm)

Scan (prefix-op) int MPI_Scan(void* sbuf, void* rbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

Assignment 1 Problem 1 Implement samplesort in parallel Samplesort 1. Local sort, pick samples 2. Gather samples at root 3. Sort samples, pick splitters 4. Broadcast splitters 5. Bin data into buckets 6. AlltoAllv data 7. Local sort