Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Similar documents
Lecture 16. Parallel Matrix Multiplication

CSE 160 Lecture 13. Non-blocking Communication Under the hood of MPI Performance Stencil Methods in MPI

Lecture 14. Performance Profiling Under the hood of MPI Parallel Matrix Multiplication MPI Communicators

Basic MPI Communications. Basic MPI Communications (cont d)

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

Message Passing Interface

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

HPC Parallel Programing Multi-node Computation with MPI - I

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Message Passing Interface. most of the slides taken from Hanjun Kim

Parallel programming MPI

L15: Putting it together: N-body (Ch. 6)!

L19: Putting it together: N-body (Ch. 6)!

High-Performance Computing: MPI (ctd)

Parallel Programming, MPI Lecture 2

Homework #4 Due Friday 10/27/06 at 5pm

Non-Blocking Communications

The MPI Message-passing Standard Practical use and implementation (V) SPD Course 6/03/2017 Massimo Coppola

MPI. (message passing, MIMD)

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Cluster Computing MPI. Industrial Standard Message Passing

COMP 322: Fundamentals of Parallel Programming

Collective Communications

Non-Blocking Communications

Collective Communications II

Standard MPI - Message Passing Interface

Outline. Communication modes MPI Message Passing Interface Standard

Data parallelism. [ any app performing the *same* operation across a data stream ]

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

NUMERICAL PARALLEL COMPUTING

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Advanced Parallel Programming

Programming with MPI Collectives

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd

What s in this talk? Quick Introduction. Programming in Parallel

MPI Collective communication

Distributed Systems + Middleware Advanced Message Passing with MPI

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

High Performance Computing

Basic Communication Operations (Chapter 4)

MPI point-to-point communication

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

CSE 160 Lecture 15. Message Passing

Topic Notes: Message Passing Interface (MPI)

Scientific Computing

Recap of Parallelism & MPI

Distributed Memory Systems: Part IV

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Message Passing Interface - MPI

Introduction to MPI: Part II

More about MPI programming. More about MPI programming p. 1

Parallel Programming. Using MPI (Message Passing Interface)

The MPI Message-passing Standard Practical use and implementation (VI) SPD Course 08/03/2017 Massimo Coppola

Parallel Programming in C with MPI and OpenMP

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Review of MPI Part 2

CS 179: GPU Programming. Lecture 14: Inter-process Communication

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Working with IITJ HPC Environment

MA471. Lecture 5. Collective MPI Communication

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Bryan Carpenter, School of Computing

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3

PARALLEL AND DISTRIBUTED COMPUTING

For developers. If you do need to have all processes write e.g. debug messages, you d then use channel 12 (see below).

Distributed Memory Programming with MPI

Programming SoHPC Course June-July 2015 Vladimir Subotic MPI - Message Passing Interface

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

MPI - The Message Passing Interface

High performance computing. Message Passing Interface

Introduction to MPI Part II Collective Communications and communicators

UNIVERSITY OF MORATUWA

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

CPS 303 High Performance Computing

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Masterpraktikum - Scientific Computing, High Performance Computing

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Matrix-vector Multiplication

Intermediate MPI features

Parallel Programming with MPI and OpenMP

Collective Communications I

COMP4300/8300: Parallelisation via Data Partitioning. Alistair Rendell

Masterpraktikum - Scientific Computing, High Performance Computing

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Collective Communication in MPI and Advanced Features

Intermediate MPI (Message-Passing Interface) 1/11

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Basic Communication Ops

MPI. What to Learn This Week? MPI Program Structure. What is MPI? This week, we will learn the basics of MPI programming.

More MPI. Bryan Mills, PhD. Spring 2017

Algorithms and Applications

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Capstone Project. Project: Middleware for Cluster Computing

Introduction to MPI Programming Part 2

Intermediate MPI (Message-Passing Interface) 1/11

Transcription:

Lecture 7 Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Announcements Quiz #1 in section on Friday Midterm Room: SSB 106 Monday 10/30, 7:00 to 8:20 PM No lectures 11/13 and 11/15 Makeup on 11/17 10AM to 11:20 AM, room TBA Section will be held on Weds 11/15 in our class room at usual meeting time 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 2

K-ary d-cubes Definition- A k-ary d-cube is an interconnection network with k d nodes There are k nodes along all d axis End around connections A generalization of a mesh and hypercube The hypercube is a special case with k=2 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 3

K-ary d-cubes Derive the diameter, number of links, & bisection bandwidth of a k-ary d-cube with p nodes Diameter : dk2 Bisection Bandwidth: 2k d-1 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 4

A Hamiltonian cycle is a path in an undirected graph that visits each node exactly once Any Hamiltonian circuit on a labeled hypercube defines a Gray code [Skeina] Map a 1D ring onto a hypercube, mesh, k-ary d-cube Hamiltonians www.cs.sunysb.edu/~skiena/combinatorica/animations/ham.html mathworld.wolfram.com/hamiltoniancircuit.html 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 5

Prefix sum The prefix sum (also called a sum-scan) of a sequence of numbers x k is in turn a sequence of running sums S k defined as follows S 0 = 0, S k = S k-1 + x k Thus, scan (3,1,4,0,2) = (3,4,8,8,10) Design an algorithm for prefix sum based on the array interconnect (w/o end around connections), and give the accompanying performance model Repeat for the hypercube Use the gray coding to define the layout of values across processors 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 6

Prefix sum algorithm S = PrefixSum(x){ if ($rank==0){ S = x; Send(S,1);} for i=1 to P-1 if ($rank==i){ Recv(y,rank-1); S = x + y; if ($rank < P-1) Send(S,rank+1); } 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 7

Performance model Number of additions per node = 1 Number of messages sent per node = 1 (except the last node) Number of messages received per node = 1 (except the first node) Let the startup time be time units, and be 1/(peak bandwidth) Running time = (N-1) * ( + (1)) + (N-1) 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 8

Time constrained scaling Consider a summation algorithm We add up N numbers on P processors N >> P Determine the largest problem that can be solved in time T=10 4 time units on 512 processors 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 9

Local additions: N/P - 1 Reduction: (1+ )(lg P - 1) T(N,P) = N/P + lg P Performance model Determine the largest problem that can be solved in time T= 10 4 time units on P=512 processors, = 10 time units, and addition costs one unit of time Consider T(512,N) 10 4 (N/512) + lg(512) 10 4 (N/512) + 90 10 4 N 5 10 6 (approximately) 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 10

Revisiting communication costs Plots from Valkyrie Plots from Blue Horizon 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 11

Message Bandwith Ring bandwidth (Fall 2003) MPI Ring Test (8 nodes, 20 iterations) 250,000 200,000 183872 226056228717230069230603 211476 215515 217182 216891 202670 206851 150,000 161224 125508 100,000 50,000 50267 60156 73130 21942 29569 0 100 204 407 822 1617 3069 5999 7846 12715 Message Length (bytes) 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 12 1.E+00 2.E+00 4.E+00 8.E+00 2.E+01 3.E+01 6.E+01 1.E+02 3.E+02 5.E+02 1.E+03 2.E+03 4.E+03 8.E+03 2.E+04 3.E+04 7.E+04 1.E+05 3.E+05 5.E+05 1.E+06 2.E+06 4.E+06 8.E+06 2.E+07 3.E+07 7.E+07

Ring variations in running time (F03) 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 13

Ring - Copy costs 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 14

Correctness issues When there are multiple outstanding irecvs, MPI doesn t say how incoming messages are matched Or even if the process is fair MPI_Request req1, req2; MPI_Status status; MPI_Irecv(buff, len, CHAR, nextnode, TYPE, WORLD,&req1); MPI_Irecv(buff2,len, CHAR, prevnode, TYPE, WORLD,&req2); MPI_Send(buff, len, CHAR, nextnode, TYPE, WORLD); MPI_Wait(&req1, &status); MPI_Wait(&req2, &status); MPI_Send(buff2, len, CHAR, prevnode, TYPE, WORLD); 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 15

Datastar communication characteristics 22528 790.08 22656 792.64 22784 795.81 22912 798.60 23040 807.12 23168 806.20 23296 810.40 n 1/2 1611 MB/sec = 5.4μs T(4 B) = 5.7μs 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 16

Short message, increments of 4 bytes 332 460 204 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 17

1K to 70K 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 18

Debugging and coding example Print out a matrix distributed over multiple processors Collect the values on a single processor for output p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 19

Design Create row and column communicators Collect data in each row to the root Collect the rows onto a column global root p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2) 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 20

Gather int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm ) Gather Scatter P 0 P p-1 Root 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 21

Code N N = size of array, q = p, nsq = (N/q) 2 MPI_Comm row_comm, col_comm; ROOT = 0; // Each row root gathers its rows float* Local_Matrix[N/q][N/q], rows = new float[q*nsq]; MPI_Gather ( Local_Matrix, nsq, MPI_FLOAT, rows, nsq, MPI_FLOAT, N/ P ROOT, row_comm ); // Gather all the data from each root float* cols = new float[n*n]; MPI_Gather ( rows, q*nsq, MPI_FLOAT, cols, q*nsq, MPI_FLOAT, ROOT, col_comm ); N 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 22

Finishing up if (!grid.my_rankr &&!grid.my_rankc) { OUTPUT } p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) 10/13/06 Scott B. Baden / CSE 160 / Fall 2006 23