Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

Similar documents
CSE 160 Lecture 15. Message Passing

Lecture 16. Parallel Matrix Multiplication

Lecture 14. Performance Profiling Under the hood of MPI Parallel Matrix Multiplication MPI Communicators

Introduction to MPI: Part II

Collective Communications I

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

Lecture 5. Applications: N-body simulation, sorting, stencil methods

CSE 160 Lecture 13. Non-blocking Communication Under the hood of MPI Performance Stencil Methods in MPI

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Basic Communication Operations (Chapter 4)

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Assignment 3 Key CSCI 351 PARALLEL PROGRAMMING FALL, Q1. Calculate log n, log n and log n for the following: Answer: Q2. mpi_trap_tree.

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Basic MPI Communications. Basic MPI Communications (cont d)

Collective Communication

Outline. Introduction to HPC computing. OpenMP MPI. Introduction. Understanding communications. Collective communications. Communicators.

Review of MPI Part 2

Lecture 16 Optimizing for the memory hierarchy

Announcements. Office hours during examination week

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

Parallel Computing. MPI Collective communication

Introduction to the Message Passing Interface (MPI)

1KOd17RMoURxjn2 CSE 20 DISCRETE MATH Fall

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

An Introduction to Parallel Programming using MPI

COMP 322: Fundamentals of Parallel Programming

CPS 303 High Performance Computing

Programming with MPI Collectives

Parallel Programming in C with MPI and OpenMP

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 2. Memory locality optimizations Address space organization

Distributed Memory Parallel Programming

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Matrix Multiplication

Matrix-vector Multiplication

Introduc)on to OpenMP

int MPI_Cart_shift ( MPI_Comm comm, int direction, int displ, int *source, int *dest )

Interconnection networks

Lecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Lecture 17: Array Algorithms

Parallel Numerical Algorithms

CPS 303 High Performance Computing

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

MPI. (message passing, MIMD)

Parallel dense linear algebra computations (1)

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

EE/CSCI 451: Parallel and Distributed Computation

AMath 483/583 Lecture 21

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

EE/CSCI 451 Midterm 1

Lecture 18. Optimizing for the memory hierarchy

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

More Communication (cont d)

Intra and Inter Communicators

A Message Passing Standard for MPP and Workstations. Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W.

Intermediate MPI features

Decomposing onto different processors

Homework #4 Due Friday 10/27/06 at 5pm

Collective Communication in MPI and Advanced Features

Lecture 6: Parallel Matrix Algorithms (part 3)

MA471. Lecture 5. Collective MPI Communication

Communicators and Topologies: Matrix Multiplication Example

CS 789 Multiprocessor Programming. Routing in a hyper-cube

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Communication Patterns

MATH 676. Finite element methods in scientific computing

Numerical Algorithms

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

With regards to bitonic sorts, everything boils down to doing bitonic sorts efficiently.

Data parallelism. [ any app performing the *same* operation across a data stream ]

CME 213 SPRING Eric Darve

Parallel Numerical Algorithms

Distributed Memory Programming with MPI

Distributed Memory Programming with MPI

EE/CSCI 451: Parallel and Distributed Computation

Matrix Multiplication

HPC Parallel Programing Multi-node Computation with MPI - I

Collective Communications

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Lecture 8 Parallel Algorithms II

Lecture 4. Programming with Message Passing: Applications and Performance

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

Introduction to MPI Part II Collective Communications and communicators

COSC 4397 Parallel Computation. Introduction to MPI (III) Process Grouping. Terminology (I)

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Introduction to MPI part II. Fabio AFFINITO

A User's Guide to MPI. Peter S. Pacheco. Department of Mathematics. University of San Francisco. San Francisco, CA

Message Passing Interface

Parallel I/O. and split communicators. David Henty, Fiona Ried, Gavin J. Pringle

CS256 Applied Theory of Computation

COMMUNICATION IN HYPERCUBES

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Transcription:

Lecture 13 Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

Announcements Extra lecture Friday 4p to 5.20p, room 2154 A4 posted u Cannon s matrix multiplication on Trestles Extra office hours today: 2.30 to 5pm 2

Today s lecture Writing message passing programs with MPI Applications u u Trapezoidal rule Cannon s Matrix Multiplication Algorithm Managing communicators Basic collectives 3

The trapezoidal rule Use the trapezoidal rule to numerically approximate a definite integral, area under the curve Divide the interval [a,b] into n segments of size h=1/n Area under the i th trapezoid ½ (f(a+i h)+f(a+(i+1) h)) h Area under the entire curve sum of all the trapezoids b " a f (x)dx h a+i*h a+(i+1)*h 4

Reference material For a discussion of the trapezoidal rule http://en.wikipedia.org/wiki/trapezoidal_rule A applet to carry out integration http://www.csse.monash.edu.au/~lloyd/tildealgds/numerical/integration Code on Triton (from Pacheco hard copy text) Serial Code "$PUB/Pacheco/ppmpi_c/chap04/serial.c Parallel Code $PUB/Pacheco/ppmpi_c/chap04/trap.c 2012 Scott B. Baden /CSE 260/ Winter 2012 5

Serial code (Following Pacheco) main() { float f(float x) { return x*x; } // Function we're integrating float h = (b-a)/n; // h = trapezoid base width // a and b: endpoints // n = # of trapezoids float integral = (f(a) + f(b))/2.0; float x; int i; for (i = 1, x=a; i <= n-1; i++) { x += h; integral = integral + f(x); } integral = integral*h; } 2012 Scott B. Baden /CSE 260/ Winter 2012 6

Parallel Implementation of the Trapezoidal Rule Decompose the integration interval into sub-intervals, one per processor Each processor computes the integral on its local subdomain Processors combine their local integrals into a global one 7

First version of the parallel code local_n = n/p; // # trapezoids; assume p divides n evenly float local_a = a + my_rank*local_n*h, local_b = local_a + local_n*h, integral = Trap(local_a, local_b, local_n); if (my_rank == ROOT) { // Sum the integrals calculated by // all processes total = integral; for (source = 1; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT, MPI_ANY_SOURCE, tag, WORLD, &status); total += integral; } } else MPI_Send(&integral, 1, MPI_FLOAT, ROOT, tag, WORLD); 2012 Scott B. Baden /CSE 260/ Winter 2012 8

Using collective communication We can take the sums in any order we wish The result does not depend on the order in which the sums are taken, except to within roundoff We can often improve performance by taking advantage of global knowledge about communication Instead of using point to point communication operations to accumulate the sum, use collective communication local_n = n/p; float local_a = a + my_rank*local_n*h, local_b = local_a + local_n*h, integral = Trap(local_a, local_b, local_n, h); MPI_Reduce( &integral, &total, 1, MPI_FLOAT, MPI_SUM, ROOT,MPI_COMM_WORLD) 10

Collective communication in MPI Collective operations are called by all processes within a communicator Broadcast: distribute data from a designated root process to all the others MPI_Bcast(in, count, type, root, comm) Reduce: combine data from all processes and return to a deisgated root process MPI_Reduce(in, out, count, type, op, root, comm) Allreduce: all processes get recuction: Reduce + Bcast 11

int local_n = n/p; Final version float local_a = a + my_rank*local_n*h, local_b = local_a + local_n*h, integral = Trap(local_a, local_b, local_n, h); MPI_Allreduce( &integral, &total, 1, MPI_FLOAT, MPI_SUM, WORLD) 12

Collective Communication 13

Broadcast The root process transmits of m pieces of data to all the p-1 other processors With the linear ring algorithm this processor performs p-1 sends of length m u Cost is (p-1)(α + βm) Another approach is to use the hypercube algorithm, which has a logarithmic running time 14

Sidebar: what is a hypercube? A hypercube is a d-dimensional graph with 2 d nodes A 0-cube is a single node, 1-cube is a line connecting two points, 2-cube is a square, etc Each node has d neighbors 2012 Scott B. Baden /CSE 260/ Winter 2012 15

Properties of hypercubes A hypercube with p nodes has lg(p) dimensions Inductive construction: we may construct a d-cube from two (d-1) dimensional cubes Diameter: What is the maximum distance between any 2 nodes? Bisection bandwidth: How many cut edges (mincut) 10 11 00 01 16

Bookkeeping Label nodes with a binary reflected grey code http://www.nist.gov/dads/html/graycode.html Neighboring labels differ in exactly one bit position 001 = 101 e 2, e 2 = 100 e 2 =100 17

Hypercube broadcast algorithm with p=4 Processor 0 is the root, sends its data to its hypercube buddy on processor 2 (10) Proc 0 & 2 send data to respective buddies 10 11 10 11 10 11 00 01 00 01 18

Reduction We may use the hypercube algorithm to perform reductions as well as broadcasts Another variant of reduction provides all processes with a copy of the reduced result Allreduce( ) Equivalent to a Reduce + Bcast A clever algorithm performs an Allreduce in one phase rather than having perform separate reduce and broadcast phases 19

Allreduce Can take advantage of duplex connections 10 11 10 11 10 11 00 01 00 01 20

Parallel Matrix Multiplication

Matrix Multiplication An important core operation in many numerical algorithms Given two conforming matrices A and B, form the matrix product A B A is m n B is n p Operation count: O(n 3 ) multiply-adds for an n n square matrix See Demmel www.cs.berkeley.edu/~demmel/cs267_spr99/lectures/lect02.html 23

Simplest Serial Algorithm ijk for i := 0 to n-1 for j := 0 to n-1 for k := 0 to n-1 C[i,j] += A[i,k] * B[k,j] C[i,j] A[i,:] += * B[:,j] 24

Parallel matrix multiplication Assume p is a perfect square Each processor gets an n/ p n/ p chunk of data Organize processors into rows and columns Process rank is an ordered pair of integers Assume that we have an efficient serial matrix multiply (dgemm, sgemm) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) 25

Canon s algorithm Move data incrementally in p phases Circulate each chunk of data among processors within a row or column In effect we are using a ring broadcast algorithm Consider iteration i=1, j=2: C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2] A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) Image: Jim Demmel 26

Canon s algorithm C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2] We want A[1,0] and B[0,2] to reside on the same processor initially A(0,0) A(0,1) A(0,2) A(1,0) A(1,1) A(1,2) A(0,0) A(1,1) A(0,1) A(0,2) A(1,2) A(1,0) Shift rows and columns so the next pair of values A[1,1] and B[1,2] line up A(2,0) A(2,1) A(2,2) B(0,0) B(0,1) B(0,2) A(2,2) B(0,0) A(2,0) B(1,1) A(2,1) B(2,2) And so on with A[1,2] and B[2,2] B(1,0) B(1,1) B(1,2) B(1,0) B(2,1) B(0,2) B(2,0) B(2,1) B(2,2) B(2,0) B(0,1) B(1,2) 27

Skewing the matrices C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2] We first skew the matrices so that everything lines up Shift each row i by i columns to the left using sends and receives Communication wraps around Do the same for each column A(0,0) A(1,1) A(2,2) B(0,0) B(1,0) A(0,1) A(0,2) A(1,2) A(1,0) A(2,0) A(2,1) B(1,1) B(2,2) B(2,1) B(0,2) B(2,0) B(0,1) B(1,2) 2012 Scott B. Baden /CSE 260/ Winter 2012 28

Shift and multiply C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2] Takes p steps Circularly shift u u each row by 1 column to the left each column by 1 row to the left Each processor forms the product of the two local matrices adding into the accumulated sum A(0,1) A(0,2) A(1,2) A(2,0) B(1,0) B(2,0) A(1,0) A(2,1) B(2,1) B(0,1) A(0,0) A(1,1) A(2,2) B(0,2) B(1,2) B(0,0) B(1,1) B(2,2) 2012 Scott B. Baden /CSE 260/ Winter 2012 29

Cost of Cannon s Algorithm forall i=0 to p -1 CShift-left A[i; :] by i // T= α+βn 2 /p forall j=0 to p -1 Cshift-up B[:, j] by j // T= α+βn 2 /p for k=0 to p -1 forall i=0 to p -1 and j=0 to p -1 C[i,j] += A[i,j]*B[i,j] // T = 2*n 3 /p 3/2 CShift-leftA[i; :] by 1 // T= α+βn 2 /p Cshift-up B[:, j] by 1 // T= α+βn 2 /p end forall end for T P = 2n 3 /p + 2(α(1+ p) + βn 2 /(1+ p)/p) E P = T 1 /(pt P ) = ( 1 + αp 3/2 /n 3 + β p/n)) -1 ( 1 + O( p/n)) -1 E P 1 as (n/ p) grows [sqrt of data / processor] 32

Implementation

Communication domains Cannon s algorithm shifts data along rows and columns of processors MPI provides communicators for grouping processors, reflect the communication structure of the algorithm An MPI communicator is a name space, a subset of processes that communicate Messages remain within their communicator A process may be a member of more than one communicator Y0 Y1 Y2 Y3 X0 X1 X2 X3 P0 P1 P2 P3 0,0 0,1 0,2 0,3 P4 P5 P6 P7 1,0 1,1 1,2 1,3 P8 P9 P10 P11 2,0 2,1 2,2 2,3 P12 P13 P14 P15 3,0 3,1 3,2 3,3 34

Establishing row communicators Create a communicator for each row and column By Row key = myrank div P X0 X1 X2 X3 Y0 Y1 Y2 Y3 P0 P1 P2 P3 0 0 0 0 P4 P5 P6 P7 1 1 1 1 P8 P9 P10 P11 2 2 2 2 P12 P13 P14 P15 3 3 3 3 35

Creating the communicators MPI_Comm rowcomm;" MPI_Comm_split( MPI_COMM_WORLD, myrank / P, myrank, &rowcomm);" MPI_Comm_rank(rowComm,&myRow);" Y0 Y1 Y2 Y3 X0 X1 X2 X3 P0 P1 P2 P3 0 0 0 0 P4 P5 P6 P7 1 1 1 1 P8 P9 P10 P11 2 2 2 2 P12 P13 P14 P15 3 3 3 3 Each process obtains a new communicator Each process rank relative to the new communicator Rank applies only to the respective communicator Ordered according to myrank 36

More on Comm_split MPI_Comm_split(MPI_Comm comm, int splitkey, " " int rankkey, MPI_Comm* newcomm) Ranks assigned arbitrarily among processes sharing the same rankkey value May exclude a process by passing the constant MPI_UNDEFINED as the splitkey" Return a special MPI_COMM_NULL communicator If a process is a member of several communicators, it will have a rank within each one 37

Circular shift Communication with columns (and rows p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) 39

Parallel print function 42

Parallel print function Debug output can be hard to sort out on the screen Many messages say the same thing Process 0 is alive! Process 1 is alive! Process 15 is alive! Compare with Processes[0-15] are alive! Parallel print facility http://www.llnl.gov/casc/ppf 43

Summary of capabilities Compact format list sets of nodes with common output PPF_Print( MPI_COMM_WORLD, "Hello world" ); 0-3: Hello world %N specifier generates process ID information PPF_Print( MPI_COMM_WORLD, "Message from %N\n" ); Message from 0-3 Lists of nodes PPF_Print(MPI_COMM_WORLD, (myrank % 2)? "[%N] Hello from the odd numbered nodes!\n" : "[%N] Hello from the even numbered nodes!\n") [0,2] Hello from the even numbered nodes! [1,3] Hello from the odd numbered nodes! 44

Practical matters Installed in $(PUB)/lib/PPF Use a special version of the arch file called arch.intel.mpi.ppf Each module that uses the facility must #include ptools_ppf.h [arch.intel-mkl.mpi.ppf] Look in $(PUB)/Examples/MPI/PPF for example programs ppfexample_cpp.c and test_print.c 45