CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

Similar documents
Lecture 6: Parallel Matrix Algorithms (part 3)

CSE 638: Advanced Algorithms. Lectures 10 & 11 ( Parallel Connected Components )

CSE 613: Parallel Programming. Lecture 11 ( Graph Algorithms: Connected Components )

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Parallel Numerical Algorithms

AMath 483/583 Lecture 24. Notes: Notes: Steady state diffusion. Notes: Finite difference method. Outline:

AMath 483/583 Lecture 24

CSE 613: Parallel Programming. Lecture 6 ( Basic Parallel Algorithmic Techniques )

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Matrix Multiplication

AMath 483/583 Lecture 21 May 13, 2011

Numerical Algorithms

Intermediate MPI (Message-Passing Interface) 1/11

Examples of MPI programming. Examples of MPI programming p. 1/18

Basic MPI Communications. Basic MPI Communications (cont d)

Introduction to MPI. Ricardo Fonseca.

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Intermediate MPI (Message-Passing Interface) 1/11

Matrix multiplication

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9

CSE 638: Advanced Algorithms. Lectures 18 & 19 ( Cache-efficient Searching and Sorting )

Document Classification Problem

CS473 - Algorithms I

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Parallel Programming. Functional Decomposition (Document Classification)

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Lecture 17: Array Algorithms

Parallel Programming in C with MPI and OpenMP

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Cost-Effective Parallel Computational Electromagnetic Modeling

Introduction to MPI: Part II

OpenMP - exercises - Paride Dagna. May 2016

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Framework of an MPI Program

AMS527: Numerical Analysis II

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

CSE 160 Lecture 15. Message Passing

Lecture 16. Parallel Matrix Multiplication

Document Classification

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Dynamic Programming II

Point-to-Point Communication. Reference:

HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI

REGULAR GRAPHS OF GIVEN GIRTH. Contents

Non-Blocking Communications

Lecture 13: Chain Matrix Multiplication

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

CSE 548: Analysis of Algorithms. Lectures 14 & 15 ( Dijkstra s SSSP & Fibonacci Heaps )

Dense Matrix Algorithms

Non-Blocking Communications

Parallelization Strategy

High performance computing. Message Passing Interface

Communicators and Topologies: Matrix Multiplication Example

MATH 676. Finite element methods in scientific computing

CSE 160 Lecture 13. Non-blocking Communication Under the hood of MPI Performance Stencil Methods in MPI

All-Pairs Shortest Paths - Floyd s Algorithm

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

MPI, Part 3. Scientific Computing Course, Part 3

Lecture 16: Introduction to Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

CSE 160 Lecture 18. Message Passing

Introduction to TDDC78 Lab Series. Lu Li Linköping University Parts of Slides developed by Usman Dastgeer

Dynamic Programming part 2

COSC 6374 Parallel Computation

Chapter 18 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M. Cargal.

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Basic Communication Operations (Chapter 4)

CSE 215: Foundations of Computer Science Recitation Exercises Set #4 Stony Brook University. Name: ID#: Section #: Score: / 4

OpenMP - exercises -

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3

D-BAUG Informatik I. Exercise session: week 5 HS 2018

MPI Lab. Steve Lantz Susan Mehringer. Parallel Computing on Ranger and Longhorn May 16, 2012

Chapter 8 Dense Matrix Algorithms

Dynamic Programming Shabsi Walfish NYU - Fundamental Algorithms Summer 2006

More MPI. Bryan Mills, PhD. Spring 2017

Hybrid Parallel Programming Part 2

Dynamic Programming. Outline and Reading. Computing Fibonacci

HPC Parallel Programing Multi-node Computation with MPI - I

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Recap of Parallelism & MPI

MPI (Message Passing Interface)

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

MPI, Part 2. Scientific Computing Course, Part 3

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Matrix Multiplication

Parallelization Strategy

Introduction to Parallel Processing. Lecture #10 May 2002 Guy Tel-Zur

Parallel Programming

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

MPI and Stommel Model (Continued) Timothy H. Kaiser, PH.D.

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

Matrix Multiplication. (Dynamic Programming)

Transcription:

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2012

2D Heat Diffusion Let, be the heat at point, at time t. Heat Equation, thermal diffusivity Update Equation ( on a discrete grid ),, 2D 5-point Stencil 1, 2, 1,, 1 2,, 1 time

Standard Serial Implementation Implementation Tricks Reuse storage for odd and even time steps Keep a halo of ghost cells around the array with boundary values for ( int t = 0; t < T; ++t ) { for ( int x = 1; x <= X; ++x ) for ( int y = 1; y <= Y; ++y ) g[x][y] = h[x][y] + cx * ( h[x+1][y] 2 * h[x][y] + h[x 1][y] ) + cy * ( h[x][y+1] 2 * h[x][y] + h[x][y-1] ); for ( int x = 1; x <= X; ++x ) for ( int y = 1; y <= Y; ++y ) h[x][y] = g[x][y]; }

One Way of Parallelization P 0 P 1 P 2 P 3

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) leave enough space MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; for ghost cells MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); downward send and upward receive } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); upward send and downward receive } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); in addition to the ghost rows exclude the two outermost interior rows } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); wait until data is received for the ghost rows if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); update the two outermost interior rows { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); wait until sending data is complete so that h can be overwritten { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); } for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y];

MPI Implementation of 2D Heat Diffusion #define UPDATE( u, v ) ( h[u][v] + cx * ( h[u+1][v] 2* h[u][v] + h[u-1][v] ) + cy * ( h[u][v+1] 2* h[u][v] + h[u][v-1] ) ) MPI_FLOAT h[ XX + 2 ][ Y + 2 ], g[ XX + 2 ][ Y + 2 ]; MPI_Status stat; MPI_Request sendreq[ 2 ], recvreq[ 2 ]; for ( int t = 0; t < T; ++t ) { if ( myrank < p - 1 ) { MPI_Isend( h[ XX ], Y, MPI_FLOAT, myrank + 1, 2 * t, MPI_COMM_WORLD, & sendreq[ 0 ] ); MPI_Irecv( h[ XX + 1 ], Y, MPI_FLOAT, myrank + 1, 2 * t + 1, MPI_COMM_WORLD, & recvreq[ 0 ] ); } if ( myrank > 0 ) { MPI_Isend( h[ 1 ], Y, MPI_FLOAT, myrank - 1, 2 * t + 1, MPI_COMM_WORLD, & sendreq[ 1 ] ); MPI_Irecv( h[ 0 ], Y, MPI_FLOAT, myrank - 1, 2 * t, MPI_COMM_WORLD, & recvreq[ 1] ); } for ( int x = 2; x < XX; ++x ) g[x][y] = UPDATE( x, y ); if ( myrank < p - 1 ) MPI_Wait( &recvreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &recvreq[ 1 ], &stat ); { g[1][y] = UPDATE( 1, y ); g[xx][y] = UPDATE( XX, y ); } } if ( myrank < p - 1 ) MPI_Wait( &sendreq[ 0 ], &stat ); if ( myrank > 0 ) MPI_Wait( &sendreq[ 1 ], &stat ); for ( int x = 1; x <= XX; ++x ) h[x][y] = g[x][y]; now overwrite h

Analysis of the MPI Implementation of Heat Diffusion Let the dimension of the 2D grid be, and suppose we execute time steps. Let be the number of processors, and suppose the grid is decomposed along direction. The computation cost in each time step is clearly. Hence, the total computation cost,. All processors except processors 0 and 1send two rows and receive two rows each in every time step. Processors 0 and 1 send and receive only one row each. Hence, the total communication cost, 4, where is the startup time of a message and is the per-word transfer time. Thus and. 4,

Naïve Matrix Multiplication z ij n = k= 1 x y ik kj z z z 11 12 1n z z z 21 22 2n z z z n1 n2 nn x11 x12 x1 n = x21 x22 x 2n x x x n1 n2 nn y y y 11 12 1n y y y 21 22 2n y y y n1 n2 nn Iter-MM( X, Y, Z, n ) 1. for i 1 to n do 2. for j 1 to n do 3. for k 1 to n do 4. z ij z ij + x ik y kj

Naïve Matrix Multiplication z ij n = k= 1 x y ik kj z z z 11 12 1n z z z 21 22 2n z z z n1 n2 nn x11 x12 x1 n = x21 x22 x 2n x x x n1 n2 nn y y y 11 12 1n y y y 21 22 2n y y y n1 n2 nn Suppose we have processors, and processor is responsible for computing. One master processor initially holds both and, and sends all and for 1,2,, to each processor. One-to-all Broadcast is a bad idea as each processor requires a different part of the input. Each computes and sends back to master. Thus 2, and 2. Hence, 2 2 2. Total work, 2.

Naïve Matrix Multiplication z ij n = k= 1 x y ik kj z z z 11 12 1n z z z 21 22 2n z z z n1 n2 nn x11 x12 x1 n = x21 x22 x 2n x x x n1 n2 nn y y y 11 12 1n y y y 21 22 2n y y y n1 n2 nn Observe that row of will be required by all,, 1. So that row can be broadcast to the group,,,,,, of size. Similarly, for other rows of, and all columns of. The communication complexity of broadcasting units of data to a group of size is log. As before, each computes and sends back to master. Hence, 2 log.

s s Block Matrix Multiplication Z X Y s s s s n = n n n n n Block-MM( X, Y, Z, n ) 1. for i 1 to n / s do 2. for j 1 to n / s do 3. for k 1 to n / s do 4. Iter-MM( X ik, Y kj, Z ij, s )

Block Matrix Multiplication Suppose, and processor computes block. One master processor initially holds both and, and sends all blocks and for 1,2,, to each processor. Thus 2 Ο, and 2.( w/o broadcast ) For, Ο, and Ο. For, Ο, and Ο

Block Matrix Multiplication Now consider one-to-group broadcasting. Block row of, i.e., blocks for 1,2,,, will be required by different processors, i.e., processors for 1,2,,. Similarly, for other block rows of, and all block columns of. As before, each computes block and sends back to master. Hence, log.

Recursive Matrix Multiplication Par-Rec-MM ( X, Y, Z, n ) 1. if n = 1 then Z Z + X Y 2. else 3. in parallel do Par-Rec-MM ( X 11, Y 11, Z 11, n / 2 ) Par-Rec-MM ( X 11, Y 12, Z 12, n / 2 ) Par-Rec-MM ( X 21, Y 11, Z 21, n / 2 ) Par-Rec-MM ( X 21, Y 12, Z 22, n / 2 ) end do Assuming and are constants, Θ 1, 1, 8 2 Θ,. Θ [ MT Case 1 ] Communication cost is too high! 4. in parallel do Par-Rec-MM ( X 12, Y 21, Z 11, n / 2 ) Par-Rec-MM ( X 12, Y 22, Z 12, n / 2 ) Par-Rec-MM ( X 22, Y 21, Z 21, n / 2 ) Par-Rec-MM ( X 22, Y 22, Z 22, n / 2 ) end do

Recursive Matrix Multiplication Par-Rec-MM ( X, Y, Z, n ) 1. if n = 1 then Z Z + X Y 2. else 3. in parallel do Par-Rec-MM ( X 11, Y 11, Z 11, n / 2 ) Par-Rec-MM ( X 11, Y 12, Z 12, n / 2 ) Par-Rec-MM ( X 21, Y 11, Z 21, n / 2 ) Par-Rec-MM ( X 21, Y 12, Z 22, n / 2 ) But with a base case, Θ 1,, 8 2 Θ,. Θ Parallel running time, end do 4. in parallel do Par-Rec-MM ( X 12, Y 21, Z 11, n / 2 ) Par-Rec-MM ( X 12, Y 22, Z 12, n / 2 ) Par-Rec-MM ( X 22, Y 21, Z 21, n / 2 ) Par-Rec-MM ( X 22, Y 22, Z 22, n / 2 ) end do Θ For, Ο, and Ο ( how? )

Cannon s Algorithm We decompose each matrix into blocks of size each. We number the processors from, to,. Initially, holds and. We rotate block row of to the left by positions, and block column of upward by positions. So, now holds, and,.

Cannon s Algorithm )*+ now holds L*,+ * and M* +,+. )*+ multiplies these two submatrices, and adds the result to N*,+. Then in each of the next 1 steps, each block row of L is rotated to the left by 1 position, and each block column of M is rotated upward by 1 position. Each )*+ adds the product of its current submatrices to N*,+.

Cannon s Algorithm Initial arrangement makes 1 block rotations of L and M, and one block matrix multiplication per processor. In each of the next 1 steps, each processor performs one block matrix multiplication, and sends and receives one block each.!"#!"## 2 4 2 1 & Ο A '.,

Cannon s Algorithm What if initially, one master processor (say,, ) holds all data (i.e., matrices and ), and the same processor wants to collect the entire output matrix (i.e., ) at the end? Processor, initially sends, and, to processor,, and at the end processor, sends back, to,. Since there are processors, and each submatrixhas size, the additional communication complexity: 3 3. So, the communication complexity increases by a factor of.