Lecture 16. Parallel Matrix Multiplication

Similar documents
Lecture 14. Performance Profiling Under the hood of MPI Parallel Matrix Multiplication MPI Communicators

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Lecture 6: Parallel Matrix Algorithms (part 3)

CSE 160 Lecture 13. Non-blocking Communication Under the hood of MPI Performance Stencil Methods in MPI

Lecture 5. Applications: N-body simulation, sorting, stencil methods

CPS 303 High Performance Computing

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

COSC 6374 Parallel Computation

Document Classification

Parallel Programming with MPI and OpenMP

HPC Parallel Programing Multi-node Computation with MPI - I

All-Pairs Shortest Paths - Floyd s Algorithm

Review of MPI Part 2

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

COMP 322: Fundamentals of Parallel Programming

More advanced MPI and mixed programming topics

Parallel Programming in C with MPI and OpenMP

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

CSE 160 Lecture 15. Message Passing

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Non-Blocking Communications

Introduction to MPI: Part II

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

Matrix-vector Multiplication

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9

Non-Blocking Communications

MPI. (message passing, MIMD)

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Parallel programming MPI

Programming SoHPC Course June-July 2015 Vladimir Subotic MPI - Message Passing Interface

Standard MPI - Message Passing Interface

Message Passing Interface

15-440: Recitation 8

COSC 4397 Parallel Computation. Introduction to MPI (III) Process Grouping. Terminology (I)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Matrix Multiplication

More about MPI programming. More about MPI programming p. 1

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

High performance computing. Message Passing Interface

Lecture 9: MPI continued

Recap of Parallelism & MPI

Introduction to MPI Part II Collective Communications and communicators

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Parallel Short Course. Distributed memory machines

Parallel Programming. Functional Decomposition (Document Classification)

Document Classification Problem

Distributed Memory Parallel Programming

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Parallel Programming in C with MPI and OpenMP

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Intermediate MPI features

Message Passing Interface

Point-to-Point Communication. Reference:

CS 426. Building and Running a Parallel Application

MPI point-to-point communication

Cluster Computing MPI. Industrial Standard Message Passing

Introduction to the Message Passing Interface (MPI)

Lecture 7: Distributed memory

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

int sum;... sum = sum + c?

Introduction to the Message Passing Interface (MPI)

Distributed Systems + Middleware Advanced Message Passing with MPI

Intermediate MPI (Message-Passing Interface) 1/11

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Programming Using the Message-Passing Paradigm (Chapter 6) Alexandre David

Basic MPI Communications. Basic MPI Communications (cont d)

CME 194 Introduc0on to MPI

Intermediate MPI (Message-Passing Interface) 1/11

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Outline. Communication modes MPI Message Passing Interface Standard

Message Passing Interface. most of the slides taken from Hanjun Kim

CSE 160 Lecture 18. Message Passing

Distributed Memory Systems: Part IV

Outline. Introduction to HPC computing. OpenMP MPI. Introduction. Understanding communications. Collective communications. Communicators.

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

High Performance Computing Course Notes Message Passing Programming I

Message-Passing Interface Basics

Parallel Programming

int MPI_Cart_shift ( MPI_Comm comm, int direction, int displ, int *source, int *dest )

Introduction to parallel computing concepts and technics

Basic Communication Operations (Chapter 4)

Introduction to MPI part II. Fabio AFFINITO

Programming Using the Message Passing Paradigm

Parallel Programming, MPI Lecture 2

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

High-Performance Computing: MPI (ctd)

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM

Parallel Computing Paradigms

ME964 High Performance Computing for Engineering Applications

MPI (Message Passing Interface)

Acknowledgments. Programming with MPI Basic send and receive. A Minimal MPI Program (C) Contents. Type to enter text

Programming with MPI Basic send and receive

Transcription:

Lecture 16 Parallel Matrix Multiplication

Assignment #5 Announcements Message passing on Triton GPU programming on Lincoln Calendar No class on Tuesday/Thursday Nov 16th/18 th TA Evaluation, Professor and Course Evaluation: Nov 23 rd 2010 Scott B. Baden /CSE 260/ Fall '10 2

Today s lecture Some short topics on MPI Matrix Multiplication Working with Communication Domains 2010 Scott B. Baden /CSE 260/ Fall '10 4

Correctness and fairness Iteration 1: 1 sends to 2 & 0, 0 sends to 1, 2 sends to 0 & 1 1 begins iteration 2: 1 sends to 2 0 sends to 2 (but for iteration 0) Problem: irecv in process 2 is receiving data from 1 in iteration 2 (of process 1) while it expects data from process For i = 1 to n MPI_Request req1, req2; MPI_Status status; MPI_Irecv(buff, len, CHAR,ANY_NODE, TYPE, WORLD,&req1); MPI_Irecv(buff2,len, CHAR, ANY_NODE, TYPE, WORLD,&req2); MPI_Send(buff, len, CHAR, nextnode, TYPE, WORLD); MPI_Send(buff, len, CHAR, prevnode, TYPE, WORLD); MPI_Wait(&req1, &status); MPI_Wait(&req2, &status); End for 0 1 2 2010 Scott B. Baden /CSE 260/ Fall '10 5

Send_Recv Instead of Send and Recv, we can use MPI_Sendrecv_replace( ) to simplify the code and improve performance Sends then receives a message using a single buffer int MPI_Sendrecv_replace ( void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status )

Parallel Matrix Multiplication

Loop reordering The simplest formulation of matrix multiply the ijk formulation for i:= 0 to n-1, j:= 0 to n-1, k:= 0 to n-1 C[i,j] += A[i,k] * B[k,j] The kij formulation is the basis for an efficient parallel algorithm for k:= 0 to n-1, i:= 0 to n-1, j:= 0 to n-1 C[i,j] += A[i,k] * B[k,j]

Formulation The matrices may be non-square for k := 0 to n3-1 for i := 0 to n1-1 for j := 0 to n2-1 C[i,j] += A[i,k] * B[k,j] C[i,:] += A[i,k] * B[k,:] The two innermost loop nests compute n3 outer products for k := 0 to n3-1 C[:,:] += A[:,k] B[k,:] where is outer product

Outer product Recall that when we multiply an m n matrix by an n p matrix we get an m p matrix Outer product of column vector a T and vector b = matrix C an m 1 times a 1 n a[1,3] x[3,1] Multiplication table with rows formed by a[:] and the columns by b[:] The SUMMA algorithm computes n partial outer products: for k := 0 to n-1 C[:,:] += A[:,k] B[k,:] $ ax ay az' (a,b,c) " (x,y,z) T & ) # bx by bz & ) % cx cy cz( * 1 2 3 10 11 20 30 20 20 40 60 30 30 60 90

Serial algorithm Each row k of B contributes to the n partial n partial outer products : for k := 0 to n-1 C[:,:] += A[:,k] B[k,:] k *

Parallel algorithm Processors organized into rows and columns, process rank an ordered pair Processor geometry P = px py Blocked (serial) matrix multiply, panel size = b << N/max(px,py) for k := 0 to n-1 by b multicast A[ :, k:k+b-1 ] Along processor rows multicast B[ k:k+b-1, : ] Along processor columns C += A[:,k:k+b-1 ] * B[k:k+b-1,: ] Local MM Each row and column of processors independently participate in a panel broadcast Owner of the panel changes with k p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2)

Parallel matrix multiplication Assume p is a perfect square Each processor gets an n/ p n/ p chunk of data Organize processors into rows and columns Process rank is an ordered pair of integers Assume that we have an efficient serial matrix multiply p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2)

What is the performance? for k := 0 to n-1 by b // Tree broadcast: lg p (α + bβn/ p) // For long messages: 2(( p -1)/ p)bβn multicast A[ :, k:k+b-1 ] along rows multicast B[ k:k+b-1, : ] along columns // Built in matrix multiply: 2(n/ p) 2 b C += A[:,k:k+b-1 ] * B[k:k+b-1,: ] Total running time: ~ 2n 3 /p + 4βbn/ p

Performance Highlights of SUMMA Running time = 2n 3 /p + 4βbn/ p Efficiency = O(1/(1 + p/n 2 ) Generality: non-square matrices, non-square geometries Adjust b to tradeoff latency cost against memory b small less memory, lower efficiency b large more memory, high efficiency Low temporary storage grows like 2bn/ p A variant used in SCALAPACK R. van de Geign and J. Watts, SUMMA: Scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience, 9:255-74 (1997) www.netlib.org/ lapack/lawns/lawn96.ps

Communication domains Summa motivates MPI communication domains Derive communicators that naturally reflect the communication structure along rows and columns of the processor geometry X0 X1 X2 X3 Y0 Y1 Y2 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) Y3 P12 P13 P14 P15

Communication domains A communicator is name space specified by an MPI communicator Messages remain within their domain Communication domains simplify the code, by specifying subsets of processes that may communicate A processor may be a member of more than one communication domain

Splitting communicators Each process computes a key based on its rank Derived communicators group processes together that have the same key Each process has a rank relative to the new communicator If a process is a member of several communicators, it will have a rank within each one

Splitting communicators for Summa Create a communicator for each row and column Group the processors by row key = myid div P Thus, if P=4 Processes 0, 1, 2, 3 are in one communicator because they share the same value of key (0) Processes 4, 5, 6, 7 are in another (1), and so on

MPI support MPI_Comm_split( ) is the workhorse MPI_Comm_split(MPI_Comm comm, int splitkey, int rankkey, MPI_Comm* newcomm); A collective call Each process receives a new communicator, newcomm, which it shares in common with other processes having the same splitkey value

Establishing row communicators MPI_Comm rowcomm; MPI_Comm_split( MPI_COMM_WORLD, myrank / P, myrank, &rowcomm); MPI_Comm_rank(rowComm,&myRow); Y0 Y1 Y2 Y3 X0 X1 X2 X3 P0 P1 P2 P3 0 0 0 0 P4 P5 P6 P7 1 1 1 1 P8 P9 P10 P11 2 2 2 2 P12 P13 P14 P15 3 3 3 3 Ranks apply only to the respective communicator Ordered according to myrank

More on Comm_split MPI_Comm_split(MPI_Comm comm, int splitkey, int rankkey, MPI_Comm* newcomm); Ranks are assigned arbitrarily among processes sharing the same rankkey value May exclude a process by passing the constant MPI_UNDEFINED as the splitkey A special MPI_COMM_NULL communicator will be returned Alternative is to enumerate all nodes in a set and call MPI_Group_incl() and MPI_Comm_create()

Panel Broadcast Each row/column calls Bcast, a multicast Contributing row/column circulates across and down ward Foreach step in 0 to n by panel Ring Bcast(current column, comm_col) Ring Bcast(current row, comm_row) DGEMM ( )

Ring BCast RING_Bcast( double *buf, int count, MPI_Datatype type, int root, MPI_Comm comm ) MPI_Comm_rank( comm, &rank ); MPI_Comm_size( comm, &np ); if ( rank root) MPI_Recv( buf, count, type, (rank-1+np) mod np, MPI_ANY_TAG, comm, &status ); if ( ( rank +1 ) mod np root ) MPI_Send(buf, count, type, (rank+1)%np, 0, comm );

Assignment #5 Implement Dense Matrix Conjugate Gradient Groups of 2: MPI or CUDA Groups of 3: Both MPI and CUDA 2010 Scott B. Baden /CSE 260/ Fall '10 25