Collective Communications II

Similar documents
MPI Collective communication

Basic MPI Communications. Basic MPI Communications (cont d)

Collective Communications I

L15: Putting it together: N-body (Ch. 6)!

Data parallelism. [ any app performing the *same* operation across a data stream ]

Programming with MPI Collectives

Parallel Programming, MPI Lecture 2

L19: Putting it together: N-body (Ch. 6)!

Parallel programming MPI

Standard MPI - Message Passing Interface

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Collective Communication: Gatherv. MPI v Operations. root

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Outline. Communication modes MPI Message Passing Interface Standard

High Performance Computing

High-Performance Computing: MPI (ctd)

Collective Communications

Collective Communication: Gather. MPI - v Operations. Collective Communication: Gather. MPI_Gather. root WORKS A OK

MPI - v Operations. Collective Communication: Gather

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Distributed Memory Programming with MPI

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

HPC Parallel Programing Multi-node Computation with MPI - I

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

MPI. (message passing, MIMD)

The MPI Message-passing Standard Practical use and implementation (V) SPD Course 6/03/2017 Massimo Coppola

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Non-Blocking Communications

Parallel Programming. Using MPI (Message Passing Interface)

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

What s in this talk? Quick Introduction. Programming in Parallel

MA471. Lecture 5. Collective MPI Communication

Recap of Parallelism & MPI

Introduction to MPI Part II Collective Communications and communicators

Non-Blocking Communications

Chapter 8 Matrix-Vector Multiplication

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

Introduzione al Message Passing Interface (MPI) Andrea Clematis IMATI CNR

Distributed Systems + Middleware Advanced Message Passing with MPI

MPI 3.0 Neighbourhood Collectives

Distributed Memory Systems: Part IV

Intermediate MPI features

Introduction to TDDC78 Lab Series. Lu Li Linköping University Parts of Slides developed by Usman Dastgeer

CDP. MPI Derived Data Types and Collective Communication

Message Passing Interface

O.I. Streltsova, D.V. Podgainy, M.V. Bashashin, M.I.Zuev

CPS 303 High Performance Computing

COMP 322: Fundamentals of Parallel Programming

Cluster Computing MPI. Industrial Standard Message Passing

Capstone Project. Project: Middleware for Cluster Computing

Message Passing Interface. most of the slides taken from Hanjun Kim

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Working with IITJ HPC Environment

UNIVERSITY OF MORATUWA

Review of MPI Part 2

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Message Passing with MPI

Report S1 C. Kengo Nakajima. Programming for Parallel Computing ( ) Seminar on Advanced Computing ( )

Scientific Computing

MPI - The Message Passing Interface

Evaluation of MPI Collectives for HPC Applications on Distributed Virtualized Environments

Introduction to MPI: Part II

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

AgentTeamwork Programming Manual

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced

Practical Course Scientific Computing and Visualization

PCAP Assignment II. 1. With a neat diagram, explain the various stages of fixed-function graphic pipeline.

NUMERICAL PARALLEL COMPUTING

Parallel Processing WS 2017/18. Universität Siegen Tel.: 0271/ , Büro: H-B Stand: January 15, 2018

For developers. If you do need to have all processes write e.g. debug messages, you d then use channel 12 (see below).

COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd

PARALLEL AND DISTRIBUTED COMPUTING

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Introduction to MPI. SuperComputing Applications and Innovation Department 1 / 143

Distributed Memory Parallel Programming

MPI point-to-point communication

Advanced Parallel Programming

Advanced MPI. Andrew Emerson

Lecture 17: Array Algorithms

Collective Communication in MPI and Advanced Features

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

MPI MESSAGE PASSING INTERFACE

More about MPI programming. More about MPI programming p. 1

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

Parallel Computing. Distributed memory model MPI. Leopold Grinberg T. J. Watson IBM Research Center, USA. Instructor: Leopold Grinberg

More MPI. Bryan Mills, PhD. Spring 2017

MPI MESSAGE PASSING INTERFACE

Programming SoHPC Course June-July 2015 Vladimir Subotic MPI - Message Passing Interface

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Programming Using the Message Passing Paradigm

Parallel Programming

An Introduction to Parallel Programming

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Using Model Checking with Symbolic Execution for the Verification of Data-Dependent Properties of MPI-Based Parallel Scientific Software

Parallel Programming

Topologies. Ned Nedialkov. McMaster University Canada. SE 4F03 March 2016

Message Passing Interface

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

Advanced MPI. Andrew Emerson

Lecture 6: Message Passing Interface

Transcription:

Collective Communications II Ned Nedialkov McMaster University Canada SE/CS 4F03 January 2014

Outline Scatter Example: parallel A b Distributing a matrix Gather Serial A b Parallel A b Allocating memory Final remarks c 2013 14 Ned Nedialkov 2/18

Scatter Sends data from one process to all other processes in a communicator i n t MPI_Scatter(void *sendbuf, i n t sendcount, MPI_Datatype sendtype, void *recvbuf, i n t recvcount, MPI_Datatype recvtype, i n t root, MPI_Comm comm) sendbuf sendcount sendtype recvbuf recvcount recvtype root comm starting address of send buffer (significant only at root) number of elements sent to each process (significant only at root ) data type of sendbuf elements (significant only at root) address of receive buffer number of elements for any single receive data type of recvbuf elements rank of sending process communicator c 2013 14 Ned Nedialkov 3/18

Assume p processes MPI_Scatter splits data at sendbuf on root into p segments each of sendcount elements, and sends these segments to processes 0, 1,..., p 1 in order c 2013 14 Ned Nedialkov 4/18

i n t MPI_Scatterv(void *sendbuf, i n t *sendcounts, i n t * displs, MPI_Datatype sendtype, void *recvbuf, i n t recvcount, MPI_Datatype recvtype, i n t root, MPI_Comm comm) sendcounts integer array (of size p) specifying the number of elements to send to each process displs integer array (of size p) Entry i specifies the displacement (relative to sendbuf) from which to take the outgoing data to process i recvbuf address of receive buffer (output) For an illustration, see http://www.mpi-forum.org/docs/ mpi-1.1/mpi-11-html/node72.html c 2013 14 Ned Nedialkov 5/18

Example: parallel matrix times vector We want to compute the product y = Ab, where A R m n and b R n on a distributed memory machine Given p processes distribute the work equally each process multiplies process 0 collects the result y How to distribute the work? c 2013 14 Ned Nedialkov 6/18

In this example, each process stores b stores m/p rows of A multiples (m/p rows of A) b What if p does not divide m? c 2013 14 Ned Nedialkov 7/18

Process i works on r i rows, where 1 if i < m mod p r i = m p + 0 otherwise In C is integer division mod is remainder # define NUM_ROWS(i,p,m) ( m/p + ( (i<m%p)?1:0) ) c 2013 14 Ned Nedialkov 8/18

Example: distributing a matrix # define NUM_ROWS(i,p,m) ( m/p + ( (i<m%p)?1:0) ) /* Distribute matrix A of size num_rows x num_cols to p processes. A is stored as one-dimensional array of size num_rows*num_cols. B is of size local_num_rows*num_cols local_num_rows = NUM_ROWS(my_rank, p, num_rows); */ i n t *displs, *sendcounts; i f (my_rank == 0) displs = malloc(sizeof( i n t )*p); sendcounts = malloc( sizeof( i n t )*p); sendcounts[0] = NUM_ROWS(0,p,num_rows)*num_cols; displs[0] = 0; for (i=1; i<p; i++) displs[i] = displs[i-1] + sendcounts[i-1]; sendcounts[i] = NUM_ROWS(i,p,num_rows)*num_cols; } } MPI_Scatterv(A, sendcounts, displs, MPI_DOUBLE, B, local_num_rows*num_cols, MPI_DOUBLE, 0, MPI_COMM_WORLD); c 2013 14 Ned Nedialkov 9/18

Gather Gathers together data from a group of processes i n t MPI_Gather(void *sendbuf, i n t sendcount, MPI_Datatype sendtype, void *recvbuf, i n t recvcount, MPI_Datatype recvtype, i n t root, MPI_Comm comm ) sendbuf sendcount sendtype recvbuf recvcount recvtype root comm starting address of send buffer number of elements in send buffer data type of sendbuf elements address of receive buffer (significant only at root) number of elements for any single receive (significant only at root) data type of recvbuf elements (significant only at root) root rank of receiving process communicator c 2013 14 Ned Nedialkov 10/18

MPI_Gather collects data, stored at sendbuf, from each process in comm and stores the data on root at recvbuf Data is received from processes in order, i.e. from process 0, then from process 1 and so on Usually sendcount, sendtype are the same as recvcount, recvtype root and comm must be the same on all processes The receive parameters are significant only on root Amount of data sent/received must be the same c 2013 14 Ned Nedialkov 11/18

MPI_Allgather i n t MPI_Allgather(void *sendbuf, i n t sendcount, MPI_Datatype sendtype,void *recvbuf, i n t recvcount, MPI_Datatype recvtype, MPI_Comm comm) The block of data sent from the ith process is received by every process and placed in the ith block of the buffer recvbuf i n t MPI_Allgatherv(void *sendbuf, i n t sendcount, MPI_Datatype sendtype,void *recvbuf, i n t *recvcounts, i n t *displs, MPI_Datatype recvtype, MPI_Comm comm) The opposite of MPI_Scatterv c 2013 14 Ned Nedialkov 12/18

Serial A b /* dotproduct.c compmatrixtimesvector computes the result of matrix times vector. The input matrix A is of size m x n, and it is stored as a one-dimensional array. The result is stored at y, which contains the computed y = A*b. Copyright 2014 Ned Nedialkov */ void compmatrixtimesvector( i n t m, i n t n, const double *A, const double *b, double *y) i n t i, j; for (i = 0; i < m; i++) y[i] = 0.0; for (j = 0; j < n; j++) const double *pa = A + i * n; y[i] += *(pa + j) * b[j]; } } } c 2013 14 Ned Nedialkov 13/18

Parallel A b /* parallelmatvec.c parallelmatrixtimesvector performs parallel matrix-vector multiplication of a matrix A times vector b. The matrix is distributed by rows. Each process contains a (num_local_rows)x(cols) matrix local_a stored as a one-dimensional array. The vector b is stored on each process. Each process computes its result and then process root collects the results and returns it in y. num_local_rows number of rows on my_rank cols number of columns on each process local_a pointer to the matrix on my_rank b pointer to the vector b of size cols y pointer to the result on the root process. y is significant only on root. Copyright 2014 Ned Nedialkov */ c 2013 14 Ned Nedialkov 14/18

Parallel A b # include <mpi.h> # include <stdlib.h> void compmatrixtimesvector( i n t m, i n t n, const double *A, const double *b, double *y); void parallelmatrixtimesvector( i n t num_local_rows, i n t cols, double *local_a, double *b, double *y, i n t root, i n t my_rank, i n t p, MPI_Comm comm) /* Allocate memory for the local result on my_rank */ double *local_y = malloc( sizeof(double)*num_local_rows); /* Compute the local matrix times vector */ compmatrixtimesvector(num_local_rows, cols, local_a, b, local_y); c 2013 14 Ned Nedialkov 15/18

/* Gather the result on process 0. recvcounts[i] is the number of doubles to be received from process i */ i n t *recvcounts; i f (my_rank==root) recvcounts = malloc( sizeof( i n t )*p); /* Gather num_local_rows from each process */ MPI_Gather(&num_local_rows, 1, MPI_INT, recvcounts, 1, MPI_INT, root, comm); /* Calculate displs for MPI_Gatterv */ i n t *displs; i f (my_rank==root) displs = malloc(sizeof( i n t )*p); displs[0] = 0; i n t i; for (i = 1; i < p; i++) displs[i] = displs[i-1] + recvcounts[i-1]; } /* Gather y */ MPI_Gatherv(local_y, num_local_rows, MPI_DOUBLE, y, recvcounts, displs, MPI_DOUBLE, root, comm); /* Free memory */ c 2013 14 Ned Nedialkov 16/18

Allocating memory malloc may return 0 MPI may crash The following function is a better malloc void * Malloc( i n t num, i n t size, MPI_Comm comm, i n t rank) void *b = malloc(num*size); i f (b==null) } fprintf(stderr,"*** PROCESS %d: MALLOC COULD NOT ALLOCATE" " %d ELEMENTS OF SIZE %d BYTES\n", rank, num, size); MPI_Abort(comm, 1); } return b; c 2013 14 Ned Nedialkov 17/18

Final remarks Amount of data sent must match amount of data received Blocking versions only No tags: calls are matched according to order of execution A collective function can return as soon as its participation is complete The complete code is at http://www.cas.mcmaster.ca/ ~nedialk/courses/4f03/code/parmatvec.zip type make run./do_all study the code and the makefile c 2013 14 Ned Nedialkov 18/18