Advanced Message-Passing Interface (MPI)

Similar documents
Practical Introduction to Message-Passing Interface (MPI)

Our new HPC-Cluster An overview

Practical Introduction to

ECE 574 Cluster Computing Lecture 13

Introduction to Lab Series DMS & MPI

Introduction to MPI, the Message Passing Library

HPC Parallel Programing Multi-node Computation with MPI - I

Practical Introduction to

Practical Introduction to Message-Passing Interface (MPI)

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo

ECE Spring 2017 Exam 2

Welcome to the introductory workshop in MPI programming at UNICC

Message Passing Interface: Basic Course

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Optimization of MPI Applications Rolf Rabenseifner

Slides prepared by : Farzana Rahman 1

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

MPI Collective communication

Introduction to parallel computing with MPI

Collective Communication in MPI and Advanced Features

Introduction to the Message Passing Interface (MPI)

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Tutorial: parallel coding MPI

Shared Memory programming paradigm: openmp

Parallel Programming. Libraries and Implementations

Review of MPI Part 2

Masterpraktikum - Scientific Computing, High Performance Computing

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Parallel Programming

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Parallel programming MPI

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

CS 426. Building and Running a Parallel Application

Programming with MPI

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Parallel Programming Libraries and implementations

Masterpraktikum - Scientific Computing, High Performance Computing

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Message Passing Interface

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

ECE 563 Spring 2016, Second Exam

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallele Numerik. Blatt 1

Computer Architecture

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

15-440: Recitation 8

An Introduction to MPI

Implementation of Parallelization

ECE 563 Spring 2012 First Exam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Topic Notes: Message Passing Interface (MPI)

Számítogépes modellezés labor (MSc)

Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Parallel Computing MPI. Christoph Beetz. September 7, Parallel Computing. Introduction. Parallel Computing

ECE 574 Cluster Computing Lecture 10

mpi4py HPC Python R. Todd Evans January 23, 2015

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ECE 574 Cluster Computing Lecture 13

MPI Performance Snapshot

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Matrix-vector Multiplication

Open Multi-Processing: Basic Course

Parallel Computing and the MPI environment

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

The Message Passing Model

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Message Passing Interface - MPI

Programming with MPI Collectives

What s in this talk? Quick Introduction. Programming in Parallel

Lecture 9: MPI continued

Introduction to MPI part II. Fabio AFFINITO

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 8. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP and MPI parallelization

Distributed Memory Programming With MPI Computer Lab Exercises

Introduction to parallel computing concepts and technics

A few words about MPI (Message Passing Interface) T. Edwald 10 June 2008

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Parallel Computing. Lecture 17: OpenMP Last Touch

Topologies in MPI. Instructor: Dr. M. Taufer

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Lecture 4: OpenMP Open Multi-Processing

Transcription:

Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point Datatypes and Packing Communicators and Groups Topologies Outline of the workshop Afternoon: Hybrid MPI/OpenMP Theory and benchmarking Examples 3 What is MPI? MPI is a specification for a standardized library: http://www.mpi-forum.org You use its subroutines You link it with your code History: MPI- (994), MPI-2 (997), MPI-3 (202). Different implementations : 4 MPICH(2), MVAPICH(2), OpenMPI, HP-MPI,... MPI-3 contains a more modern Fortran interface ( use mpi f08 ), less prone to errors. Still very new, but implemented, for instance, in OpenMPI.7.

Review: MPI routines we know... Startup and exit: MPI Init, MPI Finalize Information on the processes: MPI Comm rank, MPI Comm size Point-to-point communications: MPI Send, MPI Recv MPI Irecv, MPI Isend, MPI Wait Collective communications: MPI Bcast, MPI Reduce, MPI Scatter, MPI Gather 5 Fortran PROGRAM hello USE mpi Example: Hello from N cores INTEGER err, rank, size CALL MPI Init(err) CALL MPI Comm rank (MPI COMM WORLD, & rank, ierr) CALL MPI Comm size (MPI COMM WORLD, & size, ierr) WRITE(*,*) Hello from processor,& rank, of, size CALL MPI Finalize(err) C 6 #include <stdio.h> #include <mpi.h> int main (int argc, char * argv[]) { int rank, size; MPI Init( &argc, &argv ); MPI Comm rank( MPI COMM WORLD, &rank ); MPI Comm size( MPI COMM WORLD, &size ); printf( "Hello from processor %d of %d\n", rank, size ); MPI Finalize(); return 0; END PROGRAM hello } More on Collectives 7 8 More on Collectives More on Collectives All Functions MPI Allgather, MPI Allreduce Combined MPI Gather/MPI Reduce with MPI Bcast: all ranks receive the resulting data. MPI Alltoall Everybody gathers subsequent blocks. Works like a matrix transpose. v Functions MPI Scatterv, MPI Gatherv MPI Allgatherv, MPI Alltoallv Instead of count argument, use counts and displs arrays that specify the counts and array displacements for every rank involved. MPI Barrier Synchronization. MPI Abort Abort with error code.

Exercise : MPI Alltoall Log in and compile the file alltoall.f90 or alltoall.c: cp /software/workshop/advancedmpi/*. module add ifort icc openmpi mpicc alltoall.c -o alltoall mpif90 alltoall.f90 -o alltoall There are errors. Can you fix them? Hint: type man MPI Alltoall to obtain the syntax for the MPI function. To submit the job, use msub -q class alltoall.pbs 9 Exercise 2: Matrix-vector multiplication Complete the multiplication in mv.f90 or mv.c using MPI Allgatherv. Rows of the matrix are distributed among processors. Example: rows and 2 in rank 0, row 3 in rank : v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 a,x + a,2 x 2 + a,3 x 3 a 2, x + a 2,2 x 2 + a 2,3 x 3 v v 2 a 3, x + a 3,2 x 2 + a 3,3 x 3 v 3 0 Exercise 3: Matrix-vector multiplication Complete the multiplication in mv2.f90 or mv2.c using MPI Alltoallv. Columns of the matrix and input vector are distributed among processors. Example: columns and 2 in rank 0, column 3 in rank : v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 a 3, x + a 3,2 x 2 a 3,3 x 3 Exercise 3: Matrix-vector multiplication a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 a 3, x + a 3,2 x 2 a 3,3 x 3 (after MPI Alltoallv) a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 v v 2 a 3, x + a 3,2 x 2 a 3,3 x 3 v 3 Note: could also use MPI Reduce or MPI Allreduce here. 2

3 More on point-to-point MPI Ssend: Synchronized, force to complete only when matching receive posted. MPI Bsend: Buffered using user-provided buffer. MPI Rsend: Ready send, must go after matching receive was posted. Rarely used. MPI Issend, MPI Ibsend, MPI Irsend: Asynchronous versions. MPI Sendrecv[ replace]: Sends and receives, avoiding deadlock (like MPI Irecv, MPI Isend, MPI Wait) Note: generally plain MPI Recv and MPI Send are best. Packing and Datatypes These functions create new data types: MPI Type contiguous, MPI Type vector, MPI Type indexed: Transfer parts of a matrix directly. MPI Type struct: Transfer a struct. MPI Pack, MPI Unpack: Pack and send heterogenous data. Note: double precision variables can (on all current machines) contain 53-bit integers without loss of precision. So an alternative is to pack manually into a double precision array. 4 Pack example 5 Communicators 6 integer m double precision x(m) call MPI Pack size(,mpi INTEGER,MPI COMM WORLD,size int,ierr) call MPI Pack size(m,mpi DOUBLE PRECISION,MPI COMM WORLD,size double,ierr) bufsize = size int + size double allocate(buffer(bufsize)) pos = 0 if(rank==0)then call MPI Pack(m,,MPI INTEGER,buffer,bufsize,pos,MPI COMM WORLD,ierr) call MPI Pack(x,m,MPI DOUBLE PRECISION,buffer,bufsize,pos, & MPI COMM WORLD,ierr) endif call MPI Bcast(buffer,bufsize,MPI PACKED,0,MPI COMM WORLD,ierr) if(rank>0)then call MPI Unpack(buffer,bufsize,pos,m,, & MPI INTEGER,MPI COMM WORLD,ierr) call MPI Unpack(buffer,bufsize,pos,x,m, & MPI DOUBLE PRECISION,MPI COMM WORLD,ierr) endif So far only used MPI COMM WORLD. Can split this communicator into subsets, to allow collective operations on a subset of ranks. Easiest to use: MPI Comm split(comm, color, key, newcomm[, ierror]): comm: old communicator color: all processes with the same color go into the same communicator key: rank within new communicator (can be 0 for automatic determination) newcomm: resulting new communicator

7 Topologies 8 Topologies Topologies group processes in an n-dimensional grid (Cartesian) or graph. Here we restrict to a Cartesian 2D grid. Helps programmer and (sometimes) hardware. MPI Dims create(p, n, dims): create balanced n-dimensional grid for p processes in n-dimensional array dims. MPI Cart create(oldcomm, n, dims, periodic, reorder, newcomm): Creates new communicator for grid with n dimensions in dims, with implied periodicity in array periodic. reorder specifies whether the ranks may change for the new communicator. MPI Cart rank(comm, coords, rank): Given n-dimensional coordinates, return rank. MPI Cart coords(comm, rank, n, coords): Given the rank, return n coordinates. Exercise 4: Matrix-vector multiplication Complete the multiplication in mv3.f90 or mv3.c using a Cartesian topology. Blocks of the matrix are distributed among processors. Example: rows 2, columns 2 in rank 0 (0,0) rows 2, column 3 in rank (0,) row 3, columns 2 in rank 2 (,0) row 3, column 3 in rank 3 (,) v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 9 Exercise 4: Matrix-vector multiplication v = Ax = a, a,2 a,3 a 2, a 2,2 a 2,3 x x 2 a 3, a 3,2 a 3,3 x 3 a,x + a,2 x 2 a 2, x + a 2,2 x 2 + a,3x 3 a 2,3 x 3 v v 2 a 3, x + a 3,2 x 2 a 3,3 x 3 v 3 Use MPI Reduce call to obtain v. Advantage: both vectors and the matrix can be distributed in memory. 20

2 Hybrid MPI and OpenMP First step: measure efficiency 22 Most clusters, including Guillimin, contain multicore nodes. For Guillimin, 2 cores per node. Idea: use hybrid MPI and OpenMP: MPI for internode communication, OpenMP intranode, eliminating intranode communication. May or may not run faster than pure MPI code. Insert MPI Wtime calls to measure wall clock time. Run for various values of p to determine scaling. Amdahl s law 23 Let f be the fraction of operations in a computation that must be performed sequentially, where 0 f. The maximum speedup ψ achievable by a parallel computer with p processors performing the computation is ψ f + ( f )/p Example: if f = 0.0035 than the maximum speedup is 285 for p, and for p = 024, ψ = 223. Karp-Flatt metric 24 We can also determine the experimentally determined serial fraction e given measured speedup ψ. /ψ /p e = /p Example: p = 2, ψ =.95, e = 0.026. Example: p = 024, ψ = 200, e = 0.0040.

25 When to consider hybrid? If the serial portion is too expensive to parallelize using MPI but can be done using threads. Definitely! If the problem does not scale well due to excessive communication (e increases significantly as p increases). Maybe. Perhaps MPI performance can be improved: Fewer messages (less latency). Shorter messages. Replace communication by computation where possible. Example: for broadcasts, tree-like communication much more efficient than sending from master process directly to all other processes (fewer messages in master process). Analysts are here to help you optimize your code! When to consider hybrid? 26 Otherwise pure MPI can be just as fast. Also, must look out for OpenMP pitfalls: caching, false sharing, synchronization overhead, races. Example job script for Guillimin For 48 CPU cores on 4 nodes with 2 cores each: #!/bin/bash #PBS -l nodes=4:ppn=2 #PBS -V #PBS -N jobname cd $PBS O WORKDIR export IPATH NO CPUAFFINITY= export OMP NUM THREADS=2 mpiexec -n 4 -npernode./yourcode 27 Example job script for Guillimin Example job script for Guillimin The particular features of this submission script are as follows: export IPATH NO CPUAFFINITY=: tells the underlying software not to pin each process to one CPU core, which would effectively disable OpenMP parallelism. export OMP NUM THREADS=2: specifies the number of threads used for OpenMP for all 4 processes. mpiexec -n 4 -npernode./yourcode: starts program yourcode, compiled with MPI, in parallel on 4 nodes, with process per node. 28

29 OpenMP example: parallel for (C) OpenMP example: parallel do (Fortran) 30 Example: void addvectors(const int *a, const int *b, int *c, const int n) { int i; #pragma omp parallel for for (i = 0; i < n; i++) c[i] = a[i] + b[i]; } Here i is automatically made private because it is the loop variable. All other variables are shared. Loop split between threads, for example, for n=0, thread 0 does index 0 to 4 and thread does index 5 to 9. Example: subroutine addvectors(a, b, c, n) integer n, a(n), b(n), c(n) integer i!$omp PARALLEL DO do i =, n c(i) = a(i) + b(i) enddo!$omp END PARALLEL DO end subroutine Here i is automatically made private because it is the loop variable. All other variables are shared. Loop split between threads, for example, for n=0, thread 0 does index 0 to 4 and thread does index 5 to 9. Exercise 5: Matrix-vector multiplication 3 Consider again mv.c and mv.f90. Add a parallel for or parallel do pragma to the inner for/do loop to obtain a hybrid code, and submit. Measure its performance. Optional: do the same for the other 2 matrix-vector multiplication codes.