Parallel Programming with MPI and OpenMP

Similar documents
All-Pairs Shortest Paths - Floyd s Algorithm

Linear systems of equations

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Parallel Programming. Functional Decomposition (Document Classification)

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Matrix-vector Multiplication

Document Classification

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9

Project C/MPI: Matrix-Vector Multiplication

Message Passing Interface

15-440: Recitation 8

PARALLEL AND DISTRIBUTED COMPUTING

Document Classification Problem

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication

Lecture 7: Distributed memory

CS 426. Building and Running a Parallel Application

More MPI. Bryan Mills, PhD. Spring 2017

HPC Parallel Programing Multi-node Computation with MPI - I

Lecture 6: Parallel Matrix Algorithms (part 3)

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Introduction to parallel computing concepts and technics

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM

Buffering in MPI communications

CS 179: GPU Programming. Lecture 14: Inter-process Communication

A Message Passing Standard for MPP and Workstations

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

UNIVERSITY OF MORATUWA

MATH 676. Finite element methods in scientific computing

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Lecture 16. Parallel Matrix Multiplication

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

Parallel Computing Paradigms

Message Passing Interface

Parallel Computing. Parallel Algorithm Design

Basic Communication Operations (Chapter 4)

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Standard MPI - Message Passing Interface

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

L15: Putting it together: N-body (Ch. 6)!

Intermediate MPI (Message-Passing Interface) 1/11

High Performance Computing

Message Passing Interface. George Bosilca

Parallel Programming

Scientific Computing

Dynamic Programming II

Point-to-Point Communication. Reference:

High performance computing. Message Passing Interface

Intermediate MPI (Message-Passing Interface) 1/11

Distributed Memory Programming With MPI (4)

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Message Passing Interface. most of the slides taken from Hanjun Kim

Intermediate MPI features

CME 194 Introduc0on to MPI

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

High-Performance Computing: MPI (ctd)

Parallel Programming, MPI Lecture 2

Basic MPI Communications. Basic MPI Communications (cont d)

Parallel programming MPI

Non-Blocking Communications

Chapter 8 Matrix-Vector Multiplication

A message contains a number of elements of some particular datatype. MPI datatypes:

Lecture 5. Applications: N-body simulation, sorting, stencil methods

ECE 574 Cluster Computing Lecture 13

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

MPI point-to-point communication

Recap of Parallelism & MPI

Chapter 3. Distributed Memory Programming with MPI

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Reusing this material

Improving the interoperability between MPI and OmpSs-2

DPHPC Recitation Session 2 Advanced MPI Concepts

Parallel hardware. Distributed Memory. Parallel software. COMP528 MPI Programming, I. Flynn s taxonomy:

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Holland Computing Center Kickstart MPI Intro

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Chapter 8 Dense Matrix Algorithms

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Lecture 14. Performance Profiling Under the hood of MPI Parallel Matrix Multiplication MPI Communicators

Matrix multiplication

Non-Blocking Communications

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives

NUMERICAL PARALLEL COMPUTING

Data parallelism. [ any app performing the *same* operation across a data stream ]

Advanced MPI. Andrew Emerson

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed Memory Parallel Programming

Transcription:

Parallel Programming with MPI and OpenMP Michael J. Quinn

Chapter 6 Floyd s Algorithm

Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading and printing 2-D matrices Analyzing performance when computations and communications overlap

Outline All-pairs shortest path problem Dynamic 2-D arrays Parallel algorithm design Point-to-point communication Block row matrix I/O Analysis and benchmarking

All-pairs Shortest Path Problem 3 A 6 4 B A B C D E 1 C 5 1 3 D A B C 0 6 3 6 4 0 7 10 12 6 0 3 4 8 1 E 2 D 7 3 10 0 11 E 9 5 12 2 0 Resulting Adjacency Matrix Containing Distances

Floyd s Algorithm for k 0 to n-1 for i 0 to n-1 for j 0 to n-1 a[i,j] min (a[i,j], a[i,k] + a[k,j]) endfor endfor endfor

Designing Parallel Algorithm Partitioning Communication Agglomeration and Mapping

Partitioning Domain or functional decomposition? Look at pseudocode Same assignment statement executed n 3 times No functional parallelism Domain decomposition: divide matrix A into its n 2 elements

Communication Primitive tasks Updating a[3,4] when k = 1 Iteration k: every task in row k broadcasts its value w/in task column Iteration k: every task in column k broadcasts its value w/in task row

Agglomeration and Mapping Number of tasks: static Communication among tasks: structured Computation time per task: constant Strategy: Agglomerate tasks to minimize communication Create one task per MPI process

Two Data Decompositions Rowwise block striped Columnwise block striped

Comparing Decompositions Columnwise block striped Broadcast within columns eliminated Rowwise block striped Broadcast within rows eliminated Reading matrix from file simpler Choose rowwise block striped decomposition

File Input File

Pop Quiz Why don t we input the entire file at once and then scatter its contents among the processes, allowing concurrent message passing?

Point-to-point Communication Involves a pair of processes One process sends a message Other process receives the message

Send/Receive Not Collective

Function MPI_Send int MPI_Send ( void int *message, count, MPI_Datatype datatype, int int MPI_Comm dest, tag, comm )

Function MPI_Recv int MPI_Recv ( void int *message, count, MPI_Datatype datatype, int int MPI_Comm MPI_Status source, tag, comm, *status )

Coding Send/Receive if (ID == j) { Receive from I } if (ID == i) { Send to j } Receive is before Send. Why does this work?

Inside MPI_Send and MPI_Recv Sending Process Receiving Process Program Memory System Buffer System Buffer Program Memory MPI_Send MPI_Recv

Return from MPI_Send Function blocks until message buffer free Message buffer is free when Message copied to system buffer, or Message transmitted Typical scenario Message copied to system buffer Transmission overlaps computation

Return from MPI_Recv Function blocks until message in buffer If message never arrives, function never returns

Deadlock Deadlock: process waiting for a condition that will never become true Easy to write send/receive code that deadlocks Two processes: both receive before send Send tag doesn t match receive tag Process sends message to wrong destination process

Dynamic 1-D Array Creation Run-time Stack A Heap

Dynamic 2-D Array Creation Run-time Stack Bstorage B Heap

Computational Complexity Innermost loop has complexity Θ(n) Middle loop executed at most n/p times Outer loop executed n times Overall complexity Θ(n 3 /p)

Communication Complexity No communication in inner loop No communication in middle loop Broadcast in outer loop complexity is Θ(n log p) Overall complexity Θ(n 2 log p)

Execution Time Expression (1) n / p nχ n log p ( λ 4n / β ) n + + Cell update time Iterations of middle loop Iterations of outer loop Message-passing time Messages per broadcast Iterations of outer loop Iterations of inner loop

Computation/communication Overlap

Execution Time Expression (2) n / p nχ n log p λ log p 4n / β n + + Cell update time Iterations of middle loop Iterations of outer loop Message transmission Message-passing time Messages per broadcast Iterations of outer loop Iterations of inner loop

Predicted vs. Actual Performance Processes 1 2 3 4 5 6 7 8 Execution Time (sec) Predicted Actual 25.54 25.54 13.02 13.89 9.01 9.60 6.89 7.29 5.86 5.99 5.01 5.16 4.40 4.50 3.94 3.98

Summary Two matrix decompositions Rowwise block striped Columnwise block striped Blocking send/receive functions MPI_Send MPI_Recv Overlapping communications with computations