Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste

Similar documents
Parallel Computing and the MPI environment

Distributed Memory Parallel Programming

MPI MESSAGE PASSING INTERFACE

Slides prepared by : Farzana Rahman 1

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

High-Performance Computing: MPI (ctd)

Outline. Communication modes MPI Message Passing Interface Standard

Recap of Parallelism & MPI

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Programming with MPI Collectives

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

MA471. Lecture 5. Collective MPI Communication

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI: Message Passing Interface An Introduction. S. Lakshmivarahan School of Computer Science

Collective Communication in MPI and Advanced Features

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Data parallelism. [ any app performing the *same* operation across a data stream ]

CINES MPI. Johanne Charpentier & Gabriel Hautreux

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Message Passing Interface

東京大学情報基盤中心准教授片桐孝洋 Takahiro Katagiri, Associate Professor, Information Technology Center, The University of Tokyo

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

A short overview of parallel paradigms. Fabio Affinito, SCAI

AMath 483/583 Lecture 21

Introduction to the Message Passing Interface (MPI)

Parallel Programming in C with MPI and OpenMP

High Performance Computing

Paul Burton April 2015 An Introduction to MPI Programming

The Message Passing Model

Introduction to parallel computing concepts and technics

Holland Computing Center Kickstart MPI Intro

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Practical Scientific Computing: Performanceoptimized

Overview of MPI 國家理論中心數學組 高效能計算 短期課程. 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Lecture 9: MPI continued

Chapter 3. Distributed Memory Programming with MPI

Introduction to MPI. Ricardo Fonseca.

MPI. (message passing, MIMD)

Parallel Programming Using MPI

A Message Passing Standard for MPP and Workstations. Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W.

MPI MESSAGE PASSING INTERFACE

Message Passing Interface

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

HPC Parallel Programing Multi-node Computation with MPI - I

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Intermediate MPI. M. D. Jones, Ph.D. Center for Computational Research University at Buffalo State University of New York

Practical Introduction to Message-Passing Interface (MPI)

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop

Basic MPI Communications. Basic MPI Communications (cont d)

Parallel Computing Paradigms

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Week 3: MPI. Day 02 :: Message passing, point-to-point and collective communications

Message Passing Interface. most of the slides taken from Hanjun Kim

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

PyConZA High Performance Computing with Python. Kevin Colville Python on large clusters with MPI

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

A Message Passing Standard for MPP and Workstations

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Parallel Programming in C with MPI and OpenMP

CS 179: GPU Programming. Lecture 14: Inter-process Communication

MPI Collective communication

Standard MPI - Message Passing Interface

COMP 322: Fundamentals of Parallel Programming

An Introduction to MPI

Distributed Memory Programming with MPI

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Message Passing Interface

Intermediate MPI features

AMath 483/583 Lecture 18 May 6, 2011

Message passing. Week 3: MPI. Day 02 :: Message passing, point-to-point and collective communications. What is MPI?

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

Review of MPI Part 2

Experiencing Cluster Computing Message Passing Interface

What s in this talk? Quick Introduction. Programming in Parallel

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

Collective Communications

Introduction to MPI: Part II

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Acknowledgments. Programming with MPI Basic send and receive. A Minimal MPI Program (C) Contents. Type to enter text

Blocking SEND/RECEIVE

Programming with MPI Basic send and receive

High Performance Computing Course Notes Message Passing Programming I

An introduction to MPI

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Parallel Programming Using MPI

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Introduction to MPI. Shaohao Chen Research Computing Services Information Services and Technology Boston University

CS 426. Building and Running a Parallel Application

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

Introduction to Parallel Programming with MPI

Parallelization. Tianhe-1A, 2.45 Pentaflops/s, 224 Terabytes RAM. Nigel Mitchell

Transcription:

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela

Sum m ary 1) Parallel computation, why? 2) Parallel computer architectures 3) Computational paradigms 4) Execution Speedup 5) The MPI environment 6) Distributing arrays among processors

W hy parallel com puting? Solve problems with greater speed Run memory demanding programs

Parallel architecture m odels Shared memory Distributed memory (Message passing)

Shared vs. Distributed Mem ory Parallel threads in a single process Easy programming: extensions to standard languages (OpenMP) Several communicating processes Difficult programming: special libraries needed (MPI) The programmer must explicitely take care of message passing

Com putational paradigm s Pipeline Phases Divide & Conquer Master Worker

Execution Speedup Definition (for p processors): Amdahl law f: sequential fraction of execution time Speedup 1 p ( p 1) Maximum Speedup : 1/f, for p f Speedup T1 / T p seq. fraction f 10% 1% max. speedup 10 100

Scalability of algorithm s

Other m etrics Speedup how much do we gain in time Efficiency how much do we use the machine Cost S T / p 1 T p E p S p / p C p pt p 1 1 Effectiveness benefit / cost ratio F p S p / C E S / T p p p 1

Program m ing m odels A simple program DO i=1,n s(i)=sin(a(i)) r(i)=sqrt(a(i)) l(i)=log(a(i)) t(i)=tan(a(i)) END DO Functional decomposition: each process computes one function on all data Domain decomposition: each process computes all functions on a chunk of data Scales well

MPI : Message Passing I nterface Message structure: Cont ent: data, count, type Envelope: source/dest, tag, communicator Basic functions: MPI _SEND: data to destination MPI _RECV: data from source

Basic functions PROGRAM Hello_world! REMARK: no communication here! PROGRAM Hello_from! Each process has its own rank INCLUDE mpif.h INTEGER:: err! MPI library INCLUDE mpif.h INTEGER:: err, nproc, myid CALL MPI_INIT(err)! Initialize communication WRITE (*,*) "Hello world!" CALL MPI_FINALIZE(err)! Finalize communication END Hello_world CALL MPI_INIT (err) CALL MPI_COMM_SIZE & (MPI_COMM_WORLD, nproc,err)! get the total number of processes CALL MPI_COMM_RANK & (MPI_COMM_WORLD, myid, err)! get the process rank WRITE(*,*) "Hello from,myid, of",nproc CALL MPI_FINALIZE(err) END Hello_from

Sending and receiving m essages PROGRAM Send_Receive.. INTEGER:: err, nproc, myid INTEGER,DIMENSION(MPI_STATUS_SIZE):: status REAL,DIMENSION(2):: a IF (myid==0) THEN a = (/ 1, 2 /)! Process # 0 holds the data CALL MPI_SEND (a, 2, MPI_REAL, &! Content: buffer, count and type 1, 10, MPI_COMM_WORLD, err) &! Envelope ELSE IF (myid==1) THEN CALL MPI_RECV (a, 2, MPI_REAL, &! Data buffer, count and type 0, 10, MPI_COMM_WORLD, status, err) &! Envelope WRITE (*,*) myid, : a =, a END IF END PROGRAM Send_Receive

Deadlocks

Avoiding deadlocks 1 PROGRAM Deadlock INTEGER, PARAMETER:: N=100000 INTEGER:: err, nproc, myid REAL:: a(n), b(n) IF (myid==0) THEN a = 0 CALL MPI_SEND (a, N, MPI_REAL, 1, & 10, MPI_COMM_WORLD, err) CALL MPI_RECV (b, N, MPI_REAL, 1, & 11, MPI_COMM_WORLD, status, err) ELSE IF (myid==1) THEN a = 1 CALL MPI_SEND (a, N, MPI_REAL, 0, & 11, MPI_COMM_WORLD, err) CALL MPI_RECV (b, N, MPI_REAL, 0, & 10, MPI_COMM_WORLD, status, err) END IF END PROGRAM Deadlock PROGRAM NO_Deadlock! by breaking symmetry INTEGER, PARAMETER:: N=100000 INTEGER:: err, nproc, myid REAL:: a(n), b(n) IF (myid==0) THEN a = 0 CALL MPI_SEND (a, N, MPI_REAL, 1, & 10, MPI_COMM_WORLD, err) CALL MPI_RECV (b, N, MPI_REAL, 1, & 11, MPI_COMM_WORLD, status, err) ELSE IF (myid==1) THEN a = 1 CALL MPI_RECV (b, N, MPI_REAL, 0, & 10, MPI_COMM_WORLD, status, err) CALL MPI_SEND (a, N, MPI_REAL, 0, & 11, MPI_COMM_WORLD, err) END IF END PROGRAM NO_Deadlock

Avoiding deadlocks 2 PROGRAM SendRecv. INTEGER, PARAMETER:: N=100000 REAL:: a(n), b(n). IF (myid==0) THEN a = 0 CALL MPI_SENDRECV (a, N, MPI_REAL, 1, 10 &! sent data, count, type, destination, tag, b, N, MPI_REAL, 1, 11 &! received data, count, type, source, tag, MPI_COMM_WORLD, status, err) ELSE IF (myid==1) THEN a = 1 CALL MPI_SENDRECV (a, N, MPI_REAL, 0, 11 END IF WRITE(*,*) myid, END PROGRAM SendRecv &, b, N, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, err)! NOTE the taga correspondence received:, b(1:5),

Overlapping com m unication and com putation Start communication in advance, non-blocking send/receive Sinchronization to ensure transfer completion

One- to- all: Broadcast PROGRAM Br oa dca st INTEGER:: err, nproc, myid INTEGER:: root, i, a(10)! One-to-many communication P 0 a a a P 0 P 1 P 2 a root = 0 P n-1 IF (myid==root) a = (/ (i, i=,10) /) CALL MPI _ BCAST ( a,1 0,MPI _ I NTEGER, &! s/d buffs, count,type root, MPI _ COMM_ W ORLD, err)! source, comm.! REMARK: source and destination buffers have the same name, but are in different processor memories.. END PROGRAM Broadcast

One- to- all: Scatter/ Gather ( 1 ) PROGRAM ScatterGather! SCATTER: distribute an array among processes! GATHER: collect a distributed array in a single process INTEGER, PARAMETER:: N=16 INTEGER:: err, nproc, myid INTEGER:: root, i, m, a(n), b(n).. root = 0 m = N/nproc! number of elements per process IF (myid==root) a = (/ (i, i=1,n) /) CALL MPI_SCATTER (a, m, MPI_INTEGER, &! source buffer b, m, MPI_INTEGER, &! destination buffer root, MPI_COMM_WORLD, err)! source process, communicator b(1:m) = 2*b(1:m)! Parallel function computation CALL MPI_GATHER (b, m, MPI_INTEGER, &! source buffer a, m, MPI_INTEGER, &! destination buffer root, MPI_COMM_WORLD, err)!destination process, communicator.. END PROGRAM ScatterGather

Scatter/ Gather ( 2 ) Scatter Gather P 0 a 1 a 2 a 3 a 4 P 0 a 1 P 1 a 2 P 2 a 3 P 3 a 4 P 0 a 1 P 1 a 2 P 2 a 3 P 3 a 4 P 0 a 1 a 2 a 3 a 4

All- to- all: MPI _ Alltoall ( 1 ) PROGRAM All_to_all!----------------------------------------------------------------------! Exchange data among all processors!----------------------------------------------------------------------- INTEGER,PARAMETER:: N=4 INTEGER:: i, m REAL,DIMENSION(N):: a a = (/ (N*myid+i, i=1,n) /) WRITE(*,*) "process", myid, " has:", a m = N/nproc CALL MPI_ALLTOALL (a, m, MPI_REAL, &! Sender a, m, MPI_REAL, &! Receiver! buff, count, type MPI_COMM_WORLD, err)! REMARK: count is the number of elements sent from one! process to the other WRITE(*,*) "Process", myid, "received:", a END PROGRAM All_to_all

All- to- all: MPI _ Alltoall ( 2 ) P 0 a 1 a 2 a 3 a 4 P 0 a 1 b 1 c 1 d 1 P 1 b 1 b 2 b 3 b 4 P 1 a 2 b 2 c 2 d 2 P 2 c 1 c 2 c 3 c 4 P 2 a 3 b 3 c 3 d 3 P 3 d 1 d 2 d 3 d 4 P 3 a 4 b 4 c 4 d 4

Reduction functions ( 1 ) PROGRAM Reduce!----------------------------------------------------------------------! Parallel sum, an instance of MPI reduction functions!-----------------------------------------------------------------------. INTEGER,PARAMETER:: N=4 INTEGER:: i, m, root REAL,DIMENSION(N):: a, s.. a = (/ (i, i=1,n) /) WRITE(*,*) "process", myid, " has:", a root = 0 CALL MPI_REDUCE (a, s, N, MPI_REAL, &! s/d buff., count, type MPI_SUM, root, &! Operation, destination MPI_COMM_WORLD, err)! REMARK: dropping the root argument, ALL processors get the result IF (myid==0) WRITE(*,*) "The sum is", s.

Reduction functions ( 2 ) MPI_OP MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_MAXLOC MPI_MINLOC MPI_... Function Maximum Minimum Sum Product Location of maximum Location of minimum

Distributing arrays am ong processors ISSUES: Load balancing Communication optimization

Array distribution 1 Colunm blocks Column cycles Colunm block cycles Row and colunm block cycles

Array distribution 2 Grid of processors

I n sum m ary MPI functions: low-level tools efficiently implemented Parallel high-level language compilers (HPF) need improvement Software libraries, e.g. ScaLAPACK for Linear Algebra Careful algorithm design is needed to exploit the hardware

This document was created with Win2PDF available at http://www.win2pdf.com. The unregistered version of Win2PDF is for evaluation or non-commercial use only. This page will not be added after purchasing Win2PDF.