Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Similar documents
Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

A Message Passing Standard for MPP and Workstations

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Basic Communication Operations (Chapter 4)

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Performance properties The metrics tour

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Basic MPI Communications. Basic MPI Communications (cont d)

Performance properties The metrics tour

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Performance properties The metrics tour

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

High-Performance Computing: MPI (ctd)

Introduction to Parallel Programming

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

CUDA GPGPU Workshop 2012

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

CS 426. Building and Running a Parallel Application

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Introduction to parallel computing concepts and technics

A Message Passing Standard for MPP and Workstations. Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W.

MATH 676. Finite element methods in scientific computing

High Performance Computing. University questions with solution

Part - II. Message Passing Interface. Dheeraj Bhardwaj

MA471. Lecture 5. Collective MPI Communication

High Performance Computing

Point-to-Point Communication. Reference:

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

MPI - Today and Tomorrow

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Framework of an MPI Program

WHY PARALLEL PROCESSING? (CE-401)

Clusters of SMP s. Sean Peisert

Message Passing Interface

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

More Communication (cont d)

Parallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

Parallel Computing and the MPI environment

Document Classification

Message-Passing Computing

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Introduction to Parallel Programming

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

CS370 Operating Systems

Computer Architecture

Example of a Parallel Algorithm

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Parallel programming MPI

Parallel Programming in C with MPI and OpenMP

CINES MPI. Johanne Charpentier & Gabriel Hautreux

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Számítogépes modellezés labor (MSc)

Practical Scientific Computing: Performanceoptimized

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

HPC Parallel Programing Multi-node Computation with MPI - I

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Lecture 32: Partitioned Global Address Space (PGAS) programming models

MPI Message Passing Interface

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Tutorial 2: MPI. CS486 - Principles of Distributed Computing Papageorgiou Spyros

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Chapter 1: Introduction

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9

Parallel programming models (cont d)

Collective Communication in MPI and Advanced Features

Outline. Communication modes MPI Message Passing Interface Standard

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Parallel Programming. Functional Decomposition (Document Classification)

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

EE/CSCI 451: Parallel and Distributed Computation

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

Practical Introduction to Message-Passing Interface (MPI)

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Message Passing Interface. most of the slides taken from Hanjun Kim

An Introduction to Parallel Programming

CSE 160 Lecture 18. Message Passing

Standard MPI - Message Passing Interface

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Introduction to parallel Computing

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Transcription:

Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized machine Originally proposed for analyzing parallel algorithms Leslie Valiant A Bridging Model for Parallel Computatation, 1990 Prof. Aiken CS 315B Lecture 2 3 Prof. Aiken CS 315B Lecture 2 4 1

Computations A sequence of supersteps: What are some properties of this machine model? Repeat: All processors do local computation Barrier All processors communicate Barrier Prof. Aiken CS 315B Lecture 2 5 Prof. Aiken CS 315B Lecture 2 6 Basic Properties Uniform compute nodes communication costs What are properties of this computational model? Separate communication and computation Synchronization is global Prof. Aiken CS 315B Lecture 2 7 Prof. Aiken CS 315B Lecture 2 8 2

The Idea How Does This Work? Programs are written for v virtual processors run on p physical processors If v >= p log p then Managing memory, communication and synchronization can be done automatically within a constant factor of optimal Roughly Memory addresses are hashed to a random location in the machine Guarantees that on average, memory accesses have the same cost The extra log p factor of threads are multiplexed onto the p processors to hide the latency of memory requests The processors are kept busy and do no more compute than necessary Prof. Aiken CS 315B Lecture 2 9 Prof. Aiken CS 315B Lecture 2 10 Terminology SPMD SIMD Single Instruction, Multiple Data SPMD Single Program, Multiple Data Prof. Aiken CS 315B Lecture 2 11 Prof. Aiken CS 315B Lecture 2 12 3

SIMD = Vector Processing Picture A[1..N] = B[1..N] * factor; j += factor; A[1] = B[1] * factor A[2] = B[2] * factor A[3] = B[3] * factor... j += factor Prof. Aiken CS 315B Lecture 2 13 Prof. Aiken CS 315B Lecture 2 14 Comments SPMD = Single Program, Multiple Data Single thread of control Global synchronization at each program instruction Can exploit fine-grain parallelism Assumption of hardware support SIMD A[1..N] = B[1..N] * factor; j += factor; SPMD A[myid] = B[myid] * factor; j += factor; Prof. Aiken CS 315B Lecture 2 15 Prof. Aiken CS 315B Lecture 2 16 4

Picture Comments A[1] = B[1] * factor j += factor A[2] = B[2] * factor... j += factor Multiple threads of control One (or more) per processor Asynchronous All synchronization is programmer-specified Threads are distinguished by myid Choice: Are variables local or global? Prof. Aiken CS 315B Lecture 2 17 Prof. Aiken CS 315B Lecture 2 18 Comparison MPI SIMD Designed for tightly-coupled, synchronous hardware i.e., vector units SPMD Designed for clusters Too expensive to synchronize every statement Need a model that allows asynchrony Message Passing Interface A widely used standard Runs on everything A runtime system Most popular way to write SPMD programs Prof. Aiken CS 315B Lecture 2 19 Prof. Aiken CS 315B Lecture 2 20 5

MPI Programs MPI Point-to-Point Routines Standard sequential programs All variables are local to a thread MPI_Send(buffer, count, type, dest, ) MPI_Recv(buffer, count, type, source, ) Augmented with calls to the MPI interface SPMD model Every thread has a unique identifier Threads can send/receive messages Synchronization primitives Prof. Aiken CS 315B Lecture 2 21 Prof. Aiken CS 315B Lecture 2 22 Example MPI Point-to-Point Routines, Non-Blocking for (.) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid // boundary elements of h[] are copies of neighbors boundary elements... Local computation... MPI_ISend( ) MPI_Irecv( ) // exchange with neighbors on a 1-D grid if ( 0 < id ) MPI_Send ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); if ( id < p-1 ) MPI_Recv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); MPI_Wait( ) if ( id < p-1 ) MPI_Send ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_Recv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); More local computation... } Prof. Aiken CS 315B Lecture 2 23 Prof. Aiken CS 315B Lecture 2 24 6

Example MPI Collective Communication Routines for (.) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid // boundary elements of h[] are copies of neighbors boundary elements... Local computation... // exchange with neighbors on a 1-D grid if ( 0 < id ) MPI_ISend ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); if ( id < p-1 ) MPI_IRecv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); if ( id < p-1 ) MPI_Barrier( ) MPI_Bcast(...) MPI_Scatter( ) MPI_Gather( ) MPI_Reduce( ) MPI_ISend ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_IRecv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); MPI_Wait( 1...) MPI_Wait(...2...) More local computation... } Prof. Aiken CS 315B Lecture 2 25 Prof. Aiken CS 315B Lecture 2 26 Typical Structure PGAS Model PGAS = Partitioned Global Address Space communicate_get_work_to_do(); barrier; do_local_work(); // not always needed barrier; communicate_write_results(); What does this remind you of? Prof. Aiken CS 315B Lecture 2 27 There is one global address space But each thread owns a partition of the address space that is more efficient to access i.e., the local memory of a processor Equivalent in functionality to MPI But typically presented as a programming language Examples: Split-C, UPC, Titanium Prof. Aiken CS 315B Lecture 2 28 7

PGAS Languages PGAS Languages No library calls for communication Also provide collective communication Instead, variables can name memory locations on other machines // Assume y points to a remote location // The following is equivalent to a send/receive x = *y Prof. Aiken CS 315B Lecture 2 29 Barrier Broadcast/Reduce 1-many Exchange All-to-all Prof. Aiken CS 315B Lecture 2 30 PGAS vs. MPI Summary Programming model very similar Both provide SPMD From a pragmatic point of view, MPI rules Easy to add MPI to an existing sequential language For productivity, PGAS is better Programs filled with low-level details of MPI calls PGAS programs easier to modify PGAS compilers can know more/do better job Prof. Aiken CS 315B Lecture 2 31 SPMD is well-matched to cluster programming Also works well on shared memory machines One thread per core No need for compiler to discover parallelism No danger of overwhelming # of threads Model exposes memory architecture Local vs. Global variables Local computation vs. sends/receives Prof. Aiken CS 315B Lecture 2 32 8

Control Analysis SIMD forall (i = 1..N) A[i] = B[i] * factor; j += factor; SPMD A[myid] = B[myid] * factor; j += factor; Prof. Aiken CS 315B Lecture 2 33 Prof. Aiken CS 315B Lecture 2 34 Control, Cont. Global Synchronization Revisited SPMD replicates the sequential part of the SIMD computation Across all threads! In the presence of non-blocking global memory operations, we also need memory fence operations Why? Often cheaper to replicate computation in parallel than compute in one place and broadcast A general principle... Prof. Aiken CS 315B Lecture 2 35 Two choices Have a separate classes of memory and control synchronization operations E.g., barrier and memory_fence Have a single set of operations E.g., barrier implies memory and control synchronization Prof. Aiken CS 315B Lecture 2 36 9

Message Passing Implementations Bulk Synchronous/SPMD Model Idea: A memory fence is a special message sent on the network; when it arrives, all the memory operations are complete To work, underlying message system must deliver messages in order Easy to understand Phase structure guarantees no data races Barrier synchronization also easy to understand Fits many problems well This is one of the key properties of MPI And most message systems Prof. Aiken CS 315B Lecture 2 37 Prof. Aiken CS 315B Lecture 2 38 But Hierarchy Assumes 2-level memory hierarchy Local/global, a flat collection of homogenous sequential processors No overlap of communication and computation Barriers scale poorly with machine size (# of operations lost) * (# of processors) Current & future machines are more hierarchical 3-4 levels, not 2 Leads to programs written in a mix of MPI (network level) OpenMP (node level) + vectorization CUDA (GPU level) Each is a different programming model Prof. Aiken CS 315B Lecture 2 39 Prof. Aiken CS 315B Lecture 2 40 10

No Overlap of Computation/Communication Global Operations Leaves major portion of the machine idle in each phase And this potential is lost at many scales Hierarchical machines again Increasingly, communication is key Data movement is what matters Most of the execution time, most of the energy Global operations (such as barriers) are bad Require synchronization across the machine Especially bad when there is performance variation among participating threads Need a model that favors asynchrony Couple as few things together as possible Prof. Aiken CS 315B Lecture 2 41 Prof. Aiken CS 315B Lecture 2 42 Global Operations Continued I/O MPI has evolved to include more asynchronous and point-to-point primitives How do programs get initial data? Produce output? But these do not always mix well with the collective/global operations In many models answer is clear Passed in and out from root function Map-Reduce Multithreaded shared memory applications just use the normal file system interface Prof. Aiken CS 315B Lecture 2 43 Prof. Aiken CS 315B Lecture 2 44 11

I/O, Cont. I/O, Cont. Not clear in SPMD Program begins running and Each thread is running in its own address space No thread is special Neither master nor slave Prof. Aiken CS 315B Lecture 2 45 Option 1 Make thread 0 special Thread 0 does all I/O on behalf of the program Issue: Awkward to read/write large data sets Limited by thread 0 s memory size Option 2 Parallel I/O Each thread has access to its own file system Containing distributed files Each file f a portion of a collective file f Prof. Aiken CS 315B Lecture 2 46 I/O Summary Next Week Option 2 is clearly more SPMD-ish Creating/deallocating files requires a barrier In general, parallel programming languages have not paid much attention to I/O Intro to Regent programming First assignment Prof. Aiken CS 315B Lecture 2 47 Prof. Aiken CS 315B Lecture 2 48 12