Under the Hood, Part 1: Implementing Message Passing

Similar documents
Under the Hood, Part 1: Implementing Message Passing

Architecture or Parallel Computers CSC / ECE 506

Outline. Limited Scaling of a Bus

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

Allows program to be incrementally parallelized

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Shared-memory Parallel Programming with Cilk Plus

Scalable Multiprocessors

EE/CSCI 451: Parallel and Distributed Computation

Ch 9: Control flow. Sequencers. Jumps. Jumps

Message-Passing Programming with MPI. Message-Passing Concepts

Parallel Programming using OpenMP

Parallel Programming using OpenMP

15-418, Spring 2008 OpenMP: A Short Introduction

Lecture 4: OpenMP Open Multi-Processing

ECE 574 Cluster Computing Lecture 10

CS 470 Spring Mike Lam, Professor. OpenMP

15-440: Recitation 8

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

CS 470 Spring Mike Lam, Professor. OpenMP

Message-Passing Programming with MPI

Computer Architecture

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Shared-memory Parallel Programming with Cilk Plus

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Parallel Programming with OpenMP. CS240A, T. Yang

Martin Kruliš, v

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to Multicore Programming

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Overview: The OpenMP Programming Model

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Models and languages of concurrent computation

A Quick Introduction To The Intel Cilk Plus Runtime

Questions from last time

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Parallel Computing. Prof. Marco Bertini

Performance Optimization Part 1: Work Distribution and Scheduling

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

A Basic Snooping-Based Multi-Processor Implementation

A brief introduction to OpenMP

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

CS 261 Fall Mike Lam, Professor. Threads

Introduction to OpenMP

Multithreading in C with OpenMP

Intel Threading Tools

OpenMP and more Deadlock 2/16/18

ECE 574 Cluster Computing Lecture 13

Parallel Programming in C with MPI and OpenMP

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

OpenMP 4.5: Threading, vectorization & offloading

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Compsci 590.3: Introduction to Parallel Computing

A Basic Snooping-Based Multi-Processor Implementation

CS 5460/6460 Operating Systems

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

CS 5220: Shared memory programming. David Bindel

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Performance Optimization Part 1: Work Distribution and Scheduling

Grappa: A latency tolerant runtime for large-scale irregular applications

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 22: Remote Procedure Call (RPC)

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Implementation of Parallelization

Application Programming

OpenACC Course. Office Hour #2 Q&A

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Parallel Programming

CME 213 S PRING Eric Darve

CS CS9535: An Overview of Parallel Computing

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

<Insert Picture Here> OpenMP on Solaris

EE/CSCI 451: Parallel and Distributed Computation

Parallel Numerical Algorithms

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

! How is a thread different from a process? ! Why are threads useful? ! How can POSIX threads be useful?

Distributed Systems + Middleware Concurrent Programming with OpenMP

Basic Low Level Concepts

Holland Computing Center Kickstart MPI Intro

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel Programming in C with MPI and OpenMP

Tasking and OpenMP Success Stories

!! How is a thread different from a process? !! Why are threads useful? !! How can POSIX threads be useful?

ECE 574 Cluster Computing Lecture 8

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Parallel Programming Models & Their Corresponding HW/SW Implementations

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Multicore programming in CilkPlus

Concurrency, Thread. Dongkun Shin, SKKU

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Introduction to OpenMP

Transcription:

Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618,

Today s Theme

Message passing model (abstraction) Threads operate within their own private address spaces Threads communicate by sending/receiving messages - send: specifies recipient, buffer to be transmitted, and optional message identifier ( tag ) - receive: sender, specifies buffer to store data, and optional message identifier - Sending messages is the only way to exchange data between threads 1 and 2 Thread 1 address space Thread 2 address space x send(x, 2, my_msg_id) Variable X semantics: send contexts of local variable X as message to thread 2 and tag message with the id my_msg_id recv(y, 1, my_msg_id) semantics: receive message with id my_msg_id from thread 1 and store contents in local variable Y Y Variable X Illustration adopted from Culler, Singh, Gupta

Message passing systems Popular software library: MPI (message passing interface) Hardware need not implement system-wide loads and stores to execute message passing programs (need only be able to communicate messages) - Can connect commodity systems together to form large parallel machine (message passing is a programming model for clusters) Cluster of workstations (Infiniband network) IBM Blue Gene/P Supercomputer Image credit: IBM

Network Transaction Interconnection Network Serialized Message Output Buffer Input Buffer Source Node Destination Node One-way transfer of information from a source output buffer to a destination input buffer - causes some action at the destination - e.g., deposit data, state change, reply - occurrence is not directly visible at source

Shared Address Space Abstraction Source Destination (1) Initiate memory access (2) Address translation (3) Local/remote check (4) Request transaction Load r1 < Address Read request (5) Remote memory access Wait Read request Memory access (6) Reply transaction Read response (7) Complete memory access Time Read response Fundamentally a two-way request/response protocol - writes have an acknowledgement

Key Properties of SAS Abstraction Source and destination addresses are specified by source of the request - a degree of logical coupling and trust No storage logically outside the application address space(s) - may employ temporary buffers for transport Operations are fundamentally request-response Remote operation can be performed on remote memory - logically does not require intervention of the remote processor

Message Passing Implementation Options Synchronous: - Send completes after matching receive and source data sent - Receive completes after data transfer complete from matching send Asynchronous: - Send completes after send buffer may be reused

Synchronous Message Passing Source Destination (1) Initiate send (2) Address translation (3) Local/remote check (4) Send-ready request Send(Pdest, local VA, len) Receive(Psrc, local VA, len) Send-ready request (5) Remote check for posted receive (assume success) (6) Reply transaction Wait Tag check Receive-ready reply (7) Bulk data transfer Source VA > Dest VA Time Data-transfer request Data is not transferred until target address is known Limits contention and buffering at the destination Performance?

Asynchronous Message Passing: Optimistic Source Destination (1) Initiate send (2) Address translation (3) Local/remote check (4) Send data (5) Remote check for posted receive; on fail, allocate data buffer Send(Pdest, local VA, len) Data-transfer request Tag Match Allocate Buffer Time Receive(Psrc, local VA, len) Good news: source does not stall waiting for the destination to receive Bad news: storage is required within the message layer (?)

Asynchronous Message Passing: Conservative Source Destination (1) Initiate send (2) Address translation (3) Local/remote check (4) Send-ready request (5) Remote check for posted receive (assume fail); record send-ready Send(Pdest, local VA, len) Send-ready request Resume computing Tag match Receive(Psrc, local VA, len) (6) Receive-ready request Receive-ready request (7) Bulk data reply Source VA > Dest VA Time Data-transfer reply Where is the buffering? Contention control? Receiver-initiated protocol? What about short messages?

Key Features of Message Passing Abstraction Source knows send address, destination knows receive address - after handshake they both know both Arbitrary storage outside the local address spaces - may post many sends before any receives Fundamentally a 3-phase transaction - includes a request / response - can use optimistic 1-phase in limited safe cases - credit scheme

Challenge: Avoiding Input Buffer Overflow This requires flow-control on the sources Approaches: 1. Reserve space per source (credit) - when is it available for reuse? (utilize ack messages?) 2. Refuse input when full - what does this do to the interconnect? - backpressure in a reliable network - tree saturation? deadlock? - what happens to traffic not bound for congested destination? 3. Drop packets (?) 4.???

Challenge: Avoiding Fetch Deadlock Must continue accepting messages, even when cannot source msgs - what if incoming transaction is a request? - each may generate a response, which cannot be sent! - what happens when internal buffering is full? Approaches: 1. Logically independent request/reply networks - physical networks - virtual channels with separate input/output queues 2. Bound requests and reserve input buffer space - K(P-1) requests + K responses per node - service discipline to avoid fetch deadlock? 3. NACK on input buffer full - NACK delivery?

Implementation Challenges: Big Picture One-way transfer of information No global knowledge, nor global control - barriers, scans, reduce, global-or give fuzzy global state Very large number of concurrent transactions Management of input buffer resources - many sources can issue a request and over-commit destination before any see the effect Latency is large enough that you are tempted to take risks - e.g., optimistic protocols; large transfers; dynamic allocation

Lecture 27: Implementing Parallel Runtimes, Part 2 Parallel Computer Architecture and Programming CMU 15-418/15-618,

Objectives What are the costs of using parallelism APIs? How do the runtimes operate?

Basis of Lecture This lecture is based on runtime and source code analysis of Intel s open source parallel runtimes - OpenMP https://www.openmprtl.org/ - Cilk https://bitbucket.org/intelcilkruntime/intel-cilk-runtime And using the LLVM compiler - OpenMP part of LLVM as of 3.8 - Cilk - http://cilkplus.github.io/

OpenMP and Cilk What do these have in common? - pthreads What benefit does abstraction versus implementation provide?

Simple OpenMP Loop Compiled What is this code doing? What do the OpenMP semantics specify? How might you accomplish this? extern float foo( void ); int main (int argc, char** argv) { int i; float r = 0.0; #pragma omp parallel for schedule(dynamic) reduction(+:r) for ( i = 0; i < 10; i ++ ) { r += foo(); } return 0; } Example from OpenMP runtime documentation

Simple OpenMP Loop Compiled extern float foo( void ); int main (int argc, char** argv) { static int zero = 0; auto int gtid; auto float r = 0.0; kmpc_begin( & loc3, 0 ); gtid = kmpc_global thread num( & loc3 ); kmpc_fork call( &loc7, 1, main_7_parallel_3, &r ); kmpc_end( & loc0 ); return 0; } Call a function in parallel with the argument(s) Example from OpenMP runtime documentation

Simple OpenMP Loop Compiled OpenMP microtask - Each thread runs the task Initializes local iteration bounds and reduction Each iteration receives a chunk and operates locally After finishing all chunks, combine into global reduction struct main_10_reduction_t_5 { float r_10_rpr; }; void main_7_parallel_3( int *gtid, int *btid, float *r_7_shp ) { auto int i_7_pr; auto int lower, upper, liter, incr; auto struct main_10_reduction_t_5 reduce; reduce.r_10_rpr = 0.F; liter = 0; kmpc_dispatch_init_4( & loc7,*gtid, 35, 0, 9, 1, 1 ); while ( kmpc_dispatch_next_4( & loc7, *gtid, &liter, &lower, &upper, &incr) ) { for( i_7_pr = lower; upper >= i_7_pr; i_7_pr ++ ) reduce.r_10_rpr += foo(); } switch( kmpc_reduce_nowait( & loc10, *gtid, 1, 4, &reduce, main_10_reduce_5, &lck ) ) { case 1: *r_7_shp += reduce.r_10_rpr; kmpc_end_reduce_nowait( & loc10, *gtid, &lck); break; case 2: kmpc_atomic_float4_add( & loc10, *gtid, r_7_shp, reduce.r_10_rpr ); break; default:; } } Example from OpenMP runtime documentation

Simple OpenMP Loop Compiled All code combined extern float foo( void ); int main (int argc, char** argv) { static int zero = 0; auto int gtid; auto float r = 0.0; kmpc_begin( & loc3, 0 ); gtid = kmpc_global thread num( & loc3 ); kmpc_fork call( &loc7, 1, main_7_parallel_3, &r ); kmpc_end( & loc0 ); return 0; } struct main_10_reduction_t_5 { float r_10_rpr; }; static kmp_critical_name lck = { 0 }; static ident_t loc10; void main_10_reduce_5( struct main_10_reduction_t_5 *reduce_lhs, struct main_10_reduction_t_5 *reduce_rhs ) { reduce_lhs->r_10_rpr += reduce_rhs->r_10_rpr; } void main_7_parallel_3( int *gtid, int *btid, float *r_7_shp ) { auto int i_7_pr; auto int lower, upper, liter, incr; auto struct main_10_reduction_t_5 reduce; reduce.r_10_rpr = 0.F; liter = 0; kmpc_dispatch_init_4( & loc7,*gtid, 35, 0, 9, 1, 1 ); while ( kmpc_dispatch_next_4( & loc7, *gtid, &liter, &lower, &upper, &incr) ) { for( i_7_pr = lower; upper >= i_7_pr; i_7_pr ++ ) reduce.r_10_rpr += foo(); } switch( kmpc_reduce_nowait( & loc10, *gtid, 1, 4, &reduce, main_10_reduce_5, &lck ) ) { case 1: *r_7_shp += reduce.r_10_rpr; kmpc_end_reduce_nowait( & loc10, *gtid, &lck); break; case 2: kmpc_atomic_float4_add( & loc10, *gtid, r_7_shp, reduce.r_10_rpr ); break; default:; } } Example from OpenMP runtime documentation

Fork Call Forks execution and calls a specified routine (microtask) Determine how many threads to allocate to the parallel region Setup task structures Release allocated threads from their idle loop

Iteration Mechanisms Static, compile time iterations - kmp_for_static_init - Compute one set of iteration bounds Everything else - kmp_dispatch_next - Compute the next set of iteration bounds

OMP Barriers Two phase -> gather and release - Gather non-master threads pass, master waits - Release is opposite Barrier can be: - Linear - Tree - Hypercube - Hierarchical

OMP Atomic Can the compiler do this in a read-modify-write (RMW) op? Otherwise, create a compare-and-swap loop T* val; T update; #pragma omp atomic *val += update; If T is int, this is lock add. If T is float, this is lock cmpxchg Why?

OMP Tasks #pragma omp task depend (inout:x) Create microtasks for each task - Track dependencies by a list of address / length tuples

Cilk Covered in Lecture 5 We discussed the what and why, now the how

Simple Cilk Program Compiled What is this code doing? What do the Cilk semantics specify? Which is the child? Which is the continuation? int fib(int n) { if (n < 2) return n; int a = cilk_spawn fib(n-1); int b = fib(n-2); cilk_sync; return a + b; }

How to create a continuation? Continuation needs all of the state to continue - Register values, stack, etc. What function allows code to jump to a prior point of execution? Setjmp(jmp_buf env) - Save stack context - Return via longjmp(env, val) - Setjmp returns 0 if saving, val if returning via longjmp

Basic Block Unit of Code Analysis Sequence of instructions - Execution can only enter at the first instruction - Cannot jump into the middle - Execution can only exit at the last instruction - Branch or Function Call - Or the start of another basic block (fall through)

Simple Cilk Program Revisited entry setjmp!0 fib(n-2) 0 maybe Save Continuation fib(n-1) parallel serial setjmp Is sync? 0 cilkrts_sync!0 fib1 + fib2 fib1 + fib2 Leave frame ret

Cilk Workers While there may be work - Try to get the next item from our queue - Else try to get work from a random queue - If there is no work found, wait on semaphore If work item is found - Resume with the continuation s stack

Thread Local Storage Linux supports thread local storage - New: C11 - _Thread_local keyword - one global instance of the variable per thread - Compiler places values into.tbss - OS provides each thread with this space Since Cilk and OpenMP are using pthreads - These values are in the layer below them

DEMO