Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers

Similar documents
COMP 322 / ELEC 323: Fundamentals of Parallel Programming

Lecture 24: Java Threads,Java synchronized statement

Lecture 12: Barrier Synchronization

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Lecture 27: Safety and Liveness Properties, Java Synchronizers, Dining Philosophers Problem

COMP 322: Fundamentals of Parallel Programming Module 1: Parallelism

COMP 322: Fundamentals of Parallel Programming Module 1: Parallelism

COMP 322 / ELEC 323: Fundamentals of Parallel Programming

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

Lecture 6: Finish Accumulators

Lecture 35: Eureka-style Speculative Task Parallelism

Practical Concurrency. Copyright 2007 SpringSource. Copying, publishing or distributing without express written permission is prohibited.

COMP 322 / ELEC 323: Fundamentals of Parallel Programming

COMP 322: Fundamentals of Parallel Programming

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

COMP 322: Fundamentals of Parallel Programming

Lecture 29: Java s synchronized statement

Homework 4: due by 11:59pm on Wednesday, April 2, 2014 (Total: 100 points)

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism

COMP 322: Fundamentals of Parallel Programming

Process Description and Control

SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

COMP 322: Fundamentals of Parallel Programming

A Transformation Framework for Optimizing Task-Parallel Programs

COMP 322: Fundamentals of Parallel Programming. Lecture 13: Parallelism in Java Streams, Parallel Prefix Sums

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

CSE544 Database Architecture

Processes The Process Model. Chapter 2 Processes and Threads. Process Termination. Process States (1) Process Hierarchies

Lecture 32: Volatile variables, Java memory model

Process Description and Control. Chapter 3

OpenMP Application Program Interface

Processes The Process Model. Chapter 2. Processes and Threads. Process Termination. Process Creation

COMP 322: Fundamentals of Parallel Programming Module 1: Deterministic Shared-Memory Parallelism

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

EE/CSCI 451: Parallel and Distributed Computation

Isolation for Nested Task Parallelism

COMP 322: Fundamentals of Parallel Programming. Lecture 30: Java Synchronizers, Dining Philosophers Problem

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Chapter 4: Threads. Chapter 4: Threads

Deadlocks. Deadlock in Resource Sharing Environment. CIT 595 Spring Recap Example. Representing Deadlock

Extending BPEL with transitions that can loop

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Ch 9: Control flow. Sequencers. Jumps. Jumps

CS307: Operating Systems

Process Description and Control. Chapter 3

POSIX Threads: a first step toward parallel programming. George Bosilca

CME 213 S PRING Eric Darve

JS Event Loop, Promises, Async Await etc. Slava Kim

Concurrency Analysis of Asynchronous APIs

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Escape Analysis for Java

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Basic Communication Operations (Chapter 4)

Dr. D. M. Akbar Hussain DE5 Department of Electronic Systems

CS 450 Operating System Week 4 Lecture Notes

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Message-Passing Shared Address Space

Processes. Process Management Chapter 3. When does a process gets created? When does a process gets terminated?

The Operating System. Chapter 6

Big Java Late Objects

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

OPERATING SYSTEM. Chapter 4: Threads

An Introduction to OpenMP

Chapter 4: Threads. Chapter 4: Threads

Cooperative Scheduling of Parallel Tasks with General Synchronization Patterns

OpenMP Application Program Interface

Business-Driven Software Engineering Lecture 5 Business Process Model and Notation

CnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

COMP 322: Principles of Parallel Programming. Vivek Sarkar Department of Computer Science Rice University

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Application parallelization for multi-core Android devices

Unifying Barrier and Point-to-Point Synchronization in OpenMP with Phasers

Inthreads: Code Generation and Microarchitecture

Chapter 4: Threads. Operating System Concepts 9 th Edition

Parallelisation. Michael O Boyle. March 2014

Process Description and Control. Chapter 3

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

POSIX Threads and OpenMP tasks

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

I.-C. Lin, Assistant Professor. Textbook: Operating System Concepts 8ed CHAPTER 4: MULTITHREADED PROGRAMMING

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Processes and Threads

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads Linux Threads Java Threads. Operating System Concepts

Chapter 4: Multithreaded Programming

EE/CSCI 451: Parallel and Distributed Computation

COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd

Information Coding / Computer Graphics, ISY, LiTH

Arrays in SHIM: A Proposal

Advanced Flow of Control -- Introduction

Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities (Extended Abstract)

A Proposal for User-Level Failure Mitigation in the MPI-3 Standard

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

ABSTRACT. Optimized Event-Driven Runtime Systems for Programmability and Performance

Transcription:

COMP 322: Fundamentals of Parallel Programming Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP 322 Lecture 15 14 February 2018

Worksheet #14 Solution: Iterative Averaging Revisited Answer the questions in the table below for the versions of the Iterative Averaging code shown in slides 7, 8, 10, 12. Write in your answers as functions of m, n, and nc. Slide 7 Slide 8 Slide 10 Slide 12 How many tasks are created (excluding the main program task)? m*n n Incorrect: m*nc Incorrect: nc Incorrect: n * m n * nc n*m, m*nc How many barrier operations (calls to next per task) are performed? 0 Incorrect: m Incorrect: 0 Incorrect: m Incorrect: m m*n m m*nc, nc The SPMD version in slide 12 is the most efficient because it only creates nc tasks. (Task creation is more expensive than a barrier operation.) 2

Dataflow Computing Original idea: replace machine instructions by a small set of dataflow operators Fork Primitive Ops Switch Merge + T T F T T F + T T F T T F 3

Example instruction sequence and its dataflow graph a b 7 x = a + b; y = b * 7; z = (x-y) * (x+y); 1 2 x 3 4 y An operator executes when all its input values are present; copies of the result value are distributed to the destination operators. 5 No separate branch instructions 4

Communication via single-assignment variables Macro-Dataflow Programming Macro-dataflow = extension of dataflow model from instruction-level to task-level operations General idea: build an arbitrary task graph, but restrict all inter-task communications to single-assignment variables (like futures) Static dataflow ==> graph fixed when program execution starts Dynamic dataflow ==> graph can grow dynamically Semantic guarantees: race-freedom, determinism Deadlocks are possible due to unavailable inputs (but they are deterministic) 5

Extending HJ Futures for Macro-Dataflow: Data-Driven Futures (DDFs) HjDataDrivenFuture<T1> ddfa = newdatadrivenfuture(); Allocate an instance of a data-driven-future object (container) Object in container must be of type T1, and can only be assigned once via put() operations HjDataDrivenFuture extends the HjFuture interface ddfa.put(v) ; Store object V (of type T1) in ddfa, thereby making ddfa available Single-assignment rule: at most one put is permitted on a given DDF 6

Extending HJ Futures for Macro-Dataflow: Data-Driven Tasks (DDTs) asyncawait(ddfa, ddfb,, () -> Stmt); Create a new data-driven-task to start executing Stmt after all of ddfa, ddfb, become available (i.e., after task becomes enabled ) Await clause can be used to implement nodes and edges in a computation graph ddfa.get() Return value (of type T1) stored in ddfa Throws an exception if put() has not been performed Should be performed by async s that contain ddfa in their await clause, or if there s some other synchronization to guarantee that the put() was performed 7

Converting previous Future example to Data-Driven Futures and AsyncAwait Tasks 1. finish(() -> { 2. HjDataDrivenFuture<Void> ddfa = newdatadrivenfuture(); 3. HjDataDrivenFuture<Void> ddfb = newdatadrivenfuture(); 4. HjDataDrivenFuture<Void> ddfc = newdatadrivenfuture(); 5. HjDataDrivenFuture<Void> ddfd = newdatadrivenfuture(); 6. HjDataDrivenFuture<Void> ddfe = newdatadrivenfuture(); 7. asyncawait(ddfa, () -> {... ; ddfb.put( ); }); // Task B 8. asyncawait(ddfa, () -> {... ; ddfc.put( ); }); // Task C 9. asyncawait(ddfb, ddfc, ()->{... ; ddfd.put( ); }); // Task D 10. asyncawait(ddfc, () -> {... ; ddfe.put( ); }); // Task E 11. asyncawait(ddfd, ddfe, () -> {... }); // Task F 12. // Note that creating a producer task after its consumer 13. // task is permitted with DDFs & DDTs, but not with futures 14. async(() -> {... ; ddfa.put( ); }); // Task A 15. }); // finish 8

Differences between Futures and DDFs/DDTs Consumer task blocks on get() for each future that it reads, whereas async-await does not start execution till all DDFs are available Future tasks cannot deadlock, but it is possible for a DDT to block indefinitely ( deadlock ) if one of its input DDFs never becomes available DDTs and DDFs are more general than futures Producer task can only write to a single future object, whereas a DDT can write to multiple DDF objects The choice of which future object to write to is tied to a future task at creation time, where as the choice of output DDF can be deferred to any point with a DDT Consumer DDTs can be created before the producer tasks DDTs and DDFs can be implemented more efficiently than futures An asyncawait statement does not block the worker, unlike a future.get() 9

Two Exception (error) cases for DDFs that cannot occur with futures Case 1: If two put s are attempted on the same DDF, an exception is thrown because of the violation of the single-assignment rule There can be at most one value provided for a future object (since it comes from the producer task s return statement) Case 2: If a get is attempted by a task on a DDF that was not in the task s await list, then an exception is thrown because DDF s do not support blocking gets Futures support blocking gets 10

Deadlock example with DDTs (cannot be reproduced with futures) A parallel program execution contains a deadlock if some task s execution remains incomplete due to it being blocked indefinitely awaiting some condition 1. HjDataDrivenFuture left = newdatadrivenfuture(); 2. HjDataDrivenFuture right = newdatadrivenfuture(); 3. finish(() -> { 4. asyncawait(left, () -> { 5. right.put(rightwriter()); }); 6. asyncawait(right, () -> { 7. left.put(leftwriter()); }); 8. }); HJ-Lib has deadlock detection mode Enabled using: System.setProperty(HjSystemProperty.trackDeadlocks.propertyKey(), true"); Throws an edu.rice.hj.runtime.util.deadlockexception when deadlock detected 11

Barrier vs Point-to-Point Synchronization in One-Dimensional Iterative Averaging Example iter = i iter = i+1 Barrier synchronization iter = i iter = i+1 Point-to-point synchronization Question: when can the point-to-point computation graph result in a smaller CPL than the barrier computation graph? Answer: when there is variability in the node execution times. 12

Phasers: a unified construct for barrier and point-to-point synchronization HJ phasers unify barriers with point-to-point synchronization Inspiration for java.util.concurrent.phaser Previous example motivated the need for point-to-point synchronization With barriers, phase i of a task waits for all tasks associated with the same barrier to complete phase i-1 With phasers, phase i of a task can select a subset of tasks to wait for Phaser properties Support for barrier and point-to-point synchronization Support for dynamic parallelism --- the ability for tasks to drop phaser registrations on termination (end), and for new tasks to add phaser registrations (async phased) A task may be registered on multiple phasers in different modes 13

Simple Example with Four Async Tasks and One Phaser 1. finish (() -> { 2. ph = newphaser(sig_wait); // mode is SIG_WAIT 3. asyncphased(ph.inmode(sig), () -> { 4. // A1 (SIG mode) 5. doa1phase1(); next(); doa1phase2(); }); 6. asyncphased(ph.inmode(sig_wait), () -> { 7. // A2 (SIG_WAIT mode) 8. doa2phase1(); next(); doa2phase2(); }); 9. asyncphased(ph.inmode(hjphasermode.sig_wait), () -> { 10. // A3 (SIG_WAIT mode) 11. doa3phase1(); next(); doa3phase2(); }); 12. asyncphased(ph.inmode(hjphasermode.wait), () -> { 13. // A4 (WAIT mode) 14. doa4phase1(); next(); doa4phase2(); }); 15. }); 14

Computation Graph Schema Simple Example with Four Async Tasks and One Phaser Semantics of next depends on registration mode SIG_WAIT: next = signal + wait SIG: next = signal WAIT: next = wait SIG SIG_WAIT SIG_WAIT WAIT signal wait next A master thread (worker) gathers all signals and broadcasts a barrier completion 15

Summary of Phaser Construct Phaser allocation HjPhaser ph = newphaser(mode); Phaser ph is allocated with registration mode Phaser lifetime is limited to scope of Immediately Enclosing Finish (IEF) Registration Modes HjPhaserMode.SIG, HjPhaserMode.WAIT, HjPhaserMode.SIG_WAIT, HjPhaserMode.SIG_WAIT_SINGLE NOTE: phaser WAIT is unrelated to Java wait/notify (which we will study later) Phaser registration asyncphased (ph 1.inMode(<mode 1 >), ph 2.inMode(<mode 2 >), () -> <stmt> ) Spawned task is registered with ph 1 in mode 1, ph 2 in mode 2, Child task s capabilities must be subset of parent s asyncphased <stmt> propagates all of parent s phaser registrations to child Synchronization next(); Advance each phaser that current task is registered on to its next phase Semantics depends on registration mode Barrier is a special case of phaser, which is why next is used for both 16

Capability Hierarchy A task can be registered in one of four modes with respect to a phaser: SIG_WAIT_SINGLE, SIG_WAIT, SIG, or WAIT. The mode defines the set of capabilities signal, wait, single that the task has with respect to the phaser. The subset relationship defines a natural hierarchy of the registration modes. A task can drop (but not add) capabilities after initialization. SIG_WAIT_SINGLE = { signal, wait, single } SIG_WAIT = { signal, wait } SIG = { signal } WAIT = { wait } 17

Left-Right Neighbor Synchronization (with m=3 tasks) 1.finish(() -> { // Task-0 2. final HjPhaser ph1 = newphaser(sig_wait); 3. final HjPhaser ph2 = newphaser(sig_wait); 4. final HjPhaser ph3 = newphaser(sig_wait); 5. asyncphased(ph1.inmode(sig),ph2.inmode(wait), 6. () -> { dophase1(1); 7. next(); // signals ph1, waits on ph2 8. dophase2(1); 9. }); // Task T1 10. asyncphased(ph2.inmode(sig),ph1.inmode(wait),ph3.inmode(wait), 11. () -> { dophase1(2); 12. next(); // signals ph2, waits on ph3 13. dophase2(2); 14. }); // Task T2 15. asyncphased(ph3.inmode(sig),ph2.inmode(wait), 16. () -> { dophase1(3); 17. next(); // signals ph3, waits on ph2 18. dophase2(3); 19. }); // Task T3 20.}); // finish 18

Computation Graph for m=3 example (without async-finish nodes and edges) 6 7-signal 7-wait 8 ph1.next -start(0->1) ph1.next -end(0->1) ph2.next -start(0->1) ph2.next -end(0->1) 11 12-signal 12-wait 13 ph3.next -start(0->1) ph3.next -end(0->1) 16 17-signal 17-wait 18 continue signal wait 19