DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

Size: px

Start display at page:

Download "DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY"

Leon Hoover
5 years ago
Views:

1 1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

2 Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be series-parallel Can not describe without hampering -ism Fork/Join models constrain control and data dependences Tasks can only be created after all data dependences satisfied Necessitates ordering task creation to conform to that restriction May hamper performance

3 Macro-dataflow for intuitive -ism 3 Kernel based programming TaskA TaskB TaskC Build a task graph of kernel instantiations main TaskA TaskB TaskC TaskB TaskA Restrict dependences to true dependences race-freedom, determinism Provides productivity single-assignment data

4 Futures [Baker & Hewitt 1977] 4 future = (storage, resolvingprocess, waitingtasks) addressf TaskF TaskG TaskH TaskJ Future F = {stmt 1 ; ; return v;}!!!!!task g = {stmt; F.get(); ;}!

5 5 Data-Driven Futures (DDFs) & Data-Driven Tasks (DDTs) DataDrivenFuture = (storage, waitingtasks) Creation Create an empty Data-Driven Future (DDF) object Resolution ( put ) (resolvingprocess) Resolve what value a DDF is referring to Data-Driven Tasks (DDTs) ( async await( ) ) A task provides a consumer list of DDFs on declaration A task can only read DDFs that it is registered to Difference from futures: Creation of container (DDF) and computation (DDT) are separate events

6 DDF/DDT Code Sample 6 DataDrivenFuture left = new DataDrivenFuture (); DataDrivenFuture right = new DataDrivenFuture(); finish { async await ( left ) useleftchild(left); // Task 1 async await ( right ) userightchild(right); // Task 2 async await ( left, right ) usebothchildren( left, right ); // Task 3 } async left.put(leftchildcreator()); // Task 4 async right.put(rightchildcreator()); // Task 5 Task4 Task1 Task3 Task5 Task2

7 DDTs provide 7 Non-series-parallel task dependence graph support Less restricted parallelism Better scheduling opportunities DDTB DDF DDTD DDTA DDTF DDTC DDTE Single assignment (SA) Race-freedom on DDF accesses Determinism if all shared data is expressed as DDFs SA-value lifetime restriction Smaller than graph lifetime DDF creator: Provides DDF reference to producers and consumers DDF lifetime depends on Creator lifetime Resolver lifetime Consumers lifetimes

8 Data-Driven Scheduling 8 Steps register self to items wrapped into DDFs DDFleft PlaceHolderleft Valueleft Task1 DDFleft DDFright PlaceHolderright Valueright Task3 Task2 TaskC DDFleft DDF left = new DDF(); DDF right = new DDF(); async await (left) use(left); // Task 1 async await (right) use(right); // Task 2 async await (left,right) use(left,right); // Task 3 async builder(left); // Task 4 DDFright ready queue Task 43 Task 52 Task5 Task4 DDFright Task 1 resolve resolve DDFright DDFleft async builder(right); // Task 5

9 Mapping Macro-Dataflow to Task-Parallelism 9 Control & data dependences as first level constructs Task-parallel frameworks have them coupled e.g., OpenMP, Cilk Kernel instantiations may have multiple predecessors Need to wait for all Staged readiness concepts Created ( control dependence satisfied ) Data dependences satisfied Schedulable / Ready DDTs provide a natural implementation for Macro- Dataflow Every kernel instantiation is a DDT Data dependences between DDTs are expressed through DDFs Provides race freedom

10 Experimental Results 10 Compared DDT implementation with four macrodata schedulers from past work that used Concurrent Collections (CnC) CnC uses global data collections to synchronize tasks DDT/DDF results obtained at task-parallel level without allocating global data collections CnC can be automatically translated to DDFs (ongoing work)

Blocking Schedulers 11 Use Java wait/notify for premature data

to create more worker threads to avoid deadlock Worker C Get

11 Blocking Schedulers 11 Use Java wait/notify for premature data access Blocking granularity Instance level vs Collection level (fine-grain vs. coarsegrain) A blocked task blocks an entire worker thread Need to create more worker threads to avoid deadlock Worker C Get (key c ) step 1 wait ItemCollection Θ key α key β value α value β time Worker D Put(key c,value c ) notify step 2

Delayed async Scheduling 12 Every kernel instantiation is a guarded execution Guard condition

Promoted to ready when data provided pop Popped Task Value left = new Value (); Value right =

isready() ) useleftchild(left); // Task 1 async when ( right.

isready() ) async 1 async 2 async 3 async 4 usebothchildren( left, right ); // Task 3 async

12 Delayed async Scheduling 12 Every kernel instantiation is a guarded execution Guard condition is the availability of input data Task can be created eagerly before input data is available Promoted to ready when data provided pop Popped Task Value left = new Value (); Value right = new Value (); finish { async when ( left.isready() ) useleftchild(left); // Task 1 async when ( right.isready()) userightchild(right); // Task 2 async when ( right.isready() && left.isready() ) async 1 async 2 async 3 async 4 usebothchildren( left, right ); // Task 3 async left.put(leftchildcreator()); // Task async 5 4 push async right.put(rightchildcreator()); // Task 5 } Work Sharing Ready Task Queue No Schedule Yes Delayed? Is true? Evaluate guard Requeue Yes No

ItemCollection Θ key a key b value a value b waitlist a waitlist b key c empty

13 Data Driven Rollback & Replay 13 Throw exception to unwind Insert step 1 to waitlist c Worker C step 1 Get (key c ) step 3 step 1 Get (key c ) Get (key d ) ItemCollection Θ key a key b value a value b waitlist a waitlist b key c empty value c waitlist c Put(key c,value c ) Worker D step 2 Re-execute steps in waitlist c on Put() step 1

14 Experimental Setup 14 4-socket Xeon quad-core Intel E GHz Shared 3MB L2 cache per pair of cores. Main memory 32 GBs. #worker threads:16 8-way SMT 8-core Niagara Sun UltraSPARC T2 Shared 4MB L2 cache #worker threads: bit Sun Hotspot JDK 1.6 JVM GCC for JNI 30 runs for statistical soundness Read Serial as single-threaded execution of code

15 Cholesky decomposition 15 12,000 10,081 10,010 Serial Parallel 10,305 10,309 Execution in milli-secs 10,000 8,000 6,000 4,000 2,472 8,748 2,000 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Tasks Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on 16-core Xeon with input matrix size and with tile size

16 Black-Scholes formula ( PARSEC ) 16 40,000 35,000 Serial Parallel 33,871 33,966 34,311 34,121 34,729 Execution in milli-secs 30,000 25,000 20,000 15,000 10,000 5,000 4,300 4,309 4,279 5,061 2,353 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Tasks Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on 16-core Xeon with input size 1,000,000 and with tile size 62,500

17 Rician Denoising ( Medical Imaging ) , , ,666 Serial Parallel 483,770 Execution in milli-secs 400, , , , ,000 81,502 58,313 53,569 53,817 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Tasks Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size * Explicit memory management required for non-ddt schedules to avoid out-of-memory exception

18 Heart Wall Tracking Dependence Graph 18 IterationJ IterationJ+1 Step3 Step3 Step1 Step6 Step8 Step1 Step6 Step8 Step4 Step10 Step4 Step10 Step2 Step2 Step5 Step7 Step9 Step5 Step7 Step9

19 Heart Wall Tracking ( Rodinia ) , , ,248 Serial Parallel 157, ,159 Execution in milli-secs 140, , ,000 80,000 60,000 40,000 47,989 20,000 11,076 9,897 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Tasks Minimum execution times of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames

20 Related Work 20 Futures Can build arbitrary task graphs get()/force() is usually a blocking operation future task creation is bound to container at creation time Dataflow Typically blocks on one datum (Ivar) at a time, unlike async await ( ) Nabbit ( Cilk library ) Can build arbitrary task graphs, more explicit than DDTs No garbage collection and unwinding of task graph Concurrent Collections ( CnC ) Globalized data collections and general tags (keys) makes memory management challenging DDTs can be used to obtain more efficient implementations of CnC

21 Conclusions 21 Data-Driven Futures and Data-Driven Tasks help build arbitrary task graphs and extend task-parallel frameworks introduce the more-intuitive macro-dataflow to programmers on task-parallel frameworks support Data-Driven scheduling that outperforms alternative schedulers in both execution time and memory requirements help to implement blocking in tasks without blocking workers

22 Future Work 22 Compile Concurrent Collections down to DDTs Compiler optimizations to move DDF allocations to further reduce lifetimes Hierarchical DDTs for granularity optimizations Work-stealing support for DDTs Use DDTs to implement all blocking synchronizations without blocking worker, i.e. replace each waiting continuation as a DDT Locality aware scheduling with DDTs For a hands-on trial, visit

SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR

1 SCHEDULING MACRO-DATAFLOW PROGRAMS ON TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR Thesis 2 Our thesis is that advances in task parallel runtime systems can enable a macro-dataflow programming model,