SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR

Size: px

Start display at page:

Download "SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR"

Angelica Howard
6 years ago
Views:

1 1 SCHEDULING MACRO-DATAFLOW PROGRAMS ON TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR

2 Thesis 2 Our thesis is that advances in task parallel runtime systems can enable a macro-dataflow programming model, like Concurrent Collections (CnC), to deliver productivity and performance on modern multicore processors.

3 Approach 3 Macro-dataflow for expressiveness Determinism Race/deadlock freedom Higher level abstraction Task parallel runtimes for performance Portable scalability Contemporary consensus

4 Motivation 4 Parallelism not accessible to those who need it most Imposed serial thinking Parallelism for the masses, not just computer scientists Parallel programming models of today: Hide machine details but expose parallelism details Constrain expressiveness

5 Contributions 5 Scheduling CnC on Habanero Java Evaluation of scheduling performance for CnC Introduction of Data Driven Futures (DDF) construct Implementation of DDF construct Implementation and evaluation of data driven runtime with DDFs Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, Sağnak Taşırlar, The CnC Programming Model, submitted for publication to the Journal of Supercomputing, 2010

6 Outline 6 Background CnC Scheduling Data Driven Futures Results Wrap up

7 Dynamic Task Parallelism 7 Properties Over exposure of parallelism Scales up/down with # of cores Scheduling maps sets of tasks to threads at runtime Habanero Java (HJ) employs: Finish/async parallelism Feeds child tasks through lexical scope Work sharing/stealing runtime scheduling

8 CnC concepts 8 Step Computation abstraction Side effect free Functional w.r.t. input Special step: Environment Item Dynamic single assignment Value not storage Tag Data tag to index items Control tag to index steps Collection Step Item Tag Graphical Notation (SomeStep) [SomeItem] <Tag> Textual Notation (SomeStep) [SomeItem] <Tag>

9 Concurrent Collections model 9 Can be classified as: Declarative Deterministic Dynamic single assignment Macro-dataflow Coordination language Goal: consider only semantic ordering constraints Inherent in the application not the implementation Will be described by the CnC graph

10 Example Program Specification 10 Break up an input string Sequences of repeated single characters Filter allowing only Sequences of odd length Input string aaaffqqqmmmmmmm Sequences of repeated characters aaa ff qqq mmmmmmm Filtered sequences aaa qqq mmmmmmm

11 11 CnC Implementation of Example Program <stringtag: 1> <stringtag: 2> <spantag: 1, 1> <spantag: 1, 2> <spantag: 1, 3> <spantag: 1, 4> <stringtag: j> <spantag: j, s> [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [input: 1] = aaaffqqqmmmmmmm [input: 2] = rrhhhhxxx Collection Graphical Textual Step ( ) ( ) Item [ ] [ ] Tag <> < > [span: 1, 1] = aaa [span: 1, 2] = ff [span: 1, 3] = qqq [span: 1, 4] = mmmmmmm [results: 1, 1] = aaa [results: 1, 3] = qqq [results: 1, 4] = mmmmmmm

12 CnC-Habanero Java build model 12 Concurrent Collections Textual Graph Habanero Java source files Code to invoke the graph Code to put initial values in graph Code to implement abstract steps CnC Translator CnC compiler Habanero Java source files.class files Concurrent Collections Library Habanero Java Runtime Library Abstract classes for all steps Definitions for all collections Graph definition and initialization JAR Builder Java application User specified CnC Components

13 Outline 13 Background CnC Scheduling Data Driven Futures Results Wrap up

14 CnC Scheduling Challenges 14 Control & data dependences are first level constructs Task parallel frameworks have them coupled Step instances have multiple predecessors Need to wait for all predecessors Layered readiness concepts Control dependence satisfied Data dependence satisfied Schedulable / Ready

15 Eager scheduling 15 Assume control dependence satisfaction is readiness Conforms to task parallel runtime assumption Wait till data dependences satisfaction for safety Block on data prematurely tried to be read Discard task reading prematurely, replay when data arrive

Blocking Eager CnC Schedulers 16 Use Java wait/notify for premature

Blocked task blocks whole thread Deadlock possibility Need to create

(data-tag γ ) data-tag α wait data-tag β value α value β Thread Δ

16 Blocking Eager CnC Schedulers 16 Use Java wait/notify for premature data access Blocking granularity Instance level vs Collection level Blocked task blocks whole thread Deadlock possibility Need to create more threads as threads block time Thread Ε ItemCollection Θ Get (data-tag γ ) data-tag α wait data-tag β value α value β Thread Δ step 1 notify data-tag γ placeholder value γ γ Put(data-tag γ,value γ ) step 2

17 Data Driven Rollback & Replay 17 Alternative eager scheduling Blocking scheduler suffers from Expensive recovery from premature read Blocks whole thread Creates new thread Switch context to the new thread on every failure Inform item instance on failed task and discard task Throw an exception to unwind failed task Catch by runtime and continue with another ready task Recreate task when needed item arrives

Data Driven Rollback & Replay 18 Throw exception to unwind Thread Ε ItemCollection Θ Insert step 1 to waitlist γ Thread Δ step 1 Get (data-tag γ ) data-tag α value α waitlist α

18 Data Driven Rollback & Replay 18 Throw exception to unwind Thread Ε ItemCollection Θ Insert step 1 to waitlist γ Thread Δ step 1 Get (data-tag γ ) data-tag α value α waitlist α step 3 data-tag β value β waitlist β data-tag γ value empty γ waitlist γ Put(data-tag γ,value γ ) step 2 step 1 Get (data-tag γ ) Get (data-tag δ ) Recreate steps on waitlist γ

19 Data Driven Scheduling 19 Do not create tasks until data dependences satisfied No failure, no recovery Restrict computation frontier to ready tasks Evaluation of data readiness condition Busy waiting on data (delayed async scheduling) Dataflow like readiness (data driven scheduling) Register tasks on data Data notifies consumer tasks when created

20 Delayed Asyncs 20 Guarded execution construct for HJ Promote to async when guard evaluates to true async A async B Pop a Task async Z Delayed? No Yes Evaluate guard Assign to thread Yes Is true? No Work Sharing Ready Task Queue Requeue

21 Delayed Async Scheduling 21 Every CnC step is a guarded execution Guard condition is the availability of items to consume Task still created eagerly when provided control Promotes to ready when data provided

22 Outline 22 Background CnC Scheduling Data Driven Futures Results Wrap up

23 Data Driven Futures (DDFs) 23 Task parallel synchronization construct Acts as a reference to single assignment value Creation Create a dangling reference object Resolution ( Put ) Resolve what value a DDF is referring to Registration ( Await ) A task provides a consume list of DDFs on declaration A task can only read DDFs that it is registered to

24 Data Driven Futures (DDFs) 24 DataDrivenFuture leftchild = new DataDrivenFuture(); DataDrivenFuture rightchild = new DataDrivenFuture(); finish { async leftchild.put(leftchildcreator()); async rightchild.put(rightchildcreator()); async await ( leftchild ) useleftchild(leftchild); async await ( rightchild ) userightchild(rightchild); async await ( leftchild, rightchild ) usebothchildren( leftchild, rightchild ); }

25 Contributions of DDFs 25 Non-series-parallel task dependency graphs support TaskB TaskD TaskA TaskF TaskC TaskE Memory footprint reduction Exposes only ready parts of the execution frontier Not global lifetime Creator: feeds consumers gives access to producer Lifetime restricted to Creator lifetime Resolver lifetime Consumers lifetimes Can be garbage collected on a managed runtime

26 Data Driven Scheduling 26 Steps register self to items wrapped into DDFs DDFβ TaskM DDFα PlaceHolderβ Valueβ DDFα DDFβ PlaceHolderα Valueα TaskN DDFβ DDFδ DDFδ PlaceHolderδ Valueδ TaskC create DDFα, DDFβ, DDFδ create Task A resolving DDFα create Task M reading DDFα, DDFβ create Task D resolving DDFδ create Task B resolving DDFβ create Task N reading DDFβ, DDFδ ready queue Task A Task M D Task Task N B TaskA TaskD TaskB resolve resolve DDFα resolve DDFδ DDFβ

27 Outline 27 Background CnC Scheduling Data Driven Futures Results Wrap up

28 Performance Evaluation Legend 28 Coarse Grain Blocking Eager blocking scheduling on item collections for CnC-HJ Fine Grain Blocking Eager blocking scheduling on item instances for CnC-HJ Delayed Async Data Driven scheduling via HJ delayed asyncs for CnC-HJ Data Driven Rollback & Replay Eager scheduling with replay and notifications for CnC-HJ Data Driven Futures Hand coded CnC application equivalent on HJ with DDFs

29 Cholesky Decomposition Introduction 29 Dense linear algebra kernel Three inherent kernels Need to be pipelined for best performance Loop parallelism within some kernels Data parallelism within some kernels CnC shown to beat optimized libraries, like IntelMKL

30 Cholesky Decomposition on Xeon 30 9,000 8,000 Execution in milli-secs 7,000 6,000 5,000 4,000 3,000 2,000 1,000 1-Worker 16-Workers 7,861 7,730 1, ,818 8,789 7, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java steps on Xeon with input matrix size and with tile size

31 Cholesky Decomposition on Xeon 31 12,000 10,000 Execution in milli-secs 8,000 6,000 4,000 2,000 10,081 10,010 2,472 1-Worker 16-Workers 10,305 10,309 8, , Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on Xeon with input matrix size and with tile size

32 Cholesky Decomposition with MKL on Xeon 32 2,000 1,800 1,782 1,766 1-Worker 16-Workers 1,834 1,821 1,723 Execution in milli-secs 1,600 1,400 1,200 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java and Intel MKL steps on Xeon with input matrix size and with tile size

33 Cholesky Decomposition with MKL on Xeon 33 2,200 2,000 1,800 Execution in milli-secs 1,600 1,400 1,200 1, Worker 1,999 1,983 1, Workers 1,949 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero Java and Intel MKL steps on Xeon with input matrix size and with tile size

34 Cholesky Decomposition on UltraSPARC T , , , ,883 1-Worker 64-Workers 103, , ,297 96,587 Execution in milli-secs 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10, ,489 5,388 5,651 5,259 4,950 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java steps on UltraSPARC T2 with input matrix size and with tile size

35 Cholesky Decomposition on UltraSPARC T , , , , ,863 1-Worker 104, Workers 106,695 99,032 Execution in milli-secs 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10, ,035 6,993 6,958 6,339 5,681 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on UltraSPARC T2 with input matrix size and with tile size

36 Black-Scholes formula 36 Only one step The Black-Scholes formula Embarrassingly parallel Good indicator of scheduling overhead

37 Black-Scholes on Xeon 37 40,000 35,000 1-Worker 16-Workers 33,688 33,762 34,214 33,788 34,634 Execution in milli-secs 30,000 25,000 20,000 15,000 10, ,000 4,148 4,155 4,216 4,145 2,164 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on Xeon with input size 1,000,000 and with tile size 62,500

38 Black-Scholes on Xeon 38 40,000 35,000 1-Worker 16-Workers 33,871 33,966 34,311 34,121 34,729 Execution in milli-secs 30,000 25,000 20,000 15,000 10,000 5, ,061 4,300 4,309 4,279 2,353 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on Xeon with input size 1,000,000 and with tile size 62,500

39 Black-Scholes on UltraSPARC T , ,000 Execution in milli-secs 160, , , ,000 80,000 60,000 40,000 1-Worker 64-Workers 179, , , , , ,000 0 Coarse Grain Blocking 6,032 5,944 6,006 5,973 4,019 Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on UltraSPARC T2 with input size 1,000,000 and with tile size 15,625

40 Black-Scholes on UltraSPARC T ,000 1-Worker 64-Workers 207, , ,636 Execution in milli-secs 200, , ,000 50, , , Coarse Grain Blocking 6,179 6,159 6,149 6,145 4,296 Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on UltraSPARC T2 with input size 1,000,000 and with tile size 15,625

41 Rician Denoising 41 Image processing algorithm More than 4 kernels Mostly stencil computations Non trivial dependency graph Fixed point algorithm Enormous data size CnC schedulers needed explicit memory management DDFs took advantage of garbage collection

42 Rician Denoising on Xeon ,000 1-Worker 16-Workers 500, , , ,328 Execution in milli-secs 400, , , , ,000 78,630 56,892 51,632 52,208 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size

43 Rician Denoising on Xeon ,000 1-Worker 16-Workers Execution in milli-secs 500, , , , , , , , ,000 81,502 58,313 53,569 53,817 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size

44 Rician Denoising on UltraSPARC T2 44 2,500,000 1-Worker 64-Workers 2,000,000 1,979,340 1,932,078 1,932,488 Execution in milli-secs 1,500,000 1,000, ,000 1,273, , ,134 80,458 56,621 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on UltraSPARC T2 with input image size and with tile size

45 Rician Denoising on UltraSPARC T2 45 2,500,000 1-Worker 64-Workers Execution in milli-secs 2,000,000 1,500,000 1,000, , ,988,894 1,939,976 1,943,861 1,282, , ,600 81,707 58,017 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on UltraSPARC T2 with input image size and with tile size

46 Heart Wall Tracking Medical imaging application Nested kernels First level embarrassingly parallel Second level with intricate dependency graph Memory management Many failures on eager schedulers Blocking schedulers ran out of memory

47 Heart Wall Tracking on Xeon , , ,248 1-Worker 16-Workers 157, ,159 Execution in milli-secs 140, , ,000 80,000 60,000 40, ,989 20,000 11,076 9,897 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames

48 Heart Wall Tracking on Xeon , , ,806 1-Worker 16-Workers 158, ,635 Execution in milli-secs 140, , ,000 80,000 60,000 40, ,287 20,000 11,351 10,097 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames

49 Outline 49 Background CnC Scheduling Data Driven Futures Results Wrap up

50 Related work 50 Alternative parallel programming models: Either too verbose or constrained parallelism Alternative futures, promises Creation and resolution are coupled Either lazy or blocking execution semantics Support for unstructured parallelism Nabbit library for Cilk++ allows arbitrary task graphs Immediate successor atomic counter update for notification Does not differentiate between data, control dependences

51 Conclusions 51 Macro-dataflow is a viable parallelism model Provides expressiveness hiding parallelism concerns Macro-dataflow can perform competitively Taking advantage of modern task parallel models

52 Future Work 52 Compiling CnC to the Data Driven Runtime Currently hand-ported Need finer grain dependency analysis via tag functions Data Driven Future support for Work Stealing Compiler support for automatic DDF registration Hierarchical DDFs Locality aware scheduling support for DDFs

53 Acknowledgments 53 Committee Zoran Budimlić, Keith D. Cooper, Vivek Sarkar, Lin Zhong Journal of Supercomputing co-authors Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff P. Lowney, Ryan R. Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach Habanero multicore software research project team-members Zoran Budimlić, Vincent Cavé, Philippe Charles, Vivek Sarkar, Alina Simion Sbîrlea, Dragoș Sbîrlea, Jisheng Zhao Intel Technology Pathfinding and Innovation Software and Services Group Mark Hampton, Kathleen Knobe, Geoff P. Lowney, Ryan R. Newton, Frank Schlimbach Benchmarks Aparna Chandramowlishwaran (Georgia Tech.), Zoran Budimlić(Rice) for Cholesky Decomposition Yu-Ting Chen (UCLA) for Rician Denoising David Peixotto (Rice) for Black-Scholes Formula Alina Simion Sbîrlea (Rice) for Heart Wall Tracking

54 Feedback and clarifications 54 Thanks for your attention

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY 1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be