SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR
|
|
- Angelica Howard
- 6 years ago
- Views:
Transcription
1 1 SCHEDULING MACRO-DATAFLOW PROGRAMS ON TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR
2 Thesis 2 Our thesis is that advances in task parallel runtime systems can enable a macro-dataflow programming model, like Concurrent Collections (CnC), to deliver productivity and performance on modern multicore processors.
3 Approach 3 Macro-dataflow for expressiveness Determinism Race/deadlock freedom Higher level abstraction Task parallel runtimes for performance Portable scalability Contemporary consensus
4 Motivation 4 Parallelism not accessible to those who need it most Imposed serial thinking Parallelism for the masses, not just computer scientists Parallel programming models of today: Hide machine details but expose parallelism details Constrain expressiveness
5 Contributions 5 Scheduling CnC on Habanero Java Evaluation of scheduling performance for CnC Introduction of Data Driven Futures (DDF) construct Implementation of DDF construct Implementation and evaluation of data driven runtime with DDFs Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, Sağnak Taşırlar, The CnC Programming Model, submitted for publication to the Journal of Supercomputing, 2010
6 Outline 6 Background CnC Scheduling Data Driven Futures Results Wrap up
7 Dynamic Task Parallelism 7 Properties Over exposure of parallelism Scales up/down with # of cores Scheduling maps sets of tasks to threads at runtime Habanero Java (HJ) employs: Finish/async parallelism Feeds child tasks through lexical scope Work sharing/stealing runtime scheduling
8 CnC concepts 8 Step Computation abstraction Side effect free Functional w.r.t. input Special step: Environment Item Dynamic single assignment Value not storage Tag Data tag to index items Control tag to index steps Collection Step Item Tag Graphical Notation (SomeStep) [SomeItem] <Tag> Textual Notation (SomeStep) [SomeItem] <Tag>
9 Concurrent Collections model 9 Can be classified as: Declarative Deterministic Dynamic single assignment Macro-dataflow Coordination language Goal: consider only semantic ordering constraints Inherent in the application not the implementation Will be described by the CnC graph
10 Example Program Specification 10 Break up an input string Sequences of repeated single characters Filter allowing only Sequences of odd length Input string aaaffqqqmmmmmmm Sequences of repeated characters aaa ff qqq mmmmmmm Filtered sequences aaa qqq mmmmmmm
11 11 CnC Implementation of Example Program <stringtag: 1> <stringtag: 2> <spantag: 1, 1> <spantag: 1, 2> <spantag: 1, 3> <spantag: 1, 4> <stringtag: j> <spantag: j, s> [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [input: 1] = aaaffqqqmmmmmmm [input: 2] = rrhhhhxxx Collection Graphical Textual Step ( ) ( ) Item [ ] [ ] Tag <> < > [span: 1, 1] = aaa [span: 1, 2] = ff [span: 1, 3] = qqq [span: 1, 4] = mmmmmmm [results: 1, 1] = aaa [results: 1, 3] = qqq [results: 1, 4] = mmmmmmm
12 CnC-Habanero Java build model 12 Concurrent Collections Textual Graph Habanero Java source files Code to invoke the graph Code to put initial values in graph Code to implement abstract steps CnC Translator CnC compiler Habanero Java source files.class files Concurrent Collections Library Habanero Java Runtime Library Abstract classes for all steps Definitions for all collections Graph definition and initialization JAR Builder Java application User specified CnC Components
13 Outline 13 Background CnC Scheduling Data Driven Futures Results Wrap up
14 CnC Scheduling Challenges 14 Control & data dependences are first level constructs Task parallel frameworks have them coupled Step instances have multiple predecessors Need to wait for all predecessors Layered readiness concepts Control dependence satisfied Data dependence satisfied Schedulable / Ready
15 Eager scheduling 15 Assume control dependence satisfaction is readiness Conforms to task parallel runtime assumption Wait till data dependences satisfaction for safety Block on data prematurely tried to be read Discard task reading prematurely, replay when data arrive
16 Blocking Eager CnC Schedulers 16 Use Java wait/notify for premature data access Blocking granularity Instance level vs Collection level Blocked task blocks whole thread Deadlock possibility Need to create more threads as threads block time Thread Ε ItemCollection Θ Get (data-tag γ ) data-tag α wait data-tag β value α value β Thread Δ step 1 notify data-tag γ placeholder value γ γ Put(data-tag γ,value γ ) step 2
17 Data Driven Rollback & Replay 17 Alternative eager scheduling Blocking scheduler suffers from Expensive recovery from premature read Blocks whole thread Creates new thread Switch context to the new thread on every failure Inform item instance on failed task and discard task Throw an exception to unwind failed task Catch by runtime and continue with another ready task Recreate task when needed item arrives
18 Data Driven Rollback & Replay 18 Throw exception to unwind Thread Ε ItemCollection Θ Insert step 1 to waitlist γ Thread Δ step 1 Get (data-tag γ ) data-tag α value α waitlist α step 3 data-tag β value β waitlist β data-tag γ value empty γ waitlist γ Put(data-tag γ,value γ ) step 2 step 1 Get (data-tag γ ) Get (data-tag δ ) Recreate steps on waitlist γ
19 Data Driven Scheduling 19 Do not create tasks until data dependences satisfied No failure, no recovery Restrict computation frontier to ready tasks Evaluation of data readiness condition Busy waiting on data (delayed async scheduling) Dataflow like readiness (data driven scheduling) Register tasks on data Data notifies consumer tasks when created
20 Delayed Asyncs 20 Guarded execution construct for HJ Promote to async when guard evaluates to true async A async B Pop a Task async Z Delayed? No Yes Evaluate guard Assign to thread Yes Is true? No Work Sharing Ready Task Queue Requeue
21 Delayed Async Scheduling 21 Every CnC step is a guarded execution Guard condition is the availability of items to consume Task still created eagerly when provided control Promotes to ready when data provided
22 Outline 22 Background CnC Scheduling Data Driven Futures Results Wrap up
23 Data Driven Futures (DDFs) 23 Task parallel synchronization construct Acts as a reference to single assignment value Creation Create a dangling reference object Resolution ( Put ) Resolve what value a DDF is referring to Registration ( Await ) A task provides a consume list of DDFs on declaration A task can only read DDFs that it is registered to
24 Data Driven Futures (DDFs) 24 DataDrivenFuture leftchild = new DataDrivenFuture(); DataDrivenFuture rightchild = new DataDrivenFuture(); finish { async leftchild.put(leftchildcreator()); async rightchild.put(rightchildcreator()); async await ( leftchild ) useleftchild(leftchild); async await ( rightchild ) userightchild(rightchild); async await ( leftchild, rightchild ) usebothchildren( leftchild, rightchild ); }
25 Contributions of DDFs 25 Non-series-parallel task dependency graphs support TaskB TaskD TaskA TaskF TaskC TaskE Memory footprint reduction Exposes only ready parts of the execution frontier Not global lifetime Creator: feeds consumers gives access to producer Lifetime restricted to Creator lifetime Resolver lifetime Consumers lifetimes Can be garbage collected on a managed runtime
26 Data Driven Scheduling 26 Steps register self to items wrapped into DDFs DDFβ TaskM DDFα PlaceHolderβ Valueβ DDFα DDFβ PlaceHolderα Valueα TaskN DDFβ DDFδ DDFδ PlaceHolderδ Valueδ TaskC create DDFα, DDFβ, DDFδ create Task A resolving DDFα create Task M reading DDFα, DDFβ create Task D resolving DDFδ create Task B resolving DDFβ create Task N reading DDFβ, DDFδ ready queue Task A Task M D Task Task N B TaskA TaskD TaskB resolve resolve DDFα resolve DDFδ DDFβ
27 Outline 27 Background CnC Scheduling Data Driven Futures Results Wrap up
28 Performance Evaluation Legend 28 Coarse Grain Blocking Eager blocking scheduling on item collections for CnC-HJ Fine Grain Blocking Eager blocking scheduling on item instances for CnC-HJ Delayed Async Data Driven scheduling via HJ delayed asyncs for CnC-HJ Data Driven Rollback & Replay Eager scheduling with replay and notifications for CnC-HJ Data Driven Futures Hand coded CnC application equivalent on HJ with DDFs
29 Cholesky Decomposition Introduction 29 Dense linear algebra kernel Three inherent kernels Need to be pipelined for best performance Loop parallelism within some kernels Data parallelism within some kernels CnC shown to beat optimized libraries, like IntelMKL
30 Cholesky Decomposition on Xeon 30 9,000 8,000 Execution in milli-secs 7,000 6,000 5,000 4,000 3,000 2,000 1,000 1-Worker 16-Workers 7,861 7,730 1, ,818 8,789 7, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java steps on Xeon with input matrix size and with tile size
31 Cholesky Decomposition on Xeon 31 12,000 10,000 Execution in milli-secs 8,000 6,000 4,000 2,000 10,081 10,010 2,472 1-Worker 16-Workers 10,305 10,309 8, , Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on Xeon with input matrix size and with tile size
32 Cholesky Decomposition with MKL on Xeon 32 2,000 1,800 1,782 1,766 1-Worker 16-Workers 1,834 1,821 1,723 Execution in milli-secs 1,600 1,400 1,200 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java and Intel MKL steps on Xeon with input matrix size and with tile size
33 Cholesky Decomposition with MKL on Xeon 33 2,200 2,000 1,800 Execution in milli-secs 1,600 1,400 1,200 1, Worker 1,999 1,983 1, Workers 1,949 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero Java and Intel MKL steps on Xeon with input matrix size and with tile size
34 Cholesky Decomposition on UltraSPARC T , , , ,883 1-Worker 64-Workers 103, , ,297 96,587 Execution in milli-secs 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10, ,489 5,388 5,651 5,259 4,950 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java steps on UltraSPARC T2 with input matrix size and with tile size
35 Cholesky Decomposition on UltraSPARC T , , , , ,863 1-Worker 104, Workers 106,695 99,032 Execution in milli-secs 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10, ,035 6,993 6,958 6,339 5,681 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on UltraSPARC T2 with input matrix size and with tile size
36 Black-Scholes formula 36 Only one step The Black-Scholes formula Embarrassingly parallel Good indicator of scheduling overhead
37 Black-Scholes on Xeon 37 40,000 35,000 1-Worker 16-Workers 33,688 33,762 34,214 33,788 34,634 Execution in milli-secs 30,000 25,000 20,000 15,000 10, ,000 4,148 4,155 4,216 4,145 2,164 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on Xeon with input size 1,000,000 and with tile size 62,500
38 Black-Scholes on Xeon 38 40,000 35,000 1-Worker 16-Workers 33,871 33,966 34,311 34,121 34,729 Execution in milli-secs 30,000 25,000 20,000 15,000 10,000 5, ,061 4,300 4,309 4,279 2,353 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on Xeon with input size 1,000,000 and with tile size 62,500
39 Black-Scholes on UltraSPARC T , ,000 Execution in milli-secs 160, , , ,000 80,000 60,000 40,000 1-Worker 64-Workers 179, , , , , ,000 0 Coarse Grain Blocking 6,032 5,944 6,006 5,973 4,019 Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on UltraSPARC T2 with input size 1,000,000 and with tile size 15,625
40 Black-Scholes on UltraSPARC T ,000 1-Worker 64-Workers 207, , ,636 Execution in milli-secs 200, , ,000 50, , , Coarse Grain Blocking 6,179 6,159 6,149 6,145 4,296 Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on UltraSPARC T2 with input size 1,000,000 and with tile size 15,625
41 Rician Denoising 41 Image processing algorithm More than 4 kernels Mostly stencil computations Non trivial dependency graph Fixed point algorithm Enormous data size CnC schedulers needed explicit memory management DDFs took advantage of garbage collection
42 Rician Denoising on Xeon ,000 1-Worker 16-Workers 500, , , ,328 Execution in milli-secs 400, , , , ,000 78,630 56,892 51,632 52,208 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size
43 Rician Denoising on Xeon ,000 1-Worker 16-Workers Execution in milli-secs 500, , , , , , , , ,000 81,502 58,313 53,569 53,817 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size
44 Rician Denoising on UltraSPARC T2 44 2,500,000 1-Worker 64-Workers 2,000,000 1,979,340 1,932,078 1,932,488 Execution in milli-secs 1,500,000 1,000, ,000 1,273, , ,134 80,458 56,621 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on UltraSPARC T2 with input image size and with tile size
45 Rician Denoising on UltraSPARC T2 45 2,500,000 1-Worker 64-Workers Execution in milli-secs 2,000,000 1,500,000 1,000, , ,988,894 1,939,976 1,943,861 1,282, , ,600 81,707 58,017 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on UltraSPARC T2 with input image size and with tile size
46 Heart Wall Tracking Medical imaging application Nested kernels First level embarrassingly parallel Second level with intricate dependency graph Memory management Many failures on eager schedulers Blocking schedulers ran out of memory
47 Heart Wall Tracking on Xeon , , ,248 1-Worker 16-Workers 157, ,159 Execution in milli-secs 140, , ,000 80,000 60,000 40, ,989 20,000 11,076 9,897 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames
48 Heart Wall Tracking on Xeon , , ,806 1-Worker 16-Workers 158, ,635 Execution in milli-secs 140, , ,000 80,000 60,000 40, ,287 20,000 11,351 10,097 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames
49 Outline 49 Background CnC Scheduling Data Driven Futures Results Wrap up
50 Related work 50 Alternative parallel programming models: Either too verbose or constrained parallelism Alternative futures, promises Creation and resolution are coupled Either lazy or blocking execution semantics Support for unstructured parallelism Nabbit library for Cilk++ allows arbitrary task graphs Immediate successor atomic counter update for notification Does not differentiate between data, control dependences
51 Conclusions 51 Macro-dataflow is a viable parallelism model Provides expressiveness hiding parallelism concerns Macro-dataflow can perform competitively Taking advantage of modern task parallel models
52 Future Work 52 Compiling CnC to the Data Driven Runtime Currently hand-ported Need finer grain dependency analysis via tag functions Data Driven Future support for Work Stealing Compiler support for automatic DDF registration Hierarchical DDFs Locality aware scheduling support for DDFs
53 Acknowledgments 53 Committee Zoran Budimlić, Keith D. Cooper, Vivek Sarkar, Lin Zhong Journal of Supercomputing co-authors Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff P. Lowney, Ryan R. Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach Habanero multicore software research project team-members Zoran Budimlić, Vincent Cavé, Philippe Charles, Vivek Sarkar, Alina Simion Sbîrlea, Dragoș Sbîrlea, Jisheng Zhao Intel Technology Pathfinding and Innovation Software and Services Group Mark Hampton, Kathleen Knobe, Geoff P. Lowney, Ryan R. Newton, Frank Schlimbach Benchmarks Aparna Chandramowlishwaran (Georgia Tech.), Zoran Budimlić(Rice) for Cholesky Decomposition Yu-Ting Chen (UCLA) for Rician Denoising David Peixotto (Rice) for Black-Scholes Formula Alina Simion Sbîrlea (Rice) for Heart Wall Tracking
54 Feedback and clarifications 54 Thanks for your attention
DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY
1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be
More informationThe Concurrent Collections (CnC) Parallel Programming Model. Kathleen Knobe Intel.
The Concurrent Collections (CnC) Parallel Programming Model Kathleen Knobe Intel kath.knobe@intel.com *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are
More informationRICE UNIVERSITY. Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems. by Sagnak Ta Irlar
RICE UNIVERSITY Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems by Sagnak Ta Irlar A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Master of Science APPROVED,
More informationThe Concurrent Collections (CnC) Parallel Programming Model Tutorial. Kathleen Knobe Intel Vivek Sarkar Rice University
The Concurrent Collections (CnC) Parallel Programming Model Tutorial Kathleen Knobe Intel Vivek Sarkar Rice University kath.knobe@intel.com, vsarkar@rice.edu Tutorial CnC 09 July 23, 2009 *Intel and the
More informationCnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University
CnC-HC a programming model for CPU-GPU hybrid parallelism Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University Acknowledgements CnC-CUDA: Declarative Programming for GPUs, Max Grossman, Alina Simion-Sbirlea,
More informationMulti-core Implementations of the Concurrent Collections Programming Model
Multi-core Implementations of the Concurrent Collections Programming Model Zoran Budimlić, Aparna Chandramowlishwaran, Kathleen Knobe, Geoff Lowney, Vivek Sarkar, and Leo Treggiari Department of Computer
More informationConcurrent Collections
Concurrent Collections Zoran Budimlić 1 Michael Burke 1 Vincent Cavé 1 Kathleen Knobe 2 Geoff Lowney 2 Ryan Newton 2 Jens Palsberg 3 David Peixotto 1 Vivek Sarkar 1 Frank Schlimbach 2 Sağnak Taşırlar 1
More informationLecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers
COMP 322: Fundamentals of Parallel Programming Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu
More informationThe Concurrent Collections Programming Model
The Concurrent Collections Programming Model Michael G. Burke Rice University Houston, Texas Kathleen Knobe Intel Corporation Hudson, Massachusetts Ryan Newton Intel Corporation Hudson, Massachusetts Vivek
More informationABSTRACT. Optimized Event-Driven Runtime Systems for Programmability and Performance
ABSTRACT Optimized Event-Driven Runtime Systems for Programmability and Performance by Sağnak Taşırlar Modern parallel programming models perform their best under the particular patterns they are tuned
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationA Case for Cooperative Scheduling in X10's Managed Runtime
A Case for Cooperative Scheduling in X10's Managed Runtime X10 Workshop 2014 June 12, 2014 Shams Imam, Vivek Sarkar Rice University Task-Parallel Model Worker Threads Please ignore the DP on the cartoons
More informationDynamic Task Parallelism with a GPU Work-Stealing Runtime System
Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Sanjay Chatterjee, Max Grossman, Alina Sbîrlea, and Vivek Sarkar Department of Computer Science Rice University Background As parallel programming
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationHabanero-Java Library: a Java 8 Framework for Multicore Programming
Habanero-Java Library: a Java 8 Framework for Multicore Programming PPPJ 2014 September 25, 2014 Shams Imam, Vivek Sarkar shams@rice.edu, vsarkar@rice.edu Rice University https://wiki.rice.edu/confluence/display/parprog/hj+library
More informationCOMP 322 / ELEC 323: Fundamentals of Parallel Programming
COMP 322 / ELEC 323: Fundamentals of Parallel Programming Lecture 18: Abstract vs. Real Performance an under the hood look at HJlib Instructors: Vivek Sarkar, Mack Joyner Department of Computer Science,
More informationAsynchronous Checkpoint/Restart for the Concurrent Collections Model
RICE UNIVERSITY Asynchronous Checkpoint/Restart for the Concurrent Collections Model by Nick Vrvilo A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Master of Science Approved,
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationCOMP 322: Fundamentals of Parallel Programming Module 1: Parallelism
c 2016 by Vivek Sarkar January 10, 2018 Contents 0 Course Organization 2 1 Task-level Parallelism 2 1.1 Task Creation and Termination (Async, Finish).......................... 2 1.2 Computation Graphs.........................................
More informationScalable and Precise Dynamic Datarace Detection for Structured Parallelism
Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan Raman Rice University raghav@rice.edu Jisheng Zhao Rice University jisheng.zhao@rice.edu Vivek Sarkar Rice University
More informationHabanero Extreme Scale Software Research Project
Habanero Extreme Scale Software Research Project Comp215: Garbage Collection Zoran Budimlić (Rice University) Adapted from Keith Cooper s 2014 lecture in COMP 215. Garbage Collection In Beverly Hills...
More informationSCnC: Efficient Unification of Streaming with Dynamic Task Parallelism
SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism Dragoş Sbîrlea Jun Shirako Ryan Newton Vivek Sarkar Rice University, Indiana University dragos@rice.edu shirako@rice.edu rrnewton@indiana.edu
More informationJiST Java in Simulation Time An efficient, unifying approach to simulation using virtual machines
JiST Java in Simulation Time An efficient, unifying approach to simulation using virtual machines Rimon Barr, Zygmunt Haas, Robbert van Renesse rimon@acm.org haas@ece.cornell.edu rvr@cs.cornell.edu. Cornell
More informationCOMP 322: Fundamentals of Parallel Programming
COMP 322: Fundamentals of Parallel Programming https://wiki.rice.edu/confluence/display/parprog/comp322 Lecture 27: Java Threads Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu
More informationRegion-Based Memory Management for Task Dataflow Models
Region-Based Memory Management for Task Dataflow Models Nikolopoulos, D. (2012). Region-Based Memory Management for Task Dataflow Models: First Joint ENCORE & PEPPHER Workshop on Programmability and Portability
More informationIntroduction to Concurrent Collections (CnC)
Introduction to Concurrent Collections (CnC) Introduction - Motivation Why should chemists/physicists care about parallelism? Can they take benefit of parallelism without having knowledge of parallelism?
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationLecture 32: Volatile variables, Java memory model
COMP 322: Fundamentals of Parallel Programming Lecture 32: Volatile variables, Java memory model Vivek Sarkar Department of Computer Science, Rice University vsarkar@rice.edu https://wiki.rice.edu/confluence/display/parprog/comp322
More informationExperience with Processes and Monitors in Mesa. Arvind Krishnamurthy
Experience with Processes and Monitors in Mesa Arvind Krishnamurthy Background Focus of this paper: light-weight processes (threads) and how they synchronize with each other History: Second system; followed
More informationStatic Data Race Detection for SPMD Programs via an Extended Polyhedral Representation
via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT
More informationCapriccio: Scalable Threads for Internet Services
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley Presenter: Cong Lin Outline Part I Motivation
More informationOhua: Implicit Dataflow Programming for Concurrent Systems
Ohua: Implicit Dataflow Programming for Concurrent Systems Sebastian Ertel Compiler Construction Group TU Dresden, Germany Christof Fetzer Systems Engineering Group TU Dresden, Germany Pascal Felber Institut
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationECE519 Advanced Operating Systems
IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationModeling and Mapping for Customizable Domain-Specific Computing
Modeling and Mapping for Customizable Domain-Specific Computing Zoran Budimlić Alex Bui Jason Cong Glenn Reinman Vivek Sarkar Rice University University of California, Los Angeles Abstract In this article,
More informationLecture 29: Java s synchronized statement
COMP 322: Fundamentals of Parallel Programming Lecture 29: Java s synchronized statement Vivek Sarkar Department of Computer Science, Rice University vsarkar@rice.edu https://wiki.rice.edu/confluence/display/parprog/comp322
More informationExtensions to Barrelfish Asynchronous C
Extensions to Barrelfish Asynchronous C Michael Quigley michaelforrquigley@gmail.com School of Computing, University of Utah October 27, 2016 1 Abstract The intent of the Microsoft Barrelfish Asynchronous
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationMonitors; Software Transactional Memory
Monitors; Software Transactional Memory Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 18, 2012 CPD (DEI / IST) Parallel and
More informationc 2015 by Adam R. Smith. All rights reserved.
c 2015 by Adam R. Smith. All rights reserved. THE PARALLEL INTERMEDIATE LANGUAGE BY ADAM RANDALL SMITH DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationDYNAMIC SCHEDULING IN MULTICORE PROCESSORS
DYNAMIC SCHEDULING IN MULTICORE PROCESSORS A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2011 By Demian Rosas
More informationDeclarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model
Declarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model Zoran Budimlić Rice University zoran@rice.edu Aparna Chandramowlishwaran Georgia Institute of Technology
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationStreamBox: Modern Stream Processing on a Multicore Machine
StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu
More informationCnC for Tuning Hints on OCR. Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015
CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015 Acknowledgements This work was done as part of my internship with the OCR team, part of Intel Federal,
More informationIsolation for Nested Task Parallelism
Isolation for Nested Task Parallelism Jisheng Zhao Rice University jisheng.zhao@rice.edu Swarat Chaudhuri Rice University swarat@rice.edu Roberto Lublinerman Google, Inc. rluble@google.com Vivek Sarkar
More informationCSE 260 Lecture 19. Parallel Programming Languages
CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014
More informationFinish Accumulators: a Deterministic Reduction Construct for Dynamic Task Parallelism
Finish Accumulators: a Deterministic Reduction Construct for Dynamic Task Parallelism Jun Shirako, Vincent Cavé, Jisheng Zhao, and Vivek Sarkar Department of Computer Science, Rice University {shirako,vincent.cave,jisheng.zhao,vsarkar@rice.edu
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationRecursive inverse factorization
Recursive inverse factorization Anton Artemov Division of Scientific Computing, Uppsala University anton.artemov@it.uu.se 06.09.2017 Context of work Research group on large scale electronic structure computations
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationCOMP 322: Fundamentals of Parallel Programming Module 1: Parallelism
c 2017 by Vivek Sarkar February 14, 2017 DRAFT VERSION Contents 0 Course Organization 3 1 Task-level Parallelism 3 1.1 Task Creation and Termination (Async, Finish).......................... 3 1.2 Computation
More informationMonitors; Software Transactional Memory
Monitors; Software Transactional Memory Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico March 17, 2016 CPD (DEI / IST) Parallel and Distributed
More informationCooperative Scheduling of Parallel Tasks with General Synchronization Patterns
Cooperative Scheduling of Parallel Tasks with General Synchronization Patterns Shams Imam and Vivek Sarkar Department of Computer Science, Rice University {shams,vsarkar}@rice.edu Abstract. In this paper,
More informationCnC for high performance computing. Kath Knobe Intel SSG
CnC for high performance computing Kath Knobe Intel SSG Thanks Frank Schlimbach Intel Vivek Sarkar and his group Rice University DARPA UHPC (Runnemede) DOE X-Stack (Trilacka Glacier) 2 Outline Intro Checkpoint/restart
More informationParallel Algorithm Design. CS595, Fall 2010
Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the
More informationUniversity of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory
University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory Polytasks: A Compressed Task Representation for HPC Runtimes Daniel Orozco
More informationTransactional Memory
Transactional Memory Architectural Support for Practical Parallel Programming The TCC Research Group Computer Systems Lab Stanford University http://tcc.stanford.edu TCC Overview - January 2007 The Era
More informationAn Update on Haskell H/STM 1
An Update on Haskell H/STM 1 Ryan Yates and Michael L. Scott University of Rochester TRANSACT 10, 6-15-2015 1 This work was funded in part by the National Science Foundation under grants CCR-0963759, CCF-1116055,
More informationLecture 24: Java Threads,Java synchronized statement
COMP 322: Fundamentals of Parallel Programming Lecture 24: Java Threads,Java synchronized statement Zoran Budimlić and Mack Joyner {zoran, mjoyner@rice.edu http://comp322.rice.edu COMP 322 Lecture 24 9
More informationTHREADS AND CONCURRENCY
THREADS AND CONCURRENCY Lecture 22 CS2110 Spring 2013 Graphs summary 2 Dijkstra: given a vertex v, finds shortest path from v to x for each vertex x in the graph Key idea: maintain a 5-part invariant on
More informationDynamic inter-core scheduling in Barrelfish
Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution
More informationMulticore Computing and Scientific Discovery
scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research
More informationResilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche
Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18
More informationScalable In-memory Checkpoint with Automatic Restart on Failures
Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationPolyhedral Optimizations for a Data-Flow Graph Language
Polyhedral Optimizations for a Data-Flow Graph Language Alina Sbîrlea 1 Jun Shirako 1 Louis-Noël Pouchet 2 Vivek Sarkar 1 1 Rice University, USA 2 Ohio State University, USA Abstract. This paper proposes
More informationCloud Programming James Larus Microsoft Research. July 13, 2010
Cloud Programming James Larus Microsoft Research July 13, 2010 New Programming Model, New Problems (and some old, unsolved ones) Concurrency Parallelism Message passing Distribution High availability Performance
More informationThroughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling
Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Tobias Schwarzer 1, Joachim Falk 1, Michael Glaß 1, Jürgen Teich 1, Christian Zebelein 2, Christian
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationHeckaton. SQL Server's Memory Optimized OLTP Engine
Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability
More informationCILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY
CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware
More informationA Pluggable Framework for Composable HPC Scheduling Libraries
A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1, Vivek Kumar 2, Nick Vrvilo 1, Zoran Budimlic 1, Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University
More informationPOSIX Threads: a first step toward parallel programming. George Bosilca
POSIX Threads: a first step toward parallel programming George Bosilca bosilca@icl.utk.edu Process vs. Thread A process is a collection of virtual memory space, code, data, and system resources. A thread
More informationFILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas
FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given
More informationLecture 27: Safety and Liveness Properties, Java Synchronizers, Dining Philosophers Problem
COMP 322: Fundamentals of Parallel Programming Lecture 27: Safety and Liveness Properties, Java Synchronizers, Dining Philosophers Problem Mack Joyner and Zoran Budimlić {mjoyner, zoran}@rice.edu http://comp322.rice.edu
More informationCompsci 590.3: Introduction to Parallel Computing
Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationJS Event Loop, Promises, Async Await etc. Slava Kim
JS Event Loop, Promises, Async Await etc Slava Kim Synchronous Happens consecutively, one after another Asynchronous Happens later at some point in time Parallelism vs Concurrency What are those????
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationReminder from last time
Concurrent systems Lecture 7: Crash recovery, lock-free programming, and transactional memory DrRobert N. M. Watson 1 Reminder from last time History graphs; good (and bad) schedules Isolation vs. strict
More informationDeterministic Parallel Programming
Deterministic Parallel Programming Concepts and Practices 04/2011 1 How hard is parallel programming What s the result of this program? What is data race? Should data races be allowed? Initially x = 0
More informationScalable Shared Memory Programing
Scalable Shared Memory Programing Marc Snir www.parallel.illinois.edu What is (my definition of) Shared Memory Global name space (global references) Implicit data movement Caching: User gets good memory
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationSustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)
COMP 412 FALL 2017 Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java) Copyright 2017, Keith D. Cooper & Zoran Budimlić, all rights reserved. Students enrolled in Comp 412 at Rice
More information