SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR

Size: px
Start display at page:

Download "SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR"

Transcription

1 1 SCHEDULING MACRO-DATAFLOW PROGRAMS ON TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR

2 Thesis 2 Our thesis is that advances in task parallel runtime systems can enable a macro-dataflow programming model, like Concurrent Collections (CnC), to deliver productivity and performance on modern multicore processors.

3 Approach 3 Macro-dataflow for expressiveness Determinism Race/deadlock freedom Higher level abstraction Task parallel runtimes for performance Portable scalability Contemporary consensus

4 Motivation 4 Parallelism not accessible to those who need it most Imposed serial thinking Parallelism for the masses, not just computer scientists Parallel programming models of today: Hide machine details but expose parallelism details Constrain expressiveness

5 Contributions 5 Scheduling CnC on Habanero Java Evaluation of scheduling performance for CnC Introduction of Data Driven Futures (DDF) construct Implementation of DDF construct Implementation and evaluation of data driven runtime with DDFs Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, Sağnak Taşırlar, The CnC Programming Model, submitted for publication to the Journal of Supercomputing, 2010

6 Outline 6 Background CnC Scheduling Data Driven Futures Results Wrap up

7 Dynamic Task Parallelism 7 Properties Over exposure of parallelism Scales up/down with # of cores Scheduling maps sets of tasks to threads at runtime Habanero Java (HJ) employs: Finish/async parallelism Feeds child tasks through lexical scope Work sharing/stealing runtime scheduling

8 CnC concepts 8 Step Computation abstraction Side effect free Functional w.r.t. input Special step: Environment Item Dynamic single assignment Value not storage Tag Data tag to index items Control tag to index steps Collection Step Item Tag Graphical Notation (SomeStep) [SomeItem] <Tag> Textual Notation (SomeStep) [SomeItem] <Tag>

9 Concurrent Collections model 9 Can be classified as: Declarative Deterministic Dynamic single assignment Macro-dataflow Coordination language Goal: consider only semantic ordering constraints Inherent in the application not the implementation Will be described by the CnC graph

10 Example Program Specification 10 Break up an input string Sequences of repeated single characters Filter allowing only Sequences of odd length Input string aaaffqqqmmmmmmm Sequences of repeated characters aaa ff qqq mmmmmmm Filtered sequences aaa qqq mmmmmmm

11 11 CnC Implementation of Example Program <stringtag: 1> <stringtag: 2> <spantag: 1, 1> <spantag: 1, 2> <spantag: 1, 3> <spantag: 1, 4> <stringtag: j> <spantag: j, s> [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [input: 1] = aaaffqqqmmmmmmm [input: 2] = rrhhhhxxx Collection Graphical Textual Step ( ) ( ) Item [ ] [ ] Tag <> < > [span: 1, 1] = aaa [span: 1, 2] = ff [span: 1, 3] = qqq [span: 1, 4] = mmmmmmm [results: 1, 1] = aaa [results: 1, 3] = qqq [results: 1, 4] = mmmmmmm

12 CnC-Habanero Java build model 12 Concurrent Collections Textual Graph Habanero Java source files Code to invoke the graph Code to put initial values in graph Code to implement abstract steps CnC Translator CnC compiler Habanero Java source files.class files Concurrent Collections Library Habanero Java Runtime Library Abstract classes for all steps Definitions for all collections Graph definition and initialization JAR Builder Java application User specified CnC Components

13 Outline 13 Background CnC Scheduling Data Driven Futures Results Wrap up

14 CnC Scheduling Challenges 14 Control & data dependences are first level constructs Task parallel frameworks have them coupled Step instances have multiple predecessors Need to wait for all predecessors Layered readiness concepts Control dependence satisfied Data dependence satisfied Schedulable / Ready

15 Eager scheduling 15 Assume control dependence satisfaction is readiness Conforms to task parallel runtime assumption Wait till data dependences satisfaction for safety Block on data prematurely tried to be read Discard task reading prematurely, replay when data arrive

16 Blocking Eager CnC Schedulers 16 Use Java wait/notify for premature data access Blocking granularity Instance level vs Collection level Blocked task blocks whole thread Deadlock possibility Need to create more threads as threads block time Thread Ε ItemCollection Θ Get (data-tag γ ) data-tag α wait data-tag β value α value β Thread Δ step 1 notify data-tag γ placeholder value γ γ Put(data-tag γ,value γ ) step 2

17 Data Driven Rollback & Replay 17 Alternative eager scheduling Blocking scheduler suffers from Expensive recovery from premature read Blocks whole thread Creates new thread Switch context to the new thread on every failure Inform item instance on failed task and discard task Throw an exception to unwind failed task Catch by runtime and continue with another ready task Recreate task when needed item arrives

18 Data Driven Rollback & Replay 18 Throw exception to unwind Thread Ε ItemCollection Θ Insert step 1 to waitlist γ Thread Δ step 1 Get (data-tag γ ) data-tag α value α waitlist α step 3 data-tag β value β waitlist β data-tag γ value empty γ waitlist γ Put(data-tag γ,value γ ) step 2 step 1 Get (data-tag γ ) Get (data-tag δ ) Recreate steps on waitlist γ

19 Data Driven Scheduling 19 Do not create tasks until data dependences satisfied No failure, no recovery Restrict computation frontier to ready tasks Evaluation of data readiness condition Busy waiting on data (delayed async scheduling) Dataflow like readiness (data driven scheduling) Register tasks on data Data notifies consumer tasks when created

20 Delayed Asyncs 20 Guarded execution construct for HJ Promote to async when guard evaluates to true async A async B Pop a Task async Z Delayed? No Yes Evaluate guard Assign to thread Yes Is true? No Work Sharing Ready Task Queue Requeue

21 Delayed Async Scheduling 21 Every CnC step is a guarded execution Guard condition is the availability of items to consume Task still created eagerly when provided control Promotes to ready when data provided

22 Outline 22 Background CnC Scheduling Data Driven Futures Results Wrap up

23 Data Driven Futures (DDFs) 23 Task parallel synchronization construct Acts as a reference to single assignment value Creation Create a dangling reference object Resolution ( Put ) Resolve what value a DDF is referring to Registration ( Await ) A task provides a consume list of DDFs on declaration A task can only read DDFs that it is registered to

24 Data Driven Futures (DDFs) 24 DataDrivenFuture leftchild = new DataDrivenFuture(); DataDrivenFuture rightchild = new DataDrivenFuture(); finish { async leftchild.put(leftchildcreator()); async rightchild.put(rightchildcreator()); async await ( leftchild ) useleftchild(leftchild); async await ( rightchild ) userightchild(rightchild); async await ( leftchild, rightchild ) usebothchildren( leftchild, rightchild ); }

25 Contributions of DDFs 25 Non-series-parallel task dependency graphs support TaskB TaskD TaskA TaskF TaskC TaskE Memory footprint reduction Exposes only ready parts of the execution frontier Not global lifetime Creator: feeds consumers gives access to producer Lifetime restricted to Creator lifetime Resolver lifetime Consumers lifetimes Can be garbage collected on a managed runtime

26 Data Driven Scheduling 26 Steps register self to items wrapped into DDFs DDFβ TaskM DDFα PlaceHolderβ Valueβ DDFα DDFβ PlaceHolderα Valueα TaskN DDFβ DDFδ DDFδ PlaceHolderδ Valueδ TaskC create DDFα, DDFβ, DDFδ create Task A resolving DDFα create Task M reading DDFα, DDFβ create Task D resolving DDFδ create Task B resolving DDFβ create Task N reading DDFβ, DDFδ ready queue Task A Task M D Task Task N B TaskA TaskD TaskB resolve resolve DDFα resolve DDFδ DDFβ

27 Outline 27 Background CnC Scheduling Data Driven Futures Results Wrap up

28 Performance Evaluation Legend 28 Coarse Grain Blocking Eager blocking scheduling on item collections for CnC-HJ Fine Grain Blocking Eager blocking scheduling on item instances for CnC-HJ Delayed Async Data Driven scheduling via HJ delayed asyncs for CnC-HJ Data Driven Rollback & Replay Eager scheduling with replay and notifications for CnC-HJ Data Driven Futures Hand coded CnC application equivalent on HJ with DDFs

29 Cholesky Decomposition Introduction 29 Dense linear algebra kernel Three inherent kernels Need to be pipelined for best performance Loop parallelism within some kernels Data parallelism within some kernels CnC shown to beat optimized libraries, like IntelMKL

30 Cholesky Decomposition on Xeon 30 9,000 8,000 Execution in milli-secs 7,000 6,000 5,000 4,000 3,000 2,000 1,000 1-Worker 16-Workers 7,861 7,730 1, ,818 8,789 7, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java steps on Xeon with input matrix size and with tile size

31 Cholesky Decomposition on Xeon 31 12,000 10,000 Execution in milli-secs 8,000 6,000 4,000 2,000 10,081 10,010 2,472 1-Worker 16-Workers 10,305 10,309 8, , Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on Xeon with input matrix size and with tile size

32 Cholesky Decomposition with MKL on Xeon 32 2,000 1,800 1,782 1,766 1-Worker 16-Workers 1,834 1,821 1,723 Execution in milli-secs 1,600 1,400 1,200 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java and Intel MKL steps on Xeon with input matrix size and with tile size

33 Cholesky Decomposition with MKL on Xeon 33 2,200 2,000 1,800 Execution in milli-secs 1,600 1,400 1,200 1, Worker 1,999 1,983 1, Workers 1,949 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero Java and Intel MKL steps on Xeon with input matrix size and with tile size

34 Cholesky Decomposition on UltraSPARC T , , , ,883 1-Worker 64-Workers 103, , ,297 96,587 Execution in milli-secs 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10, ,489 5,388 5,651 5,259 4,950 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java steps on UltraSPARC T2 with input matrix size and with tile size

35 Cholesky Decomposition on UltraSPARC T , , , , ,863 1-Worker 104, Workers 106,695 99,032 Execution in milli-secs 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10, ,035 6,993 6,958 6,339 5,681 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on UltraSPARC T2 with input matrix size and with tile size

36 Black-Scholes formula 36 Only one step The Black-Scholes formula Embarrassingly parallel Good indicator of scheduling overhead

37 Black-Scholes on Xeon 37 40,000 35,000 1-Worker 16-Workers 33,688 33,762 34,214 33,788 34,634 Execution in milli-secs 30,000 25,000 20,000 15,000 10, ,000 4,148 4,155 4,216 4,145 2,164 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on Xeon with input size 1,000,000 and with tile size 62,500

38 Black-Scholes on Xeon 38 40,000 35,000 1-Worker 16-Workers 33,871 33,966 34,311 34,121 34,729 Execution in milli-secs 30,000 25,000 20,000 15,000 10,000 5, ,061 4,300 4,309 4,279 2,353 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on Xeon with input size 1,000,000 and with tile size 62,500

39 Black-Scholes on UltraSPARC T , ,000 Execution in milli-secs 160, , , ,000 80,000 60,000 40,000 1-Worker 64-Workers 179, , , , , ,000 0 Coarse Grain Blocking 6,032 5,944 6,006 5,973 4,019 Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on UltraSPARC T2 with input size 1,000,000 and with tile size 15,625

40 Black-Scholes on UltraSPARC T ,000 1-Worker 64-Workers 207, , ,636 Execution in milli-secs 200, , ,000 50, , , Coarse Grain Blocking 6,179 6,159 6,149 6,145 4,296 Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on UltraSPARC T2 with input size 1,000,000 and with tile size 15,625

41 Rician Denoising 41 Image processing algorithm More than 4 kernels Mostly stencil computations Non trivial dependency graph Fixed point algorithm Enormous data size CnC schedulers needed explicit memory management DDFs took advantage of garbage collection

42 Rician Denoising on Xeon ,000 1-Worker 16-Workers 500, , , ,328 Execution in milli-secs 400, , , , ,000 78,630 56,892 51,632 52,208 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Minimum execution times of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size

43 Rician Denoising on Xeon ,000 1-Worker 16-Workers Execution in milli-secs 500, , , , , , , , ,000 81,502 58,313 53,569 53,817 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size

44 Rician Denoising on UltraSPARC T2 44 2,500,000 1-Worker 64-Workers 2,000,000 1,979,340 1,932,078 1,932,488 Execution in milli-secs 1,500,000 1,000, ,000 1,273, , ,134 80,458 56,621 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Minimum execution times of 30 runs of single threaded and 64-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on UltraSPARC T2 with input image size and with tile size

45 Rician Denoising on UltraSPARC T2 45 2,500,000 1-Worker 64-Workers Execution in milli-secs 2,000,000 1,500,000 1,000, , ,988,894 1,939,976 1,943,861 1,282, , ,600 81,707 58,017 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Average execution times and 90% confidence interval of 30 runs of single threaded and 64-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on UltraSPARC T2 with input image size and with tile size

46 Heart Wall Tracking Medical imaging application Nested kernels First level embarrassingly parallel Second level with intricate dependency graph Memory management Many failures on eager schedulers Blocking schedulers ran out of memory

47 Heart Wall Tracking on Xeon , , ,248 1-Worker 16-Workers 157, ,159 Execution in milli-secs 140, , ,000 80,000 60,000 40, ,989 20,000 11,076 9,897 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Minimum execution times of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames

48 Heart Wall Tracking on Xeon , , ,806 1-Worker 16-Workers 158, ,635 Execution in milli-secs 140, , ,000 80,000 60,000 40, ,287 20,000 11,351 10,097 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Average execution times and 90% confidence interval of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames

49 Outline 49 Background CnC Scheduling Data Driven Futures Results Wrap up

50 Related work 50 Alternative parallel programming models: Either too verbose or constrained parallelism Alternative futures, promises Creation and resolution are coupled Either lazy or blocking execution semantics Support for unstructured parallelism Nabbit library for Cilk++ allows arbitrary task graphs Immediate successor atomic counter update for notification Does not differentiate between data, control dependences

51 Conclusions 51 Macro-dataflow is a viable parallelism model Provides expressiveness hiding parallelism concerns Macro-dataflow can perform competitively Taking advantage of modern task parallel models

52 Future Work 52 Compiling CnC to the Data Driven Runtime Currently hand-ported Need finer grain dependency analysis via tag functions Data Driven Future support for Work Stealing Compiler support for automatic DDF registration Hierarchical DDFs Locality aware scheduling support for DDFs

53 Acknowledgments 53 Committee Zoran Budimlić, Keith D. Cooper, Vivek Sarkar, Lin Zhong Journal of Supercomputing co-authors Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff P. Lowney, Ryan R. Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach Habanero multicore software research project team-members Zoran Budimlić, Vincent Cavé, Philippe Charles, Vivek Sarkar, Alina Simion Sbîrlea, Dragoș Sbîrlea, Jisheng Zhao Intel Technology Pathfinding and Innovation Software and Services Group Mark Hampton, Kathleen Knobe, Geoff P. Lowney, Ryan R. Newton, Frank Schlimbach Benchmarks Aparna Chandramowlishwaran (Georgia Tech.), Zoran Budimlić(Rice) for Cholesky Decomposition Yu-Ting Chen (UCLA) for Rician Denoising David Peixotto (Rice) for Black-Scholes Formula Alina Simion Sbîrlea (Rice) for Heart Wall Tracking

54 Feedback and clarifications 54 Thanks for your attention

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY 1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be

More information

The Concurrent Collections (CnC) Parallel Programming Model. Kathleen Knobe Intel.

The Concurrent Collections (CnC) Parallel Programming Model. Kathleen Knobe Intel. The Concurrent Collections (CnC) Parallel Programming Model Kathleen Knobe Intel kath.knobe@intel.com *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are

More information

RICE UNIVERSITY. Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems. by Sagnak Ta Irlar

RICE UNIVERSITY. Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems. by Sagnak Ta Irlar RICE UNIVERSITY Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems by Sagnak Ta Irlar A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Master of Science APPROVED,

More information

The Concurrent Collections (CnC) Parallel Programming Model Tutorial. Kathleen Knobe Intel Vivek Sarkar Rice University

The Concurrent Collections (CnC) Parallel Programming Model Tutorial. Kathleen Knobe Intel Vivek Sarkar Rice University The Concurrent Collections (CnC) Parallel Programming Model Tutorial Kathleen Knobe Intel Vivek Sarkar Rice University kath.knobe@intel.com, vsarkar@rice.edu Tutorial CnC 09 July 23, 2009 *Intel and the

More information

CnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University

CnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University CnC-HC a programming model for CPU-GPU hybrid parallelism Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University Acknowledgements CnC-CUDA: Declarative Programming for GPUs, Max Grossman, Alina Simion-Sbirlea,

More information

Multi-core Implementations of the Concurrent Collections Programming Model

Multi-core Implementations of the Concurrent Collections Programming Model Multi-core Implementations of the Concurrent Collections Programming Model Zoran Budimlić, Aparna Chandramowlishwaran, Kathleen Knobe, Geoff Lowney, Vivek Sarkar, and Leo Treggiari Department of Computer

More information

Concurrent Collections

Concurrent Collections Concurrent Collections Zoran Budimlić 1 Michael Burke 1 Vincent Cavé 1 Kathleen Knobe 2 Geoff Lowney 2 Ryan Newton 2 Jens Palsberg 3 David Peixotto 1 Vivek Sarkar 1 Frank Schlimbach 2 Sağnak Taşırlar 1

More information

Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers

Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers COMP 322: Fundamentals of Parallel Programming Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu

More information

The Concurrent Collections Programming Model

The Concurrent Collections Programming Model The Concurrent Collections Programming Model Michael G. Burke Rice University Houston, Texas Kathleen Knobe Intel Corporation Hudson, Massachusetts Ryan Newton Intel Corporation Hudson, Massachusetts Vivek

More information

ABSTRACT. Optimized Event-Driven Runtime Systems for Programmability and Performance

ABSTRACT. Optimized Event-Driven Runtime Systems for Programmability and Performance ABSTRACT Optimized Event-Driven Runtime Systems for Programmability and Performance by Sağnak Taşırlar Modern parallel programming models perform their best under the particular patterns they are tuned

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

A Case for Cooperative Scheduling in X10's Managed Runtime

A Case for Cooperative Scheduling in X10's Managed Runtime A Case for Cooperative Scheduling in X10's Managed Runtime X10 Workshop 2014 June 12, 2014 Shams Imam, Vivek Sarkar Rice University Task-Parallel Model Worker Threads Please ignore the DP on the cartoons

More information

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Sanjay Chatterjee, Max Grossman, Alina Sbîrlea, and Vivek Sarkar Department of Computer Science Rice University Background As parallel programming

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Habanero-Java Library: a Java 8 Framework for Multicore Programming

Habanero-Java Library: a Java 8 Framework for Multicore Programming Habanero-Java Library: a Java 8 Framework for Multicore Programming PPPJ 2014 September 25, 2014 Shams Imam, Vivek Sarkar shams@rice.edu, vsarkar@rice.edu Rice University https://wiki.rice.edu/confluence/display/parprog/hj+library

More information

COMP 322 / ELEC 323: Fundamentals of Parallel Programming

COMP 322 / ELEC 323: Fundamentals of Parallel Programming COMP 322 / ELEC 323: Fundamentals of Parallel Programming Lecture 18: Abstract vs. Real Performance an under the hood look at HJlib Instructors: Vivek Sarkar, Mack Joyner Department of Computer Science,

More information

Asynchronous Checkpoint/Restart for the Concurrent Collections Model

Asynchronous Checkpoint/Restart for the Concurrent Collections Model RICE UNIVERSITY Asynchronous Checkpoint/Restart for the Concurrent Collections Model by Nick Vrvilo A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Master of Science Approved,

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum

More information

COMP 322: Fundamentals of Parallel Programming Module 1: Parallelism

COMP 322: Fundamentals of Parallel Programming Module 1: Parallelism c 2016 by Vivek Sarkar January 10, 2018 Contents 0 Course Organization 2 1 Task-level Parallelism 2 1.1 Task Creation and Termination (Async, Finish).......................... 2 1.2 Computation Graphs.........................................

More information

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan Raman Rice University raghav@rice.edu Jisheng Zhao Rice University jisheng.zhao@rice.edu Vivek Sarkar Rice University

More information

Habanero Extreme Scale Software Research Project

Habanero Extreme Scale Software Research Project Habanero Extreme Scale Software Research Project Comp215: Garbage Collection Zoran Budimlić (Rice University) Adapted from Keith Cooper s 2014 lecture in COMP 215. Garbage Collection In Beverly Hills...

More information

SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism

SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism Dragoş Sbîrlea Jun Shirako Ryan Newton Vivek Sarkar Rice University, Indiana University dragos@rice.edu shirako@rice.edu rrnewton@indiana.edu

More information

JiST Java in Simulation Time An efficient, unifying approach to simulation using virtual machines

JiST Java in Simulation Time An efficient, unifying approach to simulation using virtual machines JiST Java in Simulation Time An efficient, unifying approach to simulation using virtual machines Rimon Barr, Zygmunt Haas, Robbert van Renesse rimon@acm.org haas@ece.cornell.edu rvr@cs.cornell.edu. Cornell

More information

COMP 322: Fundamentals of Parallel Programming

COMP 322: Fundamentals of Parallel Programming COMP 322: Fundamentals of Parallel Programming https://wiki.rice.edu/confluence/display/parprog/comp322 Lecture 27: Java Threads Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu

More information

Region-Based Memory Management for Task Dataflow Models

Region-Based Memory Management for Task Dataflow Models Region-Based Memory Management for Task Dataflow Models Nikolopoulos, D. (2012). Region-Based Memory Management for Task Dataflow Models: First Joint ENCORE & PEPPHER Workshop on Programmability and Portability

More information

Introduction to Concurrent Collections (CnC)

Introduction to Concurrent Collections (CnC) Introduction to Concurrent Collections (CnC) Introduction - Motivation Why should chemists/physicists care about parallelism? Can they take benefit of parallelism without having knowledge of parallelism?

More information

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to

More information

Lecture 32: Volatile variables, Java memory model

Lecture 32: Volatile variables, Java memory model COMP 322: Fundamentals of Parallel Programming Lecture 32: Volatile variables, Java memory model Vivek Sarkar Department of Computer Science, Rice University vsarkar@rice.edu https://wiki.rice.edu/confluence/display/parprog/comp322

More information

Experience with Processes and Monitors in Mesa. Arvind Krishnamurthy

Experience with Processes and Monitors in Mesa. Arvind Krishnamurthy Experience with Processes and Monitors in Mesa Arvind Krishnamurthy Background Focus of this paper: light-weight processes (threads) and how they synchronize with each other History: Second system; followed

More information

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT

More information

Capriccio: Scalable Threads for Internet Services

Capriccio: Scalable Threads for Internet Services Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley Presenter: Cong Lin Outline Part I Motivation

More information

Ohua: Implicit Dataflow Programming for Concurrent Systems

Ohua: Implicit Dataflow Programming for Concurrent Systems Ohua: Implicit Dataflow Programming for Concurrent Systems Sebastian Ertel Compiler Construction Group TU Dresden, Germany Christof Fetzer Systems Engineering Group TU Dresden, Germany Pascal Felber Institut

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Modeling and Mapping for Customizable Domain-Specific Computing

Modeling and Mapping for Customizable Domain-Specific Computing Modeling and Mapping for Customizable Domain-Specific Computing Zoran Budimlić Alex Bui Jason Cong Glenn Reinman Vivek Sarkar Rice University University of California, Los Angeles Abstract In this article,

More information

Lecture 29: Java s synchronized statement

Lecture 29: Java s synchronized statement COMP 322: Fundamentals of Parallel Programming Lecture 29: Java s synchronized statement Vivek Sarkar Department of Computer Science, Rice University vsarkar@rice.edu https://wiki.rice.edu/confluence/display/parprog/comp322

More information

Extensions to Barrelfish Asynchronous C

Extensions to Barrelfish Asynchronous C Extensions to Barrelfish Asynchronous C Michael Quigley michaelforrquigley@gmail.com School of Computing, University of Utah October 27, 2016 1 Abstract The intent of the Microsoft Barrelfish Asynchronous

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Monitors; Software Transactional Memory

Monitors; Software Transactional Memory Monitors; Software Transactional Memory Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 18, 2012 CPD (DEI / IST) Parallel and

More information

c 2015 by Adam R. Smith. All rights reserved.

c 2015 by Adam R. Smith. All rights reserved. c 2015 by Adam R. Smith. All rights reserved. THE PARALLEL INTERMEDIATE LANGUAGE BY ADAM RANDALL SMITH DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

DYNAMIC SCHEDULING IN MULTICORE PROCESSORS

DYNAMIC SCHEDULING IN MULTICORE PROCESSORS DYNAMIC SCHEDULING IN MULTICORE PROCESSORS A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2011 By Demian Rosas

More information

Declarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model

Declarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model Declarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model Zoran Budimlić Rice University zoran@rice.edu Aparna Chandramowlishwaran Georgia Institute of Technology

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

StreamBox: Modern Stream Processing on a Multicore Machine

StreamBox: Modern Stream Processing on a Multicore Machine StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu

More information

CnC for Tuning Hints on OCR. Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015

CnC for Tuning Hints on OCR. Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015 CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015 Acknowledgements This work was done as part of my internship with the OCR team, part of Intel Federal,

More information

Isolation for Nested Task Parallelism

Isolation for Nested Task Parallelism Isolation for Nested Task Parallelism Jisheng Zhao Rice University jisheng.zhao@rice.edu Swarat Chaudhuri Rice University swarat@rice.edu Roberto Lublinerman Google, Inc. rluble@google.com Vivek Sarkar

More information

CSE 260 Lecture 19. Parallel Programming Languages

CSE 260 Lecture 19. Parallel Programming Languages CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014

More information

Finish Accumulators: a Deterministic Reduction Construct for Dynamic Task Parallelism

Finish Accumulators: a Deterministic Reduction Construct for Dynamic Task Parallelism Finish Accumulators: a Deterministic Reduction Construct for Dynamic Task Parallelism Jun Shirako, Vincent Cavé, Jisheng Zhao, and Vivek Sarkar Department of Computer Science, Rice University {shirako,vincent.cave,jisheng.zhao,vsarkar@rice.edu

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Recursive inverse factorization

Recursive inverse factorization Recursive inverse factorization Anton Artemov Division of Scientific Computing, Uppsala University anton.artemov@it.uu.se 06.09.2017 Context of work Research group on large scale electronic structure computations

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

COMP 322: Fundamentals of Parallel Programming Module 1: Parallelism

COMP 322: Fundamentals of Parallel Programming Module 1: Parallelism c 2017 by Vivek Sarkar February 14, 2017 DRAFT VERSION Contents 0 Course Organization 3 1 Task-level Parallelism 3 1.1 Task Creation and Termination (Async, Finish).......................... 3 1.2 Computation

More information

Monitors; Software Transactional Memory

Monitors; Software Transactional Memory Monitors; Software Transactional Memory Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico March 17, 2016 CPD (DEI / IST) Parallel and Distributed

More information

Cooperative Scheduling of Parallel Tasks with General Synchronization Patterns

Cooperative Scheduling of Parallel Tasks with General Synchronization Patterns Cooperative Scheduling of Parallel Tasks with General Synchronization Patterns Shams Imam and Vivek Sarkar Department of Computer Science, Rice University {shams,vsarkar}@rice.edu Abstract. In this paper,

More information

CnC for high performance computing. Kath Knobe Intel SSG

CnC for high performance computing. Kath Knobe Intel SSG CnC for high performance computing Kath Knobe Intel SSG Thanks Frank Schlimbach Intel Vivek Sarkar and his group Rice University DARPA UHPC (Runnemede) DOE X-Stack (Trilacka Glacier) 2 Outline Intro Checkpoint/restart

More information

Parallel Algorithm Design. CS595, Fall 2010

Parallel Algorithm Design. CS595, Fall 2010 Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the

More information

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory Polytasks: A Compressed Task Representation for HPC Runtimes Daniel Orozco

More information

Transactional Memory

Transactional Memory Transactional Memory Architectural Support for Practical Parallel Programming The TCC Research Group Computer Systems Lab Stanford University http://tcc.stanford.edu TCC Overview - January 2007 The Era

More information

An Update on Haskell H/STM 1

An Update on Haskell H/STM 1 An Update on Haskell H/STM 1 Ryan Yates and Michael L. Scott University of Rochester TRANSACT 10, 6-15-2015 1 This work was funded in part by the National Science Foundation under grants CCR-0963759, CCF-1116055,

More information

Lecture 24: Java Threads,Java synchronized statement

Lecture 24: Java Threads,Java synchronized statement COMP 322: Fundamentals of Parallel Programming Lecture 24: Java Threads,Java synchronized statement Zoran Budimlić and Mack Joyner {zoran, mjoyner@rice.edu http://comp322.rice.edu COMP 322 Lecture 24 9

More information

THREADS AND CONCURRENCY

THREADS AND CONCURRENCY THREADS AND CONCURRENCY Lecture 22 CS2110 Spring 2013 Graphs summary 2 Dijkstra: given a vertex v, finds shortest path from v to x for each vertex x in the graph Key idea: maintain a 5-part invariant on

More information

Dynamic inter-core scheduling in Barrelfish

Dynamic inter-core scheduling in Barrelfish Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18

More information

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Polyhedral Optimizations for a Data-Flow Graph Language

Polyhedral Optimizations for a Data-Flow Graph Language Polyhedral Optimizations for a Data-Flow Graph Language Alina Sbîrlea 1 Jun Shirako 1 Louis-Noël Pouchet 2 Vivek Sarkar 1 1 Rice University, USA 2 Ohio State University, USA Abstract. This paper proposes

More information

Cloud Programming James Larus Microsoft Research. July 13, 2010

Cloud Programming James Larus Microsoft Research. July 13, 2010 Cloud Programming James Larus Microsoft Research July 13, 2010 New Programming Model, New Problems (and some old, unsolved ones) Concurrency Parallelism Message passing Distribution High availability Performance

More information

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Tobias Schwarzer 1, Joachim Falk 1, Michael Glaß 1, Jürgen Teich 1, Christian Zebelein 2, Christian

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware

More information

A Pluggable Framework for Composable HPC Scheduling Libraries

A Pluggable Framework for Composable HPC Scheduling Libraries A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1, Vivek Kumar 2, Nick Vrvilo 1, Zoran Budimlic 1, Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University

More information

POSIX Threads: a first step toward parallel programming. George Bosilca

POSIX Threads: a first step toward parallel programming. George Bosilca POSIX Threads: a first step toward parallel programming George Bosilca bosilca@icl.utk.edu Process vs. Thread A process is a collection of virtual memory space, code, data, and system resources. A thread

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

Lecture 27: Safety and Liveness Properties, Java Synchronizers, Dining Philosophers Problem

Lecture 27: Safety and Liveness Properties, Java Synchronizers, Dining Philosophers Problem COMP 322: Fundamentals of Parallel Programming Lecture 27: Safety and Liveness Properties, Java Synchronizers, Dining Philosophers Problem Mack Joyner and Zoran Budimlić {mjoyner, zoran}@rice.edu http://comp322.rice.edu

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

JS Event Loop, Promises, Async Await etc. Slava Kim

JS Event Loop, Promises, Async Await etc. Slava Kim JS Event Loop, Promises, Async Await etc Slava Kim Synchronous Happens consecutively, one after another Asynchronous Happens later at some point in time Parallelism vs Concurrency What are those????

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Reminder from last time

Reminder from last time Concurrent systems Lecture 7: Crash recovery, lock-free programming, and transactional memory DrRobert N. M. Watson 1 Reminder from last time History graphs; good (and bad) schedules Isolation vs. strict

More information

Deterministic Parallel Programming

Deterministic Parallel Programming Deterministic Parallel Programming Concepts and Practices 04/2011 1 How hard is parallel programming What s the result of this program? What is data race? Should data races be allowed? Initially x = 0

More information

Scalable Shared Memory Programing

Scalable Shared Memory Programing Scalable Shared Memory Programing Marc Snir www.parallel.illinois.edu What is (my definition of) Shared Memory Global name space (global references) Implicit data movement Caching: User gets good memory

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Placement de processus (MPI) sur architecture multi-cœur NUMA

Placement de processus (MPI) sur architecture multi-cœur NUMA Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java) COMP 412 FALL 2017 Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java) Copyright 2017, Keith D. Cooper & Zoran Budimlić, all rights reserved. Students enrolled in Comp 412 at Rice

More information