DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY
|
|
- Leon Hoover
- 5 years ago
- Views:
Transcription
1 1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY
2 Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be series-parallel Can not describe without hampering -ism Fork/Join models constrain control and data dependences Tasks can only be created after all data dependences satisfied Necessitates ordering task creation to conform to that restriction May hamper performance
3 Macro-dataflow for intuitive -ism 3 Kernel based programming TaskA TaskB TaskC Build a task graph of kernel instantiations main TaskA TaskB TaskC TaskB TaskA Restrict dependences to true dependences race-freedom, determinism Provides productivity single-assignment data
4 Futures [Baker & Hewitt 1977] 4 future = (storage, resolvingprocess, waitingtasks) addressf TaskF TaskG TaskH TaskJ Future F = {stmt 1 ; ; return v;}!!!!!task g = {stmt; F.get(); ;}!
5 5 Data-Driven Futures (DDFs) & Data-Driven Tasks (DDTs) DataDrivenFuture = (storage, waitingtasks) Creation Create an empty Data-Driven Future (DDF) object Resolution ( put ) (resolvingprocess) Resolve what value a DDF is referring to Data-Driven Tasks (DDTs) ( async await( ) ) A task provides a consumer list of DDFs on declaration A task can only read DDFs that it is registered to Difference from futures: Creation of container (DDF) and computation (DDT) are separate events
6 DDF/DDT Code Sample 6 DataDrivenFuture left = new DataDrivenFuture (); DataDrivenFuture right = new DataDrivenFuture(); finish { async await ( left ) useleftchild(left); // Task 1 async await ( right ) userightchild(right); // Task 2 async await ( left, right ) usebothchildren( left, right ); // Task 3 } async left.put(leftchildcreator()); // Task 4 async right.put(rightchildcreator()); // Task 5 Task4 Task1 Task3 Task5 Task2
7 DDTs provide 7 Non-series-parallel task dependence graph support Less restricted parallelism Better scheduling opportunities DDTB DDF DDTD DDTA DDTF DDTC DDTE Single assignment (SA) Race-freedom on DDF accesses Determinism if all shared data is expressed as DDFs SA-value lifetime restriction Smaller than graph lifetime DDF creator: Provides DDF reference to producers and consumers DDF lifetime depends on Creator lifetime Resolver lifetime Consumers lifetimes
8 Data-Driven Scheduling 8 Steps register self to items wrapped into DDFs DDFleft PlaceHolderleft Valueleft Task1 DDFleft DDFright PlaceHolderright Valueright Task3 Task2 TaskC DDFleft DDF left = new DDF(); DDF right = new DDF(); async await (left) use(left); // Task 1 async await (right) use(right); // Task 2 async await (left,right) use(left,right); // Task 3 async builder(left); // Task 4 DDFright ready queue Task 43 Task 52 Task5 Task4 DDFright Task 1 resolve resolve DDFright DDFleft async builder(right); // Task 5
9 Mapping Macro-Dataflow to Task-Parallelism 9 Control & data dependences as first level constructs Task-parallel frameworks have them coupled e.g., OpenMP, Cilk Kernel instantiations may have multiple predecessors Need to wait for all Staged readiness concepts Created ( control dependence satisfied ) Data dependences satisfied Schedulable / Ready DDTs provide a natural implementation for Macro- Dataflow Every kernel instantiation is a DDT Data dependences between DDTs are expressed through DDFs Provides race freedom
10 Experimental Results 10 Compared DDT implementation with four macrodata schedulers from past work that used Concurrent Collections (CnC) CnC uses global data collections to synchronize tasks DDT/DDF results obtained at task-parallel level without allocating global data collections CnC can be automatically translated to DDFs (ongoing work)
11 Blocking Schedulers 11 Use Java wait/notify for premature data access Blocking granularity Instance level vs Collection level (fine-grain vs. coarsegrain) A blocked task blocks an entire worker thread Need to create more worker threads to avoid deadlock Worker C Get (key c ) step 1 wait ItemCollection Θ key α key β value α value β time Worker D Put(key c,value c ) notify step 2
12 Delayed async Scheduling 12 Every kernel instantiation is a guarded execution Guard condition is the availability of input data Task can be created eagerly before input data is available Promoted to ready when data provided pop Popped Task Value left = new Value (); Value right = new Value (); finish { async when ( left.isready() ) useleftchild(left); // Task 1 async when ( right.isready()) userightchild(right); // Task 2 async when ( right.isready() && left.isready() ) async 1 async 2 async 3 async 4 usebothchildren( left, right ); // Task 3 async left.put(leftchildcreator()); // Task async 5 4 push async right.put(rightchildcreator()); // Task 5 } Work Sharing Ready Task Queue No Schedule Yes Delayed? Is true? Evaluate guard Requeue Yes No
13 Data Driven Rollback & Replay 13 Throw exception to unwind Insert step 1 to waitlist c Worker C step 1 Get (key c ) step 3 step 1 Get (key c ) Get (key d ) ItemCollection Θ key a key b value a value b waitlist a waitlist b key c empty value c waitlist c Put(key c,value c ) Worker D step 2 Re-execute steps in waitlist c on Put() step 1
14 Experimental Setup 14 4-socket Xeon quad-core Intel E GHz Shared 3MB L2 cache per pair of cores. Main memory 32 GBs. #worker threads:16 8-way SMT 8-core Niagara Sun UltraSPARC T2 Shared 4MB L2 cache #worker threads: bit Sun Hotspot JDK 1.6 JVM GCC for JNI 30 runs for statistical soundness Read Serial as single-threaded execution of code
15 Cholesky decomposition 15 12,000 10,081 10,010 Serial Parallel 10,305 10,309 Execution in milli-secs 10,000 8,000 6,000 4,000 2,472 8,748 2,000 1, Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Tasks Average execution times and 90% confidence interval of 30 runs of single threaded and 16- threaded executions for blocked Cholesky decomposition CnC application with Habanero- Java steps on 16-core Xeon with input matrix size and with tile size
16 Black-Scholes formula ( PARSEC ) 16 40,000 35,000 Serial Parallel 33,871 33,966 34,311 34,121 34,729 Execution in milli-secs 30,000 25,000 20,000 15,000 10,000 5,000 4,300 4,309 4,279 5,061 2,353 0 Coarse Grain Blocking Fine Grain Blocking Delayed Async Data Driven Rollback&Replay Data Driven Futures Tasks Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on 16-core Xeon with input size 1,000,000 and with tile size 62,500
17 Rician Denoising ( Medical Imaging ) , , ,666 Serial Parallel 483,770 Execution in milli-secs 400, , , , ,000 81,502 58,313 53,569 53,817 0 Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures Tasks Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size and with tile size * Explicit memory management required for non-ddt schedules to avoid out-of-memory exception
18 Heart Wall Tracking Dependence Graph 18 IterationJ IterationJ+1 Step3 Step3 Step1 Step6 Step8 Step1 Step6 Step8 Step4 Step10 Step4 Step10 Step2 Step2 Step5 Step7 Step9 Step5 Step7 Step9
19 Heart Wall Tracking ( Rodinia ) , , ,248 Serial Parallel 157, ,159 Execution in milli-secs 140, , ,000 80,000 60,000 40,000 47,989 20,000 11,076 9,897 0 Delayed Async Data Driven Rollback&Replay Data Driven Futures Tasks Minimum execution times of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames
20 Related Work 20 Futures Can build arbitrary task graphs get()/force() is usually a blocking operation future task creation is bound to container at creation time Dataflow Typically blocks on one datum (Ivar) at a time, unlike async await ( ) Nabbit ( Cilk library ) Can build arbitrary task graphs, more explicit than DDTs No garbage collection and unwinding of task graph Concurrent Collections ( CnC ) Globalized data collections and general tags (keys) makes memory management challenging DDTs can be used to obtain more efficient implementations of CnC
21 Conclusions 21 Data-Driven Futures and Data-Driven Tasks help build arbitrary task graphs and extend task-parallel frameworks introduce the more-intuitive macro-dataflow to programmers on task-parallel frameworks support Data-Driven scheduling that outperforms alternative schedulers in both execution time and memory requirements help to implement blocking in tasks without blocking workers
22 Future Work 22 Compile Concurrent Collections down to DDTs Compiler optimizations to move DDF allocations to further reduce lifetimes Hierarchical DDTs for granularity optimizations Work-stealing support for DDTs Use DDTs to implement all blocking synchronizations without blocking worker, i.e. replace each waiting continuation as a DDT Locality aware scheduling with DDTs For a hands-on trial, visit
SCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR
1 SCHEDULING MACRO-DATAFLOW PROGRAMS ON TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR Thesis 2 Our thesis is that advances in task parallel runtime systems can enable a macro-dataflow programming model,
More informationCnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University
CnC-HC a programming model for CPU-GPU hybrid parallelism Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University Acknowledgements CnC-CUDA: Declarative Programming for GPUs, Max Grossman, Alina Simion-Sbirlea,
More informationRICE UNIVERSITY. Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems. by Sagnak Ta Irlar
RICE UNIVERSITY Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems by Sagnak Ta Irlar A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Master of Science APPROVED,
More informationA Case for Cooperative Scheduling in X10's Managed Runtime
A Case for Cooperative Scheduling in X10's Managed Runtime X10 Workshop 2014 June 12, 2014 Shams Imam, Vivek Sarkar Rice University Task-Parallel Model Worker Threads Please ignore the DP on the cartoons
More informationLecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers
COMP 322: Fundamentals of Parallel Programming Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu
More informationCOMP 322 / ELEC 323: Fundamentals of Parallel Programming
COMP 322 / ELEC 323: Fundamentals of Parallel Programming Lecture 18: Abstract vs. Real Performance an under the hood look at HJlib Instructors: Vivek Sarkar, Mack Joyner Department of Computer Science,
More informationABSTRACT. Optimized Event-Driven Runtime Systems for Programmability and Performance
ABSTRACT Optimized Event-Driven Runtime Systems for Programmability and Performance by Sağnak Taşırlar Modern parallel programming models perform their best under the particular patterns they are tuned
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationConcurrent Collections
Concurrent Collections Zoran Budimlić 1 Michael Burke 1 Vincent Cavé 1 Kathleen Knobe 2 Geoff Lowney 2 Ryan Newton 2 Jens Palsberg 3 David Peixotto 1 Vivek Sarkar 1 Frank Schlimbach 2 Sağnak Taşırlar 1
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationHabanero-Java Library: a Java 8 Framework for Multicore Programming
Habanero-Java Library: a Java 8 Framework for Multicore Programming PPPJ 2014 September 25, 2014 Shams Imam, Vivek Sarkar shams@rice.edu, vsarkar@rice.edu Rice University https://wiki.rice.edu/confluence/display/parprog/hj+library
More informationPERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES
PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential
More informationAll routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.
technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance
More informationChapel Introduction and
Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 But before that Created a simple
More informationOhua: Implicit Dataflow Programming for Concurrent Systems
Ohua: Implicit Dataflow Programming for Concurrent Systems Sebastian Ertel Compiler Construction Group TU Dresden, Germany Christof Fetzer Systems Engineering Group TU Dresden, Germany Pascal Felber Institut
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationfor Task-Based Parallelism
for Task-Based Parallelism School of EEECS, Queen s University of Belfast July Table of Contents 2 / 24 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationScheduler Activations. CS 5204 Operating Systems 1
Scheduler Activations CS 5204 Operating Systems 1 Concurrent Processing How can concurrent processing activity be structured on a single processor? How can application-level information and system-level
More informationStatic Data Race Detection for SPMD Programs via an Extended Polyhedral Representation
via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationEfficient Work Stealing for Fine-Grained Parallelism
Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationRegion-Based Memory Management for Task Dataflow Models
Region-Based Memory Management for Task Dataflow Models Nikolopoulos, D. (2012). Region-Based Memory Management for Task Dataflow Models: First Joint ENCORE & PEPPHER Workshop on Programmability and Portability
More informationPolyhedral Optimizations of Explicitly Parallel Programs
Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015
More informationCilk Plus GETTING STARTED
Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationCOMP 322: Fundamentals of Parallel Programming Module 1: Parallelism
c 2016 by Vivek Sarkar January 10, 2018 Contents 0 Course Organization 2 1 Task-level Parallelism 2 1.1 Task Creation and Termination (Async, Finish).......................... 2 1.2 Computation Graphs.........................................
More informationThe Parallel Boost Graph Library spawn(active Pebbles)
The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University Origins Boost Graph Library (1999) Generic
More informationRecursive inverse factorization
Recursive inverse factorization Anton Artemov Division of Scientific Computing, Uppsala University anton.artemov@it.uu.se 06.09.2017 Context of work Research group on large scale electronic structure computations
More informationThread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources
Thread-Local Lecture 27: Concurrency 3 CS 62 Fall 2016 Kim Bruce & Peter Mawhorter Some slides based on those from Dan Grossman, U. of Washington Whenever possible, don t share resources Easier to have
More informationPortable Parallel Programming for Multicore Computing
Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D Acknowledgments Rice Habanero Multicore
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationCOMP 322: Fundamentals of Parallel Programming
COMP 322: Fundamentals of Parallel Programming https://wiki.rice.edu/confluence/display/parprog/comp322 Lecture 27: Java Threads Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationJava Performance Analysis for Scientific Computing
Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000
More informationDynamic Task Parallelism with a GPU Work-Stealing Runtime System
Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Sanjay Chatterjee, Max Grossman, Alina Sbîrlea, and Vivek Sarkar Department of Computer Science Rice University Background As parallel programming
More informationLecture 32: Volatile variables, Java memory model
COMP 322: Fundamentals of Parallel Programming Lecture 32: Volatile variables, Java memory model Vivek Sarkar Department of Computer Science, Rice University vsarkar@rice.edu https://wiki.rice.edu/confluence/display/parprog/comp322
More informationComparison and analysis of parallel tasking performance for an irregular application
Comparison and analysis of parallel tasking performance for an irregular application Patrick Atkinson, University of Bristol (p.atkinson@bristol.ac.uk) Simon McIntosh-Smith, University of Bristol Motivation
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationMerge Sort Quicksort 9 Abstract Windowing Toolkit & Swing Abstract Windowing Toolkit (AWT) vs. Swing AWT GUI Components Layout Managers Swing GUI
COURSE TITLE :Introduction to Programming 2 COURSE PREREQUISITE :Introduction to Programming 1 COURSE DURATION :16 weeks (3 hours/week) COURSE METHODOLOGY:Combination of lecture and laboratory exercises
More informationDelegated Isolation. Jisheng Zhao. Rice University. Rice University
Delegated Isolation Roberto Lublinerman Pennsylvania State University rluble@psu.edu Swarat Chaudhuri Rice University swarat@rice.edu Jisheng Zhao Rice University jisheng.zhao@rice.edu Vivek Sarkar Rice
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationChap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1
Chap. 6 Part 3 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 OpenMP popular for decade Compiler-based technique Start with plain old C, C++, or Fortran Insert #pragmas into source file You
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationPractical Concurrency. Copyright 2007 SpringSource. Copying, publishing or distributing without express written permission is prohibited.
Practical Concurrency Agenda Motivation Java Memory Model Basics Common Bug Patterns JDK Concurrency Utilities Patterns of Concurrent Processing Testing Concurrent Applications Concurrency in Java 7 2
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationMulti-core Implementations of the Concurrent Collections Programming Model
Multi-core Implementations of the Concurrent Collections Programming Model Zoran Budimlić, Aparna Chandramowlishwaran, Kathleen Knobe, Geoff Lowney, Vivek Sarkar, and Leo Treggiari Department of Computer
More informationMutex Locking versus Hardware Transactional Memory: An Experimental Evaluation
Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation Thesis Defense Master of Science Sean Moore Advisor: Binoy Ravindran Systems Software Research Group Virginia Tech Multiprocessing
More informationChair of Software Engineering. Java and C# in depth. Carlo A. Furia, Marco Piccioni, Bertrand Meyer. Java: concurrency
Chair of Software Engineering Carlo A. Furia, Marco Piccioni, Bertrand Meyer Java: concurrency Outline Java threads thread implementation sleep, interrupt, and join threads that return values Thread synchronization
More informationCSC630/COS781: Parallel & Distributed Computing
CSC630/COS781: Parallel & Distributed Computing Algorithm Design Chapter 3 (3.1-3.3) 1 Contents Preliminaries of parallel algorithm design Decomposition Task dependency Task dependency graph Granularity
More informationConcurrent Preliminaries
Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationc 2015 by Adam R. Smith. All rights reserved.
c 2015 by Adam R. Smith. All rights reserved. THE PARALLEL INTERMEDIATE LANGUAGE BY ADAM RANDALL SMITH DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationOptimizing Irregular Adaptive Applications on Multi-threaded Processors: The Case of Medium-Grain Parallel Delaunay Mesh Generation
Optimizing Irregular Adaptive Applications on Multi-threaded rocessors: The Case of Medium-Grain arallel Delaunay Mesh Generation Filip Blagojević The College of William & Mary CSci 710 Master s roject
More informationCall Paths for Pin Tools
, Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More informationCOMP 322: Fundamentals of Parallel Programming Module 1: Parallelism
c 2017 by Vivek Sarkar February 14, 2017 DRAFT VERSION Contents 0 Course Organization 3 1 Task-level Parallelism 3 1.1 Task Creation and Termination (Async, Finish).......................... 3 1.2 Computation
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationArchitecture without explicit locks for logic simulation on SIMD machines
Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of
More informationAnnotatable Systrace: An Extended Linux ftrace for Tracing a Parallelized Program
Annotatable Systrace: An Extended Linux ftrace for Tracing a Parallelized Program Daichi Fukui Mamoru Shimaoka Hiroki Mikami Dominic Hillenbrand Hideo Yamamoto Keiji Kimura Hironori Kasahara Waseda University,
More informationModule 20: Multi-core Computing Multi-processor Scheduling Lecture 39: Multi-processor Scheduling. The Lecture Contains: User Control.
The Lecture Contains: User Control Reliability Requirements of RT Multi-processor Scheduling Introduction Issues With Multi-processor Computations Granularity Fine Grain Parallelism Design Issues A Possible
More informationA Scalable Lock Manager for Multicores
A Scalable Lock Manager for Multicores Hyungsoo Jung Hyuck Han Alan Fekete NICTA Samsung Electronics University of Sydney Gernot Heiser NICTA Heon Y. Yeom Seoul National University @University of Sydney
More informationMultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction
: A Distributed Shared Memory System Based on Multiple Java Virtual Machines X. Chen and V.H. Allan Computer Science Department, Utah State University 1998 : Introduction Built on concurrency supported
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationSoftware transactional memory
Transactional locking II (Dice et. al, DISC'06) Time-based STM (Felber et. al, TPDS'08) Mentor: Johannes Schneider March 16 th, 2011 Motivation Multiprocessor systems Speed up time-sharing applications
More informationJava Concurrency. Towards a better life By - -
Java Concurrency Towards a better life By - Srinivasan.raghavan@oracle.com - Vaibhav.x.choudhary@oracle.com Java Releases J2SE 6: - Collection Framework enhancement -Drag and Drop -Improve IO support J2SE
More informationCSE 260 Lecture 19. Parallel Programming Languages
CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationJS Event Loop, Promises, Async Await etc. Slava Kim
JS Event Loop, Promises, Async Await etc Slava Kim Synchronous Happens consecutively, one after another Asynchronous Happens later at some point in time Parallelism vs Concurrency What are those????
More informationCompiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems
Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Florin Balasa American University in Cairo Noha Abuaesh American University in Cairo Ilie I. Luican Microsoft Inc., USA Cristian
More informationDo we need dataflow programming?
Do we need dataflow programming? Anthony Danalis Innovative Computing Laboratory University of Tennessee CCDSC 16, Chateau des Contes Programming vs Execution ü Dataflow based execution ü Think ILP, Out
More informationQuestions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process
Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation Why are threads useful? How does one use POSIX pthreads? Michael Swift 1 2 What s in a process? Organizing a Process A process
More informationOverview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions
CMSC 330: Organization of Programming Languages Multithreaded Programming Patterns in Java CMSC 330 2 Multiprocessors Description Multiple processing units (multiprocessor) From single microprocessor to
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationIngo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.
Intelligent Storage Results from real life testing Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA SAS Intelligent Storage components! OLAP Server! Scalable Performance Data Server!
More informationA Data-flow Approach to Solving the von Neumann Bottlenecks
1 A Data-flow Approach to Solving the von Neumann Bottlenecks Antoniu Pop Inria and École normale supérieure, Paris St. Goar, June 21, 2013 2 Problem Statement J. Preshing. A Look Back at Single-Threaded
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationCOMP 322: Fundamentals of Parallel Programming
COMP 322: Fundamentals of Parallel Programming! Lecture 1: The What and Why of Parallel Programming; Task Creation & Termination (async, finish) Vivek Sarkar Department of Computer Science, Rice University
More informationMULTI-THREADED QUERIES
15-721 Project 3 Final Presentation MULTI-THREADED QUERIES Wendong Li (wendongl) Lu Zhang (lzhang3) Rui Wang (ruiw1) Project Objective Intra-operator parallelism Use multiple threads in a single executor
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationECE519 Advanced Operating Systems
IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationData-triggered Multithreading for Near-Data Processing
Data-triggered Multithreading for Near-Data Processing Hung-Wei Tseng and Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego La Jolla, CA, U.S.A. Abstract
More informationDisruptor Using High Performance, Low Latency Technology in the CERN Control System
Disruptor Using High Performance, Low Latency Technology in the CERN Control System ICALEPCS 2015 21/10/2015 2 The problem at hand 21/10/2015 WEB3O03 3 The problem at hand CESAR is used to control the
More informationDay 6: Optimization on Parallel Intel Architectures
Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While
More information