Runtime Support for Scalable Task-parallel Programs
|
|
- Imogen Carter
- 5 years ago
- Views:
Transcription
1 Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May
2 Single Program Multiple Data int main () {... } 2
3 Task Parallelism int main () {... } 3
4 Task Parallelism int main () {... } 4
5 Task-parallel Abstractions Finer specification of concurrency, data locality, and dependences Convey more application information to compiler and runtime Adaptive runtime system to manage tasks Application writer specifies the computation Writes optimizable code Tools to transform code to generate an efficient implementation 5
6 The Promise Application writer specifies the computation Computation mapped to specific execution environment by the software stack We are transferring some of the burden away from the programmer 6
7 The Challenge We are transferring some of the burden to the software stack Handling million MPI processes is supposed to be hard; how about billions of tasks? What about the software ecosystem? 7
8 Tracing and Constraining Work Stealing Schedulers 8
9 Research Directions Concurrency management and tracing Dynamic load balancing Data locality optimization Task granularity selection Data race detection 9
10 Recursive Task Parallelism S fn() { s1; async { /*A1*/ s2; finish async s3;//a2 s4; } async s5; //A3 s6; } s1 s2 S P A1 S s5 P A3 s6 P A2 s4 s3 10
11 Work Stealing A worker begins with one/few tasks Tasks spawn more tasks When a worker is out of tasks, it steals from another worker A popular scheduling strategy for recursive parallel programs Well-studied load balancing strategy Provably efficient scheduling Understandable space and time bounds 11
12 Objective Trace execution under work stealing Exploit information from trace to perform various optimizations Constrain the scheduler to obtain desired behavior 12
13 Tracing 13 Steal tree: low-overhead tracing of work stealing schedulers. PLDI 13
14 Tracing Work Stealing When and where each task executed Captures the order of events for online and offline analysis Challenges Sheer size of the trace Application perturbation might make it impractical 14
15 Tracing Approach: Illustration a b c bc a Deque task async step steal Steals in order of levels Almost one steal per level 15
16 Space Overhead: Shared Memory 16 Total Total Space Space (kb) (kb) Total Space (kb) 1e+10 1e+11 1e+10 1e+09 1e+09 1e+08 1e+08 1e+07 1e+07 1e+06 1e+06 1e+05 1e+05 1e+04 1e+04 1e+03 1e+02 1e e+10 1e+09 Enum Enum StealTree: StealTree: 2 2 Cores Cores StealTree: StealTree: 4 4 Cores Cores StealTree: StealTree: Cores Cores n=1k nq=14 z=10k b=6 b=8 n=2k nq=14 z=20k b=3 b=8 StealTree: StealTree: Cores Cores StealTree: StealTree: Cores Cores StealTree: StealTree: Cores Cores n=1k nq=14 z=10k b=1 b=4 n=2k nq=15 z=20k b=6 b=4 (a) (c) Cholesky AllQueens StealTree: 1e Cores n=1k nq=15 z=10k b=3 b=2 n=2k nq=15 z=20k b=1 b=2 (b) (d) Heat Fib StealTree: Small trace sizes, 1e Cores 1e+10 less StealTree: 2 Enum affected Cores StealTree: 20 Cores Enum StealTree: 1e+08 StealTree: 4 Cores StealTree: 80 Cores StealTree: 4 Cores StealTree: 1e+06 StealTree: by StealTree: 2 Cores StealTree: 10 Cores Corescore StealTree: 4 Cores count StealTree: 20 Cores StealTree: or problem Cores Cores 1e+07 1e+09 StealTree: 10 Cores StealTree: 10 Cores 1e+05 size 1e+06 1e+05 1e+04 1e+03 1e+02 1e MB Enum StealTree: 2 Cores n=1k z=10k b=8 n=2k z=20k b=8 Total Space (kb) n=1k z=10k b=4 1e+08 1e+07 1e+06 1e+05 1e+04 1e+03 1e+02 1e n=2k z=20k b=4 n=1k n=3000 z=10k 2 b=64 b=2 n=2k n=4000 z=20k 2 b=64 b=2 n= b=16 Total Space (kb) Total Space (kb) 1e+11 1e+10 1e+09 1e+08 1e+07 1e+06 1e+05 1e+04 1e+03 1e+02 1e e+07 1e+04 1e+03 1e+02 1e n= b=16 Enum StealTree: 2 Cores StealTree: 4 Cores StealTree: 10 Cores n=35 b=10 n=35 b=5 n=3000 n=16k 2 n= n=32k 2 2 b=8 c=8 b=8 c=8 StealTree: 20 Cores StealTree: 40 Cores StealTree: 80 Cores n=35 b=1 n=16k 2 c=4 n=43 b=10 n=32k 2 c=4 n=43 b=5 n=16k 2 c=1 n=43 b=1 n=32k 2 c=1 1MB
17 Space Overhead: Distributed Memory KB/Core Cores 4000 Cores 8000 Cores Cores Cores 1 0 AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF Still less than 160MB in total on cores 17
18 Time Overhead: Shared Memory Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads AllQueens Heat Fib FFT Strassen NBody Cholesky LU Matmul Time overhead within variation in execution time 18
19 Time Overhead: Distributed Memory Cores 4000 Cores 8000 Cores Cores Cores AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF Time overhead within variation in execution time 19
20 What can we do with a steal tree? 20
21 Visualization Core utilization plot over time Cilk LU benchmark on 24 cores Trace size <100KB 21
22 Replay 22 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.
23 Replay Schedulers Strict, ordered replay (StOWS) Exactly reproduce the template schedule Donation of continuations to be stolen Strict, unordered replay (StUWS) Reproduce the template schedule, but allow the order to deviate (respecting the application s dependencies) Relaxed work-stealing replay (RelWS) Reproduce the template schedule as much as possible, but allow workers to deviate when they are idle, by further stealing work 23
24 How good are the schedulers? Trace StOWS StUWS RelWS heat floyd-warshall fdtd cg mg parallel prefix Relaxed work stealing incurs some overhead because it combines replay and work stealing 24
25 Relaxed Work Stealing: Adaptability I Slow down one out of 80 workers 4 times Execution Time (s) fib(48) fib(48) StOWS, slow worker fib(48) RelWS, slow worker 0 25
26 Relaxed Work Stealing: Adaptability II Relaxed replay of schedule from (p-10) workers on p workers Execution Time (s) fib(48) default p-10 Threads fib(48) default p Threads fib(48) RelWS(p-10) on p 0 p = 20 p = 40 p = 60 26
27 Relaxed Work Stealing: Adaptability III Relaxed work stealing of fib(54) with a schedule from fib(48) Trace StOWS StUWS RelWS fib(48) fib(48+6)
28 Retentive Stealing 28 Work stealing and persistence-based load balancers for iterative overdecomposed applications. HPDC 12
29 Iterative Applications Applications repeatedly executing the same computation Many scientific applications are iterative Static or slowly evolving execution characteristics Execution characteristics preclude static balancing Application characteristics (comm. pattern, sparsity, ) Execution environment (topology, asymmetry, ) Yes Done? No do work 29
30 Retentive Work Stealing Seeded Local Queues Actual Executed Tasks Proc 1 Proc 2 Proc 3 Proc n Proc 1 Proc 2Proc 3 Proc n Intuition: Stealing indicates poor initial balance 30
31 Retentive stealing Use work stealing to load balance within each phase Persistence-based load balancers only rebalance across phases Begin next iteration with a trace of the previous iteration s schedule 31
32 Retentive stealing results Parallel efficiency (%) StealRet-1 StealRet-2 StealRet-5 StealRet-10 StealRet-14 Avg. tasks (e) Core HF-Be512-Titan count Tasks per core Num. successful steals SuccSteals-1 SuccSteals-2 SuccSteals (e) HF-Be512-Titan Core count Retentive stealing stabilizes stealing costs 32
33 Retentive Stealing Space Overhead: HF Trace size (KB/core) Cores 4000 Cores 8000 Cores Cores Cores Cores Iteration Execution on Titan Space overhead increase but still same manageable across iterations 33
34 Retentive Stealing Space Overhead: TCE Trace KB/core size (KB/core) Cores 4000 Cores 8000 Cores Cores Cores Cores Iteration Execution on Titan Space overhead stays the same across iterations 34
35 Data Locality Optimization: NUMA Locality 35 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.
36 Constrained Schedules in OpenMP #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) A[i] = B[i] = 0; //init #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) B[i] = A[i]; //memcpy A B memcpy thread Empirical study Loops are naturally matched, leading to good performance Parallel memory copy of 8GB of data, using OpenMP schedule static 80-core system with eight NUMA domains, first-touch policy Execution time: 169ms 36
37 Cilk Scheduling cilk_for (i = 0; i < size; i++) A[i] = B[i] = 0; //init cilk_for (i = 0; i < size; i++) B[i] = A[i]; //memcpy A B memcpy thread Random Empirical study work stealing mismatches the initialization and subsequent use, causing performance degradation. Parallel memory copy of 8GB of data, using MIT Cilk or OpenMP 3.0 tasks Execution time: 436ms (Cilk/OMP task) vs 169ms (OMP static) 37
38 Can we constrain the scheduler to improve NUMA locality? 38
39 Solution: Evolve a Schedule Capture an application phase s steal tree Is load balanced? Yes Strict ordered replay No Data localization Load imbalance observed? No Adapt schedule using relaxed work stealing Yes 39
40 Alternative Strategy: Manual Steal Tree Construction Explicit markup of steal tree in the user program Useful in non-iterative applications 40
41 Data Redistribution Cost First few iterations, data is redistributed (copied) to match a given schedule Execution time (sec) mg fdtd floyd-warshall heat cg Number of threads 41
42 Benchmarks: Iterative Matching Structure 40 heat 40 floyd-warshall Speedup Number of Threads Number of Threads Cilk interleave OMP tasks (interleave) Constrained Iter. RelWS Extract template schedule, apply RelWS for five iterations until convergence, then use StOWS 42
43 Benchmarks: Iterative Differing Structure Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified Start with random work stealing on kernel, refine with RelWS until convergence, then use StOWS 43
44 Benchmarks: Iterative Multiple Structures Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified We evaluate two approaches: using the same schedule across all kernels, and using a different schedule for each kernel 44
45 Benchmarks: Non-iterative Matching Structure Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified Reuse schedule from initialization for other phases with StUWS 45
46 Task Granularity Selection 46 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.
47 Task granularity selection A key challenge for task-parallel programs Trade-off Expose more concurrency Achieve good sequential performance with a coarse grain size 47
48 Observation Concurrency only need to be exposed to achieve load balance Once load is balanced, exposed concurrency can be turned off We can coax the scheduler to select coarser grained work units 48
49 Iterative Granularity Selection Start with small grain size Relaxed work stealing à schedule Drop steal tree leaves Needs coarsening? Yes No Replace leaves with sequential (coarse) tasks 49
50 Dynamic Granularity Selection: heat Speedup x8192-Block Iterative 64x1024-Block Iterative 64x256-Block Iterative 64x256-Block Dynamic Number of threads sample 1 sample 2 sample 3 Iterative locality optimization with grain size selection Count 64x512 64x8k 64x16k 50
51 Dynamic Granularity Selection: cg Speedup 1024-Rows Iterative 128-Rows Iterative 32-Rows Iterative 32-Rows Dynamic Count Rows 512 Rows 4k Rows 16k Rows sample 1 sample 2 sample Number of threads 51
52 Data Race Detection t1 t2 A 52 Steal tree: low-overhead tracing of work stealing schedulers. PLDI 13
53 Data Race Detection Detect conflicting operations in a fork/join program Key check: Determine if two memory operation can execute in parallel For any possible schedule 53
54 Dynamic Program Structure Tree (DPST) finish { step1; async { } //F1 //A1 step2; async { //A2 step3; } step4; } step5; Two steps s1 and s2 may execute in parallel if: l1 is least common ancestor (LCA) of s1 and s2 in DPST c1 is ancestor of s1 and immediate child of l1 c1 is an async node 54
55 Steal-Tree Aided LCA Computation The nodes of the DPST tree can be annotated with the nodes of steal tree they belong to Data race detection involves multiple walks of the DPST for each memory access checked lca(s1, s2): if (s1.st_node == s2.st_node) return dpst_lca(s1,s2); //dpst walk if (s1.st_node.level > s2.st_node.level) return lca(s1.st_node.victim, s2) return lca(s1, s2.st_node.victim) 55
56 Application: Data Race Detection Percent Reduction Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads AllQueens LU Heat NBody Matmul Significant reduction in the number of DPST edges traversed 56
57 Other Results Locality-aware task graph scheduling Color-based constraints on work stealing schedulers Cache locality optimization Effect-based splicing of concurrent tasks to improve cache locality Speculative work stealing Expose greater concurrency Localized parallel failure recovery 57
58 Lessons Learned Random work stealing with ability to constrain its behavior can bring several benefits Steal trees can be useful in a variety of contexts Retentive stealing Data locality optimization Task granularity selection Data race detection Need to design interfaces to programmatically extract and use work stealing schedules 58
59 Continuing Research Challenges Recursive program specification Enabling user to express high level intent and properties Compiler analysis and transformation Runtime techniques Scheduling and load balancing Fault tolerance Power/energy efficiency Data locality Correctness and performance tools Architectural and other low-level support for such abstractions 59
60 Conclusions Abstractions supporting task parallelism can meet performance and programmability challenges Runtime systems can adapt productively Changing the load balancer or adding fault tolerance involved no change in the user code Maturing an execution paradigm requires lots of research and experience 60
61 Thank You! 61
Compsci 590.3: Introduction to Parallel Computing
Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationOpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa
OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationA Static Cut-off for Task Parallel Programs
A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING
ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationfor Task-Based Parallelism
for Task-Based Parallelism School of EEECS, Queen s University of Belfast July Table of Contents 2 / 24 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationCOMP Parallel Computing. SMM (4) Nested Parallelism
COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationOS impact on performance
PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationCilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle
CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationCS 240A: Shared Memory & Multicore Programming with Cilk++
CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some
More informationLightweight Fault Detection in Parallelized Programs
Lightweight Fault Detection in Parallelized Programs Li Tan UC Riverside Min Feng NEC Labs Rajiv Gupta UC Riverside CGO 13, Shenzhen, China Feb. 25, 2013 Program Parallelization Parallelism can be achieved
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationPerformance Optimization Part 1: Work Distribution and Scheduling
Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationIntroduction to Computer Systems /18-243, fall th Lecture, Dec 1
Introduction to Computer Systems 15-213/18-243, fall 2009 24 th Lecture, Dec 1 Instructors: Roger B. Dannenberg and Greg Ganger Today Multi-core Thread Level Parallelism (TLP) Simultaneous Multi -Threading
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationDATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY
1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be
More informationEfficient Work Stealing for Fine-Grained Parallelism
Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationScalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,
More informationCilk Plus GETTING STARTED
Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions
More informationIntel Thread Building Blocks, Part IV
Intel Thread Building Blocks, Part IV SPD course 2017-18 Massimo Coppola 13/04/2018 1 Mutexes TBB Classes to build mutex lock objects The lock object will Lock the associated data object (the mutex) for
More informationA Primer on Scheduling Fork-Join Parallelism with Work Stealing
Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join
More informationFractals exercise. Investigating task farms and load imbalance
Fractals exercise Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationHANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY
HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED
More informationAn Overview of Parallel Computing
An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationTable of Contents. Cilk
Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationMultithreaded Parallelism and Performance Measures
Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101
More informationPiecewise Holistic Autotuning of Compiler and Runtime Parameters
Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationStatic Data Race Detection for SPMD Programs via an Extended Polyhedral Representation
via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT
More informationThe Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.
Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,
More informationPerformance Optimization Part 1: Work Distribution and Scheduling
( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationThe Parallel Boost Graph Library spawn(active Pebbles)
The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University Origins Boost Graph Library (1999) Generic
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationCILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY
CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware
More informationIntroduction to Multithreaded Algorithms
Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationA Data-flow Approach to Solving the von Neumann Bottlenecks
1 A Data-flow Approach to Solving the von Neumann Bottlenecks Antoniu Pop Inria and École normale supérieure, Paris St. Goar, June 21, 2013 2 Problem Statement J. Preshing. A Look Back at Single-Threaded
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationCSCE 626 Experimental Evaluation.
CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you
More informationParallelism and Performance
6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel
More informationPurity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language
Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationMulticore programming in CilkPlus
Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory
More informationCS420: Operating Systems
Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing
More informationUni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing
Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationLoad Balancing. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.
Load Balancing Minsoo Ryu Department of Computer Science and Engineering 2 1 Concepts of Load Balancing Page X 2 Load Balancing Algorithms Page X 3 Overhead of Load Balancing Page X 4 Load Balancing in
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming
More informationParallel Programming. March 15,
Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationOpenMP on the IBM Cell BE
OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations
More information15-210: Parallelism in the Real World
: Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) Concurrency Page1 Cray-1 (1976): the world s most expensive love seat 2
More informationA Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores
A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More informationAsymmetry-Aware Work-Stealing Runtimes
Asymmetry-Aware Work-Stealing Runtimes Christopher Torng, Moyang Wang, and Christopher atten School of Electrical and Computer Engineering Cornell University 43rd Int l Symp. on Computer Architecture,
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationMultithreading in C with OpenMP
Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads
More informationA Course-Based Usability Analysis of Cilk Plus and OpenMP. Michael Coblenz, Robert Seacord, Brad Myers, Joshua Sunshine, and Jonathan Aldrich
A Course-Based Usability Analysis of Cilk Plus and OpenMP Michael Coblenz, Robert Seacord, Brad Myers, Joshua Sunshine, and Jonathan Aldrich 1 PROGRAMMING PARALLEL SYSTEMS Parallel programming is notoriously
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationCS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel
CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationJoe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:
More informationFractals. Investigating task farms and load imbalance
Fractals Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationOpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system
OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives
More informationLecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory
Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section
More informationIntroduction to OpenMP
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University History De-facto standard for Shared-Memory
More informationCS 140 : Feb 19, 2015 Cilk Scheduling & Applications
CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More information