Runtime Support for Scalable Task-parallel Programs

Size: px
Start display at page:

Download "Runtime Support for Scalable Task-parallel Programs"

Transcription

1 Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May

2 Single Program Multiple Data int main () {... } 2

3 Task Parallelism int main () {... } 3

4 Task Parallelism int main () {... } 4

5 Task-parallel Abstractions Finer specification of concurrency, data locality, and dependences Convey more application information to compiler and runtime Adaptive runtime system to manage tasks Application writer specifies the computation Writes optimizable code Tools to transform code to generate an efficient implementation 5

6 The Promise Application writer specifies the computation Computation mapped to specific execution environment by the software stack We are transferring some of the burden away from the programmer 6

7 The Challenge We are transferring some of the burden to the software stack Handling million MPI processes is supposed to be hard; how about billions of tasks? What about the software ecosystem? 7

8 Tracing and Constraining Work Stealing Schedulers 8

9 Research Directions Concurrency management and tracing Dynamic load balancing Data locality optimization Task granularity selection Data race detection 9

10 Recursive Task Parallelism S fn() { s1; async { /*A1*/ s2; finish async s3;//a2 s4; } async s5; //A3 s6; } s1 s2 S P A1 S s5 P A3 s6 P A2 s4 s3 10

11 Work Stealing A worker begins with one/few tasks Tasks spawn more tasks When a worker is out of tasks, it steals from another worker A popular scheduling strategy for recursive parallel programs Well-studied load balancing strategy Provably efficient scheduling Understandable space and time bounds 11

12 Objective Trace execution under work stealing Exploit information from trace to perform various optimizations Constrain the scheduler to obtain desired behavior 12

13 Tracing 13 Steal tree: low-overhead tracing of work stealing schedulers. PLDI 13

14 Tracing Work Stealing When and where each task executed Captures the order of events for online and offline analysis Challenges Sheer size of the trace Application perturbation might make it impractical 14

15 Tracing Approach: Illustration a b c bc a Deque task async step steal Steals in order of levels Almost one steal per level 15

16 Space Overhead: Shared Memory 16 Total Total Space Space (kb) (kb) Total Space (kb) 1e+10 1e+11 1e+10 1e+09 1e+09 1e+08 1e+08 1e+07 1e+07 1e+06 1e+06 1e+05 1e+05 1e+04 1e+04 1e+03 1e+02 1e e+10 1e+09 Enum Enum StealTree: StealTree: 2 2 Cores Cores StealTree: StealTree: 4 4 Cores Cores StealTree: StealTree: Cores Cores n=1k nq=14 z=10k b=6 b=8 n=2k nq=14 z=20k b=3 b=8 StealTree: StealTree: Cores Cores StealTree: StealTree: Cores Cores StealTree: StealTree: Cores Cores n=1k nq=14 z=10k b=1 b=4 n=2k nq=15 z=20k b=6 b=4 (a) (c) Cholesky AllQueens StealTree: 1e Cores n=1k nq=15 z=10k b=3 b=2 n=2k nq=15 z=20k b=1 b=2 (b) (d) Heat Fib StealTree: Small trace sizes, 1e Cores 1e+10 less StealTree: 2 Enum affected Cores StealTree: 20 Cores Enum StealTree: 1e+08 StealTree: 4 Cores StealTree: 80 Cores StealTree: 4 Cores StealTree: 1e+06 StealTree: by StealTree: 2 Cores StealTree: 10 Cores Corescore StealTree: 4 Cores count StealTree: 20 Cores StealTree: or problem Cores Cores 1e+07 1e+09 StealTree: 10 Cores StealTree: 10 Cores 1e+05 size 1e+06 1e+05 1e+04 1e+03 1e+02 1e MB Enum StealTree: 2 Cores n=1k z=10k b=8 n=2k z=20k b=8 Total Space (kb) n=1k z=10k b=4 1e+08 1e+07 1e+06 1e+05 1e+04 1e+03 1e+02 1e n=2k z=20k b=4 n=1k n=3000 z=10k 2 b=64 b=2 n=2k n=4000 z=20k 2 b=64 b=2 n= b=16 Total Space (kb) Total Space (kb) 1e+11 1e+10 1e+09 1e+08 1e+07 1e+06 1e+05 1e+04 1e+03 1e+02 1e e+07 1e+04 1e+03 1e+02 1e n= b=16 Enum StealTree: 2 Cores StealTree: 4 Cores StealTree: 10 Cores n=35 b=10 n=35 b=5 n=3000 n=16k 2 n= n=32k 2 2 b=8 c=8 b=8 c=8 StealTree: 20 Cores StealTree: 40 Cores StealTree: 80 Cores n=35 b=1 n=16k 2 c=4 n=43 b=10 n=32k 2 c=4 n=43 b=5 n=16k 2 c=1 n=43 b=1 n=32k 2 c=1 1MB

17 Space Overhead: Distributed Memory KB/Core Cores 4000 Cores 8000 Cores Cores Cores 1 0 AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF Still less than 160MB in total on cores 17

18 Time Overhead: Shared Memory Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads AllQueens Heat Fib FFT Strassen NBody Cholesky LU Matmul Time overhead within variation in execution time 18

19 Time Overhead: Distributed Memory Cores 4000 Cores 8000 Cores Cores Cores AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF Time overhead within variation in execution time 19

20 What can we do with a steal tree? 20

21 Visualization Core utilization plot over time Cilk LU benchmark on 24 cores Trace size <100KB 21

22 Replay 22 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.

23 Replay Schedulers Strict, ordered replay (StOWS) Exactly reproduce the template schedule Donation of continuations to be stolen Strict, unordered replay (StUWS) Reproduce the template schedule, but allow the order to deviate (respecting the application s dependencies) Relaxed work-stealing replay (RelWS) Reproduce the template schedule as much as possible, but allow workers to deviate when they are idle, by further stealing work 23

24 How good are the schedulers? Trace StOWS StUWS RelWS heat floyd-warshall fdtd cg mg parallel prefix Relaxed work stealing incurs some overhead because it combines replay and work stealing 24

25 Relaxed Work Stealing: Adaptability I Slow down one out of 80 workers 4 times Execution Time (s) fib(48) fib(48) StOWS, slow worker fib(48) RelWS, slow worker 0 25

26 Relaxed Work Stealing: Adaptability II Relaxed replay of schedule from (p-10) workers on p workers Execution Time (s) fib(48) default p-10 Threads fib(48) default p Threads fib(48) RelWS(p-10) on p 0 p = 20 p = 40 p = 60 26

27 Relaxed Work Stealing: Adaptability III Relaxed work stealing of fib(54) with a schedule from fib(48) Trace StOWS StUWS RelWS fib(48) fib(48+6)

28 Retentive Stealing 28 Work stealing and persistence-based load balancers for iterative overdecomposed applications. HPDC 12

29 Iterative Applications Applications repeatedly executing the same computation Many scientific applications are iterative Static or slowly evolving execution characteristics Execution characteristics preclude static balancing Application characteristics (comm. pattern, sparsity, ) Execution environment (topology, asymmetry, ) Yes Done? No do work 29

30 Retentive Work Stealing Seeded Local Queues Actual Executed Tasks Proc 1 Proc 2 Proc 3 Proc n Proc 1 Proc 2Proc 3 Proc n Intuition: Stealing indicates poor initial balance 30

31 Retentive stealing Use work stealing to load balance within each phase Persistence-based load balancers only rebalance across phases Begin next iteration with a trace of the previous iteration s schedule 31

32 Retentive stealing results Parallel efficiency (%) StealRet-1 StealRet-2 StealRet-5 StealRet-10 StealRet-14 Avg. tasks (e) Core HF-Be512-Titan count Tasks per core Num. successful steals SuccSteals-1 SuccSteals-2 SuccSteals (e) HF-Be512-Titan Core count Retentive stealing stabilizes stealing costs 32

33 Retentive Stealing Space Overhead: HF Trace size (KB/core) Cores 4000 Cores 8000 Cores Cores Cores Cores Iteration Execution on Titan Space overhead increase but still same manageable across iterations 33

34 Retentive Stealing Space Overhead: TCE Trace KB/core size (KB/core) Cores 4000 Cores 8000 Cores Cores Cores Cores Iteration Execution on Titan Space overhead stays the same across iterations 34

35 Data Locality Optimization: NUMA Locality 35 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.

36 Constrained Schedules in OpenMP #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) A[i] = B[i] = 0; //init #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) B[i] = A[i]; //memcpy A B memcpy thread Empirical study Loops are naturally matched, leading to good performance Parallel memory copy of 8GB of data, using OpenMP schedule static 80-core system with eight NUMA domains, first-touch policy Execution time: 169ms 36

37 Cilk Scheduling cilk_for (i = 0; i < size; i++) A[i] = B[i] = 0; //init cilk_for (i = 0; i < size; i++) B[i] = A[i]; //memcpy A B memcpy thread Random Empirical study work stealing mismatches the initialization and subsequent use, causing performance degradation. Parallel memory copy of 8GB of data, using MIT Cilk or OpenMP 3.0 tasks Execution time: 436ms (Cilk/OMP task) vs 169ms (OMP static) 37

38 Can we constrain the scheduler to improve NUMA locality? 38

39 Solution: Evolve a Schedule Capture an application phase s steal tree Is load balanced? Yes Strict ordered replay No Data localization Load imbalance observed? No Adapt schedule using relaxed work stealing Yes 39

40 Alternative Strategy: Manual Steal Tree Construction Explicit markup of steal tree in the user program Useful in non-iterative applications 40

41 Data Redistribution Cost First few iterations, data is redistributed (copied) to match a given schedule Execution time (sec) mg fdtd floyd-warshall heat cg Number of threads 41

42 Benchmarks: Iterative Matching Structure 40 heat 40 floyd-warshall Speedup Number of Threads Number of Threads Cilk interleave OMP tasks (interleave) Constrained Iter. RelWS Extract template schedule, apply RelWS for five iterations until convergence, then use StOWS 42

43 Benchmarks: Iterative Differing Structure Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified Start with random work stealing on kernel, refine with RelWS until convergence, then use StOWS 43

44 Benchmarks: Iterative Multiple Structures Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified We evaluate two approaches: using the same schedule across all kernels, and using a different schedule for each kernel 44

45 Benchmarks: Non-iterative Matching Structure Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified Reuse schedule from initialization for other phases with StUWS 45

46 Task Granularity Selection 46 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.

47 Task granularity selection A key challenge for task-parallel programs Trade-off Expose more concurrency Achieve good sequential performance with a coarse grain size 47

48 Observation Concurrency only need to be exposed to achieve load balance Once load is balanced, exposed concurrency can be turned off We can coax the scheduler to select coarser grained work units 48

49 Iterative Granularity Selection Start with small grain size Relaxed work stealing à schedule Drop steal tree leaves Needs coarsening? Yes No Replace leaves with sequential (coarse) tasks 49

50 Dynamic Granularity Selection: heat Speedup x8192-Block Iterative 64x1024-Block Iterative 64x256-Block Iterative 64x256-Block Dynamic Number of threads sample 1 sample 2 sample 3 Iterative locality optimization with grain size selection Count 64x512 64x8k 64x16k 50

51 Dynamic Granularity Selection: cg Speedup 1024-Rows Iterative 128-Rows Iterative 32-Rows Iterative 32-Rows Dynamic Count Rows 512 Rows 4k Rows 16k Rows sample 1 sample 2 sample Number of threads 51

52 Data Race Detection t1 t2 A 52 Steal tree: low-overhead tracing of work stealing schedulers. PLDI 13

53 Data Race Detection Detect conflicting operations in a fork/join program Key check: Determine if two memory operation can execute in parallel For any possible schedule 53

54 Dynamic Program Structure Tree (DPST) finish { step1; async { } //F1 //A1 step2; async { //A2 step3; } step4; } step5; Two steps s1 and s2 may execute in parallel if: l1 is least common ancestor (LCA) of s1 and s2 in DPST c1 is ancestor of s1 and immediate child of l1 c1 is an async node 54

55 Steal-Tree Aided LCA Computation The nodes of the DPST tree can be annotated with the nodes of steal tree they belong to Data race detection involves multiple walks of the DPST for each memory access checked lca(s1, s2): if (s1.st_node == s2.st_node) return dpst_lca(s1,s2); //dpst walk if (s1.st_node.level > s2.st_node.level) return lca(s1.st_node.victim, s2) return lca(s1, s2.st_node.victim) 55

56 Application: Data Race Detection Percent Reduction Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads AllQueens LU Heat NBody Matmul Significant reduction in the number of DPST edges traversed 56

57 Other Results Locality-aware task graph scheduling Color-based constraints on work stealing schedulers Cache locality optimization Effect-based splicing of concurrent tasks to improve cache locality Speculative work stealing Expose greater concurrency Localized parallel failure recovery 57

58 Lessons Learned Random work stealing with ability to constrain its behavior can bring several benefits Steal trees can be useful in a variety of contexts Retentive stealing Data locality optimization Task granularity selection Data race detection Need to design interfaces to programmatically extract and use work stealing schedules 58

59 Continuing Research Challenges Recursive program specification Enabling user to express high level intent and properties Compiler analysis and transformation Runtime techniques Scheduling and load balancing Fault tolerance Power/energy efficiency Data locality Correctness and performance tools Architectural and other low-level support for such abstractions 59

60 Conclusions Abstractions supporting task parallelism can meet performance and programmability challenges Runtime systems can adapt productively Changing the load balancer or adding fault tolerance involved no change in the user code Maturing an execution paradigm requires lots of research and experience 60

61 Thank You! 61

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

A Static Cut-off for Task Parallel Programs

A Static Cut-off for Task Parallel Programs A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

for Task-Based Parallelism

for Task-Based Parallelism for Task-Based Parallelism School of EEECS, Queen s University of Belfast July Table of Contents 2 / 24 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

COMP Parallel Computing. SMM (4) Nested Parallelism

COMP Parallel Computing. SMM (4) Nested Parallelism COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

OS impact on performance

OS impact on performance PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

Lightweight Fault Detection in Parallelized Programs

Lightweight Fault Detection in Parallelized Programs Lightweight Fault Detection in Parallelized Programs Li Tan UC Riverside Min Feng NEC Labs Rajiv Gupta UC Riverside CGO 13, Shenzhen, China Feb. 25, 2013 Program Parallelization Parallelism can be achieved

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1 Introduction to Computer Systems 15-213/18-243, fall 2009 24 th Lecture, Dec 1 Instructors: Roger B. Dannenberg and Greg Ganger Today Multi-core Thread Level Parallelism (TLP) Simultaneous Multi -Threading

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY 1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be

More information

Efficient Work Stealing for Fine-Grained Parallelism

Efficient Work Stealing for Fine-Grained Parallelism Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

Intel Thread Building Blocks, Part IV

Intel Thread Building Blocks, Part IV Intel Thread Building Blocks, Part IV SPD course 2017-18 Massimo Coppola 13/04/2018 1 Mutexes TBB Classes to build mutex lock objects The lock object will Lock the associated data object (the mutex) for

More information

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

A Primer on Scheduling Fork-Join Parallelism with Work Stealing Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join

More information

Fractals exercise. Investigating task farms and load imbalance

Fractals exercise. Investigating task farms and load imbalance Fractals exercise Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101

More information

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT

More information

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism. Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling ( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

The Parallel Boost Graph Library spawn(active Pebbles)

The Parallel Boost Graph Library spawn(active Pebbles) The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University Origins Boost Graph Library (1999) Generic

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware

More information

Introduction to Multithreaded Algorithms

Introduction to Multithreaded Algorithms Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

A Data-flow Approach to Solving the von Neumann Bottlenecks

A Data-flow Approach to Solving the von Neumann Bottlenecks 1 A Data-flow Approach to Solving the von Neumann Bottlenecks Antoniu Pop Inria and École normale supérieure, Paris St. Goar, June 21, 2013 2 Problem Statement J. Preshing. A Look Back at Single-Threaded

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

CSCE 626 Experimental Evaluation.

CSCE 626 Experimental Evaluation. CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you

More information

Parallelism and Performance

Parallelism and Performance 6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel

More information

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Multicore programming in CilkPlus

Multicore programming in CilkPlus Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Load Balancing. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Load Balancing. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab. Load Balancing Minsoo Ryu Department of Computer Science and Engineering 2 1 Concepts of Load Balancing Page X 2 Load Balancing Algorithms Page X 3 Overhead of Load Balancing Page X 4 Load Balancing in

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

Parallel Programming. March 15,

Parallel Programming. March 15, Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations

More information

15-210: Parallelism in the Real World

15-210: Parallelism in the Real World : Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) Concurrency Page1 Cray-1 (1976): the world s most expensive love seat 2

More information

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

Asymmetry-Aware Work-Stealing Runtimes

Asymmetry-Aware Work-Stealing Runtimes Asymmetry-Aware Work-Stealing Runtimes Christopher Torng, Moyang Wang, and Christopher atten School of Electrical and Computer Engineering Cornell University 43rd Int l Symp. on Computer Architecture,

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

A Course-Based Usability Analysis of Cilk Plus and OpenMP. Michael Coblenz, Robert Seacord, Brad Myers, Joshua Sunshine, and Jonathan Aldrich

A Course-Based Usability Analysis of Cilk Plus and OpenMP. Michael Coblenz, Robert Seacord, Brad Myers, Joshua Sunshine, and Jonathan Aldrich A Course-Based Usability Analysis of Cilk Plus and OpenMP Michael Coblenz, Robert Seacord, Brad Myers, Joshua Sunshine, and Jonathan Aldrich 1 PROGRAMMING PARALLEL SYSTEMS Parallel programming is notoriously

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago. Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:

More information

Fractals. Investigating task farms and load imbalance

Fractals. Investigating task farms and load imbalance Fractals Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section

More information

Introduction to OpenMP

Introduction to OpenMP Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University History De-facto standard for Shared-Memory

More information

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware

More information