A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

Size: px
Start display at page:

Download "A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores"

Transcription

1 A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 1/24

2 Shared Cache of Multicore Processors 1. One core with 2 cache levels L 2 L 1 Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24

3 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches L 2 L 2 L 2 L 2 L 1 L 1 L 1 L 1 Core Core Core Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24

4 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches 3. Multiple cores with private caches and one shared cache L 3 L 2 L 2 L 2 L 2 L 1 L 1 L 1 L 1 Core Core Core Core The sequential has an unfair advantage. Can we still get linear speedup? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24

5 Scheduling for Efficient Shared Cache Usage Schedule the computation so that shared cache misses do not increase. with a work stealing scheduler allowing efficient dynamic load balancing for parallel loops: no dependencies between tasks 8MB 256KB 256KB 256KB 256KB 32KB 32KB 32KB 32KB Core 0 Core 1 Core 2 Core 3 Xeon Nehalem E5530 (2.4Ghz) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 3/24

6 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 4/24

7 Parallel Loops Examples of parallel loops OpenMP #pragma omp parallel for TBB parallel for Cilk parallel for Parallel STL for each or transform Problem characteristics Schedule n iterations on p cores Iterations can be processed independently Time to process one iteration can vary Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 5/24

8 Static Scheduling of Parallel Loops Static Scheduling Allocate n p iterations to each core ex: OpenMP static scheduling core 1 core 2 core 3 core 4 Characteristics low overhead mechanism bad load balancing if workload is irregular Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 6/24

9 Dynamic Scheduling of Parallel Loops OpenMP dynamic scheduling Allocate iterations in chunks of size g All chunks are stored in a centralized list Each thread remove a chunk from the list and process it Characteristics Good load balancing Contention on the list Chunk creation overhead g core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 7/24

10 Scheduling Parallel Loops with Work Stealing Work Stealing Each thread has its own list of tasks (= chunks) If list is empty, steal tasks in a randomly selected list Binary tree of tasks to minimize number of steals: one steal half of the iterations Characteristics steal Good load balancing Contention is reduced Task creation overhead core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 8/24

11 Scheduling Parallel Loops with Work Stealing [0, 127] [0, 63] [64, 127] [0, 31] [32, 63] [0, 15] [16, 31] terminated tasks ready tasks running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24

12 Scheduling Parallel Loops with Work Stealing [0, 127] steal [64, 127] [0, 63] [64, 95] [96, 127] [0, 31] [32, 63] [64, 79] [80, 95] [0, 15] [16, 31] terminated tasks ready tasks running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24

13 Parallel Loops with Xkaapi Xkaapi Work stealing library Tasks are created on a steal: reduce task creation overhead Cooperative stealing: The victim stops working to answer work requests The victim can answer to multiple requests at a time Characteristics Good load balancing Low overhead mechanism typedef struct { InputIterator ibeg; InputIterator iend; } Work_t ; // Task void parallel_for (...) { while (iend!= ibeg) do_work ( ibeg++ ) ; } // no more work -> become a thief void splitter ( num_req ) { i = 0 ; size = victim.iend - victim.ibeg ; bloc = size / ( num_req + 1 ) ; local_end = victim.iend ; while ( num_req > 0 ) { thief->iend = local_end ; thief->ibeg = local_end - bloc ; local_end -= bloc ; --num_req ; } } // victim + thieves -> parallel_for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 10/24

14 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 11/24

15 Reuse Distances [Beyls and D Hollander 01] Number of distinct elements accessed between two accesses to the same element. If first access, reuse distance is infinity. On a fully associative LRU cache of size C: reuse distance C hit reuse distance > C miss h d : number of accesses with a reuse distance d number of cache misses M(C) = d=c+1 h d h d C misses d element access A B A B B C B A C reuse distance cache content A A,B A,B A,B A,B B,C B,C A,B Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 12/24

16 Shared Cache Misses with the Classic Schedule Cores work on elements far away Good temporal locality of the sequential algorithm Cores work on distinct data core 1 core 2 shared cache core 3 core 4 To 1 access by a core corresponds p 1 accesses to distinct data by the other cores Reuse distance is multiplied by p: h seq d = h par p d Number of cache misses M par (C) = d=c+1 h par d = d=c+1 h seq d/p = h d d=c/p+1 C p C seq misses d additional misses h seq d = M seq (C/p) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 13/24

17 A Shared Cache Aware Schedule Cores work at distance at most m r(m) accesses to distinct elements in a window of size m The reuse distance is increased by at most r(m): h seq d = hd+r(m) win Number of cache misses M win (C) d=c+1 r(m) Small Overhead over Sequential good sequential locality r(m) small h d small for large d h seq d = M seq (C) + h d m C d=c+1 r(m) r(m) shared cache h seq d seq misses C d additional misses Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 14/24

18 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache 3. Efficient Implementation of the Scheduler 4. Experiments m Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 15/24

19 Implementing the Shared Cache Aware Schedule Using a standard parallel for loop: StaticWindow Divide the iterations in n/m chunks of size m Each chunk is processed in parallel with a standard parallel for Two versions: pthread and Xkaapi Pthread version each thread processes 1/p of a chunk wait on a barrier after each chunk shared cache Xkaapi version m each chunk is processed in parallel with a parallel for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 16/24

20 Implementing the Shared Cache Aware Schedule Optimized implementation using Xkaapi: SlidingWindow Processing iteration i enables iteration i + m Master thread is at the beginning of the sequence On a steal, the master can give work In the interval [ibeg, iend[ like the other workers In the interval [ilast, ibeg + m[ enabled since the last steal ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window typedef struct { InputIterator ibeg; InputIterator iend; } Work_t ; // Task typedef struct { InputIterator ibeg; InputIterator iend; InputIterator ilast; } Master_Work_t ; // Master Task Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 17/24

21 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache 3. Efficient Implementation of the Scheduler m ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 18/24

22 Application: Isosurface Extraction Isosurface extraction Common scientific visualization filter Memory bounded Algorithm ( C ) M seq p MB M seq(c) 8MB Iterate through all cells in the mesh Interpolate surface inside each cell Cells close in the mesh share points Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 19/24

23 Experiments 2 processors Opteron: 2 Dualcores with private L 1 and L 2 caches Nehalem: Quadcore with private L 1, L 2 caches and shared L 3 2 schedules NoWindow: classic schedule (Static or Sliding) Window: shared cache aware schedule 7 implementations NoWindow: Pthread, TBB and Xkaapi StaticWindow: Pthread, TBB and Xkaapi SlidingWindow: Xkaapi Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 20/24

24 Synchronization overhead Speedup on 4 cores on Opteron NoWindow StaticWindow 0 Pthread TBB Xkaapi Pthread < TBB < Xkaapi On Opteron: no shared cache Window < NoWindow more synchronizations without gain in cache misses Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 21/24

25 Window Size m L3 cache misses Static Seq. No window size m On 4 cores of Nehalem Shared 8MB L 3 cache For small m: M window M seq (C) M no window M seq ( C p M no window = ( C ) M seq = p ) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 22/24

26 Speedup and Cache Misses on Nehalem Speedup T seq /T par L 3 misses ratio M par /M seq No Static No StaticSliding 0 No Static No StaticSliding Pthread Xkaapi Pthread Xkaapi 4 cores of Nehalem with 8MB of shared cache L 3 Best performance: SlidingWindow Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 23/24

27 Conclusion Shared Cache Aware Scheduler Shared cache aware scheduler for parallel loops Efficient implementation using work stealing For application with good sequential locality the window strategy is as good as if each core had its own copy of the L 3 cache shared cache m Future work Experiment with other applications Automatically find window size with reuse distance histogram Cache-oblivious version (cache size unknown)? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 24/24

A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores

A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian, Vincent Danjean, Thierry Gautier, Fabien Lementec, and Bruno Raffin MOAIS Project, INRIA- LIG ZIRST 51, avenue

More information

Efficient Work Stealing for Fine-Grained Parallelism

Efficient Work Stealing for Fine-Grained Parallelism Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

KAAPI : Adaptive Runtime System for Parallel Computing

KAAPI : Adaptive Runtime System for Parallel Computing KAAPI : Adaptive Runtime System for Parallel Computing Thierry Gautier, thierry.gautier@inrialpes.fr Bruno Raffin, bruno.raffin@inrialpes.fr, INRIA Grenoble Rhône-Alpes Moais Project http://moais.imag.fr

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

! XKaapi : a runtime for highly parallel (OpenMP) application

! XKaapi : a runtime for highly parallel (OpenMP) application XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR

More information

X-Kaapi C programming interface

X-Kaapi C programming interface X-Kaapi C programming interface Fabien Le Mentec, Vincent Danjean, Thierry Gautier To cite this version: Fabien Le Mentec, Vincent Danjean, Thierry Gautier. X-Kaapi C programming interface. [Technical

More information

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo .. Understanding Performance of Shared Memory Programs Kenjiro Taura University of Tokyo 1 / 45 Today s topics. 1 Introduction. 2 Artificial bottlenecks. 3 Understanding shared memory performance 2 / 45

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

2

2 1 2 3 4 5 Code transformation Every time the compiler finds a #pragma omp parallel directive creates a new function in which the code belonging to the scope of the pragma itself is moved The directive

More information

Understanding Task Scheduling Algorithms. Kenjiro Taura

Understanding Task Scheduling Algorithms. Kenjiro Taura Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 48 Contents 1 Introduction 2 Work stealing scheduler 3 Analyzing execution time of work stealing 4 Analyzing cache misses of work stealing 5 Summary

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Design and Implementation Of A Cache Hierarchy- Aware Task Scheduling For Parallel Loops On Multicore Architectures

Design and Implementation Of A Cache Hierarchy- Aware Task Scheduling For Parallel Loops On Multicore Architectures Design and Implementation Of A Cache Hierarchy- Aware Task Scheduling For Parallel Loops On Multicore Architectures Nader Khammassi Lab-STICC UMR CNRS 6285 ENSTA-Bretagne 29806 Brest Cedex 9, France nader.khammassi@ensta-bretagne.fr

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Multi-core Computing Lecture 2

Multi-core Computing Lecture 2 Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve OPENMP Standard multicore API for scientific computing Based on fork-join model: fork many threads, join and resume sequential thread Uses pragma:#pragma omp parallel Shared/private

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides Parallel Programming with OpenMP CS240A, T. Yang, 203 Modified from Demmel/Yelick s and Mary Hall s Slides Introduction to OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading

More information

PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT

PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT Jean-Yves VET, DDN Storage Patrick CARRIBAULT, CEA Albert COHEN, INRIA CEA, DAM, DIF, F-91297

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization

More information

An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines Marie Durand, Francois Broquedis, Thierry Gautier, Bruno Raffin To cite this version: Marie Durand, Francois Broquedis,

More information

Parallel Programming

Parallel Programming Parallel Programming Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 1 Session Objectives To understand the parallelization in terms of computational solutions. To understand

More information

Exercise: OpenMP Programming

Exercise: OpenMP Programming Exercise: OpenMP Programming Multicore programming with OpenMP 19.04.2016 A. Marongiu - amarongiu@iis.ee.ethz.ch D. Palossi dpalossi@iis.ee.ethz.ch ETH zürich Odroid Board Board Specs Exynos5 Octa Cortex

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101

More information

Loop Modifications to Enhance Data-Parallel Performance

Loop Modifications to Enhance Data-Parallel Performance Loop Modifications to Enhance Data-Parallel Performance Abstract In data-parallel applications, the same independent

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores Xiaoning Ding The Ohio State University dingxn@cse.ohiostate.edu

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads

More information

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures QUAN CHEN, School of Software Engineering, Shanghai Jiao Tong University MINYI GUO, Department

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

LAWS: Locality-Aware Work-Stealing for Multi-socket Multi-core Architectures

LAWS: Locality-Aware Work-Stealing for Multi-socket Multi-core Architectures LAWS: Locality-Aware Work-Stealing for Multi-socket Multi-core Architectures Quan Chen Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai

More information

Task-based Execution of Nested OpenMP Loops

Task-based Execution of Nested OpenMP Loops Task-based Execution of Nested OpenMP Loops Spiros N. Agathos Panagiotis E. Hadjidoukas Vassilios V. Dimakopoulos Department of Computer Science UNIVERSITY OF IOANNINA Ioannina, Greece Presentation Layout

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624 Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 Multi-core Architecture Multi-core processor CPU Cache CPU Coherence

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Chapter 4: Threads. Operating System Concepts 9 th Edition

Chapter 4: Threads. Operating System Concepts 9 th Edition Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Towards an Adaptive Task Pool Implementation

Towards an Adaptive Task Pool Implementation Towards an Adaptive Task Pool Implementation M. Hofmann and G. Rünger Department of Computer Science Chemnitz University of Technology, Germany E-mail: {mhofma,ruenger}@informatik.tu-chemnitz.de Abstract

More information

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Chapter 4: Threads. Operating System Concepts 9 th Edit9on Chapter 4: Threads Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads 1. Overview 2. Multicore Programming 3. Multithreading Models 4. Thread Libraries 5. Implicit

More information

Binary Mesh Partitioning for Cache-Efficient Visualization

Binary Mesh Partitioning for Cache-Efficient Visualization IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL., NO., JANUARY 7 Binary Mesh Partitioning for Cache-Efficient Visualization Marc Tchiboukdjian, Vincent Danjean and Bruno Raffin Abstract One

More information

GNU libstdc++ parallel mode: Algorithms

GNU libstdc++ parallel mode: Algorithms Introduction Library Overview Algorithms Conclusion 1/40 The GNU libstdc++ parallel mode: Algorithms {singler,putze,sanders}@ira.uka.de Institute for Theoretical Computer Science University of Karlsruhe

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago. Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

CSE 260 Lecture 19. Parallel Programming Languages

CSE 260 Lecture 19. Parallel Programming Languages CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the

More information

A Static Cut-off for Task Parallel Programs

A Static Cut-off for Task Parallel Programs A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

COMP Parallel Computing. SMM (4) Nested Parallelism

COMP Parallel Computing. SMM (4) Nested Parallelism COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario, Canada IBM Toronto Lab February 11, 2015 Plan

More information

Multi-core Computing Lecture 1

Multi-core Computing Lecture 1 Hi-Spade Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 20, 2012 Lecture 1 Outline Multi-cores:

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP AMath 483/583 Lecture 13 April 25, 2011 Amdahl s Law Today: Amdahl s law Speed up, strong and weak scaling OpenMP Typically only part of a computation can be parallelized. Suppose 50% of the computation

More information

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth Presenter: Surabhi Jain Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth May 25, 2018 ROME workshop (in conjunction with IPDPS 2018), Vancouver,

More information

Scheduling Parallel Programs by Work Stealing with Private Deques

Scheduling Parallel Programs by Work Stealing with Private Deques Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Carnegie Mellon University Arthur Charguéraud INRIA Mike Rainey Max Planck Institute for Software Systems PPoPP 25.2.2013 1 Scheduling

More information

Line Segment Intersection Dmitriy V'jukov

Line Segment Intersection Dmitriy V'jukov Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Chapter 4: Threads. Operating System Concepts 9 th Edition

Chapter 4: Threads. Operating System Concepts 9 th Edition Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information