A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores
|
|
- Gabriella Rich
- 5 years ago
- Views:
Transcription
1 A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 1/24
2 Shared Cache of Multicore Processors 1. One core with 2 cache levels L 2 L 1 Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24
3 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches L 2 L 2 L 2 L 2 L 1 L 1 L 1 L 1 Core Core Core Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24
4 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches 3. Multiple cores with private caches and one shared cache L 3 L 2 L 2 L 2 L 2 L 1 L 1 L 1 L 1 Core Core Core Core The sequential has an unfair advantage. Can we still get linear speedup? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24
5 Scheduling for Efficient Shared Cache Usage Schedule the computation so that shared cache misses do not increase. with a work stealing scheduler allowing efficient dynamic load balancing for parallel loops: no dependencies between tasks 8MB 256KB 256KB 256KB 256KB 32KB 32KB 32KB 32KB Core 0 Core 1 Core 2 Core 3 Xeon Nehalem E5530 (2.4Ghz) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 3/24
6 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 4/24
7 Parallel Loops Examples of parallel loops OpenMP #pragma omp parallel for TBB parallel for Cilk parallel for Parallel STL for each or transform Problem characteristics Schedule n iterations on p cores Iterations can be processed independently Time to process one iteration can vary Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 5/24
8 Static Scheduling of Parallel Loops Static Scheduling Allocate n p iterations to each core ex: OpenMP static scheduling core 1 core 2 core 3 core 4 Characteristics low overhead mechanism bad load balancing if workload is irregular Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 6/24
9 Dynamic Scheduling of Parallel Loops OpenMP dynamic scheduling Allocate iterations in chunks of size g All chunks are stored in a centralized list Each thread remove a chunk from the list and process it Characteristics Good load balancing Contention on the list Chunk creation overhead g core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 7/24
10 Scheduling Parallel Loops with Work Stealing Work Stealing Each thread has its own list of tasks (= chunks) If list is empty, steal tasks in a randomly selected list Binary tree of tasks to minimize number of steals: one steal half of the iterations Characteristics steal Good load balancing Contention is reduced Task creation overhead core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 8/24
11 Scheduling Parallel Loops with Work Stealing [0, 127] [0, 63] [64, 127] [0, 31] [32, 63] [0, 15] [16, 31] terminated tasks ready tasks running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24
12 Scheduling Parallel Loops with Work Stealing [0, 127] steal [64, 127] [0, 63] [64, 95] [96, 127] [0, 31] [32, 63] [64, 79] [80, 95] [0, 15] [16, 31] terminated tasks ready tasks running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24
13 Parallel Loops with Xkaapi Xkaapi Work stealing library Tasks are created on a steal: reduce task creation overhead Cooperative stealing: The victim stops working to answer work requests The victim can answer to multiple requests at a time Characteristics Good load balancing Low overhead mechanism typedef struct { InputIterator ibeg; InputIterator iend; } Work_t ; // Task void parallel_for (...) { while (iend!= ibeg) do_work ( ibeg++ ) ; } // no more work -> become a thief void splitter ( num_req ) { i = 0 ; size = victim.iend - victim.ibeg ; bloc = size / ( num_req + 1 ) ; local_end = victim.iend ; while ( num_req > 0 ) { thief->iend = local_end ; thief->ibeg = local_end - bloc ; local_end -= bloc ; --num_req ; } } // victim + thieves -> parallel_for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 10/24
14 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 11/24
15 Reuse Distances [Beyls and D Hollander 01] Number of distinct elements accessed between two accesses to the same element. If first access, reuse distance is infinity. On a fully associative LRU cache of size C: reuse distance C hit reuse distance > C miss h d : number of accesses with a reuse distance d number of cache misses M(C) = d=c+1 h d h d C misses d element access A B A B B C B A C reuse distance cache content A A,B A,B A,B A,B B,C B,C A,B Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 12/24
16 Shared Cache Misses with the Classic Schedule Cores work on elements far away Good temporal locality of the sequential algorithm Cores work on distinct data core 1 core 2 shared cache core 3 core 4 To 1 access by a core corresponds p 1 accesses to distinct data by the other cores Reuse distance is multiplied by p: h seq d = h par p d Number of cache misses M par (C) = d=c+1 h par d = d=c+1 h seq d/p = h d d=c/p+1 C p C seq misses d additional misses h seq d = M seq (C/p) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 13/24
17 A Shared Cache Aware Schedule Cores work at distance at most m r(m) accesses to distinct elements in a window of size m The reuse distance is increased by at most r(m): h seq d = hd+r(m) win Number of cache misses M win (C) d=c+1 r(m) Small Overhead over Sequential good sequential locality r(m) small h d small for large d h seq d = M seq (C) + h d m C d=c+1 r(m) r(m) shared cache h seq d seq misses C d additional misses Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 14/24
18 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache 3. Efficient Implementation of the Scheduler 4. Experiments m Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 15/24
19 Implementing the Shared Cache Aware Schedule Using a standard parallel for loop: StaticWindow Divide the iterations in n/m chunks of size m Each chunk is processed in parallel with a standard parallel for Two versions: pthread and Xkaapi Pthread version each thread processes 1/p of a chunk wait on a barrier after each chunk shared cache Xkaapi version m each chunk is processed in parallel with a parallel for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 16/24
20 Implementing the Shared Cache Aware Schedule Optimized implementation using Xkaapi: SlidingWindow Processing iteration i enables iteration i + m Master thread is at the beginning of the sequence On a steal, the master can give work In the interval [ibeg, iend[ like the other workers In the interval [ilast, ibeg + m[ enabled since the last steal ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window typedef struct { InputIterator ibeg; InputIterator iend; } Work_t ; // Task typedef struct { InputIterator ibeg; InputIterator iend; InputIterator ilast; } Master_Work_t ; // Master Task Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 17/24
21 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache 3. Efficient Implementation of the Scheduler m ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 18/24
22 Application: Isosurface Extraction Isosurface extraction Common scientific visualization filter Memory bounded Algorithm ( C ) M seq p MB M seq(c) 8MB Iterate through all cells in the mesh Interpolate surface inside each cell Cells close in the mesh share points Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 19/24
23 Experiments 2 processors Opteron: 2 Dualcores with private L 1 and L 2 caches Nehalem: Quadcore with private L 1, L 2 caches and shared L 3 2 schedules NoWindow: classic schedule (Static or Sliding) Window: shared cache aware schedule 7 implementations NoWindow: Pthread, TBB and Xkaapi StaticWindow: Pthread, TBB and Xkaapi SlidingWindow: Xkaapi Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 20/24
24 Synchronization overhead Speedup on 4 cores on Opteron NoWindow StaticWindow 0 Pthread TBB Xkaapi Pthread < TBB < Xkaapi On Opteron: no shared cache Window < NoWindow more synchronizations without gain in cache misses Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 21/24
25 Window Size m L3 cache misses Static Seq. No window size m On 4 cores of Nehalem Shared 8MB L 3 cache For small m: M window M seq (C) M no window M seq ( C p M no window = ( C ) M seq = p ) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 22/24
26 Speedup and Cache Misses on Nehalem Speedup T seq /T par L 3 misses ratio M par /M seq No Static No StaticSliding 0 No Static No StaticSliding Pthread Xkaapi Pthread Xkaapi 4 cores of Nehalem with 8MB of shared cache L 3 Best performance: SlidingWindow Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 23/24
27 Conclusion Shared Cache Aware Scheduler Shared cache aware scheduler for parallel loops Efficient implementation using work stealing For application with good sequential locality the window strategy is as good as if each core had its own copy of the L 3 cache shared cache m Future work Experiment with other applications Automatically find window size with reuse distance histogram Cache-oblivious version (cache size unknown)? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 24/24
A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores
A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian, Vincent Danjean, Thierry Gautier, Fabien Lementec, and Bruno Raffin MOAIS Project, INRIA- LIG ZIRST 51, avenue
More informationEfficient Work Stealing for Fine-Grained Parallelism
Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationKAAPI : Adaptive Runtime System for Parallel Computing
KAAPI : Adaptive Runtime System for Parallel Computing Thierry Gautier, thierry.gautier@inrialpes.fr Bruno Raffin, bruno.raffin@inrialpes.fr, INRIA Grenoble Rhône-Alpes Moais Project http://moais.imag.fr
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More information! XKaapi : a runtime for highly parallel (OpenMP) application
XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationPatterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices
Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR
More informationX-Kaapi C programming interface
X-Kaapi C programming interface Fabien Le Mentec, Vincent Danjean, Thierry Gautier To cite this version: Fabien Le Mentec, Vincent Danjean, Thierry Gautier. X-Kaapi C programming interface. [Technical
More information. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo
.. Understanding Performance of Shared Memory Programs Kenjiro Taura University of Tokyo 1 / 45 Today s topics. 1 Introduction. 2 Artificial bottlenecks. 3 Understanding shared memory performance 2 / 45
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationMultithreading in C with OpenMP
Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More information2
1 2 3 4 5 Code transformation Every time the compiler finds a #pragma omp parallel directive creates a new function in which the code belonging to the scope of the pragma itself is moved The directive
More informationUnderstanding Task Scheduling Algorithms. Kenjiro Taura
Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 48 Contents 1 Introduction 2 Work stealing scheduler 3 Analyzing execution time of work stealing 4 Analyzing cache misses of work stealing 5 Summary
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationDesign and Implementation Of A Cache Hierarchy- Aware Task Scheduling For Parallel Loops On Multicore Architectures
Design and Implementation Of A Cache Hierarchy- Aware Task Scheduling For Parallel Loops On Multicore Architectures Nader Khammassi Lab-STICC UMR CNRS 6285 ENSTA-Bretagne 29806 Brest Cedex 9, France nader.khammassi@ensta-bretagne.fr
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationMulti-core Computing Lecture 2
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve OPENMP Standard multicore API for scientific computing Based on fork-join model: fork many threads, join and resume sequential thread Uses pragma:#pragma omp parallel Shared/private
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,
More informationParallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides
Parallel Programming with OpenMP CS240A, T. Yang, 203 Modified from Demmel/Yelick s and Mary Hall s Slides Introduction to OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationChapter 4: Threads. Chapter 4: Threads
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationParallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai
Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading
More informationPORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT
PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT Jean-Yves VET, DDN Storage Patrick CARRIBAULT, CEA Albert COHEN, INRIA CEA, DAM, DIF, F-91297
More informationIntel Thread Building Blocks, Part II
Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization
More informationAn Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines
An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines Marie Durand, Francois Broquedis, Thierry Gautier, Bruno Raffin To cite this version: Marie Durand, Francois Broquedis,
More informationParallel Programming
Parallel Programming Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 1 Session Objectives To understand the parallelization in terms of computational solutions. To understand
More informationExercise: OpenMP Programming
Exercise: OpenMP Programming Multicore programming with OpenMP 19.04.2016 A. Marongiu - amarongiu@iis.ee.ethz.ch D. Palossi dpalossi@iis.ee.ethz.ch ETH zürich Odroid Board Board Specs Exynos5 Octa Cortex
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationParallel Programming with OpenMP. CS240A, T. Yang
Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs
More informationMultithreaded Parallelism and Performance Measures
Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101
More informationLoop Modifications to Enhance Data-Parallel Performance
Loop Modifications to Enhance Data-Parallel Performance Abstract In data-parallel applications, the same independent
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationCompsci 590.3: Introduction to Parallel Computing
Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationCS420: Operating Systems
Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationSRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores
SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores Xiaoning Ding The Ohio State University dingxn@cse.ohiostate.edu
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationOPERATING SYSTEM. Chapter 4: Threads
OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads
More informationLocality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures
Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures QUAN CHEN, School of Software Engineering, Shanghai Jiao Tong University MINYI GUO, Department
More informationEfficiently Introduce Threading using Intel TBB
Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++
More informationLAWS: Locality-Aware Work-Stealing for Multi-socket Multi-core Architectures
LAWS: Locality-Aware Work-Stealing for Multi-socket Multi-core Architectures Quan Chen Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai
More informationTask-based Execution of Nested OpenMP Loops
Task-based Execution of Nested OpenMP Loops Spiros N. Agathos Panagiotis E. Hadjidoukas Vassilios V. Dimakopoulos Department of Computer Science UNIVERSITY OF IOANNINA Ioannina, Greece Presentation Layout
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationParallel Computer Architecture and Programming Written Assignment 3
Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the
More informationCS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012
CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationPlan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624
Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 Multi-core Architecture Multi-core processor CPU Cache CPU Coherence
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationChapter 4: Threads. Operating System Concepts 9 th Edition
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationTable of Contents. Cilk
Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationTowards an Adaptive Task Pool Implementation
Towards an Adaptive Task Pool Implementation M. Hofmann and G. Rünger Department of Computer Science Chemnitz University of Technology, Germany E-mail: {mhofma,ruenger}@informatik.tu-chemnitz.de Abstract
More informationChapter 4: Threads. Operating System Concepts 9 th Edit9on
Chapter 4: Threads Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads 1. Overview 2. Multicore Programming 3. Multithreading Models 4. Thread Libraries 5. Implicit
More informationBinary Mesh Partitioning for Cache-Efficient Visualization
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL., NO., JANUARY 7 Binary Mesh Partitioning for Cache-Efficient Visualization Marc Tchiboukdjian, Vincent Danjean and Bruno Raffin Abstract One
More informationGNU libstdc++ parallel mode: Algorithms
Introduction Library Overview Algorithms Conclusion 1/40 The GNU libstdc++ parallel mode: Algorithms {singler,putze,sanders}@ira.uka.de Institute for Theoretical Computer Science University of Karlsruhe
More informationHow to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010
More informationJoe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:
More informationFractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures
Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin
More informationLecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1
Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method
More informationCSE 260 Lecture 19. Parallel Programming Languages
CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014
More informationPerformance Optimization Part 1: Work Distribution and Scheduling
Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the
More informationA Static Cut-off for Task Parallel Programs
A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationCOMP Parallel Computing. SMM (4) Nested Parallelism
COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario, Canada IBM Toronto Lab February 11, 2015 Plan
More informationMulti-core Computing Lecture 1
Hi-Spade Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 20, 2012 Lecture 1 Outline Multi-cores:
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationAmdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP
AMath 483/583 Lecture 13 April 25, 2011 Amdahl s Law Today: Amdahl s law Speed up, strong and weak scaling OpenMP Typically only part of a computation can be parallelized. Suppose 50% of the computation
More informationContributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth
Presenter: Surabhi Jain Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth May 25, 2018 ROME workshop (in conjunction with IPDPS 2018), Vancouver,
More informationScheduling Parallel Programs by Work Stealing with Private Deques
Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Carnegie Mellon University Arthur Charguéraud INRIA Mike Rainey Max Planck Institute for Software Systems PPoPP 25.2.2013 1 Scheduling
More informationLine Segment Intersection Dmitriy V'jukov
Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationChapter 4: Threads. Operating System Concepts 9 th Edition
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationOpenMP on the IBM Cell BE
OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More information