Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Similar documents
Task-based Data Parallel Programming

Parallel Programming Principle and Practice Lecture 7

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks

Intel Threading Building Blocks (TBB)

Intel Thread Building Blocks

Table of Contents. Cilk

Parallel Programming Models

Intel Thread Building Blocks, Part IV

Efficiently Introduce Threading using Intel TBB

Optimize an Existing Program by Introducing Parallelism

Intel Tools zur parallelen Programmierung Windows HPC RWTH Aachen 2007

Shared memory parallel computing. Intel Threading Building Blocks

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)

Intel Thread Building Blocks, Part II

Guillimin HPC Users Meeting January 13, 2017

Parallelization on Multi-Core CPUs

POSIX Threads: a first step toward parallel programming. George Bosilca

Martin Kruliš, v

Intel(R) Threading Building Blocks

Concurrency, Thread. Dongkun Shin, SKKU

Efficient Work Stealing for Fine-Grained Parallelism

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Threaded Programming. Lecture 9: Alternatives to OpenMP

Application Programming

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Effective Performance Measurement and Analysis of Multithreaded Applications

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Intel Parallel Studio

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Multi-core Architecture and Programming

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Intel Threading Building Blocks

Rama Malladi. Application Engineer. Software & Services Group. PDF created with pdffactory Pro trial version

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

CSE 153 Design of Operating Systems

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

On the cost of managing data flow dependencies

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CSE 120 Principles of Operating Systems

ECE 574 Cluster Computing Lecture 8

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Review: Easy Piece 1

!! How is a thread different from a process? !! Why are threads useful? !! How can POSIX threads be useful?

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

ZKI-Tagung Supercomputing, Okt Parallel Patterns for Composing and Nesting Parallelism in HPC Applications

CSE 153 Design of Operating Systems Fall 2018

CSE 4/521 Introduction to Operating Systems

CS420: Operating Systems

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

C++ Concurrency in Action

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Exercise (could be a quiz) Solution. Concurrent Programming. Roadmap. Tevfik Koşar. CSE 421/521 - Operating Systems Fall Lecture - IV Threads

Intel(R) Threading Building Blocks

Kotlin/Native concurrency model. nikolay

Multiple Inheritance. Computer object can be viewed as

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

Building High Performance Threaded Applications using Libraries or Why you don t need a parallel compiler. Jim Cownie Principal Engineer

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

Threads and Too Much Milk! CS439: Principles of Computer Systems January 31, 2018

C++ and OpenMP. 1 ParCo 07 Terboven C++ and OpenMP. Christian Terboven. Center for Computing and Communication RWTH Aachen University, Germany

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Lecture 5: Synchronization w/locks

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

What else is available besides OpenMP?

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS 5220: Shared memory programming. David Bindel

! How is a thread different from a process? ! Why are threads useful? ! How can POSIX threads be useful?

Threads and Parallelism in Java

Threads and Too Much Milk! CS439: Principles of Computer Systems February 6, 2019

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Marco Danelutto. May 2011, Pisa

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Chapter 4: Multi-Threaded Programming

CS 326: Operating Systems. Process Execution. Lecture 5

Systèmes d Exploitation Avancés

Introduction to PThreads and Basic Synchronization

CSE 374 Programming Concepts & Tools

Compositional C++ Page 1 of 17

Lecture 4: Threads; weaving control flow

Intel Threading Building Blocks (Intel TBB) 2.1. In-Depth

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Structured Parallel Programming Patterns for Efficient Computation

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to OpenMP

Multiprocessors and Locking

Transcription:

Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology

Outline Intel Threading Building Blocks Task-based programming Task Scheduler Scalable Memory Allocators Concurrent Containers Synchronization Primitives 2

Ways to Improve Naïve Implementation Programming with OS Threads can get complicated and error-prone, even for the pattern as simple as for-loop 3

Task-basked Programming A Better Approach to Parallelism Portable task-based technologies Intel Threading Building Blocks (Intel TBB) lets you easily write parallel C++ programs that take full advantage of multicore performance, that are portable and composable, and that have future-proof scalability. C++ template library: general task-based programming, concurrent data containers, and more 4

Key Feature It is a template library intended to ease parallel programming for C++ developers Relies on generic programming to deliver high performance parallel algorithms with broad applicability It provides a high-level abstraction for parallelism It facilitates scalable performance Strives for efficient use of cache, and balances load Portable across Linux*, Mac OS*, Windows*, and Solaris* Can be used in concert with other packages such as native threads and OpenMP (fighting for thread, tbb, openmp) Open source and licensed versions available 5

Implement parallel ideals with Templates and Language Features 6

Task-based Programming Advantages 7

Implementing Common Parallel Performance Patterns 8

Intel TBB online 9

limitation TBB is not intended for I/O bound processing Real-time processing General limitations Direct use only from C++ Distributed memory not supported (target is desktop) Requires more work than sprinkling in pragmas, for example OpenMP 10

Outline Intel Threading Building Blocks Task-based programming Task Scheduler Scalable Memory Allocators Concurrent Containers Synchronization Primitives 11

Task-based programming Tasks are light-weight entities at user-level TBB parallel algorithms map tasks onto threads automatically Task scheduler manages the thread pool Scheduler is unfair to favor tasks that have been most recent in the cache Oversubscription and undersubscription of core resources is prevented by task-stealing technique of TBB scheduler 12

Generic Parallel Algorithms Loop parallelization parallel_for and parallel_reduce Load balanced parallel execution of fixed number of independent loop iterations parallel_scan Template function that computes parallel prefix (y[i] = y[i-1] op x[i]) Parallel Algorithms for Streams parallel_do Use for unstructured stream or pile of work; Can add additional work to pile while running parallel_for_each feeder parallel_do without an additional work 13

Generic Parallel Algorithms Parallel Algorithms for Streams Others pipeline / parallel_pipeline Linear pipeline of stages - you specify maximum number of items that can be in flight Each stage can be parallel or serial in-order or serial out-of-order. Stage (filter) can also be thread-bound Uses cache efficiently: Each worker thread flies an item through as many stages as possible; Biases towards finishing old items before tackling new ones parallel_invoke Parallel execution of a number of user-specified functions parallel_sort Comparison sort with an average time complexity O(N Log(N)); When worker threads are available parallel_sort creates subtasks that may be executed concurrently 14

The parallel_for Template Requires definition of A range type to iterate over Must define a copy constructor and a destructor Defines is_empty () Defines is_divisible () Defines a splitting constructor, R(R &r, split) A body type that operates on the range (or a subrange) Must define a copy constructor and a destructor Defines operator() 15

Body is Generic Requirements for parallel_for Body parallel_for partitions original range into subranges, and deals out subranges to worker threads in a way that Balances load Uses cache efficiently Scales 16

Range is Generic Requirements for parallel_for Range Library provides predefined ranges blocked_range and blocked_range2d You can define your own ranges 17

An Example using parallel_for (1 of 4) Independent iterations and fixed/known bounds Sequential code starting point const int N = 100000; void change_array(float array, int M) { for (int i=0; i<m; i++) { array[i] *= 2; } } int main() { float A[N]; initialize_array(a); change_array(a,n); return 0; } 18

An Example using parallel_for (2 of 4) Include and initialize the library int main() { float A[N]; initialize_array(a); change_array(a,n); return 0; } #include tbb/task_scheduler_init.h #include tbb/blocked_range.h #include tbb/parallel_for.h using namespace tbb; int main() { task_scheduler_init init; float A[N]; initialize_array(a); parallel_change_array(a,n); return 0; } 19

An Example using parallel_for (3 of 4) Use the parallel_for algorithm void change_array(float array, int M) { for (int i=0; i<m; i++) { array[i] *= 2; class ChangeArrayBody { } float *array; } public: ChangeArrayBody (float *a): array(a) {} void operator() ( const blocked_range <int>& r ) const { for (int i=r.begin(); i!= r.end(); i++) { array[i] *= 2; } } }; void parallel_change_array(float *array, int M) { parallel_for (blocked_range <int>(0,m), ChangeArrayBody(array), auto_partitioner() ) ; } 20

An Example using parallel_for (4 of 4) Use the parallel_for algorithm 21

Parallel algorithm usage example 22

Outline Intel Threading Building Blocks Task-based programming Task Scheduler Scalable Memory Allocators Concurrent Containers Synchronization Primitives 23

Task Scheduler Task scheduler is the engine driving Intel Threading Building Blocks Manages thread pool, hiding complexity of native thread management Maps logical tasks to threads Parallel algorithms are based on task scheduler interface Task scheduler is designed to address common performance issues of parallel programming with native threads 24

Two Execution Orders 25

Work Depth First; Steal Breadth First 26

Another example: Quicksort Step 1 27

Quicksort Step 2 28

Quicksort Step 2 29

Quicksort Step 3 30

Quicksort Step 3 31

Quicksort Step 4 32

Quicksort Step 5 33

Quicksort Step 6 34

Quicksort Step 6 35

Quicksort Step 7 36

The parallel_reduce Template 37

Numerical Integration Example (1 of 3) 38

parallel_reduce Example (2 of 3) 39

parallel_reduce Example (3 of 3) 40

Task cancellation avoids unneeded work 41

Task cancellation example 42

Uncaught exceptions cancel task execution 43

Outline Intel Threading Building Blocks Task-based programming Task Scheduler Scalable Memory Allocators Concurrent Containers Synchronization Primitives 44

Scalable Memory Allocators Serial memory allocation can easily become a bottleneck in multithreaded applications Threads require mutual exclusion into shared heap False sharing - threads accessing the same cache line Even accessing distinct locations, cache line can ping-pong Intel Threading Building Blocks offers two choices for scalable memory allocation Similar to the STL template class std::allocator scalable_allocator Offers scalability, but not protection from false sharing Memory is returned to each thread from a separate pool cache_aligned_allocator Offers both scalability and false sharing protection 45

Outline Intel Threading Building Blocks Task-based programming Task Scheduler Scalable Memory Allocators Concurrent Containers Synchronization Primitives 46

Concurrent Containers TBB Library provides highly concurrent containers STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better scalability Can be used with the library, OpenMP, or native threads 47

Concurrent Containers Key Features 48

Outline Intel Threading Building Blocks Task-based programming Task Scheduler Scalable Memory Allocators Concurrent Containers Synchronization Primitives 49

Synchronization Primitives Parallel tasks must sometimes touch shared data When data updates might overlap, use mutual exclusion to avoid race High-level generic abstraction for HW atomic operations Atomically protect update of single variable Critical regions of code are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available 50

Atomic Execution atomic <T> T should be integral type or pointer type Full type-safe support for 8, 16, 32, and 64-bit integers Operations 51

Mutex Flavors spin_mutex Non-reentrant, unfair, spins in the user space VERY FAST in lightly contended situations; use if you need to protect very few instructions queuing_mutex Non-reentrant, fair, spins in the user space Use Queuing_Mutex when scalability and fairness are important queuing_rw_mutex Non-reentrant, fair, spins in the user space spin_rw_mutex Non-reentrant, fair, spins in the user space Use ReaderWriterMutex to allow non-blocking read for multiple threads 52

One last question Do not ask! Not even the scheduler knows how many threads really are available There may be other processes running on the machine Routine may be nested inside other parallel routines Focus on dividing your program into tasks of sufficient size Task should be big enough to amortize scheduler overhead Choose decompositions with good depth-first cache locality and potential breadth-first parallelism Let the scheduler do the mapping 53

References The content expressed in this chapter is come from berkeley university open course (http://parlab.eecs.berkeley.edu/2010bootcampagenda, Shared Memory Programming with TBB, Michael Wrinn) http://software.intel.com/en-us/courseware IDF2012:Task Parallel Evolution and Revolution Intel Cilk Plus and Intel Threading Building Blocks 54