Intel Thread Building Blocks

Similar documents
Intel Thread Building Blocks

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part IV

Intel Thread Building Blocks, Part II

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Intel Threading Building Blocks (TBB)

Table of Contents. Cilk

Intel Parallel Studio

Optimize an Existing Program by Introducing Parallelism

Shared memory parallel computing. Intel Threading Building Blocks

Task-based Data Parallel Programming

Efficiently Introduce Threading using Intel TBB

Martin Kruliš, v

Intel Parallel Studio 2011

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)

Marco Danelutto. May 2011, Pisa

Intel Parallel Studio

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Intel Threading Building Blocks

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Intel Threading Building Blocks

Parallelization on Multi-Core CPUs

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

On the cost of managing data flow dependencies

CSCI-1200 Data Structures Fall 2009 Lecture 25 Concurrency & Asynchronous Computing

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Capriccio : Scalable Threads for Internet Services

Expressing and Analyzing Dependencies in your C++ Application

Guillimin HPC Users Meeting January 13, 2017

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Review: Easy Piece 1

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Practical Considerations for Multi- Level Schedulers. Benjamin

Shared-Memory Parallelization of an Interval Equations Systems Solver Comparison of Tools

OpenCL C++ kernel language

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Review: Creating a Parallel Program. Programming for Performance

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

1 Publishable Summary

Summary: Open Questions:

Intel Threading Building Blocks (Intel TBB) 2.1. In-Depth

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Threads: either under- or over-utilised

Intel(R) Threading Building Blocks

Improving the Practicality of Transactional Memory

CSE 120 Principles of Operating Systems

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Parallel Programming on Larrabee. Tim Foley Intel Corp

A wishlist. Can directly pass pointers to data-structures between tasks. Data-parallelism and pipeline-parallelism and task parallelism...

Performance Tools for Technical Computing

An Introduction to Parallel Programming

Intel(R) Threading Building Blocks

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CSE 374 Programming Concepts & Tools

Programming Models for Supercomputing in the Era of Multicore

Shared-Memory Programming

Kernel Scalability. Adam Belay

Overview: The OpenMP Programming Model

Costless Software Abstractions For Parallel Hardware System

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel STL in today s SYCL Ruymán Reyes

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

September 10,

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed

Threads SPL/2010 SPL/20 1

Kernel Synchronization I. Changwoo Min

Principles of Parallel Algorithm Design: Concurrency and Mapping

A Lightweight OpenMP Runtime

Processes and Threads

Introduction to Parallel Performance Engineering

Overview of research activities Toward portability of performance

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

CS420: Operating Systems

Shared Memory Programming Model

The basic operations defined on a symbol table include: free to remove all entries and free the storage of a symbol table

Parallel architectures are enforcing the need of managing parallel software efficiently Sw design, programming, compiling, optimizing, running

High Performance Computing on GPUs using NVIDIA CUDA

[Potentially] Your first parallel application

Introduction to FastFlow programming

Parallel Programming Models

OPERATING SYSTEM. Chapter 4: Threads

Transcription:

Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1

Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa 2006 as a commercial product First version was still very low-level Little more than a debugging tool Strong emphasis was put on how to performance debug thread-parallel programs Several releases improved the abstraction level Current TBB is a programming model & runtime 2

Thread Building Blocks Release V4.4 stable, update 6 released Sept. 2016 Intel changed released naming scheme Latest release:tbb 2018 U2 Dec. 2017 A C++ based pattern language for threads Supports generic programming Supports nested parallelism Double licensed - separate version for industrial users Intel Simplified Software License No commitment to support, no reverse engineering, decompilation Open source version Stable versions (expected to be) aligned with commercial ones Developer, source-only versions Used to be GPL V2, TBB 2017 moved to Apache 2.0 Documentation is provided online https://software.intel.com/en-us/tbb-reference-manual 3

Thread Building Blocks Compatibility Source code on github https://github.com/01org/tbb/releases. Multi-OS Windows, Linux, OS X 10.11+, direct support Android Support (Apache 2 version) More OS support in the open source (e.g. FreeBSD 11 ) Several development environments Intel Parallel Studio 2018 Beta + other SW packages and tools from Intel Most notably, Parallel STL Microsoft Visual Studio 2017 Works with GCC, Clang, Intel C compilers (requires C++11 support) 4

What is TBB today A runtime and a template library for C++ Eases writing thread programs by raising the abstraction level OS-portable thread programs (Win, Linux, OS X) HW independent programs, of course Focus on task production/processing via threads, not on writing thread code C++ templates and classes for Common forms of parallelism Data structures used by these parallel skeletons Heavy use of generics for expressiveness Auxiliary data structures for parallelism management e.g. range to define the set of values of a parameter Use of Operators to specify each skeleton semantics A form of encapsulation of sequential behaviour 5

TBB Features Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization support (portability) use vector support from your specific compiler Check vectorization support in Parallel STL Full environment: compile time + runtime TBB supports patterns as well as other features algorithms, containers, mutexes, tasks... mix of high and low level mechanisms programmer must choose wisely 6

TBB Runtime support Runtime supports memory allocation synchronization task management Provide operating system-independent basic primitives Two support libraries The two can also be used independently One library for Task generation Parallel patterns Task scheduling to threads, A specific library for scalable memory allocation 7

TBB layers All TBB architectural elements are present in the user API, except the actual threads Parallel algorithms generic and scalable: for, reduce, work pile, scan, pipeline, flow graph... Concurrent Containers vectors, hash tables, queues Tasks Scheduler, work stealing, groups, over/under subscription Synchronization atomic ops, mutexes, condition variables Memory Scalable mem. allocation, false-sharing avoidance, thread-local storage Utility cross-thread timers Threads 8

Threads and composability Composing parallel patterns a pipeline of farms of maps of farms a parallel for nested in a parallel loop within a pipeline each construct can express more potential parallelism deep nesting à too many threads à overhead Potential parallelism should be expressed difficult or impossible to extract for the compiler Actual parallelism should be flexibly tuned messy to define and optimize for the programmer, performance hardly portable TBB solution Potential parallelism = tasks Actual parallelism = threads Mapping tasks over threads is largely automated and performed at run-time 9

Tasks vs threads Task is a unit of computation in TBB can be executed in parallel with other tasks the computation is carried on by a thread task mapping onto threads is a choice of the runtime the TBB user can provide hints on mapping Effects Allow Hierarchical Pattern Composability raise the level of abstraction avoid dealing with different thread semantics increase run-time portability across different architectures adapt to different number of cores/threads per core 10

Basic TBB abstractions TBB Algorithms, i.e. the templates actually expressing thread (task) parallel computation Data container classes that are specific to TBB A few C++ Concepts, i.e. sets of template requirements that allow to combine C++ data container classes with parallel patterns Splittable Range Lower-level mechanisms (thread storage, Mutexes) that allow the compentent programmers to implement new abstractions and solve special cases 11

Some supported abstractions More patterns added with each version parallel_for lambda expressions parallel_reduce parallel_do pipeline Extended to dags as supersets of pipelines concurrency-safe containers mutex helper objects atomic<t> template (atomic operations) 12

Parallel for (and partitioners) Express independent task computations parallel_for (iteration space, function) Exploit a blocked_range template to express iteration space Ranges can be recursively split by the library 1D, 2D, 3D blocked ranges as of TBB 4.0 Automatic dispatch to independent threads Heuristics within the library, but it can be customized Specify optional partitioner function to the parallel_for Specify grainsize parameter in the range Partitioners allow to customize the way ranges are split in order to obtain tasks amenable to concurrent computation Grainsize is the standard parameter of partitioners 13

Parallel_for minimal example #include "tbb/tbb.h using namespace tbb; class ApplyFoo { float *const my_a; public: void operator()( const blocked_range<size_t>& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} }; void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a)); } 14

Scheduling tasks to threads The Partitioner creates multiple tasks by decomposing a range until we get enough parallelism OR we achieve the minimum task size Task scheduler dispatches tasks to threads Automatically created by the library Customizable by program to suit user needs Define scheduler creation/destruction time Number of created threads Stack size for threads Customizable per construct via construct parameters Much more in the docs about the scheduler Task scheduler deals with pipelines and workflows 15

Partitioners and choosing grain size As always, small grain size à high overhead Intel suggests 100.000 clock cycles as grain size Also suggests experimental procedure to set You are expected to already know the issues, and take into account the number of cores and load balancing details in your algorithm Cache affinity can impact performance affinity partitioner tries to exploit it when scheduling tasks to threads Type Use Conditions simple auto affinity Chunks given by grain size (Default until TBB 2.2) Automatic size (heuristics, default nowadays) Automatic size (heuristics to exploit affinity) g/2 < chunk size <g g/2 < chunks size g/2 <chunksize 16

Lambda expression Unnamed functions defined by the latest C++ 0x standard (ISO/IEC 14882:2011) Released September 2011 Use a stereotype for in-place defining an unnamed free function [ variable_scope ] type_def function_def; some support for storing the definition Capture all variable references which are used inside, but defined outside the function Variable scope spec can dictate capturing by reference, by value, or disallow use In general, e.g. [] disallow [=] by value [&] ref. For specific variable(s) [=,&z] all by value, with only z by reference 17

Reusing concepts: Parallel reduce A brief introduction, we will back to it! Expresses the parallel reduction pattern Basic form is analogous to the parallel for parallel_reduce ( iteration_space, function ) Iteration space also defined as blocked_range The function to apply has different C++ type template w.r.t to parallel loop Reduce operator does not have the same constrequirements as the one used in a for Also accepts an optional partitioner 18

Container data Structures (I) Data structures which are very often used in programs, whose thread-safe implementation is not trivial or it does not match standard semantics Special care taken to avoid decreasing program performance concurrent_hash_map Constant or update access to elements Access to elements can block other threads 19

Container data Structures (II) concurrent_vector Random access by index, index of the first element is zero. Growing the container does not invalidate existing iterators or indices. Multiple threads can grow the container and append new elements concurrently Destroying elements is not thread safe Does not move its elements in memory when growing (and no insert() or erase()) Growing by too small a size increases memory fragmentation Operations on the whole vector are not thread-safe; can move elements in memory (and reduce fragmentation) notably reserve() and shrink_to_fit() meets requirements for Container and Reversible Container as specified in the ISO C++ standard It does not meet the Sequence requirements due to absence of methods insert() and erase() 20

Container data Structures (III) concurrent_queue Simultaneous push/pop from concurrent threads Ensure serialization and preserve object order Bottleneck if improperly used pop / push / try_push / size 21

Mutexes Classes to build lock objects The new lock object will generally Wait according to specific semantics for locking Lock the object Release lock when destroyed Several characteristics of mutexes Scalable Fair Recursive Yield / Block Check implementations in the docs: mutex, recursive_mutex, spin_mutex, queueing_mutex, spin_rw_mutex, queueing_rw_mutex, null_mutex, null_rw_mutex Specific reader/writer locks Upgrade/downgrade operation to change r/w role 22

References Download docs and code from http://threadingbuildingblocks.org/ Check the accompanying docs Getting started install and first compilation example ß TRY IT Tutorial tour of main functionalities with examples Reference Quick summaries to lamba expressions in C++ http://www.cprogramming.com/c++11/c++11-lambdaclosures.html http://www.nacad.ufrj.br/online/intel/documentation/ en_us/compiler_c/main_cls/cref_cls/common/ cppref_lambda_lambdacapt.htm#cppref_lambda_lambdacapt 23