HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

Similar documents
ASYNCHRONOUS COMPUTING IN C++

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE

Parallelism in C++ Higher-level Parallelization in C++ for Asynchronous Task-Based Programming. Hartmut Kaiser

Plain Threads are the GOTO of Today s Computing

HPX. HPX The Futurization of Computing

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Massively Parallel Task-Based Programming with HPX Thomas Heller Computer Architecture Department of Computer Science 2/ 77

Using SYCL as an Implementation Framework for HPX.Compute

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Diving into Parallelization in Modern C++ using HPX

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Better Code: Concurrency Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.

asynchronous programming with allocation aware futures /Naios/continuable Denis Blank Meeting C

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Distributed & Heterogeneous Programming in C++ for HPC at SC17

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

A brief introduction to OpenMP

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Chapter 4: Threads. Operating System Concepts 9 th Edition

Questions from last time

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Tesla GPU Computing A Revolution in High Performance Computing

HPX. The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Using Intrinsic Performance Counters to Assess Efficiency in Task-based Parallel Applications

HPX Performance Counters HARTMUT KAISER (LSU), MACIEJ BRODOWICZ (IU)

Chapter 4: Threads. Operating System Concepts 9 th Edition

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Using HPX and OP2 for Improving Parallel Scaling Performance of Unstructured Grid Applications

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Chapter 4: Threads. Chapter 4: Threads

Multiprocessors 2007/2008

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming

6.1 Multiprocessor Computing Environment

SDC Design patterns GoF

Chapter 4: Multi-Threaded Programming

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Example of a Parallel Algorithm

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Programming Models for Supercomputing in the Era of Multicore

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Green-Marl. A DSL for Easy and Efficient Graph Analysis. LSDPO (2017/2018) Paper Presentation Tudor Tiplea (tpt26)

Modern Processor Architectures. L25: Modern Compiler Design

Introduction to Parallel Programming Models

Chapter 3 Parallel Software

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS420: Operating Systems

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Intel Thread Building Blocks, Part II

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Chapter 1: Introduction. Operating System Concepts 8th Edition,

Chapter 4: Multithreaded Programming

EE/CSCI 451: Parallel and Distributed Computation

Practical Near-Data Processing for In-Memory Analytics Frameworks

Parallel and High Performance Computing CSE 745

Network-Driven Processor (NDP): Energy-Aware Embedded Architectures & Execution Models. Princeton University

Vulkan 1.1 March Copyright Khronos Group Page 1

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

CS420: Operating Systems

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

! Readings! ! Room-level, on-chip! vs.!

A Pluggable Framework for Composable HPC Scheduling Libraries

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Chapter 3: Process Concept

Chapter 3: Process Concept

Intel Thread Building Blocks

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Threaded Programming. Lecture 9: Alternatives to OpenMP

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Trends and Challenges in Multicore Programming

Structural Support for C++ Concurrency

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Multiprocessors 2014/2015

Parallelisation. Michael O Boyle. March 2014

Data Model Considerations for Radar Systems

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Chapter 3: Process Concept

The Parallel Boost Graph Library spawn(active Pebbles)

2.1 What are distributed systems? What are systems? Different kind of systems How to distribute systems? 2.2 Communication concepts

Better Code: Concurrency Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.

Better Code: Concurrency Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Open Multi-Processing: Basic Course

Parallel Programming with OpenMP. CS240A, T. Yang

Improving the Practicality of Transactional Memory

Transcription:

HPX The C++ Standards Library for Concurrency and Hartmut Kaiser (hkaiser@cct.lsu.edu)

HPX A General Purpose Runtime System The C++ Standards Library for Concurrency and Exposes a coherent and uniform, C++ standards-conforming API for ease of programming parallel, distributed, and heterogeneous applications. Enables to write fully asynchronous code using hundreds of millions of threads. Provides unified syntax and semantics for local and remote operations. Enables seamless data parallelism orthogonally to task-based parallelism HPX represents an innovative mixture of A global system-wide address space (AGAS - Active Global Address Space) Fine grain parallelism and lightweight synchronization Combined with implicit, work queue based, message driven computation Support for hardware accelerators 2

HPX A C++ Standard Library Widely portable Platforms: x86/64, Xeon/Phi, ARM 32/64, Power, BlueGene/Q Operating systems: Linux, Windows, Android, OS/X Well integrated with compiler s C++ Standard libraries Enables writing applications which out-perform and out-scale existing applications based on OpenMP/MPI http://stellar-group.org/libraries/hpx https://github.com/stellar-group/hpx/ Is published under Boost license and has an open, active, and thriving developer community. Can be used as a platform for research and experimentation 3

Policy Engine/Policies HPX - A C++ Standard Library for Concurrency and HPX A C++ Standard Library API C++1y Concurrency/ APIs Threading Subsystem Local Control Objects (LCOs) Performance Counter Framework Active Global Address Space (AGAS) Parcel Transport Layer OS 4

Programming Model Focus on the logical composition of data processing, rather than the physical orchestration of parallel computation Provide useful abstractions that shield programmer from low-level details of parallel and distributed processing Centered around data dependencies not communication patterns Make data dependencies explicit to system thus allows for automagic parallelization Basis for various types of higher level parallelism, such as iterative, fork-join, continuation-style, asynchronous, data-parallelism Enable runtime-based adaptivity while applying application-defined policies 5

Programming Model The consequent application of the Concept of Futures Make data dependencies explicit and visible to the runtime Implicit and explicit asynchrony Transparently hide communication and other latencies Makes over-subscription manageable Uniform API for local and remote operation Local operation: create new thread Remote operation: send parcel (active message), create thread on behalf of sender Work-stealing scheduler Inherently multi-threaded environment Supports millions of concurrently active threads, minimal thread overhead Enables transparent load balancing of work across all execution resources inside a locality API is fully conforming with C++11/C++14 and ongoing standardization efforts 6

HPX The API As close as possible to C++11/14/17 standard library, where appropriate, for instance std::thread hpx::thread std::mutex std::future hpx::mutex hpx::future (including N4538, Concurrency TS ) std::async hpx::async (including N3632) std::bind std::function std::tuple std::any std::cout std::for_each(par, ), etc. std::experimental::task_block hpx::bind hpx::function hpx::tuple hpx::any (N3508) hpx::cout hpx::parallel::for_each (N4507, TS, C++17) hpx::parallel::task_block (N4411) 7

Control Model How is parallelism achieved? Explicit parallelism: Low-level: thread Middle-level: async(), dataflow(), future::then() Higher-level constructs Parallel algorithms (parallel::for_each and friends, fork-join parallelism for homogeneous tasks) Asynchronous algorithms (alleviates bad effect of fork/join) Task-block (fork-join parallelism of heterogeneous tasks) Asynchronous task-blocks Continuation-style parallelism based on composing futures (task-based parallelism) Data-parallelism on accelerator architectures (vector-ops, GPUs) Same code used for CPU and accelerators 8

Parallel Algorithms (C++17) 9

STREAM Benchmark std::vector<double> a, b, c; // data //... init data auto a_begin = a.begin(), a_end = a.end(), b_begin = b.begin()...; // STREAM benchmark parallel::copy(par, a_begin, a_end, c_begin); parallel::transform(par, c_begin, c_end, b_begin, [](double val) { return 3.0 * val; }); parallel::transform(par, a_begin, a_end, b_begin, b_end, c_begin, [](double val1, double val2) { return val1 + val2; }); parallel::transform(par, b_begin, b_end, c_begin, c_end, a_begin, [](double val1, double val2) { return val1 + 3.0 * val2; }); // copy step: c = a // scale step: b = k * c // add two arrays: c = a + b // triad step: a = b + k * c 10

Dot-product: Vectorization std::vector<float> data1 = {...}; std::vector<float> data2 = {...}; double p = parallel::inner_product( datapar, // parallel and vectorized execution std::begin(data1), std::end(data1), std::begin(data2), 0.0f, [](auto t1, auto t2) { return t1 + t2; }, // std::plus<>() [](auto t1, auto t2) { return t1 * t2; } // std::multiplies<>() ); 11

Control Model How is synchronization expressed? Low-level (thread-level) synchronization: mutex, condition_variable, etc. Replace (global) barriers with finer-grain synchronization (synchronize of a as-needbasis ) Wait only for immediately necessary dependencies, forward progress as much as possible Many APIs hand out a future representing the result Parallel and sequential composition of futures (future::then(), when_all(), etc.) Orchestration of parallelism through launching and synchronizing with asynchronous tasks Synchronization primitives: barrier, latch, semaphores, channel, etc. Synchronize using futures 12

Synchonization with Futures A future is an object representing a result which has not been calculated yet Suspend consumer thread Locality 1 Future object Execute another thread Resume consumer thread Locality 2 Result is being returned Execute Future: Producer thread Enables transparent synchronization with producer Hides notion of dealing with threads Makes asynchrony manageable Allows for composition of several asynchronous operations (Turns concurrency into parallelism) 13

Data Model AGAS essential underpinning for all data management Foundation for syntactic semantic equivalence of local and remote operations Full spectrum of C++ data structures are available Either as distributed data structures or for SPMD style computation Explicit data partitioning, manually orchestrated boundary exchange Using existing synchronization primitives (for instance channels) Use of distributed data structures, like partitioned_vector Use of parallel algorithms Use of co-array like layer (FORTRAN users like that) Load balancing: migration Move objects around in between nodes without stopping the application 15

Small Example 16

Extending Parallel Algorithms Sean Parent: C++ Seasoning, Going Native 2013 17

Extending Parallel Algorithms New algorithm: gather template <typename BiIter, typename Pred> pair<biiter, BiIter> gather(biiter f, BiIter l, BiIter p, Pred pred) { BiIter it1 = stable_partition(f, p, not1(pred)); BiIter it2 = stable_partition(p, l, pred); return make_pair(it1, it2); } Sean Parent: C++ Seasoning, Going Native 2013 18

Extending Parallel Algorithms New algorithm: gather_async template <typename BiIter, typename Pred> future<pair<biiter, BiIter>> gather_async(biiter f, BiIter l, BiIter p, Pred pred) { future<biiter> f1 = parallel::stable_partition(par(task), f, p, not1(pred)); future<biiter> f2 = parallel::stable_partition(par(task), p, l, pred); return dataflow( unwrapped([](biiter r1, BiIter r2) { return make_pair(r1, r2); }), f1, f2); } 19

Extending Parallel Algorithms (await) New algorithm: gather_async template <typename BiIter, typename Pred> future<pair<biiter, BiIter>> gather_async(biiter f, BiIter l, BiIter p, Pred pred) { future<biiter> f1 = parallel::stable_partition(par(task), f, p, not1(pred)); future<biiter> f2 = parallel::stable_partition(par(task), p, l, pred); co_return make_pair(co_await f1, co_await f2); } 20

21