Parallel STL in today s SYCL Ruymán Reyes

Size: px
Start display at page:

Download "Parallel STL in today s SYCL Ruymán Reyes"

Transcription

1 Parallel STL in today s SYCL Ruymán Reyes ruyman@codeplay.com Codeplay Research 15 th November, 2016

2 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 2

3 The presenter Ruyman Reyes, PhD Background in HPC, programming models and compilers Worked in HPC Scientific Code (ScaLAPACK, GROMACs, CP2K) Created the first Open Source OpenACC implementation Contributor to SYCL Specification Lead of ComputeCpp (Codeplay s SYCL implementation) Coordinating the work on SYCL Parallel STL 3

4 Codeplay Software We build software development tools for SoC Software company based in Edinburgh 42 developers Different background and skill set Games Industry, AI, compilers, HPC, robotics Various levels of expertise (graduates to PhD) Customers work in all areas of Industry Smartphones Self-driving cars Game consoles Our technology is probably in your pocket! 4

5 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 5

6 Parallel STL: Democratizing Parallelism in C++ Various libraries offered STL-like interface for parallel algorithms Thrust, Bolt, libstdc++ Parallel Mode, AMP algorithms In 2012, two separate proposals for parallelism to C++ standard: NVIDIA (N3408), based on Thrust (CUDA-based C++ library) Microsoft and Intel (N3429), based on Intel TBB and PPL/C++AMP Made joint proposal (N3554) suggested by SG1 Many working drafts for N3554, N3850, N3960, N4071, N4409 Final proposal P0024R2 accepted for C++17 during Jacksonville Latest status on C++ draft in github 6

7 Existing implementations Following the evolution of the document Microsoft: HPX: manual/parallel.html HSA: Thibaut Lutz: NVIDIA: Codeplay: 7

8 What is Parallelism TS adding? A set of execution policies and a collection of parallel algorithms The Execution Policies Paragraphs explaining the conditions for parallel algorithms New parallel algorithms The exception_list class 8

9 Sorting with the STL A sequential sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 9

10 Sorting with the STL A parallel sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std ::par, std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 9

11 Sorting with the STL A parallel sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std ::par, std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } par is an object of an Execution Policy The sort will be executed in parallel using an implementation-defined method 9

12 The Execution Policy Standard policy classes Defined in the execution namespace class sequenced_policy: Never do parallel class parallel_policy: Can use caller thread, but may span others (e.g, std::thread) Invocations do not interleave on a single thread class parallel_unsequenced_policy: Can use caller thread or others (e.g std::thread) Multiple invocations may be interleaved on a single thread Global objects constexpr sequenced_policy sequenced ; constexpr parallel_policy par ; constexpr parallel_unseq_policy par_unseq ; 10

13 The Execution Policy Choosing different parallel implementations // May execute in parallel std :: sort (std ::par, std :: begin (data ), std :: end (data )); // May be parallelized and vectorized std :: sort (std :: par_unseq, std :: begin (data ), std :: end (data )); // Will not be parallelized std :: sort (std :: sequenced, std :: begin (data ), std :: end (data )); Propagating the policy to the end user template <typename Policy, typename Iterator > void library_function ( Policy p, Iterator begin, Iterator end ) { std :: sort (p, begin, end ); std :: for_each (p, begin, end, [&] ( Iterator :: value_type e& ) { e++; } ); std :: for_each (std :: sequenced, begin, end, non_parallel_operation ); } Implementations can define their own Execution Policies 11

14 Dealing with exceptions Different execution threads may abort with different exceptions Parallel STL algorithms may throw an exception_list Note that an uncaught exception on the parallel_unsequenced_policy will cause a terminate. class exception_list : public exception { public : typedef unspecified iterator ; size_t size () const noexcept ; iterator begin () const noexcept ; iterator end () const noexcept ; virtual const char * what () const noexcept ; }; 12

15 Parallel Algorithms Overloads to STL algorithms taking the SYCL Execution Policy Not all STL algorithms are suitable for Parallel Execution! 13

16 Introducing new algorithms to the STL For Each template < class ExecutionPolicy, class InputIterator, class Function > void for_each ( ExecutionPolicy && exec, InputIterator first, InputIterator last, Function f); template < class ExecutionPolicy, class InputIterator, class Size, class Function > InputIterator for_each_n ( ExecutionPolicy && exec, InputIterator first, Size n, Function f); template <class InputIterator, class Size, class Function > InputIterator for_each_n ( InputIterator first, Size n, Function f); for_each: Applies f to elements in [first, last). for_each_n: Applies f to elements in [first, first + n). 14

17 Introducing new algorithms to the STL Numerical parallel algorithms template < class InputIterator > typename iterator_traits <InputIterator >:: value_type reduce ( InputIterator first, InputIterator last ); template < class InputIterator, class T> T reduce ( InputIterator first, InputIterator last, T init ); template < class InputIterator, class T, class BinaryOperation > T reduce ( InputIterator first, InputIterator last, T init, BinaryOperation binary_op ); As opposed to accumulate, binary_op is executed on an unespecified order. 15

18 Introducing new algorithms to the STL Other algorithms initroduced Exclusive/Inclusive Scan (Prefix Sum) Transform Reduce Transform Exclusive/Inclusive Scan Basic block to construct other algorithms and applications! 16

19 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 17

20 The SYCL Parallel STL Spec & Examples Enable C++17 Parallel STL to run on any SYCL-supported device Any OpenCL platform with SPIR support Improves productivity of C++ developers worried about performance Integrates nicely with existing SYCL codebases Completely Open Source SYCL Parallel STL introduces two execution policies 18

21 Sorting with the STL A sequential sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 19

22 Sorting with the STL Sorting on the GPU! std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort ( sycl_policy, std :: begin (v), std :: end (v)); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 19

23 Sorting with the STL Sorting on the GPU! std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort ( sycl_policy, std :: begin (v), std :: end (v)); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } sycl_policy is an Execution Policy data is an standard stl::vector Technically will use the device returned by default_selector 19

24 The SYCL Policy template < typename KernelName = DefaultKernelName > class sycl_execution_policy { public : using kernelname = KernelName ; }; sycl_execution_policy () = default ; sycl_execution_policy (cl :: sycl :: queue cl :: sycl :: queue get_queue () const ; q); Indicates algorithm will be executed using a SYCL-device Can optionally take a queue Re-use device-selection Asynchronous data copy-back... 20

25 Why the KernelName template? How are algorithms implemented? auto f = [ vectorsize, &bufi, &bufo, op ]( cl :: sycl :: handler &h) mutable {... auto ai = bufi. template get_access <access :: mode :: read >(h); auto ao = bufo. template get_access <access :: mode :: write >(h); h. parallel_for < /* The Kernel Name */ >(r, [ai, ao, op ]( cl :: sycl ::id <1> id) { ao[id.get (0) ] = UserFunctor (ai[id.get (0) ]); }); }; 21

26 Why the KernelName template? How are algorithms implemented? auto f = [ vectorsize, &bufi, &bufo, op ]( cl :: sycl :: handler &h) mutable {... auto ai = bufi. template get_access <access :: mode :: read >(h); auto ao = bufo. template get_access <access :: mode :: write >(h); h. parallel_for < /* The Kernel Name */ >(r, [ai, ao, op ]( cl :: sycl ::id <1> id) { ao[id.get (0) ] = UserFunctor (ai[id.get (0) ]); }); }; Two separate calls can generate different kernels! transform (par, v. begin (), v. end (), [=]( int & val ){ val ++; }); transform (par, v. begin (), v. end (), [=]( int & val ){ val - -; }); 21

27 Using named policies and queues using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector <int > v =...; // Transform default_selector ds; { queue q(ds); sort ( sycl_execution_policy (q), std :: begin (v), std :: end (v)); sycl_execution_policy < class myname > sepn1 (q); transform (sepn1, std :: begin (v), std :: end (v), std :: begin (v), [=]( int i) { return i + 1;}) ; } Only required for lambdas, not functors Device selection and queue are re-used Data is copied in/out in each call! 22

28 Avoiding data-copies using buffers using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector <int > v =...; default_selector h; { buffer <int > b(std :: begin (v), std :: end (v)); b. set_final_data (v.data ()); { queue q(h); sort ( sycl_execution_policy (q), begin (b), end (b)); sycl_execution_policy < class transform1 > sepn1 (q); transform ( sepn1, begin (b), end (b), begin (b), []( int num ) { return num + 1; }); } } Buffer is constructed from STL containers Data will be copied back to the container when buffer is done Note the additional copy from vec to buffer and vice-versa 23

29 Using device-only data using namespace experimental :: parallel :: sycl ; default_selector h; { buffer <int, 1> b(range <1>( size )); b. set_final_data (v.data ()); { cl :: sycl :: queue q(h); { auto hostacc = b. get_access <mode :: read_write, target :: host_buffer >() ; for ( auto & : hostacc ) { *i = read_data_from_file (...) ; } } sort ( sycl_execution_policy (q), begin (b), end (b)); transform ( sycl_policy, begin (b), end (b), begin (b), std :: negate <int >() ); } } Data is initialized in the host using a host accessor After host accessor is done, data is on the device 24

30 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 25

31 Heterogeneous Execution Policy Distribute workload in Parallel STL algorithms Execution on two devices at the same time Designed for integrated CPU / GPU platforms User sets the decide percentage of work assigned to GPU/CPU Policy distributes workload accordingly HiPEAC internship Research work funded via collaboration grant with HiPEAC Ph.D Student from University of Malaga (A. Vilches) 26

32 Heterogeneous Execution Example cl :: sycl :: queue q; amd_cpu_selector cpu_sel ; cl :: sycl :: queue q2( cpu_sel ); sycl :: sycl_heterogeneous_execution_policy < class TransformAlgorithm1 > snp ( q, q2, ratio ); auto mytransform = [&]() { float pi = 3.14; std :: experimental :: parallel :: transform ( snp, std :: begin (v1), std :: end (v1), std :: begin (v2), std :: begin (res ), [=]( float a, float b) { return pi * a + b; }); }; 27

33 Manual distribution of the work 28

34 Heterogeneous Execution Policy in Use 29

35 Heterogeneous Execution Trivial Implementation... { auto buf1_q1 = sycl :: helpers :: make_const_buffer (first1, first1 + crosspoint ); auto buf2_q1 = sycl :: helpers :: make_const_buffer (first2, first2 + crosspoint ); auto res_q1 = sycl :: helpers :: make_buffer (result, result + crosspoint ); auto buf1_q2 = sycl :: helpers :: make_const_buffer ( first1 + crosspoint, last1 ); auto buf2_q2 = sycl :: helpers :: make_const_buffer ( first2 + crosspoint, first2 + elements ); auto res_q2 = sycl :: helpers :: make_buffer ( result + crosspoint, result + elements ); impl :: transform ( named_sep, q1, buf1_q1, buf2_q1, res_q1, binary_op ); impl :: transform ( named_sep, q2, buf1_q2, buf2_q2, res_q2, binary_op ); }... 30

36 Heterogeneous load balancing Dynamic decision of heterogeneous balancing Percentage offloading is runtime value Developers can create runtime evaluation functions Depending on workload Depending on platform Depending on user-input 31

37 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 32

38 Conclusions Parallel STL Enables developers to quickly exploit parallel architectures SYCL makes implementing these algorithms for heterogeneous platforms trivial Just write single-source C++ Will work on any OpenCL + SPIR platform! Heterogeneous Execution Plenty of heterogeneous platform out there Complex to work with them! SYCL allows developers to focus on algorithms and distribute work Heterogeneous Policy can be optimized and customized per platform Runtime can get platform information and use custom balancing Users can extend the policy with specific balancing decisions 33

39 @codeplaysoft codeplay.com

Using SYCL as an Implementation Framework for HPX.Compute

Using SYCL as an Implementation Framework for HPX.Compute Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The

More information

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd. SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel

More information

Working Draft, Technical Specification for C++ Extensions for Parallelism, Revision 1

Working Draft, Technical Specification for C++ Extensions for Parallelism, Revision 1 Document Number: N3960 Date: 2014-02-28 Reply to: Jared Hoberock NVIDIA Corporation jhoberock@nvidia.com Working Draft, Technical Specification for C++ Extensions for Parallelism, Revision 1 Note: this

More information

Working Draft, Technical Specification for C++ Extensions for Parallelism

Working Draft, Technical Specification for C++ Extensions for Parallelism Document Number: N3850 Date: 2014-01-17 Reply to: Jared Hoberock NVIDIA Corporation jhoberock@nvidia.com Working Draft, Technical Specification for C++ Extensions for Parallelism Note: this is an early

More information

Copyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL

Copyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL Copyright Khronos Group 2015 - Page 1 Introduction to SYCL SYCL Tutorial IWOCL 2015-05-12 Copyright Khronos Group 2015 - Page 2 Introduction I am - Lee Howes - Senior staff engineer - GPU systems team

More information

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016 Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what

More information

The Cut and Thrust of CUDA

The Cut and Thrust of CUDA The Cut and Thrust of CUDA Luke Hodkinson Center for Astrophysics and Supercomputing Swinburne University of Technology Melbourne, Hawthorn 32000, Australia May 16, 2013 Luke Hodkinson The Cut and Thrust

More information

Programming Languages Technical Specification for C++ Extensions for Parallelism

Programming Languages Technical Specification for C++ Extensions for Parallelism ISO 05 All rights reserved ISO/IEC JTC SC WG N4409 Date: 05-04-0 ISO/IEC DTS 9570 ISO/IEC JTC SC Secretariat: ANSI Programming Languages Technical Specification for C++ Extensions for Parallelism Warning

More information

The Role of Standards in Heterogeneous Programming

The Role of Standards in Heterogeneous Programming The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland

More information

SYCL for OpenCL. in a nutshell. Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL. IWOCL Conference May 2014

SYCL for OpenCL. in a nutshell. Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL. IWOCL Conference May 2014 SYCL for OpenCL in a nutshell Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL! IWOCL Conference May 2014 SYCL for OpenCL in a nutshell SYCL in the OpenCL ecosystem SYCL aims

More information

p0616r0 - de-pessimize legacy <numeric> algorithms with std::move

p0616r0 - de-pessimize legacy <numeric> algorithms with std::move p0616r0 - de-pessimize legacy algorithms with std::move Peter Sommerlad 2017-06-06 Document Number: p0616r0 Date: 2017-06-06 Project: Programming Language C++ Audience: LWG/LEWG Target: C++20

More information

Template Library for Parallel For Loops

Template Library for Parallel For Loops 1 Document number: P0075r2 Date: 2017-11-09 Audience: Reply to: Other authors: Concurrency study group (SG1) Pablo Halpern Clark Nelson Arch D. Robison,

More information

SYCL-based Data Layout Abstractions for CPU+GPU Codes

SYCL-based Data Layout Abstractions for CPU+GPU Codes DHPCC++ 2018 Conference St Catherine's College, Oxford, UK May 14th, 2018 SYCL-based Data Layout Abstractions for CPU+GPU Codes Florian Wende, Matthias Noack, Thomas Steinke Zuse Institute Berlin Setting

More information

Object-Oriented Programming for Scientific Computing

Object-Oriented Programming for Scientific Computing Object-Oriented Programming for Scientific Computing Traits and Policies Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de 11. Juli 2017

More information

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define

More information

OpenCL: History & Future. November 20, 2017

OpenCL: History & Future. November 20, 2017 Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language

More information

Modern C++ Parallelism from CPU to GPU

Modern C++ Parallelism from CPU to GPU Modern C++ Parallelism from CPU to GPU Simon Brand @TartanLlama Senior Software Engineer, GPGPU Toolchains, Codeplay C++ Russia 2018 2018-04-21 Agenda About me and Codeplay C++17 CPU Parallelism Third-party

More information

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov What is the Parallel STL? C++17 C++ Next An extension of the C++ Standard Template Library algorithms with the execution policy argument Support for parallel

More information

A Parallel Algorithms Library N3724

A Parallel Algorithms Library N3724 A Parallel Algorithms Library N3724 Jared Hoberock Jaydeep Marathe Michael Garland Olivier Giroux Vinod Grover {jhoberock, jmarathe, mgarland, ogiroux, vgrover}@nvidia.com Artur Laksberg Herb Sutter {arturl,

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization

More information

Handling Concurrent Exceptions with Executors

Handling Concurrent Exceptions with Executors Document number: P0797R1 Date: 2018-02-12 (Jacksonville) Project: Programming Language C++, WG21, SG1,SG14, LEWG, LWG Authors: Matti Rintala, Michael Wong, Carter Edwards, Patrice Roy, Gordon Brown, Mark

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

Compiler Tools for HighLevel Parallel Languages

Compiler Tools for HighLevel Parallel Languages Compiler Tools for HighLevel Parallel Languages Paul Keir Codeplay Software Ltd. LEAP Conference May 21st 2013 Presentation Outline Introduction EU Framework 7 Project: LPGPU Offload C++ for PS3 Memory

More information

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) Demo: CUDA on Intel HD5500 global void setvalue(float *data, int idx, float value)

More information

W3101: Programming Languages C++ Ramana Isukapalli

W3101: Programming Languages C++ Ramana Isukapalli Lecture-6 Operator overloading Namespaces Standard template library vector List Map Set Casting in C++ Operator Overloading Operator overloading On two objects of the same class, can we perform typical

More information

VisionCPP Mehdi Goli

VisionCPP Mehdi Goli VisionCPP Mehdi Goli Motivation Computer vision Different areas Medical imaging, film industry, industrial manufacturing, weather forecasting, etc. Operations Large size of data Sequence of operations

More information

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Colin Riddell GPU Compiler Developer Codeplay Visit us at OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology

More information

CUDA Toolkit 4.2 Thrust Quick Start Guide. PG _v01 March 2012

CUDA Toolkit 4.2 Thrust Quick Start Guide. PG _v01 March 2012 CUDA Toolkit 4.2 Thrust Quick Start Guide PG-05688-040_v01 March 2012 Document Change History Version Date Authors Description of Change 01 2011/1/13 NOJW Initial import from wiki. CUDA Toolkit 4.2 Thrust

More information

Chapter 15 - C++ As A "Better C"

Chapter 15 - C++ As A Better C Chapter 15 - C++ As A "Better C" Outline 15.1 Introduction 15.2 C++ 15.3 A Simple Program: Adding Two Integers 15.4 C++ Standard Library 15.5 Header Files 15.6 Inline Functions 15.7 References and Reference

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

OpenCL C++ kernel language

OpenCL C++ kernel language Copyright Khronos Group 2016 - Page 1 OpenCL C++ kernel language Vienna April 2016 Adam Stański Bartosz Sochacki Copyright Khronos Group 2016 - Page 2 OpenCL 2.2 OpenCL C++ Open source free compiler https://github.com/khronosgroup/libclcxx

More information

Michel Steuwer.

Michel Steuwer. Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include

More information

SYCL for OpenCL in a Nutshell

SYCL for OpenCL in a Nutshell SYCL for OpenCL in a Nutshell Luke Iwanski, Games Technology Programmer @ Codeplay! SIGGRAPH Vancouver 2014 1 2 Copyright Khronos Group 2014 SYCL for OpenCL in a nutshell Copyright Khronos Group 2014 Why?

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Outline. Variables Automatic type inference. Generic programming. Generic programming. Templates Template compilation

Outline. Variables Automatic type inference. Generic programming. Generic programming. Templates Template compilation Outline EDAF30 Programming in C++ 4. The standard library. Algorithms and containers. Sven Gestegård Robertz Computer Science, LTH 2018 1 Type inference 2 3 The standard library Algorithms Containers Sequences

More information

Intermediate Programming, Spring 2017*

Intermediate Programming, Spring 2017* 600.120 Intermediate Programming, Spring 2017* Misha Kazhdan *Much of the code in these examples is not commented because it would otherwise not fit on the slides. This is bad coding practice in general

More information

Context Tokens for Parallel Algorithms

Context Tokens for Parallel Algorithms Doc No: P0335R1 Date: 2018-10-07 Audience: SG1 (parallelism & concurrency) Authors: Pablo Halpern, Intel Corp. phalpern@halpernwightsoftware.com Context Tokens for Parallel Algorithms Contents 1 Abstract...

More information

Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19

Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19 Single-source SYCL C++ on Xilinx FPGA Xilinx Research Labs Khronos booth @SC17 2017/11/12 19 Khronos standards for heterogeneous systems 3D for the Web - Real-time apps and games in-browser - Efficiently

More information

LA-UR Approved for public release; distribution is unlimited.

LA-UR Approved for public release; distribution is unlimited. LA-UR-15-27730 Approved for public release; distribution is unlimited. Title: Author(s): Intended for: Summer 2015 LANL Exit Talk Usher, Will Canada, Curtis Vincent Web Issued: 2015-10-05 Disclaimer: Los

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Programming GPUs with C++14 and Just-In-Time Compilation

Programming GPUs with C++14 and Just-In-Time Compilation Programming GPUs with C++14 and Just-In-Time Compilation Michael Haidl, Bastian Hagedorn, and Sergei Gorlatch University of Muenster {m.haidl, b.hagedorn, gorlatch}@uni-muenster.de Abstract. Systems that

More information

Lambda functions. Zoltán Porkoláb: C++11/14 1

Lambda functions. Zoltán Porkoláb: C++11/14 1 Lambda functions Terminology How it is compiled Capture by value and reference Mutable lambdas Use of this Init capture and generalized lambdas in C++14 Constexpr lambda and capture *this and C++17 Zoltán

More information

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago. Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:

More information

AMD s Unified CPU & GPU Processor Concept

AMD s Unified CPU & GPU Processor Concept Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance

More information

ASYNCHRONOUS COMPUTING IN C++

ASYNCHRONOUS COMPUTING IN C++ http://stellar-goup.org ASYNCHRONOUS COMPUTING IN C++ Hartmut Kaiser (Hartmut.Kaiser@gmail.com) CppCon 2014 WHAT IS ASYNCHRONOUS COMPUTING? Spawning off some work without immediately waiting for the work

More information

THE STANDARD TEMPLATE LIBRARY (STL) Week 6 BITE 1513 Computer Game Programming

THE STANDARD TEMPLATE LIBRARY (STL) Week 6 BITE 1513 Computer Game Programming THE STANDARD TEMPLATE LIBRARY (STL) Week 6 BITE 1513 Computer Game Programming What the heck is STL???? Another hard to understand and lazy to implement stuff? Standard Template Library The standard template

More information

C++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay

C++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay C++ Basics Data Processing Course, I. Hrivnacova, IPN Orsay The First Program Comments Function main() Input and Output Namespaces Variables Fundamental Types Operators Control constructs 1 C++ Programming

More information

SkePU 2 User Guide For the preview release

SkePU 2 User Guide For the preview release SkePU 2 User Guide For the preview release August Ernstsson October 20, 2016 Contents 1 Introduction 3 2 License 3 3 Authors and Maintainers 3 3.1 Acknowledgements............................... 3 4 Dependencies

More information

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc. Parallel Processing with the SMP Framework in VTK Berk Geveci Kitware, Inc. November 3, 2013 Introduction The main objective of the SMP (symmetric multiprocessing) framework is to provide an infrastructure

More information

Heterogeneous Computing

Heterogeneous Computing Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:

More information

Introduction to Programming

Introduction to Programming Introduction to Programming session 6 Instructor: Reza Entezari-Maleki Email: entezari@ce.sharif.edu 1 Spring 2011 These slides are created using Deitel s slides Sharif University of Technology Outlines

More information

Diving into Parallelization in Modern C++ using HPX

Diving into Parallelization in Modern C++ using HPX Diving into Parallelization in Modern C++ using HPX Thomas Heller, Hartmut Kaiser, John Biddiscombe CERN 2016 May 18, 2016 Computer Architecture Department of Computer This project has received funding

More information

CSCE 206: Structured Programming in C++

CSCE 206: Structured Programming in C++ CSCE 206: Structured Programming in C++ 2017 Spring Exam 2 Monday, March 20, 2017 Total - 100 Points B Instructions: Total of 13 pages, including this cover and the last page. Before starting the exam,

More information

CSI33 Data Structures

CSI33 Data Structures Outline Department of Mathematics and Computer Science Bronx Community College November 22, 2017 Outline Outline 1 Chapter 12: C++ Templates Outline Chapter 12: C++ Templates 1 Chapter 12: C++ Templates

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

More on Templates. Shahram Rahatlou. Corso di Programmazione++

More on Templates. Shahram Rahatlou. Corso di Programmazione++ More on Templates Standard Template Library Shahram Rahatlou http://www.roma1.infn.it/people/rahatlou/programmazione++/ it/ / h tl / i / Corso di Programmazione++ Roma, 19 May 2008 More on Template Inheritance

More information

Exceptions, Case Study-Exception handling in C++.

Exceptions, Case Study-Exception handling in C++. PART III: Structuring of Computations- Structuring the computation, Expressions and statements, Conditional execution and iteration, Routines, Style issues: side effects and aliasing, Exceptions, Case

More information

SFU CMPT Topic: Function Objects

SFU CMPT Topic: Function Objects SFU CMPT-212 2008-1 1 Topic: Function Objects SFU CMPT-212 2008-1 Topic: Function Objects Ján Maňuch E-mail: jmanuch@sfu.ca Friday 14 th March, 2008 SFU CMPT-212 2008-1 2 Topic: Function Objects Function

More information

Source-Code Information Capture

Source-Code Information Capture Source-Code Information Capture Robert Douglas 2014-10-10 Document Number: N4129 followup to N3972 Date: 2014-10-10 Project: Programming Language C++ 1 Introduction Unchanged from N3972 Logging, testing,

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

Distributed & Heterogeneous Programming in C++ for HPC at SC17

Distributed & Heterogeneous Programming in C++ for HPC at SC17 Distributed & Heterogeneous Programming in C++ for HPC at SC17 Michael Wong (Codeplay), Hal Finkel DHPCC++ 2018 1 The Panel 2 Ben Sanders (AMD, HCC, HiP, HSA) Carter Edwards (SNL, Kokkos, ISO C++) CJ Newburn

More information

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information

C++14 (Preview) Alisdair Meredith, Library Working Group Chair. Thursday, July 25, 13

C++14 (Preview) Alisdair Meredith, Library Working Group Chair. Thursday, July 25, 13 C++14 (Preview) Alisdair Meredith, Library Working Group Chair 1 1 A Quick Tour of the Sausage Factory A Committee convened under ISO/IEC so multiple companies can co-operate and define the language Google

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

Designing and Optimizing LQCD code using OpenACC

Designing and Optimizing LQCD code using OpenACC Designing and Optimizing LQCD code using OpenACC E Calore, S F Schifano, R Tripiccione Enrico Calore University of Ferrara and INFN-Ferrara, Italy GPU Computing in High Energy Physics Pisa, Sep. 10 th,

More information

Object-Oriented Programming for Scientific Computing

Object-Oriented Programming for Scientific Computing Object-Oriented Programming for Scientific Computing Traits and Policies Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de Summer Semester

More information

Scan Algorithm Effects on Parallelism and Memory Conflicts

Scan Algorithm Effects on Parallelism and Memory Conflicts Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse

More information

CAAM 420 Fall 2012 Lecture 29. Duncan Eddy

CAAM 420 Fall 2012 Lecture 29. Duncan Eddy CAAM 420 Fall 2012 Lecture 29 Duncan Eddy November 7, 2012 Table of Contents 1 Templating in C++ 3 1.1 Motivation.............................................. 3 1.2 Templating Functions........................................

More information

Non-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words.

Non-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words. , ean, arithmetic s s on acters Comp Sci 1570 Introduction to C++ Outline s s on acters 1 2 3 4 s s on acters Outline s s on acters 1 2 3 4 s s on acters ASCII s s on acters ASCII s s on acters Type: acter

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Introducing Parallelism to the Ranges TS

Introducing Parallelism to the Ranges TS Introducing Parallelism to the Ranges TS Gordon Brown, Christopher Di Bella, Michael Haidl, Toomas Remmelg, Ruyman Reyes, Michel Steuwer Distributed & Heterogeneous Programming in C/C++, Oxford, 14/05/2018

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Variables. Data Types.

Variables. Data Types. Variables. Data Types. The usefulness of the "Hello World" programs shown in the previous section is quite questionable. We had to write several lines of code, compile them, and then execute the resulting

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

x Welcome to the jungle. The free lunch is so over

x Welcome to the jungle. The free lunch is so over Herb Sutter 1975-2005 Put a computer on every desk, in every home, in every pocket. The free lunch is so over 2005-2011 Put a parallel supercomputer on every desk, in every home, in every pocket. Welcome

More information

Abstract Data Types (ADTs) 1. Legal Values. Client Code for Rational ADT. ADT Design. CS 247: Software Engineering Principles

Abstract Data Types (ADTs) 1. Legal Values. Client Code for Rational ADT. ADT Design. CS 247: Software Engineering Principles Abstract Data Types (ADTs) CS 247: Software Engineering Principles ADT Design An abstract data type (ADT) is a user-defined type that bundles together: the range of values that variables of that type can

More information

Heterogeneous Computing Made Easy:

Heterogeneous Computing Made Easy: Heterogeneous Computing Made Easy: Qualcomm Symphony System Manager SDK Wenjia Ruan Sr. Engineer, Advanced Content Group Qualcomm Technologies, Inc. May 2017 Qualcomm Symphony System Manager SDK is a product

More information

An introduction to C++ template programming

An introduction to C++ template programming An introduction to C++ template programming Hayo Thielecke University of Birmingham http://www.cs.bham.ac.uk/~hxt March 2015 Templates and parametric polymorphism Template parameters Member functions of

More information

Contracts programming for C++20

Contracts programming for C++20 Contracts programming for C++20 Current proposal status J. Daniel Garcia ARCOS Group University Carlos III of Madrid Spain April, 28th, 2017 cbed J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es)

More information

Atomic maximum/minimum

Atomic maximum/minimum Atomic maximum/minimum Proposal to extend atomic with maximum/minimum operations Document number: P0493R0 Date: 2016-11-08 Authors: Al Grant (algrant@acm.org), Bronek Kozicki (brok@spamcop.net) Audience:

More information

CE221 Programming in C++ Part 1 Introduction

CE221 Programming in C++ Part 1 Introduction CE221 Programming in C++ Part 1 Introduction 06/10/2017 CE221 Part 1 1 Module Schedule There are two lectures (Monday 13.00-13.50 and Tuesday 11.00-11.50) each week in the autumn term, and a 2-hour lab

More information

Exercise 1.1 Hello world

Exercise 1.1 Hello world Exercise 1.1 Hello world The goal of this exercise is to verify that computer and compiler setup are functioning correctly. To verify that your setup runs fine, compile and run the hello world example

More information

Introduction to C++ Systems Programming

Introduction to C++ Systems Programming Introduction to C++ Systems Programming Introduction to C++ Syntax differences between C and C++ A Simple C++ Example C++ Input/Output C++ Libraries C++ Header Files Another Simple C++ Example Inline Functions

More information

BUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github.

BUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github. BUILDING PARALLEL ALGORITHMS WITH BULK Jared Hoberock Programming Systems and Applications NVIDIA Research github.com/jaredhoberock B HELLO, BROTHER! #include #include struct hello

More information

Structured bindings with polymorphic lambas

Structured bindings with polymorphic lambas 1 Introduction Structured bindings with polymorphic lambas Aaryaman Sagar (aary800@gmail.com) August 14, 2017 This paper proposes usage of structured bindings with polymorphic lambdas, adding them to another

More information

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer William Killian Tom Scogland, Adam Kunen John Cavazos Millersville University of Pennsylvania

More information

CS 247: Software Engineering Principles. ADT Design

CS 247: Software Engineering Principles. ADT Design CS 247: Software Engineering Principles ADT Design Readings: Eckel, Vol. 1 Ch. 7 Function Overloading & Default Arguments Ch. 12 Operator Overloading U Waterloo CS247 (Spring 2017) p.1/17 Abstract Data

More information

A Standard flat_set. This paper outlines what a (mostly) API-compatible, non-node-based set might look like.

A Standard flat_set. This paper outlines what a (mostly) API-compatible, non-node-based set might look like. A Standard flat_set Document Number: P1222R0 Date: 2018-10-02 Reply to: Zach Laine whatwasthataddress@gmail.com Audience: LEWG 1 Introduction This paper outlines what a (mostly) API-compatible, non-node-based

More information

TEMPLATE IN C++ Function Templates

TEMPLATE IN C++ Function Templates TEMPLATE IN C++ Templates are powerful features of C++ which allows you to write generic programs. In simple terms, you can create a single function or a class to work with different data types using templates.

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

G52CPP C++ Programming Lecture 18

G52CPP C++ Programming Lecture 18 G52CPP C++ Programming Lecture 18 Dr Jason Atkin http://www.cs.nott.ac.uk/~jaa/cpp/ g52cpp.html 1 Welcome Back 2 Last lecture Operator Overloading Strings and streams 3 Operator overloading - what to know

More information

Guillimin HPC Users Meeting January 13, 2017

Guillimin HPC Users Meeting January 13, 2017 Guillimin HPC Users Meeting January 13, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit

More information

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING 1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism

More information

Functors. Cristian Cibils

Functors. Cristian Cibils Functors Cristian Cibils (ccibils@stanford.edu) Administrivia You should have Evil Hangman feedback on paperless. If it is there, you received credit Functors Let's say that we needed to find the number

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Source-Code Information Capture

Source-Code Information Capture Source-Code Information Capture Robert Douglas 2015-04-08 Document Number: N4417 followup to N4129 Date: 2015-04-08 Project: Programming Language C++ 1 Introduction Logging, testing, and invariant checking

More information