Parallel STL in today s SYCL Ruymán Reyes
|
|
- Vanessa Bridges
- 6 years ago
- Views:
Transcription
1 Parallel STL in today s SYCL Ruymán Reyes ruyman@codeplay.com Codeplay Research 15 th November, 2016
2 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 2
3 The presenter Ruyman Reyes, PhD Background in HPC, programming models and compilers Worked in HPC Scientific Code (ScaLAPACK, GROMACs, CP2K) Created the first Open Source OpenACC implementation Contributor to SYCL Specification Lead of ComputeCpp (Codeplay s SYCL implementation) Coordinating the work on SYCL Parallel STL 3
4 Codeplay Software We build software development tools for SoC Software company based in Edinburgh 42 developers Different background and skill set Games Industry, AI, compilers, HPC, robotics Various levels of expertise (graduates to PhD) Customers work in all areas of Industry Smartphones Self-driving cars Game consoles Our technology is probably in your pocket! 4
5 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 5
6 Parallel STL: Democratizing Parallelism in C++ Various libraries offered STL-like interface for parallel algorithms Thrust, Bolt, libstdc++ Parallel Mode, AMP algorithms In 2012, two separate proposals for parallelism to C++ standard: NVIDIA (N3408), based on Thrust (CUDA-based C++ library) Microsoft and Intel (N3429), based on Intel TBB and PPL/C++AMP Made joint proposal (N3554) suggested by SG1 Many working drafts for N3554, N3850, N3960, N4071, N4409 Final proposal P0024R2 accepted for C++17 during Jacksonville Latest status on C++ draft in github 6
7 Existing implementations Following the evolution of the document Microsoft: HPX: manual/parallel.html HSA: Thibaut Lutz: NVIDIA: Codeplay: 7
8 What is Parallelism TS adding? A set of execution policies and a collection of parallel algorithms The Execution Policies Paragraphs explaining the conditions for parallel algorithms New parallel algorithms The exception_list class 8
9 Sorting with the STL A sequential sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 9
10 Sorting with the STL A parallel sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std ::par, std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 9
11 Sorting with the STL A parallel sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std ::par, std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } par is an object of an Execution Policy The sort will be executed in parallel using an implementation-defined method 9
12 The Execution Policy Standard policy classes Defined in the execution namespace class sequenced_policy: Never do parallel class parallel_policy: Can use caller thread, but may span others (e.g, std::thread) Invocations do not interleave on a single thread class parallel_unsequenced_policy: Can use caller thread or others (e.g std::thread) Multiple invocations may be interleaved on a single thread Global objects constexpr sequenced_policy sequenced ; constexpr parallel_policy par ; constexpr parallel_unseq_policy par_unseq ; 10
13 The Execution Policy Choosing different parallel implementations // May execute in parallel std :: sort (std ::par, std :: begin (data ), std :: end (data )); // May be parallelized and vectorized std :: sort (std :: par_unseq, std :: begin (data ), std :: end (data )); // Will not be parallelized std :: sort (std :: sequenced, std :: begin (data ), std :: end (data )); Propagating the policy to the end user template <typename Policy, typename Iterator > void library_function ( Policy p, Iterator begin, Iterator end ) { std :: sort (p, begin, end ); std :: for_each (p, begin, end, [&] ( Iterator :: value_type e& ) { e++; } ); std :: for_each (std :: sequenced, begin, end, non_parallel_operation ); } Implementations can define their own Execution Policies 11
14 Dealing with exceptions Different execution threads may abort with different exceptions Parallel STL algorithms may throw an exception_list Note that an uncaught exception on the parallel_unsequenced_policy will cause a terminate. class exception_list : public exception { public : typedef unspecified iterator ; size_t size () const noexcept ; iterator begin () const noexcept ; iterator end () const noexcept ; virtual const char * what () const noexcept ; }; 12
15 Parallel Algorithms Overloads to STL algorithms taking the SYCL Execution Policy Not all STL algorithms are suitable for Parallel Execution! 13
16 Introducing new algorithms to the STL For Each template < class ExecutionPolicy, class InputIterator, class Function > void for_each ( ExecutionPolicy && exec, InputIterator first, InputIterator last, Function f); template < class ExecutionPolicy, class InputIterator, class Size, class Function > InputIterator for_each_n ( ExecutionPolicy && exec, InputIterator first, Size n, Function f); template <class InputIterator, class Size, class Function > InputIterator for_each_n ( InputIterator first, Size n, Function f); for_each: Applies f to elements in [first, last). for_each_n: Applies f to elements in [first, first + n). 14
17 Introducing new algorithms to the STL Numerical parallel algorithms template < class InputIterator > typename iterator_traits <InputIterator >:: value_type reduce ( InputIterator first, InputIterator last ); template < class InputIterator, class T> T reduce ( InputIterator first, InputIterator last, T init ); template < class InputIterator, class T, class BinaryOperation > T reduce ( InputIterator first, InputIterator last, T init, BinaryOperation binary_op ); As opposed to accumulate, binary_op is executed on an unespecified order. 15
18 Introducing new algorithms to the STL Other algorithms initroduced Exclusive/Inclusive Scan (Prefix Sum) Transform Reduce Transform Exclusive/Inclusive Scan Basic block to construct other algorithms and applications! 16
19 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 17
20 The SYCL Parallel STL Spec & Examples Enable C++17 Parallel STL to run on any SYCL-supported device Any OpenCL platform with SPIR support Improves productivity of C++ developers worried about performance Integrates nicely with existing SYCL codebases Completely Open Source SYCL Parallel STL introduces two execution policies 18
21 Sorting with the STL A sequential sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 19
22 Sorting with the STL Sorting on the GPU! std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort ( sycl_policy, std :: begin (v), std :: end (v)); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 19
23 Sorting with the STL Sorting on the GPU! std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort ( sycl_policy, std :: begin (v), std :: end (v)); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } sycl_policy is an Execution Policy data is an standard stl::vector Technically will use the device returned by default_selector 19
24 The SYCL Policy template < typename KernelName = DefaultKernelName > class sycl_execution_policy { public : using kernelname = KernelName ; }; sycl_execution_policy () = default ; sycl_execution_policy (cl :: sycl :: queue cl :: sycl :: queue get_queue () const ; q); Indicates algorithm will be executed using a SYCL-device Can optionally take a queue Re-use device-selection Asynchronous data copy-back... 20
25 Why the KernelName template? How are algorithms implemented? auto f = [ vectorsize, &bufi, &bufo, op ]( cl :: sycl :: handler &h) mutable {... auto ai = bufi. template get_access <access :: mode :: read >(h); auto ao = bufo. template get_access <access :: mode :: write >(h); h. parallel_for < /* The Kernel Name */ >(r, [ai, ao, op ]( cl :: sycl ::id <1> id) { ao[id.get (0) ] = UserFunctor (ai[id.get (0) ]); }); }; 21
26 Why the KernelName template? How are algorithms implemented? auto f = [ vectorsize, &bufi, &bufo, op ]( cl :: sycl :: handler &h) mutable {... auto ai = bufi. template get_access <access :: mode :: read >(h); auto ao = bufo. template get_access <access :: mode :: write >(h); h. parallel_for < /* The Kernel Name */ >(r, [ai, ao, op ]( cl :: sycl ::id <1> id) { ao[id.get (0) ] = UserFunctor (ai[id.get (0) ]); }); }; Two separate calls can generate different kernels! transform (par, v. begin (), v. end (), [=]( int & val ){ val ++; }); transform (par, v. begin (), v. end (), [=]( int & val ){ val - -; }); 21
27 Using named policies and queues using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector <int > v =...; // Transform default_selector ds; { queue q(ds); sort ( sycl_execution_policy (q), std :: begin (v), std :: end (v)); sycl_execution_policy < class myname > sepn1 (q); transform (sepn1, std :: begin (v), std :: end (v), std :: begin (v), [=]( int i) { return i + 1;}) ; } Only required for lambdas, not functors Device selection and queue are re-used Data is copied in/out in each call! 22
28 Avoiding data-copies using buffers using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector <int > v =...; default_selector h; { buffer <int > b(std :: begin (v), std :: end (v)); b. set_final_data (v.data ()); { queue q(h); sort ( sycl_execution_policy (q), begin (b), end (b)); sycl_execution_policy < class transform1 > sepn1 (q); transform ( sepn1, begin (b), end (b), begin (b), []( int num ) { return num + 1; }); } } Buffer is constructed from STL containers Data will be copied back to the container when buffer is done Note the additional copy from vec to buffer and vice-versa 23
29 Using device-only data using namespace experimental :: parallel :: sycl ; default_selector h; { buffer <int, 1> b(range <1>( size )); b. set_final_data (v.data ()); { cl :: sycl :: queue q(h); { auto hostacc = b. get_access <mode :: read_write, target :: host_buffer >() ; for ( auto & : hostacc ) { *i = read_data_from_file (...) ; } } sort ( sycl_execution_policy (q), begin (b), end (b)); transform ( sycl_policy, begin (b), end (b), begin (b), std :: negate <int >() ); } } Data is initialized in the host using a host accessor After host accessor is done, data is on the device 24
30 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 25
31 Heterogeneous Execution Policy Distribute workload in Parallel STL algorithms Execution on two devices at the same time Designed for integrated CPU / GPU platforms User sets the decide percentage of work assigned to GPU/CPU Policy distributes workload accordingly HiPEAC internship Research work funded via collaboration grant with HiPEAC Ph.D Student from University of Malaga (A. Vilches) 26
32 Heterogeneous Execution Example cl :: sycl :: queue q; amd_cpu_selector cpu_sel ; cl :: sycl :: queue q2( cpu_sel ); sycl :: sycl_heterogeneous_execution_policy < class TransformAlgorithm1 > snp ( q, q2, ratio ); auto mytransform = [&]() { float pi = 3.14; std :: experimental :: parallel :: transform ( snp, std :: begin (v1), std :: end (v1), std :: begin (v2), std :: begin (res ), [=]( float a, float b) { return pi * a + b; }); }; 27
33 Manual distribution of the work 28
34 Heterogeneous Execution Policy in Use 29
35 Heterogeneous Execution Trivial Implementation... { auto buf1_q1 = sycl :: helpers :: make_const_buffer (first1, first1 + crosspoint ); auto buf2_q1 = sycl :: helpers :: make_const_buffer (first2, first2 + crosspoint ); auto res_q1 = sycl :: helpers :: make_buffer (result, result + crosspoint ); auto buf1_q2 = sycl :: helpers :: make_const_buffer ( first1 + crosspoint, last1 ); auto buf2_q2 = sycl :: helpers :: make_const_buffer ( first2 + crosspoint, first2 + elements ); auto res_q2 = sycl :: helpers :: make_buffer ( result + crosspoint, result + elements ); impl :: transform ( named_sep, q1, buf1_q1, buf2_q1, res_q1, binary_op ); impl :: transform ( named_sep, q2, buf1_q2, buf2_q2, res_q2, binary_op ); }... 30
36 Heterogeneous load balancing Dynamic decision of heterogeneous balancing Percentage offloading is runtime value Developers can create runtime evaluation functions Depending on workload Depending on platform Depending on user-input 31
37 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 32
38 Conclusions Parallel STL Enables developers to quickly exploit parallel architectures SYCL makes implementing these algorithms for heterogeneous platforms trivial Just write single-source C++ Will work on any OpenCL + SPIR platform! Heterogeneous Execution Plenty of heterogeneous platform out there Complex to work with them! SYCL allows developers to focus on algorithms and distribute work Heterogeneous Policy can be optimized and customized per platform Runtime can get platform information and use custom balancing Users can extend the policy with specific balancing decisions 33
39 @codeplaysoft codeplay.com
Using SYCL as an Implementation Framework for HPX.Compute
Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The
More informationJose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.
SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel
More informationWorking Draft, Technical Specification for C++ Extensions for Parallelism, Revision 1
Document Number: N3960 Date: 2014-02-28 Reply to: Jared Hoberock NVIDIA Corporation jhoberock@nvidia.com Working Draft, Technical Specification for C++ Extensions for Parallelism, Revision 1 Note: this
More informationWorking Draft, Technical Specification for C++ Extensions for Parallelism
Document Number: N3850 Date: 2014-01-17 Reply to: Jared Hoberock NVIDIA Corporation jhoberock@nvidia.com Working Draft, Technical Specification for C++ Extensions for Parallelism Note: this is an early
More informationCopyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL
Copyright Khronos Group 2015 - Page 1 Introduction to SYCL SYCL Tutorial IWOCL 2015-05-12 Copyright Khronos Group 2015 - Page 2 Introduction I am - Lee Howes - Senior staff engineer - GPU systems team
More informationCopyright Khronos Group, Page 1 SYCL. SG14, February 2016
Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what
More informationThe Cut and Thrust of CUDA
The Cut and Thrust of CUDA Luke Hodkinson Center for Astrophysics and Supercomputing Swinburne University of Technology Melbourne, Hawthorn 32000, Australia May 16, 2013 Luke Hodkinson The Cut and Thrust
More informationProgramming Languages Technical Specification for C++ Extensions for Parallelism
ISO 05 All rights reserved ISO/IEC JTC SC WG N4409 Date: 05-04-0 ISO/IEC DTS 9570 ISO/IEC JTC SC Secretariat: ANSI Programming Languages Technical Specification for C++ Extensions for Parallelism Warning
More informationThe Role of Standards in Heterogeneous Programming
The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland
More informationSYCL for OpenCL. in a nutshell. Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL. IWOCL Conference May 2014
SYCL for OpenCL in a nutshell Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL! IWOCL Conference May 2014 SYCL for OpenCL in a nutshell SYCL in the OpenCL ecosystem SYCL aims
More informationp0616r0 - de-pessimize legacy <numeric> algorithms with std::move
p0616r0 - de-pessimize legacy algorithms with std::move Peter Sommerlad 2017-06-06 Document Number: p0616r0 Date: 2017-06-06 Project: Programming Language C++ Audience: LWG/LEWG Target: C++20
More informationTemplate Library for Parallel For Loops
1 Document number: P0075r2 Date: 2017-11-09 Audience: Reply to: Other authors: Concurrency study group (SG1) Pablo Halpern Clark Nelson Arch D. Robison,
More informationSYCL-based Data Layout Abstractions for CPU+GPU Codes
DHPCC++ 2018 Conference St Catherine's College, Oxford, UK May 14th, 2018 SYCL-based Data Layout Abstractions for CPU+GPU Codes Florian Wende, Matthias Noack, Thomas Steinke Zuse Institute Berlin Setting
More informationObject-Oriented Programming for Scientific Computing
Object-Oriented Programming for Scientific Computing Traits and Policies Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de 11. Juli 2017
More informationtrisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee
trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define
More informationOpenCL: History & Future. November 20, 2017
Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language
More informationModern C++ Parallelism from CPU to GPU
Modern C++ Parallelism from CPU to GPU Simon Brand @TartanLlama Senior Software Engineer, GPGPU Toolchains, Codeplay C++ Russia 2018 2018-04-21 Agenda About me and Codeplay C++17 CPU Parallelism Third-party
More informationMikhail Dvorskiy, Jim Cownie, Alexey Kukanov
Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov What is the Parallel STL? C++17 C++ Next An extension of the C++ Standard Template Library algorithms with the execution policy argument Support for parallel
More informationA Parallel Algorithms Library N3724
A Parallel Algorithms Library N3724 Jared Hoberock Jaydeep Marathe Michael Garland Olivier Giroux Vinod Grover {jhoberock, jmarathe, mgarland, ogiroux, vgrover}@nvidia.com Artur Laksberg Herb Sutter {arturl,
More informationIntel Thread Building Blocks, Part II
Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization
More informationHandling Concurrent Exceptions with Executors
Document number: P0797R1 Date: 2018-02-12 (Jacksonville) Project: Programming Language C++, WG21, SG1,SG14, LEWG, LWG Authors: Matti Rintala, Michael Wong, Carter Edwards, Patrice Roy, Gordon Brown, Mark
More informationExpressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17
Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]
More informationCompiler Tools for HighLevel Parallel Languages
Compiler Tools for HighLevel Parallel Languages Paul Keir Codeplay Software Ltd. LEAP Conference May 21st 2013 Presentation Outline Introduction EU Framework 7 Project: LPGPU Offload C++ for PS3 Memory
More informationcuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)
cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) Demo: CUDA on Intel HD5500 global void setvalue(float *data, int idx, float value)
More informationW3101: Programming Languages C++ Ramana Isukapalli
Lecture-6 Operator overloading Namespaces Standard template library vector List Map Set Casting in C++ Operator Overloading Operator overloading On two objects of the same class, can we perform typical
More informationVisionCPP Mehdi Goli
VisionCPP Mehdi Goli Motivation Computer vision Different areas Medical imaging, film industry, industrial manufacturing, weather forecasting, etc. Operations Large size of data Sequence of operations
More informationColin Riddell GPU Compiler Developer Codeplay Visit us at
OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology
More informationCUDA Toolkit 4.2 Thrust Quick Start Guide. PG _v01 March 2012
CUDA Toolkit 4.2 Thrust Quick Start Guide PG-05688-040_v01 March 2012 Document Change History Version Date Authors Description of Change 01 2011/1/13 NOJW Initial import from wiki. CUDA Toolkit 4.2 Thrust
More informationChapter 15 - C++ As A "Better C"
Chapter 15 - C++ As A "Better C" Outline 15.1 Introduction 15.2 C++ 15.3 A Simple Program: Adding Two Integers 15.4 C++ Standard Library 15.5 Header Files 15.6 Inline Functions 15.7 References and Reference
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationOpenCL C++ kernel language
Copyright Khronos Group 2016 - Page 1 OpenCL C++ kernel language Vienna April 2016 Adam Stański Bartosz Sochacki Copyright Khronos Group 2016 - Page 2 OpenCL 2.2 OpenCL C++ Open source free compiler https://github.com/khronosgroup/libclcxx
More informationMichel Steuwer.
Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include
More informationSYCL for OpenCL in a Nutshell
SYCL for OpenCL in a Nutshell Luke Iwanski, Games Technology Programmer @ Codeplay! SIGGRAPH Vancouver 2014 1 2 Copyright Khronos Group 2014 SYCL for OpenCL in a nutshell Copyright Khronos Group 2014 Why?
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationOutline. Variables Automatic type inference. Generic programming. Generic programming. Templates Template compilation
Outline EDAF30 Programming in C++ 4. The standard library. Algorithms and containers. Sven Gestegård Robertz Computer Science, LTH 2018 1 Type inference 2 3 The standard library Algorithms Containers Sequences
More informationIntermediate Programming, Spring 2017*
600.120 Intermediate Programming, Spring 2017* Misha Kazhdan *Much of the code in these examples is not commented because it would otherwise not fit on the slides. This is bad coding practice in general
More informationContext Tokens for Parallel Algorithms
Doc No: P0335R1 Date: 2018-10-07 Audience: SG1 (parallelism & concurrency) Authors: Pablo Halpern, Intel Corp. phalpern@halpernwightsoftware.com Context Tokens for Parallel Algorithms Contents 1 Abstract...
More informationSingle-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19
Single-source SYCL C++ on Xilinx FPGA Xilinx Research Labs Khronos booth @SC17 2017/11/12 19 Khronos standards for heterogeneous systems 3D for the Web - Real-time apps and games in-browser - Efficiently
More informationLA-UR Approved for public release; distribution is unlimited.
LA-UR-15-27730 Approved for public release; distribution is unlimited. Title: Author(s): Intended for: Summer 2015 LANL Exit Talk Usher, Will Canada, Curtis Vincent Web Issued: 2015-10-05 Disclaimer: Los
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationProgramming GPUs with C++14 and Just-In-Time Compilation
Programming GPUs with C++14 and Just-In-Time Compilation Michael Haidl, Bastian Hagedorn, and Sergei Gorlatch University of Muenster {m.haidl, b.hagedorn, gorlatch}@uni-muenster.de Abstract. Systems that
More informationLambda functions. Zoltán Porkoláb: C++11/14 1
Lambda functions Terminology How it is compiled Capture by value and reference Mutable lambdas Use of this Init capture and generalized lambdas in C++14 Constexpr lambda and capture *this and C++17 Zoltán
More informationJoe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:
More informationAMD s Unified CPU & GPU Processor Concept
Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance
More informationASYNCHRONOUS COMPUTING IN C++
http://stellar-goup.org ASYNCHRONOUS COMPUTING IN C++ Hartmut Kaiser (Hartmut.Kaiser@gmail.com) CppCon 2014 WHAT IS ASYNCHRONOUS COMPUTING? Spawning off some work without immediately waiting for the work
More informationTHE STANDARD TEMPLATE LIBRARY (STL) Week 6 BITE 1513 Computer Game Programming
THE STANDARD TEMPLATE LIBRARY (STL) Week 6 BITE 1513 Computer Game Programming What the heck is STL???? Another hard to understand and lazy to implement stuff? Standard Template Library The standard template
More informationC++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay
C++ Basics Data Processing Course, I. Hrivnacova, IPN Orsay The First Program Comments Function main() Input and Output Namespaces Variables Fundamental Types Operators Control constructs 1 C++ Programming
More informationSkePU 2 User Guide For the preview release
SkePU 2 User Guide For the preview release August Ernstsson October 20, 2016 Contents 1 Introduction 3 2 License 3 3 Authors and Maintainers 3 3.1 Acknowledgements............................... 3 4 Dependencies
More informationParallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.
Parallel Processing with the SMP Framework in VTK Berk Geveci Kitware, Inc. November 3, 2013 Introduction The main objective of the SMP (symmetric multiprocessing) framework is to provide an infrastructure
More informationHeterogeneous Computing
Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:
More informationIntroduction to Programming
Introduction to Programming session 6 Instructor: Reza Entezari-Maleki Email: entezari@ce.sharif.edu 1 Spring 2011 These slides are created using Deitel s slides Sharif University of Technology Outlines
More informationDiving into Parallelization in Modern C++ using HPX
Diving into Parallelization in Modern C++ using HPX Thomas Heller, Hartmut Kaiser, John Biddiscombe CERN 2016 May 18, 2016 Computer Architecture Department of Computer This project has received funding
More informationCSCE 206: Structured Programming in C++
CSCE 206: Structured Programming in C++ 2017 Spring Exam 2 Monday, March 20, 2017 Total - 100 Points B Instructions: Total of 13 pages, including this cover and the last page. Before starting the exam,
More informationCSI33 Data Structures
Outline Department of Mathematics and Computer Science Bronx Community College November 22, 2017 Outline Outline 1 Chapter 12: C++ Templates Outline Chapter 12: C++ Templates 1 Chapter 12: C++ Templates
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationMore on Templates. Shahram Rahatlou. Corso di Programmazione++
More on Templates Standard Template Library Shahram Rahatlou http://www.roma1.infn.it/people/rahatlou/programmazione++/ it/ / h tl / i / Corso di Programmazione++ Roma, 19 May 2008 More on Template Inheritance
More informationExceptions, Case Study-Exception handling in C++.
PART III: Structuring of Computations- Structuring the computation, Expressions and statements, Conditional execution and iteration, Routines, Style issues: side effects and aliasing, Exceptions, Case
More informationSFU CMPT Topic: Function Objects
SFU CMPT-212 2008-1 1 Topic: Function Objects SFU CMPT-212 2008-1 Topic: Function Objects Ján Maňuch E-mail: jmanuch@sfu.ca Friday 14 th March, 2008 SFU CMPT-212 2008-1 2 Topic: Function Objects Function
More informationSource-Code Information Capture
Source-Code Information Capture Robert Douglas 2014-10-10 Document Number: N4129 followup to N3972 Date: 2014-10-10 Project: Programming Language C++ 1 Introduction Unchanged from N3972 Logging, testing,
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationDistributed & Heterogeneous Programming in C++ for HPC at SC17
Distributed & Heterogeneous Programming in C++ for HPC at SC17 Michael Wong (Codeplay), Hal Finkel DHPCC++ 2018 1 The Panel 2 Ben Sanders (AMD, HCC, HiP, HSA) Carter Edwards (SNL, Kokkos, ISO C++) CJ Newburn
More informationIntel Software Development Products for High Performance Computing and Parallel Programming
Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN
More informationC++14 (Preview) Alisdair Meredith, Library Working Group Chair. Thursday, July 25, 13
C++14 (Preview) Alisdair Meredith, Library Working Group Chair 1 1 A Quick Tour of the Sausage Factory A Committee convened under ISO/IEC so multiple companies can co-operate and define the language Google
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More informationDesigning and Optimizing LQCD code using OpenACC
Designing and Optimizing LQCD code using OpenACC E Calore, S F Schifano, R Tripiccione Enrico Calore University of Ferrara and INFN-Ferrara, Italy GPU Computing in High Energy Physics Pisa, Sep. 10 th,
More informationObject-Oriented Programming for Scientific Computing
Object-Oriented Programming for Scientific Computing Traits and Policies Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de Summer Semester
More informationScan Algorithm Effects on Parallelism and Memory Conflicts
Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationCUDA 7.5 OVERVIEW WEBINAR 7/23/15
CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse
More informationCAAM 420 Fall 2012 Lecture 29. Duncan Eddy
CAAM 420 Fall 2012 Lecture 29 Duncan Eddy November 7, 2012 Table of Contents 1 Templating in C++ 3 1.1 Motivation.............................................. 3 1.2 Templating Functions........................................
More informationNon-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words.
, ean, arithmetic s s on acters Comp Sci 1570 Introduction to C++ Outline s s on acters 1 2 3 4 s s on acters Outline s s on acters 1 2 3 4 s s on acters ASCII s s on acters ASCII s s on acters Type: acter
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationIntroducing Parallelism to the Ranges TS
Introducing Parallelism to the Ranges TS Gordon Brown, Christopher Di Bella, Michael Haidl, Toomas Remmelg, Ruyman Reyes, Michel Steuwer Distributed & Heterogeneous Programming in C/C++, Oxford, 14/05/2018
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationVariables. Data Types.
Variables. Data Types. The usefulness of the "Hello World" programs shown in the previous section is quite questionable. We had to write several lines of code, compile them, and then execute the resulting
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationx Welcome to the jungle. The free lunch is so over
Herb Sutter 1975-2005 Put a computer on every desk, in every home, in every pocket. The free lunch is so over 2005-2011 Put a parallel supercomputer on every desk, in every home, in every pocket. Welcome
More informationAbstract Data Types (ADTs) 1. Legal Values. Client Code for Rational ADT. ADT Design. CS 247: Software Engineering Principles
Abstract Data Types (ADTs) CS 247: Software Engineering Principles ADT Design An abstract data type (ADT) is a user-defined type that bundles together: the range of values that variables of that type can
More informationHeterogeneous Computing Made Easy:
Heterogeneous Computing Made Easy: Qualcomm Symphony System Manager SDK Wenjia Ruan Sr. Engineer, Advanced Content Group Qualcomm Technologies, Inc. May 2017 Qualcomm Symphony System Manager SDK is a product
More informationAn introduction to C++ template programming
An introduction to C++ template programming Hayo Thielecke University of Birmingham http://www.cs.bham.ac.uk/~hxt March 2015 Templates and parametric polymorphism Template parameters Member functions of
More informationContracts programming for C++20
Contracts programming for C++20 Current proposal status J. Daniel Garcia ARCOS Group University Carlos III of Madrid Spain April, 28th, 2017 cbed J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es)
More informationAtomic maximum/minimum
Atomic maximum/minimum Proposal to extend atomic with maximum/minimum operations Document number: P0493R0 Date: 2016-11-08 Authors: Al Grant (algrant@acm.org), Bronek Kozicki (brok@spamcop.net) Audience:
More informationCE221 Programming in C++ Part 1 Introduction
CE221 Programming in C++ Part 1 Introduction 06/10/2017 CE221 Part 1 1 Module Schedule There are two lectures (Monday 13.00-13.50 and Tuesday 11.00-11.50) each week in the autumn term, and a 2-hour lab
More informationExercise 1.1 Hello world
Exercise 1.1 Hello world The goal of this exercise is to verify that computer and compiler setup are functioning correctly. To verify that your setup runs fine, compile and run the hello world example
More informationIntroduction to C++ Systems Programming
Introduction to C++ Systems Programming Introduction to C++ Syntax differences between C and C++ A Simple C++ Example C++ Input/Output C++ Libraries C++ Header Files Another Simple C++ Example Inline Functions
More informationBUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github.
BUILDING PARALLEL ALGORITHMS WITH BULK Jared Hoberock Programming Systems and Applications NVIDIA Research github.com/jaredhoberock B HELLO, BROTHER! #include #include struct hello
More informationStructured bindings with polymorphic lambas
1 Introduction Structured bindings with polymorphic lambas Aaryaman Sagar (aary800@gmail.com) August 14, 2017 This paper proposes usage of structured bindings with polymorphic lambdas, adding them to another
More informationThe Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer
The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer William Killian Tom Scogland, Adam Kunen John Cavazos Millersville University of Pennsylvania
More informationCS 247: Software Engineering Principles. ADT Design
CS 247: Software Engineering Principles ADT Design Readings: Eckel, Vol. 1 Ch. 7 Function Overloading & Default Arguments Ch. 12 Operator Overloading U Waterloo CS247 (Spring 2017) p.1/17 Abstract Data
More informationA Standard flat_set. This paper outlines what a (mostly) API-compatible, non-node-based set might look like.
A Standard flat_set Document Number: P1222R0 Date: 2018-10-02 Reply to: Zach Laine whatwasthataddress@gmail.com Audience: LEWG 1 Introduction This paper outlines what a (mostly) API-compatible, non-node-based
More informationTEMPLATE IN C++ Function Templates
TEMPLATE IN C++ Templates are powerful features of C++ which allows you to write generic programs. In simple terms, you can create a single function or a class to work with different data types using templates.
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationG52CPP C++ Programming Lecture 18
G52CPP C++ Programming Lecture 18 Dr Jason Atkin http://www.cs.nott.ac.uk/~jaa/cpp/ g52cpp.html 1 Welcome Back 2 Last lecture Operator Overloading Strings and streams 3 Operator overloading - what to know
More informationGuillimin HPC Users Meeting January 13, 2017
Guillimin HPC Users Meeting January 13, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit
More information1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING
1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism
More informationFunctors. Cristian Cibils
Functors Cristian Cibils (ccibils@stanford.edu) Administrivia You should have Evil Hangman feedback on paperless. If it is there, you received credit Functors Let's say that we needed to find the number
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationSource-Code Information Capture
Source-Code Information Capture Robert Douglas 2015-04-08 Document Number: N4417 followup to N4129 Date: 2015-04-08 Project: Programming Language C++ 1 Introduction Logging, testing, and invariant checking
More information