Using SYCL as an Implementation Framework for HPX.Compute

Similar documents
Parallel STL in today s SYCL Ruymán Reyes

OpenCL: History & Future. November 20, 2017

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

OpenCL C++ kernel language

Parallelism in C++ Higher-level Parallelization in C++ for Asynchronous Task-Based Programming. Hartmut Kaiser

Distributed & Heterogeneous Programming in C++ for HPC at SC17

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

ASYNCHRONOUS COMPUTING IN C++

SYCL for OpenCL in a Nutshell

Copyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL

Stream Computing using Brook+

Compiler Tools for HighLevel Parallel Languages

VisionCPP Mehdi Goli

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Copyright Khronos Group Page 1. OpenCL BOF SIGGRAPH 2013

Diving into Parallelization in Modern C++ using HPX

CUDA Toolkit CUPTI User's Guide. DA _v01 September 2012

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

SYCL for OpenCL. in a nutshell. Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL. IWOCL Conference May 2014

Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19

Copyright Khronos Group Page 1

OpenCL Overview Benedict R. Gaster, AMD

Better Code: Concurrency Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Handling Concurrent Exceptions with Executors

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

Short Notes of CS201

Programming GPUs with C++14 and Just-In-Time Compilation

Introduction to Multicore Programming

CS201 - Introduction to Programming Glossary By

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Parallel Programming Abstractions for Productivity, Scalability, and Performance Portability

OpenCL in Action. Ofer Rosenberg

Chapter 3 Parallel Software

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

Better Code: Concurrency Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.

Now What? Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved..

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

04-17 Discussion Notes

Overload Resolution. Ansel Sermersheim & Barbara Geller Amsterdam C++ Group March 2019

Introduction to OpenCL!

Martin Kruliš, v

CUDA Programming Model

Cpt S 122 Data Structures. Course Review Midterm Exam # 2

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Better Code: Concurrency Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.

AN 831: Intel FPGA SDK for OpenCL

Lecture Topic: An Overview of OpenCL on Xeon Phi

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Michel Steuwer.

Automatic Intra-Application Load Balancing for Heterogeneous Systems

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

OpenCL parallel Processing using General Purpose Graphical Processing units TiViPE software development

Context Tokens for Parallel Algorithms

OpenACC 2.6 Proposed Features

Digging into the GAT API

Modern Processor Architectures. L25: Modern Compiler Design

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

SYCL for OpenCL May15. Copyright Khronos Group Page 1

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

CLU: Open Source API for OpenCL Prototyping

Modern C++ Parallelism from CPU to GPU

SIGGRAPH Briefing August 2014

A Local-View Array Library for Partitioned Global Address Space C++ Programs

asynchronous programming with allocation aware futures /Naios/continuable Denis Blank Meeting C

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Communication Library to Overlap Computation and Communication for OpenCL Application

Structured bindings with polymorphic lambas

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Distance Learning Advanced C++ - Programming Models, boost and Parallel Computing

Accelerated Library Framework for Hybrid-x86

Intel Thread Building Blocks, Part II

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Closing the Performance Gap with Modern C++

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE

PGAS: Partitioned Global Address Space

Overview of research activities Toward portability of performance

HPX. HPX The Futurization of Computing

Givy, a software DSM runtime using raw pointers

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

libcppa Now: High-Level Distributed Programming Without Sacrificing Performance

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Performance Tools for Technical Computing

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

Massively Parallel Task-Based Programming with HPX Thomas Heller Computer Architecture Department of Computer Science 2/ 77

Better Code: Concurrency Sean Parent Principal Scientist Adobe Systems Incorporated. All Rights Reserved.

ECE 3574: Dynamic Polymorphism using Inheritance

Heterogeneous Computing

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

Bullet Cloth Simulation. Presented by: Justin Hensley Implemented by: Lee Howes

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

UPC++ Specification v1.0 Draft 6. UPC++ Specification Working Group

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

Transcription:

Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The STEllAR Group 1 of 29 May 16, 2017

Plan HPX Concepts HPX.Compute Challenges Benchmarking Summary Goals 1 of 29

What is HPX? High Performance ParalleX 1,2 Runtime for parallel and distributed applications Written purely in C++, with large usage of Boost Unified and standard-conforming C++ API 1 Parallex an advanced parallel execution model for scaling-impaired applications-h. Kaiser et al - ICPPW, 2009 2 A Task Based Programming Model in a Global Address Space - H. Kaiser et al - PGAS, 2014 2 of 29

What is HPX? 3 of 29

HPX and C++ standard HPX implements and even extends: Concurrency TS, N4107 Extended async, N3632 Task block, N4411 Parallelism TS, N4105 Executor, N4406 3 Segmented Iterators and Hierarchical Algorithms-Austern, Matthew H. - Generic Programming: International Seminar on Generic Programming, 2000 4 of 29

HPX and C++ standard HPX implements and even extends: Concurrency TS, N4107 Extended async, N3632 Task block, N4411 Parallelism TS, N4105 Executor, N4406 Another components partitioned vector segmented algorithms 3 3 Segmented Iterators and Hierarchical Algorithms-Austern, Matthew H. - Generic Programming: International Seminar on Generic Programming, 2000 4 of 29

Overview 5 of 29

Execution policy Puts restriction on execution, ensuring thread-safety C++17 sequential parallel parallel unsequenced HPX asynchronous sequential asynchronous parallel 6 of 29

Asynchronous execution Future represents result of an unfinished computation enables sending off operations to another thread TS allows for concurrent composition of different algorithms explicit depiction of data dependencies Compose different operations hpx :: future <type > f1 = hpx :: parallel :: for_each ( par_task,...) ; auto f2 = f1. then ( []( hpx :: future <type > f1) { hpx :: parallel :: for_each ( par_task,...) ; } ); 7 of 29

Plan HPX Concepts HPX.Compute Challenges Benchmarking Summary Goals 7 of 29

HPX.Compute a unified model for heterogeneous programming platform and vendor independent interface based on C++17 and further extensions to C++ standard Backends for: host CUDA HCC 4 SYCL 4 unfinished 8 of 29

HPX.Compute a unified model for heterogeneous programming platform and vendor independent interface based on C++17 and further extensions to C++ standard Three major concepts: target allocator executor 4 unfinished 8 of 29

Target an abstract type expressing data locality and place of execution variety of represented hardware requires a simplified interface Target interface: // Blocks until target is ready void synchronize (); // Future is ready when all tasks allocated on target have been finished hpx :: future <void > get_future () const ; 9 of 29

Target an abstract type expressing data locality and place of execution variety of represented hardware requires a simplified interface SYCL implementation of target communicaties with device through sycl::queue multiple targets may represent the same device requires additional measures for asynchronous communication 9 of 29

Allocator allocate and deallocate larger chunks of data on target data allocation is trivial on backends where memory is accessed with pointers (host, CUDA) SYCL implementation of allocator create sycl::buffer objects not possible to tie a buffer to given device 10 of 29

Executor execute code on device indicated by data location usual GPU-related restrictions on allowed C++ operations marking device functions not required Interface of an executor struct default_executor : hpx :: parallel :: executor_tag { template < typename F, typename Shape, typename... Ts > void bulk_launch ( F && f, Shape const & shape, Ts &&... ts) const ; template < typename F, typename Shape, typename... Ts > std :: vector < hpx :: future <void >> bulk_async_execute ( F && f, Shape const & shape, Ts &&... ts) const ; }; 11 of 29

Plan HPX Concepts HPX.Compute Challenges Benchmarking Summary Goals 11 of 29

Device accessors Capturing data buffers in SYCL a host iterator can only store sycl::buffer and position a separate device iterator has to be created in command group scope sycl::global ptr represents an iterator type on device, but std::iterator traits specialization or related typedefs are missing in SYCL standard Comparision with other backends: an additional static conversion function is necessary distinct iterator types on host and device requires templated function objects or C++14 generic lambda 12 of 29

Data movement Problem: copy data from a device to a given memory block on host, with a selection of an offset and size? host accessor - an intermediate copy in SYCL runtime, no flexibility, may lead to deadlocks if a host accessor is not destroyed set final data - applicable only for buffer destruction, no flexibility range-based subbufer - can emulate offset and size for host accessor map allocator - data is copied to a pointer defined by the SYCL user, but it can not be changed Further issues no ability to synchronize with data transfer 13 of 29

Data movement Suggested extension for SYCL // copy all contents of buffer template < typename T, int N, typename OutIter > sycl :: event copy ( const sycl :: buffer <T,N> & src, OutIter dest ); // copy range [ begin, end ) to buffer, fully replacing its contents template < typename InIter, T, int N> sycl :: event copy ( InIter begin, InIter end, sycl :: buffer <T,N> & dest ); 14 of 29

Data movement Suggested extension for SYCL // write range to buffer starting at pos template < typename T, int N, typename InIter > sycl :: event sycl :: buffer <T,N >:: write ( std :: size_t pos, InIter begin, InIter end ); // read size elements starting at pos template < typename T, int N, typename OutIter > sycl :: event sycl :: buffer <T,N >:: read ( size_t pos, size_t size, OutIter dest ); 15 of 29

Asynchronous execution What SYCL offers for synchronization? blocking wait for tasks in queue blocking wait for enqueued kernels with sycl::event SYCL API does not cover OpenCL callbacks Competing solutions stream callbacks in CUDA an extended future in C++AMP/HCC 16 of 29

Asynchronous execution Use SYCL-OpenCL interoperability for callbacks // future_data is a shared state of hpx :: future cl :: sycl :: queue queue =...; future_data * ptr =...; cl_event marker ; clenqueuemarkerwithwaitlist ( queue. get (), 0, nullptr, & marker ); clseteventcallback ( marker, CL_COMPLETE, []( cl_event, cl_int, void * ptr ) { marker_callback ( static_cast < future_data * >( ptr )); }, ptr ); Downside not applicable for SYCL host device 17 of 29

Non-standard layout datatypes An example: standard C++ tuple common std::tuple implementations, such as in libstdc++ or libc++, are not C++11 standard layout due to multiple inheritance adding a non-standard implementation requires complex changes in existing codebase Approaches for other types refactor current solution to be C++ standard layout manually deconstruct the object and construct again in kernel scope add serialization and deserialization interface to problematic types automatic serialization by the compiler - technique used in HCC 18 of 29

Kernel naming two-tier compilation needs to link kernel code and invocation name has to be unique breaks the standard API for STL algorithms different extensions to C++ may solve this problem 5 5 Khronos s OpenCL SYCL to support Heterogeneous Devices for C++ - Wong, M. et al. - P0236R0 19 of 29

Named execution policy execution policy contains the name use the type of function object if no name is provided used in ParallelSTL project 6 A SYCL named execution policy struct DefaultKernelName {}; template <class KernelName = DefaultKernelName > class sycl_execution_policy {... }; 6 https://github.com/khronosgroup/syclparallelstl/ 20 of 29

Named execution policy execution policy contains the name use the type of function object if no name is provided used in ParallelSTL project 6 Cons: no logical connection between execution policy and kernel name duplicating std::par execution policy 6 https://github.com/khronosgroup/syclparallelstl/ 20 of 29

Named execution policy Our solution: executor parameters an HPX extension to proposed concepts for executors a set of configuration options to control execution control settings which are independent from the actual executor type example: OpenMP-like chunk sizes Pass kernel name as a parameter // uses default executor : par hpx :: parallel :: for_each ( hpx :: parallel :: par. with ( hpx :: parallel :: kernel_name < class Name >() ),... ); 21 of 29

Plan HPX Concepts HPX.Compute Challenges Benchmarking Summary Goals 21 of 29

Benchmarking hardware for STREAM Khronos SYCL GPU: AMD Radeon R9 Fury Nano ComputeCPP: CommunityEdition-0.1.1 OpenCL: AMD APP SDK 2.9 GPU-STREAM has been used to measure SYCL performance: https://github.com/uob-hpc/gpu-stream 22 of 29

STREAM 400 STREAM benchmark on 305 MB arrays Bandwidth (GB/s) 300 200 100 HPX HPX without callbacks SYCL 0 Scale Copy Add Triad 23 of 29

STREAM STREAM scaling with size Bandwidth (GB/s) 100 10 1 HPX HPX without callbacks SYCL 0.1 0.01 0.1 1 10 100 Array size (MB) 24 of 29

Plan HPX Concepts HPX.Compute Challenges Benchmarking Summary Goals 24 of 29

Summary The Good performance and capabilities comparable with competing standards no requirement of marking functions capable of running on a device previous experiments revealed that an overhead of ComputeCpp, an offline device compiler for SYCL, is not severe during build process 25 of 29

Summary The Bad kernel names appearing in standard interface troublesome capture of complex types storing SYCL buffers lack of explicit data movement limited support for SPIR on modern GPUs 26 of 29

Summary The Ugly asynchronous callbacks work but with a slight overhead SYCL pointer types can not be treated as iterators troublesome capture of non-standard layout types 27 of 29

Future Goals demonstrate a complex problem solved over host and GPU with our model and STL algorithms extend implementation with more algorithms Challenges how to express on-chip/local memory through our model? try to reduce overhead for shorter kernels 28 of 29

Thanks for your attention mcopik@gmail.com 29 of 29