Parallel STL in today s SYCL Ruymán Reyes

Size: px

Start display at page:

Download "Parallel STL in today s SYCL Ruymán Reyes"

Vanessa Bridges
6 years ago
Views:

1 Parallel STL in today s SYCL Ruymán Reyes ruyman@codeplay.com Codeplay Research 15 th November, 2016

2 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 2

3 The presenter Ruyman Reyes, PhD Background in HPC, programming models and compilers Worked in HPC Scientific Code (ScaLAPACK, GROMACs, CP2K) Created the first Open Source OpenACC implementation Contributor to SYCL Specification Lead of ComputeCpp (Codeplay s SYCL implementation) Coordinating the work on SYCL Parallel STL 3

4 Codeplay Software We build software development tools for SoC Software company based in Edinburgh 42 developers Different background and skill set Games Industry, AI, compilers, HPC, robotics Various levels of expertise (graduates to PhD) Customers work in all areas of Industry Smartphones Self-driving cars Game consoles Our technology is probably in your pocket! 4

5 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 5

6 Parallel STL: Democratizing Parallelism in C++ Various libraries offered STL-like interface for parallel algorithms Thrust, Bolt, libstdc++ Parallel Mode, AMP algorithms In 2012, two separate proposals for parallelism to C++ standard: NVIDIA (N3408), based on Thrust (CUDA-based C++ library) Microsoft and Intel (N3429), based on Intel TBB and PPL/C++AMP Made joint proposal (N3554) suggested by SG1 Many working drafts for N3554, N3850, N3960, N4071, N4409 Final proposal P0024R2 accepted for C++17 during Jacksonville Latest status on C++ draft in github 6

7 Existing implementations Following the evolution of the document Microsoft: HPX: manual/parallel.html HSA: Thibaut Lutz: NVIDIA: Codeplay: 7

8 What is Parallelism TS adding? A set of execution policies and a collection of parallel algorithms The Execution Policies Paragraphs explaining the conditions for parallel algorithms New parallel algorithms The exception_list class 8

9 Sorting with the STL A sequential sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 9

10 Sorting with the STL A parallel sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std ::par, std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 9

11 Sorting with the STL A parallel sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std ::par, std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } par is an object of an Execution Policy The sort will be executed in parallel using an implementation-defined method 9

12 The Execution Policy Standard policy classes Defined in the execution namespace class sequenced_policy: Never do parallel class parallel_policy: Can use caller thread, but may span others (e.g, std::thread) Invocations do not interleave on a single thread class parallel_unsequenced_policy: Can use caller thread or others (e.g std::thread) Multiple invocations may be interleaved on a single thread Global objects constexpr sequenced_policy sequenced ; constexpr parallel_policy par ; constexpr parallel_unseq_policy par_unseq ; 10

13 The Execution Policy Choosing different parallel implementations // May execute in parallel std :: sort (std ::par, std :: begin (data ), std :: end (data )); // May be parallelized and vectorized std :: sort (std :: par_unseq, std :: begin (data ), std :: end (data )); // Will not be parallelized std :: sort (std :: sequenced, std :: begin (data ), std :: end (data )); Propagating the policy to the end user template <typename Policy, typename Iterator > void library_function ( Policy p, Iterator begin, Iterator end ) { std :: sort (p, begin, end ); std :: for_each (p, begin, end, [&] ( Iterator :: value_type e& ) { e++; } ); std :: for_each (std :: sequenced, begin, end, non_parallel_operation ); } Implementations can define their own Execution Policies 11

14 Dealing with exceptions Different execution threads may abort with different exceptions Parallel STL algorithms may throw an exception_list Note that an uncaught exception on the parallel_unsequenced_policy will cause a terminate. class exception_list : public exception { public : typedef unspecified iterator ; size_t size () const noexcept ; iterator begin () const noexcept ; iterator end () const noexcept ; virtual const char * what () const noexcept ; }; 12

15 Parallel Algorithms Overloads to STL algorithms taking the SYCL Execution Policy Not all STL algorithms are suitable for Parallel Execution! 13

16 Introducing new algorithms to the STL For Each template < class ExecutionPolicy, class InputIterator, class Function > void for_each ( ExecutionPolicy && exec, InputIterator first, InputIterator last, Function f); template < class ExecutionPolicy, class InputIterator, class Size, class Function > InputIterator for_each_n ( ExecutionPolicy && exec, InputIterator first, Size n, Function f); template <class InputIterator, class Size, class Function > InputIterator for_each_n ( InputIterator first, Size n, Function f); for_each: Applies f to elements in [first, last). for_each_n: Applies f to elements in [first, first + n). 14

17 Introducing new algorithms to the STL Numerical parallel algorithms template < class InputIterator > typename iterator_traits <InputIterator >:: value_type reduce ( InputIterator first, InputIterator last ); template < class InputIterator, class T> T reduce ( InputIterator first, InputIterator last, T init ); template < class InputIterator, class T, class BinaryOperation > T reduce ( InputIterator first, InputIterator last, T init, BinaryOperation binary_op ); As opposed to accumulate, binary_op is executed on an unespecified order. 15

18 Introducing new algorithms to the STL Other algorithms initroduced Exclusive/Inclusive Scan (Prefix Sum) Transform Reduce Transform Exclusive/Inclusive Scan Basic block to construct other algorithms and applications! 16

19 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 17

20 The SYCL Parallel STL Spec & Examples Enable C++17 Parallel STL to run on any SYCL-supported device Any OpenCL platform with SPIR support Improves productivity of C++ developers worried about performance Integrates nicely with existing SYCL codebases Completely Open Source SYCL Parallel STL introduces two execution policies 18

21 Sorting with the STL A sequential sort std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort (std :: begin (data ), std :: end (data )); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 19

22 Sorting with the STL Sorting on the GPU! std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort ( sycl_policy, std :: begin (v), std :: end (v)); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } 19

23 Sorting with the STL Sorting on the GPU! std :: vector <int > data = { 8, 9, 1, 4 }; std :: sort ( sycl_policy, std :: begin (v), std :: end (v)); if (std :: is_sorted (data )) { cout << " Data is sorted! " << endl ; } sycl_policy is an Execution Policy data is an standard stl::vector Technically will use the device returned by default_selector 19

24 The SYCL Policy template < typename KernelName = DefaultKernelName > class sycl_execution_policy { public : using kernelname = KernelName ; }; sycl_execution_policy () = default ; sycl_execution_policy (cl :: sycl :: queue cl :: sycl :: queue get_queue () const ; q); Indicates algorithm will be executed using a SYCL-device Can optionally take a queue Re-use device-selection Asynchronous data copy-back... 20

25 Why the KernelName template? How are algorithms implemented? auto f = [ vectorsize, &bufi, &bufo, op ]( cl :: sycl :: handler &h) mutable {... auto ai = bufi. template get_access <access :: mode :: read >(h); auto ao = bufo. template get_access <access :: mode :: write >(h); h. parallel_for < /* The Kernel Name */ >(r, [ai, ao, op ]( cl :: sycl ::id <1> id) { ao[id.get (0) ] = UserFunctor (ai[id.get (0) ]); }); }; 21

26 Why the KernelName template? How are algorithms implemented? auto f = [ vectorsize, &bufi, &bufo, op ]( cl :: sycl :: handler &h) mutable {... auto ai = bufi. template get_access <access :: mode :: read >(h); auto ao = bufo. template get_access <access :: mode :: write >(h); h. parallel_for < /* The Kernel Name */ >(r, [ai, ao, op ]( cl :: sycl ::id <1> id) { ao[id.get (0) ] = UserFunctor (ai[id.get (0) ]); }); }; Two separate calls can generate different kernels! transform (par, v. begin (), v. end (), [=]( int & val ){ val ++; }); transform (par, v. begin (), v. end (), [=]( int & val ){ val - -; }); 21

27 Using named policies and queues using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector <int > v =...; // Transform default_selector ds; { queue q(ds); sort ( sycl_execution_policy (q), std :: begin (v), std :: end (v)); sycl_execution_policy < class myname > sepn1 (q); transform (sepn1, std :: begin (v), std :: end (v), std :: begin (v), [=]( int i) { return i + 1;}) ; } Only required for lambdas, not functors Device selection and queue are re-used Data is copied in/out in each call! 22

28 Avoiding data-copies using buffers using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector <int > v =...; default_selector h; { buffer <int > b(std :: begin (v), std :: end (v)); b. set_final_data (v.data ()); { queue q(h); sort ( sycl_execution_policy (q), begin (b), end (b)); sycl_execution_policy < class transform1 > sepn1 (q); transform ( sepn1, begin (b), end (b), begin (b), []( int num ) { return num + 1; }); } } Buffer is constructed from STL containers Data will be copied back to the container when buffer is done Note the additional copy from vec to buffer and vice-versa 23

29 Using device-only data using namespace experimental :: parallel :: sycl ; default_selector h; { buffer <int, 1> b(range <1>( size )); b. set_final_data (v.data ()); { cl :: sycl :: queue q(h); { auto hostacc = b. get_access <mode :: read_write, target :: host_buffer >() ; for ( auto & : hostacc ) { *i = read_data_from_file (...) ; } } sort ( sycl_execution_policy (q), begin (b), end (b)); transform ( sycl_policy, begin (b), end (b), begin (b), std :: negate <int >() ); } } Data is initialized in the host using a host accessor After host accessor is done, data is on the device 24

30 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 25

percentage of work assigned to GPU/CPU Policy distributes workload accordingly HiPEAC internship

31 Heterogeneous Execution Policy Distribute workload in Parallel STL algorithms Execution on two devices at the same time Designed for integrated CPU / GPU platforms User sets the decide percentage of work assigned to GPU/CPU Policy distributes workload accordingly HiPEAC internship Research work funded via collaboration grant with HiPEAC Ph.D Student from University of Malaga (A. Vilches) 26

32 Heterogeneous Execution Example cl :: sycl :: queue q; amd_cpu_selector cpu_sel ; cl :: sycl :: queue q2( cpu_sel ); sycl :: sycl_heterogeneous_execution_policy < class TransformAlgorithm1 > snp ( q, q2, ratio ); auto mytransform = [&]() { float pi = 3.14; std :: experimental :: parallel :: transform ( snp, std :: begin (v1), std :: end (v1), std :: begin (v2), std :: begin (res ), [=]( float a, float b) { return pi * a + b; }); }; 27

33 Manual distribution of the work 28

34 Heterogeneous Execution Policy in Use 29

35 Heterogeneous Execution Trivial Implementation... { auto buf1_q1 = sycl :: helpers :: make_const_buffer (first1, first1 + crosspoint ); auto buf2_q1 = sycl :: helpers :: make_const_buffer (first2, first2 + crosspoint ); auto res_q1 = sycl :: helpers :: make_buffer (result, result + crosspoint ); auto buf1_q2 = sycl :: helpers :: make_const_buffer ( first1 + crosspoint, last1 ); auto buf2_q2 = sycl :: helpers :: make_const_buffer ( first2 + crosspoint, first2 + elements ); auto res_q2 = sycl :: helpers :: make_buffer ( result + crosspoint, result + elements ); impl :: transform ( named_sep, q1, buf1_q1, buf2_q1, res_q1, binary_op ); impl :: transform ( named_sep, q2, buf1_q2, buf2_q2, res_q2, binary_op ); }... 30

36 Heterogeneous load balancing Dynamic decision of heterogeneous balancing Percentage offloading is runtime value Developers can create runtime evaluation functions Depending on workload Depending on platform Depending on user-input 31

37 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 32

38 Conclusions Parallel STL Enables developers to quickly exploit parallel architectures SYCL makes implementing these algorithms for heterogeneous platforms trivial Just write single-source C++ Will work on any OpenCL + SPIR platform! Heterogeneous Execution Plenty of heterogeneous platform out there Complex to work with them! SYCL allows developers to focus on algorithms and distribute work Heterogeneous Policy can be optimized and customized per platform Runtime can get platform information and use custom balancing Users can extend the policy with specific balancing decisions 33

39 @codeplaysoft codeplay.com

Using SYCL as an Implementation Framework for HPX.Compute

Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The