Modern C++ Parallelism from CPU to GPU

Size: px

Start display at page:

Download "Modern C++ Parallelism from CPU to GPU"

Christiana Horton
6 years ago
Views:

1 Modern C++ Parallelism from CPU to GPU Simon Senior Software Engineer, GPGPU Toolchains, Codeplay C++ Russia

2 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism 2

3 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism 3

4 Codeplay Based in Edinburgh, Scotland Compilers, debuggers, profilers for heterogeneous systems LLVM all the things Standardisation C++, OpenCL, Vulkan, HSA, SYCL C++ on GPU! 4

5 Me! C++ professional and enthusiast Working on compiler backends blog.tartanllama.xyz Interested in metaprogramming, functional programming, modern C++ 5

6 Agenda About me and Codeplay C++17 CPU Parallelism Parallel algorithms Third-party GPU Parallelism C++20/23 GPU Parallelism 6

7 Why parallelism? 7

8 Task parallelism Data parallelism Executing different tasks at the same time Executing one task on many pieces of data at the same time E.g. one thread for GUI, one thread for data processing, one thread for network communication E.g. change the colour of every pixel, multiply matrices, add two vectors 8

9 Sorting with the STL std::vector<int> data = { 8, 9, 1, 4 }; Normal sequential sort algorithm std::sort(std::begin(data), std::end(data)); std::vector<int> data = { 8, 9, 1, 4 }; Extra parameter to STL algorithms enable parallelism std::sort(std::execution::par, std::begin(data), std::end(data)); 9

10 Using execution policies using std::execution; // May execute in parallel std::sort(par, std::begin(data), std::end(data)) // May be parallelized and vectorized std::sort(par_unseq, std::begin(data), std::end(data)); // Will not be parallelized/vectorized std::sort(seq, std::begin(data), std::end(data)); // Vendor-specific policy, read their documentation! std::sort(custom_vendor_policy, std::begin(data), std::end(data)); 10

11 Parallel overloads available 11

12 Many different existing (partial) implementations Available today Microsoft: / Intel: HPX: HSA: Thibaut Lutz: NVIDIA: execution policies.html Codeplay: 12

13 New algorithms into the STL (Serial Reduction pattern) std::vector<int> is {0,1,2,3,4,5,6}; std::accumulate(begin(is), end(is), 0); 21 13

14 New algorithms into the STL (Parallel Reduction Pattern) std::vector<int> is {0,1,2,3,4,5,6}; std::reduce(std::execution::par, begin(is), end(is), 0); 21 14

15 New algorithms into the STL (Serial Reduction pattern) std::vector<int> is {32,16,8,4,2,1}; std::accumulate(begin(is), end(is), 64, std::minus<>{}); 1 15

16 New algorithms into the STL (Parallel Reduction Pattern) std::vector<int> is {32,16,8,4,2,1}; std::reduce(std::execution::par, begin(is), end(is), 64, std::minus<>{}); 23 16

17 std::reduce Requires associativity: Requires commutativity: For some operator * (a * b) * c = a * (b * c) For some operator * a * b = b * a 17

18 What can I do with a Parallel For Each? elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::begin(v1), nelems, 1); std::for_each(std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 18

19 What can I do with a Parallel For Each? 2500 elems 2500 elems 2500 elems 2500 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::execution_policy::par, std::begin(v1), nelems, 1); std::for_each(std::execution_policy::par, std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 19

20 What can I do with a Parallel For Each? 2500 elems 2500 elems 2500 elems 2500 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::execution_policy::par, std::begin(v1), nelems, 1); std::for_each(std::execution_policy::par, std::begin(v), std::end(v), What about this [=](float f) { f * f + f }); part? Intel Core i7 7th generation 20

21 What can I do with a Parallel For Each? size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(sycl_policy, std::begin(v1), nelems, 1); elems std::for_each(sycl_named_policy <class KernelName>, std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 21

nums(nelems); std::fill_n(sycl_heter_policy(cpu, gpu, 0.

22 What can I do with a Parallel For Each? 1250 elems 1250 elems 1250 elems 1250 elems 5000 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(sycl_heter_policy(cpu, gpu, 0.5), std::begin(v1), nelems, 1); std::for_each(sycl_heter_policy<class kname> (cpu, gpu, 0.5), std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation Experimental! 22

23 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism Why GPU? OpenCL and SYCL C++20/23 GPU Parallelism 23

24 CPU GPU Small number of powerful cores Large number of simple cores Each core can execute different code Groups of cores execute the same instruction on different data Suitable for general-purpose computing Suitable for carrying out one operation on a large amount of data 24

25 CPU 25

26 GPU 26

27 GPU Programming Nested conditional branches very expensive No indirect calls (virtual calls, function pointers) Main bottleneck is data transfer from main memory The golden rule of GPU: Use enough of the throughput capabilities to hide data transfer latency 27

28 SYCL for OpenCL Cross-platform, single-source, high-level, C++ programming layer Built on top of OpenCL and based on standard modern C++ Delivering a heterogenous programming solution to C Codeplay Software Ltd.

29 Heterogeneous Systems CPU GPU Accelerator APU 29 FPGA DSP 2016 Codeplay Software Ltd.

30 SYCL is Entirely Standard C++ global vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; vector<float> a, b, c; } #pragma parallel_for float *a, *b, *c; for(int i = 0; i < a.size(); i++) { vec_add<<<range>>>(a,array_view<float> b, c); a, b, c; c[i] = a[i] + b[i]; } extent<2> e(64, 64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; }); cgh.parallel_for<class vec_add>(range, [=](cl::sycl::id<2> idx) { c[idx] = a[idx] + c[idx]; })); Codeplay Software Ltd.

31 SYCL Ecosystem Applications C++ Template Libraries SYCL for OpenCL OpenCL OpenCL-enabled Accelerators Codeplay Software Ltd.

32 Codeplay Software Ltd.

33 OpenCL Vector Add #define CL_ENABLE_EXCEPTIONS #include #include #include #include #include "cl.hpp" <cstdio> <cstdlib> <iostream> <math.h> // OpenCL kernel. Each work item takes care of one element of c const char *kernelsource = "\n" \ "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" \ " kernel void vecadd( global double *a, \n" \ " global double *b, \n" \ " global double *c, \n" \ " const unsigned int n) \n" \ "{ \n" \ " //Get our global thread ID \n" \ " int id = get_global_id(0); \n" \ " \n" \ " //Make sure we do not go out of bounds \n" \ " if (id < n) \n" \ " c[id] = a[id] + b[id]; \n" \ "} \n" \ "\n" ; try { // Query platforms std::vector<cl::platform> platforms; cl::platform::get(&platforms); if (platforms.size() == 0) { std::cout << "Platform size 0\n"; return -1; } int main(int argc, char *argv[]) { // Length of vectors unsigned int n = 1000; // Get list of devices on default platform and create // Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input buffers cl::buffer d_a; cl::buffer d_b; // Device output buffer cl::buffer d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = new double[n]; h_b = new double[n]; h_c = new double[n]; // Initialize vectors on host for(int i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } context cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0}; cl::context context(cl_device_type_gpu, properties); std::vector<cl::device> devices = context.getinfo<cl_context_devices>(); // Create command queue for first device cl::commandqueue queue(context, devices[0], 0, &err); // Create device memory buffers d_a = cl::buffer(context, CL_MEM_READ_ONLY, bytes); d_b = cl::buffer(context, CL_MEM_READ_ONLY, bytes); d_c = cl::buffer(context, CL_MEM_WRITE_ONLY, bytes); // Bind memory buffers queue.enqueuewritebuffer(d_a, CL_TRUE, 0, bytes, h_a); queue.enqueuewritebuffer(d_b, CL_TRUE, 0, bytes, h_b); //Build kernel from source string cl::program::sources source(1, std::make_pair(kernelsource,strlen(kernelsource))); cl::program program_ = cl::program(context, source); program_.build(devices); cl_int err = CL_SUCCESS; Codeplay Software Ltd.

34 OpenCL Vector Add // Create kernel object cl::kernel kernel(program_, "vecadd", &err); // Block until kernel completion event.wait(); // Bind kernel arguments to kernel kernel.setarg(0, d_a); kernel.setarg(1, d_b); kernel.setarg(2, d_c); kernel.setarg(3, n); // Read back d_c queue.enqueuereadbuffer(d_c, CL_TRUE, 0, bytes, h_c); } catch (cl::error err) { std::cerr << "ERROR: "<<err.what()<<"("<<err.err()<<")"<<std::endl; } // Number of work items in each local work group cl::ndrange localsize(64); // Number of total work items - localsize must be devisor // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(int i=0; i<n; i++) sum += h_c[i]; std::cout<<"final result: "<<sum/n<<std::endl; cl::ndrange globalsize((int)(ceil(n/(float)64)*64)); // Enqueue kernel cl::event event; queue.enqueuendrangekernel( kernel, cl::nullrange, globalsize, localsize, NULL, &event); // Release host memory delete(h_a); delete(h_b); delete(h_c); return 0; } Codeplay Software Ltd.

35 Codeplay Software Ltd.

36 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { outputptr[idx] = inputaptr[idx] + inputbptr[idx]; })); }); } 36

37 SYCL Vector Add 37

38 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { } 38

39 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); } The buffers synchronise upon destruction 39

40 Device Selectors Device selectors allow you to choose a device based on a custom configuration Evaluates all devices within a system s topology and scores them For example: If device is from vendor V If device is a GPU If device supports double type If device has N execution units GPU CPU FPG A Selected Device Codeplay Software Ltd.

41 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; } 41

42 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { Create a command group to define an asynchronous task }); } 42

43 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); }); } 43

44 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.

45 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.

46 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.

47 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG A CG B CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C CG C Write Accessor Codeplay Software Ltd.

48 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { })); }); Create a parallel_for to define the device code } 48

49 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { outputptr[idx] = inputaptr[idx] + inputbptr[idx]; })); }); } 49

50 SYCL Vector Add template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> out); int main() { std::vector<float> inputa = { /* input a */ }; std::vector<float> inputb = { /* input b */ }; std::vector<float> output = { /* output */ }; parallel_add(inputa, inputb, output); //... } 50

51 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker 51 CPU ISA CPU 2016 Codeplay Software Ltd.

52 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker 52 CPU ISA CPU 2016 Codeplay Software Ltd.

53 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker CPU ISA CPU? GPU Codeplay Software Ltd.

54 SYCL Compilation Model C++ Source File CPU Compiler CPU Object Linker CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

55 SYCL Compilation Model C++ Source File CPU Compiler CPU Object Linker Device Source CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

56 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object SYCL Compiler SPIR Linker CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

57 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object CPU ISA CPU Linker SYCL Compiler SPIR GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

58 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object CPU Linker SYCL Compiler SPIR CPU ISA (plus SPIR) GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

59 Demo Results - Running std::sort (Running on Intel i CPU & Intel HD Graphics 520) size 2^16 2^17 2^18 2^19 std::seq s s s s std::par s s s s std::unseq s s s s sycl_execution_policy s s s s 59

60 Community Edition Available now for free! Visit: computecpp.codeplay.com Codeplay Software Ltd.

61 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism Executors Ranges 61

62 invoke defer parallel algorithms async define_task_block dispatch future::then asynchronous operations post strand<> Unified interface for execution SYCL / OpenCL / CUDA / HCC OpenMP / MPI C++ Thread Pool 62 Boost.Asio / Networking TS

63 Executors static_thread_pool pool(4); auto exec = pool.executor(); exec.execute(func); 63

64 Executors static_thread_pool pool(4); auto exec = pool.executor(); auto future = std::async(exec, func); 64

65 Executors static_thread_pool pool(4); auto exec = pool.executor(); std::for_each(std::execution::par.on(exec), begin, end, func); 65

66 Executors auto exec = get_gpu_executor(); std::for_each(std::execution::par.on(exec), begin, end, func); 66

67 An executor s cardinality reflects whether an execution launches a single thread of execution or multiple threads of execution Single cardinality Bulk cardinality Cardinality 67

68 An executor s directionality reflects whether an execution does or does not provides a channel by which to synchronise or return a result or exception One-way directionality Two-way directionality Cardinality Directionality 68

69 An executor s blocking guarantee reflects whether an execution blocks or does not block the caller thread until execution is complete Possibly-blocking guarantee Always-blocking guarantee Never-blocking guarantee Cardinality Blocking Guarantee Directionality 69

70 One-way Two-way Single execute() twoway_execute() Bulk bulk_execute() bulk_twoway_execute() 70

71 Executor Requirements oneway_executor exec; auto newexec = require(exec, never_blocking); auto fut = newexec.twoway_execute([&]() { return func(); }); 71

72 Executor Requirements possibly_blocking_executor exec; auto newexec = prefer(exec, never_blocking); newexec.execute([&]() { func(); }); 72

73 Executor Requirements possibly_blocking_executor exec; auto newexec = prefer(exec, never_blocking); auto isneverblocking = query(newexec, never_blocking); 73

74 Ranges Range View Action 74

75 Ranges Range = pair of iterators View = lazy adaptation Action = eager mutation 75

76 Ranges Views Actions using namespace ranges; int sum = accumulate(view::ints(1) view::transform([](int i){return i*i;}) view::take(10), 0); extern std::vector<int> read_data(); using namespace ranges; std::vector<int> vi = read_data() action::sort action::unique; 76

77 Ranges Views Actions using namespace ranges; int sum = accumulate(view::ints(1) view::transform([](int i){return i*i;}) view::take(10), 0); extern std::vector<int> read_data(); using namespace ranges; std::vector<int> vi = read_data() action::sort action::unique; 77

78 Parallel Ranges on GPU auto exec = get_gpu_executor(); auto indices = ranges::view::iota(0) ranges::view::take(width * height); std::experimental::transform(exec, indices, gpu_image, CalculatePixel{height, width, iterations}); 78

79 Parallel Ranges on GPU Main GPU bottleneck = data transfer Lazy views allow avoiding data transfers CPU GPU CPU f1 GPU f1+f2+f3 f2 f3 79

80 Example with Views Intel i7-6700k CPU and Intel HD Graphics 530 GPU Speedups calculated from median execution times of 100 runs per experiment ComputeCpp CE v0.5 Prototype benchmarking branch Codeplay Software Ltd. - CONFIDENTIAL

81 What if there is no predefined function? gstorm::sycl_exec exec; auto gpu_image = std::experimental::copy(exec, image); auto indices = ranges::view::iota(0) ranges::view::take(width * height); std::experimental::transform(exec, indices, gpu_image, CalculatePixel{ height, width, iterations}); Codeplay Software Ltd. - CONFIDENTIAL

82 What if there is no predefined function? Mandelbrot Intel PSTL vs SYCL Ranges Codeplay Software Ltd. - CONFIDENTIAL

83 Conclusions C++17 makes data parallel programming easier SYCL enables heterogeneous programming in standard C++ Executors will provide a unified model of execution Parallel ranges will make C++ GPU programming even better 83

84 re e g! ers/ W rin care Hi y.com/ la ep simon@codeplay.com info@codeplay.com blog.tartanllama.xyz codeplay.com

Copyright Khronos Group Page 1

Copyright Khronos Group Page 1 SYCL and OpenCL State of the Nation Michael Wong ISOCPP VP Codeplay Vice President of R & D SYCL Working Group Chair Chair C++ Standard SG5, SG14 michael@codeplay.com wongmichael.com Ronan Keryell Xilinx