Modern C++ Parallelism from CPU to GPU

Size: px
Start display at page:

Download "Modern C++ Parallelism from CPU to GPU"

Transcription

1 Modern C++ Parallelism from CPU to GPU Simon Senior Software Engineer, GPGPU Toolchains, Codeplay C++ Russia

2 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism 2

3 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism 3

4 Codeplay Based in Edinburgh, Scotland Compilers, debuggers, profilers for heterogeneous systems LLVM all the things Standardisation C++, OpenCL, Vulkan, HSA, SYCL C++ on GPU! 4

5 Me! C++ professional and enthusiast Working on compiler backends blog.tartanllama.xyz Interested in metaprogramming, functional programming, modern C++ 5

6 Agenda About me and Codeplay C++17 CPU Parallelism Parallel algorithms Third-party GPU Parallelism C++20/23 GPU Parallelism 6

7 Why parallelism? 7

8 Task parallelism Data parallelism Executing different tasks at the same time Executing one task on many pieces of data at the same time E.g. one thread for GUI, one thread for data processing, one thread for network communication E.g. change the colour of every pixel, multiply matrices, add two vectors 8

9 Sorting with the STL std::vector<int> data = { 8, 9, 1, 4 }; Normal sequential sort algorithm std::sort(std::begin(data), std::end(data)); std::vector<int> data = { 8, 9, 1, 4 }; Extra parameter to STL algorithms enable parallelism std::sort(std::execution::par, std::begin(data), std::end(data)); 9

10 Using execution policies using std::execution; // May execute in parallel std::sort(par, std::begin(data), std::end(data)) // May be parallelized and vectorized std::sort(par_unseq, std::begin(data), std::end(data)); // Will not be parallelized/vectorized std::sort(seq, std::begin(data), std::end(data)); // Vendor-specific policy, read their documentation! std::sort(custom_vendor_policy, std::begin(data), std::end(data)); 10

11 Parallel overloads available 11

12 Many different existing (partial) implementations Available today Microsoft: / Intel: HPX: HSA: Thibaut Lutz: NVIDIA: execution policies.html Codeplay: 12

13 New algorithms into the STL (Serial Reduction pattern) std::vector<int> is {0,1,2,3,4,5,6}; std::accumulate(begin(is), end(is), 0); 21 13

14 New algorithms into the STL (Parallel Reduction Pattern) std::vector<int> is {0,1,2,3,4,5,6}; std::reduce(std::execution::par, begin(is), end(is), 0); 21 14

15 New algorithms into the STL (Serial Reduction pattern) std::vector<int> is {32,16,8,4,2,1}; std::accumulate(begin(is), end(is), 64, std::minus<>{}); 1 15

16 New algorithms into the STL (Parallel Reduction Pattern) std::vector<int> is {32,16,8,4,2,1}; std::reduce(std::execution::par, begin(is), end(is), 64, std::minus<>{}); 23 16

17 std::reduce Requires associativity: Requires commutativity: For some operator * (a * b) * c = a * (b * c) For some operator * a * b = b * a 17

18 What can I do with a Parallel For Each? elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::begin(v1), nelems, 1); std::for_each(std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 18

19 What can I do with a Parallel For Each? 2500 elems 2500 elems 2500 elems 2500 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::execution_policy::par, std::begin(v1), nelems, 1); std::for_each(std::execution_policy::par, std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 19

20 What can I do with a Parallel For Each? 2500 elems 2500 elems 2500 elems 2500 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::execution_policy::par, std::begin(v1), nelems, 1); std::for_each(std::execution_policy::par, std::begin(v), std::end(v), What about this [=](float f) { f * f + f }); part? Intel Core i7 7th generation 20

21 What can I do with a Parallel For Each? size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(sycl_policy, std::begin(v1), nelems, 1); elems std::for_each(sycl_named_policy <class KernelName>, std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 21

22 What can I do with a Parallel For Each? 1250 elems 1250 elems 1250 elems 1250 elems 5000 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(sycl_heter_policy(cpu, gpu, 0.5), std::begin(v1), nelems, 1); std::for_each(sycl_heter_policy<class kname> (cpu, gpu, 0.5), std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation Experimental! 22

23 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism Why GPU? OpenCL and SYCL C++20/23 GPU Parallelism 23

24 CPU GPU Small number of powerful cores Large number of simple cores Each core can execute different code Groups of cores execute the same instruction on different data Suitable for general-purpose computing Suitable for carrying out one operation on a large amount of data 24

25 CPU 25

26 GPU 26

27 GPU Programming Nested conditional branches very expensive No indirect calls (virtual calls, function pointers) Main bottleneck is data transfer from main memory The golden rule of GPU: Use enough of the throughput capabilities to hide data transfer latency 27

28 SYCL for OpenCL Cross-platform, single-source, high-level, C++ programming layer Built on top of OpenCL and based on standard modern C++ Delivering a heterogenous programming solution to C Codeplay Software Ltd.

29 Heterogeneous Systems CPU GPU Accelerator APU 29 FPGA DSP 2016 Codeplay Software Ltd.

30 SYCL is Entirely Standard C++ global vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; vector<float> a, b, c; } #pragma parallel_for float *a, *b, *c; for(int i = 0; i < a.size(); i++) { vec_add<<<range>>>(a,array_view<float> b, c); a, b, c; c[i] = a[i] + b[i]; } extent<2> e(64, 64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; }); cgh.parallel_for<class vec_add>(range, [=](cl::sycl::id<2> idx) { c[idx] = a[idx] + c[idx]; })); Codeplay Software Ltd.

31 SYCL Ecosystem Applications C++ Template Libraries SYCL for OpenCL OpenCL OpenCL-enabled Accelerators Codeplay Software Ltd.

32 Codeplay Software Ltd.

33 OpenCL Vector Add #define CL_ENABLE_EXCEPTIONS #include #include #include #include #include "cl.hpp" <cstdio> <cstdlib> <iostream> <math.h> // OpenCL kernel. Each work item takes care of one element of c const char *kernelsource = "\n" \ "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" \ " kernel void vecadd( global double *a, \n" \ " global double *b, \n" \ " global double *c, \n" \ " const unsigned int n) \n" \ "{ \n" \ " //Get our global thread ID \n" \ " int id = get_global_id(0); \n" \ " \n" \ " //Make sure we do not go out of bounds \n" \ " if (id < n) \n" \ " c[id] = a[id] + b[id]; \n" \ "} \n" \ "\n" ; try { // Query platforms std::vector<cl::platform> platforms; cl::platform::get(&platforms); if (platforms.size() == 0) { std::cout << "Platform size 0\n"; return -1; } int main(int argc, char *argv[]) { // Length of vectors unsigned int n = 1000; // Get list of devices on default platform and create // Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input buffers cl::buffer d_a; cl::buffer d_b; // Device output buffer cl::buffer d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = new double[n]; h_b = new double[n]; h_c = new double[n]; // Initialize vectors on host for(int i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } context cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0}; cl::context context(cl_device_type_gpu, properties); std::vector<cl::device> devices = context.getinfo<cl_context_devices>(); // Create command queue for first device cl::commandqueue queue(context, devices[0], 0, &err); // Create device memory buffers d_a = cl::buffer(context, CL_MEM_READ_ONLY, bytes); d_b = cl::buffer(context, CL_MEM_READ_ONLY, bytes); d_c = cl::buffer(context, CL_MEM_WRITE_ONLY, bytes); // Bind memory buffers queue.enqueuewritebuffer(d_a, CL_TRUE, 0, bytes, h_a); queue.enqueuewritebuffer(d_b, CL_TRUE, 0, bytes, h_b); //Build kernel from source string cl::program::sources source(1, std::make_pair(kernelsource,strlen(kernelsource))); cl::program program_ = cl::program(context, source); program_.build(devices); cl_int err = CL_SUCCESS; Codeplay Software Ltd.

34 OpenCL Vector Add // Create kernel object cl::kernel kernel(program_, "vecadd", &err); // Block until kernel completion event.wait(); // Bind kernel arguments to kernel kernel.setarg(0, d_a); kernel.setarg(1, d_b); kernel.setarg(2, d_c); kernel.setarg(3, n); // Read back d_c queue.enqueuereadbuffer(d_c, CL_TRUE, 0, bytes, h_c); } catch (cl::error err) { std::cerr << "ERROR: "<<err.what()<<"("<<err.err()<<")"<<std::endl; } // Number of work items in each local work group cl::ndrange localsize(64); // Number of total work items - localsize must be devisor // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(int i=0; i<n; i++) sum += h_c[i]; std::cout<<"final result: "<<sum/n<<std::endl; cl::ndrange globalsize((int)(ceil(n/(float)64)*64)); // Enqueue kernel cl::event event; queue.enqueuendrangekernel( kernel, cl::nullrange, globalsize, localsize, NULL, &event); // Release host memory delete(h_a); delete(h_b); delete(h_c); return 0; } Codeplay Software Ltd.

35 Codeplay Software Ltd.

36 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { outputptr[idx] = inputaptr[idx] + inputbptr[idx]; })); }); } 36

37 SYCL Vector Add 37

38 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { } 38

39 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); } The buffers synchronise upon destruction 39

40 Device Selectors Device selectors allow you to choose a device based on a custom configuration Evaluates all devices within a system s topology and scores them For example: If device is from vendor V If device is a GPU If device supports double type If device has N execution units GPU CPU FPG A Selected Device Codeplay Software Ltd.

41 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; } 41

42 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { Create a command group to define an asynchronous task }); } 42

43 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); }); } 43

44 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.

45 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.

46 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.

47 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG A CG B CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C CG C Write Accessor Codeplay Software Ltd.

48 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { })); }); Create a parallel_for to define the device code } 48

49 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { outputptr[idx] = inputaptr[idx] + inputbptr[idx]; })); }); } 49

50 SYCL Vector Add template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> out); int main() { std::vector<float> inputa = { /* input a */ }; std::vector<float> inputb = { /* input b */ }; std::vector<float> output = { /* output */ }; parallel_add(inputa, inputb, output); //... } 50

51 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker 51 CPU ISA CPU 2016 Codeplay Software Ltd.

52 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker 52 CPU ISA CPU 2016 Codeplay Software Ltd.

53 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker CPU ISA CPU? GPU Codeplay Software Ltd.

54 SYCL Compilation Model C++ Source File CPU Compiler CPU Object Linker CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

55 SYCL Compilation Model C++ Source File CPU Compiler CPU Object Linker Device Source CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

56 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object SYCL Compiler SPIR Linker CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

57 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object CPU ISA CPU Linker SYCL Compiler SPIR GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

58 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object CPU Linker SYCL Compiler SPIR CPU ISA (plus SPIR) GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.

59 Demo Results - Running std::sort (Running on Intel i CPU & Intel HD Graphics 520) size 2^16 2^17 2^18 2^19 std::seq s s s s std::par s s s s std::unseq s s s s sycl_execution_policy s s s s 59

60 Community Edition Available now for free! Visit: computecpp.codeplay.com Codeplay Software Ltd.

61 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism Executors Ranges 61

62 invoke defer parallel algorithms async define_task_block dispatch future::then asynchronous operations post strand<> Unified interface for execution SYCL / OpenCL / CUDA / HCC OpenMP / MPI C++ Thread Pool 62 Boost.Asio / Networking TS

63 Executors static_thread_pool pool(4); auto exec = pool.executor(); exec.execute(func); 63

64 Executors static_thread_pool pool(4); auto exec = pool.executor(); auto future = std::async(exec, func); 64

65 Executors static_thread_pool pool(4); auto exec = pool.executor(); std::for_each(std::execution::par.on(exec), begin, end, func); 65

66 Executors auto exec = get_gpu_executor(); std::for_each(std::execution::par.on(exec), begin, end, func); 66

67 An executor s cardinality reflects whether an execution launches a single thread of execution or multiple threads of execution Single cardinality Bulk cardinality Cardinality 67

68 An executor s directionality reflects whether an execution does or does not provides a channel by which to synchronise or return a result or exception One-way directionality Two-way directionality Cardinality Directionality 68

69 An executor s blocking guarantee reflects whether an execution blocks or does not block the caller thread until execution is complete Possibly-blocking guarantee Always-blocking guarantee Never-blocking guarantee Cardinality Blocking Guarantee Directionality 69

70 One-way Two-way Single execute() twoway_execute() Bulk bulk_execute() bulk_twoway_execute() 70

71 Executor Requirements oneway_executor exec; auto newexec = require(exec, never_blocking); auto fut = newexec.twoway_execute([&]() { return func(); }); 71

72 Executor Requirements possibly_blocking_executor exec; auto newexec = prefer(exec, never_blocking); newexec.execute([&]() { func(); }); 72

73 Executor Requirements possibly_blocking_executor exec; auto newexec = prefer(exec, never_blocking); auto isneverblocking = query(newexec, never_blocking); 73

74 Ranges Range View Action 74

75 Ranges Range = pair of iterators View = lazy adaptation Action = eager mutation 75

76 Ranges Views Actions using namespace ranges; int sum = accumulate(view::ints(1) view::transform([](int i){return i*i;}) view::take(10), 0); extern std::vector<int> read_data(); using namespace ranges; std::vector<int> vi = read_data() action::sort action::unique; 76

77 Ranges Views Actions using namespace ranges; int sum = accumulate(view::ints(1) view::transform([](int i){return i*i;}) view::take(10), 0); extern std::vector<int> read_data(); using namespace ranges; std::vector<int> vi = read_data() action::sort action::unique; 77

78 Parallel Ranges on GPU auto exec = get_gpu_executor(); auto indices = ranges::view::iota(0) ranges::view::take(width * height); std::experimental::transform(exec, indices, gpu_image, CalculatePixel{height, width, iterations}); 78

79 Parallel Ranges on GPU Main GPU bottleneck = data transfer Lazy views allow avoiding data transfers CPU GPU CPU f1 GPU f1+f2+f3 f2 f3 79

80 Example with Views Intel i7-6700k CPU and Intel HD Graphics 530 GPU Speedups calculated from median execution times of 100 runs per experiment ComputeCpp CE v0.5 Prototype benchmarking branch Codeplay Software Ltd. - CONFIDENTIAL

81 What if there is no predefined function? gstorm::sycl_exec exec; auto gpu_image = std::experimental::copy(exec, image); auto indices = ranges::view::iota(0) ranges::view::take(width * height); std::experimental::transform(exec, indices, gpu_image, CalculatePixel{ height, width, iterations}); Codeplay Software Ltd. - CONFIDENTIAL

82 What if there is no predefined function? Mandelbrot Intel PSTL vs SYCL Ranges Codeplay Software Ltd. - CONFIDENTIAL

83 Conclusions C++17 makes data parallel programming easier SYCL enables heterogeneous programming in standard C++ Executors will provide a unified model of execution Parallel ranges will make C++ GPU programming even better 83

84 re e g! ers/ W rin care Hi y.com/ la ep simon@codeplay.com info@codeplay.com blog.tartanllama.xyz codeplay.com

Copyright Khronos Group Page 1

Copyright Khronos Group Page 1 SYCL and OpenCL State of the Nation Michael Wong ISOCPP VP Codeplay Vice President of R & D SYCL Working Group Chair Chair C++ Standard SG5, SG14 michael@codeplay.com wongmichael.com Ronan Keryell Xilinx

More information

SYCL: An Abstraction Layer for Leveraging C++ and OpenCL

SYCL: An Abstraction Layer for Leveraging C++ and OpenCL SYCL: An Abstraction Layer for Leveraging C++ and OpenCL Alastair Murray Compiler Research Engineer, Codeplay Visit us at www.codeplay.com 45 York Place Edinburgh EH1 3HP United Kingdom Overview SYCL for

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016 Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what

More information

Parallel STL in today s SYCL Ruymán Reyes

Parallel STL in today s SYCL Ruymán Reyes Parallel STL in today s SYCL Ruymán Reyes ruyman@codeplay.com Codeplay Research 15 th November, 2016 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions

More information

Copyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL

Copyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL Copyright Khronos Group 2015 - Page 1 Introduction to SYCL SYCL Tutorial IWOCL 2015-05-12 Copyright Khronos Group 2015 - Page 2 Introduction I am - Lee Howes - Senior staff engineer - GPU systems team

More information

Directions in Parallel Programming: A Perspective from Codeplay

Directions in Parallel Programming: A Perspective from Codeplay Directions in Parallel Programming: A Perspective from Codeplay Alastair Murray Compiler Research Engineer, Codeplay Visit us at www.codeplay.com 45 York Place Edinburgh EH1 3HP United Kingdom Overview

More information

INTRODUCTION TO OPENCL. Jason B. Smith, Hood College May

INTRODUCTION TO OPENCL. Jason B. Smith, Hood College May INTRODUCTION TO OPENCL Jason B. Smith, Hood College May 4 2011 WHAT IS IT? Use heterogeneous computing platforms Specifically for computationally intensive apps Provide a means for portable parallelism

More information

Using SYCL as an Implementation Framework for HPX.Compute

Using SYCL as an Implementation Framework for HPX.Compute Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The

More information

OpenCL: History & Future. November 20, 2017

OpenCL: History & Future. November 20, 2017 Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Introducing Parallelism to the Ranges TS

Introducing Parallelism to the Ranges TS Introducing Parallelism to the Ranges TS Gordon Brown, Christopher Di Bella, Michael Haidl, Toomas Remmelg, Ruyman Reyes, Michel Steuwer Distributed & Heterogeneous Programming in C/C++, Oxford, 14/05/2018

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

OpenCL in Action. Ofer Rosenberg

OpenCL in Action. Ofer Rosenberg pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

CS 677: Parallel Programming for Many-core Processors Lecture 11

CS 677: Parallel Programming for Many-core Processors Lecture 11 1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Project Status Update Due

More information

Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19

Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19 Single-source SYCL C++ on Xilinx FPGA Xilinx Research Labs Khronos booth @SC17 2017/11/12 19 Khronos standards for heterogeneous systems 3D for the Web - Real-time apps and games in-browser - Efficiently

More information

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd. SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel

More information

OpenCL Device Fission Benedict R. Gaster, AMD

OpenCL Device Fission Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 Fission Benedict R. Gaster, AMD March 2011 Fission (cl_ext_device_fission) Provides an interface for sub-dividing an device into multiple sub-devices Typically used

More information

Michel Steuwer.

Michel Steuwer. Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include

More information

Performing Reductions in OpenCL

Performing Reductions in OpenCL Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the

More information

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Colin Riddell GPU Compiler Developer Codeplay Visit us at OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Heterogeneous Computing Made Easy:

Heterogeneous Computing Made Easy: Heterogeneous Computing Made Easy: Qualcomm Symphony System Manager SDK Wenjia Ruan Sr. Engineer, Advanced Content Group Qualcomm Technologies, Inc. May 2017 Qualcomm Symphony System Manager SDK is a product

More information

SYCL for OpenCL May15. Copyright Khronos Group Page 1

SYCL for OpenCL May15. Copyright Khronos Group Page 1 SYCL for OpenCL May15 Copyright Khronos Group 2015 - Page 1 Copyright Khronos Group 2015 - Page 2 SYCL for OpenCL - Single-source C++ Pronounced sickle - To go with spear (SPIR) Royalty-free, cross-platform

More information

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge

More information

OpenCL: A Hands-on Introduction

OpenCL: A Hands-on Introduction OpenCL: A Hands-on Introduction Tim Mattson Intel Corp. Alice Koniges Berkeley Lab/NERSC Simon McIntosh-Smith University of Bristol Acknowledgements: These slides based on slides produced by Tom Deakin

More information

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor

More information

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) Demo: CUDA on Intel HD5500 global void setvalue(float *data, int idx, float value)

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

The Role of Standards in Heterogeneous Programming

The Role of Standards in Heterogeneous Programming The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland

More information

x Welcome to the jungle. The free lunch is so over

x Welcome to the jungle. The free lunch is so over Herb Sutter 1975-2005 Put a computer on every desk, in every home, in every pocket. The free lunch is so over 2005-2011 Put a parallel supercomputer on every desk, in every home, in every pocket. Welcome

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

Advanced OpenMP. Other threading APIs

Advanced OpenMP. Other threading APIs Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot

More information

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Intel Array Building Blocks (Intel ArBB) Technical Presentation

Intel Array Building Blocks (Intel ArBB) Technical Presentation Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

SYCL for OpenCL. in a nutshell. Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL. IWOCL Conference May 2014

SYCL for OpenCL. in a nutshell. Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL. IWOCL Conference May 2014 SYCL for OpenCL in a nutshell Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL! IWOCL Conference May 2014 SYCL for OpenCL in a nutshell SYCL in the OpenCL ecosystem SYCL aims

More information

OpenCL and CUDA: A Hands-on Introduction

OpenCL and CUDA: A Hands-on Introduction OpenCL and CUDA: A Hands-on Introduction Tim Mattson Intel Corp. Acknowledgements: These slides are based on content produced by Tom Deakin and Simon Mcintosh-Smith from the University of Bristol which

More information

Heterogeneous Computing

Heterogeneous Computing Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Accelerated Test Execution Using GPUs

Accelerated Test Execution Using GPUs Accelerated Test Execution Using GPUs Vanya Yaneva Supervisors: Ajitha Rajan, Christophe Dubach Mathworks May 27, 2016 The Problem Software testing is time consuming Functional testing The Problem Software

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

AMD s Unified CPU & GPU Processor Concept

AMD s Unified CPU & GPU Processor Concept Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance

More information

OpenCL. Computation on HybriLIT Brief introduction and getting started

OpenCL. Computation on HybriLIT Brief introduction and getting started OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro HKG18-417 OpenCL Support by NNVM & TVM Jammy Zhou - Linaro Agenda OpenCL Overview OpenCL in NNVM & TVM Current Status OpenCL Introduction Open Computing Language Open standard maintained by Khronos with

More information

Applying OpenCL. IWOCL, May Andrew Richards

Applying OpenCL. IWOCL, May Andrew Richards Applying OpenCL IWOCL, May 2017 Andrew Richards The next generation of software will not be built on CPUs 2 On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance - Daniel

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

EEM528 GPU COMPUTING

EEM528 GPU COMPUTING EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes

More information

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012 Copyright Khronos Group 2012 Page 1 OpenCL 1.2 August 2012 Copyright Khronos Group 2012 Page 2 Khronos - Connecting Software to Silicon Khronos defines open, royalty-free standards to access graphics,

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU

More information

Compiler Tools for HighLevel Parallel Languages

Compiler Tools for HighLevel Parallel Languages Compiler Tools for HighLevel Parallel Languages Paul Keir Codeplay Software Ltd. LEAP Conference May 21st 2013 Presentation Outline Introduction EU Framework 7 Project: LPGPU Offload C++ for PS3 Memory

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

OpenCL The Open Standard for Heterogeneous Parallel Programming

OpenCL The Open Standard for Heterogeneous Parallel Programming OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed

More information

Parallel programming languages:

Parallel programming languages: Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1 The Microelectronics Group

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on

More information

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University

More information

WRITING DATA PARALLEL ALGORITHMS ON GPUs

WRITING DATA PARALLEL ALGORITHMS ON GPUs WRITING DATA PARALLEL ALGORITHMS ON GPUs WITH C++ AMP Ade Miller Technical Director, CenturyLink Cloud. ABSTRACT TODAY MOST PCS, TABLETS AND PHONES SUPPORT MULTI-CORE PROCESSORS AND MOST PROGRAMMERS HAVE

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

Data Parallel Algorithmic Skeletons with Accelerator Support

Data Parallel Algorithmic Skeletons with Accelerator Support MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015 Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Tools for Multi-Cores and Multi-Targets

Tools for Multi-Cores and Multi-Targets Tools for Multi-Cores and Multi-Targets Sebastian Pop Advanced Micro Devices, Austin, Texas The Linux Foundation Collaboration Summit April 7, 2011 1 / 22 Sebastian Pop Tools for Multi-Cores and Multi-Targets

More information

Many-core Processors Lecture 11. Instructor: Philippos Mordohai Webpage:

Many-core Processors Lecture 11. Instructor: Philippos Mordohai Webpage: 1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Outline More CUDA Libraries

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

Threaded Programming. Lecture 9: Alternatives to OpenMP

Threaded Programming. Lecture 9: Alternatives to OpenMP Threaded Programming Lecture 9: Alternatives to OpenMP What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast Today s s Topic GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 GPU architecture What and why The good The bad Compute Models

More information

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

GPU Architecture and Programming with OpenCL

GPU Architecture and Programming with OpenCL GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 Today s s Topic GPU architecture What and why The good The bad Compute Models

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information