Modern C++ Parallelism from CPU to GPU
|
|
- Christiana Horton
- 6 years ago
- Views:
Transcription
1 Modern C++ Parallelism from CPU to GPU Simon Senior Software Engineer, GPGPU Toolchains, Codeplay C++ Russia
2 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism 2
3 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism 3
4 Codeplay Based in Edinburgh, Scotland Compilers, debuggers, profilers for heterogeneous systems LLVM all the things Standardisation C++, OpenCL, Vulkan, HSA, SYCL C++ on GPU! 4
5 Me! C++ professional and enthusiast Working on compiler backends blog.tartanllama.xyz Interested in metaprogramming, functional programming, modern C++ 5
6 Agenda About me and Codeplay C++17 CPU Parallelism Parallel algorithms Third-party GPU Parallelism C++20/23 GPU Parallelism 6
7 Why parallelism? 7
8 Task parallelism Data parallelism Executing different tasks at the same time Executing one task on many pieces of data at the same time E.g. one thread for GUI, one thread for data processing, one thread for network communication E.g. change the colour of every pixel, multiply matrices, add two vectors 8
9 Sorting with the STL std::vector<int> data = { 8, 9, 1, 4 }; Normal sequential sort algorithm std::sort(std::begin(data), std::end(data)); std::vector<int> data = { 8, 9, 1, 4 }; Extra parameter to STL algorithms enable parallelism std::sort(std::execution::par, std::begin(data), std::end(data)); 9
10 Using execution policies using std::execution; // May execute in parallel std::sort(par, std::begin(data), std::end(data)) // May be parallelized and vectorized std::sort(par_unseq, std::begin(data), std::end(data)); // Will not be parallelized/vectorized std::sort(seq, std::begin(data), std::end(data)); // Vendor-specific policy, read their documentation! std::sort(custom_vendor_policy, std::begin(data), std::end(data)); 10
11 Parallel overloads available 11
12 Many different existing (partial) implementations Available today Microsoft: / Intel: HPX: HSA: Thibaut Lutz: NVIDIA: execution policies.html Codeplay: 12
13 New algorithms into the STL (Serial Reduction pattern) std::vector<int> is {0,1,2,3,4,5,6}; std::accumulate(begin(is), end(is), 0); 21 13
14 New algorithms into the STL (Parallel Reduction Pattern) std::vector<int> is {0,1,2,3,4,5,6}; std::reduce(std::execution::par, begin(is), end(is), 0); 21 14
15 New algorithms into the STL (Serial Reduction pattern) std::vector<int> is {32,16,8,4,2,1}; std::accumulate(begin(is), end(is), 64, std::minus<>{}); 1 15
16 New algorithms into the STL (Parallel Reduction Pattern) std::vector<int> is {32,16,8,4,2,1}; std::reduce(std::execution::par, begin(is), end(is), 64, std::minus<>{}); 23 16
17 std::reduce Requires associativity: Requires commutativity: For some operator * (a * b) * c = a * (b * c) For some operator * a * b = b * a 17
18 What can I do with a Parallel For Each? elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::begin(v1), nelems, 1); std::for_each(std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 18
19 What can I do with a Parallel For Each? 2500 elems 2500 elems 2500 elems 2500 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::execution_policy::par, std::begin(v1), nelems, 1); std::for_each(std::execution_policy::par, std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 19
20 What can I do with a Parallel For Each? 2500 elems 2500 elems 2500 elems 2500 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(std::execution_policy::par, std::begin(v1), nelems, 1); std::for_each(std::execution_policy::par, std::begin(v), std::end(v), What about this [=](float f) { f * f + f }); part? Intel Core i7 7th generation 20
21 What can I do with a Parallel For Each? size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(sycl_policy, std::begin(v1), nelems, 1); elems std::for_each(sycl_named_policy <class KernelName>, std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation 21
22 What can I do with a Parallel For Each? 1250 elems 1250 elems 1250 elems 1250 elems 5000 elems size_t nelems = 1000u; std::vector<float> nums(nelems); std::fill_n(sycl_heter_policy(cpu, gpu, 0.5), std::begin(v1), nelems, 1); std::for_each(sycl_heter_policy<class kname> (cpu, gpu, 0.5), std::begin(v), std::end(v), [=](float f) { f * f + f }); Intel Core i7 7th generation Experimental! 22
23 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism Why GPU? OpenCL and SYCL C++20/23 GPU Parallelism 23
24 CPU GPU Small number of powerful cores Large number of simple cores Each core can execute different code Groups of cores execute the same instruction on different data Suitable for general-purpose computing Suitable for carrying out one operation on a large amount of data 24
25 CPU 25
26 GPU 26
27 GPU Programming Nested conditional branches very expensive No indirect calls (virtual calls, function pointers) Main bottleneck is data transfer from main memory The golden rule of GPU: Use enough of the throughput capabilities to hide data transfer latency 27
28 SYCL for OpenCL Cross-platform, single-source, high-level, C++ programming layer Built on top of OpenCL and based on standard modern C++ Delivering a heterogenous programming solution to C Codeplay Software Ltd.
29 Heterogeneous Systems CPU GPU Accelerator APU 29 FPGA DSP 2016 Codeplay Software Ltd.
30 SYCL is Entirely Standard C++ global vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; vector<float> a, b, c; } #pragma parallel_for float *a, *b, *c; for(int i = 0; i < a.size(); i++) { vec_add<<<range>>>(a,array_view<float> b, c); a, b, c; c[i] = a[i] + b[i]; } extent<2> e(64, 64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; }); cgh.parallel_for<class vec_add>(range, [=](cl::sycl::id<2> idx) { c[idx] = a[idx] + c[idx]; })); Codeplay Software Ltd.
31 SYCL Ecosystem Applications C++ Template Libraries SYCL for OpenCL OpenCL OpenCL-enabled Accelerators Codeplay Software Ltd.
32 Codeplay Software Ltd.
33 OpenCL Vector Add #define CL_ENABLE_EXCEPTIONS #include #include #include #include #include "cl.hpp" <cstdio> <cstdlib> <iostream> <math.h> // OpenCL kernel. Each work item takes care of one element of c const char *kernelsource = "\n" \ "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" \ " kernel void vecadd( global double *a, \n" \ " global double *b, \n" \ " global double *c, \n" \ " const unsigned int n) \n" \ "{ \n" \ " //Get our global thread ID \n" \ " int id = get_global_id(0); \n" \ " \n" \ " //Make sure we do not go out of bounds \n" \ " if (id < n) \n" \ " c[id] = a[id] + b[id]; \n" \ "} \n" \ "\n" ; try { // Query platforms std::vector<cl::platform> platforms; cl::platform::get(&platforms); if (platforms.size() == 0) { std::cout << "Platform size 0\n"; return -1; } int main(int argc, char *argv[]) { // Length of vectors unsigned int n = 1000; // Get list of devices on default platform and create // Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input buffers cl::buffer d_a; cl::buffer d_b; // Device output buffer cl::buffer d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = new double[n]; h_b = new double[n]; h_c = new double[n]; // Initialize vectors on host for(int i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } context cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0}; cl::context context(cl_device_type_gpu, properties); std::vector<cl::device> devices = context.getinfo<cl_context_devices>(); // Create command queue for first device cl::commandqueue queue(context, devices[0], 0, &err); // Create device memory buffers d_a = cl::buffer(context, CL_MEM_READ_ONLY, bytes); d_b = cl::buffer(context, CL_MEM_READ_ONLY, bytes); d_c = cl::buffer(context, CL_MEM_WRITE_ONLY, bytes); // Bind memory buffers queue.enqueuewritebuffer(d_a, CL_TRUE, 0, bytes, h_a); queue.enqueuewritebuffer(d_b, CL_TRUE, 0, bytes, h_b); //Build kernel from source string cl::program::sources source(1, std::make_pair(kernelsource,strlen(kernelsource))); cl::program program_ = cl::program(context, source); program_.build(devices); cl_int err = CL_SUCCESS; Codeplay Software Ltd.
34 OpenCL Vector Add // Create kernel object cl::kernel kernel(program_, "vecadd", &err); // Block until kernel completion event.wait(); // Bind kernel arguments to kernel kernel.setarg(0, d_a); kernel.setarg(1, d_b); kernel.setarg(2, d_c); kernel.setarg(3, n); // Read back d_c queue.enqueuereadbuffer(d_c, CL_TRUE, 0, bytes, h_c); } catch (cl::error err) { std::cerr << "ERROR: "<<err.what()<<"("<<err.err()<<")"<<std::endl; } // Number of work items in each local work group cl::ndrange localsize(64); // Number of total work items - localsize must be devisor // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(int i=0; i<n; i++) sum += h_c[i]; std::cout<<"final result: "<<sum/n<<std::endl; cl::ndrange globalsize((int)(ceil(n/(float)64)*64)); // Enqueue kernel cl::event event; queue.enqueuendrangekernel( kernel, cl::nullrange, globalsize, localsize, NULL, &event); // Release host memory delete(h_a); delete(h_b); delete(h_c); return 0; } Codeplay Software Ltd.
35 Codeplay Software Ltd.
36 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { outputptr[idx] = inputaptr[idx] + inputbptr[idx]; })); }); } 36
37 SYCL Vector Add 37
38 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { } 38
39 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); } The buffers synchronise upon destruction 39
40 Device Selectors Device selectors allow you to choose a device based on a custom configuration Evaluates all devices within a system s topology and scores them For example: If device is from vendor V If device is a GPU If device supports double type If device has N execution units GPU CPU FPG A Selected Device Codeplay Software Ltd.
41 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; } 41
42 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { Create a command group to define an asynchronous task }); } 42
43 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); }); } 43
44 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.
45 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.
46 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C Write Accessor Codeplay Software Ltd.
47 Data Dependency Task Graphs Buffer A Read Accessor CG A Write Accessor Buffer B Read Accessor CG A CG B CG B Write Accessor Buffer C Read Accessor Buffer D Read Accessor CG C CG C Write Accessor Codeplay Software Ltd.
48 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { })); }); Create a parallel_for to define the device code } 48
49 SYCL Vector Add #include <CL/sycl.hpp> template <typename T> class kernel; template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> &out) { cl::sycl::buffer<t, 1> inputabuf(ina.data(), out.size()); cl::sycl::buffer<t, 1> inputbbuf(inb.data(), out.size()); cl::sycl::buffer<t, 1> outputbuf(out.data(), out.size()); cl::sycl::queue defaultqueue; defaultqueue.submit([&] (cl::sycl::handler &cgh) { auto inputaptr = inputabuf.get_access<cl::sycl::access::read>(cgh); auto inputbptr = inputbbuf.get_access<cl::sycl::access::read>(cgh); auto outputptr = outputbuf.get_access<cl::sycl::access::write>(cgh); cgh.parallel_for<kernel<t>>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { outputptr[idx] = inputaptr[idx] + inputbptr[idx]; })); }); } 49
50 SYCL Vector Add template <typename T> void parallel_add(std::vector<t> ina, std::vector<t> inb, std::vector<t> out); int main() { std::vector<float> inputa = { /* input a */ }; std::vector<float> inputb = { /* input b */ }; std::vector<float> output = { /* output */ }; parallel_add(inputa, inputb, output); //... } 50
51 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker 51 CPU ISA CPU 2016 Codeplay Software Ltd.
52 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker 52 CPU ISA CPU 2016 Codeplay Software Ltd.
53 C++ Compilation Model C++ Source File CPU Compiler CPU Object Linker CPU ISA CPU? GPU Codeplay Software Ltd.
54 SYCL Compilation Model C++ Source File CPU Compiler CPU Object Linker CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.
55 SYCL Compilation Model C++ Source File CPU Compiler CPU Object Linker Device Source CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.
56 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object SYCL Compiler SPIR Linker CPU ISA CPU GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.
57 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object CPU ISA CPU Linker SYCL Compiler SPIR GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.
58 SYCL Compilation Model C++ Source File Device Source CPU Compiler CPU Object CPU Linker SYCL Compiler SPIR CPU ISA (plus SPIR) GPU auto ina = bufa.get_access<access::mode::read>(cgh); auto inb = bufb.get_access<access::mode::read>(cgh); auto out = bufo.get_access<access::mode::write>(cgh); cgh.parallel_for<class add>(range<2>(64, 64), [=](id<1> i) { out[i] = ina[i] + inb[i]; })); Codeplay Software Ltd.
59 Demo Results - Running std::sort (Running on Intel i CPU & Intel HD Graphics 520) size 2^16 2^17 2^18 2^19 std::seq s s s s std::par s s s s std::unseq s s s s sycl_execution_policy s s s s 59
60 Community Edition Available now for free! Visit: computecpp.codeplay.com Codeplay Software Ltd.
61 Agenda About me and Codeplay C++17 CPU Parallelism Third-party GPU Parallelism C++20/23 GPU Parallelism Executors Ranges 61
62 invoke defer parallel algorithms async define_task_block dispatch future::then asynchronous operations post strand<> Unified interface for execution SYCL / OpenCL / CUDA / HCC OpenMP / MPI C++ Thread Pool 62 Boost.Asio / Networking TS
63 Executors static_thread_pool pool(4); auto exec = pool.executor(); exec.execute(func); 63
64 Executors static_thread_pool pool(4); auto exec = pool.executor(); auto future = std::async(exec, func); 64
65 Executors static_thread_pool pool(4); auto exec = pool.executor(); std::for_each(std::execution::par.on(exec), begin, end, func); 65
66 Executors auto exec = get_gpu_executor(); std::for_each(std::execution::par.on(exec), begin, end, func); 66
67 An executor s cardinality reflects whether an execution launches a single thread of execution or multiple threads of execution Single cardinality Bulk cardinality Cardinality 67
68 An executor s directionality reflects whether an execution does or does not provides a channel by which to synchronise or return a result or exception One-way directionality Two-way directionality Cardinality Directionality 68
69 An executor s blocking guarantee reflects whether an execution blocks or does not block the caller thread until execution is complete Possibly-blocking guarantee Always-blocking guarantee Never-blocking guarantee Cardinality Blocking Guarantee Directionality 69
70 One-way Two-way Single execute() twoway_execute() Bulk bulk_execute() bulk_twoway_execute() 70
71 Executor Requirements oneway_executor exec; auto newexec = require(exec, never_blocking); auto fut = newexec.twoway_execute([&]() { return func(); }); 71
72 Executor Requirements possibly_blocking_executor exec; auto newexec = prefer(exec, never_blocking); newexec.execute([&]() { func(); }); 72
73 Executor Requirements possibly_blocking_executor exec; auto newexec = prefer(exec, never_blocking); auto isneverblocking = query(newexec, never_blocking); 73
74 Ranges Range View Action 74
75 Ranges Range = pair of iterators View = lazy adaptation Action = eager mutation 75
76 Ranges Views Actions using namespace ranges; int sum = accumulate(view::ints(1) view::transform([](int i){return i*i;}) view::take(10), 0); extern std::vector<int> read_data(); using namespace ranges; std::vector<int> vi = read_data() action::sort action::unique; 76
77 Ranges Views Actions using namespace ranges; int sum = accumulate(view::ints(1) view::transform([](int i){return i*i;}) view::take(10), 0); extern std::vector<int> read_data(); using namespace ranges; std::vector<int> vi = read_data() action::sort action::unique; 77
78 Parallel Ranges on GPU auto exec = get_gpu_executor(); auto indices = ranges::view::iota(0) ranges::view::take(width * height); std::experimental::transform(exec, indices, gpu_image, CalculatePixel{height, width, iterations}); 78
79 Parallel Ranges on GPU Main GPU bottleneck = data transfer Lazy views allow avoiding data transfers CPU GPU CPU f1 GPU f1+f2+f3 f2 f3 79
80 Example with Views Intel i7-6700k CPU and Intel HD Graphics 530 GPU Speedups calculated from median execution times of 100 runs per experiment ComputeCpp CE v0.5 Prototype benchmarking branch Codeplay Software Ltd. - CONFIDENTIAL
81 What if there is no predefined function? gstorm::sycl_exec exec; auto gpu_image = std::experimental::copy(exec, image); auto indices = ranges::view::iota(0) ranges::view::take(width * height); std::experimental::transform(exec, indices, gpu_image, CalculatePixel{ height, width, iterations}); Codeplay Software Ltd. - CONFIDENTIAL
82 What if there is no predefined function? Mandelbrot Intel PSTL vs SYCL Ranges Codeplay Software Ltd. - CONFIDENTIAL
83 Conclusions C++17 makes data parallel programming easier SYCL enables heterogeneous programming in standard C++ Executors will provide a unified model of execution Parallel ranges will make C++ GPU programming even better 83
84 re e g! ers/ W rin care Hi y.com/ la ep simon@codeplay.com info@codeplay.com blog.tartanllama.xyz codeplay.com
Copyright Khronos Group Page 1
SYCL and OpenCL State of the Nation Michael Wong ISOCPP VP Codeplay Vice President of R & D SYCL Working Group Chair Chair C++ Standard SG5, SG14 michael@codeplay.com wongmichael.com Ronan Keryell Xilinx
More informationSYCL: An Abstraction Layer for Leveraging C++ and OpenCL
SYCL: An Abstraction Layer for Leveraging C++ and OpenCL Alastair Murray Compiler Research Engineer, Codeplay Visit us at www.codeplay.com 45 York Place Edinburgh EH1 3HP United Kingdom Overview SYCL for
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator
More informationCopyright Khronos Group, Page 1 SYCL. SG14, February 2016
Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what
More informationParallel STL in today s SYCL Ruymán Reyes
Parallel STL in today s SYCL Ruymán Reyes ruyman@codeplay.com Codeplay Research 15 th November, 2016 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions
More informationCopyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL
Copyright Khronos Group 2015 - Page 1 Introduction to SYCL SYCL Tutorial IWOCL 2015-05-12 Copyright Khronos Group 2015 - Page 2 Introduction I am - Lee Howes - Senior staff engineer - GPU systems team
More informationDirections in Parallel Programming: A Perspective from Codeplay
Directions in Parallel Programming: A Perspective from Codeplay Alastair Murray Compiler Research Engineer, Codeplay Visit us at www.codeplay.com 45 York Place Edinburgh EH1 3HP United Kingdom Overview
More informationINTRODUCTION TO OPENCL. Jason B. Smith, Hood College May
INTRODUCTION TO OPENCL Jason B. Smith, Hood College May 4 2011 WHAT IS IT? Use heterogeneous computing platforms Specifically for computationally intensive apps Provide a means for portable parallelism
More informationUsing SYCL as an Implementation Framework for HPX.Compute
Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The
More informationOpenCL: History & Future. November 20, 2017
Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationIntroducing Parallelism to the Ranges TS
Introducing Parallelism to the Ranges TS Gordon Brown, Christopher Di Bella, Michael Haidl, Toomas Remmelg, Ruyman Reyes, Michel Steuwer Distributed & Heterogeneous Programming in C/C++, Oxford, 14/05/2018
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL
CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming
More informationOpenCL in Action. Ofer Rosenberg
pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationCS 677: Parallel Programming for Many-core Processors Lecture 11
1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Project Status Update Due
More informationSingle-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19
Single-source SYCL C++ on Xilinx FPGA Xilinx Research Labs Khronos booth @SC17 2017/11/12 19 Khronos standards for heterogeneous systems 3D for the Web - Real-time apps and games in-browser - Efficiently
More informationJose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.
SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel
More informationOpenCL Device Fission Benedict R. Gaster, AMD
Copyright Khronos Group, 2011 - Page 1 Fission Benedict R. Gaster, AMD March 2011 Fission (cl_ext_device_fission) Provides an interface for sub-dividing an device into multiple sub-devices Typically used
More informationMichel Steuwer.
Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include
More informationPerforming Reductions in OpenCL
Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the
More informationColin Riddell GPU Compiler Developer Codeplay Visit us at
OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationHeterogeneous Computing Made Easy:
Heterogeneous Computing Made Easy: Qualcomm Symphony System Manager SDK Wenjia Ruan Sr. Engineer, Advanced Content Group Qualcomm Technologies, Inc. May 2017 Qualcomm Symphony System Manager SDK is a product
More informationSYCL for OpenCL May15. Copyright Khronos Group Page 1
SYCL for OpenCL May15 Copyright Khronos Group 2015 - Page 1 Copyright Khronos Group 2015 - Page 2 SYCL for OpenCL - Single-source C++ Pronounced sickle - To go with spear (SPIR) Royalty-free, cross-platform
More informationGPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker
GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge
More informationOpenCL: A Hands-on Introduction
OpenCL: A Hands-on Introduction Tim Mattson Intel Corp. Alice Koniges Berkeley Lab/NERSC Simon McIntosh-Smith University of Bristol Acknowledgements: These slides based on slides produced by Tom Deakin
More informationOpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop
OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationOpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group
Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor
More informationcuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)
cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) Demo: CUDA on Intel HD5500 global void setvalue(float *data, int idx, float value)
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationThe Role of Standards in Heterogeneous Programming
The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland
More informationx Welcome to the jungle. The free lunch is so over
Herb Sutter 1975-2005 Put a computer on every desk, in every home, in every pocket. The free lunch is so over 2005-2011 Put a parallel supercomputer on every desk, in every home, in every pocket. Welcome
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More informationAdvanced OpenMP. Other threading APIs
Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot
More informationAMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas
AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance
More informationIntroduction to OpenACC. 16 May 2013
Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationIntel Array Building Blocks (Intel ArBB) Technical Presentation
Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationSYCL for OpenCL. in a nutshell. Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL. IWOCL Conference May 2014
SYCL for OpenCL in a nutshell Maria Rovatsou, Codeplay s R&D Product Development Lead & Contributor to SYCL! IWOCL Conference May 2014 SYCL for OpenCL in a nutshell SYCL in the OpenCL ecosystem SYCL aims
More informationOpenCL and CUDA: A Hands-on Introduction
OpenCL and CUDA: A Hands-on Introduction Tim Mattson Intel Corp. Acknowledgements: These slides are based on content produced by Tom Deakin and Simon Mcintosh-Smith from the University of Bristol which
More informationHeterogeneous Computing
Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationAccelerated Test Execution Using GPUs
Accelerated Test Execution Using GPUs Vanya Yaneva Supervisors: Ajitha Rajan, Christophe Dubach Mathworks May 27, 2016 The Problem Software testing is time consuming Functional testing The Problem Software
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationData Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016
Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationAMD s Unified CPU & GPU Processor Concept
Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance
More informationOpenCL. Computation on HybriLIT Brief introduction and getting started
OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationtrisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee
trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationHKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro
HKG18-417 OpenCL Support by NNVM & TVM Jammy Zhou - Linaro Agenda OpenCL Overview OpenCL in NNVM & TVM Current Status OpenCL Introduction Open Computing Language Open standard maintained by Khronos with
More informationApplying OpenCL. IWOCL, May Andrew Richards
Applying OpenCL IWOCL, May 2017 Andrew Richards The next generation of software will not be built on CPUs 2 On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance - Daniel
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationEEM528 GPU COMPUTING
EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes
More informationCopyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012
Copyright Khronos Group 2012 Page 1 OpenCL 1.2 August 2012 Copyright Khronos Group 2012 Page 2 Khronos - Connecting Software to Silicon Khronos defines open, royalty-free standards to access graphics,
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1
Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike
More informationCUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU
More informationCompiler Tools for HighLevel Parallel Languages
Compiler Tools for HighLevel Parallel Languages Paul Keir Codeplay Software Ltd. LEAP Conference May 21st 2013 Presentation Outline Introduction EU Framework 7 Project: LPGPU Offload C++ for PS3 Memory
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationOpenCL The Open Standard for Heterogeneous Parallel Programming
OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed
More informationParallel programming languages:
Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1 The Microelectronics Group
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationPragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on
More informationA Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs
A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University
More informationWRITING DATA PARALLEL ALGORITHMS ON GPUs
WRITING DATA PARALLEL ALGORITHMS ON GPUs WITH C++ AMP Ade Miller Technical Director, CenturyLink Cloud. ABSTRACT TODAY MOST PCS, TABLETS AND PHONES SUPPORT MULTI-CORE PROCESSORS AND MOST PROGRAMMERS HAVE
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President
4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationData Parallel Algorithmic Skeletons with Accelerator Support
MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015 Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationTools for Multi-Cores and Multi-Targets
Tools for Multi-Cores and Multi-Targets Sebastian Pop Advanced Micro Devices, Austin, Texas The Linux Foundation Collaboration Summit April 7, 2011 1 / 22 Sebastian Pop Tools for Multi-Cores and Multi-Targets
More informationMany-core Processors Lecture 11. Instructor: Philippos Mordohai Webpage:
1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Outline More CUDA Libraries
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationThreaded Programming. Lecture 9: Alternatives to OpenMP
Threaded Programming Lecture 9: Alternatives to OpenMP What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationGPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast
Today s s Topic GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 GPU architecture What and why The good The bad Compute Models
More informationDon t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library
Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationGPU Architecture and Programming with OpenCL
GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 Today s s Topic GPU architecture What and why The good The bad Compute Models
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More information