Boost.Compute. A C++ library for GPU computing. Kyle Lutz

Similar documents
Major Language Changes, pt. 1

Parallelism and Concurrency in C++17 and C++20. Rainer Grimm Training, Coaching and, Technology Consulting

Working Draft, Technical Specification for C++ Extensions for Parallelism, Revision 1

To use various types of iterators with the STL algorithms ( ). To use Boolean functions to specify criteria for STL algorithms ( 23.8).

Concurrency and Parallelism with C++17 and C++20. Rainer Grimm Training, Coaching and, Technology Consulting

Programming Languages Technical Specification for C++ Extensions for Parallelism

Parallelism in C++ J. Daniel Garcia. Universidad Carlos III de Madrid. November 23, 2018

Lecture 21 Standard Template Library. A simple, but very limited, view of STL is the generality that using template functions provides.

C++ - parallelization and synchronization. Jakub Yaghob Martin Kruliš

C++ - parallelization and synchronization. David Bednárek Jakub Yaghob Filip Zavoral

C++ Standard Template Library

New Iterator Concepts

Data_Structures - Hackveda

Parallel Programming with OpenMP and Modern C++ Alternatives

Scientific programming (in C++)

C and C++ Courses. C Language

Generic Programming with JGL 4

Distributed Real-Time Control Systems. Lecture 16 C++ Libraries Templates

Foundations of Programming, Volume I, Linear structures

Functors. Cristian Cibils

More STL algorithms (revision 2)

Efficient C++ Programming and Memory Management

COPYRIGHTED MATERIAL. Index SYMBOLS. Index

I. Introduction C++11 introduced a comprehensive mechanism to manage generation of random numbers in the <random> header file.

High-Productivity CUDA Development with the Thrust Template Library. San Jose Convention Center September 23 rd 2010 Nathan Bell (NVIDIA Research)

Michel Steuwer.

GJL: The Generic Java Library

Deitel Series Page How To Program Series

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL: History & Future. November 20, 2017

More STL algorithms. Design Decisions. Doc No: N2569= Reply to: Matt Austern Date:

Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Contents. 2 Introduction to C++ Programming,

Chapters and Appendices F J are PDF documents posted online at the book s Companion Website, which is accessible from.

C++ How To Program 10 th Edition. Table of Contents

Introduction to GIL, Boost and Generic Programming

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

CUDA Toolkit 4.2 Thrust Quick Start Guide. PG _v01 March 2012

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

WRITING DATA PARALLEL ALGORITHMS ON GPUs

Parallel Programming Libraries and implementations

I/O and STL Algorithms

AN 831: Intel FPGA SDK for OpenCL

Object-Oriented Programming for Scientific Computing

Parallel Programming. Libraries and Implementations

AutoTVM & Device Fleet

Addressing Heterogeneity in Manycore Applications

WP Tutorial. WP 0.6 for Oxygen Patrick Baudin, Loïc Correnson, Philippe Hermann. CEA LIST, Software Safety Laboratory

Parallel STL in today s SYCL Ruymán Reyes

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

MODULE 37 --THE STL-- ALGORITHM PART V

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

Using SYCL as an Implementation Framework for HPX.Compute

C++ for numerical computing - part 2

Contest programming. Linear data structures. Maksym Planeta

Copyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL

Bigger Better More. The new C++ Standard Library. Thomas Witt April Copyright 2009 Thomas Witt

Chris Sewell Li-Ta Lo James Ahrens Los Alamos National Laboratory

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

THRUST QUICK START GUIDE. DU _v8.0 September 2016

BUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github.

Templates & the STL. CS 2308 :: Fall 2015 Molly O'Neil

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Copyright Khronos Group, Page 1. OpenCL. GDC, March 2010

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

Getting Started with Intel SDK for OpenCL Applications

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012

MODULE 35 --THE STL-- ALGORITHM PART III

GPU Programming. Rupesh Nasre.

More performance options

OpenCL Press Conference

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Introduction to Multicore Programming

The OpenVX Computer Vision and Neural Network Inference

Fast BVH Construction on GPUs

Exceptions, Templates, and the STL

End to End Optimization Stack for Deep Learning

Heterogeneous Computing and OpenCL

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

Expose NVIDIA s performance counters to the userspace for NV50/Tesla

Sequential Containers. Ali Malik

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

FPGA-based Supercomputing: New Opportunities and Challenges

SIGGRAPH Briefing August 2014

Java SE 8 New Features

Technology for a better society. hetcomp.com

Lecture Topic: An Overview of OpenCL on Xeon Phi

CS11 Introduction to C++ Spring Lecture 8

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

2 Installation Installing SnuCL CPU Installing SnuCL Single Installing SnuCL Cluster... 7

Chapter 16: Exceptions, Templates, and the Standard Template Library (STL)

Transcription:

Boost.Compute A C++ library for GPU computing Kyle Lutz

GPUs (NVIDIA, AMD, Intel) Multi-core CPUs (Intel, AMD) STL for Parallel Devices Accelerators (Xeon Phi, Adapteva Epiphany) FPGAs (Altera, Xilinx)

accumulate() adjacent_difference() adjacent_find() all_of() any_of() binary_search() copy() copy_if() copy_n() count() count_if() equal() equal_range() exclusive_scan() fill() fill_n() find() find_end() find_if() find_if_not() for_each() gather() generate() generate_n() includes() inclusive_scan() inner_product() inplace_merge() iota() is_partitioned() is_permutation() is_sorted() lower_bound() lexicographical_compare() max_element() merge() min_element() minmax_element() mismatch() next_permutation() none_of() nth_element() partial_sum() partition() partition_copy() partition_point() prev_permutation() random_shuffle() reduce() remove() remove_if() replace() replace_copy() reverse() reverse_copy() rotate() rotate_copy() scatter() search() search_n() set_difference() set_intersection() set_symmetric_difference() set_union() sort() sort_by_key() stable_partition() stable_sort() swap_ranges() transform() transform_reduce() unique() unique_copy() upper_bound() Algorithms

Containers array<t, N> dynamic_bitset<t> flat_map<key, T> flat_set<t> stack<t> string valarray<t> vector<t> Iterators buffer_iterator<t> constant_buffer_iterator<t> constant_iterator<t> counting_iterator<t> discard_iterator function_input_iterator<function> permutation_iterator<elem, Index> transform_iterator<iter, Function> zip_iterator<itertuple> Random Number Generators bernoulli_distribution default_random_engine discrete_distribution linear_congruential_engine mersenne_twister_engine normal_distribution uniform_int_distribution uniform_real_distribution

Library Architecture Boost.Compute Interoperability STL-like API Lambda Expressions RNGs Core OpenCL GPU CPU FPGA

Why OpenCL? (or why not CUDA/Thrust/Bolt/SYCL/OpenACC/OpenMP/C++AMP?) Standard C++ (no special compiler or compiler extensions) Library-based solution (no special build-system integration) Vendor-neutral, open-standard

Low-level API

Low-level API Provides classes to wrap OpenCL objects such as buffer, context, program, and command_queue. Takes care of reference counting and error checking Also provides utility functions for handling error codes or setting up the default device

Low-level API #include <boost/compute/core.hpp> // lookup default compute device auto gpu = boost::compute::system::default_device(); // create opencl context for the device auto ctx = boost::compute::context(gpu); // create command queue for the device auto queue = boost::compute::command_queue(ctx, gpu); // print device name std::cout << device = << gpu.name() << std::endl;

High-level API

Sort Host Data #include <vector> #include <algorithm> std::vector<int> vec = {... }; std::sort(vec.begin(), vec.end());

Sort Host Data #include <vector> #include <boost/compute/algorithm/sort.hpp> std::vector<int> vec = {... }; boost::compute::sort(vec.begin(), vec.end(), queue); 8000 6000 4000 STL Boost.Compute 2000 0 1M 10M 100M

Parallel Reduction #include <boost/compute/algorithm/reduce.hpp> #include <boost/compute/container/vector.hpp> boost::compute::vector<int> data = {... }; int sum = 0; boost::compute::reduce( data.begin(), data.end(), &sum, queue ); std::cout << sum = << sum << std::endl;

Algorithm Internals Fundamentally, STL-like algorithms produce OpenCL kernel objects which are executed on a compute device. OpenCL C++

Custom Functions BOOST_COMPUTE_FUNCTION(int, plus_two, (int x), { return x + 2; }); boost::compute::transform( v.begin(), v.end(), v.begin(), plus_two, queue );

Lambda Expressions Offers a concise syntax for specifying custom operations Fully type-checked by the C++ compiler using boost::compute::lambda::_1; boost::compute::transform( v.begin(), v.end(), v.begin(), _1 + 2, queue );

Additional Features

OpenGL Interop OpenCL provides mechanisms for synchronizing with OpenGL to implement direct rendering on the GPU Boost.Compute provides easy to use functions for interacting with OpenGL in a portable manner. OpenCL OpenGL

Program Caching Helps mitigate run-time kernel compilation costs Frequently-used kernels are stored and retrieved from the global cache Offline cache reduces this to one compilation per system

Auto-tuning OpenCL supports a wide variety of hardware with diverse execution characteristics Algorithms support different execution parameters such as work-group size, amount of work to execute serially These parameters are tunable and their results are measurable Boost.Compute includes benchmarks and tuning utilities to find the optimal parameters for a given device

Auto-tuning

Recent News

Coming soon to Boost Went through Boost peer-review in December 2014 Accepted as an official Boost library in January 2015 Should be packaged in a Boost release this year (1.59)

Boost in GSoC Boost is an accepted organization for the Google Summer of Code 2015 Last year Boost.Compute mentored a student who implemented many new algorithms and features Open to mentoring another student this year See: https://svn.boost.org/trac/boost/wiki/soc2015

Thank You Source http://github.com/kylelutz/compute Documentation http://kylelutz.github.io/compute