The Cut and Thrust of CUDA

Similar documents
CUDA Toolkit 4.2 Thrust Quick Start Guide. PG _v01 March 2012

COEN244: Class & function templates

OBJECT ORIENTED PROGRAMMING USING C++ CSCI Object Oriented Analysis and Design By Manali Torpe

Short Notes of CS201

Thrust. CUDA Course July István Reguly

CS201 - Introduction to Programming Glossary By

Interview Questions of C++

Parallel STL in today s SYCL Ruymán Reyes

SFU CMPT Topic: Function Objects

Introduction to C++ Professor Hugh C. Lauer CS-2303, System Programming Concepts

Lesson 13 - Vectors Dynamic Data Storage

1. The term STL stands for?

I BCS-031 BACHELOR OF COMPUTER APPLICATIONS (BCA) (Revised) Term-End Examination. June, 2015 BCS-031 : PROGRAMMING IN C ++

Fast Introduction to Object Oriented Programming and C++

CS201 Some Important Definitions

Chapter 15 - C++ As A "Better C"

Object-Oriented Programming for Scientific Computing

COMP6771 Advanced C++ Programming

Overload Resolution. Ansel Sermersheim & Barbara Geller Amsterdam C++ Group March 2019

C++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay

OpenACC Course. Office Hour #2 Q&A

Programming, numerics and optimization

Final exam. Final exam will be 12 problems, drop any 2. Cumulative up to and including week 14 (emphasis on weeks 9-14: classes & pointers)

Stream Computing using Brook+

Advanced Systems Programming

C++ (Non for C Programmer) (BT307) 40 Hours

A brief introduction to C++

Introduction to Programming Using Java (98-388)

GridKa School 2013: Effective Analysis C++ Templates

END TERM EXAMINATION

Object-Oriented Programming for Scientific Computing

05-15/17 Discussion Notes

EL2310 Scientific Programming

QUIZ. 1. Explain the meaning of the angle brackets in the declaration of v below:

CS 6456 OBJCET ORIENTED PROGRAMMING IV SEMESTER/EEE

And Even More and More C++ Fundamentals of Computer Science

THRUST QUICK START GUIDE. DU _v8.0 September 2016

Absolute C++ Walter Savitch

Exception Namespaces C Interoperability Templates. More C++ David Chisnall. March 17, 2011

STUDY NOTES UNIT 1 - INTRODUCTION TO OBJECT ORIENTED PROGRAMMING

GPU Programming. Rupesh Nasre.

GPU programming basics. Prof. Marco Bertini

Programming in C and C++

C++ Important Questions with Answers

CSE 333 Midterm Exam 2/14/14

Introduction to C++ Systems Programming

ECE 2400 Computer Systems Programming Fall 2017 Topic 15: C++ Templates

Topics. Modularity and Object-Oriented Programming. Dijkstra s Example (1969) Stepwise Refinement. Modular program development

Object Oriented Paradigm

Part XI. Algorithms and Templates. Philip Blakely (LSC) C++ Introduction 314 / 370

Bitonic Sort for C++ AMP

Cpt S 122 Data Structures. Introduction to C++ Part II

W3101: Programming Languages C++ Ramana Isukapalli

Overload Resolution. Ansel Sermersheim & Barbara Geller ACCU / C++ June 2018

by Pearson Education, Inc. All Rights Reserved. 2

PHY4321 Summary Notes

CE221 Programming in C++ Part 1 Introduction

Week 8: Operator overloading

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

ECE 574 Cluster Computing Lecture 13

Paytm Programming Sample paper: 1) A copy constructor is called. a. when an object is returned by value

KLiC C++ Programming. (KLiC Certificate in C++ Programming)

CS 376b Computer Vision

C++ for System Developers with Design Pattern

EL2310 Scientific Programming

use static size for this buffer

Index. object lifetimes, and ownership, use after change by an alias errors, use after drop errors, BTreeMap, 309

How the Adapters and Binders Work

Welcome to Teach Yourself Acknowledgments Fundamental C++ Programming p. 2 An Introduction to C++ p. 4 A Brief History of C++ p.

CS

SkePU 2 User Guide For the preview release

High Performance Computing in C and C++

The Foundation of C++: The C Subset An Overview of C p. 3 The Origins and History of C p. 4 C Is a Middle-Level Language p. 5 C Is a Structured

Object Oriented Programming. Solved MCQs - Part 2

Expansion statements. Version history. Introduction. Basic usage

Introduction to Parallel Programming

Polymorphism. Programming in C++ A problem of reuse. Swapping arguments. Session 4 - Genericity, Containers. Code that works for many types.

Advanced C++ Programming Workshop (With C++11, C++14, C++17) & Design Patterns

Data Parallel Algorithmic Skeletons with Accelerator Support

Appendix. Grammar. A.1 Introduction. A.2 Keywords. There is no worse danger for a teacher than to teach words instead of things.

Object-Oriented Programming

More C++ : Vectors, Classes, Inheritance, Templates. with content from cplusplus.com, codeguru.com

CS3157: Advanced Programming. Outline

C++ for numerical computing

C++ Objects Overloading Other C++ Peter Kristensen

6.096 Introduction to C++ January (IAP) 2009

An introduction to C++ template programming

Outline. Function calls and results Returning objects by value. return value optimization (RVO) Call by reference or by value?

Thrust: A. Productivity-Oriented Library for CUDA 26.1 MOTIVATION CHAPTER. Nathan Bell and Jared Hoberock

Software Design and Analysis for Engineers

CS11 Advanced C++ Fall Lecture 1

CMSC 202 Section 010x Spring Justin Martineau, Tuesday 11:30am

Semantics of C++ Hauptseminar im Wintersemester 2009/10 Templates

CS179 GPU Programming Recitation 4: CUDA Particles

A Fast Review of C Essentials Part I

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Cpt S 122 Data Structures. Templates

AN OVERVIEW OF C++ 1

EL2310 Scientific Programming

Evaluation Issues in Generic Programming with Inheritance and Templates in C++

Transcription:

The Cut and Thrust of CUDA Luke Hodkinson Center for Astrophysics and Supercomputing Swinburne University of Technology Melbourne, Hawthorn 32000, Australia May 16, 2013 Luke Hodkinson The Cut and Thrust of CUDA 1/73

Table of contents 1 Introduction 2 Results 3 C++ and Generics 4 Thrust 5 Case Studies Luke Hodkinson The Cut and Thrust of CUDA 2/73

1 Introduction 2 Results 3 C++ and Generics 4 Thrust 5 Case Studies Luke Hodkinson The Cut and Thrust of CUDA 3/73

Purpose A review of required C++ fundamentals. Just the parts we need. An introduction to thrust basics. Two case studies. Luke Hodkinson The Cut and Thrust of CUDA 4/73

What is Thrust? A parallel algorithms library. Resembles the C++ Standard Template Library. Enhances productivity. Luke Hodkinson The Cut and Thrust of CUDA 5/73

Architectures Thrust is not limited to GPUs... Multicore CPUs. GPUs. CUDA. OpenMP. Intel Threading Building Blocks. Luke Hodkinson The Cut and Thrust of CUDA 6/73

Current Statuts Open source. Apache License 2.0. Version 1.6.0. Actively developed. 240+ active users. Now distributed with CUDA toolkit. Luke Hodkinson The Cut and Thrust of CUDA 7/73

1 Introduction 2 Results 3 C++ and Generics 4 Thrust 5 Case Studies Luke Hodkinson The Cut and Thrust of CUDA 8/73

Results The target problem Conversion of an MPI parallel selection algorithm. Used to find median of large distributed arrays. Running with 10,000,000 elements per process. Written in C++ using generic algorithms (STL). Luke Hodkinson The Cut and Thrust of CUDA 9/73

Results Programming effort Number of characters modified 24 Not including build scripts or including headers. Luke Hodkinson The Cut and Thrust of CUDA 10/73

Results Programming time Time spent modifying algorithm 1 minute Not including build scripts. Luke Hodkinson The Cut and Thrust of CUDA 11/73

Results Speedup Measured speedup 122x Measured on gstar. Luke Hodkinson The Cut and Thrust of CUDA 12/73

Results Scaling Luke Hodkinson The Cut and Thrust of CUDA 13/73

Results Metrics Metrics x5 per character changed. x2 per second spent. Luke Hodkinson The Cut and Thrust of CUDA 14/73

Results Of course... results may vary, already written in C++, which brings us to... Luke Hodkinson The Cut and Thrust of CUDA 15/73

1 Introduction 2 Results 3 C++ and Generics 4 Thrust 5 Case Studies Luke Hodkinson The Cut and Thrust of CUDA 16/73

Object Oriented Programming Why do we need it? Allows us to bind data with functions. Operator overloading allows us to create callable data structures. Both of the above allow for easy kernels. Luke Hodkinson The Cut and Thrust of CUDA 17/73

Object Oriented Programming What is it? A programming paradigm characterised by representation of concepts as objects that have both data and procedures which interact with one another. Luke Hodkinson The Cut and Thrust of CUDA 18/73

Object Oriented Programming Some definitions: class A structure to hold data and functions. member Used to describe a peice of data in a class. method Used to describe a function/behavior of a class. object An instantiaion of a class. A class defines an object, but no allocation is given until instantiation. Luke Hodkinson The Cut and Thrust of CUDA 19/73

Object Oriented Programming Defining a class: Example struct point_ type { double x, y; }; class point_ type { public : double x, y; double norm () { return sqrt (x*x + y*y); } }; Luke Hodkinson The Cut and Thrust of CUDA 20/73

Object Oriented Programming Defining a class: Example Declared with class. struct point_ type { double x, y; }; class point_ type { public : double x, y; double norm () { return sqrt (x*x + y*y); } }; Luke Hodkinson The Cut and Thrust of CUDA 20/73

Object Oriented Programming Defining a class: Example Needs scope modifier. struct point_ type { double x, y; }; class point_ type { public : double x, y; double norm () { return sqrt (x*x + y*y); } }; Luke Hodkinson The Cut and Thrust of CUDA 20/73

Object Oriented Programming Defining a class: Example Class s data. struct point_ type { double x, y; }; class point_ type { public : double x, y; double norm () { return sqrt (x*x + y*y); } }; Luke Hodkinson The Cut and Thrust of CUDA 20/73

Object Oriented Programming Defining a class: Example Class s behavior. struct point_ type { double x, y; }; class point_ type { public : double x, y; double norm () { return sqrt (x*x + y*y); } }; Luke Hodkinson The Cut and Thrust of CUDA 20/73

Object Oriented Programming Instantiating a class into an object: Example int main () { // Make an object of type " point_ type ". point_ type point ; } // Call the " norm " method. double value = point. norm (); Luke Hodkinson The Cut and Thrust of CUDA 21/73

Object Oriented Programming Can use struct instead of class. Data members are automatically accessible to methods. Methods callable from objects using.. Luke Hodkinson The Cut and Thrust of CUDA 22/73

Object Oriented Programming Constructors A special method that facilitates initialisation of data. Example struct point_ type { // Default constructor. point_ type () : x( 0.0 ), y( 0.0 ) { } // Constructor. point_ type ( double x_in, double y_in ) : x( x_in ), y( y_in ) { } }; Luke Hodkinson The Cut and Thrust of CUDA 23/73

Object Oriented Programming Constructors Is automatically called upon instantiation. Example int main () { // Construct with defaults ; point_ type point_a ; // Construct with values. point_ type point_b ( 2.0, 4.0 ); }; Luke Hodkinson The Cut and Thrust of CUDA 24/73

Object Oriented Programming Functors C++ allows certain operators to be overloaded. x + y point_type& operator+( const point_type& y ) x - y point_type& operator-( const point_type& y) x*y point_type& operator*( const point_type& y ) x/y point_type& operator/( const point_type& y) x() void operator()() Luke Hodkinson The Cut and Thrust of CUDA 25/73

Object Oriented Programming Functors So, we can make function-like objects (functors) with attached data... Example struct square_ plus { double to_add ; square_ plus ( double value ) : to_add ( value ) {} double operator ()( double x ) { return x* x + to_add ; } }; Luke Hodkinson The Cut and Thrust of CUDA 26/73

Object Oriented Programming Functors Example int main () { } square_ plus op( 10 ); // create a " square_ plus " // object with " to_add " // of 10 op( 2 ); // call object, returns 40 op( 3 ); // call object, returns 90 square_ plus ( 10 )( 2 ); // create and call, // can be very efficient Luke Hodkinson The Cut and Thrust of CUDA 27/73

Object Oriented Programming Functors Example int main () { square_ plus op( 10 ); // create a " square_ plus " // object with " to_add " // of 10 op( 2 ); // call object, returns 40 op( 3 ); // call object, returns 90 square_ plus ( 10 )( 2 ); // create and call, // can be very efficient } Call the constructor to initialise to 10. Luke Hodkinson The Cut and Thrust of CUDA 27/73

Object Oriented Programming Functors Example int main () { square_ plus op( 10 ); // create a " square_ plus " // object with " to_add " // of 10 op( 2 ); // call object, returns 40 op( 3 ); // call object, returns 90 square_ plus ( 10 )( 2 ); // create and call, // can be very efficient } Call functor with argument 2. Luke Hodkinson The Cut and Thrust of CUDA 27/73

Object Oriented Programming Summary Summary Bind data with functions using classes/objects. Initialize objects with constructors. Create function-like objects with operator overloading. Luke Hodkinson The Cut and Thrust of CUDA 28/73

Object Oriented Programming Object oriented concepts we won t be looking at: inheritance, polymorphism, encapsulation, dynamic dispatch. There s a lot more to it! Luke Hodkinson The Cut and Thrust of CUDA 29/73

What is Generic Programming? Some quotes Generic programming is a paradigm for developing efficient, reusable software libraries. Generic programming is about generalizing software components so that they can be easily reused in a wide variety of situations. Generic programming is a style of computer programming in which algorithms are written in terms of to-be-specified-later types that are then instantiated when needed for specific types provided as parameters. Luke Hodkinson The Cut and Thrust of CUDA 30/73

What is Generic Programming? Some quotes Generic programming is a paradigm for developing efficient, reusable software libraries. Generic programming is about generalizing software components so that they can be easily reused in a wide variety of situations. Generic programming is a style of computer programming in which algorithms are written in terms of to-be-specified-later types that are then instantiated when needed for specific types provided as parameters. Luke Hodkinson The Cut and Thrust of CUDA 30/73

What is Generic Programming? Some quotes Generic programming is a paradigm for developing efficient, reusable software libraries. Generic programming is about generalizing software components so that they can be easily reused in a wide variety of situations. Generic programming is a style of computer programming in which algorithms are written in terms of to-be-specified-later types that are then instantiated when needed for specific types provided as parameters. Luke Hodkinson The Cut and Thrust of CUDA 30/73

What is Generic Programming? Example problem Example // Sum a set of numbers. int sum ( int array [], int size ) { int result = 0; for ( unsigned ii = 0; ii < size ; ++ ii ) result += array [ ii ]; return result ; } Luke Hodkinson The Cut and Thrust of CUDA 31/73

What is Generic Programming? Example problem Example // Sum a set of numbers. int sum ( int array [], int size ) { int result = 0; for ( unsigned ii = 0; ii < size ; ++ ii ) result += array [ ii ]; return result ; } But what about different types? Luke Hodkinson The Cut and Thrust of CUDA 31/73

What is Generic Programming? Example problem Potential solutions Write different versions for each type. Fast, but unmanageable. Use pointers to generalise type. mallocs for ints?! Too slow. Lose type information. Use generic programming. Luke Hodkinson The Cut and Thrust of CUDA 32/73

Templates Introduction Templates are how C++ handles generic programming. Example // Sum a set of numbers. template < typename T > T sum ( T array [], int size ) { T result = 0; for ( unsigned ii = 0; ii < size ; ++ ii ) result += array [ ii ]; return result ; } Luke Hodkinson The Cut and Thrust of CUDA 33/73

Templates Introduction Example int main () { int array_ of_ ints [10]; double array_ of_ doubles [10]; // TODO : Initialise arrays. sum < int >( array_of_ints, 10 ); sum < double >( array_ of_ doubles, 10 ); } sum ( array_of_ints, 10 ); sum ( array_ of_ doubles, 10 ); Luke Hodkinson The Cut and Thrust of CUDA 34/73

Pointers Currently our summation precludes summation over a subset. But, if we were to use pointers... Example // Sum a set of numbers. template < typename T > T sum ( T* start, T* finish ) { T result = 0; while ( start!= finish ) result += * start ++; return result ; } Luke Hodkinson The Cut and Thrust of CUDA 35/73

Pointers Example int main () { int array_ of_ ints [10]; double array_ of_ doubles [10]; // TODO : Initialise arrays. } sum <int >( array_of_ints + 2, array_ of_ ints + 8 ); sum < double >( array_ of_ doubles, array_ of_ doubles + 4 ); Luke Hodkinson The Cut and Thrust of CUDA 36/73

Iterators Object Iterators What if we have a more complex container? Say, a linked list? Using the addition operator, ++, would not work. However, remember operator overloading......we can override these operators using C++ objects! What if we define a kind of object to represent positions in containers? Luke Hodkinson The Cut and Thrust of CUDA 37/73

Iterators Object Iterators Example // Sum a set of numbers. template < typename Iterator > typename Iterator :: value_ type sum ( Iterator start, Iterator finish ) { typename Iterator :: value_ type result = 0; while ( start!= finish ) result += * start ++; return result ; } Luke Hodkinson The Cut and Thrust of CUDA 38/73

Iterators Object Iterators Example // Sum a set of numbers. template < typename Iterator > typename Iterator :: value_ type sum ( Iterator start, Iterator finish ) { typename Iterator :: value_ type result = 0; while ( start!= finish ) result += * start ++; return result ; Looks complicated... } Luke Hodkinson The Cut and Thrust of CUDA 38/73

Iterators Object Iterators Example // Sum a set of numbers. template < typename Iterator > typename Iterator :: value_ type sum ( Iterator start, Iterator finish ) { typename Iterator :: value_ type result = 0; while ( start!= finish ) result += * start ++; return result ; } Luke Hodkinson The Cut and Thrust of CUDA 38/73

Iterators Object Iterators Example int main () { std :: list <int > list ; std :: set < double > set ; // TODO : Fill list and set with elements. } sum <int >( list. begin (), list. end () ); sum < double >( set. begin (), set. end () ); Luke Hodkinson The Cut and Thrust of CUDA 39/73

Iterators Object Iterators Example int main () { std :: list <int > list ; STL uses std namespace std :: set < double > set ; // TODO : Fill list and set with elements. } sum <int >( list. begin (), list. end () ); sum < double >( set. begin (), set. end () ); Luke Hodkinson The Cut and Thrust of CUDA 39/73

Iterators Object Iterators Example int main () { std :: list <int > list ; begin points to container start std :: set < double > set ; // TODO : Fill list and set with elements. } sum <int >( list. begin (), list. end () ); sum < double >( set. begin (), set. end () ); Luke Hodkinson The Cut and Thrust of CUDA 39/73

Iterators Object Iterators Example int main () { std :: list <int > list ; end points to container end std :: set < double > set ; // TODO : Fill list and set with elements. } sum <int >( list. begin (), list. end () ); sum < double >( set. begin (), set. end () ); Luke Hodkinson The Cut and Thrust of CUDA 39/73

Iterators Object Iterators vec.begin() vec.end() Luke Hodkinson The Cut and Thrust of CUDA 40/73

Algorithms Introduction C++ STL provides a suite of generic algorithms written in terms of iterators. sort copy generate reverse fill find count search Luke Hodkinson The Cut and Thrust of CUDA 41/73

Algorithms Transform Example template < typename InputIterator, typename OutputIterator, typename UnaryOperation > void transform ( InputIterator first, InputIterator last, OutputIterator result, UnaryOperation op ) { while ( first!= last ) { * result = op( * first ); ++ first ; ++ result ; } } Luke Hodkinson The Cut and Thrust of CUDA 42/73

Algorithms Transform Example template < typename T > struct square { T operator ()( T x ) { return x* x; } }; int main () { std :: vector <int > vec ( 100 ); std :: fill <int >( vec. begin (), vec. end (), 10 ); std :: transform <int >( vec. begin (), vec. end (), square <int >() ); } Luke Hodkinson The Cut and Thrust of CUDA 43/73

Summary Give functions/objects variable types. Very efficient, different code for each type. Manageable code. Allows separation of data and algorithms. Luke Hodkinson The Cut and Thrust of CUDA 44/73

1 Introduction 2 Results 3 C++ and Generics 4 Thrust 5 Case Studies Luke Hodkinson The Cut and Thrust of CUDA 45/73

Thrust Introduction So how can Thrust help us write CUDA capable code? The Thrust library provides replacements for STL containers and algorithms. vector, transform, copy, count, count_if And also some new ones. reduce, transform_reduce They are all hardware optimised. Luke Hodkinson The Cut and Thrust of CUDA 46/73

Thrust Introduction So how can Thrust help us write CUDA capable code? The Thrust library provides replacements for STL containers and algorithms. vector, transform, copy, count, count_if And also some new ones. reduce, transform_reduce They are all hardware optimised. Luke Hodkinson The Cut and Thrust of CUDA 46/73

Thrust Introduction So how can Thrust help us write CUDA capable code? The Thrust library provides replacements for STL containers and algorithms. vector, transform, copy, count, count_if And also some new ones. reduce, transform_reduce They are all hardware optimised. Luke Hodkinson The Cut and Thrust of CUDA 46/73

Thrust Introduction So how can Thrust help us write CUDA capable code? The Thrust library provides replacements for STL containers and algorithms. vector, transform, copy, count, count_if And also some new ones. reduce, transform_reduce They are all hardware optimised. Luke Hodkinson The Cut and Thrust of CUDA 46/73

Thrust Introduction Example Just like the STL is declared inside the std:: namespace...... all thrust symbols are declared inside a thrust:: namespace. Symbols can be the same name without conflicts. int main () { std :: transform ( vec. begin (), vec. end (), square <int >() ); thrust :: transform ( vec. begin (), vec. end (), square <int >() ); } Luke Hodkinson The Cut and Thrust of CUDA 47/73

Thrust Vectors Device Memory Device memory is distinct from host memory. We need a special container class to refer to device memory: Various vectors template < typename T > class std :: vector <T >; template < typename T > class thrust :: host_vector <T >; template < typename T > class thrust :: device_vector <T >; Luke Hodkinson The Cut and Thrust of CUDA 48/73

Thrust Vectors Creating a Vector Device vectors may be constructed with a size... Example thrust :: device_vector < float > dev_vec ( 100 );... or resized afterwards, just like a normal vector. Example thrust :: device_vector < float > dev_vec ; dev_vec. resize ( 100 ); Luke Hodkinson The Cut and Thrust of CUDA 49/73

Copying To/From Device Values can be easily transferred from host to device and vice-versa. Example std :: vector < float > host_vec ( 100 ); thrust :: device_vector < float > dev_vec ( 100 ); thrust :: copy ( host_vec. begin (), host_vec. end (), dev_vec. begin () ); // Do some GPU processing. thrust :: copy ( dev_vec. begin (), dev_vec. end (), host_vec. begin () ); Luke Hodkinson The Cut and Thrust of CUDA 50/73

Thrust Algorithms Introduction Thrust s algorithms are its real power. Each algorithm is optimised for every supported hardware type. Like the STL, algorithms are customised with functors. What kinds of algorithms can be optimised for multicore hardware? Transformation, reduction, sorting, and others. Luke Hodkinson The Cut and Thrust of CUDA 51/73

Thrust Algorithms Introduction Thrust s algorithms are its real power. Each algorithm is optimised for every supported hardware type. Like the STL, algorithms are customised with functors. What kinds of algorithms can be optimised for multicore hardware? Transformation, reduction, sorting, and others. Luke Hodkinson The Cut and Thrust of CUDA 51/73

Thrust Algorithms Introduction Thrust s algorithms are its real power. Each algorithm is optimised for every supported hardware type. Like the STL, algorithms are customised with functors. What kinds of algorithms can be optimised for multicore hardware? Transformation, reduction, sorting, and others. Luke Hodkinson The Cut and Thrust of CUDA 51/73

Thrust Algorithms Introduction Thrust s algorithms are its real power. Each algorithm is optimised for every supported hardware type. Like the STL, algorithms are customised with functors. What kinds of algorithms can be optimised for multicore hardware? Transformation, reduction, sorting, and others. Luke Hodkinson The Cut and Thrust of CUDA 51/73

Transformation Definition Apply a unary operation to each element of a vector and store the result in another (or the same) vector. a 1 a 2 a 3 a 4 a 5 a 6 f (a 1 ) f (a 2 ) f (a 3 ) f (a 4 ) f (a 5 ) f (a 6 ) b 1 b 2 b 3 b 4 b 5 b 6 Luke Hodkinson The Cut and Thrust of CUDA 52/73

Transformation Example Example struct norm { device float operator ()( float x ) { return sqrt ( x* x ); } }; int main () { thrust :: device_vector < float > src ( 100 ), dst ( 100 ); // TODO : Initialise " src ". thrust :: transform ( src. begin (), src. end (), dst. begin (), norm () ); } Luke Hodkinson The Cut and Thrust of CUDA 53/73

Reduction Definition Apply a binary operation to successive elements of a vector and return the scalar result. a 4 a 3 a 2 a 1 f (a 1, a 2 ) f (f 1, a 3 ) f (f 2, a 4 ) b Luke Hodkinson The Cut and Thrust of CUDA 54/73

Reduction Example Example int main () { thrust :: device_vector < float > src ( 100 ); // TODO : Initialise " src ". thrust :: reduce ( src. begin (), src. end (), thrust :: plus < float >() ); } Luke Hodkinson The Cut and Thrust of CUDA 55/73

Sorting Definition a 3 a 1 a 5 a 4 a 2 a 6 a 1 a 2 a 3 a 4 a 5 a 6 Luke Hodkinson The Cut and Thrust of CUDA 56/73

Sorting Example Example int main () { thrust :: device_vector < float > src ( 100 ); // TODO : Initialise " src ". thrust :: sort ( src. begin (), src. end (), thrust :: greater < float >() ); } Luke Hodkinson The Cut and Thrust of CUDA 57/73

Application Just a few fundamental algorithms may not seem like much... Luke Hodkinson The Cut and Thrust of CUDA 58/73

Application Examples... but it can do a lot! Example Discrete Voronoi Bucket sort Bounding box Histograms Lexicographical sort Monte carlo disjoint sequences Padded grid reduction Run-length encoding SAXPY Set operations Luke Hodkinson The Cut and Thrust of CUDA 59/73

Application Strategy 1 Does your algorithm subscribe to a fundamental algorithm (transform, reduce, sort)? Done. 2 Can your algorithm be decomposed into simpler components? Goto 1 for each component. 3 Can t decompose any further and doesn t match a fundamental algorithm? Write your own kernel. Luke Hodkinson The Cut and Thrust of CUDA 60/73

Application Strategy 1 Does your algorithm subscribe to a fundamental algorithm (transform, reduce, sort)? Done. 2 Can your algorithm be decomposed into simpler components? Goto 1 for each component. 3 Can t decompose any further and doesn t match a fundamental algorithm? Write your own kernel. Luke Hodkinson The Cut and Thrust of CUDA 60/73

Application Strategy 1 Does your algorithm subscribe to a fundamental algorithm (transform, reduce, sort)? Done. 2 Can your algorithm be decomposed into simpler components? Goto 1 for each component. 3 Can t decompose any further and doesn t match a fundamental algorithm? Write your own kernel. Luke Hodkinson The Cut and Thrust of CUDA 60/73

Advanced Iterators Overview There are other powerful features of Thrust, designed to improve flexibility and performance. thrust::constant_iterator, thrust::counting_iterator, thrust::transform_iterator and thrust::zip_iterator. Luke Hodkinson The Cut and Thrust of CUDA 61/73

Advanced Iterators zip iterator Algorithm s tend to accept a single input sequence. You may want to operate on multiple sequences, or your functor may require multiple sequences. thrust::zip_iterator provides a powerful solution. Luke Hodkinson The Cut and Thrust of CUDA 62/73

Advanced Iterators zip iterator Example // Initialize vectors. thrust :: device_vector <int > A (3); thrust :: device_vector <char > B (3); A [0] = 10; A [1] = 20; A [2] = 30; B [0] = x ; B [1] = y ; B [2] = z ; // Create iterator ( type omitted ). first = thrust :: make_zip_iterator ( thrust :: make_tuple (A. begin (), B. begin ())); last = thrust :: make_zip_iterator ( thrust :: make_tuple (A.end (), B.end ())); first [0] // returns tuple (10, x ) first [1] // returns tuple (20, y ) first [2] // returns tuple (30, z ) // Maximum of [ first, last ). thrust :: maximum < tuple <int,char > > binary_op ; thrust :: tuple <int,char > init = first [0]; thrust :: reduce (first, last, init, binary_op ); // returns tuple (30, z ) Luke Hodkinson The Cut and Thrust of CUDA 63/73

Advanced Iterators zip iterator Example typedef thrust :: tuple <float,float,float > float3 ; using thrust :: get ; struct dot_prod { device float operator ()( float3 a, float3 b ) { return get <0>(a)* <0>get (b) + get <1>(a)* <1>get (b) + get <2>(a)* <2>get (b); } }; int main () { thrust :: device_vector <float > A0, A1, A2; thrust :: device_vector <float > B0, B1, B2; thrust :: device_vector <float > result ; // Initialise vectors. thrust :: transform ( thrust :: make_zip_iterator ( A0. begin (), A1. begin (), A2. begin () ), thrust :: make_zip_iterator ( A0.end (), A1.end (), A2.end () ), thrust :: make_zip_iterator ( B0. begin (), B1. begin (), B2. begin () ), result. begin (), dot_prod () ); } Luke Hodkinson The Cut and Thrust of CUDA 64/73

1 Introduction 2 Results 3 C++ and Generics 4 Thrust 5 Case Studies Luke Hodkinson The Cut and Thrust of CUDA 65/73

Distributed Median Description Luke Hodkinson The Cut and Thrust of CUDA 66/73

Distributed Median Serial In serial median can be calculated fairly easily by sorting the array and taking the n 2 element. Example thrust :: device_vector < float > values ( 100 ); thrust :: sort ( values. begin (), values. end () ); float median = values [ values. size ()/2]; Luke Hodkinson The Cut and Thrust of CUDA 67/73

Distributed Median Parallel Define a balance function, b(x) = N l (x) n 2 which represents how close we are to the median by b(x) = 0. Here N l (x) is a count of how many elements are below the point x and n is the number of elements in the array. Luke Hodkinson The Cut and Thrust of CUDA 68/73

Distributed Median Now we just need to find where b(x) = 0. We can do this with any root finder, but the Ridders algorithm will work particularaly well for us. Luke Hodkinson The Cut and Thrust of CUDA 69/73

Distributed Median And already had one written. Example template < class Function, class T > T ridders ( Function func, T x1, T x2 ); Luke Hodkinson The Cut and Thrust of CUDA 70/73

Distributed Median Example template < typename Iterator > struct median_function { Iterator start, finish ; long position ; mpi :: comm comm ; long operator ()( const value_type & x ) { long sum_left = std :: count_if ( start, finish, std :: bind2nd ( std :: less <value_type >(), x ) ); return comm. all_reduce ( sum_left ) - position ; } }; Luke Hodkinson The Cut and Thrust of CUDA 71/73

Distributed Median Example template < typename Iterator > struct median_function { long position ; mpi :: comm comm ; device long operator ()( const value_type & x ) { long sum_left = thrust :: count_if ( start, finish, thrust :: bind2nd ( thrust :: less <value_type >(), x ) ); return comm. all_reduce ( sum_left ) - position ; } }; Luke Hodkinson The Cut and Thrust of CUDA 72/73

Distributed Median Example template < typename Iterator > struct median_function { long position ; mpi :: comm comm ; device long operator ()( const value_type & x ) { long sum_left = thrust :: count_if ( start, finish, thrust :: bind2nd ( thrust :: less <value_type >(), x ) ); return comm. all_reduce ( sum_left ) - position ; } }; Add device modifier. Luke Hodkinson The Cut and Thrust of CUDA 72/73

Distributed Median Example template < typename Iterator > struct median_function { long position ; mpi :: comm comm ; device long operator ()( const value_type & x ) { long sum_left = thrust :: count_if ( start, finish, thrust :: bind2nd ( thrust :: less <value_type >(), x ) ); return comm. all_reduce ( sum_left ) - position ; } }; Change to thrust namespace. Luke Hodkinson The Cut and Thrust of CUDA 72/73

Distributed Median Example int main () { thrust :: device_vector < float > vec ( 1000 ); thrust :: generate ( vec. begin (), vec. end (), thrust :: counting_ iterator < float >( 0 ) ); float median = ridders ( median_function < float >( start, finish ), 0, 1000 ); } Luke Hodkinson The Cut and Thrust of CUDA 73/73