Improving Host-GPU Communication with Buffering Schemes

Size: px
Start display at page:

Download "Improving Host-GPU Communication with Buffering Schemes"

Transcription

1 Improving Host-GPU Communication with Buffering Schemes Guillermo Marcus University of Heidelberg

2 Overview Motivation Buffering Schemes Converting data in the loop 2

3 Why We know about the benefits of double/pooled buffers in DMA transactions. Why not use them in GPUs? When using an accelerator, most of the time the data format in the GPU and in the application do not match For some apps, we do not want to reserve multi-gigabyte buffers of host memory for transfers 3

4 Transfers in CUDA Running on... Device : GeForce GTX 48 Quick Mode Host to Device Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) Device to Host Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) 616. Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) [bandwidthtest] test results... PASSED 7 read write CUDA Performance Reference 4

5 With data conversions 7 read read DP-SP read AOS-SOA write write DP-SP write AOS-SOA CUDA Performance Reference Convert data from double to single precision float Convert data from AOS to SOA Now both need to pass data by the CPU 5

6 Using Buffering Schemes!"" #$%%&'()*+*,&' )&-.'/ )!(#$%%&' 1$&$& #.*'2 3."/ 4'*+56*78.+ Provides one or more memory buffers paired with a GPU buffer. Implements typical schemes D + E 9:;(3<=>? 92;(@=#AB 9&;(C@@AB 6

7 Chunk Buffer DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

8 Chunk Buffer CUDA device DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

9 Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

10 Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); Create the buffer manager 7

11 Chunk Buffer Read Performance Chunk Buffer Read Performance Chunk Buffer 7 read write

12 Double Buffer DMAopsCUDA::board_type device(); //CUDA device DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9

13 Double Buffer DMAopsCUDA::board_type device(); //CUDA device Second Buffer, linked to device memory of buffer 1 DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9

14 Double Buffer Read Performance Double Buffer Write Performance Double Buffer 7 read write

15 Pooled Buffer DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11

16 Pooled Buffer Create a Pool of Buffers DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11

17 Pooled Buffers Read Performance Pooled Buffer Write Performance Pooled Buffer 7 read write

18 Translators Defines how to convert back and forth the data types in the host and the GPU template<class T> class TrNOP { public:! typedef T host_type;! typedef T board_type;! inline static void host2board(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset);! inline static void board2host(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template 13

19 Translator DP-SP template<typename T1, typename T2> class TrTemplate { public: typedef T1 host_type; typedef T2 board_type; inline static void host2board(unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset); inline static void board2host(unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template // Implementation of the template template<typename T1, typename T2> void TrTemplate<T1, T2>::host2board( unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t2>(in[in_offset+i]); } template<typename T1, typename T2> void TrTemplate<T1, T2>::board2host( unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t1>(in[in_offset+i]); } Makes static cast between T1 and T2 There is also an SSE optimized version for double-float conversion 14

20 Double Buffer DP-SP Read Performance Double Buffer DP-SP Write Performance Double Buffer DP-SP 7 read read DP-SP write write DP-SP

21 Pooled Buffer DP-SP Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP

22 Translator AOS-SOA // Implementation of the template template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::host2board(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! host_type * input = static_cast<host_type *>(in);! board_type * output = static_cast<board_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i] = static_cast<t2>(input[in_offset+i].x);!! output[count+out_offset+i] = static_cast<t2>(input[in_offset+i].y);!! output[count*2+out_offset+i] = static_cast<t2>(input[in_offset+i].z);!! output[count*3+out_offset+i] = static_cast<t2>(input[in_offset+i].a);!! ++i;! } } template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::board2host(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! board_type * input = static_cast<board_type *>(in);! host_type * output = static_cast<host_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i].x = static_cast<t1>(input[in_offset+i]);!! output[out_offset+i].y = static_cast<t1>(input[count + in_offset+i]);!! output[out_offset+i].z = static_cast<t1>(input[count * 2 + in_offset+i]);!! output[out_offset+i].a = static_cast<t1>(input[count * 3 + in_offset+i]);!! ++i;! } } In our example, it is 4 elements of type T1 (floats), converted into 4 interleaved blocks of floats. 17

23 Double Buffer AOS-SOA Read Performance Double Buffer AOS-SOA Write Performance Double Buffer AOS-SOA 7 read read AOS write write AOS

24 Pooled Buffer AOS-SOA Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP

25 Conclusions We present a way to composite buffering schemes with data transformation using templates We reduce the pinned memory needed to perform transfers We improve the performance of the transfers in comparison to simple CUDA implementations Questions? This work was done with support from the Volkswagen Foundation under the GRACE project 2

robotics/ openel.h File Reference Macros Macro Definition Documentation Typedefs Functions

robotics/ openel.h File Reference Macros Macro Definition Documentation Typedefs Functions openel.h File Reference Macros #define EL_TRUE 1 #define EL_FALSE 0 #define EL_NXT_PORT_A 0 #define EL_NXT_PORT_B 1 #define EL_NXT_PORT_C 2 #define EL_NXT_PORT_S1 0 #define EL_NXT_PORT_S2 1 #define EL_NXT_PORT_S3

More information

High Performance Matrix-matrix Multiplication of Very Small Matrices

High Performance Matrix-matrix Multiplication of Very Small Matrices High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Designing and Optimizing LQCD code using OpenACC

Designing and Optimizing LQCD code using OpenACC Designing and Optimizing LQCD code using OpenACC E Calore, S F Schifano, R Tripiccione Enrico Calore University of Ferrara and INFN-Ferrara, Italy GPU Computing in High Energy Physics Pisa, Sep. 10 th,

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

GPGPU Programming & Erlang. Kevin A. Smith

GPGPU Programming & Erlang. Kevin A. Smith GPGPU Programming & Erlang Kevin A. Smith What is GPGPU Programming? Using the graphics processor for nongraphical programming Writing algorithms for the GPU instead of the host processor Why? Ridiculous

More information

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse

More information

Monday, September 20, Developing CUDA Accelerated.NET Plugins for Excel NVIDIA 2010 Conference

Monday, September 20, Developing CUDA Accelerated.NET Plugins for Excel NVIDIA 2010 Conference Developing CUDA Accelerated.NET Plugins for Excel NVIDIA 2010 Conference Cuda Development XLDeveloper-Cuda enabled Cuda in a larger organization/codebase XLDeveloper: Motivation Provide productive environment

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Tokens, Expressions and Control Structures

Tokens, Expressions and Control Structures 3 Tokens, Expressions and Control Structures Tokens Keywords Identifiers Data types User-defined types Derived types Symbolic constants Declaration of variables Initialization Reference variables Type

More information

CAAM 420 Fall 2012 Lecture 29. Duncan Eddy

CAAM 420 Fall 2012 Lecture 29. Duncan Eddy CAAM 420 Fall 2012 Lecture 29 Duncan Eddy November 7, 2012 Table of Contents 1 Templating in C++ 3 1.1 Motivation.............................................. 3 1.2 Templating Functions........................................

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

OpenStaPLE, an OpenACC Lattice QCD Application

OpenStaPLE, an OpenACC Lattice QCD Application OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

Basic Types, Variables, Literals, Constants

Basic Types, Variables, Literals, Constants Basic Types, Variables, Literals, Constants What is in a Word? A byte is the basic addressable unit of memory in RAM Typically it is 8 bits (octet) But some machines had 7, or 9, or... A word is the basic

More information

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction

More information

Using SYCL as an Implementation Framework for HPX.Compute

Using SYCL as an Implementation Framework for HPX.Compute Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

From Hello World to Exascale

From Hello World to Exascale From Hello World to Exascale Rob Farber Chief Scien0st, BlackDog Endeavors, LLC Author, CUDA Applica0on Design and Development Research consultant: ICHEC and others Doctor Dobb s Journal CUDA & OpenACC

More information

A brief introduction to HONEI

A brief introduction to HONEI A brief introduction to HONEI Danny van Dyk, Markus Geveler, Dominik Göddeke, Carsten Gutwenger, Sven Mallach, Dirk Ribbrock March 2009 Contents 1 Introduction 2 2 Using HONEI 2 3 Developing HONEI kernels

More information

POINTERS - Pointer is a variable that holds a memory address of another variable of same type. - It supports dynamic allocation routines. - It can improve the efficiency of certain routines. C++ Memory

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

EEE145 Computer Programming

EEE145 Computer Programming EEE145 Computer Programming Content of Topic 2 Extracted from cpp.gantep.edu.tr Topic 2 Dr. Ahmet BİNGÜL Department of Engineering Physics University of Gaziantep Modifications by Dr. Andrew BEDDALL Department

More information

Example 1: Color-to-Grayscale Image Processing

Example 1: Color-to-Grayscale Image Processing GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The

More information

Appendix. Grammar. A.1 Introduction. A.2 Keywords. There is no worse danger for a teacher than to teach words instead of things.

Appendix. Grammar. A.1 Introduction. A.2 Keywords. There is no worse danger for a teacher than to teach words instead of things. A Appendix Grammar There is no worse danger for a teacher than to teach words instead of things. Marc Block Introduction keywords lexical conventions programs expressions statements declarations declarators

More information

11 'e' 'x' 'e' 'm' 'p' 'l' 'i' 'f' 'i' 'e' 'd' bool equal(const unsigned char pstr[], const char *cstr) {

11 'e' 'x' 'e' 'm' 'p' 'l' 'i' 'f' 'i' 'e' 'd' bool equal(const unsigned char pstr[], const char *cstr) { This document contains the questions and solutions to the CS107 midterm given in Spring 2016 by instructors Julie Zelenski and Michael Chang. This was an 80-minute exam. Midterm questions Problem 1: C-strings

More information

CPSC 427: Object-Oriented Programming

CPSC 427: Object-Oriented Programming CPSC 427: Object-Oriented Programming Michael J. Fischer Lecture 20 November 14, 2016 CPSC 427, Lecture 20 1/19 Templates Casts and Conversions CPSC 427, Lecture 20 2/19 Templates CPSC 427, Lecture 20

More information

Comparison of High-Speed Ray Casting on GPU

Comparison of High-Speed Ray Casting on GPU Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL November 8, 2008 NVIDIA 1,2, Andreas Weinlich 1, Holger Scherl 2, Markus Kowarschik 2 and Joachim Hornegger 1 1 Chair of Pattern Recognition

More information

Page 1. Agenda. Programming Languages. C Compilation Process

Page 1. Agenda. Programming Languages. C Compilation Process EE 472 Embedded Systems Dr. Shwetak Patel Assistant Professor Computer Science & Engineering Electrical Engineering Agenda Announcements C programming intro + pointers Shwetak N. Patel - EE 472 2 Programming

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Non-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words.

Non-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words. , ean, arithmetic s s on acters Comp Sci 1570 Introduction to C++ Outline s s on acters 1 2 3 4 s s on acters Outline s s on acters 1 2 3 4 s s on acters ASCII s s on acters ASCII s s on acters Type: acter

More information

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 SPH CUDA 1 1 SPH GPU GPGPU CPU GPU GPU GPU CUDA SPH SoA(Structures Of Array) GPU

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 SPH CUDA 1 1 SPH GPU GPGPU CPU GPU GPU GPU CUDA SPH SoA(Structures Of Array) GPU SPH CUDA 1 1 SPH GPU GPGPU CPU GPU GPU GPU CUDA SPH SoA(Structures Of Array) GPU CUDA SPH Acceleration of Uniform Grid-based SPH Particle Method using CUDA Takada Kisei 1 Ohno Kazuhiko 1 Abstract: SPH

More information

Variables. Data Types.

Variables. Data Types. Variables. Data Types. The usefulness of the "Hello World" programs shown in the previous section is quite questionable. We had to write several lines of code, compile them, and then execute the resulting

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Breaking the Memory Barrier for Finite Difference Algorithms

Breaking the Memory Barrier for Finite Difference Algorithms Breaking the Memory Barrier for Finite Difference Algorithms Gerhard Zumbusch Institut für Angewandte Mathematik Friedrich-Schiller Universität Jena GTC 2013, S3096 Model problem Finite Difference Stencil

More information

SIMD in Scientific Computing

SIMD in Scientific Computing SIMD in Scientific Computing Tim Haines (terminal) PhD Candidate University of Wisconsin-Madison Department of Astronomy State of Tree(PM)-based N-Body solvers in Astronomy SPH? GPU? Xeon Phi? Gadget2

More information

Assignment Operations

Assignment Operations ECE 114-4 Control Statements-2 Dr. Z. Aliyazicioglu Cal Poly Pomona Electrical & Computer Engineering Cal Poly Pomona Electrical & Computer Engineering 1 Assignment Operations C++ provides several assignment

More information

Object-Oriented Programming for Scientific Computing

Object-Oriented Programming for Scientific Computing Object-Oriented Programming for Scientific Computing Traits and Policies Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de Summer Semester

More information

OptiX Utility Library

OptiX Utility Library OptiX Utility Library 3.0.0 Generated by Doxygen 1.7.6.1 Wed Nov 21 2012 12:59:03 CONTENTS i Contents 1 Module Documentation 1 1.1 rtutraversal: traversal API allowing batch raycasting queries utilizing

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

RTfact: Concepts for Generic and High Performance Ray Tracing

RTfact: Concepts for Generic and High Performance Ray Tracing RTfact: Concepts for Generic and High Performance Ray Tracing Ray tracers are used in RT08 papers Change packet size? Change data structures? No common software base No tools for writing composable software

More information

Ch. 3: The C in C++ - Continued -

Ch. 3: The C in C++ - Continued - Ch. 3: The C in C++ - Continued - QUIZ What are the 3 ways a reference can be passed to a C++ function? QUIZ True or false: References behave like constant pointers with automatic dereferencing. QUIZ What

More information

Adapting applications to exploit virtualization management knowledge

Adapting applications to exploit virtualization management knowledge Adapting applications to exploit virtualization management knowledge DMTF SVM 2013 Outline Motivation Applications running on virtualized infrastructure suffer! 1 Example of suffering, by experiment 2

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Programming in C++ 4. The lexical basis of C++

Programming in C++ 4. The lexical basis of C++ Programming in C++ 4. The lexical basis of C++! Characters and tokens! Permissible characters! Comments & white spaces! Identifiers! Keywords! Constants! Operators! Summary 1 Characters and tokens A C++

More information

eingebetteter Systeme

eingebetteter Systeme Praktikum: Entwicklung interaktiver eingebetteter Systeme C++-Tutorial (falk@cs.fau.de) 1 Agenda Classes Pointers and References Functions and Methods Function and Operator Overloading Template Classes

More information

Stream Computing using Brook+

Stream Computing using Brook+ Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Operator overloading. Conversions. friend. inline

Operator overloading. Conversions. friend. inline Operator overloading Conversions friend inline. Operator Overloading Operators like +, -, *, are actually methods, and can be overloaded. Syntactic sugar. What is it good for - 1 Natural usage. compare:

More information

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock

More information

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution

More information

1. The term STL stands for?

1. The term STL stands for? 1. The term STL stands for? a) Simple Template Library b) Static Template Library c) Single Type Based Library d) Standard Template Library Answer : d 2. Which of the following statements regarding the

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

6.096 Introduction to C++ January (IAP) 2009

6.096 Introduction to C++ January (IAP) 2009 MIT OpenCourseWare http://ocw.mit.edu 6.096 Introduction to C++ January (IAP) 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Welcome to 6.096 Lecture

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

BUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github.

BUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github. BUILDING PARALLEL ALGORITHMS WITH BULK Jared Hoberock Programming Systems and Applications NVIDIA Research github.com/jaredhoberock B HELLO, BROTHER! #include #include struct hello

More information

PDF Document structure, that need for managing of PDF file. It uses in all functions from EMF2PDF SDK.

PDF Document structure, that need for managing of PDF file. It uses in all functions from EMF2PDF SDK. EMF2PDF SDK Pilot Structures struct pdf_document { PDFDocument4 *pdfdoc; }; PDF Document structure, that need for managing of PDF file. It uses in all functions from EMF2PDF SDK. typedef enum { conone

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Object-Oriented Programming for Scientific Computing

Object-Oriented Programming for Scientific Computing Object-Oriented Programming for Scientific Computing Templates and Static Polymorphism Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de

More information

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide NVJPEG DA-06762-001_v0.2.0 October 2018 Libary Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the Library... 3 2.1. Single Image Decoding... 3 2.3. Batched Image Decoding... 6 2.4.

More information

CS11 Advanced C++ Fall Lecture 7

CS11 Advanced C++ Fall Lecture 7 CS11 Advanced C++ Fall 2006-2007 Lecture 7 Today s Topics Explicit casting in C++ mutable keyword and const Template specialization Template subclassing Explicit Casts in C and C++ C has one explicit cast

More information

Program des-simple3.cc

Program des-simple3.cc 1 // A simple trivial Discrete Event Simulator to illustrate DES concepts 2 // This one uses typesafe callbacks for the event handlers. 3 4 // George F. Riley, Georgia Tech, Fall 2011 ECE8893 5 6 #include

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

BRAIN INTERNATIONAL SCHOOL. Term-I Class XI Sub: Computer Science Revision Worksheet

BRAIN INTERNATIONAL SCHOOL. Term-I Class XI Sub: Computer Science Revision Worksheet BRAIN INTERNATIONAL SCHOOL Term-I Class XI 2018-19 Sub: Computer Science Revision Worksheet Chapter-1. Computer Overview 1. Which electronic device invention brought revolution in earlier computers? 2.

More information

Cpt S 122 Data Structures. Templates

Cpt S 122 Data Structures. Templates Cpt S 122 Data Structures Templates Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Topics Introduction Function Template Function-template and function-template

More information

Lectures 5-6: Introduction to C

Lectures 5-6: Introduction to C Lectures 5-6: Introduction to C Motivation: C is both a high and a low-level language Very useful for systems programming Faster than Java This intro assumes knowledge of Java Focus is on differences Most

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

Structuur van Computerprogramma s 2

Structuur van Computerprogramma s 2 Structuur van Computerprogramma s 2 dr. Dirk Deridder Dirk.Deridder@vub.ac.be http://soft.vub.ac.be/ Vrije Universiteit Brussel - Faculty of Science and Bio-Engineering Sciences - Computer Science Department

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

Recap. ANSI C Reserved Words C++ Multimedia Programming Lecture 2. Erwin M. Bakker Joachim Rijsdam

Recap. ANSI C Reserved Words C++ Multimedia Programming Lecture 2. Erwin M. Bakker Joachim Rijsdam Multimedia Programming 2004 Lecture 2 Erwin M. Bakker Joachim Rijsdam Recap Learning C++ by example No groups: everybody should experience developing and programming in C++! Assignments will determine

More information

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Structures, Operators

Structures, Operators Structures Typedef Operators Type conversion Structures, Operators Basics of Programming 1 G. Horváth, A.B. Nagy, Z. Zsóka, P. Fiala, A. Vitéz 10 October, 2018 c based on slides by Zsóka, Fiala, Vitéz

More information

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing

More information

The Challenges of System Design. Raising Performance and Reducing Power Consumption

The Challenges of System Design. Raising Performance and Reducing Power Consumption The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software

More information

University of Technology. Laser & Optoelectronics Engineering Department. C++ Lab.

University of Technology. Laser & Optoelectronics Engineering Department. C++ Lab. University of Technology Laser & Optoelectronics Engineering Department C++ Lab. Second week Variables Data Types. The usefulness of the "Hello World" programs shown in the previous section is quite questionable.

More information

ION - Large pages for devices

ION - Large pages for devices ION - Large pages for devices John Einar Reitan Android/Mobile Microconference - LPC 2016 Motivation ARM Display + IOMMU need 2MB pages when rotating Native page size 4kB 64kB pages

More information

Fixed-Point Math and Other Optimizations

Fixed-Point Math and Other Optimizations Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead

More information

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University 18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera

More information

Design Patterns in C++

Design Patterns in C++ Design Patterns in C++ Metaprogramming applied Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa April 13, 2011 G. Lipari (Scuola Superiore Sant Anna) Metaprogramming applied

More information

libknx Generated by Doxygen Wed Aug :37:55

libknx Generated by Doxygen Wed Aug :37:55 libknx Generated by Doxygen 1.8.1.2 Wed Aug 7 2013 01:37:55 Contents 1 KNX interface library 1 2 Namespace Index 3 2.1 Namespace List............................................ 3 3 Class Index 5 3.1

More information

ME240 Computation for Mechanical Engineering. Lecture 4. C++ Data Types

ME240 Computation for Mechanical Engineering. Lecture 4. C++ Data Types ME240 Computation for Mechanical Engineering Lecture 4 C++ Data Types Introduction In this lecture we will learn some fundamental elements of C++: Introduction Data Types Identifiers Variables Constants

More information

C Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites:

C Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites: C Programming Code: MBD101 Duration: 10 Hours Prerequisites: You are a computer science Professional/ graduate student You can execute Linux/UNIX commands You know how to use a text-editing tool You should

More information

C++/Java. C++: Hello World. Java : Hello World. Expression Statement. Compound Statement. Declaration Statement

C++/Java. C++: Hello World. Java : Hello World. Expression Statement. Compound Statement. Declaration Statement C++/Java HB/C2/KERNELC++/1 Statements HB/C2/KERNELC++/2 Java: a pure object-oriented language. C++: a hybrid language. Allows multiple programming styles. Kernel C++/ a better C: the subset of C++ without

More information

GridKa School 2013: Effective Analysis C++ Templates

GridKa School 2013: Effective Analysis C++ Templates GridKa School 2013: Effective Analysis C++ Templates Introduction Jörg Meyer, Steinbuch Centre for Computing, Scientific Data Management KIT University of the State of Baden-Wuerttemberg and National Research

More information

Compile-time factorization

Compile-time factorization Compile-time factorization [crazy template meta-programming] Vladimir Mirnyy C++ Meetup, 15 Sep. 2015, C Base, Berlin blog.scientificcpp.com Compile-time factorization C++ Meetup, Sep. 2015 1 / 20 The

More information

Chapter 3. Fundamental Data Types

Chapter 3. Fundamental Data Types Chapter 3. Fundamental Data Types Byoung-Tak Zhang TA: Hanock Kwak Biointelligence Laboratory School of Computer Science and Engineering Seoul National Univertisy http://bi.snu.ac.kr Variable Declaration

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Structures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik

Structures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik The Best of Both Worlds: Flexible Data Structures for Heterogeneous Computing Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik My GPU Programming Challenges 2000: Conjugate

More information

Design Patterns in C++

Design Patterns in C++ Design Patterns in C++ Template metaprogramming Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa April 6, 2011 G. Lipari (Scuola Superiore Sant Anna) Template metaprogramming

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information