Improving Host-GPU Communication with Buffering Schemes

Similar documents
robotics/ openel.h File Reference Macros Macro Definition Documentation Typedefs Functions

High Performance Matrix-matrix Multiplication of Very Small Matrices

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Designing and Optimizing LQCD code using OpenACC

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

GPGPU Programming & Erlang. Kevin A. Smith

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

Monday, September 20, Developing CUDA Accelerated.NET Plugins for Excel NVIDIA 2010 Conference

Introduction to Parallel Computing with CUDA. Oswald Haan

Tokens, Expressions and Control Structures

CAAM 420 Fall 2012 Lecture 29. Duncan Eddy

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

OpenStaPLE, an OpenACC Lattice QCD Application

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

Basic Types, Variables, Literals, Constants

CUDA Memories. Introduction 5/4/11

Using SYCL as an Implementation Framework for HPX.Compute

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

From Hello World to Exascale

A brief introduction to HONEI


Introduction to GPGPU and GPU-architectures

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

EEE145 Computer Programming

Example 1: Color-to-Grayscale Image Processing

Appendix. Grammar. A.1 Introduction. A.2 Keywords. There is no worse danger for a teacher than to teach words instead of things.

11 'e' 'x' 'e' 'm' 'p' 'l' 'i' 'f' 'i' 'e' 'd' bool equal(const unsigned char pstr[], const char *cstr) {

CPSC 427: Object-Oriented Programming

Comparison of High-Speed Ray Casting on GPU

Page 1. Agenda. Programming Languages. C Compilation Process

CME 213 S PRING Eric Darve

Non-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words.

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 SPH CUDA 1 1 SPH GPU GPGPU CPU GPU GPU GPU CUDA SPH SoA(Structures Of Array) GPU

Variables. Data Types.

Multi-Processors and GPU

Breaking the Memory Barrier for Finite Difference Algorithms

SIMD in Scientific Computing

Assignment Operations

Object-Oriented Programming for Scientific Computing

OptiX Utility Library

OpenACC Fundamentals. Steve Abbott November 15, 2017

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Real-time Graphics 9. GPGPU

RTfact: Concepts for Generic and High Performance Ray Tracing

Ch. 3: The C in C++ - Continued -

Adapting applications to exploit virtualization management knowledge

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Programming in C++ 4. The lexical basis of C++

eingebetteter Systeme

Stream Computing using Brook+

Optimisation Myths and Facts as Seen in Statistical Physics

Operator overloading. Conversions. friend. inline

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

1. The term STL stands for?

Massively Parallel Architectures

6.096 Introduction to C++ January (IAP) 2009

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

BUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github.

PDF Document structure, that need for managing of PDF file. It uses in all functions from EMF2PDF SDK.

Introduction to CELL B.E. and GPU Programming. Agenda

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Object-Oriented Programming for Scientific Computing

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide

CS11 Advanced C++ Fall Lecture 7

Program des-simple3.cc

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

CS377P Programming for Performance GPU Programming - I

BRAIN INTERNATIONAL SCHOOL. Term-I Class XI Sub: Computer Science Revision Worksheet

Cpt S 122 Data Structures. Templates

Lectures 5-6: Introduction to C

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

Structuur van Computerprogramma s 2

CUDA Performance Optimization. Patrick Legresley

Martin Kruliš, v

Recap. ANSI C Reserved Words C++ Multimedia Programming Lecture 2. Erwin M. Bakker Joachim Rijsdam

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Structures, Operators

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

The Challenges of System Design. Raising Performance and Reducing Power Consumption

University of Technology. Laser & Optoelectronics Engineering Department. C++ Lab.

ION - Large pages for devices

Fixed-Point Math and Other Optimizations

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Design Patterns in C++

libknx Generated by Doxygen Wed Aug :37:55

ME240 Computation for Mechanical Engineering. Lecture 4. C++ Data Types

C Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites:

C++/Java. C++: Hello World. Java : Hello World. Expression Statement. Compound Statement. Declaration Statement

GridKa School 2013: Effective Analysis C++ Templates

Compile-time factorization

Chapter 3. Fundamental Data Types

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Structures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik

Design Patterns in C++

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Martin Kruliš, v

Transcription:

Improving Host-GPU Communication with Buffering Schemes Guillermo Marcus University of Heidelberg

Overview Motivation Buffering Schemes Converting data in the loop 2

Why We know about the benefits of double/pooled buffers in DMA transactions. Why not use them in GPUs? When using an accelerator, most of the time the data format in the GPU and in the application do not match For some apps, we do not want to reserve multi-gigabyte buffers of host memory for transfers 3

Transfers in CUDA Running on... Device : GeForce GTX 48 Quick Mode Host to Device Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) 5767.2 Device to Host Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) 616. Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 149346.2 [bandwidthtest] test results... PASSED 7 read write CUDA Performance Reference 4

With data conversions 7 read read DP-SP read AOS-SOA write write DP-SP write AOS-SOA CUDA Performance Reference Convert data from double to single precision float Convert data from AOS to SOA Now both need to pass data by the CPU 5

Using Buffering Schemes!"" #$%%&'()*+*,&' )&-.'/ )!(#$%%&' 1$&$& #.*'2 3."/ 4'*+56*78.+ Provides one or more memory buffers paired with a GPU buffer. Implements typical schemes D + E 9:;(3<=>? 92;(@=#AB 9&;(C@@AB 6

Chunk Buffer DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

Chunk Buffer CUDA device DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); Create the buffer manager 7

Chunk Buffer Read Performance Chunk Buffer Read Performance Chunk Buffer 7 read write 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 8

Double Buffer DMAopsCUDA::board_type device(); //CUDA device DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9

Double Buffer DMAopsCUDA::board_type device(); //CUDA device Second Buffer, linked to device memory of buffer 1 DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9

Double Buffer Read Performance Double Buffer Write Performance Double Buffer 7 read write 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 1

Pooled Buffer DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11

Pooled Buffer Create a Pool of Buffers DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11

Pooled Buffers Read Performance Pooled Buffer Write Performance Pooled Buffer 7 read write 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 12

Translators Defines how to convert back and forth the data types in the host and the GPU template<class T> class TrNOP { public:! typedef T host_type;! typedef T board_type;! inline static void host2board(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset);! inline static void board2host(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template 13

Translator DP-SP template<typename T1, typename T2> class TrTemplate { public: typedef T1 host_type; typedef T2 board_type; inline static void host2board(unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset); inline static void board2host(unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template // Implementation of the template template<typename T1, typename T2> void TrTemplate<T1, T2>::host2board( unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t2>(in[in_offset+i]); } template<typename T1, typename T2> void TrTemplate<T1, T2>::board2host( unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t1>(in[in_offset+i]); } Makes static cast between T1 and T2 There is also an SSE optimized version for double-float conversion 14

Double Buffer DP-SP Read Performance Double Buffer DP-SP Write Performance Double Buffer DP-SP 7 read read DP-SP write write DP-SP 134217728 67188 838868 419434 297152 496 248 124 496 248 124 13421 67188 838868 419434 297152 15

Pooled Buffer DP-SP Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP 134217728 67188 838868 419434 297152 496 248 124 496 248 124 13421 67188 838868 419434 297152 16

Translator AOS-SOA // Implementation of the template template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::host2board(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! host_type * input = static_cast<host_type *>(in);! board_type * output = static_cast<board_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i] = static_cast<t2>(input[in_offset+i].x);!! output[count+out_offset+i] = static_cast<t2>(input[in_offset+i].y);!! output[count*2+out_offset+i] = static_cast<t2>(input[in_offset+i].z);!! output[count*3+out_offset+i] = static_cast<t2>(input[in_offset+i].a);!! ++i;! } } template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::board2host(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! board_type * input = static_cast<board_type *>(in);! host_type * output = static_cast<host_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i].x = static_cast<t1>(input[in_offset+i]);!! output[out_offset+i].y = static_cast<t1>(input[count + in_offset+i]);!! output[out_offset+i].z = static_cast<t1>(input[count * 2 + in_offset+i]);!! output[out_offset+i].a = static_cast<t1>(input[count * 3 + in_offset+i]);!! ++i;! } } In our example, it is 4 elements of type T1 (floats), converted into 4 interleaved blocks of floats. 17

Double Buffer AOS-SOA Read Performance Double Buffer AOS-SOA Write Performance Double Buffer AOS-SOA 7 read read AOS write write AOS 134217728 67188 838868 419434 297152 496 248 124 496 248 124 13421 67188 838868 419434 297152 18

Pooled Buffer AOS-SOA Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 19

Conclusions We present a way to composite buffering schemes with data transformation using templates We reduce the pinned memory needed to perform transfers We improve the performance of the transfers in comparison to simple CUDA implementations Questions? This work was done with support from the Volkswagen Foundation under the GRACE project 2