Improving Host-GPU Communication with Buffering Schemes

Size: px

Start display at page:

Download "Improving Host-GPU Communication with Buffering Schemes"

Giles Freeman
6 years ago
Views:

1 Improving Host-GPU Communication with Buffering Schemes Guillermo Marcus University of Heidelberg

2 Overview Motivation Buffering Schemes Converting data in the loop 2

3 Why We know about the benefits of double/pooled buffers in DMA transactions. Why not use them in GPUs? When using an accelerator, most of the time the data format in the GPU and in the application do not match For some apps, we do not want to reserve multi-gigabyte buffers of host memory for transfers 3

4 Transfers in CUDA Running on... Device : GeForce GTX 48 Quick Mode Host to Device Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) Device to Host Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) 616. Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) [bandwidthtest] test results... PASSED 7 read write CUDA Performance Reference 4

5 With data conversions 7 read read DP-SP read AOS-SOA write write DP-SP write AOS-SOA CUDA Performance Reference Convert data from double to single precision float Convert data from AOS to SOA Now both need to pass data by the CPU 5

6 Using Buffering Schemes!"" #$%%&'()*+*,&' )&-.'/ )!(#$%%&' 1$&$& #.*'2 3."/ 4'*+56*78.+ Provides one or more memory buffers paired with a GPU buffer. Implements typical schemes D + E 9:;(3<=>? 92;(@=#AB 9&;(C@@AB 6

7 Chunk Buffer DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

8 Chunk Buffer CUDA device DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

9 Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7

10 Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); Create the buffer manager 7

11 Chunk Buffer Read Performance Chunk Buffer Read Performance Chunk Buffer 7 read write

12 Double Buffer DMAopsCUDA::board_type device(); //CUDA device DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9

13 Double Buffer DMAopsCUDA::board_type device(); //CUDA device Second Buffer, linked to device memory of buffer 1 DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9

14 Double Buffer Read Performance Double Buffer Write Performance Double Buffer 7 read write

15 Pooled Buffer DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11

16 Pooled Buffer Create a Pool of Buffers DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11

17 Pooled Buffers Read Performance Pooled Buffer Write Performance Pooled Buffer 7 read write

18 Translators Defines how to convert back and forth the data types in the host and the GPU template<class T> class TrNOP { public:! typedef T host_type;! typedef T board_type;! inline static void host2board(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset);! inline static void board2host(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template 13

19 Translator DP-SP template<typename T1, typename T2> class TrTemplate { public: typedef T1 host_type; typedef T2 board_type; inline static void host2board(unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset); inline static void board2host(unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template // Implementation of the template template<typename T1, typename T2> void TrTemplate<T1, T2>::host2board( unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t2>(in[in_offset+i]); } template<typename T1, typename T2> void TrTemplate<T1, T2>::board2host( unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t1>(in[in_offset+i]); } Makes static cast between T1 and T2 There is also an SSE optimized version for double-float conversion 14

20 Double Buffer DP-SP Read Performance Double Buffer DP-SP Write Performance Double Buffer DP-SP 7 read read DP-SP write write DP-SP

21 Pooled Buffer DP-SP Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP

22 Translator AOS-SOA // Implementation of the template template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::host2board(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! host_type * input = static_cast<host_type *>(in);! board_type * output = static_cast<board_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i] = static_cast<t2>(input[in_offset+i].x);!! output[count+out_offset+i] = static_cast<t2>(input[in_offset+i].y);!! output[count*2+out_offset+i] = static_cast<t2>(input[in_offset+i].z);!! output[count*3+out_offset+i] = static_cast<t2>(input[in_offset+i].a);!! ++i;! } } template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::board2host(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! board_type * input = static_cast<board_type *>(in);! host_type * output = static_cast<host_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i].x = static_cast<t1>(input[in_offset+i]);!! output[out_offset+i].y = static_cast<t1>(input[count + in_offset+i]);!! output[out_offset+i].z = static_cast<t1>(input[count * 2 + in_offset+i]);!! output[out_offset+i].a = static_cast<t1>(input[count * 3 + in_offset+i]);!! ++i;! } } In our example, it is 4 elements of type T1 (floats), converted into 4 interleaved blocks of floats. 17

23 Double Buffer AOS-SOA Read Performance Double Buffer AOS-SOA Write Performance Double Buffer AOS-SOA 7 read read AOS write write AOS

24 Pooled Buffer AOS-SOA Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP

25 Conclusions We present a way to composite buffering schemes with data transformation using templates We reduce the pinned memory needed to perform transfers We improve the performance of the transfers in comparison to simple CUDA implementations Questions? This work was done with support from the Volkswagen Foundation under the GRACE project 2

robotics/ openel.h File Reference Macros Macro Definition Documentation Typedefs Functions

robotics/ openel.h File Reference Macros Macro Definition Documentation Typedefs Functions openel.h File Reference Macros #define EL_TRUE 1 #define EL_FALSE 0 #define EL_NXT_PORT_A 0 #define EL_NXT_PORT_B 1 #define EL_NXT_PORT_C 2 #define EL_NXT_PORT_S1 0 #define EL_NXT_PORT_S2 1 #define EL_NXT_PORT_S3