Improving Host-GPU Communication with Buffering Schemes Guillermo Marcus University of Heidelberg
Overview Motivation Buffering Schemes Converting data in the loop 2
Why We know about the benefits of double/pooled buffers in DMA transactions. Why not use them in GPUs? When using an accelerator, most of the time the data format in the GPU and in the application do not match For some apps, we do not want to reserve multi-gigabyte buffers of host memory for transfers 3
Transfers in CUDA Running on... Device : GeForce GTX 48 Quick Mode Host to Device Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) 5767.2 Device to Host Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) 616. Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 149346.2 [bandwidthtest] test results... PASSED 7 read write CUDA Performance Reference 4
With data conversions 7 read read DP-SP read AOS-SOA write write DP-SP write AOS-SOA CUDA Performance Reference Convert data from double to single precision float Convert data from AOS to SOA Now both need to pass data by the CPU 5
Using Buffering Schemes!"" #$%%&'()*+*,&' )&-.'/ )!(#$%%&' 1$&$& #.*'2 3."/ 4'*+56*78.+ Provides one or more memory buffers paired with a GPU buffer. Implements typical schemes D + E 9:;(3<=>? 92;(@=#AB 9&;(C@@AB 6
Chunk Buffer DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7
Chunk Buffer CUDA device DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7
Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7
Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); Create the buffer manager 7
Chunk Buffer Read Performance Chunk Buffer Read Performance Chunk Buffer 7 read write 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 8
Double Buffer DMAopsCUDA::board_type device(); //CUDA device DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9
Double Buffer DMAopsCUDA::board_type device(); //CUDA device Second Buffer, linked to device memory of buffer 1 DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9
Double Buffer Read Performance Double Buffer Write Performance Double Buffer 7 read write 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 1
Pooled Buffer DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11
Pooled Buffer Create a Pool of Buffers DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11
Pooled Buffers Read Performance Pooled Buffer Write Performance Pooled Buffer 7 read write 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 12
Translators Defines how to convert back and forth the data types in the host and the GPU template<class T> class TrNOP { public:! typedef T host_type;! typedef T board_type;! inline static void host2board(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset);! inline static void board2host(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template 13
Translator DP-SP template<typename T1, typename T2> class TrTemplate { public: typedef T1 host_type; typedef T2 board_type; inline static void host2board(unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset); inline static void board2host(unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template // Implementation of the template template<typename T1, typename T2> void TrTemplate<T1, T2>::host2board( unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t2>(in[in_offset+i]); } template<typename T1, typename T2> void TrTemplate<T1, T2>::board2host( unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t1>(in[in_offset+i]); } Makes static cast between T1 and T2 There is also an SSE optimized version for double-float conversion 14
Double Buffer DP-SP Read Performance Double Buffer DP-SP Write Performance Double Buffer DP-SP 7 read read DP-SP write write DP-SP 134217728 67188 838868 419434 297152 496 248 124 496 248 124 13421 67188 838868 419434 297152 15
Pooled Buffer DP-SP Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP 134217728 67188 838868 419434 297152 496 248 124 496 248 124 13421 67188 838868 419434 297152 16
Translator AOS-SOA // Implementation of the template template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::host2board(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! host_type * input = static_cast<host_type *>(in);! board_type * output = static_cast<board_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i] = static_cast<t2>(input[in_offset+i].x);!! output[count+out_offset+i] = static_cast<t2>(input[in_offset+i].y);!! output[count*2+out_offset+i] = static_cast<t2>(input[in_offset+i].z);!! output[count*3+out_offset+i] = static_cast<t2>(input[in_offset+i].a);!! ++i;! } } template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::board2host(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! board_type * input = static_cast<board_type *>(in);! host_type * output = static_cast<host_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i].x = static_cast<t1>(input[in_offset+i]);!! output[out_offset+i].y = static_cast<t1>(input[count + in_offset+i]);!! output[out_offset+i].z = static_cast<t1>(input[count * 2 + in_offset+i]);!! output[out_offset+i].a = static_cast<t1>(input[count * 3 + in_offset+i]);!! ++i;! } } In our example, it is 4 elements of type T1 (floats), converted into 4 interleaved blocks of floats. 17
Double Buffer AOS-SOA Read Performance Double Buffer AOS-SOA Write Performance Double Buffer AOS-SOA 7 read read AOS write write AOS 134217728 67188 838868 419434 297152 496 248 124 496 248 124 13421 67188 838868 419434 297152 18
Pooled Buffer AOS-SOA Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP 134217728 67188 838868 419434 297152 496 248 124 496 248 124 134217 67188 838868 419434 297152 19
Conclusions We present a way to composite buffering schemes with data transformation using templates We reduce the pinned memory needed to perform transfers We improve the performance of the transfers in comparison to simple CUDA implementations Questions? This work was done with support from the Volkswagen Foundation under the GRACE project 2