Graphics Interoperability

Size: px

Start display at page:

Download "Graphics Interoperability"

Amanda Rice
6 years ago
Views:

1 Goals Graphics Interoperability Some questions you may have If you have a single NVIDIA card, can it be used both for CUDA computations and graphics display using OpenGL without transferring images back and forth from host and device? Fortunately the answer is yes, as we see in this chapter The example applications Generating an image in CUDA then displaying it in OpenGL Revisit the GPU ripple application and use graphics interoperability Revisit the GPU heat transfer application and use graphics interoperability

Select the CUDA device This also becomes the OpenGL

2 Basic - 1 Two steps: generate image using CUDA display image using GLUT GLUT is the OpenGL Utility Toolkit Select the CUDA device This also becomes the OpenGL device You now need to initialize GLUT OpenGL s name CUDA s name

3 Basic - 2 Set up shared buffer Register OpenGL s bufferobj as a shared resource Map the resource then get a pointer (devptr) to the mapped resource We launch the kernel passing in devptr Now we unmap the shared resource Register the keyboard and display callbacks

4 Basic - 3 CUDA generates image similar to ripple example from Chapter 5 The image is handed directly to OpenGL without going back to the CPU; this is done by the draw_func The Esc key lets the user exit the program The image

Redux the Ripple Program The original program used CPUAnimBitmap and cudamalloc The ripple pattern is shown to the right Our goal is to use a GPUAnimBitmap, this Eliminates the DataBlock

5 Redux the Ripple Program The original program used CPUAnimBitmap and cudamalloc The ripple pattern is shown to the right Our goal is to use a GPUAnimBitmap, this Eliminates the DataBlock Eliminates the cudamalloc It will also run more quickly int main( void ) { GPUAnimBitmap bitmap( DIM, DIM, NULL ); bitmap.anim_and_exit( (void (*)(uchar4*,void*,int))generate_frame, NULL );

Ripple - 1 We have the OpenGL bufferobj and the CUDA resource There is a void * datablock fanim is called by glutidlefunc The animexit is for cleanup The clickdrag is for mouse events The

6 Ripple - 1 We have the OpenGL bufferobj and the CUDA resource There is a void * datablock fanim is called by glutidlefunc The animexit is for cleanup The clickdrag is for mouse events The constructor Stores parameter values Selects the cuda device This is shared with OpenGL glutinit prefers parameters GLUT in initialized as seen in the last example We then set up the opengl buffer handler

7 Ripple - 2 Constructor (continued) opengl sets up the buffer This is registered by CUDA The glutidlefunc does three things It maps the shared buffer and retrieves the GPU pointer It calls fanim that will launch the kernel to fill in the buffer Finally it unmaps the buffer and asks opengl to display the image

8 Remaining Items The kernel is the same as in Ch05 except The buffer is no longer an array of unsigned characters It is a four-tuple representing red, green, blue, and alpha The generate_frame callback launches the kernel The main program Creates the GPUAnimBitmap Registers the generate_frame with our helper function #include "../common/book.h" #include "../common/gpu_anim.h" #define DIM 1024 global void kernel( uchar4 *ptr, int ticks ) { // map from threadidx/blockidx to pixel position int x = threadidx.x + blockidx.x * blockdim.x; int y = threadidx.y + blockidx.y * blockdim.y; int offset = x + y * blockdim.x * griddim.x; // now calculate the value at that position float fx = x - DIM/2; float fy = y - DIM/2; float d = sqrtf( fx * fx + fy * fy ); unsigned char grey = (unsigned char)(128.0f f * cos(d/10.0f - ticks/7.0f) / (d/10.0f + 1.0f)); ptr[offset].x = grey; ptr[offset].y = grey; ptr[offset].z = grey; ptr[offset].w = 255; void generate_frame( uchar4 *pixels, void*, int ticks ) { dim3 grids(dim/16,dim/16); dim3 threads(16,16); kernel<<<grids,threads>>>( pixels, ticks ); int main( void ) { GPUAnimBitmap bitmap( DIM, DIM, NULL ); bitmap.anim_and_exit( (void (*)(uchar4*,void*,int))generate_frame, NULL );

9 The Original Heat Transfer from Ch05 Inside of CPUAnimBitmap is a call to gldrawpixels The last line triggers a copy of the bitmap from the CPU to the GPU After the kernel runs, the bitmap is copied back to the host for each frame These copies back and forth are unnecessary if we use graphics interoperability

10 Heat Transfer - 1 The entire program appears here; we only highlight the differences Most of the code remains unchanged #include "../common/book.h" #include "../common/gpu_anim.h" #define DIM 1024 #define MAX_TEMP 1.0f #define MIN_TEMP f #define SPEED 0.25f // these exist on the GPU side texture<float> texconstsrc; texture<float> texin; texture<float> texout; // this kernel takes in a 2-d array of floats // it updates the value-of-interest by a scaled value based // on itself and its nearest neighbors global void blend_kernel( float *dst, bool dstout ) { // map from threadidx/blockidx to pixel position int x = threadidx.x + blockidx.x * blockdim.x; int y = threadidx.y + blockidx.y * blockdim.y; int offset = x + y * blockdim.x * griddim.x; int left = offset - 1; int right = offset + 1; if (x == 0) left++; if (x == DIM-1) right--; int top = offset - DIM; int bottom = offset + DIM; if (y == 0) top += DIM; if (y == DIM-1) bottom -= DIM; float t, l, c, r, b;

11 Heat Transfer - 2 This code is relatively unchanged if (dstout) { t = tex1dfetch(texin,top); l = tex1dfetch(texin,left); c = tex1dfetch(texin,offset); r = tex1dfetch(texin,right); b = tex1dfetch(texin,bottom); else { t = tex1dfetch(texout,top); l = tex1dfetch(texout,left); c = tex1dfetch(texout,offset); r = tex1dfetch(texout,right); b = tex1dfetch(texout,bottom); dst[offset] = c + SPEED * (t + b + r + l - 4 * c); // NOTE - texoffsetconstsrc could either be passed as a // parameter to this function, or passed in constant memory // if we declared it as a global above, it would be // a parameter here: // global void copy_const_kernel( float *iptr, // size_t texoffset ) global void copy_const_kernel( float *iptr ) { // map from threadidx/blockidx to pixel position int x = threadidx.x + blockidx.x * blockdim.x; int y = threadidx.y + blockidx.y * blockdim.y; int offset = x + y * blockdim.x * griddim.x; float c = tex1dfetch(texconstsrc,offset); if (c!= 0) iptr[offset] = c;

12 Heat Transfer - 3 The only real change here is going from unsigned char to uchar4, the same as the ripple program We skipped float_to_color in the first version; on the next slide we take a look at the old and new versions // globals needed by the update routine struct DataBlock { float *dev_insrc; float *dev_outsrc; float *dev_constsrc; cudaevent_t start, stop; float totaltime; float frames; ; void anim_gpu( uchar4* outputbitmap, DataBlock *d, int ticks ) { HANDLE_ERROR( cudaeventrecord( d->start, 0 ) ); dim3 blocks(dim/16,dim/16); dim3 threads(16,16); // since tex is global and bound, we have to use a flag to // select which is in/out per iteration volatile bool dstout = true; for (int i=0; i<90; i++) { float *in, *out; if (dstout) { in = d->dev_insrc; out = d->dev_outsrc; else { out = d->dev_insrc; in = d->dev_outsrc; copy_const_kernel<<<blocks,threads>>>( in ); blend_kernel<<<blocks,threads>>>( out, dstout ); dstout =!dstout; float_to_color<<<blocks,threads>>>( outputbitmap, d->dev_insrc );

13 A Quick Look at float_to_color Old version New version

14 Heat Transfer - 4 Most of the code is unchanged HANDLE_ERROR( cudaeventrecord( d->stop, 0 ) ); HANDLE_ERROR( cudaeventsynchronize( d->stop ) ); float elapsedtime; HANDLE_ERROR( cudaeventelapsedtime( &elapsedtime, d->start, d->stop ) ); d->totaltime += elapsedtime; ++d->frames; printf( "Average Time per frame: %3.1f ms\n", d->totaltime/d->frames ); We now call GPUAnimBitmap instead of CPUAnimBitmap // clean up memory allocated on the GPU void anim_exit( DataBlock *d ) { HANDLE_ERROR( cudaunbindtexture( texin ) ); HANDLE_ERROR( cudaunbindtexture( texout ) ); HANDLE_ERROR( cudaunbindtexture( texconstsrc ) ); HANDLE_ERROR( cudafree( d->dev_insrc ) ); HANDLE_ERROR( cudafree( d->dev_outsrc ) ); HANDLE_ERROR( cudafree( d->dev_constsrc ) ); HANDLE_ERROR( cudaeventdestroy( d->start ) ); HANDLE_ERROR( cudaeventdestroy( d->stop ) ); int main( void ) { DataBlock data; GPUAnimBitmap bitmap( DIM, DIM, &data ); data.totaltime = 0; data.frames = 0; HANDLE_ERROR( cudaeventcreate( &data.start ) ); HANDLE_ERROR( cudaeventcreate( &data.stop ) ); int imagesize = bitmap.image_size();

15 Heat Transfer - 5 This code remains unchanged; remember the use of constant values for heaters The bright white square in the center A few single point heat sinks The small rectangle heat sink in the top center region // assume float == 4 chars in size (ie rgba) HANDLE_ERROR( cudamalloc( (void**)&data.dev_insrc, imagesize ) ); HANDLE_ERROR( cudamalloc( (void**)&data.dev_outsrc, imagesize ) ); HANDLE_ERROR( cudamalloc( (void**)&data.dev_constsrc, imagesize ) ); HANDLE_ERROR( cudabindtexture( NULL, texconstsrc, data.dev_constsrc, imagesize ) ); HANDLE_ERROR( cudabindtexture( NULL, texin, data.dev_insrc, imagesize ) ); HANDLE_ERROR( cudabindtexture( NULL, texout, data.dev_outsrc, imagesize ) ); // intialize the constant data float *temp = (float*)malloc( imagesize ); for (int i=0; i<dim*dim; i++) { temp[i] = 0; int x = i % DIM; int y = i / DIM; if ((x>300) && (x<600) && (y>310) && (y<601)) temp[i] = MAX_TEMP; temp[dim* ] = (MAX_TEMP + MIN_TEMP)/2; temp[dim* ] = MIN_TEMP; temp[dim* ] = MIN_TEMP; temp[dim* ] = MIN_TEMP; for (int y=800; y<900; y++) { for (int x=400; x<500; x++) { temp[x+y*dim] = MIN_TEMP;

16 Heat Transfer - 6 This is not a heater, but the upper left corner was initially very hot and then dissipated There was a 15% speedup by avoiding the copies of graphic data between the device and the host HANDLE_ERROR( cudamemcpy( data.dev_constsrc, temp, imagesize, cudamemcpyhosttodevice ) ); // initialize the input data for (int y=800; y<dim; y++) { for (int x=0; x<200; x++) { temp[x+y*dim] = MAX_TEMP; HANDLE_ERROR( cudamemcpy( data.dev_insrc, temp, imagesize, cudamemcpyhosttodevice ) ); free( temp ); bitmap.anim_and_exit( (void (*)(uchar4*,void*,int))anim_gpu, (void (*)(void*))anim_exit );

Part II CUDA C/C++ Language Overview and Programming Techniques

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part II CUDA C/C++ Language Overview and Programming Techniques Outline GPU-Helloworld CUDA C/C++ Language Overview (with simple