Programming with CUDA, WS09

Size: px

Start display at page:

Download "Programming with CUDA, WS09"

Darren Anderson
5 years ago
Views:

1 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009

2 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming models Threads, blocks and grids CUDA memory hierarchy Device compute capability

3 Course website updated Lecture slides updated with new template

4 Grid and block dimension limitations For a block with blockdim (x,y,z), x*y*z 512 x,y 512, z 64 For a grid with griddim (x,y), x,y Valid for compute capability 1.x

5 Specifying grid and block dimensions void main() { int N; // assign N such that N*N 512 float **h_mata, **h_matb, **h_matc; // allocate memory for NxN host matrices // assign values to h_mata, h_matb, h_matc // initialize device float **d_mata, **d_matb, **d_matc; // allocate memory for NxN device matrices // copy host matrices to device matrices dim3 blocksize ( N, N ); // unspecified dimensions default to 1 matadd<<< 1, blocksize >>> ( d_mata, d_matb, d_matc ); // copy d_matc to h_matc // output h_matc // free host matrices // free device matrices }

6 Specifying grid and block dimensions void main() { // initialize device // allocate NxN host, device matrices, N*N 512 // assign host matrices and copy to device matadd<<< 1, (N,N) >>> ( d_a, d_b, d_c ); // make host copy of d_c and use it // free host, device memory } global void matadd( float **g_a, float **g_b, float **g_c ) { int i = threadidx.x; int j = threadidx.y; g_c[i][j] = g_a[i][j] + g_b[i][j]; }

7 Specifying grid and block dimensions // define arbitrary N void main() { //... dim3 blocksize( 16, 16 ); dim3 gridsize( ( N + blockdim.x - 1 ) / blockdim.x, ( N + blockdim.y - 1 ) / blockdim.y ); matadd<<< gridsize, blocksize >>> ( d_a, d_b, d_c ); //... } global void matadd ( float g_a[n][n], float g_b[n][n], float g_c[n][n] ) { int i = blockidx.x * blockdim.x + threadidx.x; // get column index int j = blockidx.y * blockdim.y + threadidx.y; // get row index if ( i < N && j < N ) g_c[i][j] = g_a[i][j] + g_b[i][j]; }

8 blockdim.x blockdim.x blockdim.x blockdim.y (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (0,0)... (0,15) (1,15)... (15,15) (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (1,0)... (0,15) (1,15)... (15,15)... (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (.,0)... (0,15) (1,15)... (15,15) blockdim.y (0,0) (1,0)... (15,0) (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (0,1)... (0,1) (1,1)... (15,1) (1,1)... N x N (0,15) (1,15)... (15,15) (0,15) (1,15)... (15,15) (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (.,1)... (0,15) (1,15)... (15,15) griddim.y blockdim.y (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (0,.)... (0,15) (1,15)... (15,15) (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (1,.)... (0,15) (1,15)... (15,15)... (0,0) (1,0)... (15,0) (0,1) (1,1)... (15,1) (.,.)... (0,15) (1,15)... (15,15) griddim.x

9 Valid also for non-square matrices Would it be a good idea to move data to shared memory? global void matadd(... ) { } int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; if ( i < N && j < N ) g_c[i][j] = g_a[i][j] + g_b[i][j];

10 CUDA extensions to C Vector types: (u)char, (u)short, (u)int, float, double, (u/ long)long available vectors: <type>(1/2/3/4) exceptions: (longlong/double)(1/2) Vector types are assigned by special functions of the form make_<type name>, e.g int4 rgba = make_int4( r, g, b, a ); // r,g,b,a int s Vector components are accessed via the x, y, z and w fields respectively, e.g. printf( r g b a: %d %d %d %d\n, rgba.x, rgba.y, rgba.z, rgba.w );

11 CUDA extensions to C dim3 based on uint3 used to specify dimensions unspecified components default to 1

12 CUDA extensions to C Built-in variables indicate grid/block dimensions: (grid/block)dim, (dim3) block/thread indices: (block/thread)idx, (uint3) warpsize, (int) These cannot be pointed or assigned to

13 CUDA extensions to C Function type qualifiers specify where a function can run and be called from host device The host qualifier can be call run omitted (compiled for host) combined with device (compiled for both) Functions that run on the device ( device, global ) cannot recurse declare static variables have a variable number of arguments host host global device - device device functions cannot have function pointers to them but global functions can

14 CUDA extensions to C A global function is a kernel must return void must have a specified execution configuration is called asynchronously function parameters are passed via shared memory and may take up to 256 bytes

15 CUDA extensions to C memory qualifier global constant shared device ( device ) constant ( device ) shared Variable type qualifiers specify where in the device a variable resides

16 CUDA extensions to C memory qualifier global constant shared device ( device ) constant ( device ) shared shared and constant variables are stored statically device and constant variables are only allowed at file scope constant variables are assigned only by the host shared variables cannot be initialized at declaration

17 A note on variables (Automatic) variables declared in device code reside in registers bit registers per MP If the variables are too large, they are stored in local memory portion of global memory (slow access) Use pointers with caution Dereferencing pointers to device variables in host code and vice versa is not allowed

18 CUDA extensions to C Barrier synchronization with syncthreads() meant to be lightweight implemented by an MP as a single instruction Threads in a block may lose concurrency scheduling of thread blocks to warps per MP memory read/write latency syncthreads() defines a barrier in thread execution the fastest thread in the block gets to it first and waits its resumes only when all others threads reach the barrier Useful when threads in a block depend on results from each other

19 CUDA extensions to C Timing: clock_t clock() Sample the value of a MP counter that is updated every clock cycle Sample at beginning and end of thread execution to find out total time to execute the thread Caution: includes time for other, time-sliced threads

20 CUDA extensions to C Atomic functions perform uninterrupted read-modify-write operations Available in device code only Take single arguments (except atomicexch()) Threads waiting for the same data are serialized atomic(add/sub/min/max/inc/dec/cas) atomic(and/or/xor)

21 CUDA extensions to C Memory fence functions: threadfence(), threadfence_block() Faster, less accurate versions of common C mathematical functions Warp vote functions volatile variables texture memory operations

22 Handling device memory Device memory: linear memory or CUDA arrays CUDA arrays optimized for texture fetches Linear memory cudaerror_t cudamalloc( void** d_memptr, size_t size ) cudaerror_t cudafree( void* d_memptr ) cudaerror_t cudamemcpy( void *dst, void *src, size_t size, enum cudamemcpykind kind )

23 Handling device memory void main() { // allocate h_a, h_b,h_c, size N // assign values to host vectors // initialize device // allocate d_a,d_b,d_c, size N // copy h_a,h_b to d_a,d_b vadd<<<1,n>>>(d_a,d_b,d_c); // copy d_c to h_c // output h_c // free host variables // free device variables } void main() { int N; // assign N size_t size = N * sizeof( int ); int *h_a = malloc( size ); int *h_b = malloc( size ); int *h_c = malloc( size ); // assign values to vectors int *d_a, *d_b, *d_c; cudamalloc( (void**) &d_a, size ); cudamalloc( (void**) &d_b, size ); cudamalloc( (void**) &d_c, size ); cudamemcpy( d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy( d_b, h_b, size, cudamemcpyhosttodevice); vadd<<<1,n>>>( d_a, d_b, d_c ); //... }

24 Handling device memory void main() { // allocate h_a, h_b,h_c, size N // assign values to host vectors // initialize device // allocate d_a,d_b,d_c, size N // copy h_a,h_b to d_a,d_b vadd<<<1,n>>>(d_a,d_b,d_c); // copy d_c to h_c // output h_c // free host variables // free device variables } void main() { //... vadd<<<1,n>>>( d_a, d_b, d_c ); cudamemcpy( h_c, d_c, size, cudamemcpydevicetohost); // output h_c free( h_a); free( h_b); free( h_c ); cudafree( d_a ); cudafree( d_b ); cudafree( d_c ); }

25 Next time CUDA texture memory Compiling CUDA programs CUDA runtime and driver APIs Streams

26 See you next time!

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into