Programming with CUDA, WS09

Size: px

Start display at page:

Download "Programming with CUDA, WS09"

Mabel Palmer
5 years ago
Views:

1 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 7.5 Thursday, 19 November, 2009

2 Recap CUDA texture memory commands

3 Today CUDA driver API

4 Runtime and Driver APIs Two interfaces for writing CUDA programs: C for CUDA, CUDA driver API C for CUDA allows to write kernels in C provides runtime API which builds on the driver API needs to be compiled with nvcc Driver API provides functions to load cubin/ PTX kernels, e.g. as compiled from runtime kernels

5 Runtime and Driver APIs The runtime API is provided by the cudart library functions prefixed with cuda implicit device initialization with first call to runtime The driver API is provided by the cuda library functions/objects prefixed with cu explicit device initialization with cuinit()

6 Driver API Initialize driver API with cuinit() Create a CUDA context Attach the context to a device Make the context current to the calling host thread

7 Driver API Object Handles

8 A runtime API sample // Kernel global void VecAdd( float *A, float *B, float *C) { //... } // Host code int main() { //... VecAdd<<<1,N>>> ( A, B, C ); //... }

9 Driver API equivalent int main() { // Initialize if ( cuinit(0)!= CUDA_SUCCESS) exit(0); // Get number of devices supporting CUDA int devicecount = 0; cudevicegetcount(&devicecount); if (devicecount == 0) { printf("there is no device supporting CUDA.\n"); exit (0); } // Get handle for device 0 CUdevice cudevice = 0; cudeviceget(&cudevice, 0); // Create context CUcontext cucontext; cuctxcreate(&cucontext, 0, cudevice); // Create module from binary file CUmodule cumodule; cumoduleload(&cumodule, VecAdd.ptx ); // Get function handle from module CUfunction vecadd; cumodulegetfunction(&vecadd, cumodule, "VecAdd"); // Invoke kernel #define ALIGN_UP(offset, alignment) \ (offset) = ((offset) + (alignment) 1) & ~((alignment) 1) int offset = 0; void* ptr; ptr = (void*)(size_t)a; ALIGN_UP(offset, alignof(ptr)); cuparamsetv(vecadd, offset, &ptr, sizeof(ptr)); offset += sizeof(ptr); ptr = (void*)(size_t)b; ALIGN_UP(offset, alignof(ptr)); cuparamsetv(vecadd, offset, &ptr, sizeof(ptr)); offset += sizeof(ptr); ptr = (void*)(size_t)c; ALIGN_UP(offset, alignof(ptr)); cuparamsetv(vecadd, offset, &ptr, sizeof(ptr)); offset += sizeof(ptr); cuparamsetsize(vecadd, offset); int threadsperblock = 256; int blockspergrid = (N + threadsperblock 1) / threadsperblock; cufuncsetblockshape(vecadd, threadsperblock, 1, 1); culaunchgrid(vecadd, blockspergrid, 1); //... }

10 CUDA Context A CUDA context loads cubin/ptx kernels C kernels must be compiled down using nvcc cubin kernels are not forward compatible, PTX kernels are All driver API resources and actions are encapsulated in contexts These are automatically cleaned up when the context is destroyed CUDA functions called outside a context return an error Each context has its own 32-bit address space

11 Working with contexts Create a context using cuctxcreate() The created context, C, is automatically made current to the calling host thread C has a usage count of 1 C is pushed on top of the current host thread s stack of current threads host thread should call cuctxdextroy() or cuctxdetach() on C when done with it C replaces previously current context, if any

12 Working with contexts Pop C from the stack using cuctxpopcurrent(), and make current using cuctxpushcurrent() Use a context in other threads using cuctxattach() and cuctxdetach() cuctxsynchronize(), cuctxgetdevice() Each context has a usage count which is 1 at creation. Incremented/decremented at cuctxattach()/ cuctxdetach() respectively A context and its resources are automatically destroyed when its usage count becomes 0

13 Modules Modules are previously compiled device functions Function names, texture references, global variables are available at module scope A context may incorporate external modules as well cumoduleload(), cumodulegetfunction()

14 Data alignment An alignment requirement for a type specifies the memory addresses at which variables of that type should be stored Data that is aligned can be read more efficiently In C/C++, a type s alignment requirement can be obtained using alignof() Alignment conditions depend on the hardware architecture A memory address, a, is n-aligned if a is a multiple of n

15 CUDA alignment requirements

16 Kernel execution cufuncsetblockshape() sets arrangement of threads and their IDs cufuncsetsharedsize() sets the size of shared memory the function will use cuparamseti(), cuparamsetf(), cuparamsettexref(), cuparamsetv() functions add integer, float, texture reference and arbitrary variables to a function s argument list added variables have to be aligned cuparamsetsize() sets total size of arguments culaunch(), culaunchgrid(), culaunchgridasync() launch a kernel

17 Device Memory Linear memory: cumemalloc(), cumemallocpitch(), cumemfree(), cumemcpyhtod(), cumemcpydtoh(), cumemcpyhtodasync(), cumemcpydtohasync() CUDA array CUDA_ARRAY_DESCRIPTOR desc; desc.format = CU_AD_FORMAT_FLOAT; desc.numchannels = 1; desc.width = desc.height = n; CUarray cuarray; cuarraycreate( &cuarray, &desc ); cuarraydestroy( cuarray ); memory copy functions...

18 Pinned Memory cumemhostalloc(), cumemfreehost() flags at alloc for portable, write combined and/or mapped memory check CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_ MEMORY in cudevicegetattribute() enable memory pinning for a context by passing CU_CTX_MAP_HOST flag to cuctxcreate()

19 Textures texture<float, 2, cudareadmodeelementtype> texref in cumodule retrieved in driver API as CUtexref cutexref; cumodulegettexref( &cutexref, cumodule, texref ) Bind texref to linear memory CUDA_ARRAY_DESCRIPTOR desc; cutexrefsetaddress2d( cutexref, &desc, devptr, pitch ); to CUDA array cutexrefsetarray( cutexref, cuarray, CU_TRSA_OVERRIDE_FORMAT ); cutexrefsetaddressmode(), cutexrefsetfiltermode(), cutexrefsetflags(): normalize texels, normalize coordinates cutexrefsetformat(): analogous to CUDA array descriptor

20 Asynchronous Execution check CU_DEVICE_ATTRIBUTE_GPU_OVERLAP in cudevicegetattribute() Stream: custreamcreate(), custreamdestroy() Event: cueventcreate(), cueventrecord(), cueventsynchronize(), cueventelapsedtime(), cueventdestroy()

21 Shared memory: set size for function using cufuncsetsharedsize() Multiple devices: cudevicegetcount(), cudeviceget() Error handling: Same variable update as in runtime API. Get error codes from asynchronous functions using synchronization

22 Next time Performance Optimizations

23 See you next time!

CUDA Programming. Week 5. Asynchronized execution, Instructions, and CUDA driver API

CUDA Programming. Week 5. Asynchronized execution, Instructions, and CUDA driver API CUDA Programming Week 5. Asynchronized execution, Instructions, and CUDA driver API Outline Asynchronized Transfers Instruction optimization CUDA driver API Homework ASYNCHRONIZED TRANSFER Asynchronous