Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008

Size: px

Start display at page:

Download "Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008"

Mervyn Davis
5 years ago
Views:

1 Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008

2 Previously CUDA Runtime Common Built-in vector types Math functions Timing Textures Texture fetch Texture reference Texture read modes Normalized texture coordinates Linear texture filtering Textures

3 Today CUDA Runtime Common Device Host

4 CUDA Runtime Common Device Host

5 Device Runtime Can only be used in device code Math functions Faster, less accurate versions of functions from common component <common_function_name> log and logf Appendix B of Programming Guide Use fast math by default Compiler option -use_fast_math

6 Device Runtime Synch function: syncthreads() Synchronize threads in a block Avoid read-after-write, write-after- read, write-after-write hazards for commonly accessed shared memory Dangerous to use in conditionals Code hangs / unwanted effects

7 Device Runtime Atomic functions Guaranteed to perform un-interfered Memory address is locked Supported by CUDA cards > 1.0 Mostly operate on integers only Appendix C of programming guide

8 Device Runtime Warp vote functions Supported by CUDA cards >= 1.2 Check a condition on all threads in a warp int all (int predicate) true (non-zero) if predicate is true for all warp threads int any (int predicate) true (non-zero) if predicate is true for any warp thread

9 Device Runtime Texture functions: fetching textures, or texturing Texture data may be stored in linear memory or CUDA arrays Texturing from linear memory template<class Type> Type tex1dfetch( texture<type, 1, cudareadmodeelementtype> texref, int x); float tex1dfetch( texture<type, 1, cudareadmodenormalizedfloat> texref, int x);

10 Device Runtime Texture functions: fetching textures, or texturing Texturing from linear memory Type can be any of the supported 1-, 2- or 4- vector types template<class Type> Type tex1dfetch( texture<type, 1, cudareadmodeelementtype> texref, int x); float4 tex1dfetch( texture<uchar4, 1, cudareadmodenormalizedfloat> texref, int x);

11 Device Runtime Texture functions: fetching textures, or texturing Texturing from linear memory No addressing modes supported No texture filtering supported

12 Device Runtime Texture functions: fetching textures, or texturing Texturing from CUDA arrays template<class Type, enum cudatexturereadmode readmode> Type tex1d(texture<type, 1, readmode> texref, float x); template<class Type, enum cudatexturereadmode readmode> Type tex2d(texture<type, 2, readmode> texref, float x, float y); template<class Type, enum cudatexturereadmode readmode> Type tex3d(texture<type, 3, readmode> texref, float x, float y, float z);

13 Device Runtime Texture functions: fetching textures, or texturing Texturing from CUDA arrays Run-time attributes determine Coordinate normalization Addressing mode (clamp/wrap) Filtering

14 CUDA Runtime Common Device Host

15 Can only be used by host functions Composed of 2 APIs High-level CUDA runtime API, which runs on top of Low-level CUDA driver API No mixing: an application should use either one or the other.

16 Each API provides functions for Device management Context management Memory management Code module management Execution control Texture reference management OpenGL/Direct3D interoperability

17 The CUDA runtime API implicitly provides Initialization Context management Module management CUDA driver API does not, and is harder to program.

18 Recall: nvcc parses an input source file Separates device and host code Device code compiled to cubin object Generated host code in C compiled by external tool

19 Generated host code Is in C format Includes the cubin object Applications may Ignore host code and run cubin object directly using the low-level CUDA driver API Link to generated host code and launch it using the high-level CUDA runtime API

20 The CUDA driver API Is harder to program Offers greater control Does not depend on C Does not offer device emulation

21 CUDA runtime functions and other entry points are prefixed by cuda CUDA driver functions and other entry points are prefixed by cu

22 - detour Device memory is always allocated as either of Linear memory CUDA arrays

23 - detour Linear memory in device Contiguous segment of memory 32-bit addresses Can be referenced using pointers

24 - detour CUDA arrays opaque memory layout 1D/2D/3D arrays of 1/2/4 vectors of 8/16/32 bit integers or 16/32 bit floats 16 bit floats from driver API only Optimized for texture fetching Accessible from kernels through texture fetches only

25 Both the CUDA runtime and CUDA driver APIs Can access device information Enable the host to read/write to linear memory/cuda arrays With support for pinned memory

26 Both the CUDA runtime and CUDA driver APIs Can access device information Enable the host to read/write to linear memory/cuda arrays With support for pinned memory Provide OpenGL/Direct3D interoperability Provide management for asynchronous execution

27 Asynchronous functions Kernel launches, and some others Async memory copies Device <-> device memory copies Memory setting Concurrent execution of functions is managed through streams

28 Streams A queue of operations An application may have multiple stream objects simultaneously kernel<<<ng,nb,ns,s>>> A kernel can be scheduled to execute on a stream Some memory copy functions can also be queued on a stream

29 Streams If no stream is specified, stream 0 is used by default. Operations in a stream are executed synchronously Previous stream operations have to end before a new one begins

30 CUDA runtime and driver APIs provide execution control through stream management <cu/cuda>streamquery() Is stream free? <cu/cuda>streamsynchronize() Wait for stream operations to end

31 CUDA runtime and driver APIs provide execution control through stream management cudathreadsynchronize() / cuctxsynchronize() Wait for all streams to be free <cu/cuda>streamdestroy() Wait for stream to get free Destroy stream

32 Accurate timing using events CUEvent/cudaEvent_t start,stop; <cu/cuda>eventcreate (&start); <cu/cuda>eventcreate (&stop); Events have to be recorded <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous Stream 0: record all operations from all streams Stream N: record operations in stream N

33 Accurate timing using events <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous <cu/cuda>eventsynchronize (stop); float time; <cu/cuda>eventelapsedtime (&time, start, stop); As call to record is asynchronous, the event has to be synchronized before timing <cu/cuda>eventdestroy (start); <cu/cuda>eventdestroy (stop);

34 Asynchronous execution can get confusing Can be switched off Useful for degbugging Set CUDA_LAUNCH_BLOCKING to 1

35 Device Initialization CUDA Runtime API Automatically with first function call Cuda Driver API cuinit() MUST be called before calling any other API function

36 Device Management cudadeviceprop / CUDevice device; int devcount; cudagetdevicecount (&devcount) / cudevicegetcount (&devcount) for dev = 1 to devcount do cudagetdeviceproperties / cudeviceget (&device, dev)

37 Device Management cudasetdevice() Sets the device to be used MUST be set before calling any global function Device 0 used by default

38 Stream Management CUStream / cudastream_t st; cudastreamcreate (&st); / custreamcreate (&st, 0); cudastreamdestroy (&st);

39 Accurate timing using events <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous <cu/cuda>eventsynchronize (stop); float time; <cu/cuda>eventelapsedtime (&time, start, stop); As call to record is asynchronous, the event has to be synchronized before timing <cu/cuda>eventdestroy (start); <cu/cuda>eventdestroy (stop);

40 Event management CUEvent/cudaEvent_t start,stop; <cu/cuda>eventcreate (&start); <cu/cuda>eventcreate (&stop); <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous <cu/cuda>eventsynchronize (stop); float time; <cu/cuda>eventelapsedtime (&time, start, stop); <cu/cuda>eventdestroy (start); <cu/cuda>eventdestroy (stop);

41 All for today Next time More on the host runtime APIs Memory, stream, event, texture management Debug mode for runtime API Context, module, execution control for driver API Performance & Optimization

42 See you next week!

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 7.5 Thursday, 19 November, 2009 Recap CUDA texture memory commands Today CUDA driver API Runtime and Driver APIs Two interfaces