Programming with CUDA

Size: px

Start display at page:

Download "Programming with CUDA"

Alexina Hubbard
6 years ago
Views:

1 Programming with CUDA Jens K. Mueller Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011

2 Today s lecture: Synchronization and Texture Memory

3 CUDA 3 / 20 Synchronization within Blocks Within blocks threads can be synchronized using synchthreads(). Acts as barrier All threads within the block have to reach this barrier before any thread can proceed. Avoid data hazards for memory accesses In conditional code only allowed if identical for the entire thread block Expected to be lightweight Can degrade device utilization

4 CUDA 4 / 20 Synchronization within Blocks (cont.) Additional for compute capability 2.x int syncthreads_count(int predicate) Same as synchthreads() but evaluates predicate for all threads within the block and returns the number for which it evaluates to non-zero. int syncthreads_and(int predicate) Same as synchthreads() but evaluates predicate for all threads within the block and returns non-zero iff it evaluates to non-zero for all threads within the block. int syncthreads_or(int predicate) Same as synchthreads() but evaluates predicate for all threads within the block and returns non-zero iff it evaluates to non-zero for any threads within the block.

5 CUDA 5 / 20 Synchronization for Memory Access threadfence_block() Calling thread waits until all global/shared memory accesses are visible to all threads within the block threadfence() Calling thread waits until all shared memory accesses are visible to all threads within the block and all global memory accesses are visible to all threads in the device threadfence_system() (2.x only) Calling thread waits until Shared memory accesses are visible to thread block Global memory accesses are visible to all threads within the device Page-locked host memory accesses are visible to host threads

6 CUDA 6 / 20 Atomic Operations Make read-modify-write on global/shared memory an atomic operation Atomic No other thread can interfere with this operation. Available since compute capability 1.1 Since 1.2 also shared memory and 64 bit words for global memory 2.x 64 bit words for shared memory Not atomic on page-locked memory as seen by the host thread/other devices Mainly signed/unsigned integer operation are supported

7 CUDA 7 / 20 Atomic Operations (cont.) Atomic Functions Arithmetic atomicadd, atomicsub, atomicexch, atomicmin, atomicmax, atomicinc, atomicdec, and atomiccas Bitwise atomicand, atomicor, and atomicxor

8 CUDA 8 / 20 Built-In Vector Types {type}{1,2,3,4} where type is char, uchar, short, ushort, int, uint, long, or ulong longlong1, ulonglong1, longlong2, ulonglong2 float1, float2, float3, float4, double1, double2 Construct with make_<typename>(...) Components accessible through x,y,z, and w

9 CUDA 9 / 20 CUDA Arrays Opaque memory layout Optimized for textures 1,2, or 3 dimensional Elements are 1, 2 or 4 vectors that may be signed/unsigned integer or floats Only readable through kernels using texture fetches

10 CUDA 10 / 20 CUDA Arrays (cont.) cudaerror_t cudamallocarray(struct cudaarray** array, const struct cudachannelformatdesc* desc, size_t width, size_t height = 0, unsigned int flags = 0) struct cudachannelformatdesc { int x, y, z, w; enum cudachannelformatkind f; }; enum cudachannelformatkind { cudachannelformatkindsigned, cudachannelformatkindunsigned, cudachannelformatkindfloat };

11 CUDA 11 / 20 CUDA Arrays (cont.) cudaerror_t cudamemcpy2dtoarray(...) cudaerror_t cudamemcpy2dfromarray(...) cudaerror_t cudafreearray(struct cudaarray* array)

12 CUDA 12 / 20 Texture Texture A region of linear memory or CUDA array Texture reference Declared at compile time and bound at runtime to a texture Texture fetch Accessing the texture within kernels Read-only with kernel Optimized for 2D spatial locality Addressing modes allow simpler code Interpolation

13 CUDA 13 / 20 Texture Reference Declared at compile time as a static global variable texture<type, Dim, ReadMode> textureref Type is the type returned when fetching the texture Restricted to integer, single-precision floats and built-in 1-, 2-, 4-vector types Dim is the dimensionality Either 1,2, or 3. Defaults to 1. ReadMode Either cudareadmodenormalizedfloat or cudareadmodeelementtype. Defaults to cudareadmodeelementtype.

14 CUDA 14 / 20 Texture Reference (cont.) Defined at runtime Texture coordinates (textureref.normalized) Not normalized Coordinates in [0, maxdim 1]. Normalized Coordinates in [0, 1). Addressing mode (textureref.addressmode[]) cudaaddressmodeclamp cudaaddressmodewrap Linear filtering for interpolation (only if floats are returned) (textureref.filtermode) cudafiltermodelinear cudafiltermodepoint

15 CUDA 15 / 20 Binding a Texture Linear Memory texture<float, 2, cudareadmodeelementtype> textureref; cudachannelformatdesc channeldesc = cudacreatechanneldesc<float>(); cudabindtexture2d(0, textureref, devptr, &channeldesc, width, height, pitch); CUDA array texture<float, 2, cudareadmodeelementtype> textureref; cudabindtexturetoarray(textureref, cuarray);

16 CUDA 16 / 20 Unbinding a Texture cudaunbindtexture(textureref);

17 CUDA 17 / 20 Texture Fetching To fetch the texture within kernels Linear Memory tex1dfetch(textureref, int x) CUDA Arrays tex1d(textureref, float x) tex2d(textureref, float x, float y) tex3d(textureref, float x, float y, float z)

18 CUDA 18 / 20 Limitations for Texture References 1D texture reference bound to CUDA array 8192 for 1.x and for 2.x 1D texture reference bound to linear memory D texture reference bound to CUDA array/linear memory x for 1.x and x for 2.x 3D texture reference bound to CUDA array/linear memory 2048 x 2048 x 2048 Maximum number of texture bound to a kernel is 128

19 CUDA 19 / 20 Example for using Texture Memory 1. Declare a texture reference 2. Allocate memory 3. Set runtime properties of the texture reference 4. Bind the texture reference to a texture 5. Launch kernel that fetches the texture 6. Unbind the texture reference 7. Free memory

20 CUDA 20 / 20 Read-Write Coherency Texture is cached but not kept coherent within a kernel Writes to the underlying memory within the kernel call result in undefined behavior Writing to the memory is only safe using another kernel call or a a memory operation

Mathematical computations with GPUs

Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device