CUDA Basics. July 6, 2016

Size: px

Start display at page:

Download "CUDA Basics. July 6, 2016"

Florence Goodwin
5 years ago
Views:

1 Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016

2 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously July 6, 2016 Slide 2

thread has an ID Select input/output data Control decisions float x = input[threadidx.

3 CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU as an array of threads in parallel All threads execute the same code, but can take different paths Each thread has an ID Select input/output data Control decisions float x = input[threadidx.x]; float y = func(x); output[threadidx.x] = y; NVIDIA Corporation 2013 July 6, 2016 Slide 3

4 CUDA Kernels: Subdivide into Blocks NVIDIA Corporation 2013 July 6, 2016 Slide 4

5 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks NVIDIA Corporation 2013 July 6, 2016 Slide 5

6 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid NVIDIA Corporation 2013 July 6, 2016 Slide 6

7 CUDA Kernels: Subdivide into Blocks DRAM Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads NVIDIA Corporation 2013 July 6, 2016 Slide 7

8 Kernel Execution CUDA thread CUDA core CUDA thread block CUDA Streaming Multiprocessor CUDA kernel grid... NVIDIA Corporation 2013 July 6, 2016 Slide 8

9 Thread blocks allow scalability Block can execute in any order, concurrently or sequentially This independence between blocks gives scalability: A kernel scales across any number of SMs Device with 2 SMs SM 0 SM 1 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel Grid Launch Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device with 4 SMs SM 0 SM 1 SM 2 SM 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 NVIDIA Corporation 2013 July 6, 2016 Slide 9

10 Scale Kernel void scale(float alpha, { } int i = 0; for ( i=0; i<m; ++i) C[i] = alpha * A[i]; float* A, float* C, int m) global void scale(float alpha, float* A, float* C, int m) { int i = blockdim.x*blockidx.x+threadidx.x; if ( i < m) C[i] = alpha * A[i]; } July 6, 2016 Slide 10

11 Getting data in and out with Unified Memory GPU has separate memory, but transfers can be managed by runtime Allocate memory with cudamallocmanaged Free memory July 6, 2016 Slide 11

12 Define dimensions of thread block dim3 dim3 blockdim(size_t blockdim(size_t blockdimx, blockdimx, size_t size_t blockdimy, blockdimy, size_t size_t blockdimz) blockdimz) On JURECA (Tesla K80): Max. dim. of a block: 1024 x 1024 x 64 Max. number of threads per block: 1024 Example: // Create 3D thread block with 512 threads dim3 blockdim(16, 16, 2); July 6, 2016 Slide 12

13 Define dimensions of grid dim3 dim3 griddim(size_t griddim(size_t blockdimx, blockdimx, size_t size_t blockdimy, blockdimy, size_t size_t blockdimz) blockdimz) On JURECA (Tesla K80): Max. dim. of a grid: x x Example: // Dimension of problem: nx x ny = 1000 x 1000 dim3 blockdim(16, 16) // Don't need to write z = 1 int gx = (nx % blockdim.x==0)? nx / blockdim.x : nx / blockdim.x + 1 int gy = (ny % blockdim.y==0)? ny / blockdim.y : ny / blockdim.y + 1 dim3 griddim(gx, gy); Watch out! July 6, 2016 Slide 13

14 Call the kernel kernel<<<int kernel<<<int griddim, griddim, int int blockdim>>>([arg]*) blockdim>>>([arg]*) Call returns immediately! Kernel executes asynchronously Example: scale<<<m/blockdim, blockdim>>>(alpha, a_gpu, c_gpu, m) July 6, 2016 Slide 14

15 Calling the kernel Define dimensions of thread block dim3 dim3 blockdim(size_t blockdim(size_t blockdimx, blockdimx, size_t size_t blockdimy, blockdimy, size_t size_t blockdimz) blockdimz) Define dimensions of grid dim3 dim3 griddim(size_t griddim(size_t griddimx, griddimx, size_t size_t griddimy, griddimy, size_t size_t griddimz) griddimz) Call the kernel kernel<<<dim3 kernel<<<dim3 griddim, griddim, dim3 dim3 blockdim>>>([arg]*) blockdim>>>([arg]*) July 6, 2016 Slide 15

16 Free device memory cudafree(void* pointer) Example: // Free the memory allocated by a_gpu on the device cudafree(a_gpu); July 6, 2016 Slide 16

17 Exercise CudaBasics/exercises/tasks/scale_vector Compile with nvcc -o scale_vector scale_vector_um.cu July 6, 2016 Slide 17

18 Getting data in and out GPU has separate memory Allocate memory on device Transfer data from host to device Transfer data from device to host Free device memory July 6, 2016 Slide 18

19 Allocate memory on device cudamalloc(t** pointer, size_t nbytes) Example: // Allocate a vector of 2048 floats on device float * a_gpu; int n = 2048; cudamalloc(&a_gpu, n * sizeof(float)); Address of pointer Get size of a float July 6, 2016 Slide 19

20 Copy from host to device cudamemcpy(void* dst, void* src, size_t nbytes, enum cudamemcpykind dir) Example: // Copy vector of floats a of length n=2048 to a_gpu on device cudamemcpy(a_gpu, a, n * sizeof(float), cudamemcpyhosttodevice); July 6, 2016 Slide 20

21 Copy from device to host cudamemcpy(void* dst, void* src, size_t nbytes, enum cudamemcpykind dir) Example: // Copy vector of floats a_gpu of length n=2048 to a on host cudamemcpy(a, a_gpu, n * sizeof(float), cudamemcpydevicetohost); Note the order Changed flag July 6, 2016 Slide 21

22 Unified Virtual Address Space (UVA) 64bit 2.0 cudamalloc*(...) return UVA pointers cudahostalloc(...) cudamemcpy*(..., cudamemcpydefault) July 6, 2016 Slide 22

23 Getting data in and out Allocate memory on device cudamalloc(void** pointer, size_t nbytes) Transfer data between host and device cudamemcpy(void* dst, void* src, size_t nbytes, enum cudamemcpykind dir) dir = cudamemcpyhosttodevice dir = cudamemcpydevicetohost Free device memory cudafree(void* pointer) July 6, 2016 Slide 23

24 Exercise Scale Vector Allocate memory on device cudamalloc(t** pointer, size_t nbytes) Transfer data between host and device cudamemcpy(void* dst, void* src, size_t nbytes, enum cudamemcpykind dir) dir = cudamemcpyhosttodevice dir = cudamemcpydevicetohost Free device memory cudafree(void* pointer) Define dimensions of thread block dim3 blockdim(size_t blockdimx, dim3 blockdim(size_t blockdimx, size_t blockdimy, size_t blockdimy, size_t blockdimz) size_t blockdimz) Define dimensions of grid dim3 griddim(size_t griddimx, size_t dim3 griddim(size_t griddimx, size_t griddimy, griddimy, size_t griddimz) size_t griddimz) Call the kernel kernel<<<dim3 griddim, kernel<<<dim3 griddim, dim3 blockdim>>>([arg]*) dim3 blockdim>>>([arg]*) July 6, 2016 Slide 24

25 Exercise CudaBasics/exercises/tasks/jacobi_w_explicit_transfer Compile with make jacobi. July 6, 2016 Slide 25

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)