GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas

2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

3 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

4 / 34 Architecture of G80 GPU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU 128 streaming processors in 16 streaming multiprocessors

5 / 34 Architecture of GT200 GPU (e.g., GeForce GTX 295) SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU 240 streaming processors in 30 streaming multiprocessors

6 / 34 Architecture of Fermi GPU (e.g., GeForce GTX 480, Tesla C2075) DRAM Gigathread DRAM Host Interface DRAM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM Fermi SM L2 Cache Fermi SM Fermi SM 16-SM Fermi GPU Fermi SM Fermi SM CUDA Core Dispatch Port Operant Collector FP Unit Fermi SM Fermi SM INT Unit Result Queue Fermi SM Fermi SM DRAM DRAM DRAM DRAM Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Register File (32,768 32-bit) Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST Special Function Unit Special Function Unit Special Function Unit Special Function Unit Interconnect Network 64 KB Shared Memory / L1 Cache Fermi Streaming Multiprocessor (SM) 512 Fermi streaming processors in 16 streaming multiprocessors

7 / 34 Architecture of Kepler GPU (e.g., Tesla K20) 2,880 streaming processors in 15 streaming multiprocessors

8 / 34 Architecture of Kepler Streaming Multiprocessor Each streaming multiprocessor contains 192 single-precision cores and 64 double-precision cores

9 / 34 Comparison among Three Architectures GPU G80 GT200 Fermi Transistors 681 million 1.4 billion 3.0 billion CUDA Cores 128 240 512 Double Precision Capability None 30 FMAs/clk 256 FMAs/clk Single Precision Capability 128 MADs/clk 240 MADs/clk 512 FMAs/clk Special Function Units / SM 2 2 4 Shared Memory / SM 16KB 16KB 48KB/16KB L1 Cache / SM None None 16KB/48KB L2 Cache None None 768KB ECC Support No No Yes Multiply-Add (MAD) A B = Product (truncate extra digits) + C = Result Fused Multiply-Add (FMA) A B = Product + C (retain all digits) = Result

Fermi vs. Kepler FERMI FERMI KEPLER KEPLER GF100 GF104 GK104 GK110 Compute Capability 2.0 2.1 3.0 3.5 Threads / Warp 32 32 32 32 Max Warps / Multiprocessor 48 48 64 64 Max Threads / Multiprocessor 1536 1536 2048 2048 Max Thread Blocks / Multiprocessor 8 8 16 16 32 bit Registers / Multiprocessor 32768 32768 65536 65536 Max Registers / Thread 63 63 63 255 Max Threads / Thread Block 1024 1024 1024 1024 Shared Memory Size Configurations (bytes) 16K 16K 16K 16K 48K 48K 32K 32K 48K 48K Max X Grid Dimension 2^16 1 2^16 1 2^32 1 2^32 1 Hyper Q No No No Yes Dynamic Parallelism No No No Yes Compute Capability of Fermi and Kepler GPUs 10 / 34

11 / 34 Compute Capability The compute capability of a device is defined by a major revision number and a minor revision number Devices with the same major revision number are of the same core architecture Kepler architecture: 3.x Fermi architecture: 2.x Prior devices: 1.x NVIDIA CUDA C Programming Guide (v7.5) Appendix A lists of all CUDA-enabled devices along with their compute capability Appendix G gives the technical specifications of each compute capability

12 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

13 / 34 Data Parallelism Square Matrix Multiplication Example N P = M N of size M P 28

14 / 34 Data Parallelism Square Matrix Multiplication Example M N P P = M N of size The approach in C for (i=0;i<;i++) { for (j=0;j<;j++) {... P[i][j]=...;... } } 28

15 / 34 Data Parallelism Square Matrix Multiplication Example M N P 28 P = M N of size The approach in C for (i=0;i<;i++) { for (j=0;j<;j++) {... P[i][j]=...;... } } The approach in CUDA (the straightforward way) One thread calculates one element of P Issue threads simultaneously

16 / 34 The Heterogeneous Platform A typical GPGPU-capable platform consists of One or more microprocessors (CPUs) host One or more GPUs device GPU devices are connected to CPUs through PCI Express (PCIe) bus A CUDA program consists of The code on CPU software part The code on GPU hardware part 25.6GB/s QPI CPU-GPU 5.8GB/s Unidirectional 16x PCIe 2.0 102GB/s QPI QPI 25.6GB/s QPI 16x PCIe 2.0 144GB/s

17 / 34 CUDA Program Structure Serial Code (host) Grid 0 Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Grid 1 Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args);... Integrated host+device application C program Sequential or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code

18 / 34 CUDA Thread Organization Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) A kernel is implemented as a grid of threads Threads in grid is further decomposed into blocks A grid can be up to three-dimensional A thread block is a batch of threads that can cooperate with each other by: Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA

19 / 34 CUDA Thread Organization Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) A kernel is implemented as a grid of threads Threads in grid is further decomposed into blocks A grid can be up to three-dimensional A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through shared memory A block can be one or two or three-dimensional Each block and each thread has its own ID Courtesy: NDVIA

20 / 34 Matrix Multiplication: the main() function N int main (void) { // 1. // Allocate and initialize // the matrices M, N, P // I/O to read the input // matrices M and N... M P // 2. // M*N on the processor MatrixMultiplication(M,N,P,width); // 3. // I/O to write the output matrix P // Free matrices M, N, P 28 }... return 0;

21 / 34 Matrix Multiplication: A Simple Host Version in C void MatrixMultiplication(float* M, float* N, float* P, int width) { for (int i=0; i<width; i++) { for (int j=0; j<width; j++) { float sum = 0; for (int k=0; k<width; k++) { float a = M[i*width+k]; float b = N[k*width+j]; sum += a*b; } P[i*width+j] = sum; } } } M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 M i,j M 3,0 M 3,1 M 3,2 M 3,3 row column

22 / 34 Matrix Multiplication: Move computation to the device void MatrixMultiplication(float* M, float* N, float* P, int width) { int size = width*width*sizeof(float); float* Md, Nd, Pd;... // 1. // Allocate device memory for M, N, // and P // Copy M and N to allocated device // memory locations // 2. // Kernel invocation code - to have // the device to perform the actual // matrix multiplication } // 3. // Copy P from the device memory // Free device matrices

23 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading

24 / 34 Memory Hierarchy on Device Memory hierarchy on device Grid Global Memory Block (0, 0) Block (1, 0) Main means of communicating Shared Memory between host and device Registers Registers Registers Long latency access Shared Memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Short latency Host Global Memory Register Per-thread local variables Data access Device code can read/write Per-thread registers, per-block shared memory, per-grid global memory Host code can transfer data to/from Per-grid global memory Shared Memory Registers Thread (1, 0)

25 / 34 CUDA Device Memory Allocation cudamalloc() Allocates object in the device Global Memory Requires two parameters 1. Address of a pointer to the allocated object 2. Size of the allocated object Grid Block (0, 0) Shared Memory Registers Registers Block (1, 0) Shared Memory Registers Registers float* Md; //d indicates a device data int size = width*width*sizeof(float); Host Thread (0, 0) Thread (1, 0) Global Memory Thread (0, 0) Thread (1, 0) cudamalloc((void**)&md, size); cudafree(md); cudafree() Frees object from device Global Memory Pointer to freed object

26 / 34 CUDA Host-Device Data Transfer cudamemcpy() Memory data transfer Requires four parameters 1. Pointer to destination 2. Pointer to source 3. Number of bytes copied 4. Type of transfer Types of transfer cudamemcpyhosttohost cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamemcpy(m, Md, size, cudamemcpydevicetohost); Host Grid Block (0, 0) Registers Shared Memory Thread (0, 0) Global Memory Registers Thread (1, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0)

27 / 34 Matrix Multiplication: After Integrating the Data Transfer void MatrixMultiplication(float* M, float* N, float* P, int width) { int size = width*width*sizeof(float); float* Md, Nd, Pd; } // 1. Allocate and Load M, N to device memory cudamalloc((void**)&md, size); cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamalloc((void**)&nd, size); cudamemcpy(nd, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc((void**)&pd, size); // 2. Kernel invocation code -- to be shown later // 3. Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd);

28 / 34 CUDA Thread Organization Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) A kernel is implemented as a grid of threads Threads in grid is further decomposed into blocks A grid can be up to three-dimensional A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through shared memory A block can be one or two or three-dimensional Each block and each thread has its own ID Courtesy: NDVIA

29 / 34 Index (i.e., coordinates) of Block and Thread Z O X (0,0,1) (1,0,1) (2,0,1) (3,0,1) M 0,0 M 0,1 M 0,2 M 0,3 Y Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) M 1,0 M 2,0 M 3,0 M 1,1 M 2,1 M 3,1 M 1,2 M 2,2 M 3,2 M 1,3 M 2,3 M 3,3 t x t y t z Each block and each thread are assigned an index, i.e., blockidx and threadidx blockidx.x, blockidx.y, blockidx.z threadidx.x, threadidx.y, threadidx.z

30 / 34 Index (i.e., coordinates) of Block and Thread Z O X (0,0,1) (1,0,1) (2,0,1) (3,0,1) M 0,0 M 0,1 M 0,2 M 0,3 Y Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) M 1,0 M 2,0 M 3,0 M 1,1 M 2,1 M 3,1 M 1,2 M 2,2 M 3,2 M 1,3 M 2,3 M 3,3 t x t y t z Each block and each thread are assigned an index, i.e., blockidx and threadidx blockidx.x, blockidx.y, blockidx.z threadidx.x, threadidx.y, threadidx.z threadidx.y row, threadidx.x column

31 / 34 Define the Dimension of Grid and Block Use pre-defined dim3 to define the dimension of grid and block Use griddim and blockdim to get the dimensions of the grid and the block dim3 dimgrid(width_g, height_g, depth_g); dim3 dimblock(width_b, height_b, depth_b); Total number of threads issued width_g height_g depth_g width_b height_b depth_b Launch the device computation threads MatrixMulKernel<<<dimGrid, dimblock>>>(md, Nd, Pd, Width);

32 / 34 Kernel Function Specification // Matrix multiplication kernel // -- per thread code global void MatrixMulKernel( float* Md, float* Nd, float* Pd, int width) { int tx = threadidx.x; int ty = threadidx.y; // Pvalue is used to store the // element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < width; ++k) { float Melement = Md[ty*width+k]; float Nelement = Nd[k*width+tx]; Pvalue += Melement * Nelement; } } Pd[ty*width+tx] = Pvalue;

33 / 34 CUDA Function Declarations Executed on Only callable from device float DeviceFunc() device device globe void DeviceFunc() device host host float HostFunc() host host global defines a kernel function Must be void function device and host can be used together Generate two versions of the same function during the compilation For functions executed on the device global functions does not support recursion device functions only support recursion in device code compiled for devices of compute capability 2.x and higher No static variable declarations inside the function No variable number of arguments No indirect function calls through pointers All functions are host functions by default

34 / 34 Check the return code of CUDA APIs cudaerror_t cuda_ret; cuda_ret = cudamalloc((void**) &A_d, sizeof(float)*n); if(cuda_ret!= cudasuccess) { printf("some error message"); exit(-1); }