ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding with a simple but complete CUDA program 2 1
Reading Assignment Kirk and Hwu, Programming Massively Parallel Processors: A Hands on Approach,, Chapter 3 CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-cprogramming-guide/#abstract 3 CUDA/OpenCL- Execution Model Reflects a multicore processor + GPGPU execution model Addition of device functions and declarations to stock C programs Compilation separated into host and GPU paths 4 2
Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Device Code (PTX) Host C Compiler/ Linker Device Just-in-Time Compiler Virtual ISA Heterogeneous Computing Platform with CPUs, GPUs 5 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args);... 6 3
Thread Hierarchy All threads execute the same kernel code Memory hierarchy is tied to the thread hierarchy (later) From http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy 7 CUDA /OpenCL Programming Model: The Grid Host Tasks GPU Tasks User Application Each kernel N-Dimensional Range 3D thread block Software Stack Operating System CPU GPU griddim.y griddim.x Note: Each thread executes the same kernel code! 8 4
Vector Addition Conceptual View vector A A[0] A[1] A[2] A[3] A[4] A[N-1] vector B B[0] B[1] B[2] B[3] B[4] B[N-1] + + + + + + vector C C[0] C[1] C[2] C[3] C[4] C[N-1] 9 A Simple Thread Block Model blockdim threadidx = 0 threadidx = blockdim-1 blockidx = 0 blockidx = 1 blockidx = 2 blockidx = griddim-1 id = blockidx * blockdim + threadidx; Structure an array of threads into thread blocks A unique id can be computed for each thread This id is used for workload partitioning 10 5
Using Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads All threads in a grid run the same kernel code (SPMD) Each thread has an index that it uses to compute memory addresses and make control decisions 0 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; Thread blocks can be multidimensional arrays of threads 11 Thread Blocks: Scalable Cooperation Divide thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate (only through global memory) Thread Block 0 Thread Block 1 Thread Block N-1 0 1 2 254 255 0 1 2 254 255 0 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign 12 6
Vector Addition Traditional C Code // Compute vector sum C = A+B void vecadd(float* A, float* B, float* C, int n) for (i = 0, i < n, i++) C[i] = A[i] + B[i]; int main() // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements vecadd(a_h, B_h, C_h, N); 13 #include <cuda.h> void vecadd(float* A, float* B, float* C, int n) Heterogeneous Computing vecadd Host Code int size = n* sizeof(float); float* A_d, B_d, C_d; 1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code to have the device // to perform the actual vector addition 3. // copy C from the device memory // Free device vectors Host Memory CPU Part 1 Device Memory Part 3 GPU Part 2 14 7
Partial Overview of CUDA Memories Device code can: R/W per-thread registers R/W per-grid global memory (Device) Grid Block (0, 0) Block (1, 0) Host code can Transfer data to/from per grid global memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) We will cover more later. Host Global Memory 15 CUDA Device Memory Management API functions cudamalloc() Allocates object in the device global memory Grid Block (0, 0) Block (1, 0) Two parameters Address of a pointer to the allocated object Size of of allocated object in terms of bytes cudafree() Frees object from device global memory Host Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Registers Thread (0, 0) Registers Thread (1, 0) Pointer to freed object 16 8
Host-Device Data Transfer API functions cudamemcpy() memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer Transfer to device is asynchronous Host (Device) Grid Block (0, 0) Registers Thread (0, 0) Global Memory Registers Thread (1, 0) Block (1, 0) Registers Registers Thread (0, 0) Thread (1, 0) 17 void vecadd(float* A, float* B, float* C, int n) int size = n * sizeof(float); float* A_d, B_d, C_d; 1. // Transfer A and B to device memory cudamalloc((void **) &A_d, size); cudamemcpy(a_d, A, size, cudamemcpyhosttodevice); cudamalloc((void **) &B_d, size); cudamemcpy(b_d, B, size, cudamemcpyhosttodevice); // Allocate device memory for cudamalloc((void **) &C_d, size); 2. // Kernel invocation code to be shown later 3. // Transfer C from device to host cudamemcpy(c, C_d, size, cudamemcpydevicetohost); // Free device memory for A, B, C cudafree(a_d); cudafree(b_d); cudafree (C_d); 18 9
// Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecaddkernel(float* A_d, float* B_d, float* C_d, int n) int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; int vectadd(float* A, float* B, float* C, int n) Example: Vector Addition Kernel // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecaddkernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n); Device Code 19 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecaddkernel(float* A_d, float* B_d, float* C_d, int n) int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; int vecadd(float* A, float* B, float* C, int n) // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecaddkernnel<<<ceil(n/256),256>>>(a_d, B_d, C_d, n); Execution Configuration Host Code 20 10
More on Kernel Launch int vecadd(float* A, float* B, float* C, int n) // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; Edge effects dim3 DimBlock(256, 1, 1); Host Code vecaddkernnel<<<dimgrid,dimblock>>>(a_d, B_d, C_d, n); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking à cudadevicesynchronize(); 21 host Void vecadd() Kernel Execution in a Nutshell dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecaddkernel<<<dimgrid,dimblock>>> (A_d,B_d,C_d,n); global void vecaddkernel(float *A_d, float *B_d, float *C_d, int n) int i = blockidx.x * blockdim.x + threadidx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; Blk 0 Kernel Blk N-1 Schedule onto multiprocessors M0 GPU RAM Mk 22 11
More on CUDA Function Declarations device float DeviceFunc() global void KernelFunc() host float HostFunc() Executed on the: device device host Only callable from the: device host host global defines a kernel function Each consists of two underscore characters A kernel function must return void device and host can be used together 23 Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Host C Compiler/ Linker Device Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs 24 12
CPU and GPU Address Spaces X= X= CPU PCIe GPU Requires explicit management in each address space Programmer initiated transfers between address spaces 25 Unified Memory X= Single unified address space across CPU and GPU CPU PCIe GPU All transfers managed under the hood Smaller code segments Explicit management now becomes an optimization 26 13
Using Unified Memory #include <iostream> #include <math.h> // Kernel function to add the elements of two arrays global void add(int n, float *x, float *y) for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; int main(void) int N = 1<<20; float *x, *y; // Allocate Unified Memory accessible from CPU or GPU cudamallocmanaged(&x, N*sizeof(float)); cudamallocmanaged(&y, N*sizeof(float)); // initialize x and y arrays on the host for (int i = 0; i < N; i++) x[i] = 1.0f; y[i] = 2.0f; // Run kernel on 1M elements on the GPU add<<<1, 1>>>(N, x, y); // Wait for GPU to finish cudadevicesynchronize(); // Free memory cudafree(x); cudafree(y); return 0; From https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/ 27 The ISA An Instruction Set Architecture (ISA) is a contract between the hardware and the software. As the name suggests, it is a set of instructions that the architecture (hardware) can execute. 28 14
PTX ISA Major Features Set of predefined, read only variables E.g., %tid (threadidx), %ntid (blockdim), %ctaid (blockidx), %nctaid(griddim), etc. Some of these are multidimensional E.g., %tid.x, %tid.y, %tid.z Includes architecture variables E.g., %laneid, %warpid Can be used for auto-tuning of code Includes undefined performance counters 29 PTX ISA Major Features (2) Multiple address spaces Register, parameter Constant global, etc. More later Predicated instruction execution 30 15
Compute capability Versions Architecture specific, e.g., Volta is 7.0 CUDA version Determines programming model features Currently 9.0 PTX version Virtual ISA version Currently 6.1 31 QUESTIONS? 32 16