CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology
Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture CUDA Programming Model Advanced Features in CUDA 6.0 onwards Unified Memory Dynamic Parallelism Example K-Means Clustering Benefits of CUDA Restrictions of CUDA Conclusion References CUDA Programming Model 2
What is GPGPU? What s the need? GPGPU General Purpose Graphics Processing Unit Accelerates the computation path of applications Leveraged by Data Parallel Algorithms Fine grain SIMD parallelism Low-latency floating point computations Exciting Supercomputing Applications : Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products Various granularities of parallelism No hindrance in parallel implementation Careful and efficient data delivery CUDA Programming Model 3
Computation Complexity Support CUDA Programming Model 4
More Transistors! Different design philosophies GPU devotes more transistors for data processing in parallel fashion CUDA Programming Model 5
GPU as Co-processor CUDA Programming Model 6
CUDA-Capable GPU Architecture CUDA Programming Model 7
CUDA Programming Model Supports High Multithreading Easily scalable model Runs on GPU a co-processor Runs many threads in parallel Extremely light-weight threads less creation time Kernels Data parallel portions on an application Thread hierarchy Memory hierarchy Compute capability CUDA Programming Model 8
Threads, Blocks and Grids Logical partitioning with: Threads Thread IDs Blocks of threads Block IDs Block Dimensions Threads are arranged in 1D, 2D, 3D logical fashion Grid of blocks Grid Dimensions Blocks are arranged in 1D, 2D, 3D logical fashion Each is limited with physical resources available All threads follow SPMD (same code runs on all threads with different data) Threads in same block can share data but not if in different blocks Scheduled in warps (a bunch of 32 threads) CUDA Programming Model 9
Memory Model Hierarchy Registers Allocated per thread Each thread can Read/write. Local Memory Allocated per thread Each thread can Read/write. Shared Memory Allocated per block Each thread can Read/write. Global memory Common for all threads in a grid. Each thread can Read/write. Constant Memory Common for all threads in a grid. Each thread can Read/only. Texture Memory Read only memory and is cached on chip CUDA Programming Model 10
Streaming Multiprocessor Scalable array of Streaming Multiprocessors GPU contains a bunch of SMs Each SM is independent and bound by threads and blocks count Contains multiples stream processors Executes threads of one warpsize at time CUDA Programming Model 11
Automatic Scalability Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors CUDA Programming Model 12
CUDA Compilation and PTX Any source file containing CUDA language extensions must be compiled with NVCC Code sent from CPU to GPU is in Parallel Thread Execution (PTX) Graphics Drivers convert PTX into executable binary C++ host code C/C++ compiler GPU device functions proprietary NVIDIA compilers/assemblers Embeds the compiled GPU functions as load images in the host object file CUDA Programming Model 13
Processing Flow on CUDA CUDA Programming Model 14
Advanced Features in CUDA 6.0 onwards Unified Memory Dynamic Parallelism Hyper-Q GPU-direct CUDA Programming Model 15
Unified Memory Earlier Separate memories for CPU and GPU Lot of communication overhead More complexity With Unified Memory Single memory between CPU and GPU Less communication overhead No need of deep copies of structured data Simpler programming CUDA Programming Model 16
Dynamic Parallelism Parallel work generates more parallel work Parent Kernel creates child Kernels and divides work further Better load balancing CUDA Programming Model 17
Example K- Means Clustering Classifies millions of data points among given number of classes Uses nearest mean distance from centroid Calculated for each point Performance comparison: Fermi architecture GPU (GeForce GTX 480) ~35 million data points 768 threads with 1D blocks Speedups of 5x CUDA Programming Model 18
Advantages of CUDA Coarse-grained thread blocks map naturally to separate processor cores Fine-grained threads map to multiple-thread contexts Easy to scale with increasing parallel resources in system Easy to transform serial programs into parallel CUDA programs Fast shared memory provides substantial performance improvements by being used as software-managed cache Supports graphical application through Texture memory hardware CUDA Programming Model 19
Restrictions of CUDA Blocks cannot communicate. Recursive function calls are not allowed in CUDA kernels due to limited perthread resource. Individual thread control is not supported. CUDA does not support the full C standard Unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia SIMD execution model of CUDA becomes a significant limitation for any inherently divergent task (high divergence»» less performance) CUDA Programming Model 20
Conclusion CUDA provides an easy-to-program model for parallel applications. Can extend to any parallel systems specific to NVIDIA s GPU architecture. Other parallel programming libraries such as OpenMPI, OpenCL provide similar features for multicore CPUs. CUDA Programming Model 21
References NVidia CUDA Home - http://www.nvidia.com/object/cuda_home_new.html CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programmingguide/index.html#axzz3tqmxinfo Maxwell Architecture - http://devblogs.nvidia.com/parallelforall/maxwell-mostadvanced-cuda-gpu-ever-made/ Unified Memory - http://devblogs.nvidia.com/parallelforall/unified-memory-incuda-6/ http://sbel.wisc.edu/documents/tr-2014-09.pdf Dynamic Parallelism - http://developer.download.nvidia.com/assets/cuda/files/cudadownloads/techbri ef_dynamic_parallelism_in_cuda.pdf CUDA Programming Model 22