Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1
Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 2
Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 3
Revolutionary NVIDIA Tesla Multi-threaded architecture with a 128-processor computing core C-language development environment for the GPU C870 GPU Computing Processor - One GPU (128 thread processors) - 1.5 GB dedicated memory - Fits in one full-length, dual slot with one open PCI Express x16 slot D870 Deskside GPU Computing System: desktop (2 x C870) S870 GPU Computing System: 1U rack-mount chassis (4 x C870) Page 4
GPU architecture Massively multithreaded parallel computing platform 8 Texture Processor Clusters (TPC) 1 TPC = 2 Streaming Multiprocessors (SMs) + texture 1 SM = 8 streaming processors (SPs) 128 Thread Processors total 1.35 GHz processor clock, 518 GFLOPS peak Parallel Data Cache accelerates processing Page 5
SM Multithreaded Multiprocessor - 8 SP Thread Processors - 32 GFLOPS, peak at 1.35GHz - IEEE 754 32-bit floating point - 32-bit integer - 2 SFU Special Function Units - Multithreaded Instruction Unit - 768 Threads, hardware multithreaded - 24 SIMD warps of 32 threads - Independent MIMD thread execution - Hardware thread scheduling - 16KB Shared Memory - Concurrent threads share data - Low latency load/store Page 6
SIMT Multithreaded Execution Warp: the set of 32 parallel threads that execute a SIMD instruction SM hardware implements zero-overhead warp and thread scheduling 768 concurrent threads = 24 warps x 32 threads Threads can execute independently Best efficiency and performance: threads of a warp execute together Single-Instruction Multiple-Thread across threads (not just SIMDacross data) gives easy single-threadscalar programming with SIMD efficiencywarp Page 7
NVIDIA CUDA CUDA (Compute Unified Device Architecture) enables efficient use of the massive parallelism of NVIDIA GPUs Direct execution of data-parallel programs Without the overhead of a graphics API Using CUDA on Tesla GPUs can provide large speedups on data-parallel computations straight out of the box! Heterogeneous mixed serial-parallel programming Scalable hierarchical thread execution model Accessible minimal but expressive changes to C Page 8
CUDA Programming Model: Grids, Blocks, and Threads Execute a sequence of kernels on GPU computing device A kernel executes as a Grid of thread blocks A thread block/ctas (Cooperative Thread Array) is an array of threads that can cooperate size: 1 to 512 concurrent threads shape: 1D, 2D, or 3D Threads within the same block synchronize and share data in Shared Memory Page 9
CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks Page 10
Memory Spaces Each thread can: Read/write per-thread 32-bit registers Read/write per-thread local memory Read/write per-block shared memory (on chip) Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can: read/write Host Global memory, Constant memory, and texture memory (stored in DRAM) Page 11
Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 12
CUDA Programming A minimal set of extensions to the C language A runtime library Page 13
Learning by example Matrices addition void addmatrix( float *a, float *b, global void addmatrix( float *a, floatt *b, float *c, int N ) float *c, int N ) { { int i, j, index; int i = blockidx.x * blockdim.x + threadidx.x; for( i = 0; i < N; i++ ) { int j = blockidx.y * blockdim.y + threadidx.y; for( j = 0; j < N; j++ ) { int index = i + j * N; index = i + j *N; if( i < N && j < N ) c[index] = a[index] + c[index]; c[index] = a [index] + b[index]; } } } } void main() void main() { { addmatrix( a, b, c, N ); dim3 dimblk ( blocksize, blocksize ); } dim3 dimgrd ( N/dimBlk.x, N/dimBlk.y ); addmatrix<<<dimgrd, dimblk>>>( a, b, c, N); } Page 14
Language Extensions Function Type Qualifiers global : kernel callable from host device : function callable on device host : function callable on host (default) - device void trigger( ); Variable Type Qualifiers device : variable in device memory constant : variable in constant memory shared : variable in shared memory Page 15
Language Extensions Execution Configuration Definition of the grid and blocks executed on the device In global function : <<< Dg, Db, Ns, S >>> Example : global void Function( float* parameter); dim3 dimgrid( 100, 50 ); dim3 dimblock( 4, 8, 8 ); Function<<< DimGrid, DimBlock >>>(parameter); Build-in in Variables - dim3 griddim - dim3 blockidx - dim3 blockdim - dim3 threadidx Page 16
Compilation with NVCC NVCC C++ syntax rules PTX Page 17
Software Stack Device driver Application programming interface Mathematical libraries : CUFFT and CUBLAS Page 18
Runtime library Common component Device component Host component Page 19
Common Component Built-in Vector Types : - char1, uchar1, int3, long2, etc. - Structures accessed with x, y, z, w fields : uint4 para; int y = para.y; - dim3 based on uint3 Texture Type - texture references - texture<type, Dim, ReadMode> texref ; Mathematical Functions - sinf, powf, log, min, etc. Time Function - clock_t clock(); Page 20
Device Component Mathematical Functions - pow - sin, cos, tan - etc. Synchronization Functions - void syncthreads(); Texture Functions - Type text1dfetch ( texture<type, 1, cudareadmodeelement> texref, int x ); Atomic Functions - atomicadd(); Page 21
Host Component Device management Context management Memory management Code module management Execution control Texture reference management Interoperability with OpenGL and Direct3D Page 22
Host Component Memory management float data[ 256 ]; int size = sizeof( data ); float* devptr; cudamalloc( ( void** ) &devptr, size ); cudamemcpy( devptr, data, size, cudamemcpyhosttodevice ); Page 23
Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 24
Optimization Strategies Maximize parallel execution Optimize memory usage Optimize instruction usage Page 25
Maximize Parallel Execution More blocks per SM, more threads per block Limiting factors: # Registers per kernel (8192/SM ) Amount of shared memory per SM (16KB/SM) So 8 blocks/sm mostly #Threads per block = a multiple of the warp size,better 64 Best:192 or 256 threads per block (maximal 768 / SM) Page 26
Optimize Memory Usage Data Transfers Device to host with lowest bandwidth : - Minimize the data transfers between the host and the device - memory - Group transfers Shared memory is hundreds of times faster than global memory(400 ~ 600 cycles) - Minimize the data transfers between the global memory by using on-chip shared memory - Typical programming pattern is to stage data coming from global memory - Into shared memory - Best: might be avoid any data transfer by recomputing the data Page 27
Optimize Memory Usage Memory accessing Global memory Read 4-byte, 8-byte, 16-byte words in a single instruction Most efficient: memory accesses by threads in a half-warp can be coalesced into a single memory transaction(32 bytes, 64 bytes, 128 bytes) Coalescing Non-coalescing Page 28
Optimize Memory Usage Matrix transpose Page 29
Optimize Memory Usage Shared memory:16 banks No bank conflict between two halves of a warp Page 30
Optimize instruction usage Memory Instructions -->4 clock cycles. - Global memory latency (400~600 cycles) can be hidden by the thread scheduler Arithmetic Instructions : - Minimize the use of arithmetic instructions with low throughput - Trading precision for speed Control flow instruction Lead to Divergent Branching Key: - Condition: threadidx / WSIZE - then - no divergent in a warp Page 31
Summary Page 32
References 1. NVIDIA CUDA Programming Guide 2.0 2. NVIDIA CUDA Optimization Strategies 3. Wikipedia 4. http://www.gpgpu.org/asplos2008/ 5. NVIDIA Tesla: A Unified Graphics and Computing Architecture March-April 2008, IEEE MICRO Page 33
Thank you for your attention! Page 34