EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1
Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel 2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware 3. Avoid latency stalls by interleaving execution of many groups of fragments Kayvon Fatahalian When one group stalls, work on another group Kayvon Fatahalian, 2008 2
Processor Make the Compute The Focus of the Architecture Processors The future of execute GPUs is computing programmable threads processing Alternative So build the operating architecture mode around specifically the processor for computing 3 Host Input Input Assembler Used to be only one Setup kernel / Rstr / ZCull at a time Execution Manager Vtx Issue Manages thread blocks Geom Issue Pixel Issue Texture Texture Texture Texture Texture Texture Texture Texture Load/store L2 Load/storeL2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 David Kirk/VIDIA and FB Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign FB FB Global Memory FB FB FB
Graphic Processing Units (GPUs) 4 General-purpose many-core accelerators Use simple in-order shader cores in 10s / 100s numbers Shader core == Streaming Multiprocessor (SM) in VIDIA GPUs Scalar frontend (fetch & decode) + parallel backend Amortizes the cost of frontend and control
CUDA exposes hierarchy of data-parallel threads MD model: single kernel executed by all scalar threads Kernel / -block Multiple thread-blocks (cooperative-thread-arrays (CTAs)) compose a kernel 5
CUDA exposes hierarchy of data-parallel threads MD model: single kernel executed by all scalar threads Kernel / -block / Warp / Multiple warps compose a thread-block Multiple threads (32) compose a warp 6 A warp is scheduled as a batch of threads
SIMT for balanced programmability and HW-eff. 7 SIMT: Single-Instruction Multiple- Programmer writes code to be executed by scalar threads Underlying hardware uses vector SIMD pipelines (lanes) HW/SW groups scalar threads to execute in vector lanes Enhanced programmability using SIMT Hardware/software supports conditional branches Each thread can follow its own control flow Per-thread load/store instructions are also supported Each thread can reference arbitrary address regions
DRAM I/F Giga HOST I/F DRAM I/F Older GPU, but high-level same 8 Expand performance sweet spot of the GPU Caching Concurrent kernels FP64 512 cores GDDR5 memory Bring more users, more applications to the GPU C++ Visual Studio Integration ECC L2 DRAM I/F DRAM I/F DRAM I/F DRAM I/F
Streaming Multiprocessor (SM) 9 Objective optimize for GPU computing ew ISA Revamp issue / control flow ew CUDA core architecture 16 SMs per Fermi chip 32 cores per SM (512 total) 64KB of configurable $ / shared memory Scheduler Dispatch Instruction Scheduler Dispatch Register File Load/Store Units x 16 Special Func Units x 4 FP32 FP64 IT SFU LD/ST Ops / clk 32 16 32 4 16 Interconnect etwork 64K Configurable /Shared Mem Uniform
SM Microarchitecture 10 ew IEEE 754-2008 arithmetic standard Fused Multiply-Add (FMA) for & DP ew integer ALU optimized for 64-bit and extended precision ops CUDA Dispatch Port Operand Collector FP Unit IT Unit Result Queue Scheduler Dispatch Instruction Scheduler Dispatch Register File Load/Store Units x 16 Special Func Units x 4 Interconnect etwork 64K Configurable /Shared Mem Uniform
DRAM I/F Giga HOST I/F DRAM I/F Memory Hierarchy 11 True cache hierarchy + on-chip shared RAM On-chip shared memory: good fit for regular memory access dense linear algebra, image processing, s: good fit for irregular or unpredictable memory access ray tracing, sparse matrix multiply, physics Separate for each SM (16/48 KB) Improves bandwidth and reduces latency L2 DRAM I/F DRAM I/F DRAM I/F DRAM I/F Unified L2 for all SMs (768 KB) Fast, coherent data sharing across all cores in the GPU
Giga TM Scheduler Hardware 12 Hierarchically manages tens of thousands of simultaneously active threads 10x faster context switching on Fermi HTS Overlapping kernel execution
Giga Streaming Transfer Engine 13 Dual DMA engines Simultaneous CPU GPU and GPU CPU data transfer SDT SDT Fully overlapped with CPU/GPU processing Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1
Life Cycle in HW 14 Kernel is launched on the A Kernels known as grids of thread blocks Host Device Grid 1 Blocks are serially distributed to all the SM s Potentially >1 Block per SM At least 96 threads per block Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Each SM launches Warps of s 2 levels of parallelism SM schedules and executes Warps that are ready to run Kernel 2 Block (1, 1) Grid 2 As Warps and Blocks complete, resources are freed A can distribute more Blocks (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (2, 1) (3, 0) (3, 1) (4, 0) (4, 1) David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)