GPU Architectures Patrick Neill May 30, 2014
Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know
CPU - Evolution Problem to solve over time
CPU - Priorities Latency Event driven programming Managing complex control flow Branch prediction Speculative execution Reduction of memory/hd access Large caches Small collection of active threads Anything greater than dual core is generally overkill Less space devoted to math
GPU - Evolution Problems over time
GPU - Priorities Parallelization Work-loads related to games are massively parallel 4K = 3840x2180 ~60fps ~= 500M threads per sec! Latency Trade-off Sacrifices immediate access to data Allow advanced batching and scheduling of data Computation Less complex control flow, more math Most space devoted d to math With a smart thread scheduler to keep it busy
Evolution of Modern GPU Fixed function (Geforce 256, Radeon R100) Hardware T&L Render Output Units/ Raster Operations Pipeline (ROPs) Programmability (Geforce 2 -> >7) Vertex, Pixel Evolved with DirectX SM versions, CG Unification (Geforce 8) Generic compute units Geometry shader Generalization (Fermi, Kepler, AMD GCN) Queuing, batching, advanced scheduling Compute focused
Future CPU GPU HPC Simpler more power efficient Integration of components Software/language needs Generalization Power efficiency Fast interconnects Heterogeneous computing
Why do you care? Companies try to leverage IP Obviously biased (except me of course, you can trust me ) Understand what architectures are best for the problem you need to solve N-Body = GPU GUI = CPU
CUDA Compute Unified Device Architecture NVIDIA invented parallel language OpenCL is similar Turns off graphics oriented features Disables: rasterizer, input assembler, output merger Turns on compute oriented features LD/ST to generic buffers Support for advanced scheduling features
CUDA Constructs Thread ~Work item Block/CTA ~Work group Gid Grid ~Index space
CUDA Constructs Thread Per-thread local memory Warp = 32 threads Share through local memory Block/CTA = Group of warps Shared mem per CTA: inter-cta sync/communication Grid = Group of CTAs that share the same kernel Global mem for cross-cta communication Defined by user (compute) Reflect communication
Warp
Compute - Integration Utilize existing libraries cufft (10x faster) cublas (6-17x faster) cusparse (8x faster) Application integration DIY Matlab Mathematica CUDA OpenCL DirectX Compute Source: http://developer.nvidia.com/gpu-accelerated-libraries
Compute - Integration OpenACC #include <stdio.h> #define N 1000000 int main(void) { double pi = 0.0f; 0f; long i; #pragma acc region for for (i=0; i<n; i++) { double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); } } printf("pi=%16.15f\n",pi/n); return 0; cufft #define NX 64 #define NY 64 #define NZ 128 cuffthandle plan; cufftcomplex *data1 data1, *data2; cudamalloc((void**)&data1, sizeof(cufftcomplex)*nx*ny*nz); cudamalloc((void**)&data2, sizeof(cufftcomplex)*nx*ny*nz); /* Create a 3D FFT plan. */ cufftplan3d(&plan, NX, NY, NZ, CUFFT_C2C); /* Transform the first signal in place. */ cufftexecc2c(plan, data1, data1, CUFFT_FORWARD); /* Transform the second signal using the same plan. */ cufftexecc2c(plan, data2, data2, CUFFT_FORWARD); /* Destroy the cufft plan. */ cufftdestroy(plan); cudafree(data1); cudafree(data2);
Dynamic Parallelism Dynamic Parallelism Allow the GPU to create work for itself Remove CPU -> GPU sync points
Dynamic Parallelism
Why do you care? Constructs reflect: Warps Communication boundaries Read/write coherency Low level primitive HW operates on Blocks/CTAs are made up of integer # of warps AMD/Intel HPC chips are similar Use existing libraries Don t reinvent the wheel
GPU Yesterday Geforce 3 Source: http://hexus.net/tech/reviews/graphics/251-hercules-geforce-3-ti500/?page=2
Kepler GK110
Kepler GPC Similar to a CPU core Stamp out for scaling Contains multiple SMXs Graphics Features Polymorph Engine Raster Engine Independent execution Compute Features Independent Kernels ~CTA (work group)
Kepler SMX Schedule threads efficiently Maximize Utilization 192 CUDA Cores 64 DP units 32 SFU 32 LD/ST Perfect utilization? 320 threads available per clock
Kepler Scheduler
Maxwell GM107
Maxwell GPC 5 SMMs per GPC Kepler had 2 SMXs ~90% perf of SMX But *much* smaller More power efficient Scheduling changes
Maxwell GPC
Maxwell SMM Schedule threads efficiently Maximize Utilization 128 CUDA Cores 32 SFU 32 LD/ST Warps scheduling Each scheduler owns cores Own instruction buffer Still 2x dispatch unit
Maxwell Utilization Avoid Divergence! Shuffle your program to avoid divergence within a warp Size of CTA matters! Programmer thinks in CTAs, HW operates on warps CTA size 48 threads 32 thread warp architecture 48/32 = 1.5 warps, second warp is partially occupied Scheduler 2 dispatch units need two independent inst per warp Series of dependent instructions = Poor utilization
Maxwell Utilization Latency Memory access - ~10 cycles -> 800 cycles Cuda cores take a few cycles to complete Contention for limited resources 64K regs/sm * SM / 64 Warps * Warp / 32 Threads 32 regs/thread before # of warps/sm decrease High # of warps vital for high utilization Balance work-load Tex versus FP versus SFU versus LD/ST Avoid peak utilization of any one unit
Kepler SMX Each CUDA Core can do FMA = 2 FLOP 192 CUDA Cores per SMX, 15x2 SMX (Titan Z) 11520 FLOP/clock 705 Mhz (876 Mhz boost) 8.1 TFLOPs (10 TFLOPs boost) Intel Xeon E5-2690 371 GFLOPs (AVX) XBox One PS4 1.32 TFLOPs 1.84 TFLOPs
Infiltrator Demo Tiled deferred shading Temporal anti-aliasing Tessellation and Displacement Millions of particles, colliding/lit by the environment Physically based materials/lighting (static GI)
Deferred shading Render to G-Buffer Material Normal Depth Defer shading Wait until all objects are rendered Use pixel position plus G-Buffer Shading done exactly once per pixel
Deferred shading
Tiled Deferred Shading Render to G-Buffer (Materials, Normal, Depth) per pixel Segment screen into tiles (think parallel) DICE 2011 GDC presentation Compute light-tile intersection Output culled light list to buffer Perform per pixel light pass (Materials,Normal,Depth) + pixel pos Fetch list of lights hitting tile containing pixel Iterate + Light
Tiled Deferred Shading
Opportunities NVIDIA/AMD/Intel (Samsung/Qualcomm/Apple) Strong C/C++ Strong OS/Algorithms fundamentals Parallel Programming w/ emphasis on Performance Compiler experience a plus Graphics knowledge a plus Other industries Games Industry Seismic Processing Biochemistry simulations Weather/climate modeling CAD
Questions? pneill@nvidia.com