GPU A rchitectures Architectures Patrick Neill May

GPU Architectures Patrick Neill May 30, 2014

Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know

CPU - Evolution Problem to solve over time

CPU - Priorities Latency Event driven programming Managing complex control flow Branch prediction Speculative execution Reduction of memory/hd access Large caches Small collection of active threads Anything greater than dual core is generally overkill Less space devoted to math

GPU - Evolution Problems over time

GPU - Priorities Parallelization Work-loads related to games are massively parallel 4K = 3840x2180 ~60fps ~= 500M threads per sec! Latency Trade-off Sacrifices immediate access to data Allow advanced batching and scheduling of data Computation Less complex control flow, more math Most space devoted d to math With a smart thread scheduler to keep it busy

Evolution of Modern GPU Fixed function (Geforce 256, Radeon R100) Hardware T&L Render Output Units/ Raster Operations Pipeline (ROPs) Programmability (Geforce 2 -> >7) Vertex, Pixel Evolved with DirectX SM versions, CG Unification (Geforce 8) Generic compute units Geometry shader Generalization (Fermi, Kepler, AMD GCN) Queuing, batching, advanced scheduling Compute focused

Future CPU GPU HPC Simpler more power efficient Integration of components Software/language needs Generalization Power efficiency Fast interconnects Heterogeneous computing

Why do you care? Companies try to leverage IP Obviously biased (except me of course, you can trust me ) Understand what architectures are best for the problem you need to solve N-Body = GPU GUI = CPU

CUDA Compute Unified Device Architecture NVIDIA invented parallel language OpenCL is similar Turns off graphics oriented features Disables: rasterizer, input assembler, output merger Turns on compute oriented features LD/ST to generic buffers Support for advanced scheduling features

CUDA Constructs Thread ~Work item Block/CTA ~Work group Gid Grid ~Index space

CUDA Constructs Thread Per-thread local memory Warp = 32 threads Share through local memory Block/CTA = Group of warps Shared mem per CTA: inter-cta sync/communication Grid = Group of CTAs that share the same kernel Global mem for cross-cta communication Defined by user (compute) Reflect communication

Warp

Compute - Integration Utilize existing libraries cufft (10x faster) cublas (6-17x faster) cusparse (8x faster) Application integration DIY Matlab Mathematica CUDA OpenCL DirectX Compute Source: http://developer.nvidia.com/gpu-accelerated-libraries

Compute - Integration OpenACC #include <stdio.h> #define N 1000000 int main(void) { double pi = 0.0f; 0f; long i; #pragma acc region for for (i=0; i<n; i++) { double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); } } printf("pi=%16.15f\n",pi/n); return 0; cufft #define NX 64 #define NY 64 #define NZ 128 cuffthandle plan; cufftcomplex *data1 data1, *data2; cudamalloc((void**)&data1, sizeof(cufftcomplex)*nx*ny*nz); cudamalloc((void**)&data2, sizeof(cufftcomplex)*nx*ny*nz); /* Create a 3D FFT plan. */ cufftplan3d(&plan, NX, NY, NZ, CUFFT_C2C); /* Transform the first signal in place. */ cufftexecc2c(plan, data1, data1, CUFFT_FORWARD); /* Transform the second signal using the same plan. */ cufftexecc2c(plan, data2, data2, CUFFT_FORWARD); /* Destroy the cufft plan. */ cufftdestroy(plan); cudafree(data1); cudafree(data2);

Dynamic Parallelism Dynamic Parallelism Allow the GPU to create work for itself Remove CPU -> GPU sync points

Dynamic Parallelism

Why do you care? Constructs reflect: Warps Communication boundaries Read/write coherency Low level primitive HW operates on Blocks/CTAs are made up of integer # of warps AMD/Intel HPC chips are similar Use existing libraries Don t reinvent the wheel

GPU Yesterday Geforce 3 Source: http://hexus.net/tech/reviews/graphics/251-hercules-geforce-3-ti500/?page=2

Kepler GK110

Kepler GPC Similar to a CPU core Stamp out for scaling Contains multiple SMXs Graphics Features Polymorph Engine Raster Engine Independent execution Compute Features Independent Kernels ~CTA (work group)

Kepler SMX Schedule threads efficiently Maximize Utilization 192 CUDA Cores 64 DP units 32 SFU 32 LD/ST Perfect utilization? 320 threads available per clock

Kepler Scheduler

Maxwell GM107

Maxwell GPC 5 SMMs per GPC Kepler had 2 SMXs ~90% perf of SMX But *much* smaller More power efficient Scheduling changes

Maxwell GPC

Maxwell SMM Schedule threads efficiently Maximize Utilization 128 CUDA Cores 32 SFU 32 LD/ST Warps scheduling Each scheduler owns cores Own instruction buffer Still 2x dispatch unit

Maxwell Utilization Avoid Divergence! Shuffle your program to avoid divergence within a warp Size of CTA matters! Programmer thinks in CTAs, HW operates on warps CTA size 48 threads 32 thread warp architecture 48/32 = 1.5 warps, second warp is partially occupied Scheduler 2 dispatch units need two independent inst per warp Series of dependent instructions = Poor utilization

Maxwell Utilization Latency Memory access - ~10 cycles -> 800 cycles Cuda cores take a few cycles to complete Contention for limited resources 64K regs/sm * SM / 64 Warps * Warp / 32 Threads 32 regs/thread before # of warps/sm decrease High # of warps vital for high utilization Balance work-load Tex versus FP versus SFU versus LD/ST Avoid peak utilization of any one unit

Kepler SMX Each CUDA Core can do FMA = 2 FLOP 192 CUDA Cores per SMX, 15x2 SMX (Titan Z) 11520 FLOP/clock 705 Mhz (876 Mhz boost) 8.1 TFLOPs (10 TFLOPs boost) Intel Xeon E5-2690 371 GFLOPs (AVX) XBox One PS4 1.32 TFLOPs 1.84 TFLOPs

Infiltrator Demo Tiled deferred shading Temporal anti-aliasing Tessellation and Displacement Millions of particles, colliding/lit by the environment Physically based materials/lighting (static GI)

Deferred shading Render to G-Buffer Material Normal Depth Defer shading Wait until all objects are rendered Use pixel position plus G-Buffer Shading done exactly once per pixel

Deferred shading

Tiled Deferred Shading Render to G-Buffer (Materials, Normal, Depth) per pixel Segment screen into tiles (think parallel) DICE 2011 GDC presentation Compute light-tile intersection Output culled light list to buffer Perform per pixel light pass (Materials,Normal,Depth) + pixel pos Fetch list of lights hitting tile containing pixel Iterate + Light

Tiled Deferred Shading

Opportunities NVIDIA/AMD/Intel (Samsung/Qualcomm/Apple) Strong C/C++ Strong OS/Algorithms fundamentals Parallel Programming w/ emphasis on Performance Compiler experience a plus Graphics knowledge a plus Other industries Games Industry Seismic Processing Biochemistry simulations Weather/climate modeling CAD

Questions? pneill@nvidia.com