GPU A rchitectures Architectures Patrick Neill May

Size: px

Start display at page:

Download "GPU A rchitectures Architectures Patrick Neill May"

Zoe Owen
5 years ago
Views:

1 GPU Architectures Patrick Neill May 30, 2014

2 Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know

3 CPU - Evolution Problem to solve over time

4 CPU - Priorities Latency Event driven programming Managing complex control flow Branch prediction Speculative execution Reduction of memory/hd access Large caches Small collection of active threads Anything greater than dual core is generally overkill Less space devoted to math

5 GPU - Evolution Problems over time

6 GPU - Priorities Parallelization Work-loads related to games are massively parallel 4K = 3840x2180 ~60fps ~= 500M threads per sec! Latency Trade-off Sacrifices immediate access to data Allow advanced batching and scheduling of data Computation Less complex control flow, more math Most space devoted d to math With a smart thread scheduler to keep it busy

7 Evolution of Modern GPU Fixed function (Geforce 256, Radeon R100) Hardware T&L Render Output Units/ Raster Operations Pipeline (ROPs) Programmability (Geforce 2 -> >7) Vertex, Pixel Evolved with DirectX SM versions, CG Unification (Geforce 8) Generic compute units Geometry shader Generalization (Fermi, Kepler, AMD GCN) Queuing, batching, advanced scheduling Compute focused

8 Future CPU GPU HPC Simpler more power efficient Integration of components Software/language needs Generalization Power efficiency Fast interconnects Heterogeneous computing

9 Why do you care? Companies try to leverage IP Obviously biased (except me of course, you can trust me ) Understand what architectures are best for the problem you need to solve N-Body = GPU GUI = CPU

10 CUDA Compute Unified Device Architecture NVIDIA invented parallel language OpenCL is similar Turns off graphics oriented features Disables: rasterizer, input assembler, output merger Turns on compute oriented features LD/ST to generic buffers Support for advanced scheduling features

11 CUDA Constructs Thread ~Work item Block/CTA ~Work group Gid Grid ~Index space

12 CUDA Constructs Thread Per-thread local memory Warp = 32 threads Share through local memory Block/CTA = Group of warps Shared mem per CTA: inter-cta sync/communication Grid = Group of CTAs that share the same kernel Global mem for cross-cta communication Defined by user (compute) Reflect communication

13 Warp

14 Compute - Integration Utilize existing libraries cufft (10x faster) cublas (6-17x faster) cusparse (8x faster) Application integration DIY Matlab Mathematica CUDA OpenCL DirectX Compute Source:

15 Compute - Integration OpenACC #include <stdio.h> #define N int main(void) { double pi = 0.0f; 0f; long i; #pragma acc region for for (i=0; i<n; i++) { double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); } } printf("pi=%16.15f\n",pi/n); return 0; cufft #define NX 64 #define NY 64 #define NZ 128 cuffthandle plan; cufftcomplex *data1 data1, *data2; cudamalloc((void**)&data1, sizeof(cufftcomplex)*nx*ny*nz); cudamalloc((void**)&data2, sizeof(cufftcomplex)*nx*ny*nz); /* Create a 3D FFT plan. */ cufftplan3d(&plan, NX, NY, NZ, CUFFT_C2C); /* Transform the first signal in place. */ cufftexecc2c(plan, data1, data1, CUFFT_FORWARD); /* Transform the second signal using the same plan. */ cufftexecc2c(plan, data2, data2, CUFFT_FORWARD); /* Destroy the cufft plan. */ cufftdestroy(plan); cudafree(data1); cudafree(data2);

16 Dynamic Parallelism Dynamic Parallelism Allow the GPU to create work for itself Remove CPU -> GPU sync points

17 Dynamic Parallelism

18 Why do you care? Constructs reflect: Warps Communication boundaries Read/write coherency Low level primitive HW operates on Blocks/CTAs are made up of integer # of warps AMD/Intel HPC chips are similar Use existing libraries Don t reinvent the wheel

19 GPU Yesterday Geforce 3 Source:

20 Kepler GK110

21 Kepler GPC Similar to a CPU core Stamp out for scaling Contains multiple SMXs Graphics Features Polymorph Engine Raster Engine Independent execution Compute Features Independent Kernels ~CTA (work group)

22 Kepler SMX Schedule threads efficiently Maximize Utilization 192 CUDA Cores 64 DP units 32 SFU 32 LD/ST Perfect utilization? 320 threads available per clock

23 Kepler Scheduler

24 Maxwell GM107

25 Maxwell GPC 5 SMMs per GPC Kepler had 2 SMXs ~90% perf of SMX But *much* smaller More power efficient Scheduling changes

26 Maxwell GPC

27 Maxwell SMM Schedule threads efficiently Maximize Utilization 128 CUDA Cores 32 SFU 32 LD/ST Warps scheduling Each scheduler owns cores Own instruction buffer Still 2x dispatch unit

28 Maxwell Utilization Avoid Divergence! Shuffle your program to avoid divergence within a warp Size of CTA matters! Programmer thinks in CTAs, HW operates on warps CTA size 48 threads 32 thread warp architecture 48/32 = 1.5 warps, second warp is partially occupied Scheduler 2 dispatch units need two independent inst per warp Series of dependent instructions = Poor utilization

29 Maxwell Utilization Latency Memory access - ~10 cycles -> 800 cycles Cuda cores take a few cycles to complete Contention for limited resources 64K regs/sm * SM / 64 Warps * Warp / 32 Threads 32 regs/thread before # of warps/sm decrease High # of warps vital for high utilization Balance work-load Tex versus FP versus SFU versus LD/ST Avoid peak utilization of any one unit

Kepler SMX Each CUDA Core can do FMA = 2 FLOP 192 CUDA Cores per SMX, 15x2 SMX (Titan Z) 11520 FLOP/clock 705 Mhz

30 Kepler SMX Each CUDA Core can do FMA = 2 FLOP 192 CUDA Cores per SMX, 15x2 SMX (Titan Z) FLOP/clock 705 Mhz (876 Mhz boost) 8.1 TFLOPs (10 TFLOPs boost) Intel Xeon E GFLOPs (AVX) XBox One PS TFLOPs 1.84 TFLOPs

31 Infiltrator Demo Tiled deferred shading Temporal anti-aliasing Tessellation and Displacement Millions of particles, colliding/lit by the environment Physically based materials/lighting (static GI)

32 Deferred shading Render to G-Buffer Material Normal Depth Defer shading Wait until all objects are rendered Use pixel position plus G-Buffer Shading done exactly once per pixel

33 Deferred shading

Tiled Deferred Shading Render to G-Buffer (Materials, Normal, Depth) per pixel Segment screen into tiles (think parallel) DICE 2011 GDC presentation Compute light-tile

34 Tiled Deferred Shading Render to G-Buffer (Materials, Normal, Depth) per pixel Segment screen into tiles (think parallel) DICE 2011 GDC presentation Compute light-tile intersection Output culled light list to buffer Perform per pixel light pass (Materials,Normal,Depth) + pixel pos Fetch list of lights hitting tile containing pixel Iterate + Light

35 Tiled Deferred Shading

36 Opportunities NVIDIA/AMD/Intel (Samsung/Qualcomm/Apple) Strong C/C++ Strong OS/Algorithms fundamentals Parallel Programming w/ emphasis on Performance Compiler experience a plus Graphics knowledge a plus Other industries Games Industry Seismic Processing Biochemistry simulations Weather/climate modeling CAD

37 Questions?

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA