GPU Architecture. Samuli Laine NVIDIA Research

Size: px

Start display at page:

Download "GPU Architecture. Samuli Laine NVIDIA Research"

Jean Cobb
6 years ago
Views:

1 GPU Architecture Samuli Laine NVIDIA Research

2 Today The graphics pipeline: Evolution of the GPU Throughput-optimized parallel processor design I.e., the GPU Contrast with latency-optimized (CPU-like) design A look at NVIDIA s GPU architecture

3 Atari: Pong (1972) Dedicated video circuitry

4 CAPCOM: Commando, C64 version (1985) Video chip with HW sprites etc.

5 id Software: DOOM (1993) 2.5D + sprites, everything done on CPU

6 id Software: Quake (1996) True 3D, everything still done on CPU

7 Valve: Half-Life (1998) Triangle rasterization hardware

8 Valve: Half-Life 2 (2004) GPU with programmable shaders

9 DICE: Star Wars Battlefront (2015) GPU with shaders, computation

10 The Graphics Pipeline

11 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer

12 The Graphics Pipeline Vertex Remains a useful abstraction Rasterize Hardware look like this Pixel Test & Blend Framebuffer

13 The Graphics Pipeline Vertex float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... Rasterize Hardware look like this Pixel Vertex, pixel processing became programmable Test & Blend Framebuffer

14 Vertex Shaders f (position, attributes) (new position, attributes) Purely functional (no side effects) Move / animate vertices Apply view and projection matrices Prepare data for pixel shaders Lighting, texture coordinates,... Hardware interpolates vertex attributes over the triangle and gives the results to pixel shader

15 VS Example 1: Blend Shapes E.g., face geometries Angry, happy, sad, move eyebrow, Each target geometry stored as difference vector For each vertex: average position + n differences Result is a weighted sum of all targets

$float4x4 xform[n_bones], uniform float weight[n_bones]) {$ $N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)$

16 VS Example 2: Skinning Transform each vertex pi with each bone as if it was rigidly tied to it Blend the results using bone weights float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos) } return outpos }

17 VS Example 3: Projection In: vertex position in input space Out: vertex position in clip space All* vertex shaders need to do this float4 transform(float4 worldpos, uniform float4x4 modelviewprojection) { return mul(modelviewprojection, worldpos) }

18 Pixel Shaders f (interpolated attributes) (color, [depth]) Also known as Fragment shaders Purely functional (no side effects) Calculate color of the surface at the given pixel Also possible: Set blending opacity (alpha) Override hardware-generated depth value Discard, i.e., produce no output Hardware takes the produced fragment and blends it into the frame buffer

19 PS Example 1: Lighting Blinn-Torrance-Phong shading model Uses the halfway vector h between v and l h n l v surface p

PS Example 1: Lighting h n l v struct interpolants { float4 p, n, v } struct light { float4 pos, float Li } p float4 phong(interpolants in, uniform light lgt, uniform

20 PS Example 1: Lighting h n l v struct interpolants { float4 p, n, v } struct light { float4 pos, float Li } p float4 phong(interpolants in, uniform light lgt, uniform float q, uniform float Ks) { float4 l = lgt.pos - in.p float r2 = dot(l, l) float4 h = normalize(normalize(l) + in.v) } return Ks * pow(dot(in.n, h), q) * (lgt.li / r2)

21 More PS Examples: Melting Ice Procedural, animated texture Bumped environment map

22 More PS Examples: Toon & Fur Toon shading Volumetric fur

23 Power of VS & PS: Half-Life 2 (2004)

24 Questions?

25 The Graphics Pipeline Vertex float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... Rasterize Hardware look like this Pixel Vertex, pixel processing became programmable Test & Blend Framebuffer

$uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos =$ $float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b],$

26 The Graphics Pipeline Vertex Geometry Rasterize Pixel Hardware float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... look like this Vertex, pixel processing became programmable New stages added Test & Blend Framebuffer

$The Graphics Pipeline Vertex Tessellation Geometry Rasterize Pixel Hardware float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos =$

27 The Graphics Pipeline Vertex Tessellation Geometry Rasterize Pixel Hardware float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... look like this Vertex, pixel processing became programmable New stages added Test & Blend Even more stages added Framebuffer GPU architecture increasingly centers around shader execution

28 Modern GPUs: Unified Design Discrete Design Unified Design Shader A Shader B ibuffer ibuffer ibuffer ibuffer Shader Core Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core Shader C obuffer obuffer obuffer obuffer Shader D

29 GPU Architecture Today GP102 (Titan X)

30 GPU Architecture Today GP102 (Titan X)

Simultaneous Multi-Projection Stream Output Very small

31 GPU Architecture Today Vertex Fetch PolyMorph Engine 4.0 Tessellator Attribute Setup Raster Engine Simultaneous Multi-Projection Stream Output Very small portion of chip is strictly graphics-specific hardware GP102 (Titan X)

32 GPU Architecture Today Most of the units are for general-purpose computation, suitable for running arbitrary graphics shaders GP102 (Titan X)

33 What Makes It Fast? Massive number of independent work items (pixels) Allows parallelism Usually, coherent control High degree of data locality Main sources of off-chip accesses: textures and frame buffer Luckily, these tend to be very coherent! Keep as much data as possible on-chip (vertices, attributes, etc.) Custom scheduling and resource allocation No need for software arbitration, thread launching, sync.. Fixed function units for common, expensive ops E.g. texture filtering

34 Different Workloads Graphics Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses

35 Different Workloads Graphics Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses Opposite Long programs with serial dependencies Complex data-dependent control and memory access patterns Few independent work items Not 2 million pixels

36 Different Workloads Graphics = Throughput-sensitive Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses Opposite = Latency-sensitive Long programs with serial dependencies Complex data-dependent control and memory access patterns Few independent work items Not 2 million pixels

37 Different Workloads Graphics = Throughput-sensitive GPU Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses Opposite = Latency-sensitive CPU Long programs with serial dependencies Complex data-dependent control and memory access patterns Few independent work items Not 2 million pixels

38 Physical Realities Today Clock speeds are not going up by much......and power consumption is superlinear in GHz Unavoidable corollary: Processors must be parallel

39 Physical Realities Today, cont d DRAM is slow Latency is 100s of cycles More speed is exponentially more expensive DRAM is bad with random access Memory atom is large (32 bytes), need coalesced R/W Strong pressure towards 64 byte atom DRAM is power hungry Off-chip access may burn 1000x more power than reading off the register file (which is not free either) Need to minimize DRAM use, otherwise execution units are sitting idle waiting for data

40 Dealing with DRAM, the CPU Way 1. Get locality by large, fast on-chip caches ($) 2. Reorder instructions to hide latency 3. Use a few threads to further hide latency e.g. Intel s HyperThreading TM Great for workloads that exhibit data reuse When cache is large enough to accommodate working set Even with non-coherent access patterns Tolerates unpredictable control by branch prediction

41 Dealing with DRAM, the GPU Way 1. Bite the bullet and wait When waiting, switch in other threads that have all the data they need With enough threads, DRAM latency is hidden What is enough? Need many times more threads than execution units (remember, latency is 100s of cycles) 2. Exploit locality by having individual threads co-operate through fast on-chip memory Allows execution units to be much simpler No need for branch prediction, instruction reordering logic, register renaming, etc.

42 Shared Memory Local to each SM, can be shared between threads Goal: Bring the data closer to the ALU I.e., minimize trips to external memory Share values between threads to minimize overfetch and computation Increases arithmetic intensity by keeping data close to the processors

43 Multicore CPU: Run 10 Threads Fast Core Cache Core Cache Global Memory Few processor cores, each supporting 1 2 hardware threads Large on-chip memory/cache near processor

GPU: Run 10000 Threads Fast SM Cache/ SM Cache/ Memory Memory SM Cache/

of hardware threads On-chip memory near processors Use as explicit local

44 GPU: Run Threads Fast SM Cache/ SM Cache/ Memory Memory SM Cache/ Memory Global Memory Dozens of SMs, each supporting hundreds / thousands of hardware threads On-chip memory near processors Use as explicit local storage, allow thread co-operation Hide latency by switching between many threads

45 High-Bandwidth Memory Interfaces GDDR5 / GDDR5X / HBM2 memory interface bit wide memory bus to GDDR5(X) up to 480 GB/s HBM2 memory is on-chip, up to 4096 bit wide bus and 720 GB/s GDDR5X HBM2 GDDR5X GDDR5X GP102 (Titan X) GP100

46 Questions?

47 NVIDIA Pascal Architecture GP100

48 NVIDIA Pascal Architecture GP100

49 GP100 SM Scheduler Register file Single-precision ALUs Double-precision ALUs L1 cache Shared memory

50 Warps Threads are executed in warps Warp contains up to 32 threads SM operates at warp granularity Resource allocation Execution

51 Warp scheduling At every cycle, each SM chooses which warp to execute Actually two warps per cycle in current architectures Zero overhead in switching between warps or threads Warp is eligible to be executed if all of its threads are free to execute Not waiting for memory fetches Not waiting for results from ALUs Not waiting for synchronization

52 Program counter (PC) All threads in a warp have the same PC I.e., they execute the same instruction on a given cycle

53 SIMT execution model How is this possible? Sounds like SIMD, but how can threads be independent? SIMT = Single Instruction, Multiple Threads Close to SIMD, but allows free per-thread control flow Built into SM instructions and scheduler Dedicated hardware is necessary for efficient implementation

54 SIMT vs SIMD SIMD (Single Instruction Multiple Data) Used in CPUs, e.g. Intel s SSE/AVX extensions Programmer sees a scalar thread with access to a wide ALU For example, able to do 4 or 8 additions with a single instruction SIMT (Single Instruction Multiple Thread) Programmer sees independent scalar threads with scalar ALUs Hardware internally converts independent control flow into convergent control flow

55 Managing divergence How can threads of a warp diverge if they all have the same PC? Partial solution: Per-instruction execution predication Full solution: Execution mask, execution stack in hardware

56 Example: Instruction predication if (a < 10) small++; else big++; ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1;

57 Example: Instruction predication if (a < 10) small++; else big++; Set predicate register P0 if a < 10, result can vary across warp ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1;

58 Example: Instruction predication if (a < 10) small++; else big++; ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1; In threads where P0 is set, R5 = R5 + 1

59 Example: Instruction predication if (a < 10) small++; else big++; ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1; In threads where P0 is clear, R4 = R4 + 1

60 What about complex cases? Nested if / else blocks, loops, recursion Solution: Execution mask and execution stack

61 Execution mask & stack: Example if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; /*0058*/...; /*0060*/...; foo() /*0068*/ BRA 0x80; /*0070*/...; /*0078*/...; bar() /*0080*/ code continues here

62 Execution mask & stack: Example Case 1: All threads take the if branch if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; // no thread of the warp wants to jump /*0058*/...; foo() /*0060*/...; /*0068*/ BRA 0x80; /*0070*/...; bar() /*0078*/...; /*0080*/ code continues here

63 Execution mask & stack: Example Case 2: All threads take the else branch if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; // all threads of the warp want to jump /*0058*/...; foo() /*0060*/...; /*0068*/ BRA 0x80; /*0070*/...; bar() /*0078*/...; /*0080*/ code continues here

64 Execution mask & stack: Example Case 3: Some threads take the if branch, some take the else branch if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; // some threads of the warp want to jump: push /*0058*/...; foo() /*0060*/...; /*0068*/ BRA 0x80; // restore active thread mask /*0070*/...; bar() /*0078*/...; // pop /*0080*/ code continues here

65 Benefits of SIMT Supports all structured C++ constructs if / else, switch / case, loops, function calls, exceptions goto kind of works, but don t use Multi-level constructs handled efficiently break / continue from inside multiple levels of conditionals Function return from inside loops and conditionals Retreating to exception handler from anywhere You only need to care about SIMT when tuning for performance Unlike traditional SIMD that gives you nothing unless you explicitly use it

66 Consequences of SIMT An if statement takes the same number of cycles for any number of threads greater than zero If nobody participates it s cheap Also, masked-out threads don t do memory accesses A loop is iterated until all active threads in the warp are done A warp stays alive until every thread in it has terminated Terminated threads are dead weight Same as in conditionals when masked out

67 Coherent Execution Is Great An if statement is perfectly efficient if either everyone takes it or nobody does All threads stay active A loop is perfectly efficient if everyone does the same number of iterations Note: These are required for traditional SIMD

68 Incoherent Execution Is Okay Conditionals are efficient as long as threads usually agree Loops are efficient if threads usually take roughly the same number of iterations Much easier to program than explicit SIMD SIMT: Incoherence supported, performance degrades if control diverges SIMD: performance is fixed, incoherence not supported

69 Recap, GPU Unified programmable cores used for all shader types Fixed-function units for rasterization, texture filtering, ROP, etc. SIMT execution model Run scalar threads on widely parallel machine SIMT provides hardware support SIMD requires program to manage control flow Throughput-oriented design Tolerate DRAM latency by having lots of active threads

70 Thank you! Questions?

The Graphics Pipeline: Evolution of the GPU!

1 Today The Graphics Pipeline: Evolution of the GPU! Bigger picture: Parallel processor designs! Throughput-optimized (GPU-like)! Latency-optimized (Multicore CPU-like)! A look at NVIDIA s Fermi GPU architecture!