GPU Architecture. Michael Doggett Department of Computer Science Lund university

Size: px

Start display at page:

Download "GPU Architecture. Michael Doggett Department of Computer Science Lund university"

Claribel Simmons
6 years ago
Views:

1 GPU Architecture Michael Doggett Department of Computer Science Lund university

2 GPUs from my time at ATI R200 Xbox360 GPU R630 R610 R770

3 Let s start at the beginning...

4 Graphics Hardware before GPUs 1970s s Very Expensive Real-time Rendering needs a lot of compute power Military, large industrial, flight simulators Evans and Sutherland founded 1968 Custom built systems, multiple boards Custom and off-the-shelf ASICs Today s GPU has similar structure, but all on one chip

5 Pixel-Planes/Flow Research Systems University of North Carolina Pixel-Planes 5 [Fuchs89] PixelFlow [Molnar92] Programmable Shading [Lastra95] Final version built and sold by HP 97 Image courtesy [Lastra95]

6 Silicon Graphics Onyx MIPS R4400 RISC processors Reality Engine 92[Akeley93] Geometry Engine 50MHz Intel i860xp, Raster Memory, 1-4 OpenGL, no shaders Images courtesy [Akeley93]

7 Silicon Graphics Onyx Infinite Reality [Montrym97] Images courtesy [Montrym97]

8 Neon: A (Big) (Fast) Single-Chip 3D Workstation Graphics Accelerator Early GPU design (1999) with publication! 8 x 32bit memory controllers Commodity SDRAMs 2D tiled memory interleaving 350 nm, 17.3 x 17.3 mm, 6.8 million transistors, 100MHz

9 GPUs

10 Graphics Processing Unit GPU How to turn millions of triangles into pixels in 1/60 of a second? Parallelism!! Pipelining Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD) multiple cores Memory Bandwidth reductions (covered previously) Data 0 Data 1 Data 2 Data 3 1 Instruction PU PU PU PU 4-wide SIMD PU=processing unit

11 GPU design parameters Competition 2 strong competitors NVIDIA and AMD (can consider Intel a third) Performance/Dollar Moore s law Number of transistors on a chip doubles every two years APIs (OpenGL and DirectX) hide details Hide architectural changes Provide backward compatibility (mostly)

Before programmable GPUs ATI Founded 1985 2D - Mach

Riva, Riva TNT series 95-99 up to DirectX 6

12 Before programmable GPUs ATI Founded D - Mach series 3D - Rage series nvidia Founded 1993 Riva, Riva TNT series up to DirectX 6 Textures Rasterization Texturing Z & Alpha Image FrameBuffer

Hardware Transform, Clipping and Lighting NV10 99 R100 00 Nvidia GeForce 256 The first GPU ATI Radeon 7500 Transformation requires 32bit float 4x4 matrix

13 Hardware Transform, Clipping and Lighting NV10 99 R Nvidia GeForce 256 The first GPU ATI Radeon 7500 Transformation requires 32bit float 4x4 matrix multiplication Texturing for 4 component (RGBA) 8bit pixels Low precision math DirectX 7 Vertices Textures Image Transform Rasterization Texturing Z & Alpha FrameBuffer

14 Programmable GPUs 1st Generation Shaders run programs that do what fixed function hardware did GPU programs (shaders) have simpler requirements than CPU programs Vertices Shaders Textures Shaders Image Vertex Transform shader Rasterization Pixel Texturing shader Z & Alpha FrameBuffer

15 Programmable GPUs 1st Generation NV20 01 Nvidia GeForce 3 [Lindholm01] R Vertices Shaders Vertex Transform shader Rasterization ATI Radeon wide Vertex shader 4-wide SIMD Pixel shader Textures Shaders Pixel Texturing shader Z & Alpha Fixed point ~16bits Image FrameBuffer

16 Programmable GPUs 1st Generation DirectX8 Pixel Shaders Multiple versions 1.1, 1.3, instructions assembler language Vertices Shaders Textures Shaders Vertex Transform shader Rasterization Pixel Texturing shader Z & Alpha Image FrameBuffer

Programmable GPUs 2nd Generation R300 02 ATI Radeon 9700 256bit memory bus 4-wide Vertex shader 8-wide SIMD Pixel shader 24bit

17 Programmable GPUs 2nd Generation R ATI Radeon bit memory bus 4-wide Vertex shader 8-wide SIMD Pixel shader 24bit float Vertices Shaders Textures Shaders Vertex Transform shader Rasterization Pixel Texturing shader Z & Alpha Image FrameBuffer

Programmable GPUs 2nd Generation NV30 03 Nvidia GeForce FX 5800 16 and 32 bit float Cg (C for graphics) NV40 04 GeForce 6800 [Montrym05] DirectX 9 High

18 Programmable GPUs 2nd Generation NV30 03 Nvidia GeForce FX and 32 bit float Cg (C for graphics) NV40 04 GeForce 6800 [Montrym05] DirectX 9 High Level Shading Language (HLSL) Vertices Shaders Textures Shaders Image Vertex Transform shader Rasterization Pixel Texturing shader Z & Alpha FrameBuffer

19 Unified Shaders

20 Unified Shader Vertex shader Rasterization Pixel shader Z & Alpha FrameBuffer

21 Unified Shader Grouper Rasterization Unified shader Z & Alpha FrameBuffer

22 Unified Shader A revolutionary step in Graphics Hardware One hardware design that performs both Vertex and Pixel shaders Vertex processing power Vertices Vertices Pixels Vertex Shader Unified Shader Pixels Pixel Shader 22

23 Unified Shader GPU based vertex and pixel load balancing Better vertex and pixel resource usage Union of features E.g. Control flow, indexable constant, All 32bit floating point DX9 Shader Model 3.0+ Shader write at address instruction Vertices Pixels 3 SIMDs Unified Shader 48 ALUs Vec4 Scalar 23

24 XBox360 GPU architecture - 05 GPU Primitive Setup Clipper Rasterizer Hierarchical Z/S Index Stream Generator Tessellator Vertex Pipeline Pixel Pipeline Display Pixels Unified Shader Texture/Vertex Fetch UNIFIED MEMORY Output Buffer Memory Export SIMD is 16 vector math units wide - Shared instr. decode and control - Makes GPUs simpler than CPUs DAUGHTER DIE 24

25 Rendering performance Z & Alpha 2 Dies (1 GPU, 1 Logic Embedded DRAM) GPU to Daughter Die high speed interface 8 pixels/clk 32BPP color, 4 samples/pixel Z - Lossless compression 16 pixels/clk Double Z 4 samples/pixel Z - Lossless compression GPU 32GB/s DAUGHTER DIE 25

26 Rendering performance Z & Alpha Z and Alpha logic to EDRAM interface 256GB/s Color and Z - 32 samples 32bit color, 24bit Z, 8bit stencil Double Z - 64 samples 24bit Z, 8bit stencil DAUGHTER DIE 8pix/clk, 4x MSAA, Stencil and Z test, Alpha blending 256GB/s 10MB EDRAM 26

27 2013 Xbox360 GTAV

28 Unified Shader Grouper Rasterization Unified shader Z & Alpha FrameBuffer

Unified Shaders - PC G80 06 (Tesla [Lindholm08]) nvidia GeForce 8800 8 Texture Processor Cluster (TPC) 2 Streaming Multiprocessor (MP) 8 Streaming processors

29 Unified Shaders - PC G80 06 (Tesla [Lindholm08]) nvidia GeForce Texture Processor Cluster (TPC) 2 Streaming Multiprocessor (MP) 8 Streaming processors (SP), Multiply Add (MAD) MHz Level 2 cache placed close to the individual memories (DRAMs) DirectX 10 Shader Shared Memory CUDA - new compute focused API

30 NVIDIA G80 Images courtesy [Lindholm08]

31 Unified Shaders - PC R ATI Radeon SIMDs 16 Stream Processing Units 5 Scalar Units, MAD MHz DirectX 10

32 R600 Top Level Red Compute Rasterizer Yellow Cache Unified shader Thread Scheduler Shader R/W Instr./Const. cache Unified texture cache Compression 4 SIMDs 16 Vector Units per SIMD 5 ALUs per Vector Unit Unified Texture Cache L2 256KB L1 32KB Z + Alpha 32 August 2007 Radeon HD 2900

33 Fast Thread Switching enables GPU arithmetic intensity 1000s of small simple threads with context on chip GPUs hide large memory request latency by switching between threads time (on one processor) 1 Read sleep ALU ops 2 Read ALU ops 3 Read ALU ops 4 Read ALU ops mmxii mcd

34 How many instructions in shaders? ATI Ruby Demos 1-3 DX9 4 DX10 34 instructions Slide courtesy Norm Rubin, AMD

R770 08 ATI Radeon 4870! 10 SIMDs SIMD = 16 x 5 ALU!

35 R ATI Radeon 4870! 10 SIMDs SIMD = 16 x 5 ALU! New memory architecture L2 cache partitioned at MCs Crossbar to L1s! 1.2 TeraFLOPS 35

36 R770 Die! 260mm2 ->small! 956 MTransistors! ~1 Billion Transistors 36

37 R770 Die usage! Red 10 SIMDs! Orange 64 z/stencil 40 texture 37

38 DirectX 11 Direct3D 11 Tessellation Hull Shader Works on control points Tessellator - Fixed Function Generates new vertices Domain Shader Works on vertices Geometry Shader (DX10) Works on primitives (triangles) Compute Shader mmxi mcd

Multi-Graphics core nvidia GF100 (Fermi) GeForce GTX 480 Released March, 2010 4 x Triangle rate, 4 rasterization engines 4 GPC x 3-4 SM x 32 scalars = 480 scalars @ 1401MHz Memory

39 Multi-Graphics core nvidia GF100 (Fermi) GeForce GTX 480 Released March, x Triangle rate, 4 rasterization engines 4 GPC x 3-4 SM x 32 scalars = MHz Memory Hierarchy New L1-L2 cache for shader read/write L1 cache (per SM) is 16 or 48 KB 768KB L2 cache is coherent and shared with texture and other parts of the graphics pipeline mmxii mcd

40 Comparing GPU shaders NVIDIA GeForce GTX 480 SM ATI Radeon HD 5870 SIMD-engine Fetch/ Decode Fetch/ Decode Fetch/ Decode Execution contexts (128 KB) Shared memory (16+48 KB) Execution contexts (256 KB) Shared memory (32 KB) 15 Streaming Multiprocessors 20 SIMD-engines mmxii mcd Images courtesy Kayvon Fatahalian

41 Comparing CPUs and GPUs Core Core Core Core CPU L2 L2 L2 L2 Shared L3 cache GPU cache cache cache cache cache cache cache cache

42 GPUs in Nvidia Volta architecture, 12nm Nvidia Tesla V100 (GV100) 15 TeraFLOPS (32bit FP) AMD Radeon Pro Vega Consoles Microsoft Xbox One X (AMD) Sony Playstation Pro (AMD) Nintendo Switch Nvidia Tegra Mobile ARM Mali Bifrost Architecture Intel discreet GPUs coming mmxvi mcd

43 Next 5-10 years Continued upgrade cycle for GPUs? Importance of parallel programming languages CUDA, OpenCL, GPU Compute killer app is here AI/Deep Learning! Nvidia Volta - Tensor Cores (FP16) mmxvi mcd

44 Summary GPU shaders grew from low precision simple programs GPU ALU is much simpler than CPUs SIMD allows less hardware for many instructions GPUs hide memory latency with 1000s of threads GPUs are massively parallel devices and can be used for general compute (GPGPU)

45 Next... Project Description Due Thursday Thursday Graphics architecture and OpenCL Next week - Tuesday - Guest Speaker Johan Grönqvist, ARM

46 References Henry Fuchs et. al. "A Heterogeneous Multiprocessor Graphics System Using Processor- Enhanced Memories," Proceedings of SIGGRAPH '89 Steven Molnar et. al. PixelFlow: high-speed rendering using image composition, Proceedings of SIGGRAPH 92 Kurt Akeley, "RealityEngine Graphics". Proceedings of SIGGRAPH '93 John Montrym et. al. InfiniteReality: a real-time graphics system, Proceedings of SIGGRAPH 97 Joel McCormack, Neon: A (Big) (Fast) Single-Chip 3D Workstation Graphics Accelerator WRL Research Report 98/1 (Revised 99) Lindholm et. al., A user-programmable vertex engine, Proceedings of SIGGRAPH 01 Montrym et. al. The GeForce 6800, IEEE Micro, March 05 Lindholm et. al. Nvidia tesla: A unified graphics and computing architecture, IEEE Micro, March- April 08 Larry Seiler et. al. Larrabee: A Many-Core x86 Architecture for Visual Computing, Proceedings of SIGGRAPH 08 Jeremy Sugerman et. al. GRAMPS: A Programming Model for Graphics Pipelines, ACM Transactions on Graphics January, 09 Timo Aila et. al. Understanding the Efficiency of Ray Traversal on GPUs. High-Performance Graphics 2009.

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Graphics Architectures and OpenCL Michael Doggett Department of Computer Science Lund university Overview Parallelism Radeon 5870 Tiled Graphics Architectures Important when Memory and Bandwidth limited