Fast Tessellated Rendering on Fermi GF100 Hot3D @ HPG 2010 Tim Purcell
Outline DX10 state of the art DX11 tessellation Fermi GF100 architecture Results Demo 2
State of the Art: DX10 pipeline Input Assembler Vertex Shader Geometry Shader izer/ Interpolator Pixel Shader Image from Far Cry 2, courtesy of Ubisoft Output Merger Pixels are meticulously shaded 3
State of the Art: DX10 pipeline Image from Far Cry 2, courtesy of Ubisoft Pixels are meticulously shaded, but geometric detail is modest 4
Tessellation in DirectX 11 From input assembly Hull shader Computes the tessellation factor Runs pre-expansion Explicitly parallel across control points Control points Vertex Patch Assembly Hull Domain Primitive Assembly Geometry
Tessellation in DirectX 11 From input assembly Fixed function tessellation stage Configured by LOD output from Hull Shader Produces triangles and lines Control points Vertex Patch Assembly Hull Domain Primitive Assembly Geometry
Tessellation in DirectX 11 From input assembly Domain shader Runs post-expansion Maps (u,v) to (x,y,z,w) Implicitly parallel Vertex Patch Assembly Hull Control points Domain Primitive Assembly Geometry
Crossbar Primitive Distributor NVIDIA DX10 Logical Pipeline 8 Dataflow Vertex Shader Geometry Shader Pixel Shader
DX10 Logical Pipeline + Tessellation Primitive Distributor TS TS TS TS TS TS TS TS Tessellation Shader (includes Hull Shader,, and Domain Shader) Dataflow Crossbar 9
Problems With DX10 + Tess. (1) Data expansion in TS can lead to excessive buffering requirements after Have to drain 0 before 1 to maintain API ordering TS 0 TS n 0 n 10
Problems With DX10 + Tess. (2) limited prim rate TS 0 TS 1 TS n Already a problem for CPU-generated geometry Tessellation generates even more primitives 0 1 n / 0 m 11
Crossbar Primitive Distributor Work Distribution Crossbar Crossbar Task Distributor Work Distribution Crossbar Fermi GF100 Logical Pipeline 12 Hull Shader Dataflow Domain Shader (includes Edge, izer, Zcull )
Crossbar Primitive Distributor Work Distribution Crossbar Crossbar Task Distributor Work Distribution Crossbar Fermi GF100 Logical Pipeline 13 Dataflow Polymorph (includes,, and )
Fermi GF100 Tessellation Features 1 Primitive Distributor Crossbar Task Distributor Task ~= Hull Shader output Control points plus tessellation factor Pre-expansion Distributes tasks for subsequent geometry processing Expand patch into primitives Optional Reduces buffering requirements Task Distributor Work Distribution Crossbar Crossbar Work Distribution Crossbar 14
Fermi GF100 Tessellation Features 2 Parallel Includes clipping and culling Eliminate serialization point 15 Crossbar Primitive Distributor Work Distribution Crossbar Crossbar Task Distributor Work Distribution Crossbar
Fermi GF100 Tessellation Features 3 Primitive Distributor Crossbar Parallel ization Including Edge, ization, and Zcull Improved primitive throughput Multiple primitives per clock Screen mapped for load balancing Task Distributor Work Distribution Crossbar Work Distribution Crossbar Crossbar 16
Screen Mapped ization 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 1 2 3 0 1 2 3 0 1 2 3 A 2 3 0 1 2 3 0 1 2 B3 0 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 izer 0 izer 1 izer 2 izer 3 Each block is a tile of pixels izers uniquely own pixel tiles Note: small primitives can overlap multiple tiles 17
Challenge: Maintaining API Order Primitive Distributor Crossbar Parallelization is easy Maintaining API ordering is a challenge Work Distribution Crossbar (WDX) Similar to Pomegranate [Eldridge et al. 2000] Task Distributor Work Distribution Crossbar Work Distribution Crossbar Crossbar 18
Work Distribution Crossbar OWDX SWDX Uses primitive bounding box to determine which rasterizers get which primitives reconstructs API order izers uniquely own pixels No subsequent sorting for ordering required OWDX SWDX OWDX Work Distribution Crossbar SWDX OWDX SWDX OWDX SWDX 19
Handling Load Imbalance Load imbalance is a concern for all region-based architectures because they are non adaptive Static imbalance Different screen regions get differing amounts of work Mitigated by relatively small pixel tiles Dynamic imbalance Lots of tiny primitives in same screen region at the same time Mitigated by buffering We can only buffer so much 20
Results 21
M tris/sec Historical Triangle Rate 2000 1750 GF100 1500 1250 1000 750 500 250 0 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Chip Release 22
Triangles per clock Results Synthetic Performance Test 3 2.5 Z + per-vertex color Z-only 2 1.5 1 0.5 0 0 2 4 6 8 10 12 14 16 Triangle Size (pixels) 23
Results Island Demo Frame average of 1.94 triangles per clock 24
Results Unigine Heaven 2.34 triangles per clock in highly tessellated areas 25
Demo Time 26
Summary Tessellation Increases geometric realism without burdening the CPU Fermi GF100 is designed for tessellation Eliminates serialization points Dramatically improved primitive throughput Next-generation visual fidelity 27
Acknowledgements Steve Molnar Ziyad Hakura Andreas Dietrich (demo help) Entire Fermi GF100 team 28
Questions? 29