A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Size: px

Start display at page:

Download "A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization"

Claire Marshall
6 years ago
Views:

1 A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization Jordi Roca Victor Moya Carlos Gonzalez Vicente Escandell Albert Murciego Agustin Fernandez, Computer Architecture Department, UPC Roger Espasa, Intel 1

2 Micropolygons & Tesselation: The future trend in interactive 3D rendering for improved Level-Of-Detail 2

3 I m presenting today An alternative GPU rasterization pipeline to efficienly process microtriangles. Our approach processes several microtriangles in parallel using GPU shader threads: Scalable throughput is guaranteed in next GPU generations. 3

4 Outline The Rasterization of Microtriangles. Parallel Rasterization in GPU Shaders. Problems & Solutions. Performance Results. Conclusion. 4

5 Y X The triangle rasterizer job V 0 (1,0) V 1 (13,4) V 2 (2,7) Input: screen-projected vertex coordinates: {(1,0),(13,4),(2,7)} Output: covered fragments: {(1,0),(1,1),(2,1),(3,1), } 5

Three Triangle Rasterization Approaches Scan Lines Edge

X-Products Pomegranate 00 Hard to parallelize Software

, 09 and THIS WORK NVIDIA GPUs 96 - today Intel Larrabee

6 Three Triangle Rasterization Approaches Scan Lines Edge Equations (setup + traversal) E(x,y) : ax + by + c = 0 X-Products Pomegranate 00 Hard to parallelize Software renderers Tile Scan Recursive Fatahalian K., 09 and THIS WORK NVIDIA GPUs 96 - today Intel Larrabee (software-based, SIMD-16) More efficient for very small triangles. Independent per-pixel computation. 6

7 Floating-point operations Setup equations or X-products? Rasterizer efficiency (ops per pixel, Lower = Better) The high cost of triangle setup is not amortized for 2-pixel triangles Edge Equations X-products Cross-products is more efficient for very small triangles Triangle size 7

8 Unit utilization The GPU s bottleneck in microtriangles: a single Setup unit! 100% 80% 60% 40% 20% 0% rstz/setup memory color Z/stencil shader Utilization of the different GPU units rendering a 1-pixel size stream of microtriangles Time (Kcycles) Typical 2009-GPU rates: 1tri : 32pix /clock But the microtriangle ratio is 1tri : 1pix The Single Setup unit starves the Pixel Pipeline (Shader/ZStencil/Color) Need more microtriangle throughput Can shader units help? 8

9 How could we increase the throughput for microtriangles? Option 1: Replicate N times the Triangle Setup unit Increases area Does not scale to very large number of microtriangles Option 2: Use the shader units to render microtriangles THIS WORK. No area cost Large triangles still use the existing triangle setup unit Scales in the future as Microtriangles are more frequent Future GPUs offer more shader cores 9

10 Proposed Microtriangle pipeline Selectable by the API user. Input Assembler Vertex Shader Vertex Buffer Index Buffer Texture Input Assembler Vertex Shader Vertex Buffer Index Buffer Texture Geometry Shader Texture Geometry Shader Texture Stream Output Stream Output Rasterize & Shade Pixels Rasterizer/ Interpolator Pixel Shader Texture Triangle Bound Rasterize Interpolate Pixel Shader Texture Output Merger Depth/Stencil Render Target Output Merger Depth/Stencil Render Target Standard DX10 Pipeline (for normal triangles) Microtriangle Pipeline 10

11 Outline The Rasterization of Microtriangles. Parallel Rasterization in GPU Shaders. Problems & solutions. Performance Results. Conclusion. 11

12 Parallel Rasterization in GPU Shaders 1. Fill shader vector groups with fragments within the bounding boxes of n input microtriangles Z= 3 Z= 5 Z= 7 Z= 9 Z= 1 Z= 1 Z= 2 Z= 3 Z= 5 Z= 7 Z= 9 Z= 1 Z= 1 Z= 2 S = 1 S = 0 S = 0 S = 1 S = 0S = 1 S = 1 T = 0 T = 1 T = 0 T = 1 T = 0T = 1 T = 0 Thread Entry Rasterization Z Interpolation Attribute Interpolation Original DirectX Fragment Shader Thread Exit 2. Run the rasterization program on multiple fragments followed by the original API fragment shader. 3. Reorder shaded fragments and do Z Test. 12

The required features of our rasterization program: Consistent rasterization (no cracks or repeated pixels): Fixed-point arithmetic. Tie break rule for adjacent edges.

13 The required features of our rasterization program: Consistent rasterization (no cracks or repeated pixels): Fixed-point arithmetic. Tie break rule for adjacent edges. Full support of modern GPU aspects: Z interpolation: Perspective Orthogonal Attribute interpolation: Flat Non-perspective correct Perspective correct Centroid Face culling: Front/Back/Front&Back MSAA: x2, x4, x6, x8 Customizable patterns 13

14 Outline The Rasterization of Microtriangles. Parallel Rasterization in GPU Shaders. Problems & Solutions. Performance Results. Conclusion. 14

Increases 20 to 45% the density of microtriangle vectors.

15 Shading of sparse vectors: Bounding Box optimization pre-pass Pixel-accurate BB: Subpixel-accurate BB: Can shrink these BB sides The gap tells those pixels will be never really hit! Increases 20 to 45% the density of microtriangle vectors. Culls entirely subpixel microtriangles (55% culling ratio). Simple hardware (four comparators, four adders) performs this optimization inside the Triangle Bound unit. 15

16 Avoid cracks or repeated pixels: Use of Fixed-Point arithmetic A lit mesh of adjacent microtriangles Floating Point 32 Bits Fixed point 24.8 Bits The rasterization program must ensure that each single pixel is hit by exactly one microtriangle in the mesh (no cracks, no repeated). Extended the shader ISA with FXMUL and FXMAD fixed-point instructions which provide consistent cross-product results across microtriangles. 16

17 Outline The Rasterization of Microtriangles. Parallel Rasterization in GPU Shaders. Problems & Solutions. Performance Results. Conclusion. 17

Great microtriangle throughput scaling Render

18 Great microtriangle throughput scaling Render times for 1/2 pixel and 1/8 pixel-size microtriangle meshes scale up 1.3X to 4X with 16 shader cores, wrt the traditional GPU rasterizer unit. The better scaling of 1/8 size (blue) is due to the effectiveness of the Bounding Box optimization. 18

19 Outline The Rasterization of Microtriangles. Parallel Rasterization in GPU Shaders. Problems & Solutions. Performance Results. Conclusion. 19

20 Conclusions Near term 3D rendering demands for a microtriangle pipeline to efficienly process tessellated surfaces. Current GPU rasterizers are not intended for microtriangles: Designed for high pixel rates on triangles larger than ~10 pixels. Poor microtriangle throughput to feed the pixel pipeline. Replication inefficiently increases area: Bad scalability. We propose to rasterize microtriangles in GPU shaders. The largest & more scalable resource in today s GPUs Using the more efficient Xproducts instead of edge setup. As an alternative selectable pipeline by the API user. Problems and solutions: Shading of sparse vectors: Bounding Box optimization pre-pass. No cracks or repeated pixels by using Fixed-Point operations. 20

21 Thank you! Q&A 21

22 BACKUP 22

Fragment Shader Thread Exit Early Z Test Thread Entry Attribute Interpolation Sleep Thread Attribute Interpolation Z

23 Early Z Test Enable Early Z Optimization Thread Entry Thread Entry Thread Entry Rasterization Z Interpolation Rasterization Z Interpolation Rasterization Z Interpolation Z test request Attribute Interpolation Original DirectX Fragment Shader Thread Exit Early Z Test Thread Entry Attribute Interpolation Sleep Thread Attribute Interpolation Z test result Thread Exit Late Z Test Original DirectX Fragment Shader Original DirectX Fragment Shader Thread Exit Thread Exit 23

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Next-Generation Graphics on Larrabee Tim Foley Intel Corp Motivation The killer app for GPGPU is graphics We ve seen Abstract models for parallel programming How those models map efficiently to Larrabee