Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

Size: px

Start display at page:

Download "Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis"

Myrtle Robbins
5 years ago
Views:

1 Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

Parallelism in Interactive Graphics Well-expressed in hardware as well as APIs Consistently growing in degree & expression More and more cores on upcoming

2 Parallelism in Interactive Graphics Well-expressed in hardware as well as APIs Consistently growing in degree & expression More and more cores on upcoming GPUs From programmable shaders to pipelines We should rethink algorithms to exploit this This paper provides one example Parallelization of composite/filter stages

3 A Feed-Forward Rendering Pipeline Primitives Geometry Processing Rasterization Composite Filter Pixels

4 Composite & Filter Input: Unordered list of fragments Output Pixel colors Assumption No fragments are discarded Sample Locations Pixel

5 Basic Idea Processors Pixel-Parallel

6 Basic Idea Irregularity Processors Fragment-Parallel Insufficient parallelism

7 Motivation Most applications have low depth complexity Pixel-level parallelism is sufficient We are interested in applications with Very high depth complexity High variation in depth complexity Further Future platforms will demand more parallelism High depth-complexity can limit pixel-parallelism

8 Number of subpixels Motivation Distribution of Depth Complexity Number of depth layers

Related Work Order-Independent Transparency (OIT)

layer Stencil-Routed A-buffer [Myers & Bavoil 07]

9 Related Work Order-Independent Transparency (OIT) Depth-Peeling [Everitt 01] One pass per transparent layer Stencil-Routed A-buffer [Myers & Bavoil 07] One pass per 8 depth layers 1 Bucket Depth-Peeling [Liu et al. 09] One pass per up to 32 layers 2 1 Maximum MSAA samples per pixel 2 Maximum render targets

10 Related Work Order-Independent Transparency (OIT) OIT using Direct3D 11 [Gruen et al. 10] Use fragment linked-lists Per-pixel sort and composite Hair Self-Shadowing [Sintorn et al. 09] Each fragment computes its contribution Assumes constant opacity

11 Related Work Programmable Rendering Pipelines RenderAnts [Zhou et al. 09] Sort fragments globally Per-pixel composite/filter FreePipe [Liu et al. 10] Sort fragments globally Per-pixel composite/filter

12 Pixel-Parallel Formulation P i P (i+1) P (i+2) Sj j S (j+1) S (j+2) S (j+3) S (j+4) S (j+5) S (j+6) Thread IDs P: Pixel S: Subsample

13 j j+1 j+2 j+3 j+4 j+5 j+6 j+7 j+8 j+9 j+10 j+11 j+12 j+13 j+14 j+15 j+16 j+17 j+18 j+19 j+20 j+21 j+22 j+23 Fragment-Parallel Formulation P i P (i+1) P (i+2) S j S (j+1) S (j+2) S (j+3) S (j+4) S (j+5) S (j+6) Thread IDs P: Pixel S: Subsample

14 Fragment-Parallel Formulation How can this behavior be achieved? Revisit the composite equation fragment 1 fragment 2 background C s = α 1 C 1 + (1-α 1 ){α 2 C 2 +(1-α 2 )( (α N +(1-α N )C B ) } C s = 1.α 1.C 1 + (1-α 1 ).α 2.C 2 + (1-α 1 )(1-α 2 ).α 3.C (1-α 1 )(1-α 2 ) (1-α k-1 ).α i.c k + + (1-α 1 )(1-α 2 ) (1-α N ).C B Local Contribution L k Global Contribution G k

15 Fragment-Parallel Formulation C s = G 1.L 1 + G 2.L 2 + G 3.L 3 G N.L N G k = (1-α 1 ).(1-α 2 ) (1-α k-1 ) L k = α k.c k L k is trivially parallel (local computation) G k is the result of a scan operation (product) For the list of input fragments Compute G[ ] and L[ ], multiply Perform reduction to add subpixel contributions

16 Fragment-Parallel Formulation Filter, for every pixel: C p = C s1.κ 1 + C s2.κ C sm.κ M This can be expressed as another reduction After multiplying with subpixel weights κ m Can be merged with previous reduction

17 Fragment-Parallel Composite & Filter Final Algorithm 1. Two-key sort (Subpixel ID, depth) 2. Segmented Scan (obtain G k ) 3. Premultiply with weights (L k, κ m ) 4. Segmented Reduction

18 Fragment-Parallel Formulation P i P (i+1) P (i+2) Segmented Scan (product) Segmented Reduction (sum) P: Pixel S: Subsample

19 Implementation Hardware used: NVIDIA GeForce GTX 280 We require fast Segmented Scan and Reduce CUDPP library provides that Restricts implementation to NVIDIA CUDA No direct access to hardware rasterizer We wrote our own

20 Example System Polygons Applications Games Depth Complexity 1 to few tens of layers Suited to pixel-parallel Fragment-parallel software rasterizer

21 Example System Particles Applications Simulations, games Depth Complexity Hundreds of layers High depth-variance Particle-parallel sprite rasterizer

22 Example System Volumes Applications Scientific Visualization Depth Complexity Tens to Hundreds of layers Low depth-variance Major-axis-slice rasterizer

23 Example System Reyes Applications Offline rendering Depth Complexity Tens of layers Moderate depth variance Data-parallel micropolygon rasterizer

24 Particles Volume Reyes (grass) Polygon Rendering Time (ms) Performance Results Fragment Generation Pixel-Parallel Composite/Filter Fragment-Parallel Composite/Filter

25 Fragments per second Performance Variation 1.00E+08 Performance Variation 1.00E E+06 Fragment-Parallel Pixel-Parallel 1.00E Depth Complexity

26 Limitations Increased memory traffic Several passes through CUDPP primitives Unclear how to optimize for special cases Threshold opacity Threshold depth complexity

27 Summary and Conclusion Parallel formulation of composite equation Maps well to known primitives Can be integrated with filter Consistent performance across varying workloads FPC is applicable to future rendering pipelines Exploits higher degree of parallelism Better related to size of rendering workload A tool for building programmable pipelines

28 Future Work Performance Reduction in memory traffic Extension to special-case scenes Hybrid PPC-FPC formulations Applications Integration with hardware rasterizer Cinematic rendering, Photoshop

29 Acknowledgments NSF Award SciDAC Insitute for Ultrascale Visualization NVIDIA Research Fellowship Equipment donated by NVIDIA Discussions and Feedback Shubho Sengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD) Anonymous reviewers Implementation assistance Jeff Stuart, Shubho Sengupta

30 Thanks!

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia