Interactive Ray Tracing: Higher Memory Coherence

Interactive Ray Tracing: Higher Memory Coherence http://gamma.cs.unc.edu/rt Dinesh Manocha (UNC Chapel Hill) Sung-Eui Yoon (Lawrence Livermore Labs)

Interactive Ray Tracing Ray tracing is naturally sub-linear with scene size Ray tracing naturally supports good shading Ray tracing maps well to multi-core architectures [Shirley 2006]

Interactive Ray Tracing Ray tracing is naturally sub-linear with scene size Ray tracing naturally supports good shading Ray tracing maps well to multi-core architectures Moore s Law is a natural boon for ray tracing: 2015 prediction -> 2048^2 with 16 samples per pixel [Shirley 2006]

Interactive Ray Tracing Ray tracing is natually sub-linear with scene size Ray tracing naturally supports good shading Ray tracing maps well to multi-core architectures Moore s Law is a natural boon for ray tracing: 2015 prediction -> 2048^2 with 16 samples per pixel But.

Low Growth Rate of Memory Bandwidth Growth rate during 1993 2005 50 45 40 35 30 25 20 15 10 5 0 Disk access speed RAM access speed CPU speed Processor speed improvements are not sufficient Courtesy: http://www.hcibook.com/e3/online/moores-law/

Applications need to have high memory coherence Memory hierarchies

One Driving Application: Massive models Model: geometric representation of object Many sources: Scientific simulation Scanned objects CAD

Massive models: Memory Overhead Size: Tens or hundreds of millions of triangles (previous slide: 100M, 372M, 82M) that s 13GB just raw data! Datasets with billions of polygons are becoming available Naïve rendering is not fast enough Still want to display in real time

Rasterization Standard method for rendering Draw all triangles on a raster:

Rasterization Advantage: Use graphics hardware / GPUs (fast, growing faster than Moore s Law) 1-2 orders of magnitude faster than ray tracing Disadvantages: Local illumination Performance ~ linear to # triangles

Rasterization Current GPUs can render 100-400M triangles per second

Rasterization Current GPUs can render 100-400M triangles per second Assumes the triangles are in GPU memory

Rasterization Current GPUs can render 100-400M triangles per second Assumes the triangles are in GPU memory CPU-GPU bandwidth is a limitation

Rasterization Current GPUs can render 100-400M triangles per second Assumes the triangles are in GPU memory CPU-GPU bandwidth is a limitation Real-time rasterization of massive model becomes a data management problem

Rasterization: Acceleration Use multi-resolution representations Static LODs View-dependent rendering Visibility and occlusion culling Out-of-core rendering

Rasterization: Acceleration Use multi-resolution representations Static LODs View-dependent rendering Visibility and occlusion culling Out-of-core rendering [Hundreds of papers]

Rasterization: Acceleration Use multi-resolution representations Static LODs View-dependent rendering Visibility and occlusion culling Out-of-core rendering Develop an integrated solution!

Towards Scale-able View-Dependent Rendering View-dependent rendering Uses dynamic simplification New multi-resolution hierarchy (CHPM) Occlusion culling using BVHs Out-of-core rendering Improved layouts for high cache throughput Integrate with low error shadow maps [Lloyd et al. 2006] [Yoon et al. 04, Yoon et al. 2005]

Video Demonstration Quick-VDR System

Interactive View-Dependent Shadow Generation Video

Ray Tracing Well studied for 25+ years 1-2 orders of magnitude slower than rasterization

Ray Tracing Well studied for 25+ years 1-2 orders of magnitude slower than rasterization But: asymptotic performance ~ logarithmic

Ray Tracing Well studied for 25+ years 1-2 orders of magnitude slower than rasterization But: asymptotic performance ~ logarithmic Good choice for massive models?

Ray Tracing for Massive Models Logarithmic asymptotic behavior Very useful for dealing with massive models Mainly due to its hierarchical data structures

Ray Tracing for Massive Models Logarithmic asymptotic behavior Very useful for dealing with massive models Mainly due to its hierarchical data structures BUT: Observed only in in-core datasets

Ray Tracing: Performance Measured with 2GB main memory Render time (log scale) Memory thrashing! Working set Size 2GB 2GB Model complexity (M tri) - log scale

Low Growth Rate of Memory Bandwidth Growth rate during 1993 2005 50 45 40 35 30 25 20 15 10 5 0 Disk access speed RAM access speed CPU speed Recent hardware improvements may not provide an efficient solution to our problem! Courtesy: http://www.hcibook.com/e3/online/moores-law/

Ray Coherence Techniques Assume coherences between rays Works well with CAD or architectural models Primary rays and some secondary rays Highly-tessellated models Not much coherence between rays Viewpoint Image plane Small triangles Rays per each pixel

Issues Design appropriate hierarchical representations: Should avoid access to lower levels in the tree Access should be coherent

Incoherent Memory Accesses Model with 370M triangles Assuming 512x512 resolution Hundreds of triangle per pixel At most <1% of triangles visible Each triangle likely in different area of memory Scan of Michelangelo s St.Matthew:

Our approach Add levels-of-detail to ray tracing Main benefit: Improved memory coherence

Our approach Add levels-of-detail to ray tracing Main benefit: Improved memory coherence LOD: simplified versions of geometry Selection according to LOD metric Use ideas from rasterization literature rasterzation: selection per object ray tracing: selection per ray [Yoon et al. 2006]

LOD-based Ray Tracing: Issues Compact and simple to compute LOD can be considered for each node and ray Drastic simplification Factor of two simplification gives only one level reduction for tree traversal High quality and interactive rendering Error should be controllable

Our approach R-LODs Highly integrated with kd-tree [Wald et al. 05] Can also be integrated with BVHs Simple but fast LOD metric Works with shadows, reflections Integrates ray and cache coherences

Outline LOD-based ray tracing Results

Ray Tracing: Performance Measured with 2GB main memory Render time (log scale) Memory thrashing! Working set size 2GB Model complexity (M tri) - log scale

Ray Tracing: Performance Achieved up to three order of magnitude speedup! Render time (log scale) Working set size Model complexity (M tri) - log scale

Real-time Captured Video St. Matthew Model 512 by 512 and 2x2 super-sampling, 4 pixels-of-error

Related Work Interactive ray tracing LOD and out-of-core techniques LOD-based ray tracing

Interactive Ray Tracing Ray coherences [Heckbert and Hanrahan 84, Wald et al. 01, Reshetov et al. 05] Parallel computing [Parker et al. 99, DeMarle et al. 04, Dietrich et al. 05] Hardware acceleration [Purcell et al. 02, Schmittler et al. 04, Woop et al. 05] Large dataset [Pharr et al. 97, Wald et al. 04]

LOD and Out-of-Core Widely researched Techniques [Luebke et al. 02, Chiang et al. 03] LOD methods combined with out-of-core techniques Points clouds [Rusinkiewicz and Levoy 00] Regular meshes [Hwa et al. 04, Losasso and Hoppe 04] General meshes [Lindstrom 03, Cignoni et al. 04, Yoon et al. 04, Gobbetti and Marton 05]

LOD Methods for Rasterization LOD selection difference LOD section for object LOD selection for ray (Culling or LOD) hierarchy difference Coarse-grained hierarchy for rasterization Fine-grained hierarchy for ray tracing Not clear whether LOD techniques for rasterization is applicable to ray tracing

LOD-based Ray Tracing Ray differentials [Igehy 99] Subdivision meshes [Christensen et al. 03, Stoll et al. 06] Point clouds [Wand and Straβer 03] Viewpoint Image plane Footprint size of ray Ray beam for one pixel

Outline LOD-based ray tracing R-LOD representation LOD selection LOD and layout computations Results

R-LOD Representation Tightly integrated with kd-nodes A plane, material attributes, and surface deviation Rays kd-node No intersection Intersection Normal Plane Valid extent of the plane

LOD-based Runtime Traversal Modification of efficient kd-tree traversal [Wald 04] Traverse, evaluate metric at each node If satisfies, intersect with plane instead if it hits, we re done if not, go back up, try other sub tree In any case: don t need to go deeper!

Properties of R-LODs Compact and efficient LOD representation Add only 4 bytes to (8 bytes) kd-node Drastic simplification Useful for performance improvement

Properties of R-LODs Error-controllable LOD rendering Error is measured in a screen-space in terms of pixels-of-error (PoE) Provides interactive rendering framework

Outline LOD-based ray tracing R-LOD representation LOD selection LOD and layout computations Results

Two Main Design Criteria for LOD Metric Controllability of visual errors Efficiency LOD metric can be evaluated with many nodes for every single ray More than tens of million times evaluation

Visual Artifacts Visibility difference Illumination difference Path difference for secondary rays Surface deviation Projected area Curvature difference LODs Original mesh View direction Ray with original mesh Ray with LODs Image plane

R-LOD Error Metric Consider two factors Projected screen-space area of a kd-node Surface deviation

Conservative Projection Method Measures the screen-space area affected by using an R-LOD LOD metric: Image plane? C (B) d min > R Viewpoint B { d min R kd-node PoE error bound One ray beam

R-LODs with Different PoE Values PoE: Original 1.85 5 10 (512x512, no anti-aliasing)

R-LODs with Different PoE Values PoE: Original 40 80 512x512 image resolution

LOD Metric for Secondary Rays Applicable to any linear transformation Shadow Planar reflection Not applicable to non-linear transformation Refraction and non-planar reflection Uses more general, but expensive ray differentials [Igehy 99]

C 0 Discontinuity between R- LODs Ray Possible solutions Posing dependencies [Lindstrom 03, Hwa et al. 04, Yoon et al. 04, Cignoni et al. 05] Implicit surfaces [Wald and Seidel 05]

Expansion of R-LODs Ray Expansion of the extent of the plane Inspired by hole-free point clouds rendering [Kalaiah and Varshney 03] A function of the surface deviation (20% of the surface deviation)

Impact of Expansions of R- LODs Hole Before expansion After expansion Original model PoE = 5 at 512 by 512

Outline LOD-based ray tracing R-LOD representation LOD selection LOD and layout computations Results

R-LOD Construction Principal component analysis (PCA) Compute the covariance matrix for the plane of R-LODs Normal (= Eigenvector) Hierarchical PCA computation Has linear time complexity Accesses the original data only one time with virtually no memory overhead

Ray Coherence Using LOD improve the utilization of SIMD functionality Maintain spatial coherence between rays Maintain ray groups bigger

Cache Coherence Cache misses can be a major bottleneck Especially for massive models Use cache-oblivious layouts [Yoon and Manocha 06, Yoon et al. 05] Works well with various caches (L1, L2, memory, disk) Does not require any code modification 10% ~ 60% improvement for LOD-based ray tracer 3X improvement for ray tracing, collision detection, GPUbased rendering, iso-surface extraction

Layout Computation vb va vc vd Input graph (weights) Multilevel optimization va vb vd vc Cache-oblivious metric Local permutations Result 1D layout The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

OpenCCL http://gamma.cs.unc.edu/openccl The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Specialization to kd-trees and BVHs What is an input graph? Hierarchy itself? Parent-child and spatial localities Implicitly considered given the input hierarchy Weights Indicates coherence levels between two nodes Computed based on geometric relationships The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Probability Function for Layout Computation How much a node is likely to be accessed? Bounding box of a node Point Bounding box of a second object Sphere Rectangular Ray beam Equivalent to surface area heuristics [MacDonald and Booth 90, Havran 00]

Layout Algorithms Recursively divide and layout between sub-trees (multi-scale approach) Based on the probability function Works well with various cache block sizes [Yoon and Lindstrom 06]

Outline R-LODs for ray tracing Results

Implementation Uses common optimized kd-tree construction methods Based on surface-area heuristics [MacDonald and Booth 90, Havran 00] Out-of-core computation Decompose an input model into a set of clusters [Yoon et al. 04]

Preprocessing Construction speed Very fast due to its linear complexity (3M triangles per min) Memory overhead Require 33% more storage over the optimized kd-tree representation [Wald 04] Runtime overhead 5% compared to non-lod version of an efficient ray tracer

Impacts of R-LODs # of intersected nodes per ray 10X speedup Render time Working set size PoE = 0 (No LOD) PoE = 2.5

Real-time Captured Video St. Matthew Model 512 x 512, 2 x 2 anti-aliasing, PoE = 4

Image Quality Comparison Forest Model (32M Triangles) 4 X speedup PoE = 0 (No LOD) PoE = 4 and cache-oblivious layout of kd-tree Shading difference

Results CAD model 2 fps 2 times speedup Double Eagle tanker, 82M triangles

Pros and Cons Limitations Does not handle advanced materials (BRDF) Our metric works well only with a linear transformation No guarantee there is no holes Advantages Simplicity Interactivity Efficiency

Ongoing and Future Work Investigate an efficient use of implicit surfaces Allow approximate visibility Extend to global illumination Design an efficient layout algorithm for deforming models

Conclusions Massive model rendering limited by memory access and bus bandwidth It is becoming a data and memory management problem LOD-based ray tracing Main improvement due to working set size reduction 20-1000% speedups Integrate cache and ray coherence techniques

UCRL-PRES-223086 Some part of this work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W- 7405-ENG-48.