Ray Tracing Performance

Similar documents
Computer Graphics. - Ray-Tracing II - Hendrik Lensch. Computer Graphics WS07/08 Ray Tracing II

Computer Graphics. - Spatial Index Structures - Philipp Slusallek

SPATIAL DATA STRUCTURES. Jon McCaffrey CIS 565

Real Time Ray Tracing

Parallel Physically Based Path-tracing and Shading Part 3 of 2. CIS565 Fall 2012 University of Pennsylvania by Yining Karl Li

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Accelerating Ray-Tracing

EDAN30 Photorealistic Computer Graphics. Seminar 2, Bounding Volume Hierarchy. Magnus Andersson, PhD student

Acceleration. Digital Image Synthesis Yung-Yu Chuang 10/11/2007. with slides by Mario Costa Sousa, Gordon Stoll and Pat Hanrahan

Acceleration Data Structures

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!

INFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome!

Lecture 4 - Real-time Ray Tracing

Accelerated Raytracing

INFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome!

Intersection Acceleration

Topics. Ray Tracing II. Intersecting transformed objects. Transforming objects

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

Acceleration Data Structures. Michael Doggett Department of Computer Science Lund university

Ray Tracing. Cornell CS4620/5620 Fall 2012 Lecture Kavita Bala 1 (with previous instructors James/Marschner)

Lecture 11 - GPU Ray Tracing (1)

Topics. Ray Tracing II. Transforming objects. Intersecting transformed objects

Accelerating Shadow Rays Using Volumetric Occluders and Modified kd-tree Traversal

Announcements. Written Assignment2 is out, due March 8 Graded Programming Assignment2 next Tuesday

Ray Tracing Acceleration. CS 4620 Lecture 20

Spatial Data Structures

CPSC / Sonny Chan - University of Calgary. Collision Detection II

Fast BVH Construction on GPUs

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

In the previous presentation, Erik Sintorn presented methods for practically constructing a DAG structure from a voxel data set.

Specialized Acceleration Structures for Ray-Tracing. Warren Hunt

Lecture 2 - Acceleration Structures

Acceleration Data Structures for Ray Tracing

Spatial Data Structures. Steve Rotenberg CSE168: Rendering Algorithms UCSD, Spring 2017

Ray Intersection Acceleration

CS 4620 Program 4: Ray II

Spatial Data Structures

Ray Tracing with Multi-Core/Shared Memory Systems. Abe Stephens

Ray Tracing Acceleration Data Structures

CS 465 Program 5: Ray II

Ray Tracing Acceleration. CS 4620 Lecture 22

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

Ray Tracing with Spatial Hierarchies. Jeff Mahovsky & Brian Wyvill CSC 305

Acceleration Structures. CS 6965 Fall 2011

Spatial Data Structures

SAH guided spatial split partitioning for fast BVH construction. Per Ganestam and Michael Doggett Lund University

Accelerating Ray Tracing

improving raytracing speed

CS 563 Advanced Topics in Computer Graphics QSplat. by Matt Maziarz

Scene Management. Video Game Technologies 11498: MSc in Computer Science and Engineering 11156: MSc in Game Design and Development

Row Tracing with Hierarchical Occlusion Maps

CS580: Ray Tracing. Sung-Eui Yoon ( 윤성의 ) Course URL:

The Traditional Graphics Pipeline

Holger Dammertz August 10, The Edge Volume Heuristic Robust Triangle Subdivision for Improved BVH Performance Holger Dammertz, Alexander Keller

Spatial Data Structures

The Traditional Graphics Pipeline

Spatial Data Structures and Speed-Up Techniques. Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology

The Traditional Graphics Pipeline

Speeding Up Ray Tracing. Optimisations. Ray Tracing Acceleration

Computer Graphics. Bing-Yu Chen National Taiwan University

Spatial Data Structures

RTfact: Concepts for Generic and High Performance Ray Tracing

Parallel Programming on Larrabee. Tim Foley Intel Corp

Computer Graphics. Bing-Yu Chen National Taiwan University The University of Tokyo

Point Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology

Exploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur

Bounding Volume Hierarchies. CS 6965 Fall 2011

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Ray Tracing.

Real-Time Voxelization for Global Illumination

Acceleration Structure for Animated Scenes. Copyright 2010 by Yong Cao

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Interactive Ray Tracing: Higher Memory Coherence

Dynamic Bounding Volume Hierarchies. Erin Catto, Blizzard Entertainment

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Chapter 11 Global Illumination. Part 1 Ray Tracing. Reading: Angel s Interactive Computer Graphics (6 th ed.) Sections 11.1, 11.2, 11.

The Art and Science of Memory Allocation

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Point based Rendering

Anti-aliased and accelerated ray tracing. University of Texas at Austin CS384G - Computer Graphics Fall 2010 Don Fussell

CS 563 Advanced Topics in Computer Graphics Culling and Acceleration Techniques Part 1 by Mark Vessella

Embree Ray Tracing Kernels: Overview and New Features

Real-Time Realism will require...

Soft Shadow Volumes for Ray Tracing with Frustum Shooting

Advanced Ray Tracing

15 Sharing Main Memory Segmentation and Paging

Anti-aliased and accelerated ray tracing. University of Texas at Austin CS384G - Computer Graphics

Universiteit Leiden Computer Science

Hello, Thanks for the introduction

Optimisation. CS7GV3 Real-time Rendering

Assignment 6: Ray Tracing

Warped parallel nearest neighbor searches using kd-trees

PowerVR Hardware. Architecture Overview for Developers

Part IV. Review of hardware-trends for real-time ray tracing

Introduction to Parallel Programming Models

Logistics. CS 586/480 Computer Graphics II. Questions from Last Week? Slide Credits

16 Sharing Main Memory Segmentation and Paging

Transcription:

Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel

Ray Tracing Performance Zero to Millions in 45 Minutes?! Gordon Stoll, Intel

Goals for this talk Goals point you toward the current state-of-the-art ( BKM ) for non-researchers: off-the-shelf performance for researchers: baseline for comparison get you interested in poking at the problem Non-Goals present lowest-level details of kernels present the one true way

Acceleration Structures Previous (90s) BKM was to use a uniform grid very high raw traversal speed performance is not robust Current BKM is the kd-tree comparable performance even on regular scenes very good build technique handles irregular scenes other grids, octrees, etc, might as well just use kd-tree packet tracing was the icing on the cake

Acceleration Structures Couldn t leave well enough alone several new schemes to be described later lots of stuff up in the air right now But The good tree building technique applies across the board. General optimization principles are similar.

kd-trees

kd-trees

kd-trees

kd-trees

Advantages of kd-trees Adaptive Can handle the Teapot in a Stadium Compact Relatively little memory overhead Cheap Traversal One FP subtract, one FP multiply

Take advantage of advantages Adaptive You have to build a good tree Compact At least use the compact node representation (8-byte) You can t be fetching whole cache lines every time Cheap traversal No sloppy inner loops! (one subtract, one multiply!)

Bang for the Buck (!/$ ) A basic kd-tree implementation will go pretty fast but extra effort will pay off big.

Fast Ray Tracing w/ kd-trees Adaptive Compact Cheap traversal

Building kd-trees Given: axis-aligned bounding box ( cell ) list of geometric primitives (triangles?) touching cell Core operation: pick an axis-aligned plane to split the cell into two parts sift geometry into two batches (some redundancy) recurse

Building kd-trees Given: axis-aligned bounding box ( cell ) list of geometric primitives (triangles?) touching cell Core operation: pick an axis-aligned plane to split the cell into two parts sift geometry into two batches (some redundancy) recurse termination criteria!

Intuitive kd-tree Building Split Axis Round-robin; largest extent Split Location Middle of extent; median of geometry (balanced tree) Termination Target # of primitives, limited tree depth

Hack kd-tree Building Split Axis Round-robin; largest extent Split Location Middle of extent; median of geometry (balanced tree) Termination Target # of primitives, limited tree depth All of these techniques stink.

Hack kd-tree Building Split Axis Round-robin; largest extent Split Location Middle of extent; median of geometry (balanced tree) Termination Target # of primitives, limited tree depth All of these techniques stink. Don t use them.

Hack kd-tree Building Split Axis Round-robin; largest extent Split Location Middle of extent; median of geometry (balanced tree) Termination Target # of primitives, limited tree depth All of these techniques stink. Don t use them. I mean it.

Building good kd-trees What split do we really want? Clever Idea: The one that makes ray tracing cheap Write down an expression of cost and minimize it Greedy Cost Optimization What is the cost of tracing a ray through a cell? Cost(cell) = C_trav + Prob(hit L) * Cost(L) + Prob(hit R) * Cost(R)

Splitting with Cost in Mind

Split in the middle Makes the L & R probabilities equal Pays no attention to the L & R costs

Split at the Median Makes the L & R costs equal Pays no attention to the L & R probabilities

Cost-Optimized Split Automatically and rapidly isolates complexity Produces large chunks of empty space

Building good kd-trees Need the probabilities Turns out to be proportional to surface area Need the child cell costs Simple triangle count works great (very rough approx.) Empty cell boost Cost(cell) = C_trav + Prob(hit L) * Cost(L) + Prob(hit R) * Cost(R) = C_trav + SA(L) * TriCount(L) + SA(R) * TriCount(R)

Termination Criteria When should we stop splitting? Clever idea: When splitting isn t helping any more. Use the cost estimates in your termination criteria Threshold of cost improvement Stretch over multiple levels (?) Threshold of cell size Absolute probability so small there s no point

Building good kd-trees Basic build algorithm Pick an axis, or optimize across all three Build a set of candidates (split locations) BBox edges or exact triangle intersections Sort them or bin them Walk through candidates to find minimum cost split Characteristics of highly optimized tree stringy, very deep, very small leaves, big empty cells

BKM for BVH building as well

BVH Version of Build

BVH Version of Build

Just Do It Benefits of a good tree are not small not 10%, 20%, 30%... several times faster than a mediocre tree Warning #1 get a good set of scenes for testing scanned models aren t representative, just easy to find Warning #2 beware the NOP

Building Trees Quickly Very important to build good trees first otherwise you have no foundation for comparison (!!!) Don t give up SAH cost optimization! approximate, sure, but don t just drop it Luckily, lots of flexibility axis picking ( hack pick vs. full optimization) candidate picking (bboxes, exact; binning, sorting) termination criteria ( knob controlling tradeoff)

Building Trees Quickly Huge amount of work happening right now I m going to defer to later speakers

Fast Ray Tracing w/ kd-trees adaptive build a cost-optimized tree w/ the surface area heuristic compact cheap traversal

What s in a node? A kd-tree internal node needs: Am I a leaf? Split axis Split location Pointers to children

Compact (8-byte) nodes kd-tree node can be packed into 8 bytes Leaf flag + Split axis 2 bits Split location 32 bit float Always two children, put them side-by-side One 32-bit pointer

Compact (8-byte) nodes kd-tree node can be packed into 8 bytes Leaf flag + Split axis 2 bits Split location 32 bit float Always two children, put them side-by-side One 32-bit pointer So close! Sweep those 2 bits under the rug

No Bounding Box! kd-tree node corresponds to an AABB Doesn t mean it has to *contain* one 24 bytes 4X explosion (!)

Memory Layout Cache lines are much bigger than 8 bytes! advantage of compactness lost with poor layout Pretty easy to do something reasonable Building depth first, watching memory allocator

Other Data Memory should be separated by rate of access Frames << Pixels << Samples [ Ray Trees ] << Rays [ Shading (not quite) ] << Triangle intersections << Tree traversal steps Example: pre-processed triangle, shading info

Fast Ray Tracing w/ kd-trees adaptive build a cost-optimized tree w/ the surface area heuristic compact use an 8-byte node lay out your memory in a cache-friendly way cheap traversal

kd-tree Traversal Step t_min t_split t_max split

kd-tree Traversal Step t_max t_split t_min split

kd-tree Traversal Step t_max t_split t_min split

kd-tree Traversal Step Given: ray P & iv (1/V), t_min, t_max, split_location, split_axis t_at_split = ( split_location - ray->p[split_axis] ) * ray_iv[split_axis] if t_at_split > t_min need to test against near child If t_at_split < t_max need to test against far child

Optimize Your Inner Loop kd-tree traversal is the most critical kernel It happens about a zillion times It s tiny Sloppy coding will show up Optimize, Optimize, Optimize Remove recursion and minimize stack operations Other standard tuning & tweaking

kd-tree Traversal while ( not a leaf ) t_at_split = (split_location - ray->p[split_axis]) * ray_iv[split_axis] if t_split <= t_min continue with far child // hit either far child or none if t_split >= t_max continue with near child // hit near child only // hit both children push (far child, t_split, t_max) onto stack continue with (near child, t_min, t_split)

Can it go faster? How do you make fast code go faster? Parallelize it!

Ray Tracing and Parallelism Classic Answer: Ray-Tree parallelism independent tasks # of tasks = millions (at least) size of tasks = thousands of instructions (at least) So this is wonderful, right?

Parallelism in CPUs Instruction-Level Parallelism (ILP) pipelining, superscalar, OOO, SIMD fine granularity (~10s of instructions window tops) easily confounded by unpredictable control easily confounded by unpredictable latencies So what does ray tracing look like to a CPU?

No joy in ILP-ville At <1000 instruction granularity, ray tracing is anything but embarrassingly parallel kd-tree traversal (CPU view): 1) fetch a tiny fraction of a cache line from who knows where 2) do two piddling floating-point operations 3) do a completely unpredictable branch, or two, or three 4) repeat until frustrated PS: Each operation is dependent on the one before it. PPS: No SIMD for you! Ha!

Split Personality Coarse-Grained parallelism (TLP) is perfect millions of independent tasks thousands of instructions per task Fine-Grained parallelism (ILP) is awful look at a scale <1000 of instructions sequential dependencies unpredictable control paths unpredictable latencies no SIMD

Options Option #1: Forget about ILP, go with TLP improve low-ilp efficiency and use multiple CPU cores Option #2: Let TLP stand in for ILP run multiple independent threads (ray trees) on one core Option #3: Improve the ILP situation directly how? Option #4:

All of the above! multi-core CPUs are here in spades better performance, better low-ilp performance on the right performance curve multi-threaded CPUs are already here improve well-written ray tracer by ~20-30% packet tracing trace multiple rays together in a packet bulk up the inner loop with ILP-friendly operations

Packet Tracing Very, very old idea from vector/simd machines Vector masks Old way if the ray wants to go left, go left if the ray wants to go right, go right New way if any ray wants to go left, go left with mask if any ray wants to go right, go right with mask

Key Observations Doesn t add bad stuff Traverses the same nodes Adds no global fetches Adds no unpredictable branches What it does add SIMD-friendly floating-point operations Some messing around with masks Result: Very robust in relation to single rays

How many rays in a packet? Packet tracing gives us a knob with which to adjust computational intensity. Do natural SIMD width first Real answer is potentially much more complex diminishing returns due to per-ray costs lack of coherence to support bigger packets register pressure, L1 pressure Makes hardware much more likely/possible

Beyond Packets Diminishing returns limit packet size cost of any operation is still linear in packet size run out of overhead to amortize away Frustums and/or Interval Representations constant time operations on a whole packet MLRT, Alex Reshetov, SIGGRAPH 05 inverse frustum culling, interval traversal Carsten Benthin s thesis

Fast Ray Tracing w/ kd-trees Adaptive build a cost-optimized tree (w/ surface area heuristic) Compact use an 8-byte node lay out your memory in a cache-friendly way Cheap traversal optimize your inner loop trace packets

Getting started Read PBRT (yeah, I know, it s 1300 pages) great book, pretty decent kd-tree builder Read Wald s thesis and Benthin s thesis lots of coding details for this stuff Track down the interesting references Learn SIMD programming (e.g. SSE intrinsics) Use a profiler.

Getting started Read PBRT (yeah, I know, it s 1300 pages) great book, pretty decent kd-tree builder Read Wald s thesis and Benthin s thesis lots of coding details for this stuff Track down the interesting references Learn SIMD programming (e.g. SSE intrinsics) Use a profiler. I mean it.

If you remember nothing else Rays per Second is measured in millions.