A Real-time Micropolygon Rendering Pipeline. Kayvon Fatahalian Stanford University

Similar documents
Lecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Real-Time Reyes Programmable Pipelines and Research Challenges

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Real-Time Realism will require...

Data-Parallel Rasterization of Micropolygons with Defocus and Motion Blur

Data-Parallel Rasterization of Micropolygons with Defocus and Motion Blur

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Approximate Catmull-Clark Patches. Scott Schaefer Charles Loop

A Trip Down The (2011) Rasterization Pipeline

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Graphics and Imaging Architectures

Comparing Reyes and OpenGL on a Stream Architecture

LOD and Occlusion Christian Miller CS Fall 2011

Reyes Rendering on the GPU

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

COMPUTER GRAPHICS COURSE. Rendering Pipelines

The Rasterization Pipeline

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Parallel Triangle Rendering on a Modern GPU

MSAA- Based Coarse Shading

Course Recap + 3D Graphics on Mobile GPUs

CS452/552; EE465/505. Clipping & Scan Conversion

Chapter IV Fragment Processing and Output Merging. 3D Graphics for Game Programming

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

TSBK03 Screen-Space Ambient Occlusion

Deferred Splatting. Gaël GUENNEBAUD Loïc BARTHE Mathias PAULIN IRIT UPS CNRS TOULOUSE FRANCE.

Deus Ex is in the Details

Efficient GPU Rendering of Subdivision Surfaces. Tim Foley,

Subdivision Surfaces

Sung-Eui Yoon ( 윤성의 )

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

UC Davis UC Davis Previously Published Works

Hardware Implementation of Micropolygon Rasterization with Motion and Defocus Blur

Dynamic Feature-Adaptive Subdivision

CS4620/5620: Lecture 14 Pipeline

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Wed, October 12, 2011

Introduction to the Direct3D 11 Graphics Pipeline

REYES REYES REYES. Goals of REYES. REYES Design Principles

Real-Time Graphics Architecture

Scheduling the Graphics Pipeline on a GPU

Drawing Fast The Graphics Pipeline

Rasterization Overview

Complexity Reduction of Catmull-Clark/Loop Subdivision Surfaces

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Rendering Subdivision Surfaces Efficiently on the GPU

INF3320 Computer Graphics and Discrete Geometry

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Rendering Grass with Instancing in DirectX* 10

Subdivision Of Triangular Terrain Mesh Breckon, Chenney, Hobbs, Hoppe, Watts

Visibility and Occlusion Culling

Curves and Surfaces 2

Craig Peeper Software Architect Windows Graphics & Gaming Technologies Microsoft Corporation

Computer Graphics Shadow Algorithms

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Direct Rendering of Trimmed NURBS Surfaces

OpenGL and RenderMan Rendering Architectures

Terrain Rendering Research for Games. Jonathan Blow Bolt Action Software

Models and Architectures

Computer Graphics Fundamentals. Jon Macey

Subdivision overview

Clipless Dual-Space Bounds for Faster Stochastic Rasterization

Other Rendering Techniques CSE 872 Fall Intro You have seen Scanline converter (+z-buffer) Painter s algorithm Radiosity CSE 872 Fall

Streaming Massive Environments From Zero to 200MPH

Drawing Fast The Graphics Pipeline

Ray tracing. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 3/19/07 1

3/1/2010. Acceleration Techniques V1.2. Goals. Overview. Based on slides from Celine Loscos (v1.0)

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Visibility: Z Buffering

Drawing Fast The Graphics Pipeline

Graphics and Interaction Rendering pipeline & object modelling

Computer Graphics. Lecture 9 Hidden Surface Removal. Taku Komura

Real-Time Shadows. Computer Graphics. MIT EECS Durand 1

FRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS

Lecture 12: Advanced Rendering

Parallel Programming for Graphics

Overview. Pipeline implementation I. Overview. Required Tasks. Preliminaries Clipping. Hidden Surface removal

1.2.3 The Graphics Hardware Pipeline

Intro to Ray-Tracing & Ray-Surface Acceleration

PowerVR Hardware. Architecture Overview for Developers

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Rendering approaches. 1.image-oriented. 2.object-oriented. foreach pixel... 3D rendering pipeline. foreach object...

Introduction to Visualization and Computer Graphics

Real-Time Patch-Based Sort-Middle Rendering on Massively Parallel Hardware

7/29/2010 Beyond Programmable Shading Course, ACM SIGGRAPH 2010

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Hidden surface removal. Computer Graphics

CSE 167: Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Visibility. Welcome!

Advanced Shading and Texturing

Intro to OpenGL II. Don Fussell Computer Science Department The University of Texas at Austin

Spring 2011 Prof. Hyesoon Kim

The Rasterization Pipeline

Performance OpenGL Programming (for whatever reason)

Hardware-driven Visibility Culling Jeong Hyun Kim

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

Transcription:

A Real-time Micropolygon Rendering Pipeline Kayvon Fatahalian Stanford University

Detailed surfaces Credit: DreamWorks Pictures, Shrek 2 (2004)

Credit: Pixar Animation Studios, Toy Story 2 (1999)

Credit: Crytek

(one pixel) Zoom in (pixels)

(one pixel) Micropolygons

Camera defocus Motion blur

Rendering goals Highly detailed surfaces: micropolygons (But interoperate with large triangle rendering in same frame) Accurate camera defocus and motion blur [Near future] real-time system

It is inefficient to render micropolygons using the current graphics pipeline.

How does the real-time graphics pipeline evolve to enable efficient micropolygon rendering? How should surfaces be tessellated into micropolygons? How can micropolygons be rasterized efficiently? How expensive is motion blur and defocus? Should the pipeline shade like GPUs or like REYES? (changes to algorithms, abstractions, and HW are fair game)

TESSELLATION: How do we add adaptive tessellation to the pipeline?

Generate micropolygons on demand Base primitives Smooth tessellation Final displaced surface

D3D tessellates parametric surfaces Coarse Vertex Processing Primitive Processing Compute tessellation factor Tessellate Generate new triangles New in D3D11 Fine Vertex Processing Compute position of new triangle verts Rasterize+Cull Fragment Processing Frame buffer ops Fixed function =

Need adaptive tessellation Uniform tessellation per patch over/under tessellates Overtessellation Undertessellation

Performing good adaptive tessellation is very important. Complexity pays for itself. Less surface evaluation Less shading Less (and more efficient) rasterization

Split-dice adaptive tessellation Split: recursively divide primitive (adaptive) Dice: simple uniform tessellation (fast) Base Patch Micropolygon Mesh v 3x2 6x6 u 3x4 Split Split (again) Set Rates Dice

D3D11 tessellation Don t split, just dice Tessellation factors for primitive edges Fancy dicer stitches edges to uniform interior 7 2 4 3 4 7 REYES dicing D3D11 dicing

Parallel, adaptive tessellation must avoid cracks

Offline techniques suboptimal for real-time Retain the entire micropolygon mesh and stitch adjacent subpatches Very robust Hard to parallelize (can t stream) Use only power-of-two dicing factors Easier to parallelizable T-junctions Over tessellates

Split-dice for a real-time pipeline DiagSplit: simple solution Adaptive tessellation Independent subpatch processing Crack-free No power-of-two constraints Fisher et al. [Siggraph Asia 2009]

DiagSplit: idea 1 Leverage D3D11 tessellator as advanced dicer Edge-based tessellation factor function: h(edge) Let h evaluate displacement h=6 h=split h=4 h=3 h=4 h=3 h=5 Dicable: send to Dicer h=split Evaluate h on subpatches

DiagSplit: idea 2 Scale area of internal polygons to hit target area precisely Can scale tessellation of interior region without creating cracks

DiagSplit: idea 3 Permit diagonal (non-isoparametric) splits Non-isoparametric patch split

Direct3D 11 Power-of -Two Split-Dice + Stitching DiagSplit

What is the cost of splitting? Most time is spent evaluating final surface positions Then you ve got to rasterize and shade Avoiding over tessellation pays for itself 10 million MP/sec on one core of Intel I7

Parallel tessellation is an active area Expressibility of parametric surfaces is increasing Approximate Catmull-Clark surfaces [Loop (TOG 2008)] Extension to creases and corners [Kovacs (I3D 2009)] Variety of implementations Depth-first (cache efficient, tiny footprint) Breath-first subdivision (fits current GPU compute model well) Patney (SIGGRAPH Asia 2008), Patney (HPG 2009) Eisenacher (I3D 2009) RenderAnts CUDATess [Schwarz (Eurographics 2009)]

D3D11 Pipeline Micropolygon Pipeline Coarse Vertex Processing Primitive Processing Tessellate (Dice) Fine Vertex Processing Coarse Vertex Processing Primitive Processing Split Tessellate (Dice) Fine Vertex Processing Tessellation Rate Func Bounding Box Func Programmable = Fixed function =

RASTERIZATION: How efficiently can micropolygons be rasterized? (and what s the cost?)

Rasterization

Step 1: per-polygon preprocessing (setup) Clip, back face cull, compute edge equations Make point-in-polygon tests cheap

Step 2: compute candidate sample set Coarse reject/accept of samples Coarse uniform grid

Step 2: compute candidate sample set Coarse reject/accept of samples Coarse uniform grid Hierarchical descent

Step 2: compute candidate sample set Coarse reject/accept of samples Coarse uniform grid Hierarchical descent

Step 2: compute candidate sample set Coarse reject/accept of samples Coarse uniform grid Hierarchical descent

Step 3: point-in-polygon tests Test stamp of samples against polygon simultaneously (data-parallel) [Pineda 88] [Fuchs 89] [Greene 96] [Seiler 08]

Micropolygons: more polygons = more setup

Micropolygons: coarse reject not useful

Micropolygons: large stamps yield low efficiency 47% of tested samples inside triangle 6% of tested samples inside triangle

Re-implement rasterizer (Fatahalian et al. HPG 2009) Rasterize many micropolygons simultaneously Input micropolygons Exec 0 Exec 1 Exec 2 Exec 3 Exec 4 Exec 5 Exec 6 Exec 7 Output fragments

Reimplement rast to increase efficiency (2.5 to 6x more efficient) Conventional rasterizer: 16 sample (4x4) stamp MICROPOLYGON-PARALLEL Sample test efficiency (%) 30 20 10 2% 12% 6% 20% 11% 28% No multi-sampling 4x multi-sampling 16x multi-sampling

Micropolygon rasterization is expensive Primary visibility computation: 1080p resolution, 30 Hz 4x multi-sampling Simple scene (10 M micropolygons) Estimated cost of GPU SW implementation: Approximately 1/3 of high-end GPU Fixed-function micropolygon rasterization is appealing

MOTION BLUR AND DEFOCUS: How expensive are motion blur/defocus?

Motion blur and defocus Many 2D-techniques for approximating blur [Sung 02] [Demers 04] Stochastic point sampling [Cook 84, Cook 86] [Akenine-Moller 07]

Moving micropolygon t x y

Soccer jump 16x multi-sampling 64 unique time samples

Enabling motion/defocus blur costs 3 to 7x more Point-in-polygon tests are more expensive Algorithms that support blur perform more point-in-polygon tests Stationary geometry, perfect focus: Tests 2.5x more samples Cost 3x more

Performance varies with motion Cost (ops)

Most efficient algorithm depends on scene conditions Cost (ops)

[Pixar method] costs increase sharply with defocus Cost (ops)

D3D11 Pipeline Coarse Vertex Processing Primitive Processing Tessellate (Dice) Fine Vertex Processing Micropolygon Pipeline Coarse Vertex Processing Primitive Processing Split Tessellate (Dice) Fine Vertex Processing Tessellation Rate Func Bounding Box Func Rasterize+Cull Rasterize (5D) open and close shutter vertex positions shutter duration, lens parameters Unchanged = Programmable = Fixed function =

SHADING: Should a micropolygon pipeline shade like today s GPUs or like REYES?

D3D11 Pipeline Coarse Vertex Processing Primitive Processing Tessellate (Dice) Fine Vertex Processing Rasterize+Cull Fragment Processing Frame buffer ops Micropolygon Pipeline Coarse Vertex Processing Primitive Processing Split Tessellate (Dice) Fine Vertex Processing Cull+Shade? Rasterize (5D) Cull+Shade? Frame buffer ops Tessellation Rate Func Bounding Box Func Unchanged = Programmable = Fixed function =

GPUs shade micropolygons inefficiently Rasterize micropolygon input Four shading samples output At least 4x overshade! (must share shading results across micropolygons)

MSAA implementations break down with blur Rasterize Motion-blurred polygon input (inefficient to represent as hit mask) Covered samples are spread widely across screen

Future fragment shading? Decoupled sampling: Ragan-Kelley et al. Rasterize Visibility sample hits Shading Result Cache Shading requests Shaded samples Fragment processing Fragments Frame Buffer Ops

REYES-style shading Perform shading computations at vertices Requires connectivity for derivatives Not a problem: system has the diced mesh Decoupled shading for motion blur and defocus just works

Overshade in a REYES system Overtessellation Conservative, coarse occlusion culling

Shading system design space Current GPU fragment shading Sample in screen space Shade only what s visible Fine early Z Deferred shading Shade before exact visibility known Sweet spot? Sample in object space REYES: shade at diced grid vertices

SUMMARY

Good adaptive tessellation Is very important DiagSplit: Build split on top of D3D11 Tess Abstract SPLIT as non-programmable stage Micropolygon Pipeline Coarse Vertex Processing Primitive Processing Split Tessellate (Dice) Fine Vertex Processing Fine Vertex Shading Rasterize (5D) Future Fragment Proc. Frame buffer ops Tessellation Rate Func Bounding Box Func

Shading on surface: Need derivatives & good tessellation We ve got the diced grid for connectivity Micropolygon Pipeline Coarse Vertex Processing Primitive Processing Split Tessellate (Dice) Fine Vertex Evaluation Tessellation Rate Func Bounding Box Func Fine Vertex Shading (requires mesh connectivity) Rasterize (5D) Future Fragment Proc. Frame buffer ops

Rasterization: re-implement for micropolygons Fixed-function HW likely Point sampling for motion blur and defocus are 3-7x more costly Micropolygon Pipeline Coarse Vertex Processing Primitive Processing Split Tessellate (Dice) Fine Vertex Processing Fine Vertex Shading Rasterize (5D) Future Fragment Proc. Frame buffer ops Tessellation Rate Func Bounding Box Func shutter duration, lens parameters

Must share shading across micropolygons to avoid overcompute (quad per MP is inefficient) Motion blur/defocus change fragment shading (even without micropolygons) Micropolygon Pipeline Coarse Vertex Processing Primitive Processing Split Tessellate (Dice) Fine Vertex Processing Fine Vertex Shading Rasterize (5D) Future Fragment Proc. Frame buffer ops Tessellation Rate Func Bounding Box Func

It s coming! We ll have the flops from future HW Lot s of active research on algorithms Real-time micropolygon rendering will be possible in the future

Thank you