Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Similar documents
Real-Time Reyes Programmable Pipelines and Research Challenges

Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

Real-Time Reyes-Style Adaptive Surface Subdivision

Lecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

A Real-time Micropolygon Rendering Pipeline. Kayvon Fatahalian Stanford University

Comparing Reyes and OpenGL on a Stream Architecture

Reyes Rendering on the GPU

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Real-Time Patch-Based Sort-Middle Rendering on Massively Parallel Hardware

Lecture 12: Advanced Rendering

CS427 Multicore Architecture and Parallel Computing

Scheduling the Graphics Pipeline on a GPU

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Mattan Erez. The University of Texas at Austin

RenderAnts: Interactive REYES Rendering on GPUs

Portland State University ECE 588/688. Graphics Processors

Direct Rendering of Trimmed NURBS Surfaces

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

A Trip Down The (2011) Rasterization Pipeline

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Mattan Erez. The University of Texas at Austin

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

PowerVR Hardware. Architecture Overview for Developers

Threading Hardware in G80

Real-Time Graphics Architecture

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

The GPGPU Programming Model

Chapter IV Fragment Processing and Output Merging. 3D Graphics for Game Programming

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Mattan Erez. The University of Texas at Austin

OpenGL and RenderMan Rendering Architectures

Ciril Bohak. - INTRODUCTION TO WEBGL

REYES REYES REYES. Goals of REYES. REYES Design Principles

Shaders. Slide credit to Prof. Zwicker

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

GPU Architecture and Function. Michael Foster and Ian Frasch

EE 4702 GPU Programming

Ray Casting of Trimmed NURBS Surfaces on the GPU

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Efficient and Scalable Shading for Many Lights

3D Modeling: Surfaces

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

The Application Stage. The Game Loop, Resource Management and Renderer Design

The Need for Programmability

Graphics Processing Unit Architecture (GPU Arch)

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Efficient GPU Rendering of Subdivision Surfaces. Tim Foley,

2.11 Particle Systems

Deus Ex is in the Details

3/1/2010. Acceleration Techniques V1.2. Goals. Overview. Based on slides from Celine Loscos (v1.0)

The NVIDIA GeForce 8800 GPU

CS452/552; EE465/505. Clipping & Scan Conversion

Spring 2009 Prof. Hyesoon Kim

Rendering Subdivision Surfaces Efficiently on the GPU

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

GRAPHICS PROCESSING UNITS

Rendering Objects. Need to transform all geometry then

The Graphics Pipeline

Graphics Hardware. Instructor Stephen J. Guy

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Approximate Catmull-Clark Patches. Scott Schaefer Charles Loop

UC Davis UC Davis Previously Published Works

The Traditional Graphics Pipeline

Why modern versions of OpenGL should be used Some useful API commands and extensions

EECS 487: Interactive Computer Graphics

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Motivation MGB Agenda. Compression. Scalability. Scalability. Motivation. Tessellation Basics. DX11 Tessellation Pipeline

Accelerating CFD with Graphics Hardware

LOD and Occlusion Christian Miller CS Fall 2011

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

TSBK03 Screen-Space Ambient Occlusion

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Rendering Grass with Instancing in DirectX* 10

Drawing Fast The Graphics Pipeline

Drawing Fast The Graphics Pipeline

Advanced Shading and Texturing

Current Trends in Computer Graphics Hardware

Point based Rendering

Per-Pixel Lighting and Bump Mapping with the NVIDIA Shading Rasterizer

B. Tech. Project Second Stage Report on

Parallel Programming for Graphics

Drawing Fast The Graphics Pipeline

Spring 2011 Prof. Hyesoon Kim

Lets assume each object has a defined colour. Hence our illumination model is looks unrealistic.

COMPUTER GRAPHICS COURSE. Rendering Pipelines

NVIDIA Parallel Nsight. Jeff Kiel

Other Rendering Techniques CSE 872 Fall Intro You have seen Scanline converter (+z-buffer) Painter s algorithm Radiosity CSE 872 Fall

Transcription:

Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis

Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia 2008 (to appear) http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=952

Cinematic rendering looks good Image courtesy: Pixar

High geometric complexity Image courtesy: Pixar

High shading complexity Image courtesy: Pixar

Motion blur, Depth-of-field Image courtesy: Pixar

Complex lighting Image courtesy: Pixar

All this is slow Image courtesy: Pixar

Our Goal Make it faster As much as possible in real time Use the GPU Massive available parallelism Fixed-function texture/raster units High Programmability

Enter Reyes Developed in 1980s, offline Pixel-accurate smooth surfaces Eye-space shading Stochastic sampling Order-independent transparency

Reyes Direct3D 10 Bound/Split Vertex Shade Dice (tessellate) Geometry Shade Displace Rasterize Shade Fragment Shade Sample Depth-buffer Compose Blend

Reyes Bound/Split Dice (tessellate) Displace Shade Sample Compose High-quality geometry Convenient surface shading Cinematic quality Well-behaved (?)

Reyes Bound/Split Dice (tessellate) Displace Shade Input: Smooth surfaces Output: 0.5 x 0.5 pixel quads (micropolygons) Sample Compose

Outline Reyes Subdivision algorithm Subdivision on GPU implementation Results Limitations

Outline Reyes Subdivision algorithm Subdivision on GPU implementation Results Limitations

Split-Dice Recursively split a surface till dicing makes sense Uniformly sample it to form a grid of micropolygons

Split Bound/Cull Diceable? No: Split Yes: Dice Dice

Dice Post-split surface Grid Micropolygon

Split-Dice is hard Split Recursive, serial Rapid primitive generation/destruction Dice Huge memory demand

Can we do this in parallel?

Parallel Split Regular computation A lot of independent operations

Analogy: A Dynamic work queue A B C D E F G H I A A B C C E E F H Cull Split No Action

How can we do these efficiently? Creating new primitives How to dynamically allocate space? Culling unneeded primitives How to avoid fragmentation?

Our Choice keep it simple A B C D E F G H I A B C E F H A C E A child primitive is offset by the queue length

and get rid of the holes later A B C E F H A C E A B C E F H A C E Scan-based compact is fast! (Sengupta 07)

Storage Issues for Dice Too many micropolygons Cannot reject early Screen-space buckets (tiles) In parallel? Ideal bucket size?

Outline Reyes Subdivision algorithm Subdivision on GPU implementation Results Limitations

Platform NVIDIA GeForce 8800 GTX 16 SIMD Multiprocessors 16KB shared memory 768 MB memory (no cache) NVIDIA CUDA 1.1 Grids/Blocks/Threads OpenGL interface

Implementation Details Input choice: Bicubic Bézier Surfaces Only affects implementation View Dependent Subdivision every frame Single CPU-GPU transfer Final micropolygons written to a VBO Flat-shaded and displayed

Kernels Implemented Dice Regular, symmetric, parallel 256 threads per patch Primitive information in shared memory Bound/Split Hard to ensure efficiency

Bound/Split: Efficiency Goals Memory Coherence Off-chip memory accesses must be efficient Computational Efficiency Hardware SIMD must be maximally utilized

Memory Coherence during Split Compact work-queue after each iteration Primitives always contiguous in memory Structure-Of-Arrays representation Primitive attributes adjacent in memory 99.5% of all accesses fully coalesced

SIMD utilization during split Intra-Primitive parallelism Independent control points Negligible divergence Vectorized Split 16 Threads per patch 32 threads (1 warp) 90.16% of all branches SIMD coherent

Outline Reyes Subdivision algorithm Subdivision on GPU implementation Results Limitations

Results - Killeroo 14426 grids 5 levels of subdivision Bound/Split: 6.99 ms Dice: 7.21ms 29.69 frames/second (19.92 with 16x AA) Killeroo model courtesy: Headus Inc.

Results - Teapot 4823 grids 11 levels of subdivision Bound/Split: 3.46 ms Dice: 2.42 ms 60.07 frames/second (30.02 with 16x AA)

Time (ms) 30 20 Results Random scenes Subdivision time proportional to number 25 of micropolygons 15 10 5 0 0 10000 20000 Number of grids

Render time Breakdown 100% 75% Rendering lots of small triangles Rendering 50% 25% CUDA-OpenGL transfer Normal generation Subdivision 0% Teapot Killeroo Random Model (30 patches) Should ideally be zero: CUDA limitation

Subdivision Time (ms) Memory Usage (MB) Results - Screen-space buckets 100 50 90 80 70 Acceptable performance, Small memory footprint 40 60 30 50 40 20 30 20 10 10 0 0 0 128 256 384 512 Bucket width (pixels) Subdivision time Memory Usage (max.) Memory Usage (avg.)

Outline Reyes Subdivision algorithm Subdivision on GPU implementation Results Limitations

Limitations Can t split and dice in parallel Uniform dicing Cracks

Limitations Uniform dicing

Limitations - Cracks

Limitations - Cracks

Conclusions Breadth-first recursive subdivision Suits GPUs Works fast Dicing Programmable tessellation is fast 500M micropolygons/sec It is time to experiment with alternate graphics pipelines

Reyes in real-time rendering Visually superior to polygon pipeline Regular, highly parallel workload Extremely well-studied Good candidate for a real-time system?

Future Work Cracks, Displacement mapping Rest of Reyes Offline quality shading Interactive lighting Parallel Stochastic Sampling (Wei 2008) A-buffer (Myers 2007)

Thanks to Feedback and suggestions Per Christensen, Charles Loop, Dave Luebke, Matt Pharr and Daniel Wexler Financial support DOE Early Career PI Award National Science Foundation SciDAC Institute for Ultrascale Visualization Equipment support from NVIDIA

Teapot Video Real-time footage

Killeroo Video Real-time footage

Real-Time Reyes-Style Adaptive Surface Subdivision BACKUP SLIDES

CUDA Thread Structure Image courtesy: NVIDIA CUDA Programming Guide, 1.1

CUDA Memory Architecture Image courtesy: NVIDIA CUDA Programming Guide, 1.1

Displacement Mapping Fairly simple if cracks can be avoided Displace adjacent grids together

Shading & Lighting Interactive preview Lpics / Lightspeed Shaders Intelligent textures File access Shadows

Composite/Filter Stuff Stochastic sampling 20x speedup on a GPU (Wei 2008) A-buffer

Special Effects Motion Blur and DOF Global Illumination Ambient Occlusion