Production. Visual Effects. Fluids, RBD, Cloth. 2. Dynamics Simulation. 4. Compositing

Similar documents
Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Part IV. Review of hardware-trends for real-time ray tracing

Applications of Explicit Early-Z Culling

Scalable multi-gpu cloud raytracing with OpenGL

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Massive Model Visualization using Real-time Ray Tracing

V-Ray RT: A New Paradigm in Photorealistic Raytraced Rendering on NVIDIA GPUs. Vladimir Koylazov Chaos Software.

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

DiFi: Distance Fields - Fast Computation Using Graphics Hardware

Blue-Steel Ray Tracer

Wed, October 12, 2011

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Introduction to CELL B.E. and GPU Programming. Agenda

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Double-Precision Matrix Multiply on CUDA

INFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome!

CellSs Making it easier to program the Cell Broadband Engine processor

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Computer Graphics. Si Lu. Fall uter_graphics.htm 11/22/2017

high performance medical reconstruction using stream programming paradigms

Ray Tracing III. Wen-Chieh (Steve) Lin National Chiao-Tung University

Abstract. Introduction. Kevin Todisco

PantaRay: Fast Ray-traced Occlusion Caching of Massive Scenes J. Pantaleoni, L. Fascione, M. Hill, T. Aila

From Brook to CUDA. GPU Technology Conference

Bandwidth Avoiding Stencil Computations

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

A Mobile Accelerator Architecture for Ray Tracing

Massively Parallel Architectures

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Rendering Part I (Basics & Ray tracing) Lecture 25 December 1, 2015

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

Parallel Exact Inference on the Cell Broadband Engine Processor

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Comparing Memory Systems for Chip Multiprocessors

Trends in the Infrastructure of Computing

Real-World Applications of Computer Arithmetic

Real-Time Reyes Programmable Pipelines and Research Challenges

Introduction to GPU computing

Computer Graphics. Lecture 13. Global Illumination 1: Ray Tracing and Radiosity. Taku Komura

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

THE SAMPLING TAB. Pixel Samples. Ray Variance Antialiasing. H16 Mantra User Guide

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

Chapter 11 Global Illumination. Part 1 Ray Tracing. Reading: Angel s Interactive Computer Graphics (6 th ed.) Sections 11.1, 11.2, 11.

Lecture 11: Ray tracing (cont.)

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Fast Tridiagonal Solvers on GPU

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Spring 2011 Prof. Hyesoon Kim

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

Benchmarking CPU Performance

CS354R: Computer Game Technology

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Parallel Physically Based Path-tracing and Shading Part 3 of 2. CIS565 Fall 2012 University of Pennsylvania by Yining Karl Li

Porting an MPEG-2 Decoder to the Cell Architecture

High-Performance Ray Tracing

Scanline Rendering 2 1/42

Ray Tracing. Foley & Van Dam, Chapters 15 and 16

Ray Tracing Foley & Van Dam, Chapters 15 and 16

Implicit and Explicit Optimizations for Stencil Computations

Other Rendering Techniques CSE 872 Fall Intro You have seen Scanline converter (+z-buffer) Painter s algorithm Radiosity CSE 872 Fall

Interactive Isosurface Ray Tracing of Large Octree Volumes

Graphics Hardware. Instructor Stephen J. Guy

Compilation for Heterogeneous Platforms

Building a Fast Ray Tracer

Interactive Ray Tracing: Higher Memory Coherence

Lecture 11 - GPU Ray Tracing (1)

Computer Graphics. Lecture 10. Global Illumination 1: Ray Tracing and Radiosity. Taku Komura 12/03/15

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Accelerated C-arm Reconstruction by Out-of-Projection Prediction

Single Scattering in Refractive Media with Triangle Mesh Boundaries

Advances of parallel computing. Kirill Bogachev May 2016

Intersection Acceleration

Input Nodes. Surface Input. Surface Input Nodal Motion Nodal Displacement Instance Generator Light Flocking

Direct Rendering of Trimmed NURBS Surfaces

CMSC427 Advanced shading getting global illumination by local methods. Credit: slides Prof. Zwicker

Sung-Eui Yoon ( 윤성의 )

UberFlow: A GPU-Based Particle Engine

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Interactive Methods in Scientific Visualization

CS427 Multicore Architecture and Parallel Computing

Metal for Ray Tracing Acceleration

ASYNCHRONOUS SHADERS WHITE PAPER 0

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Real-Time Volumetric Smoke using D3D10. Sarah Tariq and Ignacio Llamas NVIDIA Developer Technology

Deus Ex is in the Details

Acceleration Data Structures

Consider a partially transparent object that is illuminated with two lights, one visible from each side of the object. Start with a ray from the eye

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

Embree Ray Tracing Kernels: Overview and New Features

Software Occlusion Culling

DOOM 3 BFG Technical Note

A distributed rendering architecture for ray tracing large scenes on commodity hardware. FlexRender. Bob Somers Zoe J.

GeForce4. John Montrym Henry Moreton

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

A Real-time Micropolygon Rendering Pipeline. Kayvon Fatahalian Stanford University

Transcription:

Visual Effects Pr roduction on the Cell/BE Andrew Clinton, Side Effects Software

Visual Effects Production 1. Animation Character, Keyframing 2. Dynamics Simulation Fluids, RBD, Cloth 3. Rendering Raytrac ing 4. Compositing

Software Products Houdini: 3D Visual Effects production software supporting a ll stages of content creation Houdini Batch: Command-line interface to Houdini with python scripting Mantra: Bundled prod duction rendering application

Houdini

Render Farms Final software renders of visual effects shots are often distributed to a network of servers for processing Individual dua frames can computation often require e hours of Complex simulations or compositing operations also use the render farm Simulation is more difficult to parallelize due to inter-frame dependenc d cies

Cell/BE Based Render Farm In co-operation with Sony, we have ported some of the most com mputationally expensive portions of our software to the Cell/BE Fluid udsimulation uato Rendering Compositing

Project Goals Houdini Batch / Mantra to run under Linux on the Cell/BE Apply SPU and GPU-based optimizations to accelerate bottleneck algorithms Produce the same output for rendering/compositing g as other supported platforms (Linux/Windows x86) Simulation results are sensitive to floating point roundoff, so identical results are not expected between platforms

Development Approach Top-down profile common custom mer scenes to determine the most important optimizations Simplify data structures Optimize these algorithms first Bottom-up development elopment of a robust, abstract library of data structures to simplify SPU programming Tools: C++, GCC, Son ny PASuite, oprofile

Target Algorithms Rendering (SPU) Raytracing shadows s, reflections, refractions Ambient Occlusion Global Illumination Micropolygon Rendering motion blur, depth of field Fluid Simulation (SPU U) Voxel operations (Advect, Diffuse, etc.) Sparse matrix solver Compositing (SPU/GPU) Blur, Defocus, Over, etc. - SPU Noise, Lighting, Fog, etc. - GPU

Raytracing

Raytracing Algorithm Geometry (quadrilaterals) Build KD-Tree Generate rays Calculate intersections Performed on SPU Intersection (u/v/t, primitive id) Ray (origin, direction)

Raytracing on the SPU

Raytracing Benchmark (million rays/second) Incoherent Rays CPU Core 2 2.66 GHz (1 processor) Cell/BE PPU 64-bit Cell/BE 8 SPEs Box (Hit%: 0.94) Godzilla (Hit%: 0.28) Happy Buddha (Hit%: 0.87) 1.68 0.74 0.20 0.19 0.16 0.05 14.45 1.95 0.64 Coherent Rays CPU Core 2 2.66 GHz (1 processor) Cell/BE PPU 64-bit Cell/BE 8 SPEs Box (Hit%: 0.23) Godzilla (Hit%: 0.25) Happy Buddha (Hit%: 0.21) 4.23 1.67 1.81 0.51 0.27 0.31 27.02 5.90 6.05

Ambient Occlusion

Ambient Occlusion Offload ray generationn in addition to ray intersection algorithm to SPU Render Time comparison (256 samples/pixel): 2:35 (Intel quad core2, 4 threads) 5:16 (Cell/BE, 8 SPUs) Why slower? Divergent rays and heavy geometry non-ideal SIMD efficiency i and software managed cache overhead KD Tree construction performed on PPU

Micropolygon Sampling

Micropolygon Sampling Motion Blurred Polygon Time 1 Time 0 Specialization of raytr racing for eye rays Supports fast motion blur, depth of field Too much code to fit i nto the same spu program as the raytracer Use overlays! Overlay switch time close to 4us 100 times faster than programs using libspe2 to switch

Micropolygon Sampling Results Micropolygon sampling performance is considerably better on the SPUs as opposed to the PPU However, SPU sampl ing is not yet competitive with Core2 processor performance In-memory representation using linked sample lists not easily ported to SPU Render order dependencies limit the amount of data available to send to SPUs Need to consider data a structure redesign to take full advantage of SPU performance

Fluid Simulation

Fluid Simulation Primarily 2 classes of bottleneck algorithms Voxel processing (eg. advection, diffusion, etc) Sparse matrix solver (project) Each simulation frame executes a sequence of operations applied to a varying number of scalar and vector fields

Voxel Operations 16 3 Voxel Tile (16 Kb) Scalar Voxel Array (3D array of tiles) 1Kb per voxel tile plane (16 2D slices per tile) Double buffer streaming inputs/outputs Use software managed cache for random access with 1 tile plane per page 3D field = 3 scalar fields

Advection Use a velocity field to transform a scalar field (eg. density) through time 4 random access inputs (velocity x,y,z; scalar field) 1 streaming output

Sparse Matrix Solver Uses Jacobi iterations Within each iteration, compute: Sparse matrix multiply Vector scale/offset L2 norm calculation Normally vector sizes are very large (for a 128 3 simulation, i abou t 1.5 million entries) Use double buffering to stream inputs and outputs to memory

Fluid Simulation Results Processor / Sim Stage Project (Sparse Solver) Diffuse Buoyancy Advect Enforce Ttl Total CPU Core 2 2.66 GHz Cell/BE 8 SPEs (4 processors) 18.17 25.38 3.59 2.42 0.29 0.06 1.64 1.77 0.42 0.20 24.11 29.83 Results are for a singl e simulation step of a 128 3 simulation Vl Values in seconds

Floating Poin nt Evaluation (no dma) Algorithm / Processor Quad GAS Diffuse VM Multiply SPU Ideal (*) Intersection Cell/BE SPU 0.52s 0.43s 0.31s (3.2 GFLOPS) 1.36s (23.5 GFLOPS) Cell/BE PPU 2.38s N/A 1.67s (0.6 GFLOPS) 19.3s (1.7 GFLOPS) Intel Core2 0.45s 0.43s 0.26s (3.8 GFLOPS) 1.92s (16.7 GFLOPS) Speedup (*) 1.73x 2.00x 1.61x 2.82x Notes: SPU Ideal is a benchmark of the ideal SPU FLOPS (it uses only spu_madd instructions) Speedup compares Ce ell/be performance (with 8 SPUs) against Intel Core2 performance (with 4 Core2 processors) Assuming linear scaling

Bandwidth Evaluation (no computation) Cell 8 SPUs: 18.0 Gb/ /s (25.6 Gb/s) Cell 1 PPU: 1.5 Gb/s Intel Core2 DDR2 800Mhz: 4.4 Gb/s 25.0 Gb/s cached

Basic Performance Assessment In ideal scenario, cell processor has peak floating point throughp put 2.8x current- generation Intel quad-core cpus. Real-world ago algorithms speedup up to 2.0x produce floating point Ideal speedup can only be observed when: Algorithm is fully vectorized, and has high vector utilization Algorithm is compute- bound or has well-behaved dma access (streaming or prefetched)

Conclusions PPU-only port of software to Cell/BE is easy SPU port can show co ompetitive performance to Intel 4-core system ambient be toccuso occlusion fluid simulation compositing blur Cell/BE algorithms produce visually identical i results to existing x86 platforms The Cell/BE can accommodate the flexibility inherent in Houdini and visual effects production in general

Questions?

Appendix A: SPU Programming SPU side: N-Buffering Software managed cache 4-component vectors Overlays DMA/Heap debuggingg PPU side: SPU thread set start/stop Mailbox communicatio on Scheduling, SPU allocation