Streaming Massive Environments From Zero to 200MPH

Similar documents
Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Wed, October 12, 2011

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Shadows in the graphics pipeline

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

PowerVR Hardware. Architecture Overview for Developers

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

LOD and Occlusion Christian Miller CS Fall 2011

CS535 Fall Department of Computer Science Purdue University

Working with Metal Overview

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Render-To-Texture Caching. D. Sim Dietrich Jr.

The Application Stage. The Game Loop, Resource Management and Renderer Design

Profiling and Debugging Games on Mobile Platforms

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

CS 563 Advanced Topics in Computer Graphics QSplat. by Matt Maziarz

GeForce4. John Montrym Henry Moreton

Optimisation. CS7GV3 Real-time Rendering

Enabling immersive gaming experiences Intro to Ray Tracing

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

Michal Valient Lead Tech Guerrilla Games

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Table of Contents. Questions or problems?

TSBK03 Screen-Space Ambient Occlusion

PowerVR Performance Recommendations. The Golden Rules

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

A Reconfigurable Architecture for Load-Balanced Rendering

3/1/2010. Acceleration Techniques V1.2. Goals. Overview. Based on slides from Celine Loscos (v1.0)

Hardware-driven Visibility Culling Jeong Hyun Kim

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Addressing the Memory Wall

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Could you make the XNA functions yourself?

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

A Real-time Micropolygon Rendering Pipeline. Kayvon Fatahalian Stanford University

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Ray Tracing.

Gettin Procedural. Jeremy Shopf 3D Application Research Group

Scene Management. Video Game Technologies 11498: MSc in Computer Science and Engineering 11156: MSc in Game Design and Development

ARM Multimedia IP: working together to drive down system power and bandwidth

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

There are two lights in the scene: one infinite (directional) light, and one spotlight casting from the lighthouse.

Attention to Detail! Creating Next Generation Content For Radeon X1800 and beyond

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Deferred Rendering Due: Wednesday November 15 at 10pm

Goal. Interactive Walkthroughs using Multiple GPUs. Boeing 777. DoubleEagle Tanker Model

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

ICS RESEARCH TECHNICAL TALK DRAKE TETREAULT, ICS H197 FALL 2013

Specialized Acceleration Structures for Ray-Tracing. Warren Hunt

PLAYSTATION Edge. Mark Cerny Jon Olick Vince Diesi

Direct Rendering of Trimmed NURBS Surfaces

PowerVR Performance Recommendations The Golden Rules. October 2015

Terrain Rendering Research for Games. Jonathan Blow Bolt Action Software

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

Occluder Simplification using Planar Sections

Maximizing Face Detection Performance

Deferred Splatting. Gaël GUENNEBAUD Loïc BARTHE Mathias PAULIN IRIT UPS CNRS TOULOUSE FRANCE.

Real-Time Universal Capture Facial Animation with GPU Skin Rendering

Deus Ex is in the Details

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

Sparkling Effect. February 2007 WP _v01

Adaptive Point Cloud Rendering

Hello, Thanks for the introduction

Performance Analysis and Culling Algorithms

V-RAY 3.6 FOR RHINO KEY FEATURES. January 2018

CS 3733 Operating Systems:

Illumination and Geometry Techniques. Karljohan Lundin Palmerius

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Out-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays

UI Elements. If you are not working in 2D mode, you need to change the texture type to Sprite (2D and UI)

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Broken Age's Approach to Scalability. Oliver Franzke Lead Programmer, Double Fine Productions

Philipp Slusallek Karol Myszkowski. Realistic Image Synthesis SS18 Instant Global Illumination

CS 354R: Computer Game Technology

Course Recap + 3D Graphics on Mobile GPUs

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

ArcGIS Runtime: Maximizing Performance of Your Apps. Will Jarvis and Ralf Gottschalk

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

Progressive Mesh. Reddy Sambavaram Insomniac Games

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

Lecture 25: Board Notes: Threads and GPUs

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Graphics Processing Unit Architecture (GPU Arch)

Transcription:

FORZA MOTORSPORT From Zero to 200MPH Chris Tector (Software Architect Turn 10 Studios)

Turn 10 Internal studio at Microsoft Game Studios - we make Forza Motorsport Around 70 full time staff 2

Why am I here? Our goals at Turn 10 The massive model visualization hierarchy Our pipeline, from preprocessing to the runtime 3

Why are you here? Learn about streaming Typical features in a system capable of streaming massive environments Understand the importance of optimization in processing streaming content Practical takeaways for your game Primarily presented as a general system But there are some 360 specific features which are pointed out as they are encountered 4

At Turn 10 GOALS 5

Streaming Rendering at 60fps Track, 8 cars and UI Post processing, reflections, shadows Particles, skids, crowds Split-screen, replays 6

Massive Environments Over 100 tracks, some up to 13 miles long Over 47000 models and over 60000 textures 7

Zero Looks great when standing still All detail in there when in game or photo mode Especially the track since it is the majority of the screen 8

200 Looks great at high speeds All detail is there when in game or replay mode, UGC video Again, especially the track 9

Running Example Le Mans is an 8.4 mile long track It has roughly 6000 models and 3000 textures As this talk goes on we can track how much data is streamed Data streamed : 13.3 Miles driven 1.6 Laps 0.98 GB Loaded 0.14 GB Mesh 0.84 GB Texture 10

Factors to Optimize for Minimize Size on disk (especially when shipping large amounts of content) Size in memory Maximize Disk to memory rate Memory to processor rate All while maximizing quality 11

The Hierarchy MASSIVE MODEL VISUALIZATION 12

Massive Model Visualization in Research Most relevant area to search Good course notes from Siggraph 2007 http://www.siggraph.org/s2007/attendees/courses/4.html But a lot of real time options in the literature aren t game real time 13

Typical massive model visualization hierarchy GPU/GPU Speed GPU/CPU Caches Decompressed Heap Space Compressed Cache Disk/Local Storage 14

Disk Stored on zip disk in packages We store some extra data in zip format, but honor base format so standard browsing tools all still work (explorer, WinZip, etc.) Stored in LZX format inside the archive 90-300MB per track GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 15

Disk to Compressed Cache Fast IO in cache block sizes Block is a group of files within the zip Total up size of files until block size is reached Retrieve that file group with a single read Compressed cache reduces seeks 15MB/s peak 10MB/s average But 100ms seeks GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 16

Compressed Cache LZX format in-memory storage Cache blocks streamed in on demand and out LRU 56 MB Block sizes tuned per track, but typically 1 MB GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 17

Compressed Cache to Heaps Fast platform specific decompression 20 MB/s average Heap implementation Optimized for speed of alloc and free operations Good fragmentation characteristics using address ordered first-fit GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 18

Decompressed Heap Ready for GPU or CPU to consume Contiguous and aligned per allocation 194MB GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 19

Multiple Levels of Texture Storage Three views of each texture Top Mip: Mip 0, the full resolution texture Mip Chain: Mip 1 down to 1x1 Small Texture: 32x32 down to 1x1 Platform specific support here to not require relocating textures as top mip is streamed in GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 20

Multiple Levels of Geometry Storage LOD We consider different LODs as different objects to allow streaming to dump higher LODs when they wouldn t contribute Instances Models are instanced with per instance transform and shader data GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 21

Memory to GPU/CPU Cache CPU specific optimizations for cache friendly rendering High frequency operations have flat, cache line sized structures L1/L2 Caches for CPU Heavy use of command buffers to avoid touching unnecessary render data GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 22

GPU/CPU Caches Right sizing of formats relative to shader needs Vertex/texture fetch caches for GPU Vertex formats, stream counts Texture formats, sizes, mip usage Use of platform specific render controls to reduce mip access, etc. GPU/GPU GPU/CPU Caches Decompressed Heap Compressed Cache Disk/Local Storage 23

Running Example Data streamed : 66.8 Miles driven 7.9 Laps 4.9 GB Loaded 0.7 GB Mesh 4.2 GB Texture 24

The Pipeline BREAK IT DOWN 25

Pre-Computed Visibility Standard Solution Given a scene what is actually visible at a given location Many implementations use conservative occlusion Our Variant Includes Occlusion (depth buffer rejection) LOD selection Contribution Rejection (Don t draw model if less than n pixels) 26

Culling Given this View Occlusion culled (square) Other objects block this in the view Contribution culled (circle) This object does not contribute enough to the view 27

Could do it at Runtime LOD and contribution are easy, occlusion can be implemented Most importantly would have to optimize in runtime Or not do it at all, but that means streaming and rendering too much Visibility information is typically a large amount of data Which means touching a large amount of data Which is bad for cache performance Our solution: don t spend CPU/GPU on an essentially offline process 28

Pipeline Our track processing pipeline is broken into 5 major steps Sampling Splitting Building Optimization Runtime All of this is fully automated Art checks in source scenes Pipeline produces optimized game ready tracks SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 29

Linearize the Space Track is broken up into zones using AI linear view of track Art generates inner and outer splines for track Tools fit a central spline and normalize the space Waypoints are generated at regular intervals along the central spline Zone boundaries are set every n waypoints Runtime Sample points are evenly distributed within the zones SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 30

Track Space Track Zone Waypoint Sample 31

How do we Sample Environment is sampled along track surface only and at a limited height Track is rendered from four views at each sample point Oriented to local track space Sampled values stored at each sample point Also stored at neighboring sample points This is to reduce visibility pops when moving between samples SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 32

Sampling Render all models to depth Run using position only mesh version of each model on entire track Render each individual model inside a D3D occlusion query and store Object ID Location of the camera during rendering Pixel count SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME This includes LOD, occlusion and contribution culling 33

Size Reduction Sample data is enormous Contains visibility of every model at every sample point Combine all samples to reduce data required for further processing We condense it down to a list of visibility of models for each zone Keep track of the per model maximum pixel counts, not just binary visibility The pixel counts are the real value! SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME Most data is used during pre-processing and then thrown out or drastically reduced for the runtime 34

Splitting Breaks large artist meshes down to object level Example: an entire corner can be modeled and instanced into the track Break model down against a world grid Clusters objects seen together into single models This stage represents workflow balance: The further we move towards procedural data providing greater opportunities for instancing, the less this step does since we don t split instanced objects But it provides workflow savings when modeling large amounts unique geometry SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 35

Building Geometry Collect common geometry per model to reduce draws Create texture and shader usage Textures Removal of duplicates at multiple levels Similar source Cross texture comparisons Cross mip comparisons Compression settings Small Texture Set Holds the small texture for all textures used in the environment (mip chain from 32x32 down to 1x1) Only example of preloaded models or textures 20-60MB SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 36

Optimization Accumulate the working set For the three zones centered on the camera Accumulate the list of models that are visible Based on the set of visible models, generate a visible texture set using the per model texture usage data from the build phase Order the texture list by maximum pixel count of all models which use a particular texture SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 37

PIXEL COUNT Working Set PREVIOUS ZONE CURRENT ZONE NEXT ZONE MODEL LIST MODEL LIST MODEL LIST TEXTURE LIST TEXTURE LIST TEXTURE LIST TEXTURE WORKING SET UNION INTO SET Texture working set holds textures in pixel count order using maximum pixel count from all models which use a texture 38

Optimization Mechanism Removal of textures only Geometry removed by sampling Remove textures at two levels Drop top mip means the texture rendered will only come from mip 1 and lower Drop mip chain means the texture rendered will only come from the small texture set Texture level is removed from texture lists in zones contributing to the working set SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 39

Optimization Criteria Multiple Reduction Passes Trivial Reduction Based on mip Size Object pixel count vs. total pixels in the small texture I.e. Is object pixel count < 32*32? SAMPLING SPLITTING Total Working Set Memory Size Sum of model and texture sizes vs. decompressed heap size Remove top mips or mip-chains in increasing pixel count order BUILDING Total Streaming Bandwidth Compute the difference of the working set of zone n and the working set of zone n+1 Sum of model and texture sizes in working set delta vs. streaming bandwidth (Assuming zone physical size and maximum racing speed you can calculate the time allowed to stream the set) OPTIMIZATION RUNTIME 40

Running Example A single zone transition can vary widely due to occlusion behavior 0.33/2.22 MB Mesh avg/max 1.87/17.9 MB Texture avg/max Data streamed : 120 Miles driven 14 Laps 8.8 GB Loaded 1.3 GB Mesh 7.5 GB Texture 41

Optimization Create a cache efficient order for the package To reduce seek distance and increase cache hit rate We use a first seen metric Walk over the zones and track which zone is the first to use a model or texture Group all models together and order by first zone, same with textures SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 42

Runtime Create delta of zones Decide where camera is in visibility space Map camera position to zones to load Difference of currently loaded zones and zones to load Create delta of resources based on zone deltas Basically reference counting Consolidate the work to ensure free first ordering (this is to help with fragmentation) Stream out (free) data in trailing zones Stream in (allocation, IO and decompress) data in leading zones SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 43

Runtime Considerations Key Areas Work ordering Heap efficiency Decompression efficiency Disk efficiency For many problems any solution is better than doing nothing Make sure all levels of the hierarchy have been addressed SAMPLING SPLITTING BUILDING OPTIMIZATION RUNTIME 44

Flythrough Demo 45

Errors Popping Limited to two Classes Late Arrival (Tuned by limiting the amount needed per zone to stay within the system throughput) Visibility Errors (Tuned by further clustering objects or biasing the sampling results) These tunings conflict though We provide manual overrides Geometry Bias (Affects sampling results) Texture Bias (Affects position in texture working sets during optimization) No amount of automation can compete with unrealistic expectations Example: all models are visible in a single zone means there won t be space for any textures 46

Future Directions Non-linear streaming Integration in sampling, optimization and runtime Domain specific decompression Procedural generation Texture transcoding Streaming over the wire Missing piece of the massive model visualization hierarchy 47

Finally Data streamed: 147 Miles driven 17.5 Laps 10.8 GB Loaded 1.5 GB Mesh 9.3 GB Texture 48

Questions? 49

FORZA MOTORSPORT From Zero to 200MPH Chris Tector (Software Architect Turn 10 Studios)