Order Matters in Resource Creation

Similar documents
Optimizing Direct3D for the GeForce 256 Douglas H. Rogers Please send me your comments/questions/suggestions

Low-Overhead Rendering with Direct3D. Evan Hart Principal Engineer - NVIDIA

Could you make the XNA functions yourself?

The Application Stage. The Game Loop, Resource Management and Renderer Design

Memory Management: Virtual Memory and Paging CS 111. Operating Systems Peter Reiher

Chapter 8 Virtual Memory

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111. Operating Systems Peter Reiher

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

PowerVR Performance Recommendations The Golden Rules. October 2015

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

PowerVR Series5. Architecture Guide for Developers

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CS 111. Operating Systems Peter Reiher

15 Sharing Main Memory Segmentation and Paging

Virtual Memory #2 Feb. 21, 2018

Virtual Memory. Chapter 8

Drawing Fast The Graphics Pipeline

Rendering Grass with Instancing in DirectX* 10

Resolve your Resolves Jon Story Holger Gruen AMD Graphics Products Group

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23

Vulkan (including Vulkan Fast Paths)

Chapter 8. Virtual Memory

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Hardware-driven visibility culling

DX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology

POWERVR MBX. Technology Overview

16 Sharing Main Memory Segmentation and Paging

Technical Report. SLI Best Practices

Memory Management Virtual Memory

Reengineering II. Transforming the System

Memory Allocation. Static Allocation. Dynamic Allocation. Dynamic Storage Allocation. CS 414: Operating Systems Spring 2008

Coding OpenGL ES 3.0 for Better Graphics Quality

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

printf Debugging Examples

PowerVR Hardware. Architecture Overview for Developers

Chapter 4: Memory Management. Part 1: Mechanisms for Managing Memory

It s possible to get your inbox to zero and keep it there, even if you get hundreds of s a day.

Technical Report. SLI Best Practices

Direct3D 11 Performance Tips & Tricks

Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Cache introduction. April 16, Howard Huang 1

Why modern versions of OpenGL should be used Some useful API commands and extensions

Chapter 8 Virtual Memory

CS510 Operating System Foundations. Jonathan Walpole

ArcGIS Runtime: Maximizing Performance of Your Apps. Will Jarvis and Ralf Gottschalk

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

EECS 487: Interactive Computer Graphics

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Up and Running Software The Development Process

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

GPU Memory Model. Adapted from:

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

ECE519 Advanced Operating Systems

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 23

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Vulkan: Scaling to Multiple Threads. Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging

Drawing Fast The Graphics Pipeline

DRI Memory Management

Optimisation. CS7GV3 Real-time Rendering

MXwendler Fragment Shader Development Reference Version 1.0

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Chapter01.fm Page 1 Monday, August 23, :52 PM. Part I of Change. The Mechanics. of Change

Topic 18: Virtual Memory

Address spaces and memory management

CS399 New Beginnings. Jonathan Walpole

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

Working with Metal Overview

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

Buffer Management for XFS in Linux. William J. Earl SGI

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Raise your VR game with NVIDIA GeForce Tools

a process may be swapped in and out of main memory such that it occupies different regions

Per-Pixel Lighting and Bump Mapping with the NVIDIA Shading Rasterizer

Lecture 25: Board Notes: Threads and GPUs

Profiling and Debugging Games on Mobile Platforms

The Operating System. Chapter 6

Memory management, part 2: outline. Operating Systems, 2017, Danny Hendler and Amnon Meisels

Inside the PostgreSQL Shared Buffer Cache

Topic 18 (updated): Virtual Memory

Computergrafik. Matthias Zwicker. Herbst 2010

Distributed Virtual Reality Computation

Virtual Memory Outline

CS 326: Operating Systems. Process Execution. Lecture 5

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo

CS 136: Advanced Architecture. Review of Caches

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Shader Series Primer: Fundamentals of the Programmable Pipeline in XNA Game Studio Express

Streaming Massive Environments From Zero to 200MPH

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

Transcription:

Order Matters in Resource Creation William Damon ATI Research, Inc. wdamon@ati.com Introduction Latencies attributed to loading resources on the fly can seriously impact runtime performance. We typically avoid these hiccups by creating or loading resources ahead of the time we need to use them; but even then the physical locations in which our resources ultimately reside can have a serious impact on overall performance. In this article, we introduce and reinforce some common guidelines to keep in mind when setting up render resources. Depth/Stencil early Aside from the obvious resources that are created with the rendering context (i.e. the backbuffer and the optional depth-stencil surface), the best surfaces to create are additional depth-stencil surfaces and render targets. The APIs generally limit the format and size of depth-stencil surfaces to match those of the co-bound render target surfaces, so the best thing to do is to create one depth-stencil surface for each render target format/size combination for which the application requires such a resource. If the application won t be writing depth or stencil information for a particular render target format/size, then there is no need to create a corresponding depth-stencil surface. Generally, depth-stencil buffers can be shared across corresponding render targets or render passes so there is no need to create a unique depthstencil buffer per render target. In practice, most applications usually only require one, maybe two, depth-stencil buffers in addition to the default one that corresponds to the backbuffer. Create these depth-stencil surfaces in order of importance to the application. The reason this is important is that when depth-stencil buffers are created first (or at least very early) the driver can allocate them in the best location in local video-memory such that the buffers benefit from Hyper-Z technology. Render targets also early As stated above, another best resource to create as early as possible is any additional off-screen render targets the application will require. Sometimes it is not possible to know how many additional render targets will be required or even what format(s) they should take on. In that case, a good approach is to use the best heuristics available to make an educated guess as to what might be needed throughout the current resource pool lifetime (e.g. through a single level in a game). Obviously, the tradeoff here is that the application may end up creating render targets that it never uses, wasting valuable memory. This usage pattern, however, may be indicative of a larger problem, and the application architect(s) might consider revisiting the design. Alternatively, this approach may be the only way to implement an algorithm or solve a particular problem. In that case, the best the driver can ask for is that the application creates its render targets as early as possible.

Keep an eye on how many render targets are created and the size and format of those surfaces. A fullscreen render target at 1024x768 using 8-bits per channel consumes roughly 3MB of space. Add multisampling to that along with the corresponding multisampled depth-stencil buffer, and you re up to potentially 24MB for one render target! This is why a clear understanding of which off-screen render target surface formats and sizes will be required and how many render targets must coexist simultaneously is important. Creating these surfaces as early as possible allows the driver to place them in the appropriate memory location before spilling into not-as-optimal locations. Finally on the topic of render targets, the ordering of creation in terms of the formats and sizes used does not make much difference. The application will most probably benefit the most by allocating the most commonly used render targets first, followed by those less used. LVM followed by non-lvm then system memory Okay, now that the depth-stencil surfaces and render targets are allocated, the next best thing to create would be those resources that the application would prefer to have in local video-memory, followed by those that live in non-local video memory, and finally those that reside in system memory. In Direct3D terminology this translates into allocating D3DPOOL_DEFAULT resources followed by D3DPOOL_MANAGED resources. Actually, managed resources aren t loaded into LVM (or even non- LVM) until they are needed so really the application should create default pool resources before managed resources are paged in. The most effective way to ensure this is to create the default pool resources first, or evict managed resources immediately beforehand. Vertex and index buffers are a good thing to allocate at this point, as are textures. If an application is really pressed for memory, then the ordering of textures versus geometry buffers may differ from another application based on usage patterns. Say one application uses a lot of geometry but only a few textures, it may want to make sure all that geometry resides in LVM versus another application that uses less geometry but constantly switches textures and can afford the latency of fetching geometry from non-lvm memory while performing expensive pixel operations. When to go static Sometimes deciding whether a vertex or index buffer should be static or dynamic can be confusing. Adding to the confusion is the fact that index buffers behave slightly differently than vertex buffers. Here we will try to dispel some rumors and provide a bit of information as to how different buffers should be allocated. While this information is Direct3D-centric, the same concepts apply in OpenGL. Before we begin, however, let s get a bit of terminology out of the way. The term static has multiple meanings, so we cannot blindly say that locking a static buffer is bad. Buffers created without D3DUSAGE_DYNAMIC are not necessarily static, either, as far as the driver is concerned (regardless of vendor). Remember to keep this distinction in mind as we wade through the following discussion. Index Buffers Index buffers can be allocated in three memory pools: D3DPOOL_DEFAULT, D3DPOOL_MANAGED, or D3DPOOL_SYSTEM. While system memory pools suffer from a little extra overhead when copying data to the hardware, just about everything else about them does not cause any confusion or problems because everything happens in system memory. Consequently, we focus our discussion here on the default and managed memory locations.

The default pool An index buffer created in the default pool has the option of providing various usage flags at creation time: D3DUSAGE_WRITEONLY If not set, the driver will not create an LVM buffer. Instead, the Direct3D runtime will create a system memory copy of the resource to be flushed to the GPU upon first use (of each update). D3DUSAGE_DYNAMIC This flag indicates that the data in the buffer will change frequently. In particular, this flag stipulates that the contents presently being used in rendering will change frequently. Some drivers do not create a video memory surface in this case in favor of allowing the Direct3D runtime to create a system memory copy to which the CPU has direct writecombining access. If D3DUSAGE_WRITEONLY is set without D3DUSAGE_DYNAMIC, current drivers will try to create a LVM buffer. If this fails, then the driver must try to fall back to non-lvm. Now, whenever the application does a lock on a default pool index buffer, and the buffer is in video memory, the driver receives a lock call. A well-written application will use one of D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE. D3DLOCK_DISCARD indicates to the driver that it is safe to perform index buffer renaming (i.e. allocate or return another internal buffer without stalling). D3DLOCK_NOOVERWRITE signals that the application is not going to overwrite any of the contents already written (i.e. the driver is safe to return a pointer into the index buffer without stalling). In either case, the driver does not have to stall and the application need only write the data that it is updating. Failure to appropriately use these locking flags will cause the driver to stall while the current contents of the index buffer are done rendering. The managed pool An index buffer created in the managed pool cannot be marked with the usage flag D3DUSAGE_DYNAMIC; the Direct3D runtime disallows it. Also, there is no such thing as D3DUSAGE_STATIC at the API level, making life a little more interesting for the driver. When an index buffer is created in the managed pool, the Direct3D runtime creates the resource in system memory. All application locking calls only affect this system memory copy, and all updates happen here. The first time an unlock is made, the Direct3D runtime calls the driver and attempts to create a writeonly buffer in video memory that represents the managed buffer. Different vendors drivers use varying heuristics to determine whether this means allocating space in LVM or non-lvm. The nice thing about lock calls on managed pool resources is that they provide parameters for the offset and the size to lock, making things a bit simpler for the driver. Upon drawing with the index buffer, the runtime presents the driver with some information about how to best transfer the data from the updated system memory host copy into the video memory draw copy. Depending on where the resource ended up residing, various optimized copying mechanism can be invoked.

Note that allocating non-d3dusage_dynamic index buffers that exhibit dynamic behavior can sometimes be a win, especially on CrossFire (or similar) configurations. Now, with all this background information in mind, here are three common index buffer usage scenarios and some advice on how to allocate index buffers. An index buffer that only requires updates to areas that haven t already been written and aren t currently used in a draw In this case, the application should use the default pool and manage the locks with the locking flags described above. The dynamic usage flag should not be used. Alternatively, the managed pool may not be a bad option; it requires a bit more CPU work, and even some GPU overhead, but there shouldn t be any hardware stalls. An index buffer that requires updates to areas that have already been written and have been used in a draw call Here the application can go ahead and create the buffer in the default pool with the dynamic usage flag. Locks will likely not be expensive even though the locking flags are invalid for D3DUSAGE_DYNAMIC buffers because locking should happen to system memory buffer. Again, the managed pool isn t a bad alternative, and for cases in which the index buffer is updated once for several draw calls, the managed pool might be a better approach. Your mileage may vary. An index buffer that requires updates to the entire buffer every draw in which it s used Definitely use the default pool and manage the locks with the appropriate locking flags in this case. Do not set the dynamic usage flag, however. The managed pool is not a good option in this scenario. Vertex Buffers Vertex buffer creation and usage generally follows the same guidelines as index buffers, though the drivers and even the hardware may handle things a bit differently internally. Drivers generally try to create all vertex buffers in video memory, and dynamic vertex buffers generally end up in non-lvm for better CPU access. It is strongly recommended to NOT place vertex buffers (static or dynamic) in the system memory pool. The Direct3D runtime essentially behaves the same for vertex and index buffers. Other tidbits Always be sure to use the appropriate flags when creating resources through the API. The flags provide tremendous insight to the driver as to how the resource being created will be used, thus giving it clear direction as to where the best location for that resource will be for optimal performance. Also, avoid creating and destroying resources on-the-fly, per-frame. Resource allocation has tremendous overhead, comparatively, and this behavior can cause fragmentation and other memory-related problems. Occasionally, it makes sense to evict all the managed resources from video memory, like when switching levels or worlds in a game, and Direct3D provides an API for this. Performing this eviction will clean up lots of fragmentation that may have built up throughout the last level, and provide a clean slate for the next one. Lastly, should memory be a point of contention for your application, consider using ATI s plug-in for PIX or a similar tool to understand when and how many resources are being

created, and what they are used for. Note that the PIX plug-in can also give you useful information as to how well managed vertex/index buffers and textures are playing. Generally, knowing how video memory is utilized by an application and optimizing resource allocations can go a long way to providing (or at least setting the stage for) the best runtime performance. References The ATI plug-in for PIX: /atipix/index.html Acknowledgements The author of this paper wishes the thank Tim Kelley of ATI Technologies Inc. for his great patience and detailed explanations, and the ATI ISV Engineering and Application Research teams for their comments and contributions.