Order Matters in Resource Creation

Order Matters in Resource Creation William Damon ATI Research, Inc. wdamon@ati.com Introduction Latencies attributed to loading resources on the fly can seriously impact runtime performance. We typically avoid these hiccups by creating or loading resources ahead of the time we need to use them; but even then the physical locations in which our resources ultimately reside can have a serious impact on overall performance. In this article, we introduce and reinforce some common guidelines to keep in mind when setting up render resources. Depth/Stencil early Aside from the obvious resources that are created with the rendering context (i.e. the backbuffer and the optional depth-stencil surface), the best surfaces to create are additional depth-stencil surfaces and render targets. The APIs generally limit the format and size of depth-stencil surfaces to match those of the co-bound render target surfaces, so the best thing to do is to create one depth-stencil surface for each render target format/size combination for which the application requires such a resource. If the application won t be writing depth or stencil information for a particular render target format/size, then there is no need to create a corresponding depth-stencil surface. Generally, depth-stencil buffers can be shared across corresponding render targets or render passes so there is no need to create a unique depthstencil buffer per render target. In practice, most applications usually only require one, maybe two, depth-stencil buffers in addition to the default one that corresponds to the backbuffer. Create these depth-stencil surfaces in order of importance to the application. The reason this is important is that when depth-stencil buffers are created first (or at least very early) the driver can allocate them in the best location in local video-memory such that the buffers benefit from Hyper-Z technology. Render targets also early As stated above, another best resource to create as early as possible is any additional off-screen render targets the application will require. Sometimes it is not possible to know how many additional render targets will be required or even what format(s) they should take on. In that case, a good approach is to use the best heuristics available to make an educated guess as to what might be needed throughout the current resource pool lifetime (e.g. through a single level in a game). Obviously, the tradeoff here is that the application may end up creating render targets that it never uses, wasting valuable memory. This usage pattern, however, may be indicative of a larger problem, and the application architect(s) might consider revisiting the design. Alternatively, this approach may be the only way to implement an algorithm or solve a particular problem. In that case, the best the driver can ask for is that the application creates its render targets as early as possible.

Keep an eye on how many render targets are created and the size and format of those surfaces. A fullscreen render target at 1024x768 using 8-bits per channel consumes roughly 3MB of space. Add multisampling to that along with the corresponding multisampled depth-stencil buffer, and you re up to potentially 24MB for one render target! This is why a clear understanding of which off-screen render target surface formats and sizes will be required and how many render targets must coexist simultaneously is important. Creating these surfaces as early as possible allows the driver to place them in the appropriate memory location before spilling into not-as-optimal locations. Finally on the topic of render targets, the ordering of creation in terms of the formats and sizes used does not make much difference. The application will most probably benefit the most by allocating the most commonly used render targets first, followed by those less used. LVM followed by non-lvm then system memory Okay, now that the depth-stencil surfaces and render targets are allocated, the next best thing to create would be those resources that the application would prefer to have in local video-memory, followed by those that live in non-local video memory, and finally those that reside in system memory. In Direct3D terminology this translates into allocating D3DPOOL_DEFAULT resources followed by D3DPOOL_MANAGED resources. Actually, managed resources aren t loaded into LVM (or even non- LVM) until they are needed so really the application should create default pool resources before managed resources are paged in. The most effective way to ensure this is to create the default pool resources first, or evict managed resources immediately beforehand. Vertex and index buffers are a good thing to allocate at this point, as are textures. If an application is really pressed for memory, then the ordering of textures versus geometry buffers may differ from another application based on usage patterns. Say one application uses a lot of geometry but only a few textures, it may want to make sure all that geometry resides in LVM versus another application that uses less geometry but constantly switches textures and can afford the latency of fetching geometry from non-lvm memory while performing expensive pixel operations. When to go static Sometimes deciding whether a vertex or index buffer should be static or dynamic can be confusing. Adding to the confusion is the fact that index buffers behave slightly differently than vertex buffers. Here we will try to dispel some rumors and provide a bit of information as to how different buffers should be allocated. While this information is Direct3D-centric, the same concepts apply in OpenGL. Before we begin, however, let s get a bit of terminology out of the way. The term static has multiple meanings, so we cannot blindly say that locking a static buffer is bad. Buffers created without D3DUSAGE_DYNAMIC are not necessarily static, either, as far as the driver is concerned (regardless of vendor). Remember to keep this distinction in mind as we wade through the following discussion. Index Buffers Index buffers can be allocated in three memory pools: D3DPOOL_DEFAULT, D3DPOOL_MANAGED, or D3DPOOL_SYSTEM. While system memory pools suffer from a little extra overhead when copying data to the hardware, just about everything else about them does not cause any confusion or problems because everything happens in system memory. Consequently, we focus our discussion here on the default and managed memory locations.

The default pool An index buffer created in the default pool has the option of providing various usage flags at creation time: D3DUSAGE_WRITEONLY If not set, the driver will not create an LVM buffer. Instead, the Direct3D runtime will create a system memory copy of the resource to be flushed to the GPU upon first use (of each update). D3DUSAGE_DYNAMIC This flag indicates that the data in the buffer will change frequently. In particular, this flag stipulates that the contents presently being used in rendering will change frequently. Some drivers do not create a video memory surface in this case in favor of allowing the Direct3D runtime to create a system memory copy to which the CPU has direct writecombining access. If D3DUSAGE_WRITEONLY is set without D3DUSAGE_DYNAMIC, current drivers will try to create a LVM buffer. If this fails, then the driver must try to fall back to non-lvm. Now, whenever the application does a lock on a default pool index buffer, and the buffer is in video memory, the driver receives a lock call. A well-written application will use one of D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE. D3DLOCK_DISCARD indicates to the driver that it is safe to perform index buffer renaming (i.e. allocate or return another internal buffer without stalling). D3DLOCK_NOOVERWRITE signals that the application is not going to overwrite any of the contents already written (i.e. the driver is safe to return a pointer into the index buffer without stalling). In either case, the driver does not have to stall and the application need only write the data that it is updating. Failure to appropriately use these locking flags will cause the driver to stall while the current contents of the index buffer are done rendering. The managed pool An index buffer created in the managed pool cannot be marked with the usage flag D3DUSAGE_DYNAMIC; the Direct3D runtime disallows it. Also, there is no such thing as D3DUSAGE_STATIC at the API level, making life a little more interesting for the driver. When an index buffer is created in the managed pool, the Direct3D runtime creates the resource in system memory. All application locking calls only affect this system memory copy, and all updates happen here. The first time an unlock is made, the Direct3D runtime calls the driver and attempts to create a writeonly buffer in video memory that represents the managed buffer. Different vendors drivers use varying heuristics to determine whether this means allocating space in LVM or non-lvm. The nice thing about lock calls on managed pool resources is that they provide parameters for the offset and the size to lock, making things a bit simpler for the driver. Upon drawing with the index buffer, the runtime presents the driver with some information about how to best transfer the data from the updated system memory host copy into the video memory draw copy. Depending on where the resource ended up residing, various optimized copying mechanism can be invoked.

Note that allocating non-d3dusage_dynamic index buffers that exhibit dynamic behavior can sometimes be a win, especially on CrossFire (or similar) configurations. Now, with all this background information in mind, here are three common index buffer usage scenarios and some advice on how to allocate index buffers. An index buffer that only requires updates to areas that haven t already been written and aren t currently used in a draw In this case, the application should use the default pool and manage the locks with the locking flags described above. The dynamic usage flag should not be used. Alternatively, the managed pool may not be a bad option; it requires a bit more CPU work, and even some GPU overhead, but there shouldn t be any hardware stalls. An index buffer that requires updates to areas that have already been written and have been used in a draw call Here the application can go ahead and create the buffer in the default pool with the dynamic usage flag. Locks will likely not be expensive even though the locking flags are invalid for D3DUSAGE_DYNAMIC buffers because locking should happen to system memory buffer. Again, the managed pool isn t a bad alternative, and for cases in which the index buffer is updated once for several draw calls, the managed pool might be a better approach. Your mileage may vary. An index buffer that requires updates to the entire buffer every draw in which it s used Definitely use the default pool and manage the locks with the appropriate locking flags in this case. Do not set the dynamic usage flag, however. The managed pool is not a good option in this scenario. Vertex Buffers Vertex buffer creation and usage generally follows the same guidelines as index buffers, though the drivers and even the hardware may handle things a bit differently internally. Drivers generally try to create all vertex buffers in video memory, and dynamic vertex buffers generally end up in non-lvm for better CPU access. It is strongly recommended to NOT place vertex buffers (static or dynamic) in the system memory pool. The Direct3D runtime essentially behaves the same for vertex and index buffers. Other tidbits Always be sure to use the appropriate flags when creating resources through the API. The flags provide tremendous insight to the driver as to how the resource being created will be used, thus giving it clear direction as to where the best location for that resource will be for optimal performance. Also, avoid creating and destroying resources on-the-fly, per-frame. Resource allocation has tremendous overhead, comparatively, and this behavior can cause fragmentation and other memory-related problems. Occasionally, it makes sense to evict all the managed resources from video memory, like when switching levels or worlds in a game, and Direct3D provides an API for this. Performing this eviction will clean up lots of fragmentation that may have built up throughout the last level, and provide a clean slate for the next one. Lastly, should memory be a point of contention for your application, consider using ATI s plug-in for PIX or a similar tool to understand when and how many resources are being

created, and what they are used for. Note that the PIX plug-in can also give you useful information as to how well managed vertex/index buffers and textures are playing. Generally, knowing how video memory is utilized by an application and optimizing resource allocations can go a long way to providing (or at least setting the stage for) the best runtime performance. References The ATI plug-in for PIX: /atipix/index.html Acknowledgements The author of this paper wishes the thank Tim Kelley of ATI Technologies Inc. for his great patience and detailed explanations, and the ATI ISV Engineering and Application Research teams for their comments and contributions.