Technical Report. GLSL Pseudo-Instancing

Similar documents
SDK White Paper. Matrix Palette Skinning An Example

SDK White Paper. Vertex Lighting Achieving fast lighting results

User Guide. Vertex Texture Fetch Water

Technical Brief. LinkBoost Technology Faster Clocks Out-of-the-Box. May 2006 TB _v01

Android PerfHUD ES quick start guide

Technical Brief. NVIDIA and Microsoft Windows Vista Getting the Most Out Of Microsoft Windows Vista

Technical Report. Anisotropic Lighting using HLSL

Multi-View Soft Shadows. Louis Bavoil

White Paper. Soft Shadows. February 2007 WP _v01

Technical Brief. AGP 8X Evolving the Graphics Interface

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note

Application Note. NVIDIA Business Platform System Builder Certification Guide. September 2005 DA _v01

White Paper. Denoising. February 2007 WP _v01

SDK White Paper. Soft Shadows

Technical Report. SLI Best Practices

NVIDIA nforce 790i SLI Chipsets

Tuning CUDA Applications for Fermi. Version 1.2

SDK White Paper. HLSL Blood Shader Gravity Maps

CUDA Particles. Simon Green

GLExpert NVIDIA Performance Toolkit

Technical Report. Mesh Instancing

Sparkling Effect. February 2007 WP _v01

Cg Toolkit. Cg 2.0 January 2008 Release Notes

SDK White Paper. Occlusion Query Checking for Hidden Pixels

User Guide. GLExpert NVIDIA Performance Toolkit

White Paper. Texture Arrays Terrain Rendering. February 2007 WP _v01

Cg Toolkit. Cg 2.0 May 2008 Release Notes

Soft Particles. Tristan Lorach

Cg Toolkit. Cg 1.2 Release Notes

Technical Report. SLI Best Practices

Order Independent Transparency with Dual Depth Peeling. Louis Bavoil, Kevin Myers

User Guide. DU _v01f January 2004

White Paper. Solid Wireframe. February 2007 WP _v01

Technical Brief. NVIDIA Quadro FX Rotated Grid Full-Scene Antialiasing (RG FSAA)

Enthusiast System Architecture Certification Feature Requirements

NVIDIA SDK. NVMeshMender Code Sample User Guide

User Guide. TexturePerformancePBO Demo

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño

Cg Toolkit. Cg 1.4 rc 1 Release Notes

Cg Toolkit. Cg 2.1 October 2008 Release Notes

Cg Toolkit. Cg 2.1 beta August 2008 Release Notes

Constant-Memory Order-Independent Transparency Techniques

GRID SOFTWARE FOR MICROSOFT WINDOWS SERVER VERSION /370.12

GRID SOFTWARE FOR RED HAT ENTERPRISE LINUX WITH KVM VERSION /370.28

Cg Toolkit. Cg 1.3 Release Notes. December 2004

Deinterleaved Texturing for Cache-Efficient Interleaved Sampling. Louis Bavoil

Cg Toolkit. Cg 2.2 April 2009 Release Notes

CUDA/OpenGL Fluid Simulation. Nolan Goodnight

Getting Started. NVIDIA CUDA C Installation and Verification on Mac OS X

Technical Brief. FirstPacket Technology Improved System Performance. May 2006 TB _v01

Horizon-Based Ambient Occlusion using Compute Shaders. Louis Bavoil

White Paper. Perlin Fire. February 2007 WP _v01

User Guide. GPGPU Disease

CUDA Particles. Simon Green

NVWMI VERSION 2.24 STANDALONE PACKAGE

Skinned Instancing. Bryan Dudash

Histogram calculation in CUDA. Victor Podlozhnyuk

NVWMI VERSION 2.18 STANDALONE PACKAGE

NVIDIA CUDA C GETTING STARTED GUIDE FOR MAC OS X

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Mac OS X. May 2009 DU _v01

Optical Flow Estimation with CUDA. Mikhail Smirnov

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Microsoft Windows XP and Windows Vista

Cg Toolkit. Cg 2.2 February 2010 Release Notes

Histogram calculation in OpenCL. Victor Podlozhnyuk

SDK White Paper. Video Filtering on the GPU. Eric Young NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050

Getting Started. NVIDIA CUDA Development Tools 2.3 Installation and Verification on Mac OS X

Technical Report. Non-Power-of-Two Mipmap Creation

MOSAIC CONTROL DISPLAYS

QUADRO SYNC II FIRMWARE VERSION 2.02

NVIDIA CUDA C INSTALLATION AND VERIFICATION ON

Technical Brief. NVIDIA Storage Technology Confidently Store Your Digital Assets

NVBLAS LIBRARY. DU _v6.0 February User Guide

NSIGHT ECLIPSE PLUGINS INSTALLATION GUIDE

NVIDIA CAPTURE SDK 6.1 (WINDOWS)

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

TUNING CUDA APPLICATIONS FOR MAXWELL

NSIGHT ECLIPSE EDITION

TUNING CUDA APPLICATIONS FOR MAXWELL

GETTING STARTED WITH CUDA SDK SAMPLES

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

CUDA-MEMCHECK. DU _v03 February 17, User Manual

Spring 2009 Prof. Hyesoon Kim

NVIDIA Tesla Compute Cluster Driver for Windows

CUDA TOOLKIT 3.2 READINESS FOR CUDA APPLICATIONS

Spring 2011 Prof. Hyesoon Kim

CUDA-GDB: The NVIDIA CUDA Debugger

NVIDIA CAPTURE SDK 7.1 (WINDOWS)

Tegra 250 Development Kit Android Setup Experience

Particle Simulation using CUDA. Simon Green

Cg Toolkit. Cg 1.2 User s Manual Addendum

Semantic and Annotation Remapping in FX Composer

Shaders. Slide credit to Prof. Zwicker

NVIDIA CAPTURE SDK 6.0 (WINDOWS)

User Guide. NVIDIA Quadro FX 4700 X2 BY PNY Technologies Part No. VCQFX4700X2-PCIE-PB

Technical Brief. 3D Stereo. Consumer Stereoscopic 3D Solution

NVIDIA CUDA TM Developer Guide for NVIDIA Optimus Platforms. Version 1.0

SDK White Paper. Improve Batching Using Texture Atlases

NVIDIA Release 197 Tesla Driver for Windows

NVIDIA CUDA GETTING STARTED GUIDE FOR LINUX

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Transcription:

Technical Report GLSL Pseudo-Instancing

Abstract GLSL Pseudo-Instancing This whitepaper and corresponding SDK sample demonstrate a technique to speed up the rendering of instanced geometry with GLSL. The technique relies upon the efficient in-lining of persistent vertex attributes in OpenGL. World transforms (and potentially other instance specific data) are passed down for each instance using texture coordinates instead of uniform variables. Jeremy Zelsnack sdkfeedback@nvidia.com NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 November 17, 2004 i

Table of Contents GLSL Pseudo-Instancing...i Introduction...3 Technique...4 GeForce 6800 GT...6 GeForce FX 5900 Ultra...9 GeForce FX 5200 Ultra...12 Conclusion...15 Bibliography...16 2

Introduction Rendering large numbers of geometrically instanced data can be useful for world clutter such as rocks, bushes, trees, crates, and debris. In some genres of games, it can also be useful for weapons and armor. Unfortunately, rendering large number of instances results in a large amount of driver work. This is particularly true in GLSL where the driver must map abstract uniform variables into real physical hardware registers. From the hardware s perspective, the large number of constant updates is not ideal either. Constant updates can incur hardware flushes in the vertex processing engines. OpenGL has a notion of persistent vertex attributes that can be passed down by immediate mode calls like gltexcoord(). These API calls are very efficient on the driver side; they don t require validation or potentially complex remapping. They are also very efficient on the hardware side; they do not result in hardware flushes in the vertex processing engines. The efficiency of persistent vertex attributes can be exploited by passing per-instance data such as transforms, color and other data via OpenGL immediate mode calls. The benefit of this technique can be quite large (especially when rendering instances with a small number of vertices on slow CPUs). 3

Technique If you were rendering a large number of instances using GLSL, you could use the modelview matrix stack and use gl_modelviewprojectionmatrix in your vertex shader. The performance of this might not be that great because of the additional CPU load involved in computing the transform matrices and downloading them to the GPU (most games are CPU bound, so it s usually a good idea to offload the CPU when reasonable). An alternate approach would be to send down a view matrix once for all of the instances. A world matrix can be sent down for each instance using gluniform4fvarb(), gluniformmatrix4fvarb() or similar calls. This technique requires more computation because each vertex has to be transformed by three matrices (assuming view-space lighting) instead of just one. However, this offloads additional transformation computations from the over-burdened CPU to the GPU. This is one of the techniques employed in the sample. The pseudo-instancing approach takes a similar, but slightly different approach. Like the above technique, a view transform is passed down for all of the instances. Unlike the above technique, the per-instance world matrices are passed down using glmultitexcoord() instead of calls like gluniform4fvarb(). This exploits the software and hardware advantages of the persistent vertex attributes. The code sample below demonstrates the technique. For more detailed information, please see the glsl_pseudo_instancing SDK sample. 4

// Render the instances of the mesh for(i=0; i<instances.size(); i++) { // Send down the matrix and other parameters if(guseinstancing) { // pseudo-instancing passes per-instance data down // as texture coordinates glmultitexcoord4fv(gl_texture1, instances[i].mworld[0]); glmultitexcoord4fv(gl_texture2, instances[i].mworld[1]); glmultitexcoord4fv(gl_texture3, instances[i].mworld[2]); } else { // Traditional GLSL uniform variable downloading gluniform4fvarb(worldmatrixlocation, 3, (float*)(instances[i].mworld)); } } // Set the color for the instance glcolor4fv(ginstances[i].mcolor); // Render the instance mesh->render(); The pseudo-instancing technique is similar to the instancing technique in Direct3D (supported on Shader Model 3.0 GPUs). The major difference is that the Direct3D instancing API reduces the number of DrawIndexedPrimitive() calls from many to one. This DrawIndexedPrimitive() call reduction has a large performance benefit in Direct3D. In OpenGL, the application still calls gldrawelements() (or the like) for every instance. This isn t too much of a performance hit because gldrawelements() is very efficient in OpenGL. Pseudo-instancing applies well to geometry with a small number of per-instance attributes. The technique does not scale well to complex mesh rendering techniques like skinning; there are not enough vertex attributes to store all of the bone transforms for each instance. The technique is not meant as a replacement for traditional particle system rendering or to replace static meshes where geometry can be reasonably baked into world space. 5

GeForce 6800 GT The following performance data was collected on a GeForce 6800 GT with 256MB on a PCI-Express 3.4 GHz Hyper-Threaded P4 system using ForceWare driver 66.81. This demonstrates the pseudo-instancing performance on a fast CPU with a fast GPU. The ratio of instances per second using pseudo-instancing over instances per second without pseudo-instancing is shown below. On this particular setup, the performance difference is significant for small meshes. 3.5 3 2.5 Performance Ratio 2 1.5 1 0.5 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 Instances / Frame 16384 Instances / Frame 65536 Instances / Frame 6

The GeForce 6800 GT can hit over 3 million instances per second for tiny meshes. 3500000 3000000 2500000 Instances / Second 2000000 1500000 1000000 500000 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 instances / frame (pseudo-instancing) 16384 instances / frame (pseudo-instancing) 65536 instances / frame (pseudo-instancing) 7

Pseudo-instancing allows the GeForce 6800 GT to render more triangles per second until the mesh size approaches 1,000 vertices. 120 100 80 MTriangles / Second 60 40 20 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 instances / frame (pseudo-instancing) 16384 instances / frame (pseudo-instancing) 65536 instances / frame (pseudo-instancing) 1024 instances / frame 16384 instances / frame 65536 instances / frame 8

GeForce FX 5900 Ultra The following performance data was collected on a GeForce FX 5900 Ultra with 256MB on an AGP 8x Athlon XP 2500 system using ForceWare driver 66.81. The configuration represents a moderately fast CPU with a moderately fast GPU The ratio of instances per second using pseudo-instancing over instances per second without pseudo-instancing is shown below. On this setup, the performance difference is quite dramatic. On tiny instance meshes, the pseudo-instancing performance advantage approaches 35x! For more realistically sized meshes, the performance improvement is still impressive. 40 35 30 Performance Ratio 25 20 15 10 5 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 Instances / Frame 16384 Instances / Frame 65536 Instances / Frame 9

The GeForceFX 5900 Ultra can hit over 5 million instances per second when using tiny instance meshes. This is on a system without an incredibly fast CPU. 6000000 5000000 4000000 Instances / Second 3000000 2000000 1000000 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 instances / frame (pseudo-instancing) 16384 instances / frame (pseudo-instancing) 65536 instances / frame (pseudo-instancing) 10

The graph below shows a dramatic difference in the triangle rendering rates between using pseudo-instancing and the traditional rendering method. The downturn in performance for pseudo-instanced meshes with more than 50 vertices is probably due to less than perfect vertex cache friendliness in the generated meshes. 70 60 50 MTriangles / Second 40 30 20 10 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 instances / frame (pseudo-instancing) 16384 instances / frame (pseudo-instancing) 65536 instances / frame (pseudo-instancing) 1024 instances / frame 16384 instances / frame 65536 instances / frame 11

GeForce FX 5200 Ultra The following performance data was collected on a GeForce FX 5200 Ultra with 128MB on an AGP 8x Athlon XP 2500 system using ForceWare driver 66.81. The ratio of instances per second using pseudo-instancing over instances per second without pseudo-instancing is shown below. On this particular setup, the performance difference is substantial for the smaller instance meshes. 12 10 8 Performance Ratio 6 4 2 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 Instances / Frame 16384 Instances / Frame 65536 Instances / Frame 12

The GeForce FX 5200 Ultra can hit over 1.5 million instances per second on tiny instance meshes. 1800000 1600000 1400000 1200000 Instances / Second 1000000 800000 600000 400000 200000 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 instances / frame (pseudo-instancing) 16384 instances / frame (pseudo-instancing) 65536 instances / frame (pseudo-instancing) 13

The GeForce FX 5200 enjoys increased triangle rendering rates using pseudo-instancing. The downturn in performance for meshes with more than 50 vertices is probably due to less than perfect vertex cache friendliness in the generated meshes 16 14 12 MTriangles / Second 10 8 6 4 2 0 2 12 16 18 24 32 50 72 98 210 682 Vertices Per Instance 1024 instances / frame (pseudo-instancing) 16384 instances / frame (pseudo-instancing) 65536 instances / frame (pseudo-instancing) 1024 instances / frame 16384 instances / frame 65536 instances / frame 14

Conclusion The pseudo-instancing technique is a simple way to increase performance of rendering large numbers of instanced geometry. The performance advantage is particularly large when the instanced geometry has a smaller number of vertices and on slower CPUs. The technique works on all NVIDIA GPUs that support GLSL vertex shaders (hardware acceleration in GeForce 3 and beyond); there is no requirement for Shader Model 3.0 capable hardware for the technique. 15

Bibliography 1. Discussion with Mark Kilgard in the hallway. 16

Notice ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS ) ARE BEING PROVIDED AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation. Trademarks NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright 2004 by NVIDIA Corporation. All rights reserved NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 www.nvidia.com