The Source for GPU Programming

The Source for GPU Programming developer.nvidia.com Latest News Developer Events Calendar Technical Documentation Conference Presentations GPU Programming Guide Powerful Tools, SDKs, and more... Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!

GeForce 6 Series Performance Matthias Wloka Developer Technology

GeForce 6 Series Specific Performance Instancing Vertex- and Pixel-Shaders 3.0 Branching and Looping Vertex Texture Fetch Hardware Shadow Maps Z- and Stencil-Cull FP16 Filter and Blend, MRTs

Marketing Speak Translation SM3, i.e., Shader Model 3 hardware Sometimes shorthand for Every GeForce 6 feature not in GeForce FX Not just VS/PS 3.0 See previous slide! GeForce 6200 does not support fp16 filter/blend Okay, because: value cards lack memory b/w to use fp16 render-targets

Simplified Graphics Pipeline CPU Geometry Storage Instancing Vertex Shader 3.0 Geometry Processor Rasterizer Z/Stencil Cull Pixel Shader 3.0 Fragment Processor Frame Buffer Common bottlenecks: CPU Fragment processor Texture Storage + Filtering Fp16 Filter Shadow Maps Fp16 Blend MRT New features help address these bottlenecks

CPU Bottleneck Getting Worse Courtesy Ian Buck, Stanford University

Explicitly Address CPU Bottleneck Reduce draw calls Budget/Design for your draw calls! Use instancing to reduce batches Use über-shaders to eliminate batches/passes Use fp16 blending to eliminate passes Move more computations to GPU: GPGPU: General-Purpose Computations Using GPUs See http://gpgpu.org

Detail of a Single Vertex Shader Pipeline Input Vertex Data Vertex Texture Fetch FP32 Scalar Unit FP32 Vector Unit Branch Unit Texture Cache Primitive Assembly Viewport Processing To Setup

Instancing: What Is It? Let s GPU loop over vertex buffers: Tree Model VB Transform Matrices VB Single draw call generates many instances of object

Instancing Demo Complex lighting, post-processing Simple CPU collision

Instancing Advantages Alternatives: One draw call / instance, change state in-between Static batching (static pre-transformed VB) Dynamic batching (dynamic 2 stream instancing) Vertex constant instancing See Instancing code sample and whitepaper: http://download.nvidia.com/developer/sdk/ Individual_Samples/samples.html Most flexible and has the least Draw calls Memory overhead CPU/Bus overhead

But Multiple vertex streams GPU does extra work Vertex sizes are larger Transform matrix is a per vertex attribute

Attribute Bound Extra data fetched per instance Explains slowdown Vertex cache optimize Cache hit saves all vertex work: Including attribute access Pack input attributes as tightly as possible Even if vertex shader work required to unpack Move constants or derivables out of attributes

Instancing Performance Instancing Method Comparison (Note: % is relative to HW instancing in each group) [28 poly mesh] 140.00% FPS(relative to HW Instancing) 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% Single Draw Calls Dynamic 2 Stream Instancing Static 2 Stream Instancing VS Constant Instancing Hardware Instancing Static Pretransformed VB 0.00% 2800 28000 140000 280000 560000 # Polys

Another View FPS per polys [28poly mesh] 1000 100 FPS 10 1 1000 10000 100000 1000000 # Polys Single Draw Calls Static 2 Stream Instancing Hardware Instancing Dynamic 2 Stream Instancing VS Constant Instancing Static Pretransformed VB

Vertex Shader 3.0: Flow Control Vertex flow control near optimal: Branch instructions have fixed ~1 cycle overhead Divergence is full speed (MIMD) Vertex branching is a win Except for short branches Compiler/Driver decides Example: Single unified v-shader for 1, 2, 3, and 4 bone skinning Use branches and loops to Consolidate batches Skip over unnecessary work

Vertex Texture Fetch (VTF) Mipmapped texture fetches from vertex: Only R32f and R32G32B32A32f formats Only point-sampling Up to 4 different texture stages Sample as often as you like Large latency Equivalent to 20-30 instructions

Cover the Latency Latency means you can hide other ops in it For free Compiler/driver does this for you if possible texldl r0, v0, sampler0 mul r1, v1, c0 // stuff not depending on vtf result add r1, r1, r0 Branch over VTF if possible Dependent VTFs are slow Less chance to hide latency // use vtf result for the first time

Vertex Texture Fetch Performance GeForce 6800 capable of peak 600 MVerts / s Minimalist (err, read no) work per vertex Max with a single VTF: 33 MVerts / s Not all vertices in frame need to be displaced 1 Million displaced vertices @ 33 fps! Do not use as general constant memory replacement

Early Z and Stencil Cull Cull pixels that (will) fail depth/stencil tests before entering pixel-shader For maximum z-cull: Render roughly front to back Or even better: render z-only pass before normal rendering Do stencil-only passes for other cull tricks

Things That Disable Z Culling Changing depth-test direction For example, less-equal to greater-equal Only resets on clear

Z-Cull Uses Highly Compressed Z-Rep Triangles with holes (alpha test/texkill/clip planes) are not occluding Small triangles are bad occluders Small ~= less than 4x4 pixels Z-cull may not recognize triangle as occluder Good Bad

Things That Disable Stencil Culling Changing stencil function, reference, or mask Only resets on clear Writing stencil while rejecting based on stencil Write stencil in separate pass from rejecting color/z

Stencil Cull Example 1. Render light volume with color write disabled Depth func = LESS, Stencil func = ALWAYS Stencil Z-FAIL = REPLACE (with value X) Rest of stencil ops set to KEEP 2. Render with lighting shader Depth Func = ALWAYS, Stencil Func = EQUAL, all ops = KEEP, Stencil Ref = X Unlit pixels will be culled because stencil does not match reference value

Fast Z-Only Rendering GeForce FX and 6 Series render z/stencil at double speed! Important for dynamic shadow maps! Makes z-first/only pass (for z-cull benefits) attractive Only enabled if: No color-writes Disable pixel shaders (no depth replace, no texkill) Disable alpha test/color key 8-bit/component color buffer bound (not float) No user clip planes No AA

Pixel Shader 3.0 Performance What is Pixel Shader 3.0? 3.0 shaders help both CPU and GPU bottlenecks Consolidate draw calls / passes (über-shaders) Early-outs with dynamic branching Gory performance details of particular pixel shader 3.0 features

Detail of a Single Pixel Shader Pipeline Texture Filter Bi Bi // Tri Tri // Aniso 1 texture @ full speed 4 tap filter @ full speed 16:1 Aniso w/ w/ Trilinear FP16 Texture Filtering Texture Data FP Texture Processor Input Fragment Data FP32 Shader Unit 1 Shader Unit 1 4 FP Ops // pixel Co-Issue Texture Address Calc Free fp16 normalize + mini ALU Texture Cache FP32 Shader Unit 2 Shader Unit 2 4 FP Ops // pixel Co-Issue + mini ALU SIMD Architecture Co-Issue FP32 Computation Shader Model 3.0 Branch Processor Fog ALU Output Shaded Fragments

Half (fp16) Performance Half (fp16) still matters! Critical for GeForce FX performance Reduces register pressure Better able to hide texture latency Fast fp16 normalize Compiler/driver can NOT help you with this

GeForce 6 Single Cycle Normalize() Pixel shader unit has single-cycle normalize Caveat: only for 3-component 16-bit float values float3 f3; half3 h3; half4 h4; f3 = normalize(f3); // slow: dp3/rsq/mul h3 = normalize(f3); // fast: nrmh h4 = normalize(h4); // slow: dp4/rsq/mul h4.xyz = normalize(h4.xyz); // fast: nrmh

GeForce 6 Superscalar Execution Executes multiple instructions simultaneously For example, in a single cycle you can execute Two 2-vector instructions, or One 3-vector and one scalar instruction Plus, there are 2 math units per shader pipe Use swizzle / write masks to help compiler half4 A, B; A.w = sin(a.w); // A = sin(a.w) not enough A.xyz = A.xyz * B.xyz;

GeForce 6 Series Co-Issue 2 different instructions executing in the same cycle in same shader units 2 separate shader units 4 instructions/pixel/cycle Shader Unit 1 Shader Unit 2 R G B A Operation 1 Operation 2 R G B A Operation 3 Operation 4

Flow Control Performance Overview Flow control instruction costs: Not free, but useful Instruction if / endif if / else / endif call ret loop / endloop Cost (Cycles) 4 6 2 2 4 Additional costs when pixels diverge (more later)

Looping Costs DirectX ps.3.0 supports only static loops Unrolling is faster Compiler/driver can do that for you Nonetheless useful because Reduces high-level code-complexity Reduces passes Multiple lights in a single pass can be a big win Number of lights unknown at compile time Reduces proliferation of pre-compiled shaders Thousands of shaders from just a few templates Overcomes DirectX s 512 static instruction limit

Branching Costs Branching can provide substantial boost If able to skip > 6 instruction cycles, and If the branch condition is coherent vs. Coherent Incoherent Noisy branch conditions cause performance loss Potentially worse than taking both branches all the time

How Coherent Do I Have To Be? GPU has hundreds of pixels in flight Best if coherent over regions of > ~1000 pixels That s only ~30x30! You need to experiment in your own application Soft shadow demo shows: Incoherent branches on small portion of screen is still a big win

Combine Branching With Others Back face register (vface) Shade front faces differently from back faces Position register (vpos) Shade based on position For example, skip or simplify distant pixels Early out: If in shadow, don t do lighting computations If out of range (attenuation zero), don t light Applies to vs.3.0 as well

Soft Shadow Demo

How Soft Shadow Demo Works Takes 8 test samples from shadow map If all 8 in shadow or all 8 in the light then done If on the edge (some in shadow/some in light) Do 56 more samples for additional quality 64 samples at much lower cost! Quick-and-dirty importance sampling Dynamic sampling > 2x faster Vs. 64 samples everywhere

Hardware Shadow Maps In DirectX, Render to a depth format texture (D3DFMT_D24X8, D3DFMT_D16) Use tex2dproj to sample Shadow map comparison happens automatically In OpenGL, Render to DEPTH_COMPONENT texture Use TEXTURE_COMPARE_MODE_ARB with COMPARE_R_TO_TEXTURE

Hardware Shadow Map Performance Shadow map comparison is free (full speed) No need to compare and filter in the shader If bilinear state is on, Then percentage closer filtering of 4 nearest texels Use single tap for performance Quality roughly equivalent to 4-tap PCF R32F Use multiple taps for higher quality 4-tap HW shadow map roughly as fast as 4-tap manual-pcf R32F

Hardware Shadow Map Fallback Possible to use R32F or R16F shadow maps Render depth to single-channel float texture in shader Multiple jittered samples for high quality / soft edges Easy to maintain hardware shadow maps and R32F/R16F code paths: Same setup and pipeline as any shadow map technique HW shadow map shader code simpler and faster HW shadow maps buy speed or quality (or both)

Texture Instruction Performance Texldb (scalar LOD bias): Full speed Texldl (explicit scalar LOD selection): Full speed Hardware need not calculate derivatives for LOD Possible to dynamically branch over these instructions Texldd (gradient-based LOD selection): Factor 10 slower! But when you need to use this, you need to use this

Floating Point Texture Performance Prefer 64bpp float textures and render targets Half the bandwidth of 128bpp (fp32) textures More importantly: double cache coherence Poor cache coherence destroys performance Fp16 textures 2x faster than fp32 if texture bound Also important: efficient channel allocation Use R32F buffers for scalar data, and R16G16F for 2-vectors Double cache coherence again!

Common Sense Texture Performance Use mipmaps GPU fetches local neighborhood for each texel Sharper/Crisper textures Use anisotropic filtering Use better mipmap generation (use texture tools) Do NOT use LOD bias LOD bias is slower and lower quality

Normal Maps Use D3DFMT_V8U8 or DXT5 To store x and y Derive z in shader Simon Green s normal map compression paper Compares quality of variety of formats

Multiple Render Targets MRTs useful for reducing rendering passes When you need to output more than single 4-vector Deferred shading, particle physics, GPGPU algorithms Replaces up to four passes with one But MRT is not free High bandwidth cost, especially with float formats Small overhead per target rendered GeForce 6 has a sweet spot of 3 render targets (RTs) Split 6 passes into 2 3-RT passes Not 1 4-RT pass and 1 2-RT pass

Other Render Target Advice Do not render entire scene to a texture Not getting AA If user turns on control panel AA, hard to detect Instead, render to back buffer, then stretchrect Drivers give performance priority to back buffer Ahead of texture surfaces AA works with back buffer

Full Screen Effects Use scissor rects to restrict rendering Light bounds, etc. Do not use full screen quads Use full-screen triangles with scissor rect instead Completely avoids inefficient diagonals

Floating Point Blending GeForce FX needs to emulate float blending Using ping-pong buffer Lots of context switches and additional passes Blending, e.g., lots of particles becomes infeasible But fp16 is 2x bandwidth vs. A8R8G8B8

Increased Read Back Performance Pre-GeForce 6 Best case, < 200MB/s, all chipsets Only PCI cycles used to write back to host memory GeForce 6800 (AGP) 600 MB/s - 1.0 GB/s, depending on AGP chipset PCI-E Workstation boards 1.0 GB/s on Quadro FX 4400 Up to 2.4 GB/s on Quadro FX 1400

Read Back Still a BAD Idea Read back still synchronizes CPU and GPU CPU stalls until GPU finishes all rendering Can you afford wasting precious CPU cycles? GPU pipeline drains completely and becomes idle

Memory Allocation Order of resource allocation affects performance Allocate render targets first Sort order by pitch (bpp * width) Sort pitch groups by frequency of use (most used first) Then create vertex and pixel shaders Load / create remaining textures

Conclusion Lots of new/fast features Instancing, vs.3.0 flow control, vertex texture fetch Z-/Stencil-cull, fast z-only Fast normalize, ps.3.0 flow control Hardware shadow maps, fp16 blending With some sneaky gotchas Use these features to attack bottlenecks CPU Pixel shaders...

Questions? NVIDIA GPU Programming Guide: http://developer.nvidia.com/object/ gpu_programming_guide.html Matthias Wloka (mwloka@nvidia.com) http://developer.nvidia.com