Canonical Shaders for Optimal Performance. Sébastien Dominé Manager of Developer Technology Tools

Similar documents
Optimal Shaders Using High-Level Languages

AGDC Per-Pixel Shading. Sim Dietrich

GPU Performance Tools

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

In-Game Special Effects and Lighting

Cg 2.0. Mark Kilgard

Drawing a Crowd. David Gosselin Pedro V. Sander Jason L. Mitchell. 3D Application Research Group ATI Research

Direct3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation

Technical Report. Mesh Instancing

Shaders. Slide credit to Prof. Zwicker

CS 354R: Computer Game Technology

Programming Graphics Hardware

Introduction to Shaders.

Pump Up Your Pipeline

Self-shadowing Bumpmap using 3D Texture Hardware

Shaders : the sky is the limit Sébastien Dominé NVIDIA Richard Stenson SCEA

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Interactive Cloth Simulation. Matthias Wloka NVIDIA Corporation

GeForce4. John Montrym Henry Moreton

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Could you make the XNA functions yourself?

Chapter 3 ATI. Jason L. Mitchell

Wednesday, July 24, 13

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

NVIDIA FX Composer. Developer Presentation June 2004

The Application Stage. The Game Loop, Resource Management and Renderer Design

Render-To-Texture Caching. D. Sim Dietrich Jr.

CREATING AND USING NORMAL MAPS - A Tutorial

CS427 Multicore Architecture and Parallel Computing

Advanced Shading and Texturing

NVIDIA Developer Toolkit. March 2005

NVIDIA Tools for Artists

Shaders (some slides taken from David M. course)

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Evolution of GPUs Chris Seitz

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Performance Tools. Raul Aguaviva, Sim Dietrich, and Sébastien Dominé

Beginning Direct3D Game Programming: 1. The History of Direct3D Graphics

Deferred Rendering Due: Wednesday November 15 at 10pm

Chapter Answers. Appendix A. Chapter 1. This appendix provides answers to all of the book s chapter review questions.

Real-Time Hair Rendering on the GPU NVIDIA

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

Here I ll present a brief summary of the important aspects and then dive into improvements we made on Black Ops II.

DX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology

CS 4620 Program 3: Pipeline

HISTORY OF MAJOR REVISIONS

ECS 175 COMPUTER GRAPHICS. Ken Joy.! Winter 2014

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

EDAF80 Introduction to Computer Graphics. Seminar 3. Shaders. Michael Doggett. Slides by Carl Johan Gribel,

Advanced Rendering & Shading

ECE 574 Cluster Computing Lecture 16

MAXIS-mizing Darkspore*: A Case Study of Graphic Analysis and Optimizations in Maxis Deferred Renderer

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

Mattan Erez. The University of Texas at Austin

The Making of Seemore WebGL. Will Eastcott, CEO, PlayCanvas

Technical Report. Anisotropic Lighting using HLSL

User Guide. DU _v04 April 2005

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

Vertex and Pixel Shaders:

Physically Based Shading in Unity. Aras Pranckevičius Rendering Dude

CSE528 Computer Graphics: Theory, Algorithms, and Applications

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Real-World Applications of Computer Arithmetic

A Trip Down The (2011) Rasterization Pipeline

Input Nodes. Surface Input. Surface Input Nodal Motion Nodal Displacement Instance Generator Light Flocking

CSE 167: Lecture #8: GLSL. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012

User Guide. DU _v01f January 2004

Introduction Rasterization Z-buffering Shading. Graphics 2012/2013, 4th quarter. Lecture 09: graphics pipeline (rasterization and shading)

GPU Memory Model. Adapted from:

Basics of GPU-Based Programming

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Real-Time Universal Capture Facial Animation with GPU Skin Rendering

Gettin Procedural. Jeremy Shopf 3D Application Research Group

GeForce3 OpenGL Performance. John Spitzer

Rendering Subdivision Surfaces Efficiently on the GPU

Shadow Techniques. Sim Dietrich NVIDIA Corporation

Ultimate Graphics Performance for DirectX 10 Hardware

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Windowing System on a 3D Pipeline. February 2005

12.2 Programmable Graphics Hardware

Lighting and Shading

DirectX10 Effects and Performance. Bryan Dudash

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

frame buffer depth buffer stencil buffer

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

The Illusion of Motion Making magic with textures in the vertex shader. Mario Palmero Lead Programmer at Tequila Works

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Life on the Bleeding Edge: More Secrets of the NVIDIA Demo Team Eugene d Eon NVIDIA Corporation.

The Graphics Pipeline

lecture 18 - ray tracing - environment mapping - refraction

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to the Direct3D 11 Graphics Pipeline

developer.nvidia.com The Source for GPU Programming

The Rasterization Pipeline

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Homework 3: Programmable Shaders

Transcription:

Canonical Shaders for Optimal Performance Sébastien Dominé Manager of Developer Technology Tools

Agenda Introduction FX Composer 1.0 High Performance Shaders Basics Vertex versus Pixel Talk to your compiler Clever tricks Texture look ups Case Study Conclusion Q&A

Introduction High Level Shading Languages: Powerful in getting your shader up and running Allows TD to quickly reach targeted look and feel Compilers do a good job at optimizing code Is this heaven or what?

Introduction Not quite...there are a lot of pitfalls: Too expensive computation at the wrong stage Not enough hints given to the compiler: Complex HW Shader writers know the computation talk to your compiler Not using clever tricks that help compilers Not always verifying what comes out Look at the assembly fxc;fx Compiler/Shader Perf Look at the scheduling FX Compiler/Shader Perf Look at the fps of unique fullscreen shaders

FX Composer Integrated IDE for HLSL FX development Simulated shader scheduling on nv3x family Disassembly of vertex and pixel shaders Bakes textures from HLSL code Allows render to texture effects HLSL Intellisense Allows scene import from.x and.nvb files Supports animation, lights, skinned meshes, etc... Allows pluggable geometry modifiers (fins,...) Project files.fxcomposer Fxmapping.xml custom semantic/annotation mapping

Driver Model 3D Application HLSL Shader code DX Runtime Direct3D opcodes Direct3D Driver Unified Compiler HW HW opcodes

FX Composer HLSL fx D3d9 opcodes FX Composer Direct3D9 Runtime HLSL Compiler Shader Perf Unified Compiler Driver Dependent (R50) Pixel shaders only Shader Disassembly HW Shader Unit Simulation

FX Composer Demo

High Performance Shaders Basics Vertex versus Pixel Talk to your compiler Clever tricks Texture look ups

High Performance Shaders How many cycles does normalize take? Using ps2.0 math: 2-3 cycles High Level: normalize(v); Using texture fetch (signed normalization cube map): 1 cycle High Level: texcube(s, V); Using ps1.x assembly: 1 cycle dp3 r0, v0_bx2, v0_bx2 mad r0, v0_bias, 1-r0, v0_bx2

The Basics 1/2 Develop your code in the easiest way that makes it work Then optimize it Make a budget Target a reference platform: Target resolution @ x Hz, y XAA, Aniso. Identify key points of interest Use more complex shaders for them

The Basics 2/2 Move constant computations off of the GPU and on to the CPU Example: Computing Material Color * Light Color at each vertex Computing 4x3 matrix transpose in the pixel shader Watch out for row vs. column major matrices

Vertex versus Pixel 1/1 Move linear computations out of the pixel shader and into the vertex shader Example: Move from world space to light space for attenuation ( Light.pos vertex.pos ) / Light.Range Moving into tangent space usually belongs pervertex Unless you are doing per-pixel reflection into a cubemap cubemap is defined in world space

Talk to your compiler 1/3 Use lowest possible precision that will do the job Fixed, half,float (24 or 32 bit) in that order Use _pp modifiers in ps_2_0 asm or half type in HLSL Example: tex2d * diffuse_color Can be done in fixed No advantage to higher precision

Talk to your compiler 2/3 Pick the right profile for the right complexity Use lowest possible pixel shader version that will do the job Remember that DirectX 9 still supports ps1.x If you have a shader that fits in 1.x, use it Try compiling HLSL for ps 1.x Use ps2.0 when it can t be achieved at the level of quality you are requiring Example: Simple diffuse lighting * texture

Talk to your compiler 3/3 Compile with ps_2_a target alternate model for compilation Closely matches GeForce FX hardware Example: Dependent texture fetches under ps_2_0 target generate sub-optimal assembly for ps_2_a hardware Look at assembly output of fxc or FX Composer does the number of instructions match your expectations? What about the scheduling?

Clever Tricks 1/6 Reflection Vector Optimizations There s no substitute for algebraic savings If a reflection vector is used to index into a cube map, its length doesn t matter Standard reflect function can be optimized from: R = 2*N*[dot(N,V)/dot(N,N)] V to R = 2*N*dot(N,V) dot(n,n)*v If N is already normalized, you can remove the dot(n,n): R = 2 * N * dot(n,v) V

Clever Tricks 2/6 Other Normalization Optimizations Other examples: Normal map lookups can be safely left unnormalized Try leaving L vector un-normalized per-pixel for diffuse bump mapping

Clever Tricks 3/6 Computing Normalize Efficiently Normalize(N) dot Normalize(L) can be computed more efficiently: The usual way: (N/ N ) dot (L/ L ) Two rsq computations! Better: = (N dot L) / ( N * L ) = (N dot L)/ ( sqrt( (N dot N) * (L dot L)) = (N dot L) * rsq ( (N dot N) * (L dot L))

Clever Tricks 4/6 Clamping Efficiently C/C++ don t have native min() and max() functions But GPU high-level languages do Use max() and min() instead of if-then-else Why? max(a,b)!= (A>=B?A:B) Because in this case, fxc uses a cmp instruction to handle the if-then-else cmp is not a native NV3X instruction But min and max are native NV3X instructions

Clever Tricks 5/6 Lerps lrp instruction is more expensive than other blending operations on NV3X Look for opportunities to simplify, or remove Note that: lerp(a,b, 1-f) == lerp(b, a, f) Compiler should find this and optimize, but may not yet do that

Clever Tricks 6/6 Vectorize Scalar Constants Make sure scalar constants don t end up misleading the compiler to use extra temporary registers Look at FX Composer/Shader Perf for a report on how many temporary registers end up being used

Texture Lookups 1/3 Texture lookups can be faster than math in many cases Free clamping, free filtering Examples : Free [0..1] clamping for N.L and N.H and self-shadow term A8R8G8B8 Normalization Cubemap L.N, N.H map in A8L8 2D Texture L.N, N.H, N.H k in A8L8 3D Texture Attenuation via 1 or 2 Texture lookups

Texture Lookups 2/3 Vector Normalization Normalize using math or normalization cube map lookup? For 1.x shaders, can use cube map or math (Newton- Raphson approximation) dp3 r0, v0_bx2, v0_bx2 mad r0, v0_bias, 1-r0, v0_bx2 For higher quality, use 2 16-bit signed cube maps for (x,y) and (z). 8-bit cubemaps had artifacts In general, signed textures are either the same speed or faster than unsigned ps.2.0 HW doesn t always have a free _bx2

Texture Lookups 3/3 Lighting - Specular Power Exponentiation: math or texture lookup? 1d texture: Replace power function with texture lookup for fixed exponent. Less flexible, but fast 2d texture: Second coordinate is exponent 3d texture: (u,v,w) = (L dot N, H dot N, spec exponent). Clamping on L dot N for free. Texture lookup, in general, faster than math Texture lookups less flexible than math Changing exponent per pixel for 3d texture approach will be more expensive texture cache thrashing Texture lookups are more flexible for artists They can paint them well!

Unique NP2 Texture Coordinates All non-power-of-two (NP2) textures and most NP2 render targets are addressed using image mode coordinates (0...width, 0...height) Texture coordinates need to be scaled by texture dimensions Driver will try to insert this scale into vertex shader

Unique NP2 Texture Coordinates Problem: If you have one set of texture coordinates used for multiple textures (allowed by ps_2_*) Driver must insert scale into pixel shader! Even if the textures are the same size Solution: Use separate texture coordinates for each NP2 texture Driver will correctly detect that they can be scaled by the vertex shader

Per-Pixel Lighting Case Study Well understood set of solutions Different approaches for achieving the same look Diffuse Lighting: k d * (L dot N) Specular Lighting: k s * (L dot R) spec [Phong] where R is reflected E Cheaper for many lights k s * (H dot N) spec [Blinn] Cheaper for one light

How to compute L, N and H? All should be normalized But, N will be close to normalized if from a tangent space normal map L can be normalized per-vertex Use vertex shader to put into light space for attenuation Optionally re-normalize per-pixel H can be computed per-vertex But can t be linearly interpolated per-pixel But on a well-tessellated mesh will often work fine anyway Renormalizing per-pixel not the correct vector direction but still visually plausible

Per-Pixel Lighting Experimental Results float to half gave a performance increase without quality loss normalize() -> 16-bit cubemaps gave further perf increase without quality loss. 8-bit cubemaps had artifacts Clamp with a texture even faster Both for N.L and self-shadowing term for N.H If light is not too close to a triangle, H per-vertex looks good Otherwise, compute per-pixel

Per-Pixel Lighting Summary There are many variations of even simple diffuse and specular lighting techniques R.L vs N.H Per-object, Per-Vertex or Per-Pixel Exponent Gloss Factors/Specular maps Math, Texture or Per-Vertex Normalization Per-Vertex, Per-Pixel or Texture-based attenuation The ideal method is very app-dependent But more flexibility costs performance The more you do per-vertex, the more quantity and complexity of lighting you can afford

Case Study FX Composer tutorial Implement classic lighting model: A Reflectance Model for Computer Graphics by Cook & Torrance, SIGGRAPH 81. Bake textures for faster computation Use half precision Give more hints to the compiler

Conclusion Select appropriate unit for ops: CPU, Vertex, Pixel CPU for constants, Vertex for linear calculations Select appropriate precision Use lowest pixel shader version that works You only get so many pixel shader cycles per frame Use them for visually interesting effects Per-Pixel Bump Maps Per-Pixel Reflections Give yourself more cycles for effects by not spending them on unneeded precision and calculation

Questions? If you see un-optimal code generated by fxc or FX Composer/Shader Perf, send us e-mail We ll get it fixed Compilers are still evolving Send mail to: devrelfeedback@nvidia.com fxcomposer@nvidia.com