Optimal Shaders Using High-Level Languages

Similar documents
Canonical Shaders for Optimal Performance. Sébastien Dominé Manager of Developer Technology Tools

AGDC Per-Pixel Shading. Sim Dietrich

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Render-To-Texture Caching. D. Sim Dietrich Jr.

In-Game Special Effects and Lighting

Programming Graphics Hardware

Could you make the XNA functions yourself?

Interactive Cloth Simulation. Matthias Wloka NVIDIA Corporation

HISTORY OF MAJOR REVISIONS

Evolution of GPUs Chris Seitz

CS 354R: Computer Game Technology

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

Cg 2.0. Mark Kilgard

ECE 574 Cluster Computing Lecture 16

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Drawing a Crowd. David Gosselin Pedro V. Sander Jason L. Mitchell. 3D Application Research Group ATI Research

Chapter 3 ATI. Jason L. Mitchell

Chapter Answers. Appendix A. Chapter 1. This appendix provides answers to all of the book s chapter review questions.

12.2 Programmable Graphics Hardware

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

GeForce4. John Montrym Henry Moreton

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

The Making of Seemore WebGL. Will Eastcott, CEO, PlayCanvas

Shaders. Slide credit to Prof. Zwicker

Programmable Graphics Hardware

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Shaders (some slides taken from David M. course)

Technical Report. Mesh Instancing

Beginning Direct3D Game Programming: 1. The History of Direct3D Graphics

The Application Stage. The Game Loop, Resource Management and Renderer Design

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Vertex and Pixel Shaders:

Shadow Techniques. Sim Dietrich NVIDIA Corporation

EDAF80 Introduction to Computer Graphics. Seminar 3. Shaders. Michael Doggett. Slides by Carl Johan Gribel,

GeForce3 OpenGL Performance. John Spitzer

MAXIS-mizing Darkspore*: A Case Study of Graphic Analysis and Optimizations in Maxis Deferred Renderer

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Basics of GPU-Based Programming

Lighting and Shading

Input Nodes. Surface Input. Surface Input Nodal Motion Nodal Displacement Instance Generator Light Flocking

Ultimate Graphics Performance for DirectX 10 Hardware

Deferred Rendering Due: Wednesday November 15 at 10pm

ECS 175 COMPUTER GRAPHICS. Ken Joy.! Winter 2014

Mattan Erez. The University of Texas at Austin

Windowing System on a 3D Pipeline. February 2005

GPU Memory Model. Adapted from:

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

The Rasterization Pipeline

Self-shadowing Bumpmap using 3D Texture Hardware

Optimisation. CS7GV3 Real-time Rendering

CREATING AND USING NORMAL MAPS - A Tutorial

The Graphics Pipeline

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

Direct3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation

Direct Rendering of Trimmed NURBS Surfaces

Shader Series Primer: Fundamentals of the Programmable Pipeline in XNA Game Studio Express

Assignment 6: Ray Tracing

Deus Ex is in the Details

Advanced Rendering & Shading

Computer Graphics with OpenGL ES (J. Han) Chapter XI Normal Mapping

GPU Performance Tools

frame buffer depth buffer stencil buffer

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Advanced Shading and Texturing

Wednesday, July 24, 13

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Abstract. 2 Description of the Effects Used. 1 Introduction Phong Illumination Bump Mapping

The Rasterization Pipeline

Introduction to Shaders.

Introduction Rasterization Z-buffering Shading. Graphics 2012/2013, 4th quarter. Lecture 09: graphics pipeline (rasterization and shading)

1.2.3 The Graphics Hardware Pipeline

Graphics Processing Unit Architecture (GPU Arch)

Dithering and Rendering. CS116B Chris Pollett Apr 18, 2004.

DirectX 9 High Level Shading Language. Jason Mitchell ATI Research

Rendering Subdivision Surfaces Efficiently on the GPU

SGI OpenGL Shader. Marc Olano SGI

Rendering Objects. Need to transform all geometry then

CS 4620 Program 3: Pipeline

Pipeline Operations. CS 4620 Lecture 10

Illumination and Shading

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Surface Detail Maps with Soft Self- Shadowing. Chris Green, VALVE

Project6 Lighting per-pixel shading and optimizations for Direct3D 8

Real Time Rendering of Expensive Small Environments Colin Branch Stetson University

Shaders in Eve Online Páll Ragnar Pálsson

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Pipeline Operations. CS 4620 Lecture 14

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Supplement to Lecture 22

3D Programming. 3D Programming Concepts. Outline. 3D Concepts. 3D Concepts -- Coordinate Systems. 3D Concepts Displaying 3D Models

Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Transcription:

Optimal Shaders Using High-Level Languages

The Good, The Bad, The Ugly All high level languages provide significant power and flexibility that: Make writing shaders easy Make writing slow shaders easy May lead to a spontaneous explosion of slow shaders grinding your game to a halt

The More Power You Have Although high level languages make it easy to write great shaders It is possible to write long inefficient shaders for the simplest of effects We do not have general purpose CPU-style computing (yet), so GPU limitations still need consideration The basic rules of programming GPUs still remain true

Example: Normalize How many cycles does normalize take? Using ps2.0 math: 2-3 cycles High Level: normalize(v); Using texture fetch (signed normalization cube map): 1 cycle High Level: texcube(s, V); Using ps1.x assembly: 1 cycle dp3 r0, v0_bx2, v0_bx2 mad r0, v0_bias, 1-r0, v0_bx2

Pixel Shader Versionitis Use lowest possible pixel shader version that will do the job Budget your shader versions Remember that DirectX 9 still supports ps1.x If you ve got a shader that fits in 1.x, use it Try compiling HLSL for ps 1.x first If not, try assembly Example: Simple diffuse lighting * texture

To Be Precise Use lowest possible precision that will do the job Fixed, half (if available), float (24 or 32 bit) in that order Use _pp modifiers in ps_2_0 asm or half type in HLSL Example: tex2d * diffuse_color Can be done in fixed No advantage to higher precision

Writing Shaders Make a budget Most pixels should use a relatively simple shader Identify key points of interest Use more complex shaders for them Develop your code in the easiest way that makes it work Then optimize it

Per Vertex!= Per Pixel Move linear computations out of the pixel shader and into the vertex shader Example: Move from world space to light space for attenuation ( Light.pos vertex.pos ) / Light.Range Moving into tangent space usually belongs pervertex Unless you are doing per-pixel reflection into a cubemap cubemap is defined in world space

Back To The Basics Move constant computations off of the GPU and on to the CPU Example: Computing Material Color * Light Color at each vertex Computing 4x3 matrix transpose in the pixel shader Watch out for row vs. column major matrices

Reflection Vector Optimizations There s no substitute for algebraic savings If a reflection vector is used to index into a cube map, its length doesn t matter Standard reflect function can be optimized from: R = 2*N*[dot(N,V)/dot(N,N)] V to R = 2*N*dot(N,V) dot(n,n)*v If N is already normalized, you can remove the dot(n,n): R = 2 * N * dot(n,v) V

Other Normalization Optimizations Other examples: Normal map lookups can be safely left unnormalized Try leaving L vector un-normalized per-pixel for diffuse bump mapping

Computing Normalize Efficiently Normalize(N) dot Normalize(L) can be computed more efficiently: The usual way: (N/ N ) dot (L/ L ) Two rsq computations! Better: = (N dot L) / ( N * L ) = (N dot L)/ ( sqrt( (N dot N) * (L dot L)) = (N dot L) * rsq ( (N dot N) * (L dot L))

Texture Lookups Texture lookups can be faster than math in many cases Free clamping, free filtering Examples : Free [0..1] clamping for N.L and N.H and self-shadow term A8R8G8B8 Normalization Cubemap L.N, N.H map in A8L8 2D Texture L.N, N.H, N.H k in A8L8 3D Texture Attenuation via 1 or 2 Texture lookups

Try Alternate Paths Compile with ps_2_a target alternate model for compilation Closely matches GeForce FX hardware Example: Dependent texture fetches under ps_2_0 target generate sub-optimal assembly for ps_2_a hardware Look at assembly output of fxc does the number of instructions match your expectations?

Clamping Efficiently C/C++ don t have native min() and max() functions But GPU high-level languages do Use max() and min() instead of if-else Why? max(a,b)!= (A>=B?A:B) Because in this case, fxc uses a cmp instruction to handle the if-else cmp is not a native NV3X instruction But min and max are native NV3X instructions

Lerps lrp instruction is more expensive than other blending operations on NV3X Look for opportunities to simplify, or remove Note that: lerp(a,b, 1-f) == lerp(b, a, f) Compiler should find this and optimize, but may not yet do that

Unique NP2 Texture Coordinates All non-power-of-two (NP2) textures and most NP2 render targets are addressed using image mode coordinates (0...width, 0...height) Texture coordinates need to be scaled by texture dimensions Driver will try to insert this scale into vertex shader

Unique NP2 Texture Coordinates Problem: If you have one set of texture coordinates used for multiple textures (allowed by ps_2_*) Driver must insert scale into pixel shader! Even if the textures are the same size Solution: Use separate texture coordinates for each NP2 texture Driver will correctly detect that they can be scaled by the vertex shader

Vectorize Scalar Constants May cause simple instructions to be broken into multiple instructions May take more shader cycles to run Packing constants means they can be accessed more readily Compilers will do a better job of packing constants over time

Per-Pixel Lighting Case Study Well understood set of solutions Different approaches to achieving the same look Diffuse Lighting: k d * (L dot N) Specular Lighting: k s * (V dot R) spec [Phong] Cheaper for many lights k s * (H dot N) spec [Blinn] Cheaper for one light

How to compute L, N and H? All should be normalized But, N will be close to normalized if from a tangent space normal map L can be normalized per-vertex Use vertex shader to put into light space for attenuation Optionally re-normalize per-pixel H can be computed per-vertex But can t be linearly interpolated per-pixel But on a well-tessellated mesh will often work fine anyway Renormalizing per-pixel not the correct vector direction but still visually plausible

Vector Normalization Normalize using math or normalization cube map lookup? For 1.x shaders, can use cube map or math (Newton- Raphson approximation) dp3 r0, v0_bx2, v0_bx2 mad r0, v0_bias, 1-r0, v0_bx2 For higher quality, use 2 16-bit signed cube maps for (x,y) and (z) In general, signed textures are either the same speed or faster than unsigned ps.2.0 HW doesn t always have a free _bx2

Specular Power Exponentiation: math or texture lookup? 1d texture: Replace power function with texture lookup for fixed exponent. Less flexible, but fast 2d texture: Second coordinate is exponent 3d texture: (u,v,w) = (L dot N, H dot N, spec exponent). Clamping on L dot N for free. Texture lookup, in general, faster than math Texture lookups less flexible than math Changing exponent per pixel for 3d texture approach will be more expensive texture cache thrashing

Per-Pixel Lighting Experimental Results float to half gave a performance increase without quality loss normalize() -> 16-bit cubemaps gave further perf increase without quality loss. 8-bit cubemaps had artifacts Clamp with a texture even faster Both for N.L and self-shadowing term for N.H If light is not too close to a triangle, H per-vertex looks good Otherwise, compute per-pixel

Per-Pixel Lighting Summary There are many variations of even simple diffuse and specular lighting techniques R.V vs N.H Per-object, Per-Vertex or Per-Pixel Exponent Gloss Factors Math, Texture or Per-Vertex Normalization Per-Vertex, Per-Pixel or Texture-based attenuation The ideal method is very app-dependent But more flexibility costs performance The more you do per-vertex, the more quantity and complexity of lighting you can afford

Shader Optimization Summary Select appropriate unit for ops: CPU, Vertex, Pixel CPU for constants, Vertex for linear calculations Select appropriate precision Use lowest pixel shader version You only get so many pixel shader cycles per frame Use them for visually interesting effects Per-Pixel Bump Maps Per-Pixel Reflections Give yourself more cycles for effects by not spending them on unneeded precision and calculation

Questions? If you see unoptimal code generated by fxc or cgc, send us e-mail We ll get it fixed Compilers are still evolving Send mail to: devrelfeedback@nvidia.com