Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Similar documents
General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GeForce3 OpenGL Performance. John Spitzer

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Project Gotham Racing 2 (Xbox) Real-Time Rendering. Microsoft Flighsimulator. Halflife 2

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group

Optimisation. CS7GV3 Real-time Rendering

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Programming Graphics Hardware

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

Evolution of GPUs Chris Seitz

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

GeForce4. John Montrym Henry Moreton

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

The Application Stage. The Game Loop, Resource Management and Renderer Design

Working with Metal Overview

Spring 2009 Prof. Hyesoon Kim

gems_ch28.qxp 2/26/ :49 AM Page 469 PART V PERFORMANCE AND PRACTICALITIES

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

CS 354R: Computer Game Technology

Rasterization Overview

1.2.3 The Graphics Hardware Pipeline

E.Order of Operations

Spring 2011 Prof. Hyesoon Kim

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

PowerVR Series5. Architecture Guide for Developers

Graphics Processing Unit Architecture (GPU Arch)

Performance OpenGL Programming (for whatever reason)

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Windowing System on a 3D Pipeline. February 2005

Deferred Splatting. Gaël GUENNEBAUD Loïc BARTHE Mathias PAULIN IRIT UPS CNRS TOULOUSE FRANCE.

Lecture 2. Shaders, GLSL and GPGPU

Hardware-driven Visibility Culling Jeong Hyun Kim

PowerVR Performance Recommendations. The Golden Rules

GPU Memory Model Overview

GPU Memory Model. Adapted from:

Shadow Techniques. Sim Dietrich NVIDIA Corporation

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Scanline Rendering 2 1/42

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Pump Up Your Pipeline

Dave Shreiner, ARM March 2009

Rendering Grass with Instancing in DirectX* 10

Could you make the XNA functions yourself?

NVIDIA Developer Toolkit. March 2005

Ultimate Graphics Performance for DirectX 10 Hardware

The Rasterization Pipeline

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Hardware-driven visibility culling

PowerVR Hardware. Architecture Overview for Developers

Programming Guide. Aaftab Munshi Dan Ginsburg Dave Shreiner. TT r^addison-wesley

Mattan Erez. The University of Texas at Austin

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

CS4620/5620: Lecture 14 Pipeline

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Real-World Applications of Computer Arithmetic

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Graphics Hardware. Instructor Stephen J. Guy

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

Pipeline Operations. CS 4620 Lecture 14

The Source for GPU Programming

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Mattan Erez. The University of Texas at Austin

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

PowerVR Performance Recommendations The Golden Rules. October 2015

Rendering Objects. Need to transform all geometry then

Pipeline Operations. CS 4620 Lecture 10

Drawing Fast The Graphics Pipeline

GCN Performance Tweets AMD Developer Relations

CS 130 Exam I. Fall 2015

Coding OpenGL ES 3.0 for Better Graphics Quality

Render-To-Texture Caching. D. Sim Dietrich Jr.

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

CS451Real-time Rendering Pipeline

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Profiling and Debugging Games on Mobile Platforms

Optimizing Direct3D for the GeForce 256 Douglas H. Rogers Please send me your comments/questions/suggestions

Introduction to Shaders.

Abstract. 2 Description of the Effects Used. 1 Introduction Phong Illumination Bump Mapping

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

Monday Morning. Graphics Hardware

Direct3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation

Drawing a Crowd. David Gosselin Pedro V. Sander Jason L. Mitchell. 3D Application Research Group ATI Research

POWERVR MBX. Technology Overview

Transcription:

Graphics Performance Optimisation John Spitzer Director of European Developer Technology

Overview Understand the stages of the graphics pipeline Cherchez la bottleneck Once found, either eliminate or balance

Simplified Graphics Pipeline CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer Texture Storage + Filtering Vertices Pixels

Possible Pipeline Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Battle Plan for Better Performance Locate the bottleneck(s) Eliminate the bottleneck (if possible) Decrease workload of the bottlenecked stage Otherwise, make it look better Balance pipeline by increasing workload of the non-bottlenecked stages

Bottleneck Identification Run App Vary FB FPS varies? Yes FB limited No Vary texture size/filtering Vary resolution Vary vertex instructions FPS varies? No FPS varies? No FPS varies? No Yes Yes Yes Texture limited Vary fragment instructions Transform limited Yes FPS varies? No Fragment limited Raster limited Vary vertex size/ AGP rate FPS varies? No Yes Transfer limited CPU limited

CPU Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

CPU Bottlenecks Application limited (most games are in some way) Driver or API limited too many state changes (bad batching) using non-accelerated paths Use VTune (Intel performance analyzer) caveat: truly GPU-limited games hard to distinguish from pathological use of API

Consolidate Small Batches Each vertex buffer/array preferably has thousands of vertices or more Draw as many triangles per call as possible ~50K DIPs/s COMPLETELY saturate 1.5GHz Pentium 4 50fps means 1,000 DIPs/frame! Up to you whether drawing 1K tri/frame or 1M tri/frame

Batch Consolidation Strategies Use degenerate triangles to join strips together Hardware culls zero-area triangles very quickly Use texture pages Use a vertex shader to batch instanced geometry VS2.0 and VP30 have 256 constant 4D vectors

Geometry Transfer Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Geometry Transfer Bottlenecks Vertex data problems size issues (just under or over 32 bytes) non-native types (e.g. double, packed byte normals) Using the wrong API calls Immediate mode, non-accelerated vertex arrays Non-indexed primitives (e.g. gldrawarrays, DrawPrimitive) AGP misconfigured or aperture set too small

Optimising Geometry Transfer: OpenGL Static geometry display lists okay, but ARB_vertex_buffer_object is better Dynamic geometry - use ARB_vertex_buffer_object vertex size ideally multiples of 32 bytes (compress or pad) access vertices in sequential (cache friendly) pattern always use indexed primitives (i.e. gldrawelements) 16 bit indices can be faster than 32 bit

Optimising Geometry Transfer: Direct3D Static geometry: Create a write-only vertex buffer and only write to it once Dynamic geometry: Create a dynamic vertex buffer Lock with DISCARD at start of frame Then append with NOOVERWRITE until full Use NOOVERWRITE more often than DISCARD Each DISCARD takes either more time or more memory So NOOVERWRITE should be most common Never use no flags

Geometry Transform Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Geometry Transform Bottlenecks Too many vertices Too much computation per vertex Vertex cache inefficiency

Too Many Vertices Favor triangle strips/fans over lists (fewer vertices) Use levels of detail (but beware of CPU overhead) Use bump maps to fake geometric detail

Too Much Vertex Computation: Fixed Function Avoid superflous work >3 lights (saturation occurs quickly) local lights/viewer, unless really necessary unused texgen or non-identity texture matrices Consider commuting to vertex program if (and only if) good shortcut exists example: texture matrix only needs to be 2x2 not recommended for optimizing fixed function lighting

Too Much Vertex Computation: Vertex Programs Move per-object calculations to CPU, save results as constants Leverage full spectrum of instruction set (LIT, DST, SIN,...) Leverage swizzle and mask operators to minimize MOVs Consider using shader levels of detail

Vertex Cache Inefficiency Always use indexed primitives on high-poly models Re-order vertices to be sequential in use (e.g. NVTriStrip) Favor triangle fans/strips over lists

Rasterization Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Rasterization Rarely the bottleneck (exception: stencil shadow volumes) Speed influenced primarily by size of triangles Also, by number of vertex attributes to be interpolated Be sure to maximize depth culling efficiency

Maximize Depth Culling Efficiency Always clear depth at the beginning of each frame clear with stencil, if stencil buffer exists feel free to combine with color clear, if applicable Coarsely sort objects front to back Don t switch the direction of the depth test mid-frame Constrain near and far planes to geometry visible in frame Use scissor to minimize superfluous fragment generation for stencil shadow volumes Avoid polygon offset unless you really need it NVIDIA advice use depth bounds test for stencil shadow volumes

Texture Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Texture Bottlenecks Running out of texture memory Poor texture cache utilization Excessive texture filtering

Conserving Texture Memory Texture resolutions should be only as big as needed Avoid expensive internal formats New GPUs allow floating point 4xfp16 and 4xfp32 formats Compress textures: Collapse monochrome channels into alpha Use 16-bit color depth when possible (environment maps and shadow maps) Use DXT compression

Poor Texture Cache Utilization Localize texture accesses beware of dependent texturing beware of non-power of 2 textures ALWAYS use mipmapping use trilinear/aniso only when necessary (more later!) Avoid negative LOD bias to sharpen texture caches are tuned for standard LODs sharpening usually causes aliasing in the distance opt for anisotropic filtering over sharpening

Excessive Texture Filtering Use trilinear filtering only when needed trilinear filtering can cut fillrate in half typically, only diffuse maps truly benefit light maps are too low resolution to benefit environment maps are distorted anyway Similarly use anisotropic filtering judiciously even more expensive than trilinear not useful for environment maps (again, distortion)

Fragment Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Fragment Bottlenecks Too many fragments Too much computation per fragment Unnecessary fragment operations

Too Many Fragments Follow prior advice for maximizing depth culling efficiency Consider using a depth-only first pass shade only the visible fragments in subsequent pass(es) improve fragment throughput at the expense of additional vertex burden (only use for frames employing complex shaders)

Too Much Fragment Computation Use a mix of texture and math instructions (they often run in parallel) Move constant per-triangle calculations to vertex program, send data as texture coordinates Do similar with values that can be linear interpolated (e.g. fresnel) Consider using shader levels of detail Use lowest pixel shader version you can

GeForceFX-specific Optimisations Use even numbers of texture instructions Use even numbers of blending (math) instructions Use normalization cubemaps to efficiently normalize vectors Leverage full spectrum of instruction set (LIT, DST, SIN,...) Leverage swizzle and mask operators to minimize MOVs Minimize temporary storage Use 16-bit registers where applicable (most cases) Use all components in each (swizzling is free) Use ps_2_a profile in HLSL

Framebuffer Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Minimizing Framebuffer Traffic Collapse multiple passes with longer shaders (not always a win) Turn off Z writes for transparent objects and multipass Question the use of floating point frame buffers Use 16-bit Z depth if you can get away with it Reduce number and size of render-to-texture targets Cube maps and shadow maps can be of small resolution and at 16-bit color depth and still look good Try turning cube-maps into hemisphere maps for reflections instead Can be smaller than an equivalent cube map Fewer render target switches Reuse render target textures to reduce memory footprint Do not mask off only some color channels unless really necessary

Finally... Use Occlusion Query Use occlusion query to minimize useless rendering It s cheap and easy! Examples: multi-pass rendering rough visibility determination (lens flare, portals) Caveats: need time for query to process can add fillrate overhead

Tools: NVPerfHUD Drivers now support NVPerfHUD Overlay that shows vital various statistics as the application runs Top graph shows : Number of API calls Draw*Prim*, render states, texture states, shader states Memory allocated AGP and video Bottom graph shows : GPU Idle Graphics HW not processing anything Driver Time Driver doing work (state and resource management, shader compilation) Driver Idle Driver waiting for GPU to finish Frame Time Milliseconds per frame time

NVPerfHUD - Screenshot

Conclusion Complex, programmable GPUs have many potential bottlenecks Rarely is there but one bottleneck in a game Understand what you are bound by in various sections of the scene The skybox is probably texture limited The skinned, dot3 characters are probably transfer or transform limited Exploit imbalances to get things for free

Questions, comments, feedback? John Spitzer, spit@nvidia.com