Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Similar documents
Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

Graphics Processing Unit Architecture (GPU Arch)

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

PowerVR Hardware. Architecture Overview for Developers

Optimisation. CS7GV3 Real-time Rendering

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS427 Multicore Architecture and Parallel Computing

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Render-To-Texture Caching. D. Sim Dietrich Jr.

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Hardware-driven Visibility Culling Jeong Hyun Kim

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

PowerVR Series5. Architecture Guide for Developers

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

The Application Stage. The Game Loop, Resource Management and Renderer Design

Siggraph Agenda. Usability & Productivity. FX Composer 2.5. Usability & Productivity 9/12/2008 9:16 AM

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 25: Board Notes: Threads and GPUs

The Source for GPU Programming

Could you make the XNA functions yourself?

Performance OpenGL Programming (for whatever reason)

Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group

! Readings! ! Room-level, on-chip! vs.!

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

GeForce4. John Montrym Henry Moreton

Drawing Fast The Graphics Pipeline

Mobile 3D Devices. -- They re not little PCs! Stephen Wilkinson Graphics Software Technical Lead Texas Instruments CSSD/OMAP

Mobile HW and Bandwidth

PowerVR Performance Recommendations. The Golden Rules

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

POWERVR MBX. Technology Overview

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Programming Tips For Scalable Graphics Performance

Ultimate Graphics Performance for DirectX 10 Hardware

Graphics Hardware. Instructor Stephen J. Guy

EECS 487: Interactive Computer Graphics

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Hardware-driven visibility culling

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Introduction to Shaders.

Profiling and Debugging Games on Mobile Platforms

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

GeForce3 OpenGL Performance. John Spitzer

Rendering Grass with Instancing in DirectX* 10

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

PowerVR Performance Recommendations The Golden Rules. October 2015

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CONSOLE ARCHITECTURE

Rendering Objects. Need to transform all geometry then

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

The NVIDIA GeForce 8800 GPU

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Introduction to the Direct3D 11 Graphics Pipeline

Wed, October 12, 2011

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Coming to a Pixel Near You: Mobile 3D Graphics on the GoForce WMP. Chris Wynn NVIDIA Corporation

Streaming Massive Environments From Zero to 200MPH

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Direct3D 11 Performance Tips & Tricks

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Order Matters in Resource Creation

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

PowerVR Performance Recommendations. The Golden Rules

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

GCN Performance Tweets AMD Developer Relations

Course Recap + 3D Graphics on Mobile GPUs

Chapter Answers. Appendix A. Chapter 1. This appendix provides answers to all of the book s chapter review questions.

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

What s New with GPGPU?

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Maximizing Face Detection Performance

GPU Architecture and Function. Michael Foster and Ian Frasch

printf Debugging Examples

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Current Trends in Computer Graphics Hardware

CS451Real-time Rendering Pipeline

Lecture 2. Shaders, GLSL and GPGPU

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Drawing Fast The Graphics Pipeline

Scheduling the Graphics Pipeline on a GPU

A Reconfigurable Architecture for Load-Balanced Rendering

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Threading Hardware in G80

Real-World Applications of Computer Arithmetic

Windowing System on a 3D Pipeline. February 2005

Transcription:

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques Jonathan Zarge, Team Lead Performance Tools Richard Huddy, European Developer Relations Manager ATI Technologies, Inc.

March 31, 2006 2 Outline DX9 Optimization Techniques: Richard Huddy 25 minutes Graphics pipeline overview Optimizing at each pipeline stage Tips for writing efficient code Performance Tools: Jonathan Zarge 30 minutes ATI developer performance tools overview PerfDash our new real-time performance analysis tool PIX plug-in track ATI hardware counters in PIX gdebugger OpenGL debugging - now with support for ATI performance counters ATI Content Creation Tools: Jonathan Zarge 5 minutes

DirectX 9 Optimization Techniques

Some early observations Graphics performance problems are both commoner and rarer than you d think The most common problem is that games are CPU-limited But you can certainly think of that as a graphics problem... As it s most often caused by graphics abuse March 31, 2006 4

There s plenty of mileage in Instancing Available on all ATI s recent hardware That s all SM3 hardware On ATI s SM2b hardware thru a simple backdoor Use Instancing for objects up to ~100 polys Batching You will be CPU limited if you don t send your triangles in large groups You can think of this as pretty much a fixed overhead per Draw call in DX9, much less in DX10 March 31, 2006 5

DirectX9 State Changes Top 5 by cost: SetPixelShaderConstant() SetPixeShader() SetVertexShaderConstant() SetVertexShader() SetTexture() So try to avoid these when you can March 31, 2006 6

Unified shaders? Think cross platform Xbox 360 When it happens the dynamics of PC graphics will change radically March 31, 2006 7

Shall we have a target? 1600x1200 and 1280x1024 (at least) 85Hz (then lower refresh rates will just work) 4xAA or 6xAA so pixels can look good Because of the variability of the platform it makes no sense to ask blindly if we are pixel-limited or vertex-limited etc. [And with U.S. that idea stops making sense ] March 31, 2006 8

Down the bottleneck pipeline Cache re-use VFetch, Vertex, texture, Z All caches are totally independent of each other... Vertex shaders Pixel shaders Z buffer Frame buffer March 31, 2006 9

The pre-vs cache I Is purely a memory cache Has a common line size of 256 bits (That s 32 bytes) Is accessible by all vertex fetches Is why vertex data is best aligned to 32 bytes or 64 bytes 44 is very much worse than 64 Roughly sequential access should be your aim March 31, 2006 10

The pre-vs cache II Because it s purely a memory cache... Multiple streams can both help and hinder. Multiple streams with random access is doubly bad... Generally expect 0% to 10% hit for using additional streams March 31, 2006 11

Vertex Engines I Consider compressing your vertex data if that helps you line things up with the 32 byte cache line... Decompress in the Vertex Shader Store compressed data in VB See previous slide for the point... This can be a significant win if it achieves some key alignment objectives March 31, 2006 12

Vertex Engines II HLSL is your best approach... We recommend that you compile with optimisations disabled we ll get to know more that way and usually do better Expect one op per clock per pipe Sometimes you ll get 2 ops instead... Masking out unused channels helps You can get up to 5 ops at once! I ve never seen a game which is vertexthroughput limited at interesting resolutions on modern hardware March 31, 2006 13

The post-vs cache Only accessible when using indexed primitives (can give you free triangles) Operates as a FIFO Use D3DXOptimizeMesh() Is 14 entries for triangles, 15 for lines and 16 for points Cache Size is independent of vertex format! Use highly local wending for best results Flushed between DrawPrim() calls March 31, 2006 14

Triangle setup Never a bottleneck Just joins vertices into triangles Feeds the rasterizer which simply hands out quad pixel blocks to draw March 31, 2006 15

March 31, 2006 16 A Quad-Pixel Processing Unit 4 pixels at once... [Texture cache] Depth Values Always 2x2 screen aligned. texels write or blend Frame Buffer

Texture cache Probably smaller than you d think... Unless you thought only a few KB Partitioned over all active textures So heavy multi-texturing can really hurt Modern hardware has efficient fully associative caches Wrecked by random access! Often from bump-map into env-map Needs reuse to show benefits (i.e. don t minify!) Usually contains uncompressed data At 8, 16, 32 or more bits per texel Some hardware stores DXT1 in compressed format Texture fetches are per-pixel March 31, 2006 17

Making Z work for you... We re faster at rejecting than at accepting... So draw roughly front to back For complex scenes consider Z pre-pass (not for depth_complexity=1!) Take care to Clear() Z (and stencil) Although Z is logically at the end of the shader that s not the best way March 31, 2006 18

Making Z work for you... Note that ATI hardware can do double speed Z/Stencil only work when: Color-writes disabled AA enabled Good for general rendering AA is your default, yes? That s up to 32 AA Z values per clock March 31, 2006 19

Depth Values Can come from:- Actual Z buffer (slow) Compressed Z (fast & lossless) Your pixel can be Z-tested away before the shader has run at all! If you are performing any Z compare then please try hard not to write to odepth Remember that depth values are per-sample... March 31, 2006 20

Bashing the depth buffer You can reduce the huge(*) early Z benefits by... Writing odepth Kills compressed Z and early Z Using alpha-test etc on visible pixels decompresses Z values Changing the Z compare mode (sometimes) Can disable Hi-Z E.g. from LESS to GREATER (*) Top class h/w can reject 256 pixels per clock! March 31, 2006 21

The PS Unit I Shorter shaders generally faster And we can cache them on chip too At the high end there is roughly 4 times as much ALU power as texture power This ratio will only go up Because available bandwidth doesn t rise as fast as chip density So generally push more maths into here March 31, 2006 22

The PS Unit II Is a 4D vector processor So try to match your math to your needs i.e. Mask out unused channels Trust the compilers to schedule things well:- You don t worry about scheduling... PS runs once per pixel... March 31, 2006 23

FB (Fog and) Blend Is not part of the PS unit You can think of it as a special function of the memory controller Although there are lots of latency hiding tricks here... This is still probably the easiest place to get B/W limited So disable blend whenever possible March 31, 2006 24

Pure FB optimisations Fewer bits are written faster... 16BPP > 32BPP > 64BPP > 128BPP (here > means faster) Blending is slower than not Often by more than a factor of 2 ATI: Surfaces are faster when allocated earlier! March 31, 2006 25

PS Dynamic Flow Control DFC can be a significant benefit But only when the selection coherency is at least as big as the hardware batch size Hardware Batch Size X1800 16 pixels X1900 48 pixels Xenos 48 pixels March 31, 2006 26

Conclusion... Several classes of optimisation: Pushing things back up the pipe: E.G. Cull early, not late Getting better parallelism: E.g. Use write masks in your shader code to allow SIMD Doing less is faster than doing more: E.g. Short shaders are faster Understand what is cached: 32 byte vertices are fast! 16 bytes are faster... March 31, 2006 27

Performance Tools

March 31, 2006 29 The Plan Overview of ATI developer performance tools PerfDash ATI PIX Plugin gdebugger ATI content creation tools

March 31, 2006 30 ATI Developer Performance Tools Application OpenGL API Direct3D API gdebugger (Graphic Remedy) DirectX Runtime Driver Direct3D DDI ATI PerfServer PerfDash Hardware

March 31, 2006 31 The Plan Overview of ATI developer performance tools PerfDash ATI PIX Plugin gdebugger ATI content creation tools

PerfDash Overview PerfDash: Performance Dashboard Real-time visualization: API statistics Hardware counters Driver data Virtual counters Local or remote performance profiling Overriding rendering states Loading/saving session data and preferences No special driver No code modifications Plugin architecture March 31, 2006 32

PerfDash Features: Toolbar Toggle global data collection Connect to local or remote machine API, Hardware, All modes Performance server must be running on target machine Virtually no performance impact if not running PerfDash March 31, 2006 33

PerfDash Features: API Statistics Per-frame API call data Sorting of API call counts and times Flexible plotting of all numeric data Plot window properties control appearance Real-time state overrides March 31, 2006 34

PerfDash Features: Hardware Hardware counter values 3D/TCL clocks Primitive counts ALU instructions executed Select custom set of counters Driver data Framerate Memory in use Prims per state change Plotting of all numeric data March 31, 2006 35

PerfDash Features: Virtual Counters Virtual (derived) counters Hardware busy % TCL busy % Pixels passed z-test Temperature bars Plotting of all numeric data March 31, 2006 36

Performance Tuning w/perfdash CPU vs GPU balance Is the GPU saturated? TCL percentage Is vertex processing the bottleneck? Pixel shader bottleneck Draw*Primitive per state change Good indication of batch size Inefficient use of API High memory usage Texture bandwidth and filtering Efficient use of z-buffer and sorting Wireframe for geometry visualization March 31, 2006 37

PerfDash Demo March 31, 2006 38

PerfDash Future Future enhancements Record/playback/step through API calls Support external plugins Support for different platforms More data: pixel stats, render states, buffers, surfaces Bottleneck detection Custom virtual counters Support for OpenGL applications PerfDash release schedule Beta in Q1 06 First release Q2 06 March 31, 2006 39

March 31, 2006 40 The Plan Overview of ATI developer performance tools PerfDash ATI PIX Plugin gdebugger ATI content creation tools

PIX for Windows Based on Xbox PIX Record/playback/visualize model Direct3D API level Numerous included counters Plugin API for additional counters Image comparison Display render states, textures, objects, shaders, vertex decl. 3 recording modes Gather statistics (counters) Record D3D calls Record playable stream information March 31, 2006 41

ATI PIX Plugin Communicates with ATI Driver (through D3D API) Virtual counters Computed from actual hardware counters/driver data Can be graphed with other performance counters Counter examples: Hardware/TnL Busy Vertex Fetch Busy Triangle/Line/Point Count Total/Blended/Pass Z Pixels Local/AGP Texture Memory Used Primitives per RS/TS/PSC/VSC Stalls on Flip/VB March 31, 2006 42

March 31, 2006 43 The Plan Overview of ATI developer performance tools PerfDash ATI PIX Plugin gdebugger ATI content creation tools

gdebugger adds ATI performance counters gdebugger & gdebugger ES: A professional OpenGL and OpenGL ES Debugger and Profiler Shortens developer time required for debugging and profiling OpenGL and OpenGL ES based applications Integrated with ATI performance counters to find graphic pipeline performance bottlenecks and optimize application performance More details: www.gremedy.com ATI Workstation SDK March 31, 2006 44

gdebugger Demo March 31, 2006 45

March 31, 2006 46 The Plan Overview of ATI developer performance tools PerfDash ATI PIX Plugin gdebugger ATI content creation tools

ATI Content Creation Tools RenderMonkey Shader development environment Supports HLSL, D3D assembler and GLSL New version available soon! Compressonator Tool for compressing textures and creating mip-map levels CubeMapGen Tool for creating filtered cube maps without seams Uses angular extent filtering NormalMapper Automatic normal map generation tool Traces rays from low resolution geometry to high resolution geometry March 31, 2006 47

Wrap up Graphics features are a key product differentiator Increasing graphics efficiency can lead to Playable performance on more hardware More time for non-graphics functions Richer stunning effects for your game How do you ascend to these graphic heights? Incorporate the optimization techniques in to your code Utilize the performance tools in your development process March 31, 2006 48

Thank You! Questions? JZarge@ati.com RHuddy@ati.com devrel@ati.com http://www.ati.com/developer/ March 31, 2006 49