Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Similar documents
Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Graphics Processing Unit Architecture (GPU Arch)

POWERVR MBX. Technology Overview

Performance OpenGL Programming (for whatever reason)

Hardware-driven Visibility Culling Jeong Hyun Kim

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Optimisation. CS7GV3 Real-time Rendering

GeForce4. John Montrym Henry Moreton

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Hardware-driven visibility culling

Windowing System on a 3D Pipeline. February 2005

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

PowerVR Hardware. Architecture Overview for Developers

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

The NVIDIA GeForce 8800 GPU

Mobile HW and Bandwidth

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Streaming Massive Environments From Zero to 200MPH

The Source for GPU Programming

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

CS427 Multicore Architecture and Parallel Computing

Could you make the XNA functions yourself?

Performance Analysis and Culling Algorithms

Drawing Fast The Graphics Pipeline

Render-To-Texture Caching. D. Sim Dietrich Jr.

PowerVR Series5. Architecture Guide for Developers

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Real-World Applications of Computer Arithmetic

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

What s New with GPGPU?

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

1.2.3 The Graphics Hardware Pipeline

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

GeForce3 OpenGL Performance. John Spitzer

CPU Pipelining Issues

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Course Recap + 3D Graphics on Mobile GPUs

The Rasterization Pipeline

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS 487: Interactive Computer Graphics

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Drawing Fast The Graphics Pipeline

Mobile 3D Devices. -- They re not little PCs! Stephen Wilkinson Graphics Software Technical Lead Texas Instruments CSSD/OMAP

PowerVR Performance Recommendations. The Golden Rules

Ultimate Graphics Performance for DirectX 10 Hardware

Graphics Hardware. Instructor Stephen J. Guy

PowerVR Performance Recommendations The Golden Rules. October 2015

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Lecture 25: Board Notes: Threads and GPUs

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

Spring 2009 Prof. Hyesoon Kim

Portland State University ECE 588/688. Graphics Processors

Drawing Fast The Graphics Pipeline

Wed, October 12, 2011

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Rendering Grass with Instancing in DirectX* 10

ECE 574 Cluster Computing Lecture 16

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

Rendering Objects. Need to transform all geometry then

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Welcome to Part 3: Memory Systems and I/O

CS 354R: Computer Game Technology

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Programming Tips For Scalable Graphics Performance

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Direct3D 11 Performance Tips & Tricks

The Application Stage. The Game Loop, Resource Management and Renderer Design

Threading Hardware in G80

Structure. Woo-Chan Park, Kil-Whan Lee, Seung-Gi Lee, Moon-Hee Choi, Won-Jong Lee, Cheol-Ho Jeong, Byung-Uck Kim, Woo-Nam Jung,

E.Order of Operations

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

Saving the Planet Designing Low-Power, Low-Bandwidth GPUs

GCN Performance Tweets AMD Developer Relations

Abstract. 2 Description of the Effects Used. 1 Introduction Phong Illumination Bump Mapping

Course Administration

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Comparing Reyes and OpenGL on a Stream Architecture

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Gettin Procedural. Jeremy Shopf 3D Application Research Group

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Optimal Shaders Using High-Level Languages

Transcription:

Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager

Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic Lighting Techniques for Games Speaker: Chris Oat Start & End Time: 2:30pm 3:30pm Title: Bringing Hollywood to Real-time Speaker: Abe Wiley Start & End Time: 4:00pm 6:15pm (short break at 5:00pm) Title: Effects Breakdown - How d They Do That Speakers: Thorsten Scheuermann, Natalya Tatarchuk, Daniel Ginsburg And Jonathan Zarge is presenting our tools strategy on the ATI stand at 6PM Thursday including some new performance tools...

Early observations Graphics is a hard problem Look at the trends Mem B/W doesn t grow fast enough We re having to generate more heat Graphics Chip Complexity rises much faster than CPU Complexity Although we re doing a lot to improve the straight-line speed people want to turn more corners

High End hardware s target? 1600x1200 85Hz (then lower refresh rates will just work) 6xAA so pixels can look good Because of the variability of the platform it makes no sense to ask blindly if we are pixel-limited or vertex-limited etc.

So, what shall we tune for? Cache re-use VFetch, Vertex, texture, Z All caches are totally independent of each other... i.e. Writing to one never invalidates another Vertex shaders Pixel shaders Z buffer Frame buffer

The X800/X850 hardware pipeline... VB IB 256 bit line size Pre-VS cache 6 Parallel 5D VS units Post VS cache Pixel Pixel Quad 4X Pixel Quad Quad Quad 4D PS Units Rasterizer Triangle setup

The pre-vs cache I Is purely a memory cache Has a common line size of 256 bits (That s 32 bytes) Is accessible by all vertex fetches Is why vertex data is best aligned to 32 bytes or 64 bytes 40 is very much worse than 64 Truly sequential access would be great...!

The pre-vs cache II Because it s purely a memory cache... Multiple streams can both help and hinder Multiple streams with random access is doubly bad... Generally expect 0% to 10% hit for using additional streams It is surprisingly easy to get bottlenecked here by unfriendly data That s because unfriendly data really can require 10x as much b/w as friendly data Games (and benchmarks) do get stuck here

Vertex Engines 0 Consider compressing your vertex data if that helps you line things up with the 32 byte cache line... Decompress in the Vertex Shader Store compressed data in VB See previous slide for the point... This can be a significant win if it achieves alignment objectives

Vertex Engines I On any modern hardware these generally dwarf the task Just don t do mad things here! 6 vertex engines delivering over 700 million vertices per second Software is another matter:- Integrated solutions... DX7 parts... You may struggle to get 7 million vertices per second

Vertex Engines II HLSL is your best approach... Expect one op per clock per pipe Often you ll get 2 ops instead... But you won t know that X800 contains both a 4D and 1D vertex ALU which can both do useful work in the same clock cycle So be specific about scalar operations... Masking out unused channels helps I ve never seen a game which is vertexthroughput limited at interesting resolutions on modern hardware Particularly true at our target settings of 1600x1200 and 6xAA

The post-vs cache Only accessible when using indexed primitives (can give you free triangles) Operates as a FIFO Use D3DXOptimizeMesh() knows about the current hardware Which means you must not pre-process this! Cache has 14 entries for triangles, 15 for lines, 16 for points Cache Size is independent of vertex format! Use highly local wending for best results Flushed between DrawPrim() calls

Triangle setup Never a bottleneck Just joins vertices into triangles Feeds the rasterizer which simply hands out quad pixel blocks to draw

Each Quad-Pixel Shader Unit I 4 pixels at once... Texture cache Depth Buffer Always 2x2 screen aligned. texels write or blend Frame Buffer

Each Quad-Pixel Shader Unit II The Quad is allocated to just one triangle at a time But each quad might be working on unrelated triangles Which means a quad could be 25% efficient if the triangle is an inconvenient size... Small triangles make good use of the vertex pipe, but at the expense of the pixel pipe efficiency Shares all gradient calculations etc

Texture cache Probably smaller than you d think... Unless you thought only a few KB Partitioned over all active textures So heavy multi-texturing can really hurt Wrecked by random access! Often from bump-map into env-map Needs reuse to show benefits (i.e. don t minify!) Contains uncompressed data At 8, 16, 32 or more bits per texel Texture fetches are per-pixel

Making Z work for you... We re faster at rejecting than at accepting... So draw roughly front to back For complex scenes consider Z pre-pass not for depth_complexity=1! Take care to Clear() Z (and stencil) Although Z is logically at the end of the shader that s not the best way for the hardware to work

Making Z work for you... Note that ATI hardware can do double speed Z/Stencil only work when: Color-writes disabled AA enabled Good for general rendering That s up to 32 AA samples per clock

Depth Values Can come from:- Actual Z buffer (slow) Compressed Z (fast & lossless) Your pixel can be Z-tested away before the shader has run at all! If you are performing any Z compare then please try hard not to write to odepth Depth values are per-sample...

Bashing the depth buffer You can reduce the huge(*) early Z benefits by... Writing odepth Hits compressed Z and early Z Using alpha-test etc on visible pixels decompresses Z values Changing the Z compare mode (sometimes) Can disable Hi-Z E.g. from LESS to GREATER (*) X800 etc can reject 256 pixels per clock!

The PS Unit 0 A basic optimisation here is to avoid doing work in the pixel shader which can be done earlier in the pipeline E.g. Never multiply two constants together in the PS, instead do that at the earliest possible point back up the pipe If you can do it in the VS, do it there, not in the PS (because pixels are commoner than vertices)

The PS Unit I Shorter shaders generally faster And we store multiple shaders in the shader cache which makes switching faster too At the high end there is roughly 4 times as much ALU power as texture power This ratio will only go up Because available bandwidth doesn t rise as fast as chip density So generally push more maths into here

The PS Unit II Is a 4D vector processor So try to match your math to your needs i.e. Mask out unused channels Trust the compilers to schedule things well:- You don t worry about scheduling... PS runs once per pixel...

FB (Fog and) Blend Is not part of the PS unit You can think of it as a special function of the memory controller Although there are lots of latency hiding tricks here... This is still probably the easiest place to get B/W limited So disable blend whenever possible

Pure FB optimisations Fewer bits are written faster... 16BPP > 32BPP > 64BPP > 128BPP (here > means faster) Blending is slower than not By more than a factor of 2 ATI: Surfaces are faster if allocated earlier!

Some cases... 3DMark05 Sometimes Vertex-Fetch limited Large batches, good efficiency Un-named games: 85 fps no matter what... 9.9 fps leaps to 10.1 fps! Wrong flags push vertex data through S/W Vertex Engine

Conclusion... Several classes of optimisation: Push things back up the pipe: E.g. Cull early, not late Get better parallelism: E.g. Use write masks to allow SIMD Doing less is faster than doing more: E.g. Short shaders are faster Understand what is cached: 32 byte vertices are fast. 16 bytes are faster. Indexed random access is v bad!