Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Similar documents
Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Graphics Processing Unit Architecture (GPU Arch)

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Hardware-driven Visibility Culling Jeong Hyun Kim

Performance OpenGL Programming (for whatever reason)

Optimisation. CS7GV3 Real-time Rendering

POWERVR MBX. Technology Overview

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

The Source for GPU Programming

Render-To-Texture Caching. D. Sim Dietrich Jr.

Windowing System on a 3D Pipeline. February 2005

Hardware-driven visibility culling

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

PowerVR Hardware. Architecture Overview for Developers

GeForce4. John Montrym Henry Moreton

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

The NVIDIA GeForce 8800 GPU

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Could you make the XNA functions yourself?

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

CS427 Multicore Architecture and Parallel Computing

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Mobile HW and Bandwidth

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Graphics Hardware. Instructor Stephen J. Guy

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

PowerVR Series5. Architecture Guide for Developers

Ultimate Graphics Performance for DirectX 10 Hardware

Real-World Applications of Computer Arithmetic

Drawing Fast The Graphics Pipeline

Wed, October 12, 2011

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

PowerVR Performance Recommendations The Golden Rules. October 2015

Drawing Fast The Graphics Pipeline

PowerVR Performance Recommendations. The Golden Rules

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Lecture 25: Board Notes: Threads and GPUs

Streaming Massive Environments From Zero to 200MPH

Drawing Fast The Graphics Pipeline

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

GCN Performance Tweets AMD Developer Relations

What s New with GPGPU?

1.2.3 The Graphics Hardware Pipeline

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

Rendering Objects. Need to transform all geometry then

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Course Recap + 3D Graphics on Mobile GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

The Application Stage. The Game Loop, Resource Management and Renderer Design

Direct3D 11 Performance Tips & Tricks

Readings on graphics architecture for Advanced Computer Architecture class

GeForce3 OpenGL Performance. John Spitzer

Threading Hardware in G80

Performance Analysis and Culling Algorithms

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Texture Compression. Jacob Ström, Ericsson Research

The Rasterization Pipeline

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Optimal Shaders Using High-Level Languages

CPU Pipelining Issues

Addressing the Memory Wall

Abstract. 2 Description of the Effects Used. 1 Introduction Phong Illumination Bump Mapping

A Reconfigurable Architecture for Load-Balanced Rendering

Scheduling the Graphics Pipeline on a GPU

E.Order of Operations

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

GPU Architecture. Michael Doggett Department of Computer Science Lund university

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Programming Tips For Scalable Graphics Performance

MAXIS-mizing Darkspore*: A Case Study of Graphic Analysis and Optimizations in Maxis Deferred Renderer

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Programming Graphics Hardware

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Evolution of GPUs Chris Seitz

High Performance Visibility Testing with Screen Segmentation

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Mobile 3D Devices. -- They re not little PCs! Stephen Wilkinson Graphics Software Technical Lead Texas Instruments CSSD/OMAP

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Real-Time Graphics Architecture

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Transcription:

Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager

Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most common problem is that games are CPU-limited But you can certainly think of that as a graphics problem...

Shall we have a target? 1280x1024 (at least) 85Hz (then lower refresh rates will just work) 4xAA or 6xAA so pixels can look good Because of the variability of the platform it makes no sense to ask blindly if we are pixel-limited limited or vertex- limited etc.

So, what shall we tune for? Cache re-use VFetch,, Vertex, texture, Z All caches are totally independent of each other... Vertex shaders Pixel shaders Z buffer Frame buffer

The high-end hardware pipeline... VB IB 256 bit line size Pre-VS cache 6 Parallel VS units Post VS cache Pixel Pixel Quad Pixel Quad Quad Quad PS Unit Rasterizer Triangle setup

The pre-vs cache I Is purely a memory cache Has a common line size of 256 bits (That s s 32 bytes) Is accessible by all vertex fetches Is why vertex data is best aligned to 32 bytes or 64 bytes 40 is very much worse than 64 Truly sequential access would be great...!

The pre-vs cache II Because it s s purely a memory cache... Multiple streams can both help and hinder. Multiple streams with random access is doubly bad... Generally expect 0% to 10% hit for using additional streams

Vertex Engines 0 Consider compressing your vertex data if that helps you line things up with the 32 byte cache line... Decompress in the Vertex Shader Store compressed data in VB See previous slide for the point... This can be a significant win if it achieves alignment objectives

Vertex Engines I On any modern hardware these generally dwarf the task Just don t t do mad things here! 6 vertex engines is plenty Software is another matter:- Integrated solutions... DX7 parts...

Vertex Engines II HLSL is your best approach... Expect one op per clock per pipe Sometimes you ll get 2 ops instead... Masking out unused channels helps You can get up to 5 ops at once! I ve never seen a game which is vertex- throughput limited at interesting resolutions on modern hardware

The post-vs cache Only accessible when using indexed primitives (can give you free triangles) Operates as a FIFO Use D3DXOptimizeMesh() ATI: Is 14 entries for triangles, 15 for lines and 16 for points NV: Is 16 entries on GF2 & GF4MX, 24 entries on all shader hardware Cache Size is independent of vertex format! Use highly local wending for best results Flushed between DrawPrim() calls

Triangle setup Never a bottleneck Just joins vertices into triangles Feeds the rasterizer which simply hands out quad pixel blocks to draw

Each Quad-Pixel Shader Unit 4 pixels at once... Texture cache Depth Values Always 2x2 screen aligned. texels write or blend Frame Buffer

Texture cache Probably smaller than you d d think... Unless you thought only a few KB Partitioned over all active textures So heavy multi-texturing texturing can really hurt Wrecked by random access! Often from bump-map map into env-map Needs reuse to show benefits (i.e. don t t minify!) Usually contains uncompressed data At 8, 16, 32 or more bits per texel NV shader h/w stores DXT1 in compressed format Texture fetches are per-pixel pixel

Making Z work for you... We re faster at rejecting than at accepting... So draw roughly front to back For complex scenes consider Z pre-pass pass (not for depth_complexity=1!) Take care to Clear() Z (and stencil) Although Z is logically at the end of the shader that s s not the best way

Making Z work for you... Note that NV hardware can do double speed Z/Stencil only work when: Color-writes disabled 8-bit/component color buffer bound (not float) No user clip planes AA disabled Good for shadow rendering That s s up to 32 zixels per clock

Making Z work for you... Note that ATI hardware can do double speed Z/Stencil only work when: Color-writes disabled AA enabled Good for general rendering That s s up to 32 AA samples per clock

Depth Values Can come from:- Actual Z buffer (slow) Compressed Z (fast & lossless) Your pixel can be Z-tested Z away before the shader has run at all! If you are performing any Z compare then please try hard not to write to odepth Depth values are per-sample sample...

Bashing the depth buffer You can reduce the huge(*) early Z benefits by... Writing odepth Hits compressed Z and early Z Using alpha-test etc on visible pixels decompresses Z values Changing the Z compare mode (sometimes) Can disable Hi-Z E.g. from LESS to GREATER (*) Top class h/w can reject 256 pixels per clock!

The PS Unit I Shorter shaders generally faster Older NV hardware also benefits from smaller register footprint At the high end there is roughly 4 times as much ALU power as texture power This ratio will only go up Because available bandwidth doesn t t rise as fast as chip density So generally push more maths into here

The PS Unit II Is a 4D vector processor So try to match your math to your needs i.e. Mask out unused channels Trust the compilers to schedule things well:- You don t t worry about scheduling... PS runs once per pixel...

FB (Fog and) Blend Is not part of the PS unit You can think of it as a special function of the memory controller Although there are lots of latency hiding tricks here... This is still probably the easiest place to get B/W limited So disable blend whenever possible

Pure FB optimisations Fewer bits are written faster... 16BPP > 32BPP > 64BPP > 128BPP (here > means faster) Blending is slower than not By more than a factor of 2 ATI: Surfaces are faster if allocated earlier!

Conclusion... Several classes of optimisation: Pushing things back up the pipe: E.G. Cull early, not late Getting better parallelism: E.g. Use write masks to allow SIMD Doing less is faster than doing more: E.g. Short shaders are faster Understand what is cached: 32 byte vertices are fast! 16 bytes are faster...