Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Similar documents
Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Graphics Processing Unit Architecture (GPU Arch)

Programming Graphics Hardware

CS427 Multicore Architecture and Parallel Computing

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Evolution of GPUs Chris Seitz

What s New with GPGPU?

The Source for GPU Programming

Anatomy of AMD s TeraScale Graphics Engine

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

ECE 574 Cluster Computing Lecture 16

Graphics Hardware. Instructor Stephen J. Guy

GPU Architecture. Michael Doggett Department of Computer Science Lund university

PowerVR Hardware. Architecture Overview for Developers

Spring 2009 Prof. Hyesoon Kim

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Spring 2011 Prof. Hyesoon Kim

Profiling and Debugging Games on Mobile Platforms

Advanced Visual Effects with Direct3D

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Optimisation. CS7GV3 Real-time Rendering

Could you make the XNA functions yourself?

Windowing System on a 3D Pipeline. February 2005

GeForce3 OpenGL Performance. John Spitzer

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Monday Morning. Graphics Hardware

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Performance OpenGL Programming (for whatever reason)

Fast Tessellated Rendering on Fermi GF100. HPG 2010 Tim Purcell

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Hardware-driven visibility culling

Coming to a Pixel Near You: Mobile 3D Graphics on the GoForce WMP. Chris Wynn NVIDIA Corporation

Drawing a Crowd. David Gosselin Pedro V. Sander Jason L. Mitchell. 3D Application Research Group ATI Research

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

OpenGL ES 2.0 : Start Developing Now. Dan Ginsburg Advanced Micro Devices, Inc.

GPU Architecture. Samuli Laine NVIDIA Research

Direct3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

Threading Hardware in G80

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

The Application Stage. The Game Loop, Resource Management and Renderer Design

Shaders (some slides taken from David M. course)

Shader Series Primer: Fundamentals of the Programmable Pipeline in XNA Game Studio Express

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

The NVIDIA GeForce 8800 GPU

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Working with Metal Overview

From Concept to Silicon

Multimedia in Mobile Phones. Architectures and Trends Lund

Canonical Shaders for Optimal Performance. Sébastien Dominé Manager of Developer Technology Tools

Xbox 360 high-level architecture

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Shaders CSCI 4229/5229 Computer Graphics Fall 2017

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

A Trip Down The (2011) Rasterization Pipeline

Chapter Answers. Appendix A. Chapter 1. This appendix provides answers to all of the book s chapter review questions.

CS452/552; EE465/505. Clipping & Scan Conversion

Deferred Splatting. Gaël GUENNEBAUD Loïc BARTHE Mathias PAULIN IRIT UPS CNRS TOULOUSE FRANCE.

DirectX10 Effects. Sarah Tariq

Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved.

Introduction to Modern GPU Hardware

From Brook to CUDA. GPU Technology Conference

Shaders CSCI 4239/5239 Advanced Computer Graphics Spring 2014

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

A Reconfigurable Architecture for Load-Balanced Rendering

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Portland State University ECE 588/688. Graphics Processors

CONSOLE ARCHITECTURE

Programming Tips For Scalable Graphics Performance

Cg 2.0. Mark Kilgard

DirectX10 Effects and Performance. Bryan Dudash

Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group

Technical Report. Mesh Instancing

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

GPU A rchitectures Architectures Patrick Neill May

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Ultimate Graphics Performance for DirectX 10 Hardware

developer.nvidia.com The Source for GPU Programming

Optimal Shaders Using High-Level Languages

Lecture 25: Board Notes: Threads and GPUs

Motivation MGB Agenda. Compression. Scalability. Scalability. Motivation. Tessellation Basics. DX11 Tessellation Pipeline

EECS 487: Interactive Computer Graphics

PowerVR Series5. Architecture Guide for Developers

Transcription:

Today s Agenda DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips Optimization for DirectX 9 Graphics Mike Burrows, Microsoft - Performance Analysis and Solutions Coffee break 11:00 11:15 Richard Huddy, ATI - Optimization Ashutosh Rege, nvidia - Optimization Lunch break 12:30 2:00 Jeff Grills, Sony Online Entertainment - Optimizing Star Wars: Galaxies Rendering Techniques Craig Peeper, Microsoft - Advanced D3DX Effects Coffee break 4:00 4:15 Matthias Wloka, nvidia - Advanced Visual Effects David Gosselin, ATI - ATI Demo Team Visual Effects Shawn Hargreaves, Climax Brighton - Deferred Shading Gary McTaggart, Valve Software - Half-Life 2 / Source Shading

The API Since there s an API between you and the hardware it makes sense to expect that you need to know how to use it Abuse of the API can be a mighty expensive option The two commonest failures: 1 Games are CPU limited at target resolutions 2 Games fail to batch sufficiently

Huge Savings... SetRenderTarget() Let s not have too many of these please! Lock() with no flags is a danger signal! Whether that s a VB that s being rendered from Or a RenderTarget which was rendered to There are milliseconds at stake here! Also use DONOTWAIT appropriately to reclaim CPU cycles these are scarce!

Significant savings... Every DrawPrim call is a significant cost So make sure you get good value from it Every time you set state it costs you significantly Whether you set one or ten... But aggressive state filtering is no longer needed as much in DX9 One pixel is irrelevant, but millions matter... So draw Front to Back Clear() the Z/stencil buffer to make it work fast Sort by shader Set your shader constants in blocks

Compilers can be smart... At ATI we test compilers to make sure that they re good and help make them better Sample Renderman results : (Win, Draw, Lose) HLSL vs Cg on ATI (*) : 5, 7, 2 HLSL vs Cg on NV : 16, 7, 0 (*) Cg compiler failed to compile 9 samples for SM2.0 even though HLSL compiler succeeded So using HLSL seems like the logical choice Not just an industry standard but the best too

And a PC is complex Which is a bit of an understatement A 9800 Pro has a similar number of gates as two Pentium4 processors all on one die But the highly parallel design allows it to do much more work of a very specific kind So you d like to have the CPU and VPU both doing useful work at the same time Luckily the API encourages this

Which bits are fast? System: CPU 1 to 1/3 of a nanosecond (1GHz to 3GHz) System memory High latency compared to the CPU 200-800MHz (for moving data about) Virtual memory Takes all week Graphics card: VPU core 200 to 500MHz Local video memory 200 to 500MHz AGP 8X Bus: 266MHz, 2GB per second, with latency like molasses

Which bits are fast? System: CPU So the CPU is fast, but it still has too much to do All games are CPU limited Graphics card: VPU core Not a blinding fast clock, but very high throughput AGP Bus: Don t texture from here unless you have to!

Inside the VPU You have several units at your disposal Vertex fetch (memory cache) Vertex shader (xform and lighting) Vertex cache (protecting the shader from abuse) Clipper (so fast it might as well not be there ) Triangle setup Fast Z/stencil reject (quad speed rasterizer rejection) Rasterizer Pixel cache Texture cache Z buffer Blend (Yummy! Read-modify-write)

Inside the VPU Because the vertex fetch unit is just reading / caching memory it makes sense to prefer cache-aligned data formats (like 32 bytes or 64 bytes) The vertex cache only works for indexed primitives D3DXOptimizeMesh gets generic benefits!! So we recommend that all rendering is done with DrawIndexedPrimitive() and that you submit data in roughly tri-strip order

Saving nanoseconds Use shorter shaders since they re faster One op per clock is what you should expect ATI hardware can parallelise vector + scalar op pairs NVIDIA hardware prefers a small register footprint Shaders are cached on chip too So switching shader can sometimes be very fast Hand written assembly isn t usually a good bet ps.1.4 modifiers can be free on ps.2.0 hardware

Saving nanoseconds Prefer the shortest shader which does what you want Use the lowest shader model which achieves your target That way you can potentially access the ps1.4 modifiers which run in the same clock cycle But please do not sacrifice quality for speed! That can be the user s choice later on by selecting no-aa, low screen resolution etc

My favourite optimisation stories 1. 9.9 fps to 10.1 fps by using H/W TnL How many triangles per call? Just the one huh? 2. My AGP bus is too slow We must be drawing too many triangles Nope, you have a pure device using S/W 3. Just draw the damned stuff! How productive is cleverness? Static LOD is much faster than dynamic!

Questions After Ashu finishes the section please