Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Similar documents
Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Could you make the XNA functions yourself?

Graphics Processing Unit Architecture (GPU Arch)

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CONSOLE ARCHITECTURE

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Performance OpenGL Programming (for whatever reason)

Shader Series Primer: Fundamentals of the Programmable Pipeline in XNA Game Studio Express

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

PowerVR Hardware. Architecture Overview for Developers

Optimisation. CS7GV3 Real-time Rendering

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Spring 2009 Prof. Hyesoon Kim

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

CS427 Multicore Architecture and Parallel Computing

Windowing System on a 3D Pipeline. February 2005

Streaming Massive Environments From Zero to 200MPH

Threading Hardware in G80

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

The NVIDIA GeForce 8800 GPU

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Hardware-driven Visibility Culling Jeong Hyun Kim

From Brook to CUDA. GPU Technology Conference

The Central Processing Unit

Modern Processor Architectures. L25: Modern Compiler Design

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

DX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology

Graphics Hardware. Instructor Stephen J. Guy

Spring 2011 Prof. Hyesoon Kim

Advanced processor designs

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Lecture 25: Board Notes: Threads and GPUs

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

ECE 574 Cluster Computing Lecture 16

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Portland State University ECE 588/688. Graphics Processors

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

The Application Stage. The Game Loop, Resource Management and Renderer Design

Current Trends in Computer Graphics Hardware

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PowerVR Series5. Architecture Guide for Developers

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Monday Morning. Graphics Hardware

POWERVR MBX. Technology Overview

Parallelism and Concurrency. COS 326 David Walker Princeton University

Course Recap + 3D Graphics on Mobile GPUs

Coming to a Pixel Near You: Mobile 3D Graphics on the GoForce WMP. Chris Wynn NVIDIA Corporation

POWERVR MBX & SGX OpenVG Support and Resources

Xbox 360 high-level architecture

Profiling and Debugging Games on Mobile Platforms

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

PC I/O. May 7, Howard Huang 1

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Day: Thursday, 03/19 Time: 16:00-16:50 Location: Room 212A Level: Intermediate Type: Talk Tags: Developer - Tools & Libraries; Game Development

Introduction to Computer Graphics (CS602) Lecture No 03 Graphics Systems

CS61C : Machine Structures

Enabling immersive gaming experiences Intro to Ray Tracing

What s New with GPGPU?

Ultimate Graphics Performance for DirectX 10 Hardware

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

GoForce 3D: Coming to a Pixel Near You

From Concept to Silicon

Render-To-Texture Caching. D. Sim Dietrich Jr.

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Mention driver developers in the room. Because of time this will be fairly high level, feel free to come talk to us afterwards

CS195V Week 9. GPU Architecture and Other Shading Languages

Hardware-driven visibility culling

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Cg 2.0. Mark Kilgard

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

Scanline Rendering 2 1/42

Multimedia in Mobile Phones. Architectures and Trends Lund

Mobile 3D Devices. -- They re not little PCs! Stephen Wilkinson Graphics Software Technical Lead Texas Instruments CSSD/OMAP

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

Direct3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation

Evolution of GPUs Chris Seitz

Introduction to the Direct3D 11 Graphics Pipeline

Transcription:

Save the Nanosecond! PC Graphics Performance for the next 3 years Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

A funny thing happened to me ATI is now broadly recognised and highly recommended amongst high end gamers

Another DX performance talk? Because although this has been my pet subject for 7 years there s still complexity to work out Like: Choosing sort criteria Preferred ways of handling dynamic data The best way to express a pixel shader algorithm

nanoseconds - There are lots of them... But, once they re gone... They re gone... If a game lasts for around 40 hours of play that s roughly 10 14 nanoseconds... Each frame says goodbye to roughly 10 million of these puppies Each VPU clock tick is roughly 2 ns 500MHz is a fast VPU [Each CPU tick is roughly 1/2 ns.] 2GHz is a modest CPU

Save the nanosecond There s an English saying, Look after the pennies and the pounds will look after themselves In a sense, pennies are just small pounds But delivering fast frames requires you to save millions of nanoseconds And you can t get rich by saving a dollar every now and then...

The DirectX API Since there s an API between you and the hardware it makes sense to expect that you need to know how to use it Abuse of the API can be a mighty expensive option And this is an incredibly common problem

Huge Savings... Don t create resources within the performance sensitive part of your code Offline: Compressing textures Install time: Optimize vertex sequences (D3DXOptimizeMesh) Start-up time: Create VBs, IBs, RTs, etc Game loop: Create nothing at all

Huge Savings... SetRenderTarget() Let s not have too many of these please! Single digits counts are good Lock() with zero for flags Whether that s a VB that s being rendered from Or a RenderTarget which was rendered to Because there are milliseconds at stake here! Also use DONOTWAIT appropriately to reclaim CPU cycles these are scarce!

Significant savings... Every DrawPrim call is a significant cost So make sure you get good value from it Every time you set any state it costs you Whether you set one or ten... But aggressive state filtering is no longer needed so much in DX9 One pixel is irrelevant, but millions matter... Clear() the Z/stencil buffer to make it work fast Sort Front to Back Sub-Sort by shader Set your shader constants in blocks

Compilers are smart... At ATI we test compilers to make sure that they re good and help make them better Sample results show : (Win, Draw, Lose) HLSL vs Cg on ATI* : 5, 7, 2 HLSL vs Cg on NV :16, 7, 0 (*) Cg compiler failed to compile 9 of the 23 Renderman samples for SM2.0 even though HLSL compiler succeeded So using HLSL seems like the logical choice Not just an industry standard but the best too

And a PC is complex Which is a bit of an understatement A 9800 Pro has a similar number of gates to two Pentium4 processors all on one die But the highly parallel design allows it to do much more work of a very specific kind So you d like to have the CPU and VPU both doing useful work at the same time Luckily the API encourages this

Which bits are fast? System: CPU 1 to 1/3 of a nanosecond (1GHz to 3GHz) System memory High latency compared to the CPU 200-800MHz (for moving data about) Virtual memory Takes all week Graphics card: VPU core 200 to 500MHz Local video memory 200 to 500MHz (~20GB per second) AGP Bus: 266MHz, 2GB per second, with latency like molasses [100MB per second for CPU reads so don t!]

Which bits are fast? System: CPU So the CPU is fast, but it still has too much to do All games are CPU limited Graphics card: VPU core AGP Bus: Not blinding fast clock, but phenomenal throughput Don t texture from here unless you have to

Inside the VPU You have several units at your disposal Vertex fetch (memory cache) Vertex shader (xform and lighting) Vertex cache (protecting the shader from abuse) Clipper (so fast it might as well not be there ) Triangle setup Fast Z/stencil reject (quad speed rasterizer rejection) Rasterizer Pixel cache Texture cache Z buffer Blend (Yummy! Read-modify-write)

Inside the VPU Because the vertex fetch unit is just reading / caching memory it makes sense to prefer cache-aligned data formats (like 32 bytes or 64 bytes) The vertex cache only works for indexed primitives So we recommend that all rendering is done with DrawIndexedPrimitive() and that you submit data in roughly tri-strip order

Saving nanoseconds Use shorter shaders since they re faster One op per clock is what you should expect ATI hardware can parallelise vector + scalar op pairs Shaders are cached on chip too So switching shader can sometimes be very fast Hand written assembly isn t usually a good bet ps.1.4 modifiers can be free in ps.2.0 hardware

Saving nanoseconds Prefer the shortest shader which does what you want Use the lowest shader model which achieves your target That way you can potentially access the ps1.4 modifiers which run in the same clock cycle But please do not sacrifice quality for speed! That can be the user s choice later on by selecting no-aa, low screen resolution etc

Pre Zee An early Z only pass will save you time if (1) Your pixel shaders are long (2) You cannot sort front-to-back The definition of long here depends upon how well you can usually sort! Pre-Z saves you pixels, but costs you vertices

Optimisation - The Big Picture Almost all of the best optimisations come down to one single principal Do the work as early as possible in the pipeline to avoid doing it later where the cost would be greater This applies to things like resource creation (prefer install time costs to runtime costs) culling (cull early is better than late) shader tuning (pre-shader opts move from ps to vs to CPU) Z-only pass

What s this about the future? Let s looks at the trends which are changing the balance

ATI is at the Center of The Digital Experience

Market share... At the end of 2003 ATI finally took the lead in market share in game-play graphics from the competition Yeah, but only by 0.2%... So what? According to Mercury Research, ATI leads with a roughly 80:20 split at the high end Which means that if you re targeting high end gamers and reviewers then your focus is on ATI That s what the vast majority of your audience is using And ATI has a 100% market share lead of New Xbox technologies

Multiple platforms... The PC leads the way so that the various genres of lesser hardware are several years behind PC architecture... Latest PDA hardware is equivalent to cutting edge PC hardware from just 4 years ago! Laptops are less than 2 years behind high end workstations Consoles often define the high end as they arrive...

PC Platform retirement Top spec PC s actually have a game-buying life of just two years! PC s older than that are retired for Word, email, web browsing etc. New PC s or graphics cards are brought into the home and it s these that are used for games Gamers with systems which are >2 years old buy roughly 1 game per year and these are not high end games Hard core gamers average 5-10 games per year This implies a roughly 2.5:1 CPU scalability issue And roughly 4:1 GPU scalability on both power and features

All of which means You should require DX8 hardware and upwards for games due Xmas 2004 or later We recommend treating low end DX9 hardware to the DX8 path. Even 1024x768 is often too demanding for the low end DX9 hardware out there So you should be able to cope with just two code paths on many games for this year DX8 hardware takes one DX9 hardware takes the other But note that because this assertion is based on forecasts and trends it is highly subjective

DirectX 8 class hardware Programmable vertex pipeline is in addition to the FF pipeline That makes it hard to beat the fixed function hardware And this makes it fast to switch between pipelines Pixel pipeline is shared between the old fashioned texture cascade and the new pixel processor

DirectX 9 class hardware Programmable vertex pipeline is shared with the FF pipeline That makes it easy to beat the fixed function hardware That makes it slow to switch between pipelines For this reason it makes sense generally to prefer the programmable pipeline.

So, here is our target: DX9 style mainstream graphics (per frame): > 500K triangles < 500 DrawIndexedPrimitive() calls < 500 VertexBuffer switches < 200 different textures < 200 State change groups Few calls to SetRenderTarget - aim for 0 to 4... 1 pass per poly is typical, but 2 is sometimes smart Runs at monitor refresh rate Which gives more than 40 million polys per second And everything goes through the programmable pipeline No occurrences of Lock(0), DrawPrimitive(), DPUP(), CreateVB() etc

Are we there yet? Pixel Shader throughput: More pixel engines with Higher clock speeds Higher Instruction counts More vertex engines too since triangles keep getting smaller The pressure moves away from textures and towards the ALU operations Simply because ALU power grows faster than B/W

Are we there yet? High quality AA: Continue to innovate with... Programmable sample points Currently 0, 2, 4 or 6 Full exposure of centroid control DirectX 9.0c API fully exposes this Gamma correction of AA in hardware ATI do this already with a 2.2 gamma function

The 3.0 shader model Requires 32 bit floats throughout the pipeline But that s not necessarily full IEEE 754... With it s -0.0s, NANs and INFINITYs etc Although the spec does not require support for blend and fog into float surfaces you may expect this to be available on much hardware Static flow control in pixel shader Has some serious performance implications...

Which constraints are next? SM3 Precision Consistent 32 bit IEEE throughout Which means... se7m24 One sign bit 7 bits of exponent 24 bits of mantissa But the propagation rules (like what is INF * -0.0 ) are not necessarily required until SM 4.0 Higher (64 bit) precision is not for the near-term...

Stream Processors Modern GPUs and VPUs are computing devices built from stream processors Stream Processors are great for some tasks... Fixed maximum input B/W Fixed Processing power Fixed maximum output B/W

Stream Processors? Modern GPUs and VPUs are computing devices built from stream processors Vertex Fetch Vertex Shader Triangle set up Pixel Shader FB fog +blend But really, each block is complex... Sp[0] Sp[1] Sp[...] Sp[n-1] Sp[n]

A unified shader model The plan as of GDC 04 Is that each of the different 4.0 shaders will use the same syntax and feature set This allows us to get around the major drawback of hardwired stream processors fixed resources. Then the chip can become a pool of vector processors and the hardware allocates these resources to match demand Which implies that benchmarking the hardware becomes somewhat more complex where:- How many vertices per second depends on the pixel complexity How many pixels per second depends on the vertex complexity

So isn t this a CPU? No, look at the Differences: Cache Sizes - CPU = huge Number of Pipeline Stages - VPU = long Cache Interaction - VPU = none Clock Speed - CPU = fast Generality - VPU tends not to read what it writes Vector oriented - VPU is fundamentally 4D Number types - CPU is more flexible, supporting integers and floats easily Branches - VPUs don t like branching

Some of the targets for DX Next Geometry generation in the VPU A fully specified new Topology Processor unit Which means you ll be able to generate new vertices with all relevant connectivity information from within the VPU... For example you can extrude shadow volumes using this new hardware [But the geometry shader probably doesn t get fed it s own output...] Note please that DX Next is just my placeholder name

Some of the targets for DX Next Support for virtual memory So texture downloads are much more efficient Now only those pages of the relevant mip levels will be present Contrast that with the current situation where all of every mip level is required to be present in VPU-accessible memory before the first texel is filtered... And DX Next has the notion of graphics hardware contexts with maximum context switch times VM may also include write capabilities... Will reduce the pressure to move beyond 512MB but we ll still head in that direction...

The 4.0 shader model Is still being decided by Microsoft Will be for the next OS only Expect this circa early 2006 New geometry shader Common capabilities between all shaders Faster small batch performance is a very high priority Which implies a new driver model Will last for two or more years DX9 lasts from Q4 2002 until the next OS