Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Save the Nanosecond! PC Graphics Performance for the next 3 years Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

A funny thing happened to me ATI is now broadly recognised and highly recommended amongst high end gamers

Another DX performance talk? Because although this has been my pet subject for 7 years there s still complexity to work out Like: Choosing sort criteria Preferred ways of handling dynamic data The best way to express a pixel shader algorithm

nanoseconds - There are lots of them... But, once they re gone... They re gone... If a game lasts for around 40 hours of play that s roughly 10 14 nanoseconds... Each frame says goodbye to roughly 10 million of these puppies Each VPU clock tick is roughly 2 ns 500MHz is a fast VPU [Each CPU tick is roughly 1/2 ns.] 2GHz is a modest CPU

Save the nanosecond There s an English saying, Look after the pennies and the pounds will look after themselves In a sense, pennies are just small pounds But delivering fast frames requires you to save millions of nanoseconds And you can t get rich by saving a dollar every now and then...

The DirectX API Since there s an API between you and the hardware it makes sense to expect that you need to know how to use it Abuse of the API can be a mighty expensive option And this is an incredibly common problem

Huge Savings... Don t create resources within the performance sensitive part of your code Offline: Compressing textures Install time: Optimize vertex sequences (D3DXOptimizeMesh) Start-up time: Create VBs, IBs, RTs, etc Game loop: Create nothing at all

Huge Savings... SetRenderTarget() Let s not have too many of these please! Single digits counts are good Lock() with zero for flags Whether that s a VB that s being rendered from Or a RenderTarget which was rendered to Because there are milliseconds at stake here! Also use DONOTWAIT appropriately to reclaim CPU cycles these are scarce!

Significant savings... Every DrawPrim call is a significant cost So make sure you get good value from it Every time you set any state it costs you Whether you set one or ten... But aggressive state filtering is no longer needed so much in DX9 One pixel is irrelevant, but millions matter... Clear() the Z/stencil buffer to make it work fast Sort Front to Back Sub-Sort by shader Set your shader constants in blocks

Compilers are smart... At ATI we test compilers to make sure that they re good and help make them better Sample results show : (Win, Draw, Lose) HLSL vs Cg on ATI* : 5, 7, 2 HLSL vs Cg on NV :16, 7, 0 (*) Cg compiler failed to compile 9 of the 23 Renderman samples for SM2.0 even though HLSL compiler succeeded So using HLSL seems like the logical choice Not just an industry standard but the best too

And a PC is complex Which is a bit of an understatement A 9800 Pro has a similar number of gates to two Pentium4 processors all on one die But the highly parallel design allows it to do much more work of a very specific kind So you d like to have the CPU and VPU both doing useful work at the same time Luckily the API encourages this

Which bits are fast? System: CPU 1 to 1/3 of a nanosecond (1GHz to 3GHz) System memory High latency compared to the CPU 200-800MHz (for moving data about) Virtual memory Takes all week Graphics card: VPU core 200 to 500MHz Local video memory 200 to 500MHz (~20GB per second) AGP Bus: 266MHz, 2GB per second, with latency like molasses [100MB per second for CPU reads so don t!]

Which bits are fast? System: CPU So the CPU is fast, but it still has too much to do All games are CPU limited Graphics card: VPU core AGP Bus: Not blinding fast clock, but phenomenal throughput Don t texture from here unless you have to

Inside the VPU You have several units at your disposal Vertex fetch (memory cache) Vertex shader (xform and lighting) Vertex cache (protecting the shader from abuse) Clipper (so fast it might as well not be there ) Triangle setup Fast Z/stencil reject (quad speed rasterizer rejection) Rasterizer Pixel cache Texture cache Z buffer Blend (Yummy! Read-modify-write)

Inside the VPU Because the vertex fetch unit is just reading / caching memory it makes sense to prefer cache-aligned data formats (like 32 bytes or 64 bytes) The vertex cache only works for indexed primitives So we recommend that all rendering is done with DrawIndexedPrimitive() and that you submit data in roughly tri-strip order

Saving nanoseconds Use shorter shaders since they re faster One op per clock is what you should expect ATI hardware can parallelise vector + scalar op pairs Shaders are cached on chip too So switching shader can sometimes be very fast Hand written assembly isn t usually a good bet ps.1.4 modifiers can be free in ps.2.0 hardware

Saving nanoseconds Prefer the shortest shader which does what you want Use the lowest shader model which achieves your target That way you can potentially access the ps1.4 modifiers which run in the same clock cycle But please do not sacrifice quality for speed! That can be the user s choice later on by selecting no-aa, low screen resolution etc

Pre Zee An early Z only pass will save you time if (1) Your pixel shaders are long (2) You cannot sort front-to-back The definition of long here depends upon how well you can usually sort! Pre-Z saves you pixels, but costs you vertices

Optimisation - The Big Picture Almost all of the best optimisations come down to one single principal Do the work as early as possible in the pipeline to avoid doing it later where the cost would be greater This applies to things like resource creation (prefer install time costs to runtime costs) culling (cull early is better than late) shader tuning (pre-shader opts move from ps to vs to CPU) Z-only pass

What s this about the future? Let s looks at the trends which are changing the balance

ATI is at the Center of The Digital Experience

Market share... At the end of 2003 ATI finally took the lead in market share in game-play graphics from the competition Yeah, but only by 0.2%... So what? According to Mercury Research, ATI leads with a roughly 80:20 split at the high end Which means that if you re targeting high end gamers and reviewers then your focus is on ATI That s what the vast majority of your audience is using And ATI has a 100% market share lead of New Xbox technologies

Multiple platforms... The PC leads the way so that the various genres of lesser hardware are several years behind PC architecture... Latest PDA hardware is equivalent to cutting edge PC hardware from just 4 years ago! Laptops are less than 2 years behind high end workstations Consoles often define the high end as they arrive...

PC Platform retirement Top spec PC s actually have a game-buying life of just two years! PC s older than that are retired for Word, email, web browsing etc. New PC s or graphics cards are brought into the home and it s these that are used for games Gamers with systems which are >2 years old buy roughly 1 game per year and these are not high end games Hard core gamers average 5-10 games per year This implies a roughly 2.5:1 CPU scalability issue And roughly 4:1 GPU scalability on both power and features

All of which means You should require DX8 hardware and upwards for games due Xmas 2004 or later We recommend treating low end DX9 hardware to the DX8 path. Even 1024x768 is often too demanding for the low end DX9 hardware out there So you should be able to cope with just two code paths on many games for this year DX8 hardware takes one DX9 hardware takes the other But note that because this assertion is based on forecasts and trends it is highly subjective

DirectX 8 class hardware Programmable vertex pipeline is in addition to the FF pipeline That makes it hard to beat the fixed function hardware And this makes it fast to switch between pipelines Pixel pipeline is shared between the old fashioned texture cascade and the new pixel processor

DirectX 9 class hardware Programmable vertex pipeline is shared with the FF pipeline That makes it easy to beat the fixed function hardware That makes it slow to switch between pipelines For this reason it makes sense generally to prefer the programmable pipeline.

So, here is our target: DX9 style mainstream graphics (per frame): > 500K triangles < 500 DrawIndexedPrimitive() calls < 500 VertexBuffer switches < 200 different textures < 200 State change groups Few calls to SetRenderTarget - aim for 0 to 4... 1 pass per poly is typical, but 2 is sometimes smart Runs at monitor refresh rate Which gives more than 40 million polys per second And everything goes through the programmable pipeline No occurrences of Lock(0), DrawPrimitive(), DPUP(), CreateVB() etc

Are we there yet? Pixel Shader throughput: More pixel engines with Higher clock speeds Higher Instruction counts More vertex engines too since triangles keep getting smaller The pressure moves away from textures and towards the ALU operations Simply because ALU power grows faster than B/W

Are we there yet? High quality AA: Continue to innovate with... Programmable sample points Currently 0, 2, 4 or 6 Full exposure of centroid control DirectX 9.0c API fully exposes this Gamma correction of AA in hardware ATI do this already with a 2.2 gamma function

The 3.0 shader model Requires 32 bit floats throughout the pipeline But that s not necessarily full IEEE 754... With it s -0.0s, NANs and INFINITYs etc Although the spec does not require support for blend and fog into float surfaces you may expect this to be available on much hardware Static flow control in pixel shader Has some serious performance implications...

Which constraints are next? SM3 Precision Consistent 32 bit IEEE throughout Which means... se7m24 One sign bit 7 bits of exponent 24 bits of mantissa But the propagation rules (like what is INF * -0.0 ) are not necessarily required until SM 4.0 Higher (64 bit) precision is not for the near-term...

Stream Processors Modern GPUs and VPUs are computing devices built from stream processors Stream Processors are great for some tasks... Fixed maximum input B/W Fixed Processing power Fixed maximum output B/W

Stream Processors? Modern GPUs and VPUs are computing devices built from stream processors Vertex Fetch Vertex Shader Triangle set up Pixel Shader FB fog +blend But really, each block is complex... Sp[0] Sp[1] Sp[...] Sp[n-1] Sp[n]

A unified shader model The plan as of GDC 04 Is that each of the different 4.0 shaders will use the same syntax and feature set This allows us to get around the major drawback of hardwired stream processors fixed resources. Then the chip can become a pool of vector processors and the hardware allocates these resources to match demand Which implies that benchmarking the hardware becomes somewhat more complex where:- How many vertices per second depends on the pixel complexity How many pixels per second depends on the vertex complexity

So isn t this a CPU? No, look at the Differences: Cache Sizes - CPU = huge Number of Pipeline Stages - VPU = long Cache Interaction - VPU = none Clock Speed - CPU = fast Generality - VPU tends not to read what it writes Vector oriented - VPU is fundamentally 4D Number types - CPU is more flexible, supporting integers and floats easily Branches - VPUs don t like branching

Some of the targets for DX Next Geometry generation in the VPU A fully specified new Topology Processor unit Which means you ll be able to generate new vertices with all relevant connectivity information from within the VPU... For example you can extrude shadow volumes using this new hardware [But the geometry shader probably doesn t get fed it s own output...] Note please that DX Next is just my placeholder name

Some of the targets for DX Next Support for virtual memory So texture downloads are much more efficient Now only those pages of the relevant mip levels will be present Contrast that with the current situation where all of every mip level is required to be present in VPU-accessible memory before the first texel is filtered... And DX Next has the notion of graphics hardware contexts with maximum context switch times VM may also include write capabilities... Will reduce the pressure to move beyond 512MB but we ll still head in that direction...

The 4.0 shader model Is still being decided by Microsoft Will be for the next OS only Expect this circa early 2006 New geometry shader Common capabilities between all shaders Faster small batch performance is a very high priority Which implies a new driver model Will last for two or more years DX9 lasts from Q4 2002 until the next OS