Game Developers Conference 2009 Programming Tips For Scalable Graphics Performance March 25, 2009 ROOM 2010 Luis Gimenez Graphics Architect Ganesh Kumar Application Engineer Katen Shah Graphics Architect Agenda Why Optimize for Scalable Graphics Intel GMA Series Architecture and Tools Balance Work Load Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance Case Study Q&A 2 1
Developing for Integrated Graphics Allows You to Sell Your Game to More Customers! 300 250 PC Graphics Market Segment Millions 200 150 100 Desktop Integrated Desktop Discrete Mobile Integrated Mobile Discrete 50 0 2007 2008 2009 2010 2011 2012 2013 Source: Mercury Research (Q4 08) 3 Scale Your Game! 4 2
Intel Integrated Graphics (IIG) Architecture Memory Commands Internal buses Cmd Streamer Video Processing 2D Display Memory /Cache VF VS GS Clip Setup Rast / Early-Z SO Thread Dispatch I$ Cache EU 0 EU 1 EU n EU 0 EU 1 EU n Array of Execution Units Row0 RowN Sampler Texture Cache Render Cache Pixel Ops Intel GMA 3 & GMA 4 Series support SM4 5 Intel s New Graphics Performance Analyzers Today 2:30 PM 3:30 PM in Room 3004, West Hall SYSTEM ANALYZER FRAME ANALYZER 6 3
Optimization Hints For Intel Integrated Graphics How to avoid frequent pitfalls found in testing integrated graphics playability over numerous games every year Balance Workload Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance 7 Balance The Workload between the and the GPU OCEAN FOG DEMO Complex Algorithms Physics/AI Simulation Animation Pre-computing Massive Data Parallelism Per Pixel Lighting Shadows Post Processing Blending Animation Pre-computing the Perlin textures in the and using the GPU for Rendering nearly doubled the frame rate http://software.intel.com/en-us/articles/ocean-fog-using-direct3d-10/ 8 4
Maximize and GPU Utilization: Avoid Stalling the Pipeline! 2. Map() Resource Copy output Staging Resource 3. Stall Until Flush To avoid stalling the minimize data read-back Serializing Event Queries GPU CMD Buffer 1.CopyResource Render Command Command 9 Maximize and GPU Utilization: Avoid Stalling the Pipeline! STUTTERING F0 F1 F2 F3 F4 F5 F0 GPU GPU GPU GPU F0 F1 F2 F3 STALL F1 GPU F0 STALL GPU F1 F2 GPU F4 GPU F2 To avoid stalling the minimize data read-back Serializing Event Queries Put Space between locks Synchronize to N-1 to N-2 frames F0 F1 F2 F3 N-2 SYNCH GPU F0 GPU F1 GPU F2 10 5
Maximize and GPU Utilization: Avoid Stalling the Pipeline! The IIG driver optimizes the workload before sending it to the GPU Memory App Direct3D Intel Driver Commands Vertex Buffers Index Buffer Texture Texture Buffer Texture Depth / Color Display Buffer Cmd Parser Vertex Shader Geometry Shader Stream Out Clipper Setup/ Rasterization Pixel Shader Output Merger To avoid stalling the minimize data read-back Serializing Event Queries Put Space between locks Synchronize to N-1 to N-2 frames Reduce work, optimize Driver performance by reducing State Changes Creation and Destruction of Resources 11 Optimization Hints For Intel Integrated Graphics Balance load Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance 12 6
Minimizing Runtime and Driver Overhead Manage Your DirectX 10 Resources! DirectX 10 manages resources based on USAGE and _ACCESS_FLAG The best memory location is decided by OS/driver/memory manager DX10 Usage / Update Freq NON MAPPABLE MAPPABLE IMMUTABLE Never DEFAULT <=1 per frame DYNAMIC > 1 per Frame STAGING transfer data to the GPU transfer data to the GPU Access Resource Update USE GPU read GPU readwrite write GPU read Copy() readwrite GPU indirect read/ write Read-back from GPU Create () Load @ create never updated Copy (), Update () use only for CBs and small textures Map() w. WRITE_NO_OVERWRITE partial update of VBs/IBs WRITE_DISCARD for full update or CBs Copy () Map() for write to mapped memory WRITE/DO_NOT_WAIT_FLAG to avoid stalls Copy () from staging resource to video Memory Copy() GPU output to staging resource Map() for read w. DO_NOT_WAIT_FLAG to avoid stall Static VBs/ IBs/Textures VBs/IBs/CBs /Textures Dynamic Update VBs/ IBs CBs Texture updates Surfaces for read-back / Minimizing Runtime and Driver Overhead Optimize Your Constants Access! IIG Driver optimizes for DX9/10 the most frequently used constants Avoid global constants Limit Dynamicindexed Constants C[a0] C[r] Fog Demo In DX10 when a constant changes the complete buffer gets updated Group cbuffers by frequency of updates Organize cbuffers based on feature scaling Inside cbuffer put constants by access sequence Inside cbuffers pack data into float4 boundaries http://software.intel.com/en-us/articles/directx-constants-optimizationsfor-intel-integrated-graphics/ 14 7
Minimizing Runtime and Driver Overhead Batch Your Primitives! Use large batches >200-1K primitives Minimize State Changes between batches Use Instancing for Small Batches http://software.intel.com/en-us/articles/rendering-grass-with-instancing-in-directx-10/ 15 Optimization Hints For Intel Integrated Graphics Balance load Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance 16 8
Optimizing Shader Performance Skip Computes that do not Render! Test for visibility to reject objects that fall outside the view frustum Maximize Use of Early-Z (cost 4 pixels/clock hardware) Avoid modified Z value (odepth) in the pixel shader Use Occlusion Query for complex scenes Use LOD to reduce complexity for objects that are distant 17 Optimizing Shader Performance Optimize the Use of the Intel Integrated Graphics HW! Cmd Streamer VF VS GS Clip Setup Rast / Early-Z SO Thread Dispatch I$ Cache Array of Execution Units EU 0 EU 1 EU Row0 n EU 0 EU 1 EU n RowN Sampler Texture Cache Render Cache Pixel Ops For best EUs Utilization minimize registry usage Sample Textures to >4:1 ratio of #Instructions per Texture Sample Large shader impacts performance due to limited number of registers Smart Usage of Flow Control Mask alpha when not needed Minimize use of transcendentals like LOG, POW, EXP etc. Pre-load Shaders to avoid Mid-Scene Compiles Avoid Mid-Scene textures changes 18 9
Optimizing Shader Performance Scale Your Pixel Shader and Textures! Keep your Textures under 256x256 and same format if possible Prefer Multi-texture texture over Multi-Pass Use Compressed Textures and mip-maps Use Texture arrays / Texture Atlas Minimize Lock/Blit of Z and/or Stencil Buffer Use Shadow Maps for IIG and Stencil Shadows as scalable feature Minimize Clear() surfaces Minimize post processing passes 19 Optimizing for IIG: Demigod 20 10
Key Lessons Learned from Optimizing Demigod for IIG 21 Be Wary of Clear Calls Why: - Costlier than you might think - Affects every ypixel on surface Recommendations: - Make sure unused surfaces don t get cleared unnecessarily - Consider reducing surface resolution when in lower LOD - Clear Color, Stencil and Z-Buffer in the same API call 22 11
Prune Costly Clear Calls 23 Reduce the Number of Texture Fetches Texture cache is limited on integrated graphics Reducing Texture sizes alone doesn t help as much Optimize Shaders by reducing texture fetches in Low Fidelity modes Balance Texture load instructions with arithmetic instructions if possible 24 12
Simplify Post Processing Effects Post Processing Effects that use multiple passes Bloom Motion Blur Depth of Field High Dynamic Range Balance visual quality with speed by reducing the number of passes 25 Demigod Bloom Effect Before After Bloom turned Off Bloom On with Fewer Passes 26 13
Avoid Pixel Overdraw Render opaque objects from Front to Back - Render UI and other HUDs first - Render Sky and Terrain last Early-Z architecture eliminates occluded pixels early in the pipeline 27 Example of Back to Front Rendering 28 14
Moving Terrain Rendering to the End 29 Lastly, Add Benchmark Mode to Your Game for Performance Profiling! It helps to characterize the workload Four Key requirements benchmark must provide 1. Accurately reflect real workload 2. Repeatability 3. Ability to run standalone without Internet 4. Ability to Automate t built-in i demo, command-line execution and output to a log file 30 15
Summary Scale Your Game for Integrated! Balance and GPU Workload, Avoid Stalls Minimize Run Time and Driver Overhead Optimize your shader performance by scaling your game Analyze your game, find your most expensive call Balance your visual effects against performance penalties Add benchmark mode to your game 31 Additional Resources Developers Guide for Intel Integrated Graphics http://software.intel.com/en-us/articles/intel-graphics-media-accelerator-developersguide Articles Mentioned in this Presentation http://software.intel.com/en com/en-us/articles/ocean-fog-using-direct3d-10 using http://software.intel.com/en-us/articles/directx-constants-optimizations-for-intelintegrated-graphics/ http://software.intel.com/en-us/articles/rendering-grass-with-instancing-in-directx-10 Intel Graphics Performance Analyzer www.intel.com/software/gpa Intel Graphics Community http://softwarecommunities.intel.com/communities/visualcomputing Integrated Graphics Software Development Forum http://softwarecommunities.intel.com/isn/community/en- US/forums/2414/ShowForum.aspx Intel Laptop Gaming TDK http://softwarecommunities.intel.com/articles/eng/1017.htm 32 3 2 32 16
Enhance Your Products and Your Business Training the Next Generation The gateway to Intel s worldwide technology, engineering and go-to-market support for Visual Computing developers Get the Story Behind the Story Investing in Talent and Technology See What s New Developers Connecting with Intel Engineers www.intel.com/software/visualadrenaline 33 For More Information http://www.intel.com/software/gdc Contact info See Intel at GDC: - Intel Booth at Expo, North Hall - Intel Interactive Lounge West Hall 3 rd floor Take a collateral DVD - Here in the room! - Intel Booth or Interactive Lounge 34 17
Intel @ GDC Wednesday, March 25 Programming Tips for Scalable Graphics 10:30 AM 11:30 AM in Room 2010, West Hall Threaded AI For the Win! 12:00 PM 1:00 PM in Room 2011, West Hall Intel s New Graphics Performance Analyzers 2:30 PM 3:30 PM in Room 3004, West Hall Kaboom: Real-Time Multi-Threaded Fluid Simulation for Games 4:00 PM 5:00 PM in Room 2011, West Hall Thursday, March 26 Who Moved the Goalposts? The Rapidly Changing World of s and Optimization 1:30 PM 2:30 PM in Room 2011, West Hall Taming Your Game Production Demons: the Offset approach 3:00 PM 4:00 PM in Room 2011, West Hall Optimizing Game Architectures with Intel Threading Building Blocks 4:30 PM 5:30 PM in Room 2011, West Hall 35 Last of Intel @ GDC Friday, March 27 Procedural and Multi-Core Techniques to take Visuals to the Next Level 9:00 AM 10:00 AM in Room 2010, West Hall Rasterization on Larrabee: A First Look at the Larrabee New Instructions (LRBni) in Action 9:00 AM 10:00 AM in Room 135, North Hall SIMD Programming on Larrabee: A Second Look at the Larrabee New Instructions (LRBni) in Action 10:30 AM 11:30 AM in Room 3002, West Hall 36 18
Risk Factors This presentation contains forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing available on our website for more information i on the risk factors that could cause actual results to differ. Rev. 4/17/07 37 Backup Slides 39 19
Both Intel GMA 3 and 4 support DirectX 10 Make your Scaling API Independent! Game Scaling DX8 DX9 DX10 High Detail Standard Detail Low Detail Recommend dation 40 Both Intel GMA 3 and 4 support all required D3D10 Features D3D10 Optional Features - MSAA: only single sample supported - 32-bit FP Filtering: not supported - 16bit UNORM Blending: Supported in GMA X4XXX and beyond - RGB32 RT: Not supported - Use D3D10Device::CheckFormatSupport to check for supported formats Other D3D10 performance considerations Limit Use of GS make it scale feature Use different Stream Out buffers for different SO formats Check for Optional Features before Use them 41 20
21