Programming Tips For Scalable Graphics Performance

Similar documents
Ultimate Graphics Performance for DirectX 10 Hardware

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Working with Metal Overview

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Graphics Processing Unit Architecture (GPU Arch)

PowerVR Hardware. Architecture Overview for Developers

The Application Stage. The Game Loop, Resource Management and Renderer Design

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

GCN Performance Tweets AMD Developer Relations

Rendering Grass with Instancing in DirectX* 10

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

PowerVR Series5. Architecture Guide for Developers

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

CS427 Multicore Architecture and Parallel Computing

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Windowing System on a 3D Pipeline. February 2005

The Rasterization Pipeline

Vulkan on Mobile. Daniele Di Donato, ARM GDC 2016

PowerVR Performance Recommendations. The Golden Rules

Achieving High-performance Graphics on Mobile With the Vulkan API

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Render-To-Texture Caching. D. Sim Dietrich Jr.

Parallel Programming on Larrabee. Tim Foley Intel Corp

Hardware-driven Visibility Culling Jeong Hyun Kim

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Direct3D 11 Performance Tips & Tricks

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

DX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology

Introduction to the Direct3D 11 Graphics Pipeline

Real-Time Hair Simulation and Rendering on the GPU. Louis Bavoil

Threading Hardware in G80

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Introducing Metal 2. Graphics and Games #WWDC17. Michal Valient, GPU Software Engineer Richard Schreyer, GPU Software Engineer

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Streaming Massive Environments From Zero to 200MPH

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Achieving Console Quality Games on Mobile

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

1.2.3 The Graphics Hardware Pipeline

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Craig Peeper Software Architect Windows Graphics & Gaming Technologies Microsoft Corporation

EECS 487: Interactive Computer Graphics

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Software Occlusion Culling

Intel Core 4 DX11 Extensions Getting Kick Ass Visual Quality out of the Latest Intel GPUs

Increase your FPS with CPU Onload

Course Recap + 3D Graphics on Mobile GPUs

Spring 2011 Prof. Hyesoon Kim

The Bifrost GPU architecture and the ARM Mali-G71 GPU

DirectX 10 Performance. Per Vognsen

A Trip Down The (2011) Rasterization Pipeline

Hardware-driven visibility culling

Jomar Silva Technical Evangelist

Increase your FPS. with CPU Onload Josh Doss. Doug McNabb.

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Low-Overhead Rendering with Direct3D. Evan Hart Principal Engineer - NVIDIA

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Performance OpenGL Programming (for whatever reason)

Vulkan Multipass mobile deferred done right

Portland State University ECE 588/688. Graphics Processors

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

MAXIS-mizing Darkspore*: A Case Study of Graphic Analysis and Optimizations in Maxis Deferred Renderer

New GPU Features of NVIDIA s Maxwell Architecture

POWERVR MBX & SGX OpenVG Support and Resources

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Optimisation. CS7GV3 Real-time Rendering

Acknowledgement: Images and many slides from presentations by Mark J. Kilgard and other Nvidia folks, from slides on developer.nvidia.

The NVIDIA GeForce 8800 GPU

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Tiled shading: light culling reaching the speed of light. Dmitry Zhdan Developer Technology Engineer, NVIDIA

GeForce3 OpenGL Performance. John Spitzer

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

GPU Memory Model. Adapted from:

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

POWERVR MBX. Technology Overview

Inside VR on Mobile. Sam Martin Graphics Architect GDC 2016

NVIDIA Parallel Nsight. Jeff Kiel

After the release of Maxwell in September last year, a number of press articles appeared that describe VXGI simply as a technology to improve

The Traditional Graphics Pipeline

Anatomy of AMD s TeraScale Graphics Engine

Transcription:

Game Developers Conference 2009 Programming Tips For Scalable Graphics Performance March 25, 2009 ROOM 2010 Luis Gimenez Graphics Architect Ganesh Kumar Application Engineer Katen Shah Graphics Architect Agenda Why Optimize for Scalable Graphics Intel GMA Series Architecture and Tools Balance Work Load Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance Case Study Q&A 2 1

Developing for Integrated Graphics Allows You to Sell Your Game to More Customers! 300 250 PC Graphics Market Segment Millions 200 150 100 Desktop Integrated Desktop Discrete Mobile Integrated Mobile Discrete 50 0 2007 2008 2009 2010 2011 2012 2013 Source: Mercury Research (Q4 08) 3 Scale Your Game! 4 2

Intel Integrated Graphics (IIG) Architecture Memory Commands Internal buses Cmd Streamer Video Processing 2D Display Memory /Cache VF VS GS Clip Setup Rast / Early-Z SO Thread Dispatch I$ Cache EU 0 EU 1 EU n EU 0 EU 1 EU n Array of Execution Units Row0 RowN Sampler Texture Cache Render Cache Pixel Ops Intel GMA 3 & GMA 4 Series support SM4 5 Intel s New Graphics Performance Analyzers Today 2:30 PM 3:30 PM in Room 3004, West Hall SYSTEM ANALYZER FRAME ANALYZER 6 3

Optimization Hints For Intel Integrated Graphics How to avoid frequent pitfalls found in testing integrated graphics playability over numerous games every year Balance Workload Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance 7 Balance The Workload between the and the GPU OCEAN FOG DEMO Complex Algorithms Physics/AI Simulation Animation Pre-computing Massive Data Parallelism Per Pixel Lighting Shadows Post Processing Blending Animation Pre-computing the Perlin textures in the and using the GPU for Rendering nearly doubled the frame rate http://software.intel.com/en-us/articles/ocean-fog-using-direct3d-10/ 8 4

Maximize and GPU Utilization: Avoid Stalling the Pipeline! 2. Map() Resource Copy output Staging Resource 3. Stall Until Flush To avoid stalling the minimize data read-back Serializing Event Queries GPU CMD Buffer 1.CopyResource Render Command Command 9 Maximize and GPU Utilization: Avoid Stalling the Pipeline! STUTTERING F0 F1 F2 F3 F4 F5 F0 GPU GPU GPU GPU F0 F1 F2 F3 STALL F1 GPU F0 STALL GPU F1 F2 GPU F4 GPU F2 To avoid stalling the minimize data read-back Serializing Event Queries Put Space between locks Synchronize to N-1 to N-2 frames F0 F1 F2 F3 N-2 SYNCH GPU F0 GPU F1 GPU F2 10 5

Maximize and GPU Utilization: Avoid Stalling the Pipeline! The IIG driver optimizes the workload before sending it to the GPU Memory App Direct3D Intel Driver Commands Vertex Buffers Index Buffer Texture Texture Buffer Texture Depth / Color Display Buffer Cmd Parser Vertex Shader Geometry Shader Stream Out Clipper Setup/ Rasterization Pixel Shader Output Merger To avoid stalling the minimize data read-back Serializing Event Queries Put Space between locks Synchronize to N-1 to N-2 frames Reduce work, optimize Driver performance by reducing State Changes Creation and Destruction of Resources 11 Optimization Hints For Intel Integrated Graphics Balance load Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance 12 6

Minimizing Runtime and Driver Overhead Manage Your DirectX 10 Resources! DirectX 10 manages resources based on USAGE and _ACCESS_FLAG The best memory location is decided by OS/driver/memory manager DX10 Usage / Update Freq NON MAPPABLE MAPPABLE IMMUTABLE Never DEFAULT <=1 per frame DYNAMIC > 1 per Frame STAGING transfer data to the GPU transfer data to the GPU Access Resource Update USE GPU read GPU readwrite write GPU read Copy() readwrite GPU indirect read/ write Read-back from GPU Create () Load @ create never updated Copy (), Update () use only for CBs and small textures Map() w. WRITE_NO_OVERWRITE partial update of VBs/IBs WRITE_DISCARD for full update or CBs Copy () Map() for write to mapped memory WRITE/DO_NOT_WAIT_FLAG to avoid stalls Copy () from staging resource to video Memory Copy() GPU output to staging resource Map() for read w. DO_NOT_WAIT_FLAG to avoid stall Static VBs/ IBs/Textures VBs/IBs/CBs /Textures Dynamic Update VBs/ IBs CBs Texture updates Surfaces for read-back / Minimizing Runtime and Driver Overhead Optimize Your Constants Access! IIG Driver optimizes for DX9/10 the most frequently used constants Avoid global constants Limit Dynamicindexed Constants C[a0] C[r] Fog Demo In DX10 when a constant changes the complete buffer gets updated Group cbuffers by frequency of updates Organize cbuffers based on feature scaling Inside cbuffer put constants by access sequence Inside cbuffers pack data into float4 boundaries http://software.intel.com/en-us/articles/directx-constants-optimizationsfor-intel-integrated-graphics/ 14 7

Minimizing Runtime and Driver Overhead Batch Your Primitives! Use large batches >200-1K primitives Minimize State Changes between batches Use Instancing for Small Batches http://software.intel.com/en-us/articles/rendering-grass-with-instancing-in-directx-10/ 15 Optimization Hints For Intel Integrated Graphics Balance load Between and GPU Minimize Runtime and Driver Overhead Optimize Shader Performance 16 8

Optimizing Shader Performance Skip Computes that do not Render! Test for visibility to reject objects that fall outside the view frustum Maximize Use of Early-Z (cost 4 pixels/clock hardware) Avoid modified Z value (odepth) in the pixel shader Use Occlusion Query for complex scenes Use LOD to reduce complexity for objects that are distant 17 Optimizing Shader Performance Optimize the Use of the Intel Integrated Graphics HW! Cmd Streamer VF VS GS Clip Setup Rast / Early-Z SO Thread Dispatch I$ Cache Array of Execution Units EU 0 EU 1 EU Row0 n EU 0 EU 1 EU n RowN Sampler Texture Cache Render Cache Pixel Ops For best EUs Utilization minimize registry usage Sample Textures to >4:1 ratio of #Instructions per Texture Sample Large shader impacts performance due to limited number of registers Smart Usage of Flow Control Mask alpha when not needed Minimize use of transcendentals like LOG, POW, EXP etc. Pre-load Shaders to avoid Mid-Scene Compiles Avoid Mid-Scene textures changes 18 9

Optimizing Shader Performance Scale Your Pixel Shader and Textures! Keep your Textures under 256x256 and same format if possible Prefer Multi-texture texture over Multi-Pass Use Compressed Textures and mip-maps Use Texture arrays / Texture Atlas Minimize Lock/Blit of Z and/or Stencil Buffer Use Shadow Maps for IIG and Stencil Shadows as scalable feature Minimize Clear() surfaces Minimize post processing passes 19 Optimizing for IIG: Demigod 20 10

Key Lessons Learned from Optimizing Demigod for IIG 21 Be Wary of Clear Calls Why: - Costlier than you might think - Affects every ypixel on surface Recommendations: - Make sure unused surfaces don t get cleared unnecessarily - Consider reducing surface resolution when in lower LOD - Clear Color, Stencil and Z-Buffer in the same API call 22 11

Prune Costly Clear Calls 23 Reduce the Number of Texture Fetches Texture cache is limited on integrated graphics Reducing Texture sizes alone doesn t help as much Optimize Shaders by reducing texture fetches in Low Fidelity modes Balance Texture load instructions with arithmetic instructions if possible 24 12

Simplify Post Processing Effects Post Processing Effects that use multiple passes Bloom Motion Blur Depth of Field High Dynamic Range Balance visual quality with speed by reducing the number of passes 25 Demigod Bloom Effect Before After Bloom turned Off Bloom On with Fewer Passes 26 13

Avoid Pixel Overdraw Render opaque objects from Front to Back - Render UI and other HUDs first - Render Sky and Terrain last Early-Z architecture eliminates occluded pixels early in the pipeline 27 Example of Back to Front Rendering 28 14

Moving Terrain Rendering to the End 29 Lastly, Add Benchmark Mode to Your Game for Performance Profiling! It helps to characterize the workload Four Key requirements benchmark must provide 1. Accurately reflect real workload 2. Repeatability 3. Ability to run standalone without Internet 4. Ability to Automate t built-in i demo, command-line execution and output to a log file 30 15

Summary Scale Your Game for Integrated! Balance and GPU Workload, Avoid Stalls Minimize Run Time and Driver Overhead Optimize your shader performance by scaling your game Analyze your game, find your most expensive call Balance your visual effects against performance penalties Add benchmark mode to your game 31 Additional Resources Developers Guide for Intel Integrated Graphics http://software.intel.com/en-us/articles/intel-graphics-media-accelerator-developersguide Articles Mentioned in this Presentation http://software.intel.com/en com/en-us/articles/ocean-fog-using-direct3d-10 using http://software.intel.com/en-us/articles/directx-constants-optimizations-for-intelintegrated-graphics/ http://software.intel.com/en-us/articles/rendering-grass-with-instancing-in-directx-10 Intel Graphics Performance Analyzer www.intel.com/software/gpa Intel Graphics Community http://softwarecommunities.intel.com/communities/visualcomputing Integrated Graphics Software Development Forum http://softwarecommunities.intel.com/isn/community/en- US/forums/2414/ShowForum.aspx Intel Laptop Gaming TDK http://softwarecommunities.intel.com/articles/eng/1017.htm 32 3 2 32 16

Enhance Your Products and Your Business Training the Next Generation The gateway to Intel s worldwide technology, engineering and go-to-market support for Visual Computing developers Get the Story Behind the Story Investing in Talent and Technology See What s New Developers Connecting with Intel Engineers www.intel.com/software/visualadrenaline 33 For More Information http://www.intel.com/software/gdc Contact info See Intel at GDC: - Intel Booth at Expo, North Hall - Intel Interactive Lounge West Hall 3 rd floor Take a collateral DVD - Here in the room! - Intel Booth or Interactive Lounge 34 17

Intel @ GDC Wednesday, March 25 Programming Tips for Scalable Graphics 10:30 AM 11:30 AM in Room 2010, West Hall Threaded AI For the Win! 12:00 PM 1:00 PM in Room 2011, West Hall Intel s New Graphics Performance Analyzers 2:30 PM 3:30 PM in Room 3004, West Hall Kaboom: Real-Time Multi-Threaded Fluid Simulation for Games 4:00 PM 5:00 PM in Room 2011, West Hall Thursday, March 26 Who Moved the Goalposts? The Rapidly Changing World of s and Optimization 1:30 PM 2:30 PM in Room 2011, West Hall Taming Your Game Production Demons: the Offset approach 3:00 PM 4:00 PM in Room 2011, West Hall Optimizing Game Architectures with Intel Threading Building Blocks 4:30 PM 5:30 PM in Room 2011, West Hall 35 Last of Intel @ GDC Friday, March 27 Procedural and Multi-Core Techniques to take Visuals to the Next Level 9:00 AM 10:00 AM in Room 2010, West Hall Rasterization on Larrabee: A First Look at the Larrabee New Instructions (LRBni) in Action 9:00 AM 10:00 AM in Room 135, North Hall SIMD Programming on Larrabee: A Second Look at the Larrabee New Instructions (LRBni) in Action 10:30 AM 11:30 AM in Room 3002, West Hall 36 18

Risk Factors This presentation contains forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing available on our website for more information i on the risk factors that could cause actual results to differ. Rev. 4/17/07 37 Backup Slides 39 19

Both Intel GMA 3 and 4 support DirectX 10 Make your Scaling API Independent! Game Scaling DX8 DX9 DX10 High Detail Standard Detail Low Detail Recommend dation 40 Both Intel GMA 3 and 4 support all required D3D10 Features D3D10 Optional Features - MSAA: only single sample supported - 32-bit FP Filtering: not supported - 16bit UNORM Blending: Supported in GMA X4XXX and beyond - RGB32 RT: Not supported - Use D3D10Device::CheckFormatSupport to check for supported formats Other D3D10 performance considerations Limit Use of GS make it scale feature Use different Stream Out buffers for different SO formats Check for Optional Features before Use them 41 20

21