Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Similar documents
Coming to a Pixel Near You: Mobile 3D Graphics on the GoForce WMP. Chris Wynn NVIDIA Corporation

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

GoForce 3D: Coming to a Pixel Near You

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Graphics Processing Unit Architecture (GPU Arch)

Evolution of GPUs Chris Seitz

Module 13C: Using The 3D Graphics APIs OpenGL ES

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

GeForce4. John Montrym Henry Moreton

Mattan Erez. The University of Texas at Austin

Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

Optimisation. CS7GV3 Real-time Rendering

1.2.3 The Graphics Hardware Pipeline

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Render-To-Texture Caching. D. Sim Dietrich Jr.

Windowing System on a 3D Pipeline. February 2005

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

The Application Stage. The Game Loop, Resource Management and Renderer Design

Optimal Shaders Using High-Level Languages

Canonical Shaders for Optimal Performance. Sébastien Dominé Manager of Developer Technology Tools

Graphics Hardware. Instructor Stephen J. Guy

Profiling and Debugging Games on Mobile Platforms

Textures. Texture coordinates. Introduce one more component to geometry

Programming Graphics Hardware

CS427 Multicore Architecture and Parallel Computing

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

Drawing Fast The Graphics Pipeline

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Working with Metal Overview

The Rasterization Pipeline

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

GPU Memory Model. Adapted from:

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Baback Elmieh, Software Lead James Ritts, Profiler Lead Qualcomm Incorporated Advanced Content Group

GeForce3 OpenGL Performance. John Spitzer

Dave Shreiner, ARM March 2009

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Performance OpenGL Programming (for whatever reason)

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

Scanline Rendering 2 1/42

POWERVR MBX. Technology Overview

Mattan Erez. The University of Texas at Austin

Rendering Objects. Need to transform all geometry then

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

OpenGL ES 2.0 Performance Guidelines for the Tegra Series

Hardware-driven Visibility Culling Jeong Hyun Kim

GPU Target Applications

Self-shadowing Bumpmap using 3D Texture Hardware

Real-World Applications of Computer Arithmetic

The Source for GPU Programming

CS 4620 Program 3: Pipeline

CS 130 Final. Fall 2015

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Monday Morning. Graphics Hardware

Drawing Fast The Graphics Pipeline

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

CIS 665 GPU Programming and Architecture

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

CS4620/5620: Lecture 14 Pipeline

CS 498 VR. Lecture 19-4/9/18. go.illinois.edu/vrlect19

PowerVR Performance Recommendations. The Golden Rules

E.Order of Operations

Pipeline Operations. CS 4620 Lecture 14

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

PowerVR Performance Recommendations The Golden Rules. October 2015

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Readings on graphics architecture for Advanced Computer Architecture class

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

Vertex Shader Design I

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

Lets assume each object has a defined colour. Hence our illumination model is looks unrealistic.

GPU Memory Model Overview

Sign up for crits! Announcments

Achieving Console Quality Games on Mobile

Drawing Fast The Graphics Pipeline

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Texture Compression. Jacob Ström, Ericsson Research

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

CS 130 Exam I. Fall 2015

The Graphics Pipeline

2.11 Particle Systems

Lecture 2. Shaders, GLSL and GPGPU

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Sung-Eui Yoon ( 윤성의 )

Transcription:

GDC Europe 2005 Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game Lars M. Bishop NVIDIA Embedded Developer Technology 1

Agenda GoForce 3D capabilities Strengths and weaknesses What that means to application developers Common performance bottlenecks From a hardware vendor s viewpoint Techniques to optimize performance While simultaneously maximizing quality 2

What is GoForce 3D? Licensable 3D core for mobile devices Discrete solutions: GoForce 3D 4500/4800 OpenGL ES compliant Low power Integrated SRAM Up to VGA resolution 3

GoForce 3D 4800 Features Geometry engine 16-bit color w/ 16-bit Z (40-bit color internal) Fully perspective correct Sub-pixel accuracy Per-pixel fog, alpha blending, alpha-test 4

GoForce 3D 4800 Features cont. Multi-texturing w/ up to 4 textures In OpenGL ES 1.0!! Flexible texture formats (DXT1/4-bit/8-bit) Bilinear / trilinear filtering 5

GoForce 3D Pipeline Flexible Fragment ALU Transform/Setup Raster Texture Raster fragment generation and loop management Scalable Fragment/ALU Stages only trigger on activity Data Write Low power (~50 pipe stages) <50mW per 100M pixel/sec, actual game play 6

Common Performance Bottlenecks CPU Low clock speeds HW floating point functionality not standard System memory bandwidth (bus) This is not your PCI Express x16 GPU Powerful, but features are not free Need to pay attention to the features you enable 7

Performance Bottleneck Causes CPU-bound App-level work Non-accelerated driver work Bus-bound Geometry issues Texture issues GPU-bound Raster feature usage / multipass rendering 8

App-level CPU Bottlenecks Simplify physics, collision, visibility Avoid non-coherent algorithms Small caches: data incoherency hurts performance Avoid floating point or integer division Use Fixed point if CPU does not support floating point Floating point otherwise 9

Render-related CPU Bottlenecks Batch geometry Minimize number of state changes i.e. sort geometry into buckets of common state Avoid examining every triangle every frame e.g. per-triangle app-level culling Avoid multi pass Use more complex fragment shaders instead Avoid CPU-assisted driver paths 10

CPU-Assisted Vertex Computations Vertex lighting is NOT GPU accelerated Non-identity texture matrices NOT accelerated But do you really want per vertex lighting? Per vertex specular is bad, unless highly tessellated Use per pixel computations instead Per pixel tricks Per pixel lighting 11

Specular Lighting Screenshots Per Vertex Per Pixel 12

Alternative Lighting Strategies Fake Phong highlights using multi-texture Pre-compute vertex lighting Stuntcar Extreme (Images Courtesy of Fathammer) 13

System Memory Bandwidth: Vertices Avoid transferring large number of vertices Cull high-level geometry in app: not per triangle Use LODs Use impostors and billboards Minimize per vertex size Don t specify z or w if unused (e.g. screen-space) Use packed ARGB (not floating point) Optimize vertex order for the vertex cache 14

Vertex Cache Optimization What is a vertex cache? Place for the GPU to store vertex data If vertex found in cache, GPU uses cached copy How to utilize this? Use longest possible triangle strip or fan, or Use indexed triangle lists, LRU cache Optimize for 32- or 16-entry LRU vertex cache 32 8-word verts or 16 16-word verts 15

Vertex Cache Diagram CPU GPU Vertex Cache System Memory 16

Avoiding Vertex Computations Avoid multi-pass Every vertex processed once per pass Much more efficient to render in a single pass using a more complex fragment shader Minimize number of vertices Use LODs: Remember your screen size! Cull high-level geometry in app: not per triangle Use triangle strips or (optimized) indexed tri lists 17

System Memory Bandwidth: Textures Avoid overflowing GPU texture memory Memory contains frame buffer and textures GoForce 3D 4800: 1280K SRAM Texture memory: 1280K frame buffer QVGA (320x240) 16-bit color double-buffered w/ z: Frame buffer: 450K Texture memory available: 830K 18

Avoid Overflowing Texture Memory Avoid allocating full-screen buffers Use compact texture formats EXT_texture_compression_dxt1 (4-bit/texel) Palette textures Use lowest texture resolution possible See GPU Gems 2, Chapter 28: Mipmap-Level Measurement If you must exceed budget, sort by texture 19

Palette Textures HW supports 8bit palettes 4bit palettes But only 256 entries total in a rendering pass So only one 8bit palette texture per fragment shader Or up to 16 4bit palettes i.e. can t multi-texture from 2 different 8bit palettes 20

Benefits of DXT1 Bubble: scene uses 8 textures 2 256x256 textures (mipmapped) 6 256x256 textures (not mipmapped) Space: RGB565 (16bits/texel) DXT1 (4bits/texel) = 1.08MB = 0.27MB DXT1: high quality & 25% the size of RGB565 Actually, also faster to texture from 21

Leverage Multi-Texture Four Full Res (256 x 256) 4-bit = 128KB Full Res Base + four 1/4 res lightmaps = 40KB Store diffuse maps in lower resolution and use multi-texture to save memory 22

Texture Size Avoid allocating alpha if unused DXT1 supports RGBA and RGB Avoid allocating mipmaps if not useful Don t bother skipping just 1x1 & 2x2 levels: Very negligible space savings Consider NO mipmaps: Reasonable space savings Saves ~50 clocks of triangle setup calculations But not if image quality suffers (sparklies) 23

GPU Bottleneck Overview Texture optimizations Multi-texturing Filtering Fragment optimization Not all features cost the same Avoiding framebuffer reads Avoiding depth-buffer expense 24

Texture Filtering Bilinear filtering is free Trilinear filtering is half-speed Compared to bilinear Use trilinear filtering when needed Don t turn it on by default 25

Fragment Shading Do as much work as possible in each fragment Use Combiner Programs NV_combiner_program extension from OpenGL ES Combiner Programs allow for powerful effects DOT3 bump/normal mapping Environment-mapped bump mapping Image processing (blurring, edge detect, etc) Up to four textures per pass 26

Assembly Language Consists of Different types of registers Instructions Write masking and co-issue Argument modifiers 27

Overall Program Limits Up to 4 stages Each stage consists of Optional color interpolation Optional texture look-up Math instruction(s) STAGE color GL_DIFFUSE texld v0.x, v0.y, TEX0 mul r0, tex, col 28

Register Types Temporary registers: r0, r1 read and write, persistent across stages Constant registers: c0-c3 read only Color register: col read only Texture register: tex and tex.a read only and read/write, respectively 29

Instructions mov, mul, add, mad, lrp, min, max seq, sne, sge, sgt, sle, slt mmad: (a*b) + (c*d) mmul: r0 = a*b, r1 = c*d mmin: min(a*b, c*d); mmax: max(a*b, c*d) mseq, msne, msge, msgt, msle, mslt keq, kne, kge, kgt, kle, klt mkeq, mkne, mkge, mkgt, mkle, mklt 30

Write Masking and Co-Issue Optional write mask Legal:.rgba,.rgb,.a Defaults to.rgba add r0.a, r0, r1 mul r0.rgb, c0, r1 Co-issue: Alpha and rgb parts execute separately Alpha result available before rgb computation 31

Modifiers Destination clamp Source negate Source swizzle (tex and color registers) Not on temporaries or constants smear : tex.a means tex.aaaa mux : one of each r, g, b, a (e.g. tex.bgar) 32

Combiner Program Examples Environment-mapped Bumpmapping Glow + DOT3 Bumpmapping 33

Fragment Shader Considerations Pixel failing z costs same number of cycles i.e., no z-cull, no early out But killed pixels do not load texels from cache Slightly faster Consumes less power May consider rough front-to-back rendering Do not override state sorting! 34

Fragment Shader Instruction Costs Single cycle Single texture, color-interpolated, z-buffered Additional cycle for each: Additional texture stage Alpha test Alpha blending Fog Color masking 35

Raster-OP Readback: Turn Them Off Alpha blending consumes bandwidth Do not assume 1*src 0*dest is optimal (it s not) Color masking actually reads framebuffer More expensive than not masking Z comparison reads z-buffer If possible, turn all three above options off You ll get an extra throughput boost in rasterization 36

Other Advice Avoid framebuffer LogicOp (bitwise AND, etc) Only rudimentary support (expensive) Don t read from framebuffer glreadpixels() Avoid dynamic texture updates Never per-frame on large textures 37

Other Advice (Continued) There is (currently) no push buffer Run CPU and GPU in parallel as much as possible Going to get better soon Full screen apps can flip Pointer swap versus blit (actual memory copy) 38

Conclusion Offload CPU Avoid CPU driver paths Avoid multi-pass, prefer multi-texture Avoid state changes Watch vertex computations and bandwidth Watch video memory usage But video memory will continue to increase Shift load to fragment shaders 39

Plan for the Future NVIDIA is committed to handheld 3D Expect exciting advances in HW soon Prepare your content to scale up in all directions Expect future GoForce products to have: More powerful/flexible shader capabilities More VRAM Increased vertex/triangle throughput 40

Sooner Than You d Think! 41

GPU Gems Plug Not (yet) 100% applicable, but fast approaching! 42

Questions? Register for NVIDIA Handheld Developer Program http://developer.nvidia.com Email: handset-dev@nvidia.com 43

44