The Source for GPU Programming

Similar documents
Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Graphics Processing Unit Architecture (GPU Arch)

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Windowing System on a 3D Pipeline. February 2005

Direct3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Programming Graphics Hardware

Readings on graphics architecture for Advanced Computer Architecture class

Evolution of GPUs Chris Seitz

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

GeForce4. John Montrym Henry Moreton

The NVIDIA GeForce 8800 GPU

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

The Application Stage. The Game Loop, Resource Management and Renderer Design

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

GCN Performance Tweets AMD Developer Relations

PowerVR Hardware. Architecture Overview for Developers

Hardware-driven Visibility Culling Jeong Hyun Kim

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

DX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology

Performance OpenGL Programming (for whatever reason)

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Hardware-driven visibility culling

CS427 Multicore Architecture and Parallel Computing

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

gems_ch28.qxp 2/26/ :49 AM Page 469 PART V PERFORMANCE AND PRACTICALITIES

Working with Metal Overview

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Ultimate Graphics Performance for DirectX 10 Hardware

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

E.Order of Operations

Monday Morning. Graphics Hardware

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPU Target Applications

PowerVR Performance Recommendations. The Golden Rules

GeForce3 OpenGL Performance. John Spitzer

PowerVR Series5. Architecture Guide for Developers

Drawing Fast The Graphics Pipeline

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Lecture 2. Shaders, GLSL and GPGPU

Sung-Eui Yoon ( 윤성의 )

Chapter 10 Computation Culling with Explicit Early-Z and Dynamic Flow Control

1.2.3 The Graphics Hardware Pipeline

Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group

Threading Hardware in G80

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Could you make the XNA functions yourself?

Programmable Graphics Hardware

Real-Time Hair Simulation and Rendering on the GPU. Louis Bavoil

Spring 2009 Prof. Hyesoon Kim

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Real-World Applications of Computer Arithmetic

Graphics Hardware. Instructor Stephen J. Guy

Spring 2011 Prof. Hyesoon Kim

Optimisation. CS7GV3 Real-time Rendering

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Drawing Fast The Graphics Pipeline

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

GPU Architecture. Samuli Laine NVIDIA Research

Scanline Rendering 2 1/42

Mattan Erez. The University of Texas at Austin

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

In-Game Special Effects and Lighting

GoForce 3D: Coming to a Pixel Near You

GRAPHICS PROCESSING UNITS

Render-To-Texture Caching. D. Sim Dietrich Jr.

Using Virtual Texturing to Handle Massive Texture Data

2.11 Particle Systems

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

DirectX 10 Performance. Per Vognsen

Efficient and Scalable Shading for Many Lights

From Brook to CUDA. GPU Technology Conference

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

Rationale for Non-Programmable Additions to OpenGL 2.0

Direct Rendering of Trimmed NURBS Surfaces

Rendering Objects. Need to transform all geometry then

What s New with GPGPU?

The GPGPU Programming Model

Interactive Cloth Simulation. Matthias Wloka NVIDIA Corporation

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Modern Processor Architectures. L25: Modern Compiler Design

Transcription:

The Source for GPU Programming developer.nvidia.com Latest News Developer Events Calendar Technical Documentation Conference Presentations GPU Programming Guide Powerful Tools, SDKs, and more... Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!

GeForce 6 Series Performance Matthias Wloka Developer Technology

GeForce 6 Series Specific Performance Instancing Vertex- and Pixel-Shaders 3.0 Branching and Looping Vertex Texture Fetch Hardware Shadow Maps Z- and Stencil-Cull FP16 Filter and Blend, MRTs

Marketing Speak Translation SM3, i.e., Shader Model 3 hardware Sometimes shorthand for Every GeForce 6 feature not in GeForce FX Not just VS/PS 3.0 See previous slide! GeForce 6200 does not support fp16 filter/blend Okay, because: value cards lack memory b/w to use fp16 render-targets

Simplified Graphics Pipeline CPU Geometry Storage Instancing Vertex Shader 3.0 Geometry Processor Rasterizer Z/Stencil Cull Pixel Shader 3.0 Fragment Processor Frame Buffer Common bottlenecks: CPU Fragment processor Texture Storage + Filtering Fp16 Filter Shadow Maps Fp16 Blend MRT New features help address these bottlenecks

CPU Bottleneck Getting Worse Courtesy Ian Buck, Stanford University

Explicitly Address CPU Bottleneck Reduce draw calls Budget/Design for your draw calls! Use instancing to reduce batches Use über-shaders to eliminate batches/passes Use fp16 blending to eliminate passes Move more computations to GPU: GPGPU: General-Purpose Computations Using GPUs See http://gpgpu.org

Detail of a Single Vertex Shader Pipeline Input Vertex Data Vertex Texture Fetch FP32 Scalar Unit FP32 Vector Unit Branch Unit Texture Cache Primitive Assembly Viewport Processing To Setup

Instancing: What Is It? Let s GPU loop over vertex buffers: Tree Model VB Transform Matrices VB Single draw call generates many instances of object

Instancing Demo Complex lighting, post-processing Simple CPU collision

Instancing Advantages Alternatives: One draw call / instance, change state in-between Static batching (static pre-transformed VB) Dynamic batching (dynamic 2 stream instancing) Vertex constant instancing See Instancing code sample and whitepaper: http://download.nvidia.com/developer/sdk/ Individual_Samples/samples.html Most flexible and has the least Draw calls Memory overhead CPU/Bus overhead

But Multiple vertex streams GPU does extra work Vertex sizes are larger Transform matrix is a per vertex attribute

Attribute Bound Extra data fetched per instance Explains slowdown Vertex cache optimize Cache hit saves all vertex work: Including attribute access Pack input attributes as tightly as possible Even if vertex shader work required to unpack Move constants or derivables out of attributes

Instancing Performance Instancing Method Comparison (Note: % is relative to HW instancing in each group) [28 poly mesh] 140.00% FPS(relative to HW Instancing) 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% Single Draw Calls Dynamic 2 Stream Instancing Static 2 Stream Instancing VS Constant Instancing Hardware Instancing Static Pretransformed VB 0.00% 2800 28000 140000 280000 560000 # Polys

Another View FPS per polys [28poly mesh] 1000 100 FPS 10 1 1000 10000 100000 1000000 # Polys Single Draw Calls Static 2 Stream Instancing Hardware Instancing Dynamic 2 Stream Instancing VS Constant Instancing Static Pretransformed VB

Vertex Shader 3.0: Flow Control Vertex flow control near optimal: Branch instructions have fixed ~1 cycle overhead Divergence is full speed (MIMD) Vertex branching is a win Except for short branches Compiler/Driver decides Example: Single unified v-shader for 1, 2, 3, and 4 bone skinning Use branches and loops to Consolidate batches Skip over unnecessary work

Vertex Texture Fetch (VTF) Mipmapped texture fetches from vertex: Only R32f and R32G32B32A32f formats Only point-sampling Up to 4 different texture stages Sample as often as you like Large latency Equivalent to 20-30 instructions

Cover the Latency Latency means you can hide other ops in it For free Compiler/driver does this for you if possible texldl r0, v0, sampler0 mul r1, v1, c0 // stuff not depending on vtf result add r1, r1, r0 Branch over VTF if possible Dependent VTFs are slow Less chance to hide latency // use vtf result for the first time

Vertex Texture Fetch Performance GeForce 6800 capable of peak 600 MVerts / s Minimalist (err, read no) work per vertex Max with a single VTF: 33 MVerts / s Not all vertices in frame need to be displaced 1 Million displaced vertices @ 33 fps! Do not use as general constant memory replacement

Early Z and Stencil Cull Cull pixels that (will) fail depth/stencil tests before entering pixel-shader For maximum z-cull: Render roughly front to back Or even better: render z-only pass before normal rendering Do stencil-only passes for other cull tricks

Things That Disable Z Culling Changing depth-test direction For example, less-equal to greater-equal Only resets on clear

Z-Cull Uses Highly Compressed Z-Rep Triangles with holes (alpha test/texkill/clip planes) are not occluding Small triangles are bad occluders Small ~= less than 4x4 pixels Z-cull may not recognize triangle as occluder Good Bad

Things That Disable Stencil Culling Changing stencil function, reference, or mask Only resets on clear Writing stencil while rejecting based on stencil Write stencil in separate pass from rejecting color/z

Stencil Cull Example 1. Render light volume with color write disabled Depth func = LESS, Stencil func = ALWAYS Stencil Z-FAIL = REPLACE (with value X) Rest of stencil ops set to KEEP 2. Render with lighting shader Depth Func = ALWAYS, Stencil Func = EQUAL, all ops = KEEP, Stencil Ref = X Unlit pixels will be culled because stencil does not match reference value

Fast Z-Only Rendering GeForce FX and 6 Series render z/stencil at double speed! Important for dynamic shadow maps! Makes z-first/only pass (for z-cull benefits) attractive Only enabled if: No color-writes Disable pixel shaders (no depth replace, no texkill) Disable alpha test/color key 8-bit/component color buffer bound (not float) No user clip planes No AA

Pixel Shader 3.0 Performance What is Pixel Shader 3.0? 3.0 shaders help both CPU and GPU bottlenecks Consolidate draw calls / passes (über-shaders) Early-outs with dynamic branching Gory performance details of particular pixel shader 3.0 features

Detail of a Single Pixel Shader Pipeline Texture Filter Bi Bi // Tri Tri // Aniso 1 texture @ full speed 4 tap filter @ full speed 16:1 Aniso w/ w/ Trilinear FP16 Texture Filtering Texture Data FP Texture Processor Input Fragment Data FP32 Shader Unit 1 Shader Unit 1 4 FP Ops // pixel Co-Issue Texture Address Calc Free fp16 normalize + mini ALU Texture Cache FP32 Shader Unit 2 Shader Unit 2 4 FP Ops // pixel Co-Issue + mini ALU SIMD Architecture Co-Issue FP32 Computation Shader Model 3.0 Branch Processor Fog ALU Output Shaded Fragments

Half (fp16) Performance Half (fp16) still matters! Critical for GeForce FX performance Reduces register pressure Better able to hide texture latency Fast fp16 normalize Compiler/driver can NOT help you with this

GeForce 6 Single Cycle Normalize() Pixel shader unit has single-cycle normalize Caveat: only for 3-component 16-bit float values float3 f3; half3 h3; half4 h4; f3 = normalize(f3); // slow: dp3/rsq/mul h3 = normalize(f3); // fast: nrmh h4 = normalize(h4); // slow: dp4/rsq/mul h4.xyz = normalize(h4.xyz); // fast: nrmh

GeForce 6 Superscalar Execution Executes multiple instructions simultaneously For example, in a single cycle you can execute Two 2-vector instructions, or One 3-vector and one scalar instruction Plus, there are 2 math units per shader pipe Use swizzle / write masks to help compiler half4 A, B; A.w = sin(a.w); // A = sin(a.w) not enough A.xyz = A.xyz * B.xyz;

GeForce 6 Series Co-Issue 2 different instructions executing in the same cycle in same shader units 2 separate shader units 4 instructions/pixel/cycle Shader Unit 1 Shader Unit 2 R G B A Operation 1 Operation 2 R G B A Operation 3 Operation 4

Flow Control Performance Overview Flow control instruction costs: Not free, but useful Instruction if / endif if / else / endif call ret loop / endloop Cost (Cycles) 4 6 2 2 4 Additional costs when pixels diverge (more later)

Looping Costs DirectX ps.3.0 supports only static loops Unrolling is faster Compiler/driver can do that for you Nonetheless useful because Reduces high-level code-complexity Reduces passes Multiple lights in a single pass can be a big win Number of lights unknown at compile time Reduces proliferation of pre-compiled shaders Thousands of shaders from just a few templates Overcomes DirectX s 512 static instruction limit

Branching Costs Branching can provide substantial boost If able to skip > 6 instruction cycles, and If the branch condition is coherent vs. Coherent Incoherent Noisy branch conditions cause performance loss Potentially worse than taking both branches all the time

How Coherent Do I Have To Be? GPU has hundreds of pixels in flight Best if coherent over regions of > ~1000 pixels That s only ~30x30! You need to experiment in your own application Soft shadow demo shows: Incoherent branches on small portion of screen is still a big win

Combine Branching With Others Back face register (vface) Shade front faces differently from back faces Position register (vpos) Shade based on position For example, skip or simplify distant pixels Early out: If in shadow, don t do lighting computations If out of range (attenuation zero), don t light Applies to vs.3.0 as well

Soft Shadow Demo

How Soft Shadow Demo Works Takes 8 test samples from shadow map If all 8 in shadow or all 8 in the light then done If on the edge (some in shadow/some in light) Do 56 more samples for additional quality 64 samples at much lower cost! Quick-and-dirty importance sampling Dynamic sampling > 2x faster Vs. 64 samples everywhere

Hardware Shadow Maps In DirectX, Render to a depth format texture (D3DFMT_D24X8, D3DFMT_D16) Use tex2dproj to sample Shadow map comparison happens automatically In OpenGL, Render to DEPTH_COMPONENT texture Use TEXTURE_COMPARE_MODE_ARB with COMPARE_R_TO_TEXTURE

Hardware Shadow Map Performance Shadow map comparison is free (full speed) No need to compare and filter in the shader If bilinear state is on, Then percentage closer filtering of 4 nearest texels Use single tap for performance Quality roughly equivalent to 4-tap PCF R32F Use multiple taps for higher quality 4-tap HW shadow map roughly as fast as 4-tap manual-pcf R32F

Hardware Shadow Map Fallback Possible to use R32F or R16F shadow maps Render depth to single-channel float texture in shader Multiple jittered samples for high quality / soft edges Easy to maintain hardware shadow maps and R32F/R16F code paths: Same setup and pipeline as any shadow map technique HW shadow map shader code simpler and faster HW shadow maps buy speed or quality (or both)

Texture Instruction Performance Texldb (scalar LOD bias): Full speed Texldl (explicit scalar LOD selection): Full speed Hardware need not calculate derivatives for LOD Possible to dynamically branch over these instructions Texldd (gradient-based LOD selection): Factor 10 slower! But when you need to use this, you need to use this

Floating Point Texture Performance Prefer 64bpp float textures and render targets Half the bandwidth of 128bpp (fp32) textures More importantly: double cache coherence Poor cache coherence destroys performance Fp16 textures 2x faster than fp32 if texture bound Also important: efficient channel allocation Use R32F buffers for scalar data, and R16G16F for 2-vectors Double cache coherence again!

Common Sense Texture Performance Use mipmaps GPU fetches local neighborhood for each texel Sharper/Crisper textures Use anisotropic filtering Use better mipmap generation (use texture tools) Do NOT use LOD bias LOD bias is slower and lower quality

Normal Maps Use D3DFMT_V8U8 or DXT5 To store x and y Derive z in shader Simon Green s normal map compression paper Compares quality of variety of formats

Multiple Render Targets MRTs useful for reducing rendering passes When you need to output more than single 4-vector Deferred shading, particle physics, GPGPU algorithms Replaces up to four passes with one But MRT is not free High bandwidth cost, especially with float formats Small overhead per target rendered GeForce 6 has a sweet spot of 3 render targets (RTs) Split 6 passes into 2 3-RT passes Not 1 4-RT pass and 1 2-RT pass

Other Render Target Advice Do not render entire scene to a texture Not getting AA If user turns on control panel AA, hard to detect Instead, render to back buffer, then stretchrect Drivers give performance priority to back buffer Ahead of texture surfaces AA works with back buffer

Full Screen Effects Use scissor rects to restrict rendering Light bounds, etc. Do not use full screen quads Use full-screen triangles with scissor rect instead Completely avoids inefficient diagonals

Floating Point Blending GeForce FX needs to emulate float blending Using ping-pong buffer Lots of context switches and additional passes Blending, e.g., lots of particles becomes infeasible But fp16 is 2x bandwidth vs. A8R8G8B8

Increased Read Back Performance Pre-GeForce 6 Best case, < 200MB/s, all chipsets Only PCI cycles used to write back to host memory GeForce 6800 (AGP) 600 MB/s - 1.0 GB/s, depending on AGP chipset PCI-E Workstation boards 1.0 GB/s on Quadro FX 4400 Up to 2.4 GB/s on Quadro FX 1400

Read Back Still a BAD Idea Read back still synchronizes CPU and GPU CPU stalls until GPU finishes all rendering Can you afford wasting precious CPU cycles? GPU pipeline drains completely and becomes idle

Memory Allocation Order of resource allocation affects performance Allocate render targets first Sort order by pitch (bpp * width) Sort pitch groups by frequency of use (most used first) Then create vertex and pixel shaders Load / create remaining textures

Conclusion Lots of new/fast features Instancing, vs.3.0 flow control, vertex texture fetch Z-/Stencil-cull, fast z-only Fast normalize, ps.3.0 flow control Hardware shadow maps, fp16 blending With some sneaky gotchas Use these features to attack bottlenecks CPU Pixel shaders...

Questions? NVIDIA GPU Programming Guide: http://developer.nvidia.com/object/ gpu_programming_guide.html Matthias Wloka (mwloka@nvidia.com) http://developer.nvidia.com

The Source for GPU Programming developer.nvidia.com Latest News Developer Events Calendar Technical Documentation Conference Presentations GPU Programming Guide Powerful Tools, SDKs, and more... Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!