The Source for GPU Programming
|
|
- Buddy O’Neal’
- 5 years ago
- Views:
Transcription
1 The Source for GPU Programming developer.nvidia.com Latest News Developer Events Calendar Technical Documentation Conference Presentations GPU Programming Guide Powerful Tools, SDKs, and more... Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!
2 GeForce 6 Series Performance Matthias Wloka Developer Technology
3 GeForce 6 Series Specific Performance Instancing Vertex- and Pixel-Shaders 3.0 Branching and Looping Vertex Texture Fetch Hardware Shadow Maps Z- and Stencil-Cull FP16 Filter and Blend, MRTs
4 Marketing Speak Translation SM3, i.e., Shader Model 3 hardware Sometimes shorthand for Every GeForce 6 feature not in GeForce FX Not just VS/PS 3.0 See previous slide! GeForce 6200 does not support fp16 filter/blend Okay, because: value cards lack memory b/w to use fp16 render-targets
5 Simplified Graphics Pipeline CPU Geometry Storage Instancing Vertex Shader 3.0 Geometry Processor Rasterizer Z/Stencil Cull Pixel Shader 3.0 Fragment Processor Frame Buffer Common bottlenecks: CPU Fragment processor Texture Storage + Filtering Fp16 Filter Shadow Maps Fp16 Blend MRT New features help address these bottlenecks
6 CPU Bottleneck Getting Worse Courtesy Ian Buck, Stanford University
7 Explicitly Address CPU Bottleneck Reduce draw calls Budget/Design for your draw calls! Use instancing to reduce batches Use über-shaders to eliminate batches/passes Use fp16 blending to eliminate passes Move more computations to GPU: GPGPU: General-Purpose Computations Using GPUs See
8 Detail of a Single Vertex Shader Pipeline Input Vertex Data Vertex Texture Fetch FP32 Scalar Unit FP32 Vector Unit Branch Unit Texture Cache Primitive Assembly Viewport Processing To Setup
9 Instancing: What Is It? Let s GPU loop over vertex buffers: Tree Model VB Transform Matrices VB Single draw call generates many instances of object
10 Instancing Demo Complex lighting, post-processing Simple CPU collision
11 Instancing Advantages Alternatives: One draw call / instance, change state in-between Static batching (static pre-transformed VB) Dynamic batching (dynamic 2 stream instancing) Vertex constant instancing See Instancing code sample and whitepaper: Individual_Samples/samples.html Most flexible and has the least Draw calls Memory overhead CPU/Bus overhead
12 But Multiple vertex streams GPU does extra work Vertex sizes are larger Transform matrix is a per vertex attribute
13 Attribute Bound Extra data fetched per instance Explains slowdown Vertex cache optimize Cache hit saves all vertex work: Including attribute access Pack input attributes as tightly as possible Even if vertex shader work required to unpack Move constants or derivables out of attributes
14 Instancing Performance Instancing Method Comparison (Note: % is relative to HW instancing in each group) [28 poly mesh] % FPS(relative to HW Instancing) % % 80.00% 60.00% 40.00% 20.00% Single Draw Calls Dynamic 2 Stream Instancing Static 2 Stream Instancing VS Constant Instancing Hardware Instancing Static Pretransformed VB 0.00% # Polys
15 Another View FPS per polys [28poly mesh] FPS # Polys Single Draw Calls Static 2 Stream Instancing Hardware Instancing Dynamic 2 Stream Instancing VS Constant Instancing Static Pretransformed VB
16 Vertex Shader 3.0: Flow Control Vertex flow control near optimal: Branch instructions have fixed ~1 cycle overhead Divergence is full speed (MIMD) Vertex branching is a win Except for short branches Compiler/Driver decides Example: Single unified v-shader for 1, 2, 3, and 4 bone skinning Use branches and loops to Consolidate batches Skip over unnecessary work
17 Vertex Texture Fetch (VTF) Mipmapped texture fetches from vertex: Only R32f and R32G32B32A32f formats Only point-sampling Up to 4 different texture stages Sample as often as you like Large latency Equivalent to instructions
18 Cover the Latency Latency means you can hide other ops in it For free Compiler/driver does this for you if possible texldl r0, v0, sampler0 mul r1, v1, c0 // stuff not depending on vtf result add r1, r1, r0 Branch over VTF if possible Dependent VTFs are slow Less chance to hide latency // use vtf result for the first time
19 Vertex Texture Fetch Performance GeForce 6800 capable of peak 600 MVerts / s Minimalist (err, read no) work per vertex Max with a single VTF: 33 MVerts / s Not all vertices in frame need to be displaced 1 Million displaced 33 fps! Do not use as general constant memory replacement
20 Early Z and Stencil Cull Cull pixels that (will) fail depth/stencil tests before entering pixel-shader For maximum z-cull: Render roughly front to back Or even better: render z-only pass before normal rendering Do stencil-only passes for other cull tricks
21 Things That Disable Z Culling Changing depth-test direction For example, less-equal to greater-equal Only resets on clear
22 Z-Cull Uses Highly Compressed Z-Rep Triangles with holes (alpha test/texkill/clip planes) are not occluding Small triangles are bad occluders Small ~= less than 4x4 pixels Z-cull may not recognize triangle as occluder Good Bad
23 Things That Disable Stencil Culling Changing stencil function, reference, or mask Only resets on clear Writing stencil while rejecting based on stencil Write stencil in separate pass from rejecting color/z
24 Stencil Cull Example 1. Render light volume with color write disabled Depth func = LESS, Stencil func = ALWAYS Stencil Z-FAIL = REPLACE (with value X) Rest of stencil ops set to KEEP 2. Render with lighting shader Depth Func = ALWAYS, Stencil Func = EQUAL, all ops = KEEP, Stencil Ref = X Unlit pixels will be culled because stencil does not match reference value
25 Fast Z-Only Rendering GeForce FX and 6 Series render z/stencil at double speed! Important for dynamic shadow maps! Makes z-first/only pass (for z-cull benefits) attractive Only enabled if: No color-writes Disable pixel shaders (no depth replace, no texkill) Disable alpha test/color key 8-bit/component color buffer bound (not float) No user clip planes No AA
26 Pixel Shader 3.0 Performance What is Pixel Shader 3.0? 3.0 shaders help both CPU and GPU bottlenecks Consolidate draw calls / passes (über-shaders) Early-outs with dynamic branching Gory performance details of particular pixel shader 3.0 features
27 Detail of a Single Pixel Shader Pipeline Texture Filter Bi Bi // Tri Tri // Aniso 1 full speed 4 tap full speed 16:1 Aniso w/ w/ Trilinear FP16 Texture Filtering Texture Data FP Texture Processor Input Fragment Data FP32 Shader Unit 1 Shader Unit 1 4 FP Ops // pixel Co-Issue Texture Address Calc Free fp16 normalize + mini ALU Texture Cache FP32 Shader Unit 2 Shader Unit 2 4 FP Ops // pixel Co-Issue + mini ALU SIMD Architecture Co-Issue FP32 Computation Shader Model 3.0 Branch Processor Fog ALU Output Shaded Fragments
28 Half (fp16) Performance Half (fp16) still matters! Critical for GeForce FX performance Reduces register pressure Better able to hide texture latency Fast fp16 normalize Compiler/driver can NOT help you with this
29 GeForce 6 Single Cycle Normalize() Pixel shader unit has single-cycle normalize Caveat: only for 3-component 16-bit float values float3 f3; half3 h3; half4 h4; f3 = normalize(f3); // slow: dp3/rsq/mul h3 = normalize(f3); // fast: nrmh h4 = normalize(h4); // slow: dp4/rsq/mul h4.xyz = normalize(h4.xyz); // fast: nrmh
30 GeForce 6 Superscalar Execution Executes multiple instructions simultaneously For example, in a single cycle you can execute Two 2-vector instructions, or One 3-vector and one scalar instruction Plus, there are 2 math units per shader pipe Use swizzle / write masks to help compiler half4 A, B; A.w = sin(a.w); // A = sin(a.w) not enough A.xyz = A.xyz * B.xyz;
31 GeForce 6 Series Co-Issue 2 different instructions executing in the same cycle in same shader units 2 separate shader units 4 instructions/pixel/cycle Shader Unit 1 Shader Unit 2 R G B A Operation 1 Operation 2 R G B A Operation 3 Operation 4
32 Flow Control Performance Overview Flow control instruction costs: Not free, but useful Instruction if / endif if / else / endif call ret loop / endloop Cost (Cycles) Additional costs when pixels diverge (more later)
33 Looping Costs DirectX ps.3.0 supports only static loops Unrolling is faster Compiler/driver can do that for you Nonetheless useful because Reduces high-level code-complexity Reduces passes Multiple lights in a single pass can be a big win Number of lights unknown at compile time Reduces proliferation of pre-compiled shaders Thousands of shaders from just a few templates Overcomes DirectX s 512 static instruction limit
34 Branching Costs Branching can provide substantial boost If able to skip > 6 instruction cycles, and If the branch condition is coherent vs. Coherent Incoherent Noisy branch conditions cause performance loss Potentially worse than taking both branches all the time
35 How Coherent Do I Have To Be? GPU has hundreds of pixels in flight Best if coherent over regions of > ~1000 pixels That s only ~30x30! You need to experiment in your own application Soft shadow demo shows: Incoherent branches on small portion of screen is still a big win
36 Combine Branching With Others Back face register (vface) Shade front faces differently from back faces Position register (vpos) Shade based on position For example, skip or simplify distant pixels Early out: If in shadow, don t do lighting computations If out of range (attenuation zero), don t light Applies to vs.3.0 as well
37 Soft Shadow Demo
38 How Soft Shadow Demo Works Takes 8 test samples from shadow map If all 8 in shadow or all 8 in the light then done If on the edge (some in shadow/some in light) Do 56 more samples for additional quality 64 samples at much lower cost! Quick-and-dirty importance sampling Dynamic sampling > 2x faster Vs. 64 samples everywhere
39 Hardware Shadow Maps In DirectX, Render to a depth format texture (D3DFMT_D24X8, D3DFMT_D16) Use tex2dproj to sample Shadow map comparison happens automatically In OpenGL, Render to DEPTH_COMPONENT texture Use TEXTURE_COMPARE_MODE_ARB with COMPARE_R_TO_TEXTURE
40 Hardware Shadow Map Performance Shadow map comparison is free (full speed) No need to compare and filter in the shader If bilinear state is on, Then percentage closer filtering of 4 nearest texels Use single tap for performance Quality roughly equivalent to 4-tap PCF R32F Use multiple taps for higher quality 4-tap HW shadow map roughly as fast as 4-tap manual-pcf R32F
41 Hardware Shadow Map Fallback Possible to use R32F or R16F shadow maps Render depth to single-channel float texture in shader Multiple jittered samples for high quality / soft edges Easy to maintain hardware shadow maps and R32F/R16F code paths: Same setup and pipeline as any shadow map technique HW shadow map shader code simpler and faster HW shadow maps buy speed or quality (or both)
42 Texture Instruction Performance Texldb (scalar LOD bias): Full speed Texldl (explicit scalar LOD selection): Full speed Hardware need not calculate derivatives for LOD Possible to dynamically branch over these instructions Texldd (gradient-based LOD selection): Factor 10 slower! But when you need to use this, you need to use this
43 Floating Point Texture Performance Prefer 64bpp float textures and render targets Half the bandwidth of 128bpp (fp32) textures More importantly: double cache coherence Poor cache coherence destroys performance Fp16 textures 2x faster than fp32 if texture bound Also important: efficient channel allocation Use R32F buffers for scalar data, and R16G16F for 2-vectors Double cache coherence again!
44 Common Sense Texture Performance Use mipmaps GPU fetches local neighborhood for each texel Sharper/Crisper textures Use anisotropic filtering Use better mipmap generation (use texture tools) Do NOT use LOD bias LOD bias is slower and lower quality
45 Normal Maps Use D3DFMT_V8U8 or DXT5 To store x and y Derive z in shader Simon Green s normal map compression paper Compares quality of variety of formats
46 Multiple Render Targets MRTs useful for reducing rendering passes When you need to output more than single 4-vector Deferred shading, particle physics, GPGPU algorithms Replaces up to four passes with one But MRT is not free High bandwidth cost, especially with float formats Small overhead per target rendered GeForce 6 has a sweet spot of 3 render targets (RTs) Split 6 passes into 2 3-RT passes Not 1 4-RT pass and 1 2-RT pass
47 Other Render Target Advice Do not render entire scene to a texture Not getting AA If user turns on control panel AA, hard to detect Instead, render to back buffer, then stretchrect Drivers give performance priority to back buffer Ahead of texture surfaces AA works with back buffer
48 Full Screen Effects Use scissor rects to restrict rendering Light bounds, etc. Do not use full screen quads Use full-screen triangles with scissor rect instead Completely avoids inefficient diagonals
49 Floating Point Blending GeForce FX needs to emulate float blending Using ping-pong buffer Lots of context switches and additional passes Blending, e.g., lots of particles becomes infeasible But fp16 is 2x bandwidth vs. A8R8G8B8
50 Increased Read Back Performance Pre-GeForce 6 Best case, < 200MB/s, all chipsets Only PCI cycles used to write back to host memory GeForce 6800 (AGP) 600 MB/s GB/s, depending on AGP chipset PCI-E Workstation boards 1.0 GB/s on Quadro FX 4400 Up to 2.4 GB/s on Quadro FX 1400
51 Read Back Still a BAD Idea Read back still synchronizes CPU and GPU CPU stalls until GPU finishes all rendering Can you afford wasting precious CPU cycles? GPU pipeline drains completely and becomes idle
52 Memory Allocation Order of resource allocation affects performance Allocate render targets first Sort order by pitch (bpp * width) Sort pitch groups by frequency of use (most used first) Then create vertex and pixel shaders Load / create remaining textures
53 Conclusion Lots of new/fast features Instancing, vs.3.0 flow control, vertex texture fetch Z-/Stencil-cull, fast z-only Fast normalize, ps.3.0 flow control Hardware shadow maps, fp16 blending With some sneaky gotchas Use these features to attack bottlenecks CPU Pixel shaders...
54 Questions? NVIDIA GPU Programming Guide: gpu_programming_guide.html Matthias Wloka
55 The Source for GPU Programming developer.nvidia.com Latest News Developer Events Calendar Technical Documentation Conference Presentations GPU Programming Guide Powerful Tools, SDKs, and more... Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!
Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationGraphics Processing Unit Architecture (GPU Arch)
Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics
More informationOptimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic
More informationSqueezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques
Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques Jonathan Zarge, Team Lead Performance Tools Richard Huddy, European Developer Relations Manager ATI
More informationGraphics Performance Optimisation. John Spitzer Director of European Developer Technology
Graphics Performance Optimisation John Spitzer Director of European Developer Technology Overview Understand the stages of the graphics pipeline Cherchez la bottleneck Once found, either eliminate or balance
More informationReal - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský
Real - Time Rendering Pipeline optimization Michal Červeňanský Juraj Starinský Motivation Resolution 1600x1200, at 60 fps Hw power not enough Acceleration is still necessary 3.3.2010 2 Overview Application
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationDirect3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation
Direct3D API Issues: Instancing and Floating-point Specials Cem Cebenoyan NVIDIA Corporation Agenda Really two mini-talks today Instancing API Usage Performance / pitfalls Floating-point specials DirectX
More informationHow to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies
How to Work on Next Gen Effects Now: Bridging DX10 and DX9 Guennadi Riguer ATI Technologies Overview New pipeline and new cool things Simulating some DX10 features in DX9 Experimental techniques Why This
More informationProgramming Graphics Hardware
Tutorial 5 Programming Graphics Hardware Randy Fernando, Mark Harris, Matthias Wloka, Cyril Zeller Overview of the Tutorial: Morning 8:30 9:30 10:15 10:45 Introduction to the Hardware Graphics Pipeline
More informationReadings on graphics architecture for Advanced Computer Architecture class
Readings on graphics architecture for Advanced Computer Architecture class Attached are several short readings on graphics architecture. They are a mix of application-focused and hardware-focused readings.
More informationEvolution of GPUs Chris Seitz
Evolution of GPUs Chris Seitz Overview Concepts: Real-time rendering Hardware graphics pipeline Evolution of the PC hardware graphics pipeline: 1995-1998: Texture mapping and z-buffer 1998: Multitexturing
More informationGraphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal
Graphics Hardware, Graphics APIs, and Computation on GPUs Mark Segal Overview Graphics Pipeline Graphics Hardware Graphics APIs ATI s low-level interface for computation on GPUs 2 Graphics Hardware High
More informationGeForce4. John Montrym Henry Moreton
GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,
More informationThe NVIDIA GeForce 8800 GPU
The NVIDIA GeForce 8800 GPU August 2007 Erik Lindholm / Stuart Oberman Outline GeForce 8800 Architecture Overview Streaming Processor Array Streaming Multiprocessor Texture ROP: Raster Operation Pipeline
More informationReal-Time Rendering (Echtzeitgraphik) Michael Wimmer
Real-Time Rendering (Echtzeitgraphik) Michael Wimmer wimmer@cg.tuwien.ac.at Walking down the graphics pipeline Application Geometry Rasterizer What for? Understanding the rendering pipeline is the key
More informationThe Application Stage. The Game Loop, Resource Management and Renderer Design
1 The Application Stage The Game Loop, Resource Management and Renderer Design Application Stage Responsibilities 2 Set up the rendering pipeline Resource Management 3D meshes Textures etc. Prepare data
More informationGraphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics
Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high
More informationToday s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips
Today s Agenda DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips Optimization for DirectX 9 Graphics Mike Burrows, Microsoft - Performance
More informationGCN Performance Tweets AMD Developer Relations
AMD Developer Relations Overview This document lists all GCN ( Graphics Core Next ) performance tweets that were released on Twitter during the first few months of 2013. Each performance tweet in this
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationHardware-driven Visibility Culling Jeong Hyun Kim
Hardware-driven Visibility Culling Jeong Hyun Kim KAIST (Korea Advanced Institute of Science and Technology) Contents Introduction Background Clipping Culling Z-max (Z-min) Filter Programmable culling
More informationRSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog
RSX Best Practices Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog RSX Best Practices About libgcm Using the SPUs with the RSX Brief overview of GCM Replay December 7 th, 2004
More informationReal - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský
Real - Time Rendering Graphics pipeline Michal Červeňanský Juraj Starinský Overview History of Graphics HW Rendering pipeline Shaders Debugging 2 History of Graphics HW First generation Second generation
More informationDX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology
DX10, Batching, and Performance Considerations Bryan Dudash NVIDIA Developer Technology The Point of this talk The attempt to combine wisdom and power has only rarely been successful and then only for
More informationPerformance OpenGL Programming (for whatever reason)
Performance OpenGL Programming (for whatever reason) Mike Bailey Oregon State University Performance Bottlenecks In general there are four places a graphics system can become bottlenecked: 1. The computer
More informationX. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1
X. GPU Programming 320491: Advanced Graphics - Chapter X 1 X.1 GPU Architecture 320491: Advanced Graphics - Chapter X 2 GPU Graphics Processing Unit Parallelized SIMD Architecture 112 processing cores
More informationArchitectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1
Architectures Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Overview of today s lecture The idea is to cover some of the existing graphics
More informationHardware-driven visibility culling
Hardware-driven visibility culling I. Introduction 20073114 김정현 The goal of the 3D graphics is to generate a realistic and accurate 3D image. To achieve this, it needs to process not only large amount
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationGUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON
Deferred Rendering in Killzone 2 Michal Valient Senior Programmer, Guerrilla Talk Outline Forward & Deferred Rendering Overview G-Buffer Layout Shader Creation Deferred Rendering in Detail Rendering Passes
More informationCS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1 Markus Hadwiger, KAUST Reading Assignment #2 (until Feb. 17) Read (required): GLSL book, chapter 4 (The OpenGL Programmable
More informationgems_ch28.qxp 2/26/ :49 AM Page 469 PART V PERFORMANCE AND PRACTICALITIES
gems_ch28.qxp 2/26/2004 12:49 AM Page 469 PART V PERFORMANCE AND PRACTICALITIES gems_ch28.qxp 2/26/2004 12:49 AM Page 470 gems_ch28.qxp 2/26/2004 12:49 AM Page 471 As GPUs become more complex, incorporating
More informationWorking with Metal Overview
Graphics and Games #WWDC14 Working with Metal Overview Session 603 Jeremy Sandmel GPU Software 2014 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission
More informationOptimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June
Optimizing and Profiling Unity Games for Mobile Platforms Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June 1 Agenda Introduction ARM and the presenter Preliminary knowledge
More informationUltimate Graphics Performance for DirectX 10 Hardware
Ultimate Graphics Performance for DirectX 10 Hardware Nicolas Thibieroz European Developer Relations AMD Graphics Products Group nicolas.thibieroz@amd.com V1.01 Generic API Usage DX10 designed for performance
More informationFeeding the Beast: How to Satiate Your GoForce While Differentiating Your Game
GDC Europe 2005 Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game Lars M. Bishop NVIDIA Embedded Developer Technology 1 Agenda GoForce 3D capabilities Strengths and weaknesses
More informationE.Order of Operations
Appendix E E.Order of Operations This book describes all the performed between initial specification of vertices and final writing of fragments into the framebuffer. The chapters of this book are arranged
More informationMonday Morning. Graphics Hardware
Monday Morning Department of Computer Engineering Graphics Hardware Ulf Assarsson Skärmen består av massa pixlar 3D-Rendering Objects are often made of triangles x,y,z- coordinate for each vertex Y X Z
More informationSpring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett
Spring 2010 Prof. Hyesoon Kim AMD presentations from Richard Huddy and Michael Doggett Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backens
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationGPU Target Applications
John Montrym Henry Moreton GPU Target Applications (Graphics Processing Unit) 1 Interactive Gaming (50M units, 10M gamers) Cinematic quality rendering in real time. Digital Content Creation (DCC) (1M prof,
More informationPowerVR Performance Recommendations. The Golden Rules
PowerVR Performance Recommendations Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind. Redistribution
More informationGeForce3 OpenGL Performance. John Spitzer
GeForce3 OpenGL Performance John Spitzer GeForce3 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering jspitzer@nvidia.com Possible Performance Bottlenecks They mirror the OpenGL pipeline
More informationPowerVR Series5. Architecture Guide for Developers
Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationDrawing Fast The Graphics Pipeline
Drawing Fast The Graphics Pipeline CS559 Fall 2015 Lecture 9 October 1, 2015 What I was going to say last time How are the ideas we ve learned about implemented in hardware so they are fast. Important:
More informationMobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools
Mobile Performance Tools and GPU Performance Tuning Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools NVIDIA GoForce5500 Overview World-class 3D HW Geometry pipeline 16/32bpp
More informationLecture 2. Shaders, GLSL and GPGPU
Lecture 2 Shaders, GLSL and GPGPU Is it interesting to do GPU computing with graphics APIs today? Lecture overview Why care about shaders for computing? Shaders for graphics GLSL Computing with shaders
More informationSung-Eui Yoon ( 윤성의 )
Introduction to Computer Graphics and OpenGL Graphics Hardware Sung-Eui Yoon ( 윤성의 ) Course URL: http://sglab.kaist.ac.kr/~sungeui/etri_cg/ Class Objectives Understand how GPUs have been evolved Understand
More informationChapter 10 Computation Culling with Explicit Early-Z and Dynamic Flow Control
Chapter 10 Computation Culling with Explicit Early-Z and Dynamic Flow Control Pedro V. Sander ATI Research John R. Isidoro ATI Research Jason L. Mitchell ATI Research Introduction In last year s course,
More information1.2.3 The Graphics Hardware Pipeline
Figure 1-3. The Graphics Hardware Pipeline 1.2.3 The Graphics Hardware Pipeline A pipeline is a sequence of stages operating in parallel and in a fixed order. Each stage receives its input from the prior
More informationPractical Performance Analysis Koji Ashida NVIDIA Developer Technology Group
Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group Overview Tools for the analysis Finding pipeline bottlenecks Practice identifying the problems Analysis Tools NVPerfHUD Graph
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationBuilding scalable 3D applications. Ville Miettinen Hybrid Graphics
Building scalable 3D applications Ville Miettinen Hybrid Graphics What s going to happen... (1/2) Mass market: 3D apps will become a huge success on low-end and mid-tier cell phones Retro-gaming New game
More informationGeneral Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)
ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Performance: Bottlenecks Sources of bottlenecks CPU Transfer Processing Rasterizer
More informationNext-Generation Graphics on Larrabee. Tim Foley Intel Corp
Next-Generation Graphics on Larrabee Tim Foley Intel Corp Motivation The killer app for GPGPU is graphics We ve seen Abstract models for parallel programming How those models map efficiently to Larrabee
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More information3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems
Bandwidth Gravity of modern computer systems GPUs Under the Hood Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology The bandwidth between key components
More informationAutomatic Tuning Matrix Multiplication Performance on Graphics Hardware
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful
More informationSave the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.
Save the Nanosecond! PC Graphics Performance for the next 3 years Richard Huddy European Developer Relations Manager ATI Technologies, Inc. A funny thing happened to me ATI is now broadly recognised and
More informationLecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 6: Texture Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today: texturing! Texture filtering - Texture access is not just a 2D array lookup ;-) Memory-system implications
More informationCould you make the XNA functions yourself?
1 Could you make the XNA functions yourself? For the second and especially the third assignment, you need to globally understand what s going on inside the graphics hardware. You will write shaders, which
More informationProgrammable Graphics Hardware
Programmable Graphics Hardware Outline 2/ 49 A brief Introduction into Programmable Graphics Hardware Hardware Graphics Pipeline Shading Languages Tools GPGPU Resources Hardware Graphics Pipeline 3/ 49
More informationReal-Time Hair Simulation and Rendering on the GPU. Louis Bavoil
Real-Time Hair Simulation and Rendering on the GPU Sarah Tariq Louis Bavoil Results 166 simulated strands 0.99 Million triangles Stationary: 64 fps Moving: 41 fps 8800GTX, 1920x1200, 8XMSAA Results 166
More informationSpring 2009 Prof. Hyesoon Kim
Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationGraphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1
Graphics Hardware Computer Graphics COMP 770 (236) Spring 2007 Instructor: Brandon Lloyd 2/26/07 1 From last time Texture coordinates Uses of texture maps reflectance and other surface parameters lighting
More informationReal-World Applications of Computer Arithmetic
1 Commercial Applications Real-World Applications of Computer Arithmetic Stuart Oberman General purpose microprocessors with high performance FPUs AMD Athlon Intel P4 Intel Itanium Application specific
More informationGraphics Hardware. Instructor Stephen J. Guy
Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationOptimisation. CS7GV3 Real-time Rendering
Optimisation CS7GV3 Real-time Rendering Introduction Talk about lower-level optimization Higher-level optimization is better algorithms Example: not using a spatial data structure vs. using one After that
More informationWhiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)
Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME) Pavel Petroshenko, Sun Microsystems, Inc. Ashmi Bhanushali, NVIDIA Corporation Jerry Evans, Sun Microsystems, Inc. Nandini
More informationRendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane
Rendering Pipeline Rendering Converting a 3D scene to a 2D image Rendering Light Camera 3D Model View Plane Rendering Converting a 3D scene to a 2D image Basic rendering tasks: Modeling: creating the world
More informationDrawing Fast The Graphics Pipeline
Drawing Fast The Graphics Pipeline CS559 Spring 2016 Lecture 10 February 25, 2016 1. Put a 3D primitive in the World Modeling Get triangles 2. Figure out what color it should be Do ligh/ng 3. Position
More informationOptimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research
Optimizing Games for ATI s IMAGEON 2300 Aaftab Munshi 3D Architect ATI Research A A 3D hardware solution enables publishers to extend brands to mobile devices while remaining close to original vision of
More informationGPU Architecture. Samuli Laine NVIDIA Research
GPU Architecture Samuli Laine NVIDIA Research Today The graphics pipeline: Evolution of the GPU Throughput-optimized parallel processor design I.e., the GPU Contrast with latency-optimized (CPU-like) design
More informationScanline Rendering 2 1/42
Scanline Rendering 2 1/42 Review 1. Set up a Camera the viewing frustum has near and far clipping planes 2. Create some Geometry made out of triangles 3. Place the geometry in the scene using Transforms
More informationMattan Erez. The University of Texas at Austin
EE382V: Principles in Computer Architecture Parallelism and Locality Fall 2008 Lecture 10 The Graphics Processing Unit Mattan Erez The University of Texas at Austin Outline What is a GPU? Why should we
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIn-Game Special Effects and Lighting
In-Game Special Effects and Lighting Introduction! Tomas Arce! Special Thanks! Matthias Wloka! Craig Galley! Stephen Broumley! Cryrus Lum! Sumie Arce! Inevitable! nvidia! Bungy What Is Per-Pixel Pixel
More informationGoForce 3D: Coming to a Pixel Near You
GoForce 3D: Coming to a Pixel Near You CEDEC 2004 NVIDIA Actively Developing Handheld Solutions Exciting and Growing Market Fully Committed to developing World Class graphics products for the mobile Already
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationRender-To-Texture Caching. D. Sim Dietrich Jr.
Render-To-Texture Caching D. Sim Dietrich Jr. What is Render-To-Texture Caching? Pixel shaders are becoming more complex and expensive Per-pixel shadows Dynamic Normal Maps Bullet holes Water simulation
More informationUsing Virtual Texturing to Handle Massive Texture Data
Using Virtual Texturing to Handle Massive Texture Data San Jose Convention Center - Room A1 Tuesday, September, 21st, 14:00-14:50 J.M.P. Van Waveren id Software Evan Hart NVIDIA How we describe our environment?
More information2.11 Particle Systems
2.11 Particle Systems 320491: Advanced Graphics - Chapter 2 152 Particle Systems Lagrangian method not mesh-based set of particles to model time-dependent phenomena such as snow fire smoke 320491: Advanced
More informationMany rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.
1 2 Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters. Crowd rendering in large environments presents a number of challenges,
More informationCornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary
Cornell University CS 569: Interactive Computer Graphics Introduction Lecture 1 [John C. Stone, UIUC] 2008 Steve Marschner 1 2008 Steve Marschner 2 NASA University of Calgary 2008 Steve Marschner 3 2008
More informationBringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games
Bringing AAA graphics to mobile platforms Niklas Smedberg Senior Engine Programmer, Epic Games Who Am I A.k.a. Smedis Platform team at Epic Games Unreal Engine 15 years in the industry 30 years of programming
More informationDirectX 10 Performance. Per Vognsen
DirectX 10 Performance Per Vognsen Outline General DX10 API usage Designed for performance Batching and Instancing State Management Constant Buffer Management Resource Updates and Management Reading the
More informationEfficient and Scalable Shading for Many Lights
Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL
More informationFrom Brook to CUDA. GPU Technology Conference
From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i
More informationDominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment
Dominic Filion, Senior Engineer Blizzard Entertainment Rob McNaughton, Lead Technical Artist Blizzard Entertainment Screen-space techniques Deferred rendering Screen-space ambient occlusion Depth of Field
More informationRationale for Non-Programmable Additions to OpenGL 2.0
Rationale for Non-Programmable Additions to OpenGL 2.0 NVIDIA Corporation March 23, 2004 This white paper provides a rationale for a set of functional additions to the 2.0 revision of the OpenGL graphics
More informationDirect Rendering of Trimmed NURBS Surfaces
Direct Rendering of Trimmed NURBS Surfaces Hardware Graphics Pipeline 2/ 81 Hardware Graphics Pipeline GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target Screen Extended
More informationRendering Objects. Need to transform all geometry then
Intro to OpenGL Rendering Objects Object has internal geometry (Model) Object relative to other objects (World) Object relative to camera (View) Object relative to screen (Projection) Need to transform
More informationWhat s New with GPGPU?
What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Microprocessor Scaling is Slowing
More informationThe GPGPU Programming Model
The Programming Model Institute for Data Analysis and Visualization University of California, Davis Overview Data-parallel programming basics The GPU as a data-parallel computer Hello World Example Programming
More informationInteractive Cloth Simulation. Matthias Wloka NVIDIA Corporation
Interactive Cloth Simulation Matthias Wloka NVIDIA Corporation MWloka@nvidia.com Overview Higher-order surfaces Vertex-shader deformations Lighting modes Per-vertex diffuse Per-pixel diffuse with bump-map
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More information