GeForce4. John Montrym Henry Moreton

Similar documents
Real-World Applications of Computer Arithmetic

GPU Target Applications

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Graphics Processing Unit Architecture (GPU Arch)

The NVIDIA GeForce 8800 GPU

Mattan Erez. The University of Texas at Austin

Portland State University ECE 588/688. Graphics Processors

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Mattan Erez. The University of Texas at Austin

Windowing System on a 3D Pipeline. February 2005

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Evolution of GPUs Chris Seitz

Spring 2011 Prof. Hyesoon Kim

CS427 Multicore Architecture and Parallel Computing

GPU Memory Model Overview

Spring 2009 Prof. Hyesoon Kim

Threading Hardware in G80

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Hardware-driven visibility culling

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

Scheduling the Graphics Pipeline on a GPU

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Working with Metal Overview

Hardware-driven Visibility Culling Jeong Hyun Kim

1.2.3 The Graphics Hardware Pipeline

Programming Graphics Hardware

Monday Morning. Graphics Hardware

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

Rendering Objects. Need to transform all geometry then

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

PowerVR Hardware. Architecture Overview for Developers

GPU Memory Model. Adapted from:

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CHAPTER 1 Graphics Systems and Models 3

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Readings on graphics architecture for Advanced Computer Architecture class

GeForce3 OpenGL Performance. John Spitzer

Drawing Fast The Graphics Pipeline

Rasterization Overview

ECE 574 Cluster Computing Lecture 16

GPU Architecture. Michael Doggett Department of Computer Science Lund university

The Rasterization Pipeline

Enabling immersive gaming experiences Intro to Ray Tracing

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CS4620/5620: Lecture 14 Pipeline

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

2.11 Particle Systems

Performance Analysis and Culling Algorithms

Grafica Computazionale: Lezione 30. Grafica Computazionale. Hiding complexity... ;) Introduction to OpenGL. lezione30 Introduction to OpenGL

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

ECE 571 Advanced Microprocessor-Based Design Lecture 18

Course Recap + 3D Graphics on Mobile GPUs

Sung-Eui Yoon ( 윤성의 )

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Graphics Hardware. Instructor Stephen J. Guy

A Data-Parallel Genealogy: The GPU Family Tree

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Parallel Triangle Rendering on a Modern GPU

The Graphics Pipeline

Drawing Fast The Graphics Pipeline

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Drawing Fast The Graphics Pipeline

CS452/552; EE465/505. Clipping & Scan Conversion

AMD Radeon HD 2900 Highlights

Lecture 2. Shaders, GLSL and GPGPU

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

Performance OpenGL Programming (for whatever reason)

Project Gotham Racing 2 (Xbox) Real-Time Rendering. Microsoft Flighsimulator. Halflife 2

Point Sample Rendering

What s New with GPGPU?

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

CS130 : Computer Graphics Lecture 2: Graphics Pipeline. Tamar Shinar Computer Science & Engineering UC Riverside

Transcription:

GeForce4 John Montrym Henry Moreton

1 Architectural Drivers Programmability Parallelism Memory bandwidth

2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform, lighting, and pixel pipelines 25M transistors : 0.18um/6LM : 250MHz 25M polygons/sec : 1G pixels/sec another lighting image

3 Rendering in Transition Pre-2001: pixel painting Image complexity and richness from LOTS of pixels Each pixel derived from 1-2 textures & blending Detail added by transparency and layers Post-2001 fork in the road: Paint more simple pixels, faster - embedded DRAM OR Use Programmable Shading to render better pixels - but, must reduce depth complexity

Examples 4

5 GPUs vs. CPUs More independent calculations Enables wide and deep parallelism API churn shorter development cycles -> ASIC Blend of general- and special- purpose compute resources Both transistor-bound for the foreseeable future

6 Special-Purpose Hardware Most efficient implementations of Cube environment map Shadow calculations Anisotropic filtering Clipping Rasterization Log, exp, dot-product More programmability won t change this

7 Managing DRAM Bandwidth Very large working sets Affordable caches cannot support long-term reuse Target is effective streaming with local reuse Supercomputer techniques apply Latency hiding Vector operations

8 Fit the Machine to DRAM Characteristics Page locality DRAMs pages are 1-D Graphics accesses are 1-D, 2-D, 3-D in memory Page misses and read/write turns getting more expensive Granularity

9 Unified Memory Traffic comprises: Commands primitives,state Vertex data {x,y,z,w} Texture samples Depth values Colors Relative amounts vary widely Powerful programming model

10 Eliminate Redundant Traffic We cache data Textures, vertices We cache work Pixel fragments Designed for 80-90% hit rate (not 99.9%) Leverage coherence Engines traverse locally ex: rasterization order

11 Amplify Peak Bandwidth Lossless compression Lossy compression Conservative (undetectable) Else application must be given the choice Small-grained random access favors fixed compression atoms

12 Reduce Dead Cycles Queuing Amortize read/write turns over longer runs Multi-bank DRAM scheduling

13 Embedded DRAM Tempting to embed megabytes of DRAM. But.. Cannot fit the whole problem Costs are huge Will be just around the corner for a long time

14 A Tour of the GeForce4 process commands convert to FP Host / Front End / Vertex Processor transform vertices to screen-space Transform and Lighting generate pixels delete pixels that cannot be seen determine the colors, transparencies and depth of the pixel Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Frame Buffer Controller combine colors and transparencies to produce final output RGBA Register Combiners do final hidden surface test, blend and write out color and new depth Pixel Engines (ROP)

Host / Front End / Vertex Processor Protocol and physical interface to PCI/AGP Command ABI interpreter Context switch DMA gather Host / Front End / Vertex Processor Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Frame Buffer Controller 15

Transform and Lighting Handles persistent attributes Dispatch Hides latency from the programmer Host / Front End / Vertex Processor Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Fixed-function modes driven by APIs Multiple vector floating point processors 256 x 128 context RAM 12 x 128 temp regs 16 x128 input and output Frame Buffer Controller 16

17 Vertex Program Examples Deformation Warping Procedural Animation Lens Effects Range-based Fog Elevation-based Fog Animation Morphing Interpolation

Primitive Assembly, Setup & Rasterizer Per-triangle parameter setup Tile walking Sample inclusion determination Host / Front End / Vertex Processor Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Tiles are traversed in memory page friendly order Texture Register Combiners Pixel Engines (ROP) Frame Buffer Controller 18

Occlusion Culling & Programmable Shading 19 Occlusion Culling reduces Depth Complexity Calculate Z and determine visible pixels Eliminate invisible pixels Programmable Shading enables richer visual quality Accurately model: reflections, shadows, materials More textures/pixel More calculations/pixel consumes many cycles Programmable Shading impractical without Occlusion Culling

Host / Front End / Vertex Processor 20 Occlusion Strategies Possibilities: Maintain local conservative data structure Use actual depth buffer data Or combine the techniques A coherence problem no matter how you slice it. API depth test is at the far end of the pipe! Must preserve semantics Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Frame Buffer Controller

Pixel Shading / Texturing A pixel shader converts texture coordinates into a color using a shader program. Floating point math Texture lookups Results of previous pixel shaders 4 stages, 1 texture address op per stage Compressed, mipmapped 3-D textures True reflective bump mapping True dependent textures (lookup tables) Host / Front End / Vertex Processor Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Full 3 3 transform with cubemap or 3-D texture lookup 16-bit-per-component normal maps Frame Buffer Controller 21

Host / Front End / Vertex Processor 22 Pixel Shader Input: values interpolated across triangle IEEE floating point operations Lookup functions using textures Large, multi-dimensional tables Filtered Outputs an ARGB value that register combiners can read Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Frame Buffer Controller

Host / Front End / Vertex Processor 23 Register Combiners Input Input Input Input A*B A B Sum A*B A B Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Frame Buffer Controller Output Output Output 1 8 stages, plus a final combiner Up to 4 inputs from texture stages, interpolators, constant registers, earlier combiners Fixed set of operations: Each stage can evaluate A*B+C*D and output result, along with A*B, C*D Alternatively, each stage can evaluate dot products instead of multiplies Can conditionally select A*B or C*D

24 Pixel Shading effects Multi-texturing Dot products for per pixel lighting calculations Reflections Shadowing Custom effects Pixel math

Host / Front End / Vertex Processor 25 Texture Deeply pipelined cache Many hits and misses in flight Compression 4:1 ratio Palettes Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Lossy small-grained fixed ratio scheme Filtering Bilinear, tri-linear, 8:1 anisotropic Frame Buffer Controller

Host / Front End / Vertex Processor 26 Pixel Engines (ROP) Coalesces shader pixels into memory access grain Performs visibility and blending / transparency calculations Balanced processing power vs. bandwidth Bandwidth is amplified by compression Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Frame Buffer Controller

27 Multisample Antialiasing Transparent to the application 2 & 4 subsamples per pixel 2, 4, 5, and 9-tap reconstruction filters

28 nview Display Technology Flexible display combinations + + + + + Intuitive user interface Easy set-up Application Management Multi Desktop Support Window Management Window Effects Custom User Profiles

Host / Front End / Vertex Processor 29 Framebuffer Controller 128-pin DDR Schedules requests from all engines Transparent compress/decompress Maps from pixel-linear address to page & partition tiling Flexible in: Width Depth Frequency Banks Transform and Lighting Primitive Assembly, Setup & Rasterizer Occlusion Culler Pixel Shader Texture Register Combiners Pixel Engines (ROP) Frame Buffer Controller

30 Statistics 136M vertices per second 60M triangles per second 4.8G samples/sec 1.2T ops/sec 83.2 GB/sec clear BW 63M transistors TSMC 0.15u 300 MHz pipeline / 325 MHz memory clk

Conclusions 31

Questions 32

33 Related reading Stanford special topics course Real-Time Graphics Architectures with Kurt Akeley & Pat Hanrahan Reading list: http://graphics.stanford.edu/courses/cs448a-01-fall/readings.html