Real-World Applications of Computer Arithmetic

Similar documents
GeForce4. John Montrym Henry Moreton

The NVIDIA GeForce 8800 GPU

Graphics Processing Unit Architecture (GPU Arch)

Mattan Erez. The University of Texas at Austin

Evolution of GPUs Chris Seitz

GPU Target Applications

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Mattan Erez. The University of Texas at Austin

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

CS427 Multicore Architecture and Parallel Computing

Spring 2009 Prof. Hyesoon Kim

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Windowing System on a 3D Pipeline. February 2005

Spring 2011 Prof. Hyesoon Kim

Programming Graphics Hardware

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Monday Morning. Graphics Hardware

Portland State University ECE 588/688. Graphics Processors

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Threading Hardware in G80

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

The Rasterization Pipeline

Enabling immersive gaming experiences Intro to Ray Tracing

Readings on graphics architecture for Advanced Computer Architecture class

PowerVR Hardware. Architecture Overview for Developers

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Hardware-driven visibility culling

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

What s New with GPGPU?

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

ECE 574 Cluster Computing Lecture 16

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GCN Performance Tweets AMD Developer Relations

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

GPU Memory Model Overview

Rendering Objects. Need to transform all geometry then

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

A Data-Parallel Genealogy: The GPU Family Tree

Project Gotham Racing 2 (Xbox) Real-Time Rendering. Microsoft Flighsimulator. Halflife 2

CS130 : Computer Graphics Lecture 2: Graphics Pipeline. Tamar Shinar Computer Science & Engineering UC Riverside

Wed, October 12, 2011

ASYNCHRONOUS SHADERS WHITE PAPER 0

CHAPTER 1 Graphics Systems and Models 3

Performance OpenGL Programming (for whatever reason)

1.2.3 The Graphics Hardware Pipeline

Canonical Shaders for Optimal Performance. Sébastien Dominé Manager of Developer Technology Tools

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Working with Metal Overview

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

The Need for Programmability

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Parallel Triangle Rendering on a Modern GPU

Scanline Rendering 2 1/42

The Bifrost GPU architecture and the ARM Mali-G71 GPU


Lecture 25: Board Notes: Threads and GPUs

Course Recap + 3D Graphics on Mobile GPUs

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

A Trip Down The (2011) Rasterization Pipeline

PowerVR Performance Recommendations. The Golden Rules

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

AMD Radeon HD 2900 Highlights

GPU Architecture. Michael Doggett Department of Computer Science Lund university

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

Streaming Massive Environments From Zero to 200MPH

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

Advanced Computer Graphics (CS & SE ) Lecture 7

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

NVIDIA nfinitefx Engine: Programmable Pixel Shaders

Software Occlusion Culling

ECE 571 Advanced Microprocessor-Based Design Lecture 18

Spring 2009 Prof. Hyesoon Kim

The Graphics Pipeline

Ray Tracing with Multi-Core/Shared Memory Systems. Abe Stephens

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Parallel Computing: Parallel Architectures Jin, Hai

3D Production Pipeline

CS 354R: Computer Game Technology

Falanx Microsystems. Company Overview

Graphics Hardware. Instructor Stephen J. Guy

Transcription:

1 Commercial Applications Real-World Applications of Computer Arithmetic Stuart Oberman General purpose microprocessors with high performance FPUs AMD Athlon Intel P4 Intel Itanium Application specific processors Digital Signal Processors Graphics Processors 2 3 AMD-Athlon Processor Architecture Raw FP Performance Comparison Operation FADD FMUL FDIV (SP) FDIV (DP) FDIV (EP) FSQRT (SP) FSQRT (DP) FSQRT (EP) Athlon Latency / Throughput 4/1 4/1 16/13 20/17 24/21 19/16 27/25 35/32 P4 Latency / Throughput 5/1 7/2 23/23 38/38 43/43 23/23 38/38 43/43 4 5 AMD Athlon Newest Offering Barton core Same basic functionality as original Athlon 512KB L2 cache 543 million transistors 743W Microprocessor Performance DX8-Game: Unreal Tournament 2003 1

6 7 Microprocessor Performance 3DMark 2001 SE 3D Graphics Processing Units: Why? We have software algorithms for cinematic quality 3D graphics eg Shrek, Monsters Inc, Toy Story Problem is the rendering time per frame Even with larger server farms, can take hours per frame Want to achieve same quality in real-time on the PC 60 fps, instead of 2 hours / frame Requires TREMENDOUS amounts of arithmetic computation 8 9 GPUs vs CPUs Special-Purpose Hardware More independent calculations Enables wide and deep parallelism API churn shorter development cycles -> ASIC Blend of general- and special- purpose compute resources Both transistor-bound for the forseeable future Most efficient implementations of Cube environment map Shadow calculations Anisotropic filtering Clipping Rasterization Log, ep, dot-product More programmability won t change this 10 11 Recent History: GeForce 1&2 First integrated geometry engine & 4 piels/clk Fied-function transform, lighting, and piel pipelines 25M transistors : 018um/6LM : 250MHz 25M polygons/sec : 1G piels/sec another lightin g im age Rendering in Transition Pre-2001: piel painting Image compleity and richness from LOTS of piels Each piel derived from 1-2 tetures & blending Detail added by transparency and layers Post-2001 fork in the road: Paint more simple piels, faster - embedded DRAM OR Use Programmable Shading to render better piels - but, must reduce depth compleity 2

Eamples 12 A Tour of the GeForce4 13 Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) Host / Front End / Verte Processor Protocol and physical interface to PCI/AGP Command ABI interpreter Contet switch Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 14 Handles persistent attributes Dispatch Hides latency from the programmer Fied-function modes driven by APIs Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 15 DMA gather Multiple vector floating point processors 256 128 contet RAM 12 128 temp regs 16 128 input and output Verte Program Eamples Deformation Warping Procedural Animation 16 Primitive Assembly, Setup & Rasterizer Per-triangle parameter setup Tile walking Sample inclusion determination Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 17 Lens Effects Range-based Fog Elevation-based Fog Animation Morphing Tiles are traversed in memory page friendly order Interpolation 3

Occlusion Culling & Programmable Shading Occlusion Culling reduces Depth Compleity Calculate Z and determine visible piels Eliminate invisible piels Programmable Shading enables richer visual quality Accurately model: reflections, shadows, materials More tetures/piel More calculations/piel consumes many cycles Programmable Shading impractical without Occlusion Culling 18 Occlusion Strategies Possibilities: Maintain local conservative data structure Use actual depth buffer data Or combine the techniques A coherence problem no matter how you slice it API depth test is at the far end of the pipe! Must preserve semantics Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 19 Piel Shading / Teturing A piel shader converts teture coordinates into a color using a shader program Floating point math Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 20 Piel Shader Input: values interpolated across triangle IEEE floating point operations Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 21 Teture lookups Results of previous piel shaders 4 stages, 1 teture address op per stage Compressed, mipmapped 3-D tetures True reflective bump mapping True dependent tetures (lookup tables) Lookup functions using tetures Large, multi-dimensional tables Filtered Outputs an ARGB value that register combiners can read Full 3 3 transform with cubemap or 3-D teture lookup 16-bit-per-component normal maps 22 23 Host / Front End / Verte Processor Input Input Input Input Sum Output Output Output Piel Shader Teture Piel Engines (ROP) 1 8 stages, plus a final combiner Up to 4 inputs from teture stages, interpolators, constant registers, earlier combiners Fied set of operations: Each stage can evaluate A*B+C*D and output result, along with A*B, C*D Alternatively, each stage can evaluate dot products instead of multiplies Can conditionally select A*B or C*D Piel Shading effects Multi-teturing Dot products for per piel lighting calculations Reflections Shadowing Custom effects Piel math 4

Host / Front End / Verte Processor 24 25 Teture Deeply pipelined cache Many hits and misses in flight Compression 4:1 ratio Palettes Piel Shader Teture Piel Engines (ROP) Lossy small-grained fied ratio scheme Filtering Bilinear, trilinear, 8:1 anisotropic Eample: Level of Detail Computation level 0 u level 1 level 2 level k v teel mipmap y in teture space piel quad screen space 26 27 Anisotropic Filtering Simplified LOD Computation access two levels level 0 footprint level 1 level 2 level k + + + + + + n aniso samples on each level, each bilinearly interp u i =(u i,v i,p i ) (bold=vector, italic=scalar) Differences/partials: =(u 3 +u 1 -u 2 -u 0 )/2= u/ y=(u 3 +u 2 -u 1 -u 0 )/2= u/ y major2=ma( 2, y 2 ) lod=log2(major2)/2 area= cross(,y) ratio=area / major2 u 0 u 2 y u 1 u 3 height major Host / Front End / Verte Processor 28 29 Piel Engines (ROP) Coalesces shader piels into memory access grain Performs visibility and blending / transparency calculations Balanced processing power vs bandwidth Bandwidth is amplified by compression Piel Shader Teture Piel Engines (ROP) Statistics 136 Mtriangles per second 48 Gsamples/sec 12 Tops/sec 832 GB/sec clear BW 63M transistors TSMC 015u 300 MHz pipeline / 325 MHz memory clk 5

30 GeForce FX 5800 Launched Comde 02 125 million transistors 200 million Vertices/sec DDR-II 500MHz / 1GHz 4 billion teels / sec First generation of FP shaders, both for geometry and piel processing (128b) 6