1 Commercial Applications Real-World Applications of Computer Arithmetic Stuart Oberman General purpose microprocessors with high performance FPUs AMD Athlon Intel P4 Intel Itanium Application specific processors Digital Signal Processors Graphics Processors 2 3 AMD-Athlon Processor Architecture Raw FP Performance Comparison Operation FADD FMUL FDIV (SP) FDIV (DP) FDIV (EP) FSQRT (SP) FSQRT (DP) FSQRT (EP) Athlon Latency / Throughput 4/1 4/1 16/13 20/17 24/21 19/16 27/25 35/32 P4 Latency / Throughput 5/1 7/2 23/23 38/38 43/43 23/23 38/38 43/43 4 5 AMD Athlon Newest Offering Barton core Same basic functionality as original Athlon 512KB L2 cache 543 million transistors 743W Microprocessor Performance DX8-Game: Unreal Tournament 2003 1
6 7 Microprocessor Performance 3DMark 2001 SE 3D Graphics Processing Units: Why? We have software algorithms for cinematic quality 3D graphics eg Shrek, Monsters Inc, Toy Story Problem is the rendering time per frame Even with larger server farms, can take hours per frame Want to achieve same quality in real-time on the PC 60 fps, instead of 2 hours / frame Requires TREMENDOUS amounts of arithmetic computation 8 9 GPUs vs CPUs Special-Purpose Hardware More independent calculations Enables wide and deep parallelism API churn shorter development cycles -> ASIC Blend of general- and special- purpose compute resources Both transistor-bound for the forseeable future Most efficient implementations of Cube environment map Shadow calculations Anisotropic filtering Clipping Rasterization Log, ep, dot-product More programmability won t change this 10 11 Recent History: GeForce 1&2 First integrated geometry engine & 4 piels/clk Fied-function transform, lighting, and piel pipelines 25M transistors : 018um/6LM : 250MHz 25M polygons/sec : 1G piels/sec another lightin g im age Rendering in Transition Pre-2001: piel painting Image compleity and richness from LOTS of piels Each piel derived from 1-2 tetures & blending Detail added by transparency and layers Post-2001 fork in the road: Paint more simple piels, faster - embedded DRAM OR Use Programmable Shading to render better piels - but, must reduce depth compleity 2
Eamples 12 A Tour of the GeForce4 13 Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) Host / Front End / Verte Processor Protocol and physical interface to PCI/AGP Command ABI interpreter Contet switch Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 14 Handles persistent attributes Dispatch Hides latency from the programmer Fied-function modes driven by APIs Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 15 DMA gather Multiple vector floating point processors 256 128 contet RAM 12 128 temp regs 16 128 input and output Verte Program Eamples Deformation Warping Procedural Animation 16 Primitive Assembly, Setup & Rasterizer Per-triangle parameter setup Tile walking Sample inclusion determination Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 17 Lens Effects Range-based Fog Elevation-based Fog Animation Morphing Tiles are traversed in memory page friendly order Interpolation 3
Occlusion Culling & Programmable Shading Occlusion Culling reduces Depth Compleity Calculate Z and determine visible piels Eliminate invisible piels Programmable Shading enables richer visual quality Accurately model: reflections, shadows, materials More tetures/piel More calculations/piel consumes many cycles Programmable Shading impractical without Occlusion Culling 18 Occlusion Strategies Possibilities: Maintain local conservative data structure Use actual depth buffer data Or combine the techniques A coherence problem no matter how you slice it API depth test is at the far end of the pipe! Must preserve semantics Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 19 Piel Shading / Teturing A piel shader converts teture coordinates into a color using a shader program Floating point math Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 20 Piel Shader Input: values interpolated across triangle IEEE floating point operations Host / Front End / Verte Processor Piel Shader Teture Piel Engines (ROP) 21 Teture lookups Results of previous piel shaders 4 stages, 1 teture address op per stage Compressed, mipmapped 3-D tetures True reflective bump mapping True dependent tetures (lookup tables) Lookup functions using tetures Large, multi-dimensional tables Filtered Outputs an ARGB value that register combiners can read Full 3 3 transform with cubemap or 3-D teture lookup 16-bit-per-component normal maps 22 23 Host / Front End / Verte Processor Input Input Input Input Sum Output Output Output Piel Shader Teture Piel Engines (ROP) 1 8 stages, plus a final combiner Up to 4 inputs from teture stages, interpolators, constant registers, earlier combiners Fied set of operations: Each stage can evaluate A*B+C*D and output result, along with A*B, C*D Alternatively, each stage can evaluate dot products instead of multiplies Can conditionally select A*B or C*D Piel Shading effects Multi-teturing Dot products for per piel lighting calculations Reflections Shadowing Custom effects Piel math 4
Host / Front End / Verte Processor 24 25 Teture Deeply pipelined cache Many hits and misses in flight Compression 4:1 ratio Palettes Piel Shader Teture Piel Engines (ROP) Lossy small-grained fied ratio scheme Filtering Bilinear, trilinear, 8:1 anisotropic Eample: Level of Detail Computation level 0 u level 1 level 2 level k v teel mipmap y in teture space piel quad screen space 26 27 Anisotropic Filtering Simplified LOD Computation access two levels level 0 footprint level 1 level 2 level k + + + + + + n aniso samples on each level, each bilinearly interp u i =(u i,v i,p i ) (bold=vector, italic=scalar) Differences/partials: =(u 3 +u 1 -u 2 -u 0 )/2= u/ y=(u 3 +u 2 -u 1 -u 0 )/2= u/ y major2=ma( 2, y 2 ) lod=log2(major2)/2 area= cross(,y) ratio=area / major2 u 0 u 2 y u 1 u 3 height major Host / Front End / Verte Processor 28 29 Piel Engines (ROP) Coalesces shader piels into memory access grain Performs visibility and blending / transparency calculations Balanced processing power vs bandwidth Bandwidth is amplified by compression Piel Shader Teture Piel Engines (ROP) Statistics 136 Mtriangles per second 48 Gsamples/sec 12 Tops/sec 832 GB/sec clear BW 63M transistors TSMC 015u 300 MHz pipeline / 325 MHz memory clk 5
30 GeForce FX 5800 Launched Comde 02 125 million transistors 200 million Vertices/sec DDR-II 500MHz / 1GHz 4 billion teels / sec First generation of FP shaders, both for geometry and piel processing (128b) 6