Bandwidth Gravity of modern computer systems GPUs Under the Hood Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology The bandwidth between key components ultimately dictates system performance Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance falls back to what the feeds and speeds dictate PCIe replaced AGP (Advanced Graphics Port) from UIUC ECE498 Lecture 6, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al 2 3D buzzwords Fill Rate how fast the GPU can generate pixels, often a strong predictor for application frame rate Performance Metrics Mtris/sec - Triangle Rate Mverts/sec - Rate Mpixels/sec - Pixel Fill (Write) Rate Mtexels/sec - Fill (Read) Rate Msamples/sec - Antialiasing Fill (Write) Rate Adding programmability to the pipeline See http://courses.engr.illinois.edu/ece498/al 3 4 3D API Commands 3D Application or Game GPU Command & Data Stream Pre-transformed Vertices 3D API: OpenGL or Direct3D GPU Front End Index Stream Programmable Processor Primitive Assembly Transformed Vertices CPU GPU Boundary Assembled Polygons, Lines, and Points ized Pre-transformed Fragments ization & Interpolation See http://courses.engr.illinois.edu/ece498/al Pixel Location Stream Programmable Fragment Processor Operations Pixel Updates Transformed Fragments buffer 1
data Typically floats, and vectors/matrices of floats Fixed size arrays Three main types: Per-instance data, e.g., per-vertex position Per-pixel interpolated data, e.g., texture coordinates Per-batch data, e.g., light position Data are tightly bound to the GPU flow control Very simple No recursion Fixed size loops for Model 2.0 or earlier Simple if-then-else statements allowed in the latest APIs Texkill (asm) or clip (HLSL) or discard (GLSL) allows you to abort a write to a pixel (form of flow control) Specialized instructions (GeForce 6) Dot products Exponential instructions: EXP, LOG LIT (Blinn specular lighting model calculation!) Reciprocal instructions: RCP (reciprocal) RSQ (reciprocal square root!) Trignometric functions SIN, COS Swizzling (swapping xyzw), write masking (only some xyzw get assigned), and negation is free shader Transform to clip space Inputs: Common inputs: position (x, y, z, w) coordinate colors Constant inputs Output to a pixel (fragment) shader shader is executed once per vertex, so usually less expensive than pixel shader From GPU Gems 2, p. 484 7 2
shader data flow (3.0) 32 Temporary registers al Loop Register r0 r1 r2 r31 a0 Address Register v0 v1 v2 stream 16 data registers v15 12 output registers C0 C1 C2 Cn Constant float registers (at least 256) 16 Constant Integer Registers shader: logical view Per-vertex Input Data Register File r0 r1 r2 r3 Start Addr Bound s Bound Samplers Bound Consants Sampler Unit Processing Unit Swizzle / Mask Unit.rgba.xyzw.zzzz.xxyz Resources (bound by application) Math/Logic Unit cosine log sine sub add Constants Per-vertex Output Data Transformed and Lit vertices opos position otn texture ofog fog Each register is a 4-component vector register except al od0 od1 Diff. color Spec. color opts Output Pt size Input Data Output Data State Information Architectural State Control Logic Some uses of vertex shaders Easy cross products and normalization Transform vertices to clip space Pass normal, texture coordinates to PS Transform vectors to other spaces (e.g., texture space) Calculate per-vertex lighting (e.g., Gouraud shading) Distort geometry (waves) From Stanford CS448A: Real-Time Graphics Architectures Adapted from Mart Slot s presentation See graphics.stanford.edu/courses/cs448a-01-fall 12 3
Blinn lighting in one instruction Simple graphics pipeline From Stanford CS448A: Real-Time Graphics Architectures From Stanford CS448A: Real-Time Graphics Architectures See graphics.stanford.edu/courses/cs448a-01-fall 13 See graphics.stanford.edu/courses/cs448a-01-fall 14 Pixel (or fragment) shader (1) Determine each fragment s color Custom (sophisticated) pixel operations sampling Inputs Interpolated output from vertex shader Typically vertex position, vertex normals, texture coordinates, etc. These registers could be reused for other purpose Output Color (including alpha) Depth value (optional) Pixel (or fragment) shader (2) Executed once per pixel, hence typically executed many more times than a vertex shader It is advantageous to compute stuff on a per-vertex basis to improve performance 4
Pixel shader data flow (3.0) Pixel shader: logical view Temporary registers r0 r1 r31 v0 Pixel stream v1 Color (diff/spec) and texture coord. registers oc0 color Pixel odepth Depth v9 C0 C1 Cn s0 s1 s15 Constant registers (16 INT, 224 Float) Sampler Registers (Up to 16 texture surfaces can be read in a single pass) Interpolator Per-pixel Input Data Input Data Output Data State Information Register File r0 r1 r2 r3 Start Addr Bound s Bound Samplers Bound Consants Sampler Unit Pixel Processing Unit Swizzle / Mask Unit.rgba.xyzw.zzzz.xxyz Resources (bound by application) Architectural State Control Logic Color buffer Depth Stencil Math/Logic Unit cosine log sine sub add Constants Per-pixel Output Data Pixel Color Depth Info Stencil Info Some uses of pixel shaders Old GeForce graphics pipeline Texturing objects Per-pixel lighting (e.g., Phong shading) Normal mapping (each pixel has its own normal) Shadows (determine whether a pixel is shadowed or not) Environment mapping Control VS/T&L Adapted from Mart Slot s presentation See http://courses.engr.illinois.edu/ece498/al 20 5
cache Reusing vertices between primitives saves PCIe bus bandwidth and GPU computational resources A vertex cache attempts to exploit commonality between triangles to generate vertex reuse Unfortunately, many applications do not use efficient triangular ordering Control VS/T&L cache Stores temporally local texel values to reduce bandwidth requirements Due to nature of texture filtering high degrees of efficiency are possible (75% or better hit rates) Reduces texture (memory) bandwidth by a factor of four for bilinear filtering See http://courses.engr.illinois.edu/ece498/al See http://courses.engr.illinois.edu/ece498/al 21 22 Control T&L Built-in texture filtering (GeForce 6) ( Operations) Control T&L Pixel texturing Hardware supports 2D, 3D, and cube map Non power-of-2 textures OK Hardware handles addressing and interpolation Bilinear, trilinear (3D or mipmap), anisotropic texturing processors can access texture memory too Only nearest-neighbor filtering supported in G60 hardware C- performs frame buffer blending Combinations of colors and transparency Antialiasing Read/Modify/Write the Color Z- performs the Z operations Determine the visible pixels Discard the occluded pixels Read/Modify/Write the Z- on GeForce also performs Coalescing of transactions Z- compression/decompression 23 See http://courses.engr.illinois.edu/ece498/al 24 6
The frame buffer Control T&L Interface () Control Surface Engine T&L The primary determinant of graphics performance other than the GPU The most expensive component of a graphics product other than the GPU bandwidth is the key buffer size also determines Local texture storage Maximum resolutions Anitaliasing resolution limits Manages reading from and writing to frame buffer Perhaps the most performance-critical component of a GPU GeForce s is a crossbar Independent memory controllers for 4+ independent memory banks for more efficient access to frame buffer See http://courses.engr.illinois.edu/ece498/al See http://courses.engr.illinois.edu/ece498/al 25 26 GeForce 7800 GTX board details From www.xbitlabs.com/articles/video/display/g70-indepth.html NVIDIA 7800 GTX SLI Connector Single slot cooling Processors svideo TV Out Pixel Processors DVI x 2 s ( Op. Units) 16x PCI-Express from UIUC ECE498 Lecture 6, Fall 2007; used with permission 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 See http://courses.engr.illinois.edu/ece498/al 27 28 7
NVIDIA 7800 GTX NVIDIA 7800 GTX processors NVIDIA 7800 GTX NVIDIA 7800 GTX Pixel processors 7800 GTX has 8 of these 8 MADD (multiply/add) instructions in a single cycle Processors 7800 GTX has 24 of these 29 From http://www.xbitlabs.com/articles/video/display/g70-indepth_3.html 30 NVIDIA 7800 GTX Modern GPUs: unified design GeForce 8 architecture Processors Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 31 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 32 8
Why unify? (1) Why unify? (2) Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 33 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 34 Dynamic load balancing Company of Heroes Motivation for shader languages Programming powerful hardware with assembly code is hard Programmers need the benefits of a high-level language: Easier programming Easier code reuse Easier debugging Portability Assembly DP3 R0, c[11].xyzx, c[11].xyzx; RSQ R0, R0.x; MUL R0, R0.x, c[11].xyzx; MOV R1, c[3]; MUL R1, R1.x, c[0].xyzx; DP3 R2, R1.xyzx, R1.xyzx; RSQ R2, R2.x; MUL R1, R2.x, R1.xyzx; ADD R2, R0.xyzx, R1.xyzx; DP3 R3, R2.xyzx, R2.xyzx; RSQ R3, R3.x; MUL R2, R3.x, R2.xyzx; DP3 R2, R1.xyzx, R2.xyzx; MAX R2, c[3].z, R2.x; MOV R2.z, c[3].y; MOV R2.w, c[3].y; LIT R2, R2; float3 cspecular = pow(max(0, dot(nf, H)), phongexp).xxx; float3 cplastic = Cd * (cambient + cdiffuse) + Cs * cspecular; Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 35 From The Cg Tutorial 9
languages HLSL/Cg most common Both are more-or-less compatible Other alternatives: GLSL (for OpenGL) Assembly? (not anymore ) 10