3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

Similar documents
Mattan Erez. The University of Texas at Austin

Spring 2009 Prof. Hyesoon Kim

1.2.3 The Graphics Hardware Pipeline

Spring 2011 Prof. Hyesoon Kim

Mattan Erez. The University of Texas at Austin

Teaching Cg. This presentation introduces Cg ( C for graphics ) and explains why it would be useful when teaching a computer graphics course.

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

GeForce4. John Montrym Henry Moreton

Evolution of GPUs Chris Seitz

Graphics Processing Unit Architecture (GPU Arch)

Basics of GPU-Based Programming

Vertex and Pixel Shaders:

Programming Graphics Hardware

Lecture 2. Shaders, GLSL and GPGPU

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

Graphics Hardware. Instructor Stephen J. Guy

Textures. Texture coordinates. Introduce one more component to geometry

Intro to GPU Programming (OpenGL Shading Language) Cliff Lindsay Ph.D. Student CS WPI

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

CS 354R: Computer Game Technology

CS427 Multicore Architecture and Parallel Computing

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Sung-Eui Yoon ( 윤성의 )

Shaders. Slide credit to Prof. Zwicker

Shaders (some slides taken from David M. course)

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Introduction to Shaders.

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

12.2 Programmable Graphics Hardware

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

GPU Memory Model. Adapted from:

Could you make the XNA functions yourself?

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Target Applications

Readings on graphics architecture for Advanced Computer Architecture class

Monday Morning. Graphics Hardware

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Scanline Rendering 2 1/42

Efficient Data Transfers

Pipeline Operations. CS 4620 Lecture 14

frame buffer depth buffer stencil buffer

Drawing Fast The Graphics Pipeline

Windowing System on a 3D Pipeline. February 2005

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Drawing Fast The Graphics Pipeline

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Programmable Graphics Hardware

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

CSE 167: Introduction to Computer Graphics Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2015

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

2.11 Particle Systems

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Chapter 7 - Light, Materials, Appearance

Drawing Fast The Graphics Pipeline

In-Game Special Effects and Lighting

Rasterization Overview

Pipeline Operations. CS 4620 Lecture 10

Programmable Graphics Hardware

The Rasterization Pipeline

lecture 18 - ray tracing - environment mapping - refraction

Real-World Applications of Computer Arithmetic

Current Trends in Computer Graphics Hardware

Ultimate Graphics Performance for DirectX 10 Hardware

Introduction to Visualization and Computer Graphics

Cg 2.0. Mark Kilgard

Module Contact: Dr Stephen Laycock, CMP Copyright of the University of East Anglia Version 1

Tutorial on GPU Programming. Joong-Youn Lee Supercomputing Center, KISTI

ECE 574 Cluster Computing Lecture 16

Rendering Objects. Need to transform all geometry then

Grafica Computazionale: Lezione 30. Grafica Computazionale. Hiding complexity... ;) Introduction to OpenGL. lezione30 Introduction to OpenGL

CIS 536/636 Introduction to Computer Graphics. Kansas State University. CIS 536/636 Introduction to Computer Graphics

Rendering Grass with Instancing in DirectX* 10

The Rasterization Pipeline

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GCN Performance Tweets AMD Developer Relations

Getting Started with Cg. Release 1.2 February 2004

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

The Rasterizer Stage. Texturing, Lighting, Testing and Blending

The Graphics Pipeline and OpenGL III: OpenGL Shading Language (GLSL 1.10)!

CS GPU and GPGPU Programming Lecture 12: GPU Texturing 1. Markus Hadwiger, KAUST

CS4620/5620: Lecture 14 Pipeline

GPGPU. Peter Laurens 1st-year PhD Student, NSC

ECE 571 Advanced Microprocessor-Based Design Lecture 20

The Source for GPU Programming

Mattan Erez. The University of Texas at Austin

Vertex Shader Design I

Supplement to Lecture 22

Transcription:

Bandwidth Gravity of modern computer systems GPUs Under the Hood Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology The bandwidth between key components ultimately dictates system performance Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance falls back to what the feeds and speeds dictate PCIe replaced AGP (Advanced Graphics Port) from UIUC ECE498 Lecture 6, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al 2 3D buzzwords Fill Rate how fast the GPU can generate pixels, often a strong predictor for application frame rate Performance Metrics Mtris/sec - Triangle Rate Mverts/sec - Rate Mpixels/sec - Pixel Fill (Write) Rate Mtexels/sec - Fill (Read) Rate Msamples/sec - Antialiasing Fill (Write) Rate Adding programmability to the pipeline See http://courses.engr.illinois.edu/ece498/al 3 4 3D API Commands 3D Application or Game GPU Command & Data Stream Pre-transformed Vertices 3D API: OpenGL or Direct3D GPU Front End Index Stream Programmable Processor Primitive Assembly Transformed Vertices CPU GPU Boundary Assembled Polygons, Lines, and Points ized Pre-transformed Fragments ization & Interpolation See http://courses.engr.illinois.edu/ece498/al Pixel Location Stream Programmable Fragment Processor Operations Pixel Updates Transformed Fragments buffer 1

data Typically floats, and vectors/matrices of floats Fixed size arrays Three main types: Per-instance data, e.g., per-vertex position Per-pixel interpolated data, e.g., texture coordinates Per-batch data, e.g., light position Data are tightly bound to the GPU flow control Very simple No recursion Fixed size loops for Model 2.0 or earlier Simple if-then-else statements allowed in the latest APIs Texkill (asm) or clip (HLSL) or discard (GLSL) allows you to abort a write to a pixel (form of flow control) Specialized instructions (GeForce 6) Dot products Exponential instructions: EXP, LOG LIT (Blinn specular lighting model calculation!) Reciprocal instructions: RCP (reciprocal) RSQ (reciprocal square root!) Trignometric functions SIN, COS Swizzling (swapping xyzw), write masking (only some xyzw get assigned), and negation is free shader Transform to clip space Inputs: Common inputs: position (x, y, z, w) coordinate colors Constant inputs Output to a pixel (fragment) shader shader is executed once per vertex, so usually less expensive than pixel shader From GPU Gems 2, p. 484 7 2

shader data flow (3.0) 32 Temporary registers al Loop Register r0 r1 r2 r31 a0 Address Register v0 v1 v2 stream 16 data registers v15 12 output registers C0 C1 C2 Cn Constant float registers (at least 256) 16 Constant Integer Registers shader: logical view Per-vertex Input Data Register File r0 r1 r2 r3 Start Addr Bound s Bound Samplers Bound Consants Sampler Unit Processing Unit Swizzle / Mask Unit.rgba.xyzw.zzzz.xxyz Resources (bound by application) Math/Logic Unit cosine log sine sub add Constants Per-vertex Output Data Transformed and Lit vertices opos position otn texture ofog fog Each register is a 4-component vector register except al od0 od1 Diff. color Spec. color opts Output Pt size Input Data Output Data State Information Architectural State Control Logic Some uses of vertex shaders Easy cross products and normalization Transform vertices to clip space Pass normal, texture coordinates to PS Transform vectors to other spaces (e.g., texture space) Calculate per-vertex lighting (e.g., Gouraud shading) Distort geometry (waves) From Stanford CS448A: Real-Time Graphics Architectures Adapted from Mart Slot s presentation See graphics.stanford.edu/courses/cs448a-01-fall 12 3

Blinn lighting in one instruction Simple graphics pipeline From Stanford CS448A: Real-Time Graphics Architectures From Stanford CS448A: Real-Time Graphics Architectures See graphics.stanford.edu/courses/cs448a-01-fall 13 See graphics.stanford.edu/courses/cs448a-01-fall 14 Pixel (or fragment) shader (1) Determine each fragment s color Custom (sophisticated) pixel operations sampling Inputs Interpolated output from vertex shader Typically vertex position, vertex normals, texture coordinates, etc. These registers could be reused for other purpose Output Color (including alpha) Depth value (optional) Pixel (or fragment) shader (2) Executed once per pixel, hence typically executed many more times than a vertex shader It is advantageous to compute stuff on a per-vertex basis to improve performance 4

Pixel shader data flow (3.0) Pixel shader: logical view Temporary registers r0 r1 r31 v0 Pixel stream v1 Color (diff/spec) and texture coord. registers oc0 color Pixel odepth Depth v9 C0 C1 Cn s0 s1 s15 Constant registers (16 INT, 224 Float) Sampler Registers (Up to 16 texture surfaces can be read in a single pass) Interpolator Per-pixel Input Data Input Data Output Data State Information Register File r0 r1 r2 r3 Start Addr Bound s Bound Samplers Bound Consants Sampler Unit Pixel Processing Unit Swizzle / Mask Unit.rgba.xyzw.zzzz.xxyz Resources (bound by application) Architectural State Control Logic Color buffer Depth Stencil Math/Logic Unit cosine log sine sub add Constants Per-pixel Output Data Pixel Color Depth Info Stencil Info Some uses of pixel shaders Old GeForce graphics pipeline Texturing objects Per-pixel lighting (e.g., Phong shading) Normal mapping (each pixel has its own normal) Shadows (determine whether a pixel is shadowed or not) Environment mapping Control VS/T&L Adapted from Mart Slot s presentation See http://courses.engr.illinois.edu/ece498/al 20 5

cache Reusing vertices between primitives saves PCIe bus bandwidth and GPU computational resources A vertex cache attempts to exploit commonality between triangles to generate vertex reuse Unfortunately, many applications do not use efficient triangular ordering Control VS/T&L cache Stores temporally local texel values to reduce bandwidth requirements Due to nature of texture filtering high degrees of efficiency are possible (75% or better hit rates) Reduces texture (memory) bandwidth by a factor of four for bilinear filtering See http://courses.engr.illinois.edu/ece498/al See http://courses.engr.illinois.edu/ece498/al 21 22 Control T&L Built-in texture filtering (GeForce 6) ( Operations) Control T&L Pixel texturing Hardware supports 2D, 3D, and cube map Non power-of-2 textures OK Hardware handles addressing and interpolation Bilinear, trilinear (3D or mipmap), anisotropic texturing processors can access texture memory too Only nearest-neighbor filtering supported in G60 hardware C- performs frame buffer blending Combinations of colors and transparency Antialiasing Read/Modify/Write the Color Z- performs the Z operations Determine the visible pixels Discard the occluded pixels Read/Modify/Write the Z- on GeForce also performs Coalescing of transactions Z- compression/decompression 23 See http://courses.engr.illinois.edu/ece498/al 24 6

The frame buffer Control T&L Interface () Control Surface Engine T&L The primary determinant of graphics performance other than the GPU The most expensive component of a graphics product other than the GPU bandwidth is the key buffer size also determines Local texture storage Maximum resolutions Anitaliasing resolution limits Manages reading from and writing to frame buffer Perhaps the most performance-critical component of a GPU GeForce s is a crossbar Independent memory controllers for 4+ independent memory banks for more efficient access to frame buffer See http://courses.engr.illinois.edu/ece498/al See http://courses.engr.illinois.edu/ece498/al 25 26 GeForce 7800 GTX board details From www.xbitlabs.com/articles/video/display/g70-indepth.html NVIDIA 7800 GTX SLI Connector Single slot cooling Processors svideo TV Out Pixel Processors DVI x 2 s ( Op. Units) 16x PCI-Express from UIUC ECE498 Lecture 6, Fall 2007; used with permission 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 See http://courses.engr.illinois.edu/ece498/al 27 28 7

NVIDIA 7800 GTX NVIDIA 7800 GTX processors NVIDIA 7800 GTX NVIDIA 7800 GTX Pixel processors 7800 GTX has 8 of these 8 MADD (multiply/add) instructions in a single cycle Processors 7800 GTX has 24 of these 29 From http://www.xbitlabs.com/articles/video/display/g70-indepth_3.html 30 NVIDIA 7800 GTX Modern GPUs: unified design GeForce 8 architecture Processors Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 31 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 32 8

Why unify? (1) Why unify? (2) Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 33 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 34 Dynamic load balancing Company of Heroes Motivation for shader languages Programming powerful hardware with assembly code is hard Programmers need the benefits of a high-level language: Easier programming Easier code reuse Easier debugging Portability Assembly DP3 R0, c[11].xyzx, c[11].xyzx; RSQ R0, R0.x; MUL R0, R0.x, c[11].xyzx; MOV R1, c[3]; MUL R1, R1.x, c[0].xyzx; DP3 R2, R1.xyzx, R1.xyzx; RSQ R2, R2.x; MUL R1, R2.x, R1.xyzx; ADD R2, R0.xyzx, R1.xyzx; DP3 R3, R2.xyzx, R2.xyzx; RSQ R3, R3.x; MUL R2, R3.x, R2.xyzx; DP3 R2, R1.xyzx, R2.xyzx; MAX R2, c[3].z, R2.x; MOV R2.z, c[3].y; MOV R2.w, c[3].y; LIT R2, R2; float3 cspecular = pow(max(0, dot(nf, H)), phongexp).xxx; float3 cplastic = Cd * (cambient + cdiffuse) + Cs * cspecular; Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf 35 From The Cg Tutorial 9

languages HLSL/Cg most common Both are more-or-less compatible Other alternatives: GLSL (for OpenGL) Assembly? (not anymore ) 10