General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Lecture 7

Outline Last time Visibility Shading Texturing Today Texturing continued Frame buffer operations Hardware trends GPU programming intro 2

Procedural Texture Mapping Instead of looking up an image, pass the texture coordinates to a function that computes the texture value on the fly Renderman, the Pixar rendering language, does this Available in with fragment shaders on current generation hardware Advantages: Near-infinite resolution with small storage cost Idea works for many other things Has the disadvantage of being slower 3

Other Types of Mapping Environment mapping looks up incoming illumination in a map Simulates reflections from shiny surfaces Bump-mapping computes an offset to the normal vector at each rendered pixel No need to put bumps in geometry, but silhouette looks wrong Displacement mapping adds an offset to the surface at each point Like putting bumps on geometry, but simpler to model All are available in software renderers like RenderMan compliant renderers All these are becoming available in hardware 4

Bump Mapping Look at smooth silhouettes Credit: Rich Riesenfeld 5

Displacement Mapping Look at silhouette Credit: Rich Riesenfeld 6

Deforming Images 3D Animated Flags--By 3DFlags.com intuitionbase.com/waveguide/tut6.html Credit: Rich Riesenfeld 7

Outline Today Texturing continued Frame buffer operations Hardware trends GPU programming intro 8

Rasterization Frame buffer Color buffer Depth buffer Stencil buffer 9

Frame buffer Ops Fragment Alpha Test Stencil Test Depth Test 10

Pipeline: Alpha Test no Reject fragment Fragment Alpha Test Stencil Test Depth Test P User-specified If ( P.alpha op alpha ) pass fragment Else reject fragment 11

Frame Buffer Ops: Alpha Test Fragment Alpha Test yes Stencil Test Depth Test P User-specified If ( P.alpha op alpha ) pass fragment Else reject fragment 12

Frame Buffer Ops : Stencil Test no Fragment Alpha Test Stencil Test Depth Test P User-specified If ( P.FB.Stencil op S ) pass fragment Else reject fragment 13

Frame Buffer Ops : Stencil Test Fragment yes Alpha Test Stencil Test Depth Test P User-specified If ( P.FB.Stencil op S ) pass fragment Else reject fragment 14

Frame Buffer Ops : Depth Test no Fragment Alpha Test Stencil Test Depth Test P If ( P.FB.depth op P.depth ) pass fragment Else reject fragment 15

Frame Buffer Ops: Depth Test Fragment Alpha Test Stencil Test Depth Test yes Frame Buffer 16

Outline Today Frame buffer operations Hardware trends GPU programming intro 17

In the beginning... (1965) Gordon Moore: # transistors per die doubling annually Moore s Law transistor density increasing size decreasing 18

Today Processor performance doubles ~18 mos more transistors faster clock DRAM capacity doubles every ~3 yrs bandwidth increases 25%/yr latency improving 5%/yr 19

Compute vs. Communicate Faster clocks faster computation Chips are big sending signal all the way across takes multiple clock cycles Communication becoming more expensive relative to computation ratio of computation:bandwidth growing 20

Computation:Bandwidth Case Study NVIDIA GeForce FX 5800 (12/02) 2 fp ops: word of off-chip bandwidth GeForce FX 5950 (6/03) 2.66 fp ops: word of off-chip bandwidth GeForce FX 6800 (1/04) almost 6 fp ops: word of off-chip bandwidth 21

Computation:Bandwidth Case Study John Owens 22

Predicted Trends John Owens 23

Implications Computation:Latency, Bandwidth need to do useful work while waiting for data request fulfillment may be faster to compute value than using a lookup table Need efficient communication as well as efficient computation 24

Outline Today Texturing Frame buffer operations Hardware trends GPU programming intro 25

GPU Programming Intro Outline Data Parallelism and Stream Processing Computational Resources Inventory CPU-GPU Analogies Example: N-body gravitational simulation Parallel reductions Linear Algebra Representations Overview of Branching Techniques 26

The Importance of Data Parallelism GPUs are designed for graphics Highly parallel tasks GPUs process fragments independent Temporary registers are zeroed No shared or static data No read-modify-write buffers Data-parallel processing Multiple vertex & fragment pipelines vertices & Hide memory latency (with more computation) 27

Arithmetic Intensity Arithmetic intensity ops per word transferred Computation / bandwidth Best to have Ideal GPGPU apps have Large data sets High parallelism high arithmetic intensity High independence between data elements 28

Data Streams & Kernels Streams Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. Provide data parallelism Kernels Functions applied to each element in stream transforms, PDE, Few dependencies between stream elements Encourage high Arithmetic Intensity 29

Example: Simulation Grid Common GPGPU computation style Textures represent computational grids = streams Many computations map to grids Matrix algebra Image & Volume processing Physically-based simulation ray tracing Non-grid streams can be mapped to grids 30

Stream Computation Grid Simulation algorithm Made up of steps Each step updates entire grid Must complete before next step can begin Grid is a stream, steps are kernels Kernel applied to each stream element Cloud simulation algorithm 31

Scatter vs. Gather Grid communication Grid cells share information 32

Computational Resources Inventory Programmable parallel processors Vertex, Geometry, & Fragment pipelines Rasterizer Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Texture unit Read-only memory interface Render to texture Write-only memory interface 33

Vertex Processor Fully programmable Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather Can change the location of current vertex Cannot read info from other vertices On older GPUs can only read a small constant memory Vertex Texture Fetch Random access memory for vertices Not available with older vertex processors 34

Fragment Processor Fully programmable Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter RAM read (texture fetch), but no RAM write Output address fixed to a specific pixel Typically more useful than vertex processor More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline) 35

Vertex Programming Vertex Program Interface to Transform&Light unit GPU instruction set to perform all vertex math Input: arbitrary vertex attributes Output: transformed vertex attributes homogeneous clip space position (required) color texture coordinates... 36

Vertex Programming Vertex Program Does not generate or destroy vertexes Geometry processor on latest cards can No topological information provided No edge, face, nor neighboring vertex info But this can be packed in vertex attributes Dynamically loadable 37

What gets bypassed? Modelview vertex transformations Projection transformations Vertex weighting/blending Normal transformation, rescaling, normalization Per-vertex lighting Texture coordinate generation and texture matrix transformations User-clip planes 38

What does NOT get bypassed? Clipping to the view frustum Perspective divide Viewport transformation Depth range transformation Clamping of colors to [0,1] ([0,255]) Primitive rasterization 39

CPU-GPU Analogies CPU programming is familiar GPU programming is graphics-centric Analogies can aid understanding 40

CPU-GPU Analogies CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample 41

Kernels CPU GPU Kernel / loop body / algorithm step = Fragment Program 42

Feedback Each algorithm step depends on the results of previous steps Each time step depends on the results of the previous time step 43

Feedback CPU.. Grid[i][j]= x;... GPU Array Write = Render to Texture 44

GPU Simulation Overview Analogies lead to implementation Algorithm steps are fragment programs Computational kernels Current state is stored in textures Feedback via render to texture 45

Invoking Computation Must invoke computation at each pixel Just draw geometry! Most common GPGPU invocation is a full-screen quad Other Useful Analogies Rasterization = Kernel Invocation Texture Coordinates = Computational Domain Vertex Coordinates = Computational Range 46

Typical Grid Computation Initialize view (so that pixels:texels::1:1) glmatrixmode(gl_modelview); glloadidentity(); glmatrixmode(gl_projection); glloadidentity(); glortho(0, 1, 0, 1, 0, 1); glviewport(0, 0, outtexresx, outtexresy); For each algorithm step: Activate render-to-texture Setup input textures, fragment program Draw a full-screen quad 47

Example: N-Body Simulation Brute force L N = 8192 bodies N 2 gravity computations 64M force comps. / frame ~25 flops per force 12.5+ GFLOPs sustained GeForce 6800 Ultra Nyland, Harris, Prins, GP 2 2004 poster 48

Computing Gravitational Forces Each body attracts all other bodies N bodies, so N 2 forces Draw into an NxN buffer Pixel ( i, j) computes force between bodies i and j Very simple fragment program More than 2048 bodies makes it trickier 49

Computing Gravitational Forces F( i, j) = gm / (, ) 2 i M j d i j, d( i, j) = pos( i) - pos( j) Force is proportional to the inverse square of the distance between bodies 50

Computing Gravitational Forces N N-body force Texture Body Position Texture j j force( i, j) i F( i, j) = gm / (, ) 2 i M j r i j, 0 i N d( i, j) = pos( i) - pos( j) Coordinates ( i, j ) in force texture used to find bodies i and j in body position texture 51

Computing Gravitational Forces float4 force(float2 ij { } : WPOS, uniform sampler2d pos) : COLOR0 // Pos texture is 2D, not 1D, so we need to // convert body index into 2D coords for pos tex float4 icoords = getbodycoords(ij); float4 iposmass = texture2d(pos, icoords.xy); float4 jposmass = texture2d(pos, icoords.zw); float3 dir = ipos.xyz - jpos.xyz; float d2 = dot(dir, dir); dir = normalize(dir); return dir * g * iposmass.w * jposmass.w / d2; 52

Computing Total Force Have: array of (i,j) forces Need: total force on each particle i N N-body force Texture force( i, j) 0 i N 53

Computing Total Force Have: array of (i,j) forces Need: total force on each particle i Sum of each column of the force array N N-body force Texture force( i, j) 0 i N 54

Computing Total Force Have: array of (i,j) forces N N-body force Texture Need: total force on each particle i Sum of each column of the force array force( i, j) Can do all N columns in parallel 0 i N This is called a Parallel Reduction 55

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together NxN + 56

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... N x( N /2) + 57

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... N x( N /4) + 58

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... Until we re left with a single row of texels Nx1 Requires log 2 N steps 59

Update Positions and Velocities Now we have a 1-D array of total forces One per body Update Velocity u( i, t+ dt) = u( i, t) + Ftotal( i) * dt Simple fragment program reads previous velocity and force textures, creates new velocity texture Update Position x( i, t+ dt) = x( i, t) + u( i, t) * dt Simple fragment program reads previous position and velocity textures, creates new position texture 60

Linear Algebra Representations Vector representation 2D textures best we can do High texture memory bandwidth Read-write access, dependent fetches 1 N 1 N 61

Representation (cont.) Dense Matrix representation treat a dense matrix as a set of column vectors again, store these vectors as 2D textures i Matrix N Vectors N N...... N 2D-Textures 1 i N...... N 1 i N 62

Representation (cont.) Banded Sparse Matrix representation treat a banded matrix as a set of diagonal vectors i Matrix 2 Vectors N N 2 2D-Textures 1 2 N 1 2 63

Representation (cont.) Banded Sparse Matrix representation combine opposing vectors to save space i Matrix 2 Vectors N N 2 2D-Textures 1 2 N-i N 1 2 64

Operations Vector-Vector Operations Reduced to 2D texture operations Coded in vertex/fragment programs Example: Vector1 + Vector2 à Vector3 Vector 1 Vector 2 Vector 3 + Static quad TexUnit 0 TexUnit 1 Render To Texture Pass through return tex0 + tex1 Vertex program fragment program 65

The single float on GPUs Some operations generate single float values e.g. reduce... Read-back to main-memory is slow Keep single floats on the GPU as 1x1 textures 66

GPGPU Flow Control Strategies Branching and Looping

Branching Techniques Fragment program branches can be expensive No true fragment branching on older cards SIMD branching on GeForce 6+ Series Incoherent branching hurts performance Sometimes better to move decisions up the pipeline Replace with math Occlusion Query Z-cull Pre-computation Static Branch Resolution 68

Branching with Occlusion Query Use it for iteration termination Do { // outer loop on CPU BeginOcclusionQuery { // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0 Can be used for subdivision techniques 69

Z-Cull In early pass, modify depth buffer Clear Z to 1 Draw quad at Z=0 Discard fragments that should be modified in later passes Subsequent passes Enable depth test (GL_LESS) Draw full-screen quad at z=0.5 Only fragments with previous depth=1 will be processed Can also use stencil cull on GeForce 6 series Not available on GeForce FX (NV3X) Discard and shader depth output disables Z-Cull 70

Pre-computation Pre-compute anything that will not change every iteration! Example: static obstacles in fluid sim Texture containing boundary info for cells inside obstacles Reuse that texture until obstacles are modified Combine with Z-cull for higher performance! 71

Static Branch Resolution Avoid branches where outcome is fixed One region is always true, another false Separate FPs for each region, no branches Example: boundaries 72

Acknowledgements Jens Krüger Mark Harris Naga K. Govindaraju John Owens GPU Gems 2 chapter 29: Streaming Architectures and Technology Trends 73