General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Similar documents
General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

General-Purpose Computation on Graphics Hardware

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

The Way of the GPU (based on GPGPU SIGGRAPH Course)

Graphics Processing Unit Architecture (GPU Arch)

Query Processing on GPUs

GPGPU: Beyond Graphics. Mark Harris, NVIDIA

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

CS427 Multicore Architecture and Parallel Computing

The GPGPU Programming Model

E.Order of Operations

CS4620/5620: Lecture 14 Pipeline

Lecture 2. Shaders, GLSL and GPGPU

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Pipeline Operations. CS 4620 Lecture 14

Rasterization Overview

GeForce4. John Montrym Henry Moreton

Applications of Explicit Early-Z Culling

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

What s New with GPGPU?

GPU Memory Model. Adapted from:

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

From Brook to CUDA. GPU Technology Conference

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

Today. Rendering pipeline. Rendering pipeline. Object vs. Image order. Rendering engine Rendering engine (jtrt) Computergrafik. Rendering pipeline

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Windowing System on a 3D Pipeline. February 2005

The Rasterization Pipeline

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Programming Graphics Hardware

Accelerating CFD with Graphics Hardware

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Shaders. Slide credit to Prof. Zwicker

Evolution of GPUs Chris Seitz

Shaders (some slides taken from David M. course)

Sung-Eui Yoon ( 윤성의 )

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

1.2.3 The Graphics Hardware Pipeline

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Spring 2009 Prof. Hyesoon Kim

Mattan Erez. The University of Texas at Austin

Spring 2011 Prof. Hyesoon Kim

The Graphics Pipeline

Shadow Algorithms. CSE 781 Winter Han-Wei Shen

Pipeline Operations. CS 4620 Lecture 10

CS451Real-time Rendering Pipeline

Introduction to Shaders for Visualization. The Basic Computer Graphics Pipeline

2.11 Particle Systems

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

The NVIDIA GeForce 8800 GPU

Drawing Fast The Graphics Pipeline

Real-Time Graphics Architecture

Drawing Fast The Graphics Pipeline

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

POWERVR MBX. Technology Overview

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

GPU-Based Visualization of AMR and N-Body Dark Matter Simulation Data. Ralf Kähler (KIPAC/SLAC)

Advanced Rendering Techniques

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Triangle Rasterization

GPGPU Lessons Learned. Mark Harris

Rendering Objects. Need to transform all geometry then

Comparing Reyes and OpenGL on a Stream Architecture

GPGPU: Parallel Reduction and Scan

The Rasterization Pipeline

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

GPU Memory Model Overview

CSE 167: Lecture #4: Vertex Transformation. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012

The Traditional Graphics Pipeline

Threading Hardware in G80

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Drawing Fast The Graphics Pipeline

Computergrafik. Matthias Zwicker. Herbst 2010

Graphics Pipeline 2D Geometric Transformations

Graphics Hardware. Instructor Stephen J. Guy

Portland State University ECE 588/688. Graphics Processors

Programmable Graphics Hardware

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Computer Graphics Shadow Algorithms

Monday Morning. Graphics Hardware

General Purpose computation on GPUs. Liangjun Zhang 2/23/2005

3D Rasterization II COS 426

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

DiFi: Distance Fields - Fast Computation Using Graphics Hardware

Transcription:

ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Lecture 7

Outline Last time Visibility Shading Texturing Today Texturing continued Frame buffer operations Hardware trends GPU programming intro 2

Procedural Texture Mapping Instead of looking up an image, pass the texture coordinates to a function that computes the texture value on the fly Renderman, the Pixar rendering language, does this Available in with fragment shaders on current generation hardware Advantages: Near-infinite resolution with small storage cost Idea works for many other things Has the disadvantage of being slower 3

Other Types of Mapping Environment mapping looks up incoming illumination in a map Simulates reflections from shiny surfaces Bump-mapping computes an offset to the normal vector at each rendered pixel No need to put bumps in geometry, but silhouette looks wrong Displacement mapping adds an offset to the surface at each point Like putting bumps on geometry, but simpler to model All are available in software renderers like RenderMan compliant renderers All these are becoming available in hardware 4

Bump Mapping Look at smooth silhouettes Credit: Rich Riesenfeld 5

Displacement Mapping Look at silhouette Credit: Rich Riesenfeld 6

Deforming Images 3D Animated Flags--By 3DFlags.com intuitionbase.com/waveguide/tut6.html Credit: Rich Riesenfeld 7

Outline Today Texturing continued Frame buffer operations Hardware trends GPU programming intro 8

Rasterization Frame buffer Color buffer Depth buffer Stencil buffer 9

Frame buffer Ops Fragment Alpha Test Stencil Test Depth Test 10

Pipeline: Alpha Test no Reject fragment Fragment Alpha Test Stencil Test Depth Test P User-specified If ( P.alpha op alpha ) pass fragment Else reject fragment 11

Frame Buffer Ops: Alpha Test Fragment Alpha Test yes Stencil Test Depth Test P User-specified If ( P.alpha op alpha ) pass fragment Else reject fragment 12

Frame Buffer Ops : Stencil Test no Fragment Alpha Test Stencil Test Depth Test P User-specified If ( P.FB.Stencil op S ) pass fragment Else reject fragment 13

Frame Buffer Ops : Stencil Test Fragment yes Alpha Test Stencil Test Depth Test P User-specified If ( P.FB.Stencil op S ) pass fragment Else reject fragment 14

Frame Buffer Ops : Depth Test no Fragment Alpha Test Stencil Test Depth Test P If ( P.FB.depth op P.depth ) pass fragment Else reject fragment 15

Frame Buffer Ops: Depth Test Fragment Alpha Test Stencil Test Depth Test yes Frame Buffer 16

Outline Today Frame buffer operations Hardware trends GPU programming intro 17

In the beginning... (1965) Gordon Moore: # transistors per die doubling annually Moore s Law transistor density increasing size decreasing 18

Today Processor performance doubles ~18 mos more transistors faster clock DRAM capacity doubles every ~3 yrs bandwidth increases 25%/yr latency improving 5%/yr 19

Compute vs. Communicate Faster clocks faster computation Chips are big sending signal all the way across takes multiple clock cycles Communication becoming more expensive relative to computation ratio of computation:bandwidth growing 20

Computation:Bandwidth Case Study NVIDIA GeForce FX 5800 (12/02) 2 fp ops: word of off-chip bandwidth GeForce FX 5950 (6/03) 2.66 fp ops: word of off-chip bandwidth GeForce FX 6800 (1/04) almost 6 fp ops: word of off-chip bandwidth 21

Computation:Bandwidth Case Study John Owens 22

Predicted Trends John Owens 23

Implications Computation:Latency, Bandwidth need to do useful work while waiting for data request fulfillment may be faster to compute value than using a lookup table Need efficient communication as well as efficient computation 24

Outline Today Texturing Frame buffer operations Hardware trends GPU programming intro 25

GPU Programming Intro Outline Data Parallelism and Stream Processing Computational Resources Inventory CPU-GPU Analogies Example: N-body gravitational simulation Parallel reductions Linear Algebra Representations Overview of Branching Techniques 26

The Importance of Data Parallelism GPUs are designed for graphics Highly parallel tasks GPUs process fragments independent Temporary registers are zeroed No shared or static data No read-modify-write buffers Data-parallel processing Multiple vertex & fragment pipelines vertices & Hide memory latency (with more computation) 27

Arithmetic Intensity Arithmetic intensity ops per word transferred Computation / bandwidth Best to have Ideal GPGPU apps have Large data sets High parallelism high arithmetic intensity High independence between data elements 28

Data Streams & Kernels Streams Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. Provide data parallelism Kernels Functions applied to each element in stream transforms, PDE, Few dependencies between stream elements Encourage high Arithmetic Intensity 29

Example: Simulation Grid Common GPGPU computation style Textures represent computational grids = streams Many computations map to grids Matrix algebra Image & Volume processing Physically-based simulation ray tracing Non-grid streams can be mapped to grids 30

Stream Computation Grid Simulation algorithm Made up of steps Each step updates entire grid Must complete before next step can begin Grid is a stream, steps are kernels Kernel applied to each stream element Cloud simulation algorithm 31

Scatter vs. Gather Grid communication Grid cells share information 32

Computational Resources Inventory Programmable parallel processors Vertex, Geometry, & Fragment pipelines Rasterizer Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Texture unit Read-only memory interface Render to texture Write-only memory interface 33

Vertex Processor Fully programmable Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather Can change the location of current vertex Cannot read info from other vertices On older GPUs can only read a small constant memory Vertex Texture Fetch Random access memory for vertices Not available with older vertex processors 34

Fragment Processor Fully programmable Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter RAM read (texture fetch), but no RAM write Output address fixed to a specific pixel Typically more useful than vertex processor More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline) 35

Vertex Programming Vertex Program Interface to Transform&Light unit GPU instruction set to perform all vertex math Input: arbitrary vertex attributes Output: transformed vertex attributes homogeneous clip space position (required) color texture coordinates... 36

Vertex Programming Vertex Program Does not generate or destroy vertexes Geometry processor on latest cards can No topological information provided No edge, face, nor neighboring vertex info But this can be packed in vertex attributes Dynamically loadable 37

What gets bypassed? Modelview vertex transformations Projection transformations Vertex weighting/blending Normal transformation, rescaling, normalization Per-vertex lighting Texture coordinate generation and texture matrix transformations User-clip planes 38

What does NOT get bypassed? Clipping to the view frustum Perspective divide Viewport transformation Depth range transformation Clamping of colors to [0,1] ([0,255]) Primitive rasterization 39

CPU-GPU Analogies CPU programming is familiar GPU programming is graphics-centric Analogies can aid understanding 40

CPU-GPU Analogies CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample 41

Kernels CPU GPU Kernel / loop body / algorithm step = Fragment Program 42

Feedback Each algorithm step depends on the results of previous steps Each time step depends on the results of the previous time step 43

Feedback CPU.. Grid[i][j]= x;... GPU Array Write = Render to Texture 44

GPU Simulation Overview Analogies lead to implementation Algorithm steps are fragment programs Computational kernels Current state is stored in textures Feedback via render to texture 45

Invoking Computation Must invoke computation at each pixel Just draw geometry! Most common GPGPU invocation is a full-screen quad Other Useful Analogies Rasterization = Kernel Invocation Texture Coordinates = Computational Domain Vertex Coordinates = Computational Range 46

Typical Grid Computation Initialize view (so that pixels:texels::1:1) glmatrixmode(gl_modelview); glloadidentity(); glmatrixmode(gl_projection); glloadidentity(); glortho(0, 1, 0, 1, 0, 1); glviewport(0, 0, outtexresx, outtexresy); For each algorithm step: Activate render-to-texture Setup input textures, fragment program Draw a full-screen quad 47

Example: N-Body Simulation Brute force L N = 8192 bodies N 2 gravity computations 64M force comps. / frame ~25 flops per force 12.5+ GFLOPs sustained GeForce 6800 Ultra Nyland, Harris, Prins, GP 2 2004 poster 48

Computing Gravitational Forces Each body attracts all other bodies N bodies, so N 2 forces Draw into an NxN buffer Pixel ( i, j) computes force between bodies i and j Very simple fragment program More than 2048 bodies makes it trickier 49

Computing Gravitational Forces F( i, j) = gm / (, ) 2 i M j d i j, d( i, j) = pos( i) - pos( j) Force is proportional to the inverse square of the distance between bodies 50

Computing Gravitational Forces N N-body force Texture Body Position Texture j j force( i, j) i F( i, j) = gm / (, ) 2 i M j r i j, 0 i N d( i, j) = pos( i) - pos( j) Coordinates ( i, j ) in force texture used to find bodies i and j in body position texture 51

Computing Gravitational Forces float4 force(float2 ij { } : WPOS, uniform sampler2d pos) : COLOR0 // Pos texture is 2D, not 1D, so we need to // convert body index into 2D coords for pos tex float4 icoords = getbodycoords(ij); float4 iposmass = texture2d(pos, icoords.xy); float4 jposmass = texture2d(pos, icoords.zw); float3 dir = ipos.xyz - jpos.xyz; float d2 = dot(dir, dir); dir = normalize(dir); return dir * g * iposmass.w * jposmass.w / d2; 52

Computing Total Force Have: array of (i,j) forces Need: total force on each particle i N N-body force Texture force( i, j) 0 i N 53

Computing Total Force Have: array of (i,j) forces Need: total force on each particle i Sum of each column of the force array N N-body force Texture force( i, j) 0 i N 54

Computing Total Force Have: array of (i,j) forces N N-body force Texture Need: total force on each particle i Sum of each column of the force array force( i, j) Can do all N columns in parallel 0 i N This is called a Parallel Reduction 55

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together NxN + 56

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... N x( N /2) + 57

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... N x( N /4) + 58

Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... Until we re left with a single row of texels Nx1 Requires log 2 N steps 59

Update Positions and Velocities Now we have a 1-D array of total forces One per body Update Velocity u( i, t+ dt) = u( i, t) + Ftotal( i) * dt Simple fragment program reads previous velocity and force textures, creates new velocity texture Update Position x( i, t+ dt) = x( i, t) + u( i, t) * dt Simple fragment program reads previous position and velocity textures, creates new position texture 60

Linear Algebra Representations Vector representation 2D textures best we can do High texture memory bandwidth Read-write access, dependent fetches 1 N 1 N 61

Representation (cont.) Dense Matrix representation treat a dense matrix as a set of column vectors again, store these vectors as 2D textures i Matrix N Vectors N N...... N 2D-Textures 1 i N...... N 1 i N 62

Representation (cont.) Banded Sparse Matrix representation treat a banded matrix as a set of diagonal vectors i Matrix 2 Vectors N N 2 2D-Textures 1 2 N 1 2 63

Representation (cont.) Banded Sparse Matrix representation combine opposing vectors to save space i Matrix 2 Vectors N N 2 2D-Textures 1 2 N-i N 1 2 64

Operations Vector-Vector Operations Reduced to 2D texture operations Coded in vertex/fragment programs Example: Vector1 + Vector2 à Vector3 Vector 1 Vector 2 Vector 3 + Static quad TexUnit 0 TexUnit 1 Render To Texture Pass through return tex0 + tex1 Vertex program fragment program 65

The single float on GPUs Some operations generate single float values e.g. reduce... Read-back to main-memory is slow Keep single floats on the GPU as 1x1 textures 66

GPGPU Flow Control Strategies Branching and Looping

Branching Techniques Fragment program branches can be expensive No true fragment branching on older cards SIMD branching on GeForce 6+ Series Incoherent branching hurts performance Sometimes better to move decisions up the pipeline Replace with math Occlusion Query Z-cull Pre-computation Static Branch Resolution 68

Branching with Occlusion Query Use it for iteration termination Do { // outer loop on CPU BeginOcclusionQuery { // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0 Can be used for subdivision techniques 69

Z-Cull In early pass, modify depth buffer Clear Z to 1 Draw quad at Z=0 Discard fragments that should be modified in later passes Subsequent passes Enable depth test (GL_LESS) Draw full-screen quad at z=0.5 Only fragments with previous depth=1 will be processed Can also use stencil cull on GeForce 6 series Not available on GeForce FX (NV3X) Discard and shader depth output disables Z-Cull 70

Pre-computation Pre-compute anything that will not change every iteration! Example: static obstacles in fluid sim Texture containing boundary info for cells inside obstacles Reuse that texture until obstacles are modified Combine with Z-cull for higher performance! 71

Static Branch Resolution Avoid branches where outcome is fixed One region is always true, another false Separate FPs for each region, no branches Example: boundaries 72

Acknowledgements Jens Krüger Mark Harris Naga K. Govindaraju John Owens GPU Gems 2 chapter 29: Streaming Architectures and Technology Trends 73