The Way of the GPU (based on GPGPU SIGGRAPH Course)

Size: px

Start display at page:

Download "The Way of the GPU (based on GPGPU SIGGRAPH Course)"

Bernice Todd
5 years ago
Views:

1 The Way of the GPU (based on GPGPU SIGGRAPH Course) CS535 Fall 2016 Daniel G. Aliaga Department of Computer Science Purdue University

2 Computer Graphics Pipeline Geometry (this is really from 20 years ago ) Modeling Transformation Transform into 3D world coordinate system Lighting Simulate illumination and reflectance Viewing Transformation Transform into 3D camera coordinate system Clipping Clip primitives outside camera s view Projection Transform into 2D camera coordinate system Scan Conversion Draw pixels (incl. texturing, hidden surface ) Image

3 Today, we have GPUs (GPU = graphical processing unit)

4 Motivation: Computational Power Why are GPUs fast? Arithmetic intensity: the specialized nature of GPUs makes it easier to use additional transistors for computation not cache Economics: multi-billion dollar video game market is a pressure cooker that drives innovation

5 Motivation: Flexible and Precise Modern GPUs are deeply programmable Programmable pixel, vertex, video engines Solidifying high-level language support Modern GPUs support high precision 32 bit floating point throughout the pipeline High enough for many (not all) applications

6 The Problem: Difficult To Use GPUs designed for & driven by video games Programming model unusual Programming idioms tied to computer graphics Programming environment tightly constrained Underlying architectures are: Inherently parallel Rapidly evolving (even in basic feature set!) Largely secret Can t simply port CPU code!

7 Diagram of a Modern GPU Input from CPU Host interface Geometry/Vertex processing Triangle setup Pixel processing Memory Interface fast fast fast fast memory memory memory memory

8 nvidia GPU GTX Titan X GHz (i.e., mini processors) 192 gigatexels/sec fill rate 966 Gflops in matrix multiplication 5120x3200 pixels 250W power 91C max GPU temp $999

9 Modern GPU has more ALU s

10 GPU Pipeline: Transform Vertex/Geometry processor (multiple in parallel) Transform from world space to image space Compute per-primitive and per-vertex lighting

11 Rasterizer GPU Pipeline: Rasterize (typically not programmable) Convert geometric rep. (vertex) to image rep. (fragment) Fragment = image fragment Pixel + associated data: color, depth, stencil, etc. Interpolate per-vertex quantities across pixels

12 GPU Pipeline: Shade Fragment processors (multiple in parallel) Compute a color for each pixel Optionally read colors from textures (images)

13 GPU Programming Languages Many options! A while ago: Renderman cg (from NVIDIA) GLSL (GL shading Language) CUDA (more general that graphics) Lets focus first on the concept, later on the language specifics

14 Mapping Parallel Computational Concepts to GPUs

15 Mapping Parallel Computational Concepts to GPUs GPUs are designed for graphics Highly parallel tasks GPUs process independent vertices & fragments Temporary registers are zeroed No shared or static data No read-modify-write buffers Data-parallel processing GPUs architecture is ALU-heavy Multiple vertex & pixel pipelines, multiple ALUs per pipe Hide memory latency (with more computation) 15

Example: Simulation Grid Common GPGPU computation style Textures represent computational grids = streams Many computations map to grids Matrix algebra

16 Example: Simulation Grid Common GPGPU computation style Textures represent computational grids = streams Many computations map to grids Matrix algebra Image & Volume processing Physically-based simulation Global Illumination ray tracing, photon mapping, radiosity Non-grid streams can be mapped to grids 16

17 Typical Stream Computation Grid Simulation algorithm Made up of steps Each step updates entire grid Must complete before next step can begin Grid is a stream, steps are kernels Kernel applied to each stream element 17 Cloud simulation algorithm

18 e.g.: Scatter vs. Gather Grid communication Grid cells share information 18

19 Vertex Processor Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather 19 Can change the location of current vertex Cannot read info from other vertices Can only read a small constant memory Latest GPUs: Vertex Texture Fetch Random access memory for vertices Gather (But not from the vertex stream itself)

20 Fragment Processor Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter RAM read (texture fetch), but no RAM write Output address fixed to a specific pixel Typically more useful than vertex processor More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline) 20

21 GPU Simulation Overview A Simulation: Its algorithm steps are fragment programs Called Computational kernels Current state is stored in textures Feedback via render to texture Question: How do we invoke computation? 21

22 Invoking Computation Must invoke computation at each pixel Just draw geometry! Most common GPGPU invocation is a full-screen quad Other Useful Analogies Rasterization = Kernel Invocation Texture Coordinates = Computational Domain Vertex Coordinates = Computational Range 22

23 Typical Grid Computation Initialize view (so that pixels:texels::1:1) glmatrixmode(gl_modelview); glloadidentity(); glmatrixmode(gl_projection); glloadidentity(); glortho(0, 1, 0, 1, 0, 1); glviewport(0, 0, outtexresx, outtexresy); For each algorithm step: Activate render-to-texture Setup input textures, fragment program Draw a full-screen quad (1x1) 23

24 Example: N-Body Simulation Brute force N = 8192 bodies N 2 gravity computations 64M force comps. / frame ~25 flops per force 10.5 fps Nyland, Harris, Prins, 17+ GFLOPs sustained in this example GP poster 24

25 Computing Gravitational Forces Each body attracts all other bodies N bodies, so N 2 forces Draw into an NxN buffer Pixel (i,j) computes force between bodies i and j Very simple fragment program More than N=2048 bodies is tricky Why? 25

26 Computing Gravitational Forces N N-body force Texture Body Position Texture j F(i,j) = gm i M j / r(i,j) 2, force(i,j) r(i,j) = pos(i) - pos(j) i j 0 i N Force is proportional to the inverse square of the distance between bodies 26

27 Computing Gravitational Forces float4 force(float2 ij : WPOS, uniform sampler2d pos) : COLOR0 { // Pos texture is 2D, not 1D, so we need to // convert body index into 2D coords for pos tex float4 icoords = getbodycoords(ij); float4 iposmass = texture2d(pos, icoords.xy); float4 jposmass = texture2d(pos, icoords.zw); float3 dir = ipos.xyz - jpos.xyz; float r2 = dot(dir, dir); dir = normalize(dir); return dir * g * iposmass.w * jposmass.w / r2; } 27

28 Computing Total Force Have: array of (i, j) forces Need: total force on each particle i Sum of each column of the force array Can do all N columns in parallel N N-body force Texture force(i,j) 0 i N This is called a Parallel Reduction 28

29 Parallel Reductions 1D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... NxN Until we re left with a single row of texels Nx(N/2) Nx(N/4) Nx Requires log 2 N steps

30 Update Positions and Velocities Now we have a 1-D array of total forces One per body Update Velocity 30 u(i,t+dt) = u(i,t) + F total (i) * dt Simple pixel shader reads previous velocity and force textures, creates new velocity texture Update Position x(i, t+dt) = x(i,t) + u(i,t) * dt Simple pixel shader reads previous position and velocity textures, creates new position texture

31 Linear Algebra on GPUs Use linear algebra for VR, education, simulation, games and much more!

32 Representation Vector representation 2D textures best we can do Per-fragment vs. per-vertex operations High texture memory bandwidth Read-write access, dependent fetches 1 N 1 N

33 Representation (cont.) Dense Matrix representation treat a dense matrix as a set of column vectors again, store these vectors as 2D textures i Matrix N Vectors N N N 2D-Textures i N N 1 i N

34 Representation (cont.) Banded Sparse Matrix representation treat a banded matrix as a set of diagonal vectors i Matrix 2 Vectors N N 2 2D-Textures 1 2 N 1 2

35 Representation (cont.) Banded Sparse Matrix representation combine opposing vectors to save space i Matrix 2 Vectors N N 2 2D-Textures 1 2 N-i N 1 2

36 Operations Vector-Vector Operations Reduced to 2D texture operations Coded in vertex/fragment shaders Example: Vector1 + Vector2 Vector3 Vector 1 Vector 2 Vector 3 + Static quad TexUnit 0 TexUnit 1 Render To Texture Pass through return tex0 + tex1 Vertex Shader Pixel Shader

37 Operations (cont.) Vector-Vector Operations Reduce operation for scalar products original Texture st 1 pass nd 2 pass Reduce m x n region in fragment shader

38 The single float on GPUs Some operations generate single float values e.g. reduce... Read-back to main-mem is slow Keep single floats on the GPU as 1x1 textures

39 Operations (cont.) In depth example: Vector / Banded-Matrix Multiplication A b x =

40 Example (cont.) Vector / Banded-Matrix Multiplication A b A b x =

41 Example (cont.) Compute the result in 2 Passes Pass 1: b = A x multiply

42 Example (cont.) Compute the result in 2 Passes Pass 2: b = A shift b x multiply

43 Representation (cont.) Random sparse matrix representation Textures do not work Splitting yields highly fragmented textures Difficult to find optimal partitions Idea: encode only non-zero entries in vertex arrays Column as TexCoord 1-4 Matrix Values as TexCoord 0 Matrix Vector = Result Vertex Array 1 Vertex Array 2 Tex0 Pos Tex1-4 Tex0 Pos Tex1-4 Vector Result N = Row Index as Position N

44 Sorting Given an unordered list of elements, produce list ordered by key value Kernel: compare and swap GPUs constrained programming environment limits viable algorithms Based on Bitonic merge sort [Batcher 68] Periodic balanced sorting networks [Dowd 89]

45 Bitonic Merge Sort Overview Repeatedly build bitonic lists and then sort them Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. List A: (3, 4, 7, 8) List B: (6, 5, 2, 1) List AB: (3, 4, 7, 8, 6, 5, 2, 1) monotonically increasing monotonically decreasing bitonic

46 Bitonic Merge Sort x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

47 Bitonic Merge Sort Sort the bitonic lists

48 Bitonic Merge Sort x monotonic lists: (3,7) (8,4) (2,6) (5,1) 2x bitonic lists: (3,7,8,4) (2,6,5,1)

49 Bitonic Merge Sort Sort the bitonic lists

50 Bitonic Merge Sort Sort the bitonic lists

51 Bitonic Merge Sort Sort the bitonic lists

52 Bitonic Merge Sort x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1)

53 Bitonic Merge Sort Sort the bitonic list

54 Bitonic Merge Sort Sort the bitonic list

55 Bitonic Merge Sort Sort the bitonic list

56 Bitonic Merge Sort Sort the bitonic list

57 Bitonic Merge Sort Sort the bitonic list

58 Bitonic Merge Sort Done!

59 Bitonic Merge Sort Summary Separate rendering pass for each set of swaps O(log 2 n) passes Each pass performs n compare/swaps Total compare/swaps: O(n log 2 n) Limitations of GPU cost us factor of log n over best CPUbased sorting algorithms

60 Database and Streaming on GPUs Utilize graphics processors for fast computation of common database operations Conjunctive selections Aggregations Semi-linear queries Essential components 60

61 Basic DB Operations Basic SQL query» Select A» From T» Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT) 61

62 Database Operations Predicates a i op constant or a i op a j op: <,>,<=,>=,!=, =, TRUE, FALSE Boolean combinations Conjunctive Normal Form (CNF) Aggregations COUNT, SUM, MAX, MEDIAN, AVG 62

63 Predicate Evaluation a i op constant (d) Copy the attribute values a i into depth buffer Specify the comparison operation used in the depth test Draw a screen filling quad at depth d and perform the depth test 63

64 a i op d If ( a i op d )pass fragment Else reject fragment P Screen d 64

65 Predicate Evaluation CPU implementation Intel compiler 7.1 with SIMD optimizations 65

66 Predicate Evaluation a i op a j Equivalent to (a i a j ) op 0 Semi-linear queries Defined as linear combination of attribute values compared against a constant Linear combination is computed as a dot product of two vectors s i ai Utilize the vector processing capabilities of GPUs 66

67 67 Semi-linear Query

68 Boolean Combination CNF: (A 1 AND A 2 AND AND A k ) where A i = (B i 1 OR B i 2 OR OR B i mi ) Performed using stencil test recursively C 1 = (TRUE AND A 1 ) = A 1 C i = (A 1 AND A 2 AND AND A i ) = (C i-1 AND A i ) Different stencil values are used to code the outcome of C i Positive stencil values pass predicate evaluation Zero fail predicate evaluation 68

69 A 1 AND A 2 A 2 = (B 2 1 OR B 2 2 OR B 2 3 ) B 2 3 A 1 B 2 2 B

70 A 1 AND A 2 Stencil value = 1 A 1 70

71 A 1 AND A 2 TRUE AND A 1 Stencil value = 0 Stencil value = 1 A 1 71

72 A 1 AND A 2 Stencil = 0 Stencil=2 B 2 2 Stencil = 1 A 1 Stencil=2 B 2 3 Stencil=2 B

73 A 1 AND A 2 Stencil = 0 Stencil=2 B 2 2 Stencil = 1 A 1 Stencil=2 B 2 3 Stencil=2 B

74 A 1 AND A 2 Stencil = 2 A 1 AND B 2 2 Stencil = 0 Stencil=2 A 1 AND B 2 3 Stencil=2 A 1 AND B

75 Aggregations COUNT, MAX, MIN, SUM, AVG These can be efficiently implemented as well! 75

76 Geometric Computations Well studied Computer graphics, computational geometry etc. Widely used in games, simulations, virtual reality applications 76

77 GPUs: Geometric Computations Used for geometric applications Minkowski sums [Kim et al. 02] CSG rendering [Goldfeather et al. 89, Rossignac et al. 90] Voronoi computation [Hoff et al. 01, 02, Sud et al. 04] Isosurface computation [Pascucci 04] Map simplification [Mustafa et al. 01] 77

78 Collision Detection Pointers: Overview Collision Detection: CULLIDE Inter- and Intra-Object Collision Detection: Quick-CULLIDE Reliable Collision Detection: FAR Analysis 78

79 Geometry processing on GPUs so far: GPGPU limited to texture output new APIs allow geometry generation on GPU

80 Examples Particles Fluid Simulation 3D Smoke & Fire Grid displacement Water Simulation 3D Water Surfaces

81 Examples Particles Fluid Simulation 3D Smoke & Fire Water Simulation Grid displacement 3D Water Surfaces Point Compression Point Decompression Point Rendering

82 High Level Shading Languages Cg, HLSL, & OpenGL Shading Language Cg: HLSL: rs.asp OpenGL Shading Language: ex.html 82

83 GPGPU Languages Why do want them? Make programming GPUs easier! Don t need to know OpenGL, DirectX, or ATI/NV extensions Simplify common operations Focus on the algorithm, not on the implementation 83

84 Compilers: CGC & FXC HLSL and Cg are syntactically almost identical Exception: Cg 1.3 allows shader interfaces, unsized arrays Command line compilers Microsoft s FXC.exe Compiles to DirectX vertex and pixel shader assembly only fxc /Tps_2_0 myshader.hlsl NVIDIA s CGC.exe Compiles to everything cgc -profile ps_2_0 myshader.cg Can generate very different assembly! Driver will recompile code Compliance may vary

85 Babelshader Converts between DirectX pixel shaders and OpenGL shaders Allows OpenGL programs to use DirectX HLSL compilers to compile programs into ARB or fp30 assembly. Enables fair benchmarking competition between the HLSL compiler and the Cg compiler on the same platform with the same demo and driver. Example Conversion Between Ps2.0 and ARB 85

86 printf Debugging MOV suspect register to output Comment out anything else writing to output Scale and bias as needed Recompile Display/readback frame buffer Check values Repeat until error is (hopefully) found

87 printf Debugging Examples

88 printf Debugging Examples

89 printf Debugging Examples

90 printf Debugging Examples

91 Ideal Fragment Program Debugger Automate printf debugging Intuitive and easy to use interface Features over performance Debuggers don t need to be fast Should still be interactive Easy to add to existing apps And remove Touch as little GPU state as possible Report actual hardware values No software simulations!

92 Data Parallel Computing Instruction-Level Parallelism Data-Level Parallelism

93 A really naïve shader frag2frame Smooth(vert2frag IN, uniform samplerrect Source : texunit0, uniform samplerrect Operator : texunit1, uniform samplerrect Boundary : texunit2, uniform float4 params) { frag2frame OUT; float2 center = IN.TexCoord0.xy; float4 U = f4texrect(source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) (!(mask.x + mask.y) &&!params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f));... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); } float poisson = ((params.x*params.x)*u.z + (-O.x * f1texrect(source, float2(neighbor.x, center.y)) + -O.x * f1texrect(source, float2(neighbor.y, center.y)) + -O.y * f1texrect(source, float2(center.x, neighbor.z)) + -O.z * f1texrect(source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; }... return OUT;

94 A really naïve shader frag2frame Smooth(vert2frag IN, uniform samplerrect Source : texunit0, uniform samplerrect Operator : texunit1, uniform samplerrect Boundary : texunit2, uniform float4 params) { frag2frame OUT; float2 center = IN.TexCoord0.xy; float4 U = f4texrect(source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) (!(mask.x + mask.y) &&!params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f));... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); } float poisson = ((params.x*params.x)*u.z + (-O.x * f1texrect(source, float2(neighbor.x, center.y)) + -O.x * f1texrect(source, float2(neighbor.y, center.y)) + -O.y * f1texrect(source, float2(center.x, neighbor.z)) + -O.z * f1texrect(source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; }... return OUT;

95 Instruction-Level Parallelism float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f);

96 Instruction-Level Parallelism float2 offset = center.xy - 0.5f; offset = offset * params.xx + 0.5f; // MADR is cool too one // cycle, two flops float4 neighbor = center.xxyy + float4(-1.0f,1.0f,-1.0f,1.0f);

97 Data-Level Parallelism Pack scalar data into RGBA in texture memory

98 Computational Frequency Think of your CPU program and your vertex and fragment programs as different levels of nested looping.... foreach tri in triangles { // run the vertex program on each vertex v1 = process_vertex(tri.vertex1); v2 = process_vertex(tri.vertex2); v3 = process_vertex(tri.vertex2); }... // assemble the vertices into a triangle assembledtriangle = setup_tri(v1, v2, v3); // rasterize the assembled triangle into [0..many] fragments fragments = rasterize(assembledtriangle); // run the fragment program on each fragment foreach frag in fragments { outbuffer[frag.position] = process_fragment(frag); }

99 Computational Frequency Branches Avoid these, especially in the inner loop i.e., the fragment program.

100 Computational Frequency Static branch resolution write several variants of each fragment program to handle boundary cases eliminates conditionals in the fragment program equivalent to avoiding CPU inner-loop branching case 1: no boundaries case 2: accounts for boundaries

101 Computational Frequency Dynamic branching Use only when needed Good perf requires spatial coherence in branching

102 Computational Frequency Precompute Precompute Precompute

103 Computational Frequency Precompute texture coordinates Take advantage of under-utilized hardware vertex processor rasterizer Reduce instruction count at the per-fragment level Avoid lookups being treated as texture indirections

104 Memory Hierarchy CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Caches GPU Video Memory CPU Registers GPU Caches GPU Constant Registers GPU Temporary Registers 104

105 CPU Memory Model At any program point 105 Allocate/free local or global memory Random memory access Registers Read/write Local memory Read/write to stack Global memory Disk Read/write to heap Read/write to disk

106 GPU Memory Model Much more restricted memory access 106 Allocate/free memory only before computation Limited memory access during computation (kernel) Registers Read/write Local memory Does not exist Global memory Read-only during computation Write-only at end of computation (pre-computed address) Disk access

107 GPU Memory Model Where is GPU Data Stored? Vertex buffer Frame buffer Texture VS 3.0 GPUs Texture Vertex Buffer Vertex Processor Rasterizer Fragment Processor Frame Buffer(s)

108 Glift: Goal A template for simple creation and efficient use of randomaccess GPU data structures for graphics and GPGPU programming Contributions Abstraction for GPU data structures The template library 108

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Lecture 7 Outline Last time Visibility Shading Texturing Today Texturing continued