General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Size: px

Start display at page:

Download "General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)"

Cory Dixon
6 years ago
Views:

1 ME 90-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 009 Lecture

2 Outline Last time Frame buffer operations GPU programming intro Linear algebra representations Flow control Today Reduce review Sorting Searching Cg

3 Parallel Reductions D parallel reduction: sum N columns or rows in parallel add two halves of texture together NxN + Credit: Mark Harris

4 Parallel Reductions D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... N x( N /) + Credit: Mark Harris

5 Parallel Reductions D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... N x( N /) + Credit: Mark Harris

6 Parallel Reductions D parallel reduction: sum N columns or rows in parallel add two halves of texture together repeatedly... Until we re left with a single row of texels Nx Requires log N steps Credit: Mark Harris

7 Reduce Any operation that computes a single result from a data set sum min max average product...

8 Max Reduction: Reduce float max(float texcoord : TEXCOORD0, { } uniform samplerrect img) : COLOR float a, b, c, d; a = ftexrect(img, texcoord); b = ftexrect(img, texcoord + float(0,)); c = ftexrect(img, texcoord + float(,0)); d = ftexrect(img, texcoord + float(,)); return max(max(a, b), max(c, d));, Ian Buck

9 Max Reduction O( log n) passes to reduce n^ elements can increase number of reductions in fragment program to reduce number passes Credit: Mark Harris, Tim Purcell, Ian Buck 9

10 Linear Algebra Representations Vector representation D textures best we can do High texture memory bandwidth Read-write access, dependent fetches N N Credit: Jens Krüger 0

11 The fragment pipeline Input: Fragment Attributes Input: Texture Image Color R G B A Position X Y Z W Texture coordinates X Y [Z] - Interpolated from vertex information Texture coordinates X Y [Z] - X Y Z W Each element of texture is D vector bits = float bits = half Credit: Suresh Venkatasubramanian

12 Outline Today Reduce review Sorting Searching Cg

13 Assumptions Data organized into D arrays Rendering pass == screen aligned quad Not using vertex shaders PS.0 GPU No data dependent branching at fragment level

14 Sorting Given an unordered list of elements, produce list ordered by key value Kernel: compare and swap Standard sort algorithms not suited to GPUs Look at parallel sort algorithms Bitonic merge sort [Batcher ] Periodic balanced sorting networks [Dowd 9]

15 Bitonic Merge Sort Overview Repeatedly build bitonic lists and then sort them Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. List A: (,,, ) List B: (,,, ) List AB: (,,,,,,, ) monotonically increasing monotonically decreasing bitonic

16 Bitonic Merge Sort x monotonic lists: () () () () () () () () x bitonic lists: (,) (,) (,) (,)

17 Bitonic Merge Sort Sort the bitonic lists

18 Bitonic Merge Sort x monotonic lists: (,) (,) (,) (,) x bitonic lists: (,,,) (,,,)

19 Bitonic Merge Sort Sort the bitonic lists 9

20 Bitonic Merge Sort Sort the bitonic lists 0

21 Bitonic Merge Sort Sort the bitonic lists

22 Bitonic Merge Sort Bitonic Merge Sort x monotonic lists: (,,,) (,,,) x bitonic list: (,,,,,,,)

23 Bitonic Merge Sort Sort the bitonic list

24 Bitonic Merge Sort Bitonic Merge Sort Sort the bitonic list

25 Bitonic Merge Sort Bitonic Merge Sort Sort the bitonic list

26 Bitonic Merge Sort Bitonic Merge Sort Sort the bitonic list

27 Bitonic Merge Sort Bitonic Merge Sort Sort the bitonic list

28 Bitonic Merge Sort Bitonic Merge Sort Done!

29 Bitonic Merge Sort Summary Separate rendering pass for each set of swaps O(log n) passes Each pass performs n compare/swaps Total compare/swaps: O(n log n) Limitations of GPU cost us factor of logn over best CPU-based sorting algorithms 9

30 Bitonic Merge Sort Helper Function float convertdtod(float coordd, float width) { float coordd; coordd.y = coordd/width; coordd.x = floor(frac(coordd.y) * width); coordd.y = floor(coordd.y); return coordd; } 0

31 Bitonic Merge Sort float BitonicSort(float elemd : WPOS, uniform float offset, // offset = ^(stage - ) uniform float pbufwidth, uniform float stageno, // stageno = ^stage uniform float stepno, // stepno = ^step uniform samplerrect sortedlist) : COLOR { elemd = floor(elemd); float elemd = elemd.y * pbufwidth + elemd.x; half csign = (fmod(elemd, stageno) < offset)? : -; half cdir = (fmod(floor(elemd/stepno), ) == 0)? : -; float adrd = csign * offset + elemd; float adrd = convertdtod(adrd, pbufwidth); float val0 = ftexrect(sortedlist, elemd); float val = ftexrect(sortedlist, adrd); float cmin = (val0 < val)? val0 : val; float cmax = (val0 > val)? val0 : val; return (csign == cdir)? cmin : cmax; }

32 Binary Sort float BinarySearch(float elemd : WPOS, uniform float stride, uniform float pbufwidth, uniform float sortbufwidth, uniform samplerrect sortlist) : COLOR {elemd = floor(elemd); float elemd = elemd.y * pbufwidth + elemd.x; float curpos = stride; //loop over (LOGN ) search passes for (int i = 0; i < LOGN - ; i++){ stride = floor(stride * 0.); curpos = Search(curpos, elemd, stride, sortlist, sortbufwidth); } //log nth pass curpos = Search(curpos, elemd,.0, srtlist, srtbufwidth); //cleanup pass curpos = SearchFin(curpos,elemd,.0,srtlist, srtbufwidth); return curpos;}

33 Making GPU Sorting Faster Draw several quads with similar computation instead of single quad Reduce decision making in fragment program Push work into vertex processor and interpolator Reduce computation in fragment program More than one compare/swap per sort kernel invocation Reduce computational complexity

34 Grouping Computation Grouping Computation

35 Implementation Details Specify interpolants for smaller quads down or up compare and swap distance to comparison partner Kipfer & Westermann in GPU Gems

36 Outline Today Reduce review Sorting Searching Cg

37 Types of Search Search for specific element Binary search Search for nearest element(s) k-nearest neighbor search Both searches require ordered data

38 Binary Search Find a specific element in an ordered list Implement just like CPU algorithm Assuming hardware supports long enough shaders Finds the first element of a given value If v does not exist, find next smallest element > v Search algorithm is sequential, but many searches can be executed in parallel Number of pixels drawn determines number of searches executed in parallel pixel == search v

39 Binary Search Search for v0 Initialize Search starts at center of sorted array v >= v0 so search left half of sub-array Sorted List v0 v0 v0 v v v v v 0 9

40 Binary Search Search for v0 Initialize Step v0 >= v0 so search left half of sub-array Sorted List v0 v0 v0 v v v v v 0 0

41 Binary Search Search for v0 Initialize Step Step v0 >= v0 so search left half of sub-array Sorted List v0 v0 v0 v v v v v 0

42 Binary Search Search for v0 Initialize Step Step Step 0 At this point, we either have found v0 or are element too far left One last step to resolve Sorted List v0 v0 v0 v v v v v 0

43 Binary Search Search for v0 Initialize Step Step Step Step 0 0 Done! Sorted List v0 v0 v0 v v v v v 0

44 Binary Search Search for v0 and v Initialize Search starts at center of sorted array Both searches proceed to the left half of the array Sorted List v0 v0 v0 v v v v v 0

45 Binary Search Search for v0 and v Initialize Step The search for v0 continues as before The search for v overshot, so go back to the right Sorted List v0 v0 v0 v v v v v 0

46 Binary Search Search for v0 and v Initialize Step We ve found the proper v, but are still looking for v0 Step Both searches continue Sorted List v0 v0 v0 v v v v v 0

47 Binary Search Search for v0 and v Initialize Step Step Step 0 Now, we ve found the proper v0, but overshot v The cleanup step takes care of this Sorted List v0 v0 v0 v v v v v 0

48 Binary Search Search for v0 and v Initialize Step Done! Both v0 and v are located properly Step Step 0 Step 0 Sorted List v0 v0 v0 v v v v v 0

49 Binary Search Summary Single rendering pass Fragment program Each pixel drawn performs independent search Iterates log n + times through list 9

50 Binary Search float BinarySearch(float elemd : WPOS, uniform float stride, uniform float pbufwidth, uniform float sortbufwidth, uniform samplerrect sortlist) : COLOR {elemd = floor(elemd); float elemd = elemd.y * pbufwidth + elemd.x; float curpos = stride; //loop over (LOGN ) search passes for (int i = 0; i < LOGN - ; i++){ stride = floor(stride * 0.); curpos = Search(curpos, elemd, stride, sortlist, sortbufwidth); } //log nth pass curpos = Search(curpos, elemd,.0, srtlist, srtbufwidth); //cleanup pass curpos = SearchFin(curpos,elemd,.0,srtlist, srtbufwidth); return curpos;} 0

51 Binary Search: Search Routines float Search(float curpos, float elem, float stride, uniform samplerrect data, float texw) { float adrd = convertdtod(curpos, texw); float val = ftexrect (data, adrd); float dir = (elem <= val)? -.0 :.0; return dir * stride + curpos; } or, for SearchFin: float dir = (elem <= val)? 0.0 :.0; instead.

52 Nearest Neighbor Search

53 Nearest Neighbor Search Given a sample point p, find the k points nearest p within a data set On the CPU, this is easily done with a heap or priority queue Can add or reject neighbors as search progresses Don t know how to build one efficiently on GPU knn-grid Can only add neighbors

54 knn-grid Algorithm sample point candidate neighbor neighbors found Want neighbors

55 knn-grid Algorithm Candidate neighbors must be within max search radius Visit voxels in order of distance to sample point sample point candidate neighbor neighbors found Want neighbors

56 knn-grid Algorithm If current number of neighbors found is less than the number requested, grow search radius sample point candidate neighbor neighbors found Want neighbors

57 knn-grid Algorithm If current number of neighbors found is less than the number requested, grow search radius sample point candidate neighbor neighbors found Want neighbors

58 knn-grid Algorithm Don t add neighbors outside maximum search radius Don t grow search radius when neighbor is outside maximum radius sample point candidate neighbor neighbors found Want neighbors

59 knn-grid Algorithm Add neighbors within search radius sample point candidate neighbor neighbors found Want neighbors 9

60 knn-grid Algorithm Add neighbors within search radius sample point candidate neighbor neighbors found Want neighbors 0

61 knn-grid Algorithm Don t expand search radius if enough neighbors already found sample point candidate neighbor neighbors found Want neighbors

62 knn-grid Algorithm Add neighbors within search radius sample point candidate neighbor neighbors found Want neighbors

63 knn-grid Algorithm Visit all other voxels accessible within determined search radius Add neighbors within search radius sample point candidate neighbor neighbors found Want neighbors

64 knn-grid Summary sample point candidate neighbor neighbors found Want neighbors Finds all neighbors within a sphere centered about sample point May locate more than requested k-nearest neighbors Photon Mapping on Programmable Graphics Hardware, Purcell et al.

65 Outline Today Reduce review Sorting Searching Cg

66 Constant Parameters Fixed inside program Examples.9... Size of compute window Example declarations const float v = (.0,.0,.0,.0) const float pi =.9 Illegal pi =. float a = pi++

67 Uniform parameters Can be passed to a fragment program like normal parameters gets initial value from outside program before the fragment program executes Example: A counter that tracks which pass the algorithm is in. you are allowed to change uniform parameters within program

68 Math operators E.g. co s (x ) lo g (x ) po w(x,y) do t(a,b) m ul(v, M) s qrt(x ) cro s s (u, v) Using built-in ops is more efficient than writing your own

69 Swizzling and friends Swizzle v = (,-,,); // Initialize v = v.yx; // v = (-,) s = v.w; // s = Smear v = s.rrr; // v = (,,) can use xyzw or rgba, but not both at once Write masking : v = (,,,); v.ar = v; // v=(,,,-) 9

70 Swizzling and friends Swizzle v = (,-,,); v = v.yx; s = v.w; Smear v = s.rrr; can use xyzw or rgba, but not both at once Write masking v = (,,,); v.ar = v; : 0

71 The fragment pipeline float v = texd(img, float(x,y)) x Texture access is like an array lookup. The value in v can be used y to perform another lookup! This is called a dependent read Texture reads (and dependent reads) are expensive, and are limited in different GPUs. Use them wisely! Credit: Suresh Venkatasubramanian

72 The fragment pipeline Control flow: (<test>)?a:b operator. if-then-else conditional [nvx] Both branches are executed, and the condition code is used to decide which value is used to write the output register. [nv0] True conditionals for-loops and do-while [nvx] limited to what can be unrolled (i.e no variable loop limits) [nv0] True looping. WARNING: Even though nv0 has true flow control, performance will still suffer if there is no coherence Credit: Suresh Venkatasubramanian

73 The fragment pipeline out float result : COLOR // Do computation result = <final answer> Notes: Only output color can generally be modified (single float output on some GPUs) Setting different values in different channels of result can be useful for debugging limits # instructions both static (program length) and dynamic (number executed) Credit: Suresh Venkatasubramanian

74 Anatomy of a Cg Fragment Program Credit: Paul Kanyuk

75 The fragment pipeline What comes after fragment programs? Raster Operations Frame Buffer Depth/stencil happen after frag. program Blending and aggregation happen as usual Early z-culling: fragments that would have failed depth test are killed before executing fragment program. Optimization point: avoid work in the fragment program if possible. Credit: Suresh Venkatasubramanian

76 Getting data back I: Readbacks D API: OpenGL or DirectD GPU Front End Primitive Assembly Vertex Processor Readbacks transfer data from the frame buffer to the CPU. J They are very general (any buffer can be transferred) J Partial buffers can be transferred Credit: Suresh Venkatasubramanian Rasterization and Interpolation Raster Operations Frame Buffer Fragment Processor L They are slow: reverse data transfer across PCI/AGP bus is very, very slow L PCIe is better but still slow L Data mismatch: readbacks return image data, but the CPU expects vertex data (or has to load image into texture)

77 Getting data back II: Render-to-texturetexture GPU Front End Primitive Assembly Rasterization and Interpolation Raster Operations Vertex Processor Fragment Processor Render-to-texture renders directly into a texture. J J Transfer does not cross GPU- CPU boundary. Fastest way to transfer data to fragment processor L Only works with depth and color buffers (not stencil). Render-to-texture is the best method for reading data back after a computation. Credit: Suresh Venkatasubramanian

78 Using Render-to-texturetexture Using the render-texture extension is tricky. You have to set up a pbuffer context, bind an appropriate texture to it, and then render to this context. Then you have to change context and read the bound texture. You cannot write to a texture and read it simultaneously Mark Harris (NVIDIA) has written a RenderTexture class that wraps all of this. Credit: Suresh Venkatasubramanian

79 The vertex pipeline Input: vertices position, color, texture coords. Input: uniform and constant parameters. Matrices can be passed to a vertex program. Lighting/material parameters can also be passed. Credit: Suresh Venkatasubramanian 9

80 The vertex pipeline Operations: Math/swizzle ops Matrix operators Flow control (as before) [nvx] Output: No access to textures. Modified vertices (position, color) Vertex data transmitted to primitive assembly. Credit: Suresh Venkatasubramanian 0

81 Anatomy of a Cg Vertex Program Credit: Paul Kanyuk

82 Vertex programs are useful We can replace the entire geometry transformation portion of the fixedfunction pipeline. Vertex programs used to change vertex coordinates (move objects around) Shifting operations to vertex programs improves overall pipeline performance. Much of shader processing happens at vertex level. We have access to original scene geometry. Credit: Suresh Venkatasubramanian

83 Vertex programs are not useful Fragment programs allow us to exploit full parallelism of GPU pipeline ( a processor at every pixel ). Vertex programs can t read input! [nvx] Rule of thumb: If computation requires intensive calculation, it should probably be in the fragment processor. If it requires more geometric/graphic computing, it should be in the vertex processor. Credit: Suresh Venkatasubramanian

84 When might a VP need access to textures? n-body simulation: We have a force field in a texture Each vertex moves according to this force field. v = a t s = v t In each pass, all vertex coordinates are updated. New locations create new force field. How do we update vertex coordinates? Credit: Suresh Venkatasubramanian

85 Sending data back to vertex program Solution: [Pass ] Render all vertices to be stored in a texture. [Pass ] Compute force field in fragment program [Pass ] Update texture containing vertex coordinates in a fragment program using the force field. [Pass ] Retrieve vertex data from texture. How? Credit: Suresh Venkatasubramanian

86 Vertex/ ertex/pixel Buffer Objects V/P buffer objects are ways to transfer data between framebuffer/vertex arrays and GPU memory. Conceptually, V/PBO are like CPU memory, but on the GPU. Can use glreadpixels to read to PBO Can create vertex array from VBO Credit: Suresh Venkatasubramanian

87 Solution! GPU Front End Primitive Assembly Rasterization and Interpolation Programmable Fragment Processor Programmable Vertex Processor VBO/PBO Credit: Suresh Venkatasubramanian Raster Operations texture

88 NV0: Vertex programs can read textures GPU Front End Primitive Assembly Programmable Vertex Processor Rasterization and Interpolation Raster Operations Programmable Fragment Processor texture Credit: Suresh Venkatasubramanian

89 Summary of memory flow CPU Vertex program Fragment program Frame buffer Readback CPU Vertex program Fragment program Frame buffer Copy-to-Texture CPU Vertex program Fragment program Render-to-Texture Credit: Suresh Venkatasubramanian 9

90 Summary of memory flow Vertex program Fragment program VBO/PBO transfer Vertex program Fragment program nv0 texture ref in vertex program Credit: Suresh Venkatasubramanian 90

91 Acknowledgements Paul Kanyuk Suresh Venkatasubramanian Tim Purcell Mark Harris Jens Krüger Ian Buck 9

Sorting and Searching. Tim Purcell NVIDIA

Sorting and Searching. Tim Purcell NVIDIA Sorting and Searching Tim Purcell NVIDIA Topics Sorting Sorting networks Search Binary search Nearest neighbor search Assumptions Data organized into D arrays Rendering pass == screen aligned quad Not