GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Size: px

Start display at page:

Download "GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16"

Daniella Turner
6 years ago
Views:

1 GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

2 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 input keys compacted S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

3 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 Other option: split input into two buckets input keys buckets output keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

4 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan Shuffle e.g., splitter: 10 Other option: split input into two buckets Can also be solved by sorting keys Not always possible Loses stability, i.e., initial order within buckets not preserved input keys buckets output keys buckets sorted keys S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

5 What is Multisplit? Multisplit (generalization of binary split) Let s try multiple buckets e.g., splitters: 10 and 20 input keys buckets output keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

6 What is Multisplit? Multisplit (generalization of binary split) Let s try multiple buckets e.g., splitters: 10 and 20 Can also be solved by sorting keys input keys buckets output keys buckets sorted keys buckets S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

7 Mutlisplit primitive Input: unordered set of keys (or key-value pairs) m, number of buckets a user specified function to identify buckets for each key Output: keys (or key-value pairs) separated into m buckets B0 B1 B2 B3 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

8 A Fast and Flexible Data-Organization Primitive characterizing key-value pairs into buckets General load balancing Priority queues Single Source Shortest Path (SSSP) Serial (Dijkstra): processing the vertex with the lowest weight Bellman-Ford-Moore all vertices in parallel delta-stepping formulation of SSSP [Davidson et al., 2014] classifying vertices into buckets by their weights processing the lowest weights in parallel But no multisplit primitive used radix-sort instead By using our own multisplit 2.1x faster other applications colored prefix-sum reorganizing into 8 direction-based buckets in GPU based ray tracers [Yang et al., 2013] first step in building GPU hash tables [Alcantara et al., 2009] in the shallow stages of k-d tree construction [Wu et al., 2011] S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

9 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits Buckets B 0 = {i apple 40} B 1 = {i >40} Initial Keys B 0 Exclusive scan B 1 2 right to left exclusive scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

10 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys overkill (sorted within buckets) initial order is not preserved Initial Keys binary representation apple 7 splits S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys 1 overkill (sorted within buckets) 0 initial order is not preserved 3 Reduced bit sort sort

11 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys 1 overkill (sorted within buckets) 0 initial order is not preserved 3 Reduced bit sort sort (bucket ID, keys) log m -bit bucket IDs Initial Keys New values New keys key-value sort S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

12 Common Approaches: 1 Recursive scan-based split log(m) rounds of binary splits 2 Radix sort sorting keys overkill (sorted within buckets) initial order is not preserved 3 Reduced bit sort sort (bucket ID, keys) log m -bit bucket IDs 4 Randomized insertions a PRAM algorithm large buffers for buckets random insertions initial order is not preserved Initial Keys bu er B 0 17 bu er B compaction S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Designing an Efficient Approach Stable Multisplit unique permutation + data movement 1 Deriving all permutations global computations histogram (h 0,.

13 Designing an Efficient Approach Stable Multisplit unique permutation + data movement 1 Deriving all permutations global computations histogram (h 0,..., h m 1 ) key order per bucket u i B j p(i) = j 1 h k + {u r : u r B j, r < i} }{{} k=0 }{{} Number of keys Number of keys in before me previous buckets in my own bucket B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

14 Designing an Efficient Approach Stable Multisplit unique permutation + data movement 1 Deriving all permutations global computations histogram (h 0,..., h m 1 ) key order per bucket 2 Final data movements global random scatters B 0 B 1 B 2 B 3 B 0 S. Ashkiani (UC Davis) GPU Multisplit B 1 GTC / 16

15 Our high level ideas 1 Global computations Localize computations several large enough local subproblems local histograms a single small enough global computation global histogram several large enough local subproblems permutations + scatters Avoid shared memory and synchronization: utilize intrinsics Local Pre scan Global Scan Local Post scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

16 Our high level ideas 1 Global computations Localize computations several large enough local subproblems local histograms a single small enough global computation global histogram several large enough local subproblems permutations + scatters Avoid shared memory and synchronization: utilize intrinsics 2 Global random scatters Reordering keys locally in the last stage local multisplits more computational cost but better memory access pattern (coalesced writes) S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

17 Granularity Tradeoffs We experimented with a couple different subproblem granularities 1 warp warp synchronous model with minimal warp divergence fast communication via warp-wide ballot/shuffles 2 block more expensive communication via shared memory cheaper global computation (scan over m N blocks ) more locality to extract after reordering Property Direct MS Warp-level MS Block-level MS Subproblem warp warp block reordering warp-wide reordering block-wide reordering computational load low medium high Coalesced memory access low medium high S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

18 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): h1,0 hm 1,0 Pre scan read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 Post scan h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 h0,l 1 h1,l 1 hm 1,L 1 1: procedure warp histogram(bucket id[0:31]) Input: bucket id[0:31] Output: histo[0:m-1] 2: for each thread i = 0:31 parallel warp do 3: histo bmp[i] = 0xFFFFFFFF; 4: for (int k = 0; k < ceil(log2(m)); k++) do 5: temp buffer = ballot(bucket id[i] & 0x01); 6: if ((i >> k) & 0x01) then 7: histo bmp[i] &= temp buffer; 8: else 9: histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer); 10: end if 11: bucket id[i] >>= 1; 12: end for 13: histo[i] = popc(histo bmp[i]); 14: end for 15: return histo[0:m-1]; 16: end procedure S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram 2 Scan (Global): exclusive scan on

19 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram 2 Scan (Global): exclusive scan on histograms m N warps elements h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,0 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

20 Implementation details & Optimizations Warp-level MS 1 Pre-scan (Local): read keys warp histogram 1 bit-by-bit balloting 2 log m rounds store warp histogram 2 Scan (Global): exclusive scan on histograms m N warps elements 3 Post-scan (Local): read keys (key-value) recompute warp histograms compute local offsets warp-level reordering compute final positions final data movement h0,0 h1,0 hm 1,0 Pre scan h0,1 h1,1 hm 1,1 Scan h0,l 1 h1,l 1 hm 1,L 1 h1,0 h1,1 h1,l 1 h0,0 h0,1 h0,l 1 hm 1,0 hm 1,1 hm 1,L 1 Post scan S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

21 Performance Evaluation In this presentation: 1 NVIDIA Tesla K40c GPU 2 Radix sort from CUB including the one in the reduced-bit sort method 3 Device-wide exclusive scan from CUB 4 Uniform distribution of keys in buckets More results in the paper: 1 Detailed timing of different stages of our algorithm 2 Other GPU achitectures: Maxwell 3 Different distributions of keys 4 Using our Multisplit algorithms in SSSP method S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

22 Average running time vs. number of buckets 9 Average running time (msec) Block level MS Direct MS Reduced bit sort Warp level MS Average running time (msec) Number of buckets (m) (a) Key-only Number of buckets (m) (b) Key-value Memory access quality: Block-level MS > Warp-level MS > Direct MS Computational load: Block-level MS > Warp-level MS > Direct MS S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

23 Performance vs Radix-Sort 6 Block level MS Direct MS Reduced bit sort Warp level MS Binary Split 8 6 Block level MS Direct MS Redcued bit sort Warp level MS Binary Split Speedup 4 Speedup Number of buckets (m) Number of buckets (m) (c) Key-only (d) Key-value key-only: 3.0x 6.7x key-value: 4.4x 8.0x S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

24 More buckets For more buckets than the warp width (m > 32): warp histograms each thread in charge of multiple buckets Shared memory capacity the other bottleneck Average running time (msec) Radix sort (key value) Radix sort (key only) Block level MS Reduced bit sort Key only Key value Number of buckets (m) Average running time (msec) for more buckets for Block level MS and reduced-bit sort S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

25 Conclusions Introduce a new efficient data organization primitive High performance especially for low or modest number of buckets Full paper: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016) Code will soon be available in CUDPP: S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

26 Thank You S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

27 References Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1 154:9. Davidson, A., Baxter, S., Garland, M., and Owens, J. D. (2014). Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages Wu, Z., Zhao, F., and Liu, X. (2011). SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG 11, pages Yang, X., Xu, D., and Zhao, L. (2013). Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1 8. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

GPU Multisplit. Andrew Davidson. University of California, Davis John D. Owens

GPU Multisplit. Andrew Davidson. University of California, Davis John D. Owens GPU Multisplit Saman Ashkiani University of California, Davis sashkiani@ucdavis.edu Andrew Davidson University of California, Davis aaldavidson@ucdavis.edu Ulrich Meyer Goethe-Universität Frankfurt am