CUB. collective software primitives. Duane Merrill. NVIDIA Research

Size: px

Start display at page:

Download "CUB. collective software primitives. Duane Merrill. NVIDIA Research"

Anna Holland
5 years ago
Views:

1 CUB collective software primitives Duane Merrill NVIDIA Research

2 What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc. 3. A library of global primitives Device-reduce, device-sort, device-scan, etc. Constructed from collective primitives Demonstrate performance, performance-portability

3 Software reuse 3

4 Software reuse Abstraction & composability are fundamental Reducing redundant programmer effort Saves time, energy, money Reduces buggy software Encapsulating complexity Empowers ordinary programmers Insulates applications from underlying hardware Software reuse empowers a durable programming model 4

5 Software reuse Abstraction & composability are fundamental Reducing redundant programmer effort Saves time, energy, money Reduces buggy software Encapsulating complexity Empowers ordinary programmers Insulates applications from underlying hardware Software reuse empowers a durable programming model 5

6 Collective primitives 6

7 Parallel programming is hard 7

8 Cooperative parallel programming is hard Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.) Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc. 8

Parallel programming is hard Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.

9 Parallel programming is hard Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.) Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc. 9

10 CUDA today application thread CUDA stub threadblock threadblock threadblock 0

11 CUDA today Collective primitives are the missing layer in today s CUDA software stack application thread CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort

12 3 3 What do these have in common? 0 Parallel sparse graph traversal Parallel radix sort Parallel SpMV Parallel BWT compression

13 3 3 What do these have in common? Block-wide prefix-scan Queue management 0 Partitioning Parallel sparse graph traversal Parallel radix sort Segmented reduction Recurrence solver Parallel SpMV Parallel BWT compression 3

14 Examples of parallel scan data flow 6 threads contributing 4 items each t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t 0 id id id t t t 0 t t 3 t 0 t t 3 t 3 t 3 id id id t 0 t t 3 t 3 t 3 t 3 t 3 t t t 3 t 3 id id id t 4 t 5 t 6 t 7 t 4 t 4 t 5 t 5 t 6 t 6 t 7 t 7 id id id t 8 t 8 t 8 t 9 t 9 t id id id t 0 t t 3 t 0 t 0 t t t 3 t 3 t 4 t 5 t 6 t 7 t 8 t 8 t 9 t t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t Work-efficient Brent-Kung hybrid (~30 binary ops) Depth-efficient Kogge-Stone hybrid (~70 binary ops) 4

15 CUDA today Kernel programming is complicating application thread CUDA stub threadblock threadblock threadblock 5

16 CUDA today Collective primitives are the missing layer in today s CUDA software stack application thread CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort 6

17 Collective design & usage 7

18 Collective design criteria Components are easily nested & sequenced threadblock application thread CUDA stub BlockRadixRank BlockScan threadblock threadblock threadblock WarpScan BlockSort BlockSort BlockSort BlockExchange BlockSort 8

19 Collective design criteria Flexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc. threadblock application thread CUDA stub BlockRadixRank BlockScan thread block thread block thread block WarpScan Block Sort Block Sort Block Sort BlockExchange BlockSort 9

20 Collective interface design - 3 parameter fields separated by concerns - Reflected shared resource types. Static specialization interface global void ExampleKernel() { // Specialize cub::blockscan for 8 threads typedef cub::blockscan<int, 8> BlockScanT; Params dictate storage layout and unrolling of algorithmic steps Allows data placement in fast registers 3 4 // Allocate temporary storage in shared memory shared typename BlockScanT::TempStorage scan_storage; // Obtain a 5 items blocked across 8 threads int items[4];... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items);. Reflected shared resource types Reflection enables compile-time allocation and tuning 3. Collective construction interface Optional params concerning interthread communication Orthogonal to function behavior 4. Operational function interface Method-specific inputs/outputs 0

21 Collective primitive design Simplified block-wide prefix sum template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (shared storage reference) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} // Prefix sum operation (each thread contributes its own data item) T Sum (T thread_data) { for (int i = ; i < BLOCK_THREADS; i *= ) { temp_storage[tid] = thread_data; syncthreads(); if (tid i >= 0) thread_data += temp_storage[tid]; syncthreads(); } }; } return thread_data;

22 Sequencing CUB primitives Using cub::blockload and cub::blockscan global void ExampleKernel(int *d_in) { // Specialize for 8 threads owning 4 integers each typedef cub::blockload<int*, 8, 4> BlockLoadT; typedef cub::blockscan<int, 8> BlockScanT; // Allocate temporary storage in shared memory shared union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; } temp_storage; Specialize, Allocate // Use coalesced (thread-striped) loads and a subsequent local exchange to // block a global segment of 5 items across 8 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items) syncthreads() // Compute block-wide prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items);... Load, Scan

23 Nested composition of CUB primitives cub::blockscan cub::blockscan cub::warpscan 3

24 Nested composition of CUB primitives cub::blockradixsort cub::blockradixsort cub::blockradixrank cub::blockscan cub::warpscan cub::blockexchange 4

25 Nested composition of CUB primitives cub::blockhistogram (specialized for BLOCK_HISTO_SORT algorithm) cub::blockhistogram cub::blockradixsort cub::blockradixrank cub::blockscan cub::warpscan cub::blockexchange cub::blockdiscontinuity 5

26 Block-wide and warp-wide CUB primitives cub::blockdiscontinuity cub::blockexchange cub::blockload & cub::blockstore t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 0 t t 3 t 4 t 5 t 6 t 7 cub::blockradixsort cub::warpreduce & cub::blockreduce cub::warpscan & cub::blockscan L / Tex cub::blockhistogram and more at the CUB project on GitHub 6

27 Tuning with flexible collectives 7

28 Example: radix sorting throughput (initial GT00 effort ~0) Millions of 3-bit keys /s NVIDIA GTX580 [] NVIDIA GTX480 [] NVIDIA Tesla C050 [] NVIDIA GTX80 [] NVIDIA 9800 GTX+ [] Intel MIC Knight's Ferry [4] Intel Core i7 Nehalem 3.GHz [] AMD Radeon HD 6970 [3] [] Merrill. Back40 GPU Primitives (0) [] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (00) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (0) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs,

29 Radix sorting throughput (current) Millions of 3-bit keys /s NVIDIA GTX580 [] NVIDIA GTX480 [] NVIDIA Tesla C050 [] NVIDIA GTX80 [] NVIDIA 9800 GTX+ [] Intel MIC Knight's Ferry [4] Intel Core i7 Nehalem 3.GHz [] AMD Radeon HD 6970 [3] [] Merrill. Back40, CUB GPU Primitives (03) [] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (00) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (0) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs,

30 Fine-tuning primitives Tiled prefix sum Data is striped across threads for memory accesses /** * Simple CUDA kernel for computing tiled partial sums */ template <int BLOCK_THREADS, int ITEMS_PER_THREAD, LoadAlgorithm LOAD_ALGO, ScanAlgorithm SCAN_ALGO> global void ScanTilesKernel(int *d_in, int *d_out) { // Specialize collective types for problem context typedef cub::blockload<int*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT; typedef cub::blockscan<int, BLOCK_THREADS, SCAN_ALGO> BlockScanT; t 0 t t 3 t 0 t t 3 t 4 t 5 t 6 t 7 t 4 t 5 t 6 t 7 // Allocate on-chip temporary storage shared union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage reduce; } temp_storage; Data is blocked across threads for scanning // Load data per thread } int thread_data[items_per_thread]; int offset = blockidx.x * (BLOCK_THREADS * ITEMS_PER_THREAD); BlockLoadT(temp_storage.load).Load(d_in + offset, offset); syncthreads(); // Compute the block-wide prefix sum BlockScanT(temp_storage).Sum(thread_data); t 0 t t 3 t 4 t 5 t 6 t 7 t 8 t t 0 0 t t 3 3 t 4 4 t 5 5 t 6 6 t 7 7 t 8 8 t i t t t t i t t t t i t t t t i d i t t t i t 0 3 d i t t t t i t t t t i d i t t i t t t t 8 9 d d t t t t d d t t t t t t t 0 t d i d d t t 3 4 t d d t t t t t t 0 t t 3 t 4 t 5 t 6 t t t t t t t t t 8 t t t t t t t t t t t t t Scan data flow tiled from warpscans 30

CUB: device-wide performance-portability vs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler) billions of 3b keys / sec.6.4..0 0.8 0.6 0.4 0. 0.0 CUB 0.50 0.

31 CUB: device-wide performance-portability vs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler) billions of 3b keys / sec CUB Tesla C060 Thrust v Tesla C Tesla K0C billions of 3b items / sec CUB 8 4 Tesla C060 Thrust v.7. 4 Tesla C Tesla K0C billions of 8b items / sec CUB 0 Tesla C060 NPP 6. Tesla C Tesla K0C Global radix sort Global prefix scan Global Histogram billions of 3b items / sec CUB 9 Tesla C060 Thrust v Tesla C Tesla K0C billions of 3b inputs / sec CUB 4. Tesla C060 Thrust v Tesla C050 Tesla K0C Global reduction Global partition-if 3

32 Summary 3

33 Summary: benefits of using CUB primitives Simplicity of composition Kernels are simply sequences of primitives (e.g., BlockLoad -> BlockSort -> BlockReduceByKey) High performance CUB uses the best known algorithms, abstractions, and strategies, and techniques Performance portability CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.) Simplicity of tuning CUB adapts to various grain sizes (threads per block, items per thread, etc.) CUB provides alterative algorithms Robustness and durability CUB supports arbitrary data types and block sizes 33

34 Questions? Please visit the CUB project on GitHub Duane Merrill

35 x 0 x x reduce reduce reduce scan prefix 0:0 prefix 0: scan scan prefix 0: prefix 0:P- reduce scan y 0 y y p 0 p p p P-

x 0 x x reduce reduce reduce reduce adaptive look-back aggregate 0 adaptive look-back aggregate 0 adaptive look-back aggregate 0 adaptive look-back aggregate 0

36 x 0 x x reduce reduce reduce reduce adaptive look-back aggregate 0 adaptive look-back aggregate 0 adaptive look-back aggregate 0 adaptive look-back aggregate 0 incl-prefix 0 incl-prefix 0 incl-prefix 0 scan scan scan scan incl-prefix 0 y 0 y y p 0 p p p P- Status flag P A P X Aggregate Inclusive prefix P-

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline