Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Size: px

Start display at page:

Download "Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs"

Elijah Cox
5 years ago
Views:

1 AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de Henri Casanova 1 henric@hawaii.edu John Iacono 3 john.iacono@ulb.ac.be Nodari Sitchinava 1 nodari@hawaii.edu 1 DEPARTMENT OF ICS, UNIVERSITY OF HAWAII AT MANOA 2 GOETHE UNIVERSITY FRANKFURT 3 DPARTEMENT D INFORMATIQUE, UNIVERSITÉ LIBRE DE BRUXELLES Work supported by the National Science Foundation under grants and

2 Sorting: A fundamental problem Sorting is a building block Used by countless algorithms...

3 Sorting: A fundamental problem Sorting is a building block Used by countless algorithms... O(N) O(log N)

4 Sorting: A fundamental problem Sorting is a building block Used by countless algorithms...

5 Sorting: A fundamental problem Sorting is a building block Used by countless algorithms...

6 Sorting: A fundamental problem Sorting is a building block Used by countless algorithms... Many solutions

7 Graphics Processing Units Designed for high throughput Extremely Parallel Thousands of cores Huge performance potential Lots of application research No standard performance model

8 NVIDIA GPU Streaming Multiprocessors (SMs) < 20 per GPU < 200 cores each NVIDIA GPU SM SM SM SM Global Memory SM SM SM SM Control Logic Shared Memory processor cores

9 NVIDIA GPU Streaming Multiprocessors (SMs) < 20 per GPU < 200 cores each Memory Hierarchy User-controlled Different scope processor cores SM SM SM SM Control Logic NVIDIA GPU Global Memory SM SM SM SM Shared Memory

10 NVIDIA GPU Streaming Multiprocessors (SMs) < 20 per GPU < 200 cores each Memory Hierarchy User-controlled Different scope Thread organization Cores share logic Need lots of parallelism! processor cores SM SM SM SM Control Logic NVIDIA GPU Global Memory SM SM SM SM Shared Memory

11 Thread Organization SM SM SM SM Global Memory SM SM SM SM

12 Thread Organization b SM SM SM SM Global Memory SM SM SM SM Threads are groupped into thread-blocks b threads Run on the SM

13 Thread Organization b w SM SM SM SM Global Memory SM SM SM SM Threads are groupped into thread-blocks b threads Run on the SM Groups of w = 32 form a warp execute in SIMT lockstep

14 Memory Hierarchy 3 levels with different: Access scope Capacity Access pattern Latency Peak bandwidth NVIDIA GPU SM SM SM SM Global Memory SM SM SM SM Control Logic Shared Memory processor cores

15 Global Memory Large (up to 32 GB) Shared by all threads Slow NVIDIA GPU SM SM SM SM Global Memory SM SM SM SM Blocked accesses I/O model Control Logic Shared Memory processor cores

16 Global Memory Access Pattern Warp - 32 threads execute in lockstep Access global memory together Warp is a single unit 1 operation accesses 32 elements Just like disk accesses in I/O model (B = 32)

17 Shared Memory Small (48-64 KB per SM) Private to SM User defines sharing 5 10 faster NVIDIA GPU SM SM SM SM Global Memory SM SM SM SM Unique access pattern organized into banks processor cores Control Logic Shared Memory

18 Shared Memory Access Pattern A Stored across w memory banks Bank 1 Bank 2 Bank 3 Bank 4 Shared memory.

19 Shared Memory Access Pattern T 1 T 2 T 3 T 4 A Separate banks accessed concurrently Bank 1 Bank 2 Bank 3 Bank 4 Shared memory O O O O.

20 Shared Memory Access Pattern T 1 T 2 T 3 T 4 A Threads accessing same bank = Bank conflict Serialize access Bank 1 Bank 2 Bank 3 Bank 4 Shared memory X X X X.

21 Registers Small (255 per thread) Private to thread Fastest Random access NVIDIA GPU SM SM SM SM Global Memory SM SM SM SM Must be static known at compile time processor cores Control Logic Shared Memory

22 Talk Outline Motivation/background GPU overview Memory hierarchy State-of-the-art GPU sorting Our multiway mergesort (GPU-MMS) Optimizations Performance results Conclusions & future work

23 State-of-the-art GPU sorting Modern GPU (MGPU) Pairwise mergesort CUB Radix sort Limited application Thrust Changes algorithm based on input type Comes with CUDA compiler All highly engineered and optimized for hardware Change parameters based on hardware detected

24 MGPU mergesort Pairwise mergesort E elements per thread E t 1 t 2 t 3 t 4 t ( N E 1) t N E

25 MGPU mergesort Pairwise mergesort E elements per thread b threads per thread-block be t 1 t 2 t 3 t 4 t ( N E 1) t N E

26 MGPU mergesort Pairwise mergesort E elements per thread b threads per thread-block Lots of parallelism N E threads! be t 1 t 2 t 3 t 4 t ( N E 1) t N E

27 MGPU mergesort Each thread-block sorts be elements be

28 MGPU mergesort Each thread-block sorts be elements Merge pairs of lists be

29 MGPU mergesort Each thread-block sorts be elements Merge pairs of lists log N be merge rounds b and E iare small constants log N be be

30 MGPU bottlenecks Global memory is the main bottleneck Unavoidable: O(log 2 N) merge rounds

31 Multiway mergesort Reduce global memory bottleneck Merge K lists at a time! log K N logk N B merge rounds Merging done in internal memory Use a priority queue

32 Merging K lists Use a heap Load blocks from each list Build min-heap on smallest items K

33 Merging K lists Use a heap Buffer smallest item Heapify to find next smallest K

34 Merging K lists Use a heap Output buffer when full Read block when needed K

35 Parallel Block Heap Warp shares a heap 32 threads all need work K

36 Parallel Block Heap Each node has a sorted list

37 Parallel Block Heap Each node has a sorted list Output

38 Parallel Block Heap Each node has a sorted list Merge child nodes All 32 threads work together Merge

39 Parallel Block Heap Each node has a sorted list Merge child nodes Smallest Largest

40 Parallel Block Heap Each node has a sorted list Merge child nodes Repeat on empty child

41 Multiway mergesort (GPU-MMS) analysis Base case sorts w 2 elements Merge groups of K lists per round N logk w rounds 2 No bank conflicts Perform merging of nodes in registers

42 Multiway mergesort (GPU-MMS) analysis Base case sorts w 2 elements Merge groups of K lists per round N logk w rounds 2 No bank conflicts Perform merging of nodes in registers Not work-efficient log w more register accesses

43 Multiway mergesort (GPU-MMS) analysis Base case sorts w 2 elements Merge groups of K lists per round N logk w rounds 2 No bank conflicts Perform merging of nodes in registers Not work-efficient log w more register accesses Low parallelism Lots of shared memory used Dependent operations

44 Pipelining merge steps Pre-search path to leaf Identify all nodes to be merged

45 Pipelining merge steps Pre-search path to leaf Identify all nodes to be merged Output Merge

46 Tuning K Small K : too many global memory access Large K : not enough parallelism

47 Sorting Performance Sorting integers on Maxwell GPU

48 Impact of Bank Conflcits Generate input that causes bank conflicts GPU-MMS is unaffected

49 Different datatypes Increasing comparison work degrades performance

50 Conclusions Analysis helps us develop better GPU algorithms I/O-efficient techniques work well Minimize global memory accesses Don t forget parallelism

51 Conclusions Analysis helps us develop better GPU algorithms I/O-efficient techniques work well Minimize global memory accesses Don t forget parallelism Future work Optimize GPU-MMS Work efficient (open problem) Apply analysis methods to other algorithms How will future architectures change things?

52 Conclusions Analysis helps us develop better GPU algorithms I/O-efficient techniques work well Minimize global memory accesses Don t forget parallelism Future work Optimize GPU-MMS Work efficient (open problem) Apply analysis methods to other algorithms How will future architectures change things? Thank You! GPU-MMS available:

53 Backup Slides

54 MGPU Merge phase Merge pairs of lists Repeat until sorted TB1 TB2 TB3 TB4

55 MGPU Merge phase Merge pairs of lists Repeat until sorted Find thread-block partition TB1 TB2 TB3 TB4

56 MGPU Merge phase Merge pairs of lists Repeat until sorted Find thread-block partition Each thread-block loads partition into shared memory TB1 TB2 TB3 TB4

57 MGPU Merge phase Merge pairs of lists Repeat until sorted Find thread-block partition Each thread-block loads partition into shared memory And merges... TB1 TB2 TB3 TB4

58 GPU-MMS Bottlenecks Mostly compute-bound GMEM SMEM Sync Basecase Registers

59 Searching in global memory

60 Model results: MGPU mergesort Model is quite accurate! Shows that E = 31 is ideal for this GPU! (E = 15 is hard-coded)

61 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency...

62 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Multiplicity: X multiple threads per core core Memory threads

63 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Multiplicity: X multiple threads per core thread sends request to slow memory core request Memory threads

64 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Multiplicity: X multiple threads per core switch out thread while it waits core request Memory threads

65 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Multiplicity: X multiple threads per core schedule new thread to use core core request Memory threads

66 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Multiplicity: X multiple threads per core issue more requests to saturate bandwidth core request request Memory threads

67 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Instruction-level parallelism (ILP): I consecutive independent instructions core Memory

68 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Instruction-level parallelism (ILP): I consecutive independent instructions thread requests memory element X core request X Memory

69 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Instruction-level parallelism (ILP): I consecutive independent instructions next instruction requests Y core request X Memory

70 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Instruction-level parallelism (ILP): I consecutive independent instructions issue next request without waiting for X core request Y request X Memory

71 Hiding Latency t x : average time per x operation min t x max throughput But operations have latency... Instruction-level parallelism (ILP): I consecutive independent instructions issue more requests to saturate bandwidth core request Y request X Memory

72 Impact of X and I (global memory) Copy 2 16 elts. in global memory per thread When X I 8 is limited by bandwidth I X

73 Impact of X and I (global memory) Copy 2 16 elts. in global memory per thread When X I 8 is limited by bandwidth I X I = 8 1 X

74 Impact of X and I (global memory) Copy 2 16 elts. in global memory per thread When X I 8 is limited by bandwidth I X I = 8 1 X I = 4 2 X

75 Impact of X and I (global memory) Copy 2 16 elts. in global memory per thread When X I 8 is limited by bandwidth I X I = 8 1 X I = 4 2 X I = 2 4 X

76 Time per memory access Increasing (X I): Reduce latency Until peak bandwidth reached Parameters for each type of memory: L x - memory access latency (clock cycles) B x - peak bandwidth peak operations per clock cycle, per core Reduce latency until bandwidth reached: t x = max ( 1 B x, L x X I )

77 GPU Hardware Parameters Run benchmarks on 3 architectures ALGOPARC: server in our lab GIBSON: desktop with GPU UHHPC: GPU node of UH cluster Parameter ALGOPARC GIBSON UHHPC NVIDIA GPU Quadro M4000 GTX 770 K40 P (total cores) L g B g L s B s L r B r 1 1 1

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control