Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Size: px

Start display at page:

Download "Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015"

Luke Heath
5 years ago
Views:

1 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5

2 Graph is Ubiquitous

Connectivity Detection Distance Oracle Reachability

3 Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path (SSSP) Connectivity Detection Distance Oracle Reachability Problem Centrality Problems, e.g., Betweenness & Closeness Centralities

4 Graphics Processing Unit (GPU) Instruction Warp Cache Schedulers SMX SMX SMX n GPU Memory Hierarchy: core core core Register core core core core core core File core core core core core core L cache (KB), ~ cycles L cache (MB), ~ cycles Interconnect Network Shared Memory (L Cache) Global memory (GB), ~ cycles L Cache Thread Granularity: Thread -> Warp -> Block -> Grid Global Memory Threads ~5 Threads

5 Enterprise Innovations Streamlined GPU Thread Scheduling GPU Workload Balancing Hub Vertex based Optimization Rank No. in Graph5 with two GPUs No. in Green Graph5 Small Data Category for the last two year 5

6 Top-Down BFS 5 7 FQ Atomic operation NFQ Frontier Queue 9 SA F U U F U U U U U Vertex ID Status Array NSA F U U U F U U Vertex ID Status Array (SA) needs to assign threads to non-frontiers

7 Challenge #: Putting GPU Threads to Good Use 8 Frontier Percentage (%) FB FR HW KR KR KR KR KR LJ OR PK RM TW WK Average frontier ratio per level for all graphs is very low: ~ 9%. Need a more efficient way to use GPU threads 7

Bottom-Up BFS 5 7 7 8 9 5 8 Level 9 Bottom-up: SA F F F F F 5 7 8 9 Early Termination NSA F F 5 7 8 9

8 Bottom-Up BFS Level 9 Bottom-up: SA F F F F F Early Termination NSA F F From unvisited to visited Reduce workload by early termination Decide when to switch heuristically! 8

9 Technique #: Streamlined GPU Thread Scheduling Atomic Op Frontier Queue Enterprise Status Array Compact 9

10 Streamlined GPU Thread Scheduling Top-Down Workflow Thread Thread 5 SA F U U F U U U U U Thread Bin NFQ Current Level Next Level 7 Follow-up traversal Frontier Order Matters!

11 Streamlined GPU Thread Scheduling Direction-Switching Workflow Thread Thread 5 SA F F F F F Thread Bin NFQ Current Level Next Level 7 Follow-up traversal 5 8 9

12 Challenge #: Balancing Workload Between GPU Threads Out degree (log) out degree = 5 out degree =....8 Percentile of vertices Gowalla Out degree (log) out degree = 5 out degree =....8 Percentile of vertices Orkut Various graphs have different distribution of out-degrees Gowalla: 87% of vertices have less than edges, 99.5% have less than 5 edges Not every frontier is created equal.

13 Technique #: GPU Workload Balancing Status Array FQ Generation / Classification Frontier Queues Out-degree < F F F F (, 5) (5, 55) SmallQueue MiddleQueue LargeQueue ExtremeQueue Thread Warp CTA Grid Out-degree>55 Multiple Expansion / Inspections thread threads 5 threads 5,5 threads Two Steps: Classify frontiers when generate FQ from SA Assign different amount of threads for different frontiers

14 Facebook Execution Timeline CTA kernel 9ms CTA kernel. ms (a) Status array 9 ms 7 ms (b)streamlined GPU threads scheduling 8.5 ms FQ generation CTA kernel.5 ms Warp kernel Thread kernel 7.8 ms.5 ms 7 ms 5 (c) GPU workload balancing

15 Challenge #: Making Bottom-Up BFS GPU-Aware Bottom-up BFS: Direction-switching level is decided heuristically, where Large portion of status array is accessed CPUs have large LLC (e.g., 5MB Xeon-E5) GPUs have small cache/shared memory KB per SMX, but Manually controllable Have developed graph-ware, software controlled caching strategy 5

16 Challenge #: Making Bottom-up BFS GPU-aware (Cont'd) CDF of total edges.8... CDF of total edges.8... YouTube Kron-- Wiki-talk....8 Percentile of vertices Percentile of vertices Small amount of hub vertices contains considerable amount of edges YouTube: (.%) vertices à % edges Kron--: 77 (.5%) vertices à % edges Wiki-Talk: 9 (.%) vertices à % edges Hub vertices are super important in bottom-up BFS

17 Technique #: Graph-Aware, Software Controlled GPU Cache FQ s neighbor:, 5 and. s neighbor: 5 HubCache 7 Miss SA Vertex ID 5 F 7 F Steps: 5 8 Vertex ID of just visited Hub Vertex in shared mem. 9 Load frontier s neighbors in-core Check Neighbor ID == Cached Vertex ID? 7

18 Evaluation Hardware: C7 and K are from our own cluster M9 and K are from Keeneland and Stampede of XSEDE Metrics GTEPS: billion traversed edges per second Software g++..7, CUDA 5. NVIDIA profiler: nvprof, nvvp. Compilation flag: -O All results are reported with average of runs 8

19 Graph Datasets Edge Count (Million) 9 KR KR KR KR KR FR RM OR FB HW LJ TW PK WK Vertex Count (Million) 9

20 Different Optimizations TEPS (bilion, log scale) BaseLine (BL) BL+Thread Scheduling (TS) BL+TS+Workload Balance (WB) BL+TS+WB+HubCaching (HC) FB FR HW KR KR KR KR KR LJ OR PK RM TW WK TS improves x to 7.5x WB further increases x HC further improves upto 5% Overall Speedup: by.x (KR) to 5.5x (TW)

21 Scalability Weak-vertex scale Weak-edge scale Strong scalability TEPS (billion) GPU count

22 GPU Counter Analysis Power (W) 9 8 BL BL+TS BL+TS+WB BL+TS+WB+HC 7 FB FR HW LJ OR PK TW WK KR Power saving 8W -> 77W Contribution distribution: TS: W WB: W HC: W

23 Conclusion & Future Work Techniques Streamlined GPU Thread Scheduling GPU Thread Workload Balancing Hub Vertex Based Optimization Possible extensions Different Workload Balancing Heuristics Theoretical Support of Direction-Switching Based on Hub Vertex

24 Acknowledgements

25 Thank You {asherliu,

Enterprise: Breadth-First Graph Traversal on GPUs

Enterprise: Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang Department of Electrical and Computer Engineering George Washington University ABSTRACT The Breadth-First Search (BFS) algorithm