Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang Dept of Computer & Information Sciences University of Delaware
Introduction Algorithms for analyzing sparse relationships represented as graphs provide crucial tools in many computational fields.
Introduction Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms.
Introduction This paper presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O( V + E ) work complexity. 1. Parallelization strategy 2. Empirical performance characterization 3. High performance
Background We consider graphs of the form G = (V, E) with a set V of n vertices and a set E of m directed edges.
Background compressed sparse row (CSR) sparse matrix format
Background Sequential Breadth-First Search Algorithm
Background BFS Graph Traversal
Background Parallel breadth-first search 1. Quadratic parallelizations inspect every edge or every vertex during every iteration work complexity is O(n 2 +m) as there may n BFS iterations in the worst case 2. Linear parallelizations each iteration examine only the edges and vertices in that iteration s logical edge and vertex-frontiers, respectively. work-efficient parallel BFS algorithm should perform O(n+m) work 3. Distributed parallelizations partition the graph structure amongst multiple processors, particularly for very large datasets that that are too large to fit within the main memory of a single node.
Background Our parallelization strategy 1. our BFS strategy expands adjacent neighbors in parallel 2. implements out-of-core edge and vertex-frontiers 3. uses local prefix sum for determining enqueue offsets 4. uses a best-effort bitmask for efficient neighbor filtering
Background Prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. In the context of parallel BFS, parallel threads use prefix sum when assembling global edge frontiers and global vertex frontiers.
Benchmark Suite The majority of the contraction from edge-frontier down to vertex-frontier can actually be performed using duplicate-removal techniques instead of visitation-status lookup.
Microbenchmark Analyses A linear BFS workload is composed of two components: O(n) work related to vertex-frontier processing, and O(m) for edge-frontier processing Because the edge-frontier is dominant, we focus our attention on the two fundamental aspects of its operation: neighbor-gathering status-lookup
Isolated neighbor-gathering Serial gathering each thread serially expand neighbors from the column-indices array C. non-uniform degree distributions can impose significant load imbalance between threads
Isolated neighbor-gathering Coarse-grained, warp*-based gathering each thread enlists its entire warp to gather its assigned adjacency list this approach can suffer underutilization within the warp *Warp: the set of 32 parallel threads that execute a SIMD instruction
Isolated neighbor-gathering Fine-grained, scan-based gathering Threads construct a shared array of column-indices offsets corresponding to a CTA**-wide concatenation of their assigned adjacency lists. the entire CTA to gather the referenced neighbors from the column-indices array C using this perfectly packed gather vector. workload imbalance can occur in the form of underutilized cycles during offsetsharing **A CTA is an array of concurrent threads that cooperate to compute a result
Isolated neighbor-gathering Scan+warp+CTA gathering(hybrid) We can further mitigate inter-warp workload imbalance by introducing a third granularity of thread-enlistment: the entire CTA. CTA-wide gathering process very large adjacency lists. apply warp-based gathering to acquire adjacency smaller than the CTA size, but greater than the warp width. perform scan-based gathering to efficiently acquire the remaining loose ends.
neighbor-gathering analysis
neighbor-gathering analysis
neighbor-gathering analysis
neighbor-gathering analysis This hybrid scan+warp+cta strategy demonstrates good gathering rates. It limits all forms of load imbalance from adjacency list expansion.
Isolated status-lookup Bitmask reduce the size of status data from a 32-bit label to a single bit per vertex. we avoid atomic operations, our bitmask is only a conservative approximation of visitation status. Bitmask + Label If a status bit is unset, we then check the corresponding label Warp culling Using shared-memory per warp, each thread hashes in the neighbor it is currently inspecting. History culling maintaining a cache of recently- inspected vertex identifiers in local shared memory.
Isolated status-lookup
Coupling of gathering and lookup The coupled kernel requires O(m) less overall data movement The fused kernel likely suffers from TLB misses experienced by the neighborgathering workload, and it inherits the worst aspects of both
Single-GPU Parallelizations A complete solution must couple expansion and contraction activities. 1. Expand-contract (out-of-core vertex queue) based upon the fused gather-lookup benchmark kernel It consumes the vertex queue for the current BFS iteration and produces the vertex queue for the next requires 2n global storage and generate 5n+2m global data movement 2. Contract-expand (out-of-core edge queue) filters previously-visited and duplicate neighbors from the current edge queue, and then surviving vertices are expanded and copied out into the edge queue for the next iteration. requires 2m global storage and generate 3n+4m explicit global data movement.
Single-GPU Parallelizations Two-phase (out-of-core vertex and edge queues) implementation isolates the expansion and contraction workloads into separate kernels. requires n+m global storage and generates 5n+4m explicit global data movement. Hybrid combines the relative strengths of the contract-expand and two-phase approaches If the edge queue for a given BFS iteration contains more vertex identifiers than resident threads, we invoke the two-phase implementation for that iteration. Otherwise we invoke the contract-expand implementation. The hybrid approach inherits the 2m global storage requirement from the former and the 5n+4m explicit global data movement from the latter.
Single-GPU Parallelizations
Single-GPU Parallelizations
Multi-GPU Parallelizations We implement a simple partitioning of the graph into equally- sized, disjoint subsets of V. Graph traversal proceeds in level-synchronous fashion. 1. Invoke the expansion kernel on each GPU 2. Invoke a fused filter+partition operation for each GPU that sorts neighbors within Qedgei by ownership into p bins. 3. Barrier across all GPUs 4. Invoke p-1 contraction kernels on each GPUi to stream and filter the incoming neighbors from its peers. This assembles each vertex queue Qvertexi for the next BFS iteration.
Multi-GPU Parallelizations slowdown for datasets having large average search depth and require more global synchronization. speedups for datasets that have small diameters and require little global synchronization.
Conclusion This paper has demonstrated that GPUs are well-suited for sparse graph traversal. It distills several general themes for implementing sparse and dynamic problems for the GPU machine model: Prefix sum is an effective mechanisms for coordinating shared data by threads. GPU threads should cooperatively assist each other for data movement tasks. Fusing heterogeneous tasks does not always produce the best results. The relative I/O contribution from global task redistribution can be less costly than anticipated. It is useful to provide separate implementations for saturating versus fleeting workloads.