Scalable GPU Graph Traversal!

Similar documents
GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal

Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

Scalable GPU Graph Traversal

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High Performance and Scalable GPU Graph Traversal

Graph traversal and BFS

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

Accelerated Load Balancing of Unstructured Meshes

Graph Partitioning for Scalable Distributed Graph Computations

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Portland State University ECE 588/688. Graphics Processors

Introduction to Parallel Programming Models

Towards Efficient Graph Traversal using a Multi- GPU Cluster

CuSha: Vertex-Centric Graph Processing on GPUs

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Practical Near-Data Processing for In-Memory Analytics Frameworks

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Breadth First Search on Cost efficient Multi GPU Systems

Maximizing Face Detection Performance

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Tools and Primitives for High Performance Graph Computation

Parallel graph traversal for FPGA

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Scheduling the Graphics Pipeline on a GPU

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Lecture 2: CUDA Programming

Optimization solutions for the segmented sum algorithmic function

Parallel Breadth First Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library

Parallel Architectures

CS377P Programming for Performance GPU Programming - II

Using GPUs to compute the multilevel summation of electrostatic forces

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

Parallel Variable-Length Encoding on GPGPUs

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Lecture 6: Input Compaction and Further Studies

ECE 669 Parallel Computer Architecture

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory

Policy-based Tuning for Performance Portability and Library Co-optimization

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Performance impact of dynamic parallelism on different clustering algorithms

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

Introduction to Parallel Computing

Recent Advances in Heterogeneous Computing using Charm++

CSC630/COS781: Parallel & Distributed Computing

Chapter 3 Parallel Software

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Fast BVH Construction on GPUs

Prefix Scan and Minimum Spanning Tree with OpenCL

Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) Endorsement graphs (web link graph, paper citation graph, etc.)

Parallel Combinatorial BLAS and Applications in Graph Computations

State of Art and Project Proposals Intensive Computation

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Dynamic Load Balancing on Single- and Multi-GPU Systems

Deploying Graph Algorithms on GPUs: an Adaptive Solution

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Problem. Context. Hash table

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Algorithms and Applications

Lecture 13: March 25

Concurrency for data-intensive applications

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Chí Cao Minh 28 May 2008

Advanced Databases: Parallel Databases A.Poulovassilis

DEEP DIVE INTO DYNAMIC PARALLELISM

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Chapter 27 Cluster Work Queues

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Transcription:

Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang Dept of Computer & Information Sciences University of Delaware

Introduction Algorithms for analyzing sparse relationships represented as graphs provide crucial tools in many computational fields.

Introduction Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms.

Introduction This paper presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O( V + E ) work complexity. 1. Parallelization strategy 2. Empirical performance characterization 3. High performance

Background We consider graphs of the form G = (V, E) with a set V of n vertices and a set E of m directed edges.

Background compressed sparse row (CSR) sparse matrix format

Background Sequential Breadth-First Search Algorithm

Background BFS Graph Traversal

Background Parallel breadth-first search 1. Quadratic parallelizations inspect every edge or every vertex during every iteration work complexity is O(n 2 +m) as there may n BFS iterations in the worst case 2. Linear parallelizations each iteration examine only the edges and vertices in that iteration s logical edge and vertex-frontiers, respectively. work-efficient parallel BFS algorithm should perform O(n+m) work 3. Distributed parallelizations partition the graph structure amongst multiple processors, particularly for very large datasets that that are too large to fit within the main memory of a single node.

Background Our parallelization strategy 1. our BFS strategy expands adjacent neighbors in parallel 2. implements out-of-core edge and vertex-frontiers 3. uses local prefix sum for determining enqueue offsets 4. uses a best-effort bitmask for efficient neighbor filtering

Background Prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. In the context of parallel BFS, parallel threads use prefix sum when assembling global edge frontiers and global vertex frontiers.

Benchmark Suite The majority of the contraction from edge-frontier down to vertex-frontier can actually be performed using duplicate-removal techniques instead of visitation-status lookup.

Microbenchmark Analyses A linear BFS workload is composed of two components: O(n) work related to vertex-frontier processing, and O(m) for edge-frontier processing Because the edge-frontier is dominant, we focus our attention on the two fundamental aspects of its operation: neighbor-gathering status-lookup

Isolated neighbor-gathering Serial gathering each thread serially expand neighbors from the column-indices array C. non-uniform degree distributions can impose significant load imbalance between threads

Isolated neighbor-gathering Coarse-grained, warp*-based gathering each thread enlists its entire warp to gather its assigned adjacency list this approach can suffer underutilization within the warp *Warp: the set of 32 parallel threads that execute a SIMD instruction

Isolated neighbor-gathering Fine-grained, scan-based gathering Threads construct a shared array of column-indices offsets corresponding to a CTA**-wide concatenation of their assigned adjacency lists. the entire CTA to gather the referenced neighbors from the column-indices array C using this perfectly packed gather vector. workload imbalance can occur in the form of underutilized cycles during offsetsharing **A CTA is an array of concurrent threads that cooperate to compute a result

Isolated neighbor-gathering Scan+warp+CTA gathering(hybrid) We can further mitigate inter-warp workload imbalance by introducing a third granularity of thread-enlistment: the entire CTA. CTA-wide gathering process very large adjacency lists. apply warp-based gathering to acquire adjacency smaller than the CTA size, but greater than the warp width. perform scan-based gathering to efficiently acquire the remaining loose ends.

neighbor-gathering analysis

neighbor-gathering analysis

neighbor-gathering analysis

neighbor-gathering analysis This hybrid scan+warp+cta strategy demonstrates good gathering rates. It limits all forms of load imbalance from adjacency list expansion.

Isolated status-lookup Bitmask reduce the size of status data from a 32-bit label to a single bit per vertex. we avoid atomic operations, our bitmask is only a conservative approximation of visitation status. Bitmask + Label If a status bit is unset, we then check the corresponding label Warp culling Using shared-memory per warp, each thread hashes in the neighbor it is currently inspecting. History culling maintaining a cache of recently- inspected vertex identifiers in local shared memory.

Isolated status-lookup

Coupling of gathering and lookup The coupled kernel requires O(m) less overall data movement The fused kernel likely suffers from TLB misses experienced by the neighborgathering workload, and it inherits the worst aspects of both

Single-GPU Parallelizations A complete solution must couple expansion and contraction activities. 1. Expand-contract (out-of-core vertex queue) based upon the fused gather-lookup benchmark kernel It consumes the vertex queue for the current BFS iteration and produces the vertex queue for the next requires 2n global storage and generate 5n+2m global data movement 2. Contract-expand (out-of-core edge queue) filters previously-visited and duplicate neighbors from the current edge queue, and then surviving vertices are expanded and copied out into the edge queue for the next iteration. requires 2m global storage and generate 3n+4m explicit global data movement.

Single-GPU Parallelizations Two-phase (out-of-core vertex and edge queues) implementation isolates the expansion and contraction workloads into separate kernels. requires n+m global storage and generates 5n+4m explicit global data movement. Hybrid combines the relative strengths of the contract-expand and two-phase approaches If the edge queue for a given BFS iteration contains more vertex identifiers than resident threads, we invoke the two-phase implementation for that iteration. Otherwise we invoke the contract-expand implementation. The hybrid approach inherits the 2m global storage requirement from the former and the 5n+4m explicit global data movement from the latter.

Single-GPU Parallelizations

Single-GPU Parallelizations

Multi-GPU Parallelizations We implement a simple partitioning of the graph into equally- sized, disjoint subsets of V. Graph traversal proceeds in level-synchronous fashion. 1. Invoke the expansion kernel on each GPU 2. Invoke a fused filter+partition operation for each GPU that sorts neighbors within Qedgei by ownership into p bins. 3. Barrier across all GPUs 4. Invoke p-1 contraction kernels on each GPUi to stream and filter the incoming neighbors from its peers. This assembles each vertex queue Qvertexi for the next BFS iteration.

Multi-GPU Parallelizations slowdown for datasets having large average search depth and require more global synchronization. speedups for datasets that have small diameters and require little global synchronization.

Conclusion This paper has demonstrated that GPUs are well-suited for sparse graph traversal. It distills several general themes for implementing sparse and dynamic problems for the GPU machine model: Prefix sum is an effective mechanisms for coordinating shared data by threads. GPU threads should cooperatively assist each other for data movement tasks. Fusing heterogeneous tasks does not always produce the best results. The relative I/O contribution from global task redistribution can be less costly than anticipated. It is useful to provide separate implementations for saturating versus fleeting workloads.