A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Similar documents
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

Gunrock: A Fast and Programmable Multi- GPU Graph Processing Library

Accelerated Load Balancing of Unstructured Meshes

Scalable GPU Graph Traversal!

First Look: Linear Algebra-Based Triangle Counting without Matrix Multiplication

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Fast Segmented Sort on GPUs

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Efficient Subgraph Matching by Postponing Cartesian Products

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

custinger A Sparse Dynamic Graph and Matrix Data Structure Oded Green

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Scan Primitives for GPU Computing

arxiv: v1 [cs.cv] 2 Sep 2018

Multicore Triangle Computations Without Tuning

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Graph Partitioning for Scalable Distributed Graph Computations

Lecture 4: Graph Algorithms

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallel Combinatorial BLAS and Applications in Graph Computations

Performance impact of dynamic parallelism on different clustering algorithms

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Algorithm Engineering with PRAM Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Joint Entity Resolution

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

A Framework for Processing Large Graphs in Shared Memory

2. Definitions and notations. 3. Background and related work. 1. Introduction

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Data Structures and Algorithms for Counting Problems on Graphs using GPU

The clustering in general is the task of grouping a set of objects in such a way that objects

Distributed Exact Subgraph Matching in Small Diameter Dynamic Graphs

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

GPU Sparse Graph Traversal. Duane Merrill

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Graph Algorithms on A transpose A. Benjamin Chang John Gilbert, Advisor

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Introduction to parallel Computing

Harp-DAAL for High Performance Big Data Computing

Simultaneous Solving of Linear Programming Problems in GPU

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Measurements on (Complete) Graphs: The Power of Wedge and Diamond Sampling

Embedding Large Complete Binary Trees in Hypercubes with Load Balancing

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CSCI5070 Advanced Topics in Social Computing

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

[8] GEBREMEDHIN, A. H.; MANNE, F.; MANNE, G. F. ; OPENMP, P. Scalable parallel graph coloring algorithms, 2000.

Cost Optimal Parallel Algorithm for 0-1 Knapsack Problem

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Flexible Batched Sparse Matrix-Vector Product on GPUs

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Algorithm Design (8) Graph Algorithms 1/2

Tools and Primitives for High Performance Graph Computation

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Accelerating Braided B+ Tree Searches on a GPU with CUDA

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis

Autonomous, Independent Management of Dynamic Graphs on GPUs

The Stratosphere Platform for Big Data Analytics

Extending SLURM with Support for GPU Ranges

Reducing Directed Max Flow to Undirected Max Flow and Bipartite Matching

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

Accelerating the Iterative Linear Solver for Reservoir Simulation

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

Adaptive NURBS Tessellation with CUDA

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Binary Decision Diagrams and Symbolic Model Checking

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

High Performance Computing on GPUs using NVIDIA CUDA

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Parallel Graph Algorithms

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Linear Algebraic Formulation of Edge-centric K-truss Algorithms with Adjacency Matrices

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Transcription:

A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 1 / 20

Outline 1 Introduction 1. Motivation 2. Methods and Challenges 3. Our contributions 2 Parallel Triangle Counting Algorithms 1. TC using Subgraph Matching 2. TC using Set Intersection 3. TC using Matrix Multiplication 3 Experiments and Analysis 4 Conclusions L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 2 / 20

Introduction 1. Motivation Why triangle counting? Graphs have been used to model interactions between entities in many applications. Finding small subgraph patterns is useful for understanding the underlying structure of these graphs; The most important such graph is the triangle. Many important measures of a graph are triangle-based, e.g. clustering coefficient and the transitivity ratio. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 3 / 20

Introduction 2. Methods and Challenges Subgraph Matching-based TC Problem definition: finding all occurrences of a small query graph in a large data graph. We can use subgraph matching to solve triangle counting problems by first defining the query graph to be a triangle, then assigning a unified label to both the triangle and the data graph. Backtracking strategy (Ullmann [3]): Incrementally compose partial solutions or abandoning them when it determines they cannot be completed. Improvements: filtering rules, joining orders, and auxiliary information to prune out false-positive candidates as early as possible, thereby increasing performance. Advantages: Get the listings of all the triangles for free Can be generated to find the embedding of triangles with certain label patterns. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 4 / 20

Introduction 2. Methods and Challenges Intersection-based TC For each edge, count the number of common nodes (intersection) between the neighbor lists of the two end vertices. 0 1 4 5 2 3 Input Graph Advance+Filter form filtered edge list ELIST (0,1) (0,4) (1,2) (2,3) (2,6) (3,4) (5,0) (5,1) (5,3) (5,4) Batch Intersection TC Count 0 0 0 0 0 0 2 0 1 0 Reduction 3 6 An edge e = (u, v), where u, v are its two end nodes, can form triangles with edges connected to both u and v. Let the intersections between the neighbor lists of u and v be (w 1, w 2,..., w N ), where N is the number of intersections. Then the number of triangles formed with e is N, where the three edges of each triangle are (u, v), (w i, u), (w i, v), i [1, N]. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 5 / 20

Introduction 2. Methods and Challenges Matrix Multiplication-based TC Method (based on the algorithm of Azad et al. [1]): 1 Order rows in the adjacency matrix A with increasing vertex degree. 2 Break the rearranged matrix into a lower triangular piece L and an upper triangular piece U such that A = L + U. 3 Count all the wedges of (i, j, k) where (i, k) is an edge in L with i k and (k, j) is an edge in U with k j by B = L U. 4 Find if the wedges close by element-wise multiplication (Hadamard product) C = A. B 5 Then we can count the number of triangles by summing all the elements in C and divide the sum by 2. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 6 / 20

Introduction 3. Our contributions Our contributions We do a comparative study on three specialized parallel triangle counting algorithms with highly scalable implementations on NVIDIA GPUs. We apply a state-of-the-art subgraph matching algorithm to triangle counting based on a filtering-and-joining strategy, achieving a speedup of 9 260 over a sequential CPU implementation. We develop a best-of-class intersection operator and integrate it into our second triangle counting method, yielding a speedup as high as 630 over the sequential CPU implementation. We formulate a parallel matrix-multiplication-based method that achieves 5 26 speedup over the sequential CPU implementation. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 7 / 20

Parallel Triangle Counting Algorithms Gunrock Programming Model [4] Graph represented as Compressed Sparse Row (CSR): sparse matrix row offsets, column indices and values. Bulk-Synchronous Programming: series of parallel operations separated by global barriers. Data-Centric Abstraction: operations are defined on one or more frontiers of active vertices/edges. Advance: generates a new frontier via visiting the neighbor vertices/edges in the current frontier (work distribution/load balancing) Filter: removes elements from a frontier via validation test Compute: user-defined vertex/edge-centric computations that run in parallel, can be combined with advance or filter L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 8 / 20

Parallel Triangle Counting Algorithms 1. TC using Subgraph Matching TC using Subgraph Matching Challenges: GPU operations are based on warps (which are groups of threads to be executed in single-instruction-multiple-data fashion), so different execution paths generated by backtracking algorithms may cause a warp divergence problem. Irregular memory access patterns cause memory accesses to become uncoalesced, which degrades GPU performance. Solutions: Follow a filtering-and-joining procedure using Gunrock s programming model. Filtering: prune away candidate vertices that cannot contribute to the final solutions. Joining: collect candidate edges for each query edge and then combine the edges that satisfy the predefined intersection rules. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 9 / 20

Parallel Triangle Counting Algorithms 2. TC using Set Intersection TC using Set Intersection Method: 1 Form edge list. 2 Compute set intersection for two neighbor lists of each edge. Optimizations: We visit all the neighbor lists using Gunrock s Advance operator. Then we filter out half of the edges by node degree/id order to avoid redundant counting. Use dynamic grouping strategy to divide the edge lists into two groups (1) small neighbor lists; and (2) large neighbor lists and use different work distribution methods for the two groups in batch set intersection to achieve load balancing and to maximize resource utilization. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 10 / 20

Parallel Triangle Counting Algorithms 3. TC using Matrix Multiplication TC using Matrix Multiplication Challenges: An efficient sparse matrix multiplication implementation. Since sparse matrix multiplication needs to generate an intermediate array, redundant work is done within the matrix formulation. Solutions: Use highly optimized GPU matrix multiplication function csrgemm from cusparse to compute B. Only compute Hadamard product for L/U instead of A since C is symmetric. Assign a thread to each row and use a counter to keep track of the number of triangles found by each thread, allowing the final sum with a single parallel reduction. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 11 / 20

Experiments and Analysis Dataset Dataset Vertices Edges Max Degree Type coauthorsciteseer 227,320 1,628,268 1372 rs copapersdblp 540,486 30,491,458 3299 rs road central 14,081,816 33,866,826 8 rm soc-lj 4,847,571 137,987,546 20,292 rs cit-patents 3,774,768 33,037,896 770 rs com-orkut 3,072,441 234,370,166 33,007 rs Table: Dataset Description Table. The edge number shown is the number of directed edges when the graphs are treated as undirected graphs (if two nodes are connected, then there are two directed edges between them) and de-duplicate the redundant edges. Graph types are: r: regular, s: scale-free, and m: mesh-like. 1 The datasets are from DIMACS10 Graph Challenge and the Stanford Network Analysis Project (SNAP). L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 12 / 20

Experiments and Analysis Performance Figure: Execution-time speedup for our four GPU implementations, Green et al. s GPU implementation, Shun et al. s 40-core CPU implementation and Green et al. s 40-core CPU implementation. All are normalized to a baseline CPU implementation https://bitbucket.org/seshadhri/escape L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 13 / 20

Experiments and Analysis Complexity Analysis 10 6 10 5 Runtime (ms) 10 5 10 4 10 3 10 2 10 1 10 0 f(x) = 3.059023 10 7 x Runtime (ms) 10 4 10 3 f(x) = 6.144212 10 6 x 10 1 10 2 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 12 10 13 Sum of Square Degrees 10 2 10 4 10 5 10 6 10 7 10 8 10 9 10 10 Sum of Square Degrees Figure: The runtime of triangle counting on a particular graph should be proportional to the sum of square degrees (SSD) of that graph: O( v V d(v) 2 ). Left, intersection-based method; right, matrix-based method. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 14 / 20

Experiments and Analysis Subgraph Matching Generality Tests Figure: Speedups on Gowalla (left) and Enron (right) datasets of our PSM and a previous state-of-the-art GPU implementation by Tran et al. [2]. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 15 / 20

Experiments and Analysis Discussion Intersection-based TC shows best performance among our three GPU implementations. Subgraph-matching-based TC shows some performance wins on mesh-like datasets since we prune out a lot of leaf nodes in our filtering step. Our intersection-based TC achieves better results than previous intersection-based implementations because (1) we filter edge list and reform the induced subgraph to reduce workload (2) we use dynamic work scheduling to maintain a high GPU resource utilization. For scale-free graphs, (1) gives us constant speedups over our implementation without this step. But for road networks and some small scale-free graph, the overhead of (1) cause a performance drop. Our matrix multiplication-based TC is bounded by the same complexity as our intersection-based TC, the low slope at the start indicates large overhead from the SpGEMM algorithm which is bottleneck of our implementation. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 16 / 20

Experiments and Analysis Future Work For subgraph matching-based TC, finding a good matching order for the query graph can reduce a lot of intermediate results. For intersection-based TC, we expect further performance gains from tuning of launch settings for the large-neighbor-list intersection kernel. We also believe that a third kernel to do batch set intersection between one large and one small neighbor list using a binary search operation will continue to improve performance. For matrix multiplication-based TC, we believe that avoiding multiplications where the input adjacency matrix A is known to be zero and avoiding writing the multiplication output to global memory. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 17 / 20

Conclusions Conclusions We implemented and compared three triangle counting methods on the GPU based on subgraph matching, set intersection and matrix multiplication. Our intersection-based implementation achieved state-of-the-art performance and potential to gain even better performance. Our matrix-multiplication-based TC implementation shows that SpGEMM is the performance bottleneck due to its unavoidable redundant work. Our subgraph matching implementation is memory bound, but shows good performance on mesh like graphs and it also allows programmers to extend this method to match more complicated subgraph patterns. L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 18 / 20

Conclusions Acknowledgements Thanks to Seshadhri Comandur for providing the CPU baseline code for comparisons. We appreciate the funding support of DARPA XDATA under grants US Army award W911QX-12-C-0059, DARPA STTR awards D14PC00023 and D15PC00010 as well as the National Science Foundation under grants CCF-1017399 and OCI-1032859, and a Lawrence Berkeley Laboratory internship. Thanks also to NVIDIA for equipment donations and server time. 2 Contact Info: leywang@ucdavis.edu L. Wang, Y. Wang, C. Yang, J. D. Owens Exact Triangle Counting on the GPU HPGP 16 Kyoto, Japan 19 / 20

Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and L. Wang, Practice Y. Wang, C. Yang, of Parallel J. D. Owens Programming, Exact Triangle Counting PPoPPon the 2016, GPU pages HPGP 16 11:1 11:12, Kyoto, Japan Mar. 20 / 20 References Conclusions A. Azad, A. Buluç, and J. Gilbert. Parallel triangle counting and enumeration using matrix algebra. In IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW 2015, pages 804 811, 2015. H.-N. Tran, J.-j. Kim, and B. He. Fast subgraph matching on large graphs using graphics processors. In M. Renz, C. Shahabi, X. Zhou, and A. M. Cheema, editors, Proceedings of the 20th International Conference on Database Systems for Advanced Applications, DASFAA 2015, pages 299 315. Springer International Publishing, Cham, Apr. 2015. J. R. Ullmann. An algorithm for subgraph isomorphism. Journal of the ACM, 23(1):31 42, Jan. 1976.