Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Similar documents
Irregular Graph Algorithms on Parallel Processing Systems

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Extreme-scale Graph Analysis on Blue Waters

Massively Parallel Graph Analytics

XtraPuLP. Partitioning Trillion-edge Graphs in Minutes. State University

Extreme-scale Graph Analysis on Blue Waters

PULP: Fast and Simple Complex Network Partitioning

HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers

PuLP. Complex Objective Partitioning of Small-World Networks Using Label Propagation. George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

High-performance Graph Analytics

Graph Partitioning for Scalable Distributed Graph Computations

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

Characterizing Biological Networks Using Subgraph Counting and Enumeration

Downloaded 10/31/16 to Redistribution subject to SIAM license or copyright; see

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

FASCIA. Fast Approximate Subgraph Counting and Enumeration. 2 Oct Scalable Computing Laboratory The Pennsylvania State University 1 / 28

Partitioning Trillion-edge Graphs in Minutes

Accelerated Load Balancing of Unstructured Meshes

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Graph Data Management

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Big Data Management and NoSQL Databases

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Memory Hierarchy Management for Iterative Graph Structures

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

Scalable GPU Graph Traversal!

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

PULP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Engineering Multilevel Graph Partitioning Algorithms

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

War Stories : Graph Algorithms in GPUs

DNA Interaction Network

Fast Dynamic Load Balancing for Extreme Scale Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

New Challenges In Dynamic Load Balancing

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics

NUMA-aware Graph-structured Analytics

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Parallel Graph Partitioning for Complex Networks

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Lecture 15: More Iterative Ideas

Graph Partitioning for Scalable Distributed Graph Computations

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

CACHE-GUIDED SCHEDULING

Multicore Triangle Computations Without Tuning

AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Compressing Social Networks

Scalable Community Detection Benchmark Generation

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

CuSha: Vertex-Centric Graph Processing on GPUs

Data Structures and Algorithms for Counting Problems on Graphs using GPU

Graph traversal and BFS

k-way Hypergraph Partitioning via n-level Recursive Bisection

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Dynamic Adaptive Simulations with ParFUM

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Implementation of BFS on shared memory (CPU / GPU)

Towards Approximate Computing: Programming with Relaxed Synchronization

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

FINAL REPORT. Milestone/Deliverable Description: Final implementation and final report

Improving Performance of Sparse Matrix-Vector Multiplication

Introduction to Parallel Computing

High Performance Computing

Tutorial: Application MPI Task Placement

Practical Near-Data Processing for In-Memory Analytics Frameworks

Callisto-RTS: Fine-Grain Parallel Loops

Leveraging Flash in HPC Systems

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

Graph Partitioning for Scalable Distributed Graph Computations

Parallel Graph Partitioning for Complex Networks

CS 475: Parallel Programming Introduction

Distributed computing: index building and use

Parallel graph traversal for FPGA

Foundation of Parallel Computing- Term project report

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

HPCS Scalable Synthetic Compact Applications #2 Graph Analysis

EE/CSCI 451: Parallel and Distributed Computation

Review: Creating a Parallel Program. Programming for Performance

Parallel Mesh Partitioning in Alya

PREDICTING COMMUNICATION PERFORMANCE

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

Chapter 5: Analytical Modelling of Parallel Programs

Transcription:

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National Labs, 3 The Pennsylvania State University slotag@rpi.edu, srajama@sandia.gov, madduri@cse.psu.edu GABB2017 29 May 2017 1 / 28

Highlights We present an empirical evaluation of the effect of partitioning and ordering on the performance of several prototypical graph computations We give experimental evidence that uncommonly-used quality metrics for partitioning and ordering might better predict performance for certain computations We introduce a BFS-based ordering method that approximates RCM, gives better computational performance, and can be computed twice as fast 2 / 28

Main Idea Generally understood that graph layout (partitioning+ordering) has a performance impact on the distributed graph computations (GABBs) comprising more complex analytics All graph computations/analytics/frameworks require some layout choice Goals of this preliminary study: Examine the general relationship between layout choices and performance impact; especially relative to naïve methods such as randomization E.g., attempt to correlate partitioning and ordering quality metrics with real performance given some graph computation with known properties Identify avenues for future investigation 3 / 28

Graph Layout Graph Layout: (vertex ordering+partitioning) pair We consider specifically 1D partitioning along with per-part vertex reordering Impacts and Considerations: Load balance Communication reduction NUMA and memory access locality Overheads for partitioning and ordering methods 4 / 28

Partitioning (1D) Partitioning: separating a graph into vertex-disjoint sets under certain balance constraints and optimization criteria Balance constraints: Computational workload per task/part Workload is application specific Might depend on number (or summed weights) of vertices and/or number of edges per task Optimization Criteria: Communication workload Again, application specific Might depend on total edges cut (edge cut), total summed size of one-hop neighborhoods (communication volume), or maximum communication load of a single task (max per-part cut/volume) 5 / 28

Partitioning Methods for varying numbers of optimization criteria and balance constraints: No criteria, no constraint: Random/Hash partitioning No criteria, single constraint: Block methods Single criteria, single constraint: KaHIP tools, Scotch, many others Single criteria, multi constraint: (Par)METIS, PaToH Multi criteria, multi constraint: (Xtra)PuLP Can we empirically show what kinds of applications might benefit from these varying methods? 6 / 28

Vertex Ordering Vertex (re)ordering: creating some specific ordering of the numeric vertex identifiers of a graph Want to minimize some metric related to the numeric identifiers f(v) of vertices v V (G) in a graph Minimizing differences between identifiers of vertices in the neighborhood of some vertex might improve spatial locality Minimizing overall differences between neighborhoods of neighboring vertices might improve temporal locality 7 / 28

Ordering Optimization Criteria Bandwidth: Max distance from diagonal to nonzero in any row of the adjacency matrix max{ f(v i ) f(v j ) } : (v i, v j ) = e E(G) Co-location ratio: How often nonzeros appear in consecutive columns of the same row of the adjacency matrix N(v) 1 1 V N(v) {1+ { 1, iff(vi ) f(v C(v i )} : C(v i ) = i 1 ) = 1 0, otherwise v V i=2 Logarithmic Gap Arrangement: Total spread of nonzeros in all rows of the adjacency matrix A(v) : A(v) = v V N(v) i=2 log f(u i ) f(u i 1), u i N(v) 8 / 28

Vertex Ordering Prior Methods: No criteria: Random methods Community/clustering methods: Label propagation, Rabbit Order Neighborhood methods: Gray Order, Shingle Order Traversal-based methods: BFS, Cuthill-McGee, Reverse Cuthill-McGee (RCM) Our new method: BFS-based approximation of RCM that avoids explicit sorting Gives approximately 2 speedup relative to RCM We term this method combined with PuLP-MM as DGL (Distributed graph layout); by itself in the results, we just abbreviate it as BFS 9 / 28

Experimental Approach We consider several different distributed graph computations Run computations under several partitioning+ordering combinations Look at total execution times and end-to-end times (including partitioning and ordering) Also look at computation and communication times separately to see specific impacts of partitioning and ordering 10 / 28

Experimental Setup Test Systems: Compton at Sandia National Labs and Blue Waters at NCSA - 2x Sandy Bridge CPUs with 64 GB memory Layout Methods: Partitioners: METIS, METIS-M, PuLP-MM, PuLP-M, Random (w/random order) Orderings: RCM, DGL, Random (w/pulp-mm partition) Graph Computations: PageRank, Subgraph Isomorphism, BFS, SSSP, SPARQL queries Test Instances: Social Networks: LiveJournal, Orkut, Twitter Web Crawls: uk-2005, WebBase, sk-2005 RDF Graphs: BSBM, LUBM, DBpedia 11 / 28

Partitioner Comparison Geometric mean over all graphs for 16 and 64 parts Max vertex imbalance, max edge imbalance Values given for edge cut (EC) and max per-part cut (EC max ) are mean improvements relative to the random method Partitioner EC (imp) EC max (imp) V max E max avg min max avg min max Random 1.15 1.7 1 1 1 1 1 1 METIS 1.1 3.88 7.71 1.5 107 2.39 0.25 63 METIS-M 1.1 1.5 4.4 1.02 41 2.16 0.77 22 PuLP-M 1.1 1.5 5.5 1.17 64 2.1 0.54 23 PuLP-MM 1.1 1.5 5 1.19 63 3.18 2.54 204 12 / 28

Ordering Comparison Geometric mean ordering performance over all graphs and all five partitioning methods Performance measured by co-location ratio and logarithmic gap sum ratio (normalized version of gap sum metric) Co-loc. Ratio Gap Sum Ratio (Higher better) (Lower better) BFS RCM Rand BFS RCM Rand 16 parts 0.298 0.112 0.000 0.030 0.034 0.048 64 parts 0.274 0.119 0.000 0.007 0.008 0.011 13 / 28

PageRank Power method for fixed iterations (20) Parallelization: Bulk synchronous OpenMP+MPI Communication: Moderate, O(m) per iteration Computation: Moderate, O(m) per iteration Source: Wikipedia 14 / 28

PageRank: Results 16 tasks Speedups vs. random methods for partitioning (top) and ordering (bottom) Overall mean speedup is best for PuLP-MM at 4.5 vs. 3.4-3.6 for other methods Mean speedup for BFS is 2.0 vs. 1.9 for RCM 15 / 28

PageRank: Execution Timelines 16 tasks running on WebBase Clockwise from top left: Random, METIS, PuLP-MM, METIS-M MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (s) Time (s) MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (s) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (s) 16 / 28

Subgraph Counting Alon et al. color-coding algorithm [Alon et al., 1995] Parallelization: Bulk synchronous OpenMP+MPI Communication: High, O(m2 k ) Computation: High, O(m2 k ) 17 / 28

Subgraph Counting: Results 16 tasks Overall mean speedup is best for PuLP-M at 3.1 vs. 3.0 for other methods Mean speedup for BFS is 1.06 vs. 1.05 for RCM High memory storage/access per vertex minimizes cache improvements for ordering 18 / 28

Subgraph Counting: Execution Timelines 16 tasks running on LiveJournal Clockwise from top left: Random, METIS, PuLP-MM, METIS-M MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate 0 50 100 150 200 0 50 100 150 200 Time (s) Time (s) MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate MPI task # 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Compute Idle Communicate 0 50 100 150 200 0 50 100 150 200 Time (s) Time (s) 19 / 28

Subgraph Counting: End-to-end times 16 tasks running on LiveJournal End-to-end times for subgraph counting - partitioning+ordering, total computation, and total communication 20 / 28

Breadth-first Search Optimized top-down implementation Parallelization: Bulk synchronous MPI Communication: Low, O(m) total Computation: Low, O(m) total Source: Wikipedia 21 / 28

Single Source Shortest Paths Optimized -stepping algorithm [Meyer and Sanders, 2003] Parallelization: Bulk synchronous MPI Communication: Low, O(m) total Computation: Low, O(m) total Source: Wikipedia 22 / 28

BFS/SSSP: Results 64 tasks Show communication speedups for BFS (top) and computation speedups for SSSP (bottom) METIS gives best overall performance for BFS and SSSP; communication cost correlates highly with edge cut BFS: is 1.30 with BFS ordering vs. 1.26 for RCM SSSP: 1.08 with BFS ordering vs. 1.04 for RCM 23 / 28

SPARQL Query Processing Parallel implementation of RDF-3X store [Neumann and Weikum, 2010] Parallelization: MPI with n-hop replication Communication: Very Low, O(k) Computation: High, polynomial Source: Wikipedia 24 / 28

SPARQL Query Processing: Results 2-hop replication ratios for each partitioning and sum query time (selection from Berlin SPARQL Benchmark) over all 3 graphs for each ordering methods Performance impacts less obvious than with other analytics. Some dependence on 2-hop ratio as expected but paradoxical performance relationship with ordering (likely due to pre-processing methods) Partitioning 2-hop Rep. Ratio Sum Query Time (s) BSBM LUBM DBpedia Rand BFS RCM Random 7.256 10.58 5.580 3.41 4.58 4.32 METIS 5.566 9.714 1.552 3.71 4.41 3.91 METIS-M 5.577 9.146 1.552 3.87 4.20 4.13 PuLP-M 5.308 8.944 1.905 3.41 3.97 3.94 PuLP-MM 5.112 9.227 2.448 3.32 4.01 3.51 25 / 28

Experimental Limitations and Future Work Graph Instances: Moderate problem sizes (only up to 2 B edges); was limited by subgraph counting and METIS memory requirements Run future experiments focusing on larger instances and techniques only applicable to such instances Analytic and layout: Higher quality partitioners (for certain objectives) available (e.g. KaHIP); other ordering approaches (e.g. Rabbit order) available Analytics in experiments used only 1D partitioning; explore 2D and hybrid partitioning in conjunction with these partitioning and ordering methods Parallelization: Only 16 and 64 tasks and bulk synchronous; limited by large number of experiments being run (Analytic Partitioner Ordering for each task count) Run to larger parallelization along with larger problem sizes; investigate methods or modifications applicable to asynchronous execution Overall Goal: more explicitly quantify performance impact for partition and order quality given the memory access and communication patterns of some graph-based computation 26 / 28

Acknowledgments Sandia and FASTMATH This research is supported by NSF grants CCF-1439057 and the DOE Office of Science through the FASTMath SciDAC Institute. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energys National Nuclear Security Administration under contract DE-AC04-94AL85000. Blue Waters Fellowship This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070, ACI-1238993, and ACI-1444747) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana Champaign and its National Center for Supercomputing Applications. Kamesh Madduri s CAREER Award This research was also supported by NSF grant ACI-1253881. 27 / 28

Conclusions Use of the secondary objective (max per-part cut) had a large impact on communication-heavy synchronous graph computations Our BFS-based ordering demonstrated better average performance relative to RCM in all experiments; demonstrates that gap-based metrics might be a better way to quantify order quality for irregular graphs Overall, performance impact scales in proportion with improvements in partitioning and ordering quality The optimal choice of graph layout is highly dependent on the characteristics of the graph computation being processed Thanks! slotag@rpi.edu, www.gmslota.com 28 / 28

Bibliography I N. Alon, R. Yuster, and U. Zwick. Color-coding. J. ACM, 42(4):844 856, 1995. U. Meyer and P. Sanders. -stepping: a parallelizable shortest path algorithm. J. Algs., 49(1):114 152, 2003. Thomas Neumann and Gerhard Weikum. The RDF-3X engine for scalable management of RDF data. VLDB J., 19 (1):91 113, 2010. 29 / 28