PuLP. Complex Objective Partitioning of Small-World Networks Using Label Propagation. George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1

Similar documents
PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

PULP: Fast and Simple Complex Network Partitioning

XtraPuLP. Partitioning Trillion-edge Graphs in Minutes. State University

Irregular Graph Algorithms on Parallel Processing Systems

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

PULP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

High-performance Graph Analytics

Downloaded 10/31/16 to Redistribution subject to SIAM license or copyright; see

Extreme-scale Graph Analysis on Blue Waters

Partitioning Trillion-edge Graphs in Minutes

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

Extreme-scale Graph Analysis on Blue Waters

Massively Parallel Graph Analytics

Characterizing Biological Networks Using Subgraph Counting and Enumeration

Graph Partitioning for Scalable Distributed Graph Computations

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers

Partitioning Problem and Usage

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

DNA Interaction Network

Scalable Community Detection Benchmark Generation

FASCIA. Fast Approximate Subgraph Counting and Enumeration. 2 Oct Scalable Computing Laboratory The Pennsylvania State University 1 / 28

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Engineering Multilevel Graph Partitioning Algorithms

Parallel Graph Partitioning for Complex Networks

Engineering Multilevel Graph Partitioning Algorithms

Graph Data Management

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Graph and Hypergraph Partitioning for Parallel Computing

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

Parallel Graph Partitioning and Sparse Matrix Ordering Library Version 4.0

Multi-Threaded Graph Partitioning

Preclass Warmup. ESE535: Electronic Design Automation. Motivation (1) Today. Bisection Width. Motivation (2)

Improving Graph Partitioning for Modern Graphs and Architectures

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Parallel Algorithm Design. CS595, Fall 2010

Fennel: Streaming Graph Partitioning for Massive Scale Graphs

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

Parallel Graph Partitioning for Complex Networks

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Shape Optimizing Load Balancing for Parallel Adaptive Numerical Simulations Using MPI

Parallel Multilevel Algorithms for Multi-constraint Graph Partitioning

Architecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing

Efficient Programming of Nanowire-based Sublithographic PLAs: A Multilevel Algorithm for Partitioning Graphs

Graph Partitioning for Scalable Distributed Graph Computations

Accelerated Load Balancing of Unstructured Meshes

Master-Worker pattern

Master-Worker pattern

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Irregular Graph Algorithms on Modern Multicore, Manycore, and Distributed Processing Systems

Argo: Architecture- Aware Graph Par33oning

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Study and Implementation of CHAMELEON algorithm for Gene Clustering

k-way Hypergraph Partitioning via n-level Recursive Bisection

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

Partitioning Complex Networks via Size-constrained Clustering

Concurrency Preserving Partitioning Algorithm

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Cameron W. Smith, Gerrett Diamond, George M. Slota, Mark S. Shephard. Scientific Computation Research Center Rensselaer Polytechnic Institute

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Partitioning and Partitioning Tools. Tim Barth NASA Ames Research Center Moffett Field, California USA

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Improvements in Dynamic Partitioning. Aman Arora Snehal Chitnavis

New Challenges In Dynamic Load Balancing

Parallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Hardware-Software Codesign

Multilevel Graph Partitioning

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018

Think Locally, Act Globally: Highly Balanced Graph Partitioning

Real Parallel Computers

Hybrid Parallel Programming

A Case Study of Complex Graph Analysis in Distributed Memory: Implementation and Optimization

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

Mining Social Network Graphs

Lecture 19: Graph Partitioning

Place and Route for FPGAs

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Parallel Mesh Partitioning in Alya

Genetic Algorithm for Circuit Partitioning

Harp-DAAL for High Performance Big Data Computing

Eu = {n1, n2} n1 n2. u n3. Iu = {n4} gain(u) = 2 1 = 1 V 1 V 2. Cutset

Combinatorial problems in a Parallel Hybrid Linear Solver

Requirements of Load Balancing Algorithm

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

Communication balancing in Mondriaan sparse matrix partitioning

High Performance Computing. Introduction to Parallel Computing

Measurements on (Complete) Graphs: The Power of Wedge and Diamond Sampling

ReViNE: Reallocation of Virtual Network Embedding to Eliminate Substrate Bottleneck

Lecture 15: More Iterative Ideas

Native mesh ordering with Scotch 4.0

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

MATH 567: Mathematical Techniques in Data

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

A Parallel Community Detection Algorithm for Big Social Networks

Advances of parallel computing. Kirill Bogachev May 2016

Transcription:

PuLP Complex Objective Partitioning of Small-World Networks Using Label Propagation George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania State University gslota@psu.edu, madduri@cse.psu.edu, srajama@sandia.gov SIAM CSE15 17 March 2015 1 / 18

Highlights We present PuLP, a multi-constraint multi-objective partitioner Shared-memory parallelism (OpenMP) PuLP demonstrates an average speedup of 14.5 relative to state-of-the-art partitioners on small-world graph test suite PuLP requires 8-39 less memory than state-of-the-art partitioners PuLP produces partitions with comparable or better quality than state-of-the-art partitioners for small-world graphs Exploits the label propagation clustering algorithm 2 / 18

Label Propagation Algorithm progression Randomly label with n labels 3 / 18

Label Propagation Algorithm progression Randomly label with n labels 3 / 18

Label Propagation Algorithm progression Randomly label with n labels Iteratively update each v with max per-label count over neighbors, ties broken randomly 3 / 18

Label Propagation Algorithm progression Randomly label with n labels Iteratively update each v with max per-label count over neighbors, ties broken randomly 3 / 18

Label Propagation Algorithm progression Randomly label with n labels Iteratively update each v with max per-label count over neighbors, ties broken randomly Algorithm completes when no new updates possible 3 / 18

Label Propagation Overview and observations Label propagation: initialize a graph with n labels, iteratively assign to each vertex the maximal per-label count over all neighbors to generate clusters, ties broken randomly (Raghavan et al. 2007) Clustering algorithm - dense clusters hold same label Fast - each iteration in O(n + m) Naïvely parallel - only per-vertex label updates Observation: Possible applications for large-scale small-world graph partitioning 4 / 18

Partitioning Problem description, objectives and constraints Goal: minimize execution time for small-world graph analytics Constraints: Vertex and edge balance: per-task memory and computation requirements Objectives: Edge cut and max per-part cut: total and per-task communication requirements Single level vs. multi level partitioners Multi level produces high quality at cost of computing Single level produces sub-optimal quality 5 / 18

Partitioning Problem description, objectives and constraints Goal: minimize execution time for small-world graph analytics Constraints: Vertex and edge balance: per-task memory and computation requirements Objectives: Edge cut and max per-part cut: total and per-task communication requirements Single level vs. multi level partitioners Multi level produces high quality at cost of computing Single level produces sub-optimal quality Until now... 5 / 18

PuLP Overview and description PuLP : Partitioning using Label Propagation Utilize label propagation for: Vertex balanced partitions, minimize edge cut (PuLP) Vertex and edge balanced partitions, minimize edge cut (PuLP-M) Vertex and edge balanced partitions, minimize edge cut and maximal per-part edge cut (PuLP-MM) Any combination of the above - multi objective, multi constraint 6 / 18

PuLP-MM Algorithm overview Randomly initialize p partitions Create partitions with degree-weighted label propagation for Some number of iterations do for Some number of iterations do Balance partitions to satisfy vertex constraint Refine partitions to minimize edge cut for Some number of iterations do Balance partitions to satisfy edge constraint and minimize per-part edge cut Refine partitions to minimize edge cut 7 / 18

PuLP-MM PuLP-MM algorithm Randomly initialize partition labels for p parts Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 8 / 18

PuLP-MM PuLP-MM algorithm Randomly initialize partition labels for p parts Run degree-weighted label prop to create initial parts Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 8 / 18

PuLP-MM PuLP-MM algorithm Randomly initialize partition labels for p parts Run degree-weighted label prop to create initial parts Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 8 / 18

PuLP-MM PuLP-MM algorithm Randomly initialize partition labels for p parts Run degree-weighted label prop to create initial parts Iteratively balance for vertices, minimize edge cut Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 8 / 18

PuLP-MM PuLP-MM algorithm Randomly initialize partition labels for p parts Run degree-weighted label prop to create initial parts Iteratively balance for vertices, minimize edge cut Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 8 / 18

PuLP-MM PuLP-MM algorithm Randomly initialize partition labels for p parts Run degree-weighted label prop to create initial parts Iteratively balance for vertices, minimize edge cut Balance for edges, minimize per-part edge cut Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 8 / 18

PuLP-MM Label propagation step P PULP-lp(G(V, E), p, P, N, I p ) Min v (n/p) (1 ɛ l ) i 0, r 1 while i < I p and r 0 do r 0 for all v V do C(1 p) 0 for u E(v) do C(P (u)) C(P (u)) + E(u) x Max(C(1 p)) if x P (v) and N(P (v)) 1 > Min v then P (v) x r r + 1 i i + 1 return P 9 / 18

PuLP-MM Vertex Balancing Step P PULP-vp(G(V, E), P, p, N, I b ) i 0, r 1 Max v (n/p) (1 + ɛ u) W v(1 p) Max(Max v/n(1 p) 1,0) while i < I b and r 0 do r 0 for all v V do C(1 p) 0 for all u E(v) do C(P (u)) C(P (u)) + E(u) for j = 1 p do if Moving v to P j violates Max v then C(j) 0 else C(j) C(j) W v(j) x Max(C(1 p)) if x P (v) then Update(N(P (v)), N(x)) Update(W v(p (v)), W v(x)) P (v) x r r + 1 i i + 1 10 / 18

PuLP-MM Edge balancing and max per-part cut minimization step P PULP-cp(G(V, E), P, p, N, M, T, U, I b ) i 0, r 1 Max v (n/p) (1 + ɛ u), Max e (m/p) (1 + η u) Cur Maxe Max(M(1 p)), Cur Maxc Max(T (1 p)) W e(1 p) Cur Maxe /M(1 p) 1, W c(1 p) Cur Maxc /T (1 p) 1 R e 1, R c 1 while i < I p and r 0 do r 0 for all v V do C(1 p) 0 for all u E(v) do C(P (u)) C(P (u)) + 1 for j = 1 p do if Moving v to P j violates Max v, Cur Maxe, Cur Maxc then C(j) 0 else C(j) C(j) (We(j) R e + W v(j) R c) x Max(C(1 p)) if x P (v) then P (v) x r r + 1 Update all variables for x and P (v) if Cur Maxe < Max e then Cur Maxe Max e R c R c Cur Maxc R e 1 else Re R e (Cur Maxe /Max e) R c 1 i i + 1 11 / 18

Results Test Environment and Graphs Test system: Compton Intel Xeon E5-2670 (Sandy Bridge), dual-socket, 16 cores, 64 GB memory. Test graphs: LAW graphs from UF Sparse Matrix, SNAP, MPI, Koblenz Real (one R-MAT), small-world, 60 K 70 M vertices, 275 K 2 B edges Test Algorithms: METIS - single constraint single objective METIS-M - multi constraint single objective ParMETIS - METIS-M running in parallel KaFFPa - single constraint single objective PuLP - single constraint single objective PuLP-M - multi constraint single objective PuLP-MM - multi constraint multi objective Metrics: 2 128 partitions, serial and parallel running times, memory utilization, edge cut, max per-partition edge cut 12 / 18

Results PuLP Running Times - Serial (top), Parallel (bottom) In serial, PuLP-MM runs 1.7 faster (geometric mean) than next fastest of METIS and KaFFPa Partitioner PULP PULP M PULP MM METIS METIS M KaFFPa FS LiveJournal R MAT Twitter Running Time 300 200 100 0 1500 1000 15000 10000 500 5000 0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions In parallel, PuLP-MM runs 14.5 faster (geometric mean) than next fastest (ParMETIS times are fastest of 1 to 256 cores) Partitioner PULP PULP M PULP MM ParMETIS METIS M (Serial) PULP M (Serial) LiveJournal R MAT Twitter Running Time 75 50 25 0 1500 1000 500 15000 10000 0 0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions 5000 13 / 18

Results PuLP memory utilization for 128 partitions PuLP utilizes minimal memory, O(n), 8-39 less than KaFFPa and METIS Savings are mostly from avoiding a multilevel approach Network Memory Utilization METIS-M KaFFPa PuLP-MM Graph Size Improv. LiveJournal 7.2 GB 5.0 GB 0.44 GB 0.33 GB 21 Orkut 21 GB 13 GB 0.99 GB 0.88 GB 23 R-MAT 42 GB - 1.2 GB 1.02 GB 35 DBpedia 46 GB - 2.8 GB 1.6 GB 28 WikiLinks 103 GB 42 GB 5.3 GB 4.1 GB 25 sk-2005 121 GB - 16 GB 13.7 GB 8 Twitter 487 GB - 14 GB 12.2 GB 39 14 / 18

Results PuLP quality - edge cut and max per-part cut PuLP-M produces better edge cut than METIS-M over most graphs Partitioner PULP M PULP MM METIS M Edge Cut Ratio 0.4 0.3 0.2 0.1 LiveJournal R MAT Twitter 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions PuLP-MM produces better max edge cut than METIS-M over most graphs Partitioner PULP M PULP MM METIS M Max Per Part Ratio 0.08 0.06 0.04 0.02 LiveJournal R MAT Twitter 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions 15 / 18

Results PuLP balanced communication uk-2005 graph from LAW, METIS-M (left) vs. PuLP-MM (right) Blue: low comm; White: avg comm; Red: High comm PuLP reduces max inter-part communication requirements and balances total communication load through all tasks 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Part Number 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 Part Number 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 Part Number Part Number 7 8 9 10 11 12 13 14 15 16 16 / 18

Results Re-balancing for multiple constraints and objectives PuLP can balance a quality but unbalanced (single objective and constraint) partition We observe up to 11% further improvement in edge cut produced by PuLP-MM when starting with a single constraint partition from KaFFPa No cost to second objective (max per-part cut), our approaches show up to 400% improvement relative to KaFFPa Partitioner PULP MM KaFFPa KaFFPa+PULP MM Partitioner PULP MM KaFFPa KaFFPa+PULP MM Edge Cut Ratio 0.4 0.3 0.2 0.1 Max Per part Edge Cut Ratio 0.075 0.050 0.025 2 4 8 16 32 64 128 Number of Parts 2 4 8 16 32 64 128 Number of Parts 17 / 18

Conclusions and future work Average of 14.5 faster, 23 less memory, better edge cut or better per-part edge cut than next best tested partitioner on small-world graph test suite Future work: Implementation in Zoltan2 Explore techniques for avoiding local minima, such as simulated annealing, etc. Further parallelization in distributed environment for massive-scale graphs Explore tradeoff and interactions in various parameters and iteration counts New initialization procedures for various graph types (meshes, road nets, etc.) 18 / 18