PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania State University gslota@psu.edu, madduri@cse.psu.edu, srajama@sandia.gov BigData14 28 Oct 2014 1 / 37
Highlights We present PuLP, a multi-constraint multi-objective partitioner designed for small-world graphs Shared-memory parallelism PuLP demonstrates an average speedup of 14.5 relative to state-of-the-art partitioners PuLP requires 8-39 less memory than state-of-the-art partitioners PuLP produces partitions with comparable or better quality than state-of-the-art partitioners for small-world graphs 2 / 37
Overview PuLP: Partitioning Using Label Propagation Overview Graph partitioning formulation Label propagation Using label propagation for partitioning PuLP Algorithm Degree-weighted label prop Label propagation for balancing constraints and minimizing objectives Label propagation for iterative refinement Results Performance comparisons with other partitioners Partitioning quality with different objectives 3 / 37
Overview Partitioning Graph Partitioning: Given a graph G(V, E) and p processes or tasks, assign each task a p-way disjoint subset of vertices and their incident edges from G Balance constraints (weighted) vertices per part, (weighted) edges per part Quality metrics edge cut, communication volume, maximal per-part edge cut We consider: Balancing edges and vertices per part Minimizing edge cut (EC) and maximal per-part edge cut (EC max ) 4 / 37
Overview Partitioning - Objectives and Constraints Lots of graph algorithms follow a certain iterative model BFS, SSSP, FASCIA subgraph counting (Slota and Madduri 2014) computation, synchronization, communication, synchronization, computation, etc. Computational load: proportional to vertices and edges per-part Communication load: proportional to total edge cut and max per-part cut We want to minimize the maximal time among tasks for each comp/comm stage 5 / 37
Overview Partitioning - Balance Constraints Balance vertices and edges: (1 ɛ l ) V p V (π i ) (1 + ɛ u ) V p E(π i ) (1 + η u ) E p (1) (2) ɛ l and ɛ u : lower and upper vertex imbalance ratios η u : upper edge imbalance ratio V (π i ): set of vertices in part π i E(π i ): set of edges with both endpoints in part π i 6 / 37
Overview Partitioning - Objectives Given a partition Π, the set of cut edges (C(G, Π)) and cut edge per partition (C(G, π k )) are C(G, Π) = {{(u, v) E} Π(u) Π(v)} (3) C(G, π k ) = {{(u, v) C(G, Π)} (u π k v π k )} (4) Our partitioning problem is then to minimize total edge cut EC and max per-part edge cut EC max : EC(G, Π) = C(G, Π) (5) EC max (G, Π) = max C(G, π k ) (6) k 7 / 37
Overview Partitioning - HPC Approaches (Par)METIS (Karypis et al.), PT-SCOTCH (Pellegrini et al.), Chaco (Hendrickson et al.), etc. Multilevel methods: Coarsen the input graph in several iterative steps At coarsest level, partition graph via local methods following balance constraints and quality objectives Iteratively uncoarsen graph, refine partitioning Problem 1: Designed for traditional HPC scientific problems (e.g. meshes) limited balance constraints and quality objectives Problem 2: Multilevel approach high memory requirements, can run slowly and lack scalability 8 / 37
Overview Label Propagation Label propagation: randomly initialize a graph with some p labels, iteratively assign to each vertex the maximal per-label count over all neighbors to generate clusters (Raghavan et al. 2007) Clustering algorithm - dense clusters hold same label Fast - each iteration in O(n + m), usually fixed iteration count (doesn t necessarily converge) Naïvely parallel - only per-vertex label updates Observation: Possible applications for large-scale small-world graph partitioning 9 / 37
Overview Partitioning - Big Data Approaches Methods designed for small-world graphs (e.g. social networks and web graphs) Exploit label propagation/clustering for partitioning: Multilevel methods - use label propagation to coarsen graph (Wang et al. 2014, Meyerhenke et al. 2014) Single level methods - use label propagation to directly create partitioning (Ugander and Backstrom 2013, Vaquero et al. 2013) Problem 1: Multilevel methods still can lack scalability, might also require running traditional partitioner at coarsest level Problem 2: Single level methods can produce sub-optimal partition quality 10 / 37
Overview PuLP PuLP : Partitioning Using Label Propagation Utilize label propagation for: Vertex balanced partitions, minimize edge cut (PuLP) Vertex and edge balanced partitions, minimize edge cut (PuLP-M) Vertex and edge balanced partitions, minimize edge cut and maximal per-part edge cut (PuLP-MM) Any combination of the above - multi objective, multi constraint 11 / 37
Algorithms Primary Algorithm Overview PuLP-MM Algorithm Constraint 1: balance vertices, Constraint 2: balance edges Objective 1: minimize edge cut, Objective 2: minimize per-partition edge cut Pseudocode gives default iteration counts Initialize p random partitions Execute 3 iterations degree-weighted label propagation (LP) for k 1 = 1 iterations do for k 2 = 3 iterations do Balance partitions with 5 LP iterations to satisfy constraint 1 Refine partitions with 10 FM iterations to minimize objective 1 for k 3 = 3 iterations do Balance partitions with 2 LP iterations to satisfy constraint 2 and minimize objective 2 with 5 FM iterations Refine partitions with 10 FM iterations to minimize objective 1 12 / 37
Algorithms Primary Algorithm Overview Initialize p random partitions Execute degree-weighted label propagation (LP) for k 1 iterations do for k 2 iterations do Balance partitions with LP to satisfy vertex constraint Refine partitions with FM to minimize edge cut for k 3 iterations do Balance partitions with LP to satisfy edge constraint and minimize max per-part cut Refine partitions with FM to minimize edge cut 13 / 37
Algorithms Primary Algorithm Overview Randomly initialize p partitions (p = 4) Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 14 / 37
Algorithms Primary Algorithm Overview After random initialization, we then perform label propagation to create partitions Initial Observations: Partitions are unbalanced, for high p, some partitions end up empty Edge cut is good, but can be better PuLP Solutions: Impose loose balance constraints, explicitly refine later Degree weightings - cluster around high degree vertices, let low degree vertices form boundary between partitions 15 / 37
Algorithms Primary Algorithm Overview Initialize p random partitions Execute degree-weighted label propagation (LP) for k 1 iterations do for k 2 iterations do Balance partitions with LP to satisfy vertex constraint Refine partitions with FM to minimize edge cut for k 3 iterations do Balance partitions with LP to satisfy edge constraint and minimize max per-part cut Refine partitions with FM to minimize edge cut 16 / 37
Algorithms Primary Algorithm Overview Part assignment after random initialization. Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 17 / 37
Algorithms Primary Algorithm Overview Part assignment after degree-weighted label propagation. Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 18 / 37
Algorithms Primary Algorithm Overview After label propagation, we balance vertices among partitions and minimize edge cut (baseline PuLP ends here) Observations: Partitions are still unbalanced in terms of edges Edge cut is good, max per-part cut isn t necessarily PuLP-M and PuLP-MM Solutions: Maintain vertex balance while explicitly balancing edges Alternate between minimizing total edge cut and max per-part cut (for PuLP-MM, PuLP-M only minimizes total edge cut) 19 / 37
Algorithms Primary Algorithm Overview Initialize p random partitions Execute degree-weighted label propagation (LP) for k 1 iterations do for k 2 iterations do Balance partitions with LP to satisfy vertex constraint Refine partitions with FM to minimize edge cut for k 3 iterations do Balance partitions with LP to satisfy edge constraint and minimize max per-part cut Refine partitions with FM to minimize edge cut 20 / 37
Algorithms Primary Algorithm Overview Part assignment after degree-weighted label propagation. Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 21 / 37
Algorithms Primary Algorithm Overview Part assignment after balancing for vertices and minimizing edge cut. Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 22 / 37
Algorithms Primary Algorithm Overview Initialize p random partitions Execute degree-weighted label propagation (LP) for k 1 iterations do for k 2 iterations do Balance partitions with LP to satisfy vertex constraint Refine partitions with FM to minimize edge cut for k 3 iterations do Balance partitions with LP to satisfy edge constraint and minimize max per-part cut Refine partitions with FM to minimize edge cut 23 / 37
Algorithms Primary Algorithm Overview Part assignment after balancing for vertices and minimizing edge cut. Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 24 / 37
Algorithms Primary Algorithm Overview Part assignment after balancing for edges and minimizing total edge cut and max per-part edge cut Network shown is the Infectious network dataset from KONECT (http://konect.uni-koblenz.de/) 25 / 37
Results Test Environment and Graphs Test system: Compton Intel Xeon E5-2670 (Sandy Bridge), dual-socket, 16 cores, 64 GB memory. Test graphs: LAW graphs from UF Sparse Matrix, SNAP, MPI, Koblenz Real (one R-MAT), small-world, 60 K 70 M vertices, 275 K 2 B edges Test Algorithms: METIS - single constraint single objective METIS-M - multi constraint single objective ParMETIS - METIS-M running in parallel KaFFPa - single constraint single objective PuLP - single constraint single objective PuLP-M - multi constraint single objective PuLP-MM - multi constraint multi objective Metrics: 2 128 partitions, serial and parallel running times, memory utilization, edge cut, max per-partition edge cut 26 / 37
Results Running Times - Serial (top), Parallel (bottom) In serial, PuLP-MM runs 1.7 faster (geometric mean) than next fastest Partitioner PULP PULP M PULP MM METIS METIS M KaFFPa FS LiveJournal R MAT Twitter Running Time 300 200 100 0 1500 1000 15000 10000 500 5000 0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions In parallel, PuLP-MM runs 14.5 faster (geometric mean) than next fastest (ParMETIS times are fastest of 1 to 256 cores) Partitioner PULP PULP M PULP MM ParMETIS METIS M (Serial) PULP M (Serial) LiveJournal R MAT Twitter Running Time 75 50 25 0 1500 1000 500 15000 10000 0 0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions 5000 27 / 37
Results Memory utilization for 128 partitions PuLP utilizes minimal memory, O(n), 8-39 less than other partitioners Savings are mostly from avoiding a multilevel approach Network Memory Utilization METIS-M KaFFPa PuLP-MM Graph Size Improv. LiveJournal 7.2 GB 5.0 GB 0.44 GB 0.33 GB 21 Orkut 21 GB 13 GB 0.99 GB 0.88 GB 23 R-MAT 42 GB - 1.2 GB 1.02 GB 35 DBpedia 46 GB - 2.8 GB 1.6 GB 28 WikiLinks 103 GB 42 GB 5.3 GB 4.1 GB 25 sk-2005 121 GB - 16 GB 13.7 GB 8 Twitter 487 GB - 14 GB 12.2 GB 39 28 / 37
Results Performance - Edge Cut and Edge Cut Max PuLP-M produces better edge cut than METIS-M over most graphs PuLP-MM produces better max edge cut than METIS-M over most graphs Partitioner PULP M PULP MM METIS M Edge Cut Ratio 0.4 0.3 0.2 0.1 LiveJournal R MAT Twitter 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions Partitioner PULP M PULP MM METIS M Max Per Part Ratio 0.08 0.06 0.04 0.02 LiveJournal R MAT Twitter 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions 29 / 37
Results Balanced communication uk-2005 graph from LAW, METIS-M (left) vs. PuLP-MM (right) Blue: low comm; White: avg comm; Red: High comm PuLP reduces max inter-part communication requirements and balances total communication load through all tasks 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Part Number 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 Part Number 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 Part Number Part Number 7 8 9 10 11 12 13 14 15 16 30 / 37
Future Work Explore techniques for avoiding local minima, such as simulated annealing, etc. Further parallelization in distributed environment for massive-scale graphs Demonstrate performance of PuLP partitions with graph analytics Explore tradeoff and interactions in various parameters and iteration counts 31 / 37
Conclusions We presented PuLP, a multi-constraint multi-objective partitioner designed for small-world graphs Shared-memory parallelism PuLP demonstrates an average speedup of 14.5 relative to state-of-the-art partitioners PuLP requires 8-39 less memory than state-of-the-art partitioners PuLP produces partitions with comparable or better quality than METIS/ParMETIS for small-world graphs 32 / 37
Conclusions We presented PuLP, a multi-constraint multi-objective partitioner designed for small-world graphs Shared-memory parallelism PuLP demonstrates an average speedup of 14.5 relative to state-of-the-art partitioners PuLP requires 8-39 less memory than state-of-the-art partitioners PuLP produces partitions with comparable or better quality than METIS/ParMETIS for small-world graphs Questions? 32 / 37
Acknowledgments DOE Office of Science through the FASTMath SciDAC Institute Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000. NSF grant ACI-1253881, OCI-0821527 Used NERSC hardware for generating partitionings - supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 33 / 37
Backup slides 34 / 37
Results Running Times - Serial (top), Parallel (bottom) PuLP faster than others over most tests in serial In parallel, PuLP always faster than other Partitioner PULP PULP M PULP MM METIS METIS M KaFFPa FS Running Time 300 200 100 0 LiveJournal Orkut R MAT DBpedia WikiLinks sk 2005 Twitter 500 1500 400 1500 5000 15000 4000 1000 300 1000 1000 3000 10000 200 100 500 2000 500 500 1000 5000 0 0 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 Number of Partitions In parallel, PuLP runs 14.5 faster (geometric mean) Partitioner PULP PULP M PULP MM ParMETIS METIS M (Serial) PULP M (Serial) Running Time 75 50 25 0 LiveJournal Orkut R MAT DBpedia WikiLinks sk 2005 Twitter 300 1500 1500 1500 15000 1000 200 1000 1000 1000 10000 100 500 500 0 0 0 0 0 0 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 2 4 8 163264128 Number of Partitions 500 500 5000 35 / 37
Results Memory utilization for 128 partitions PuLP utilizes minimal memory - O(n) Savings are mostly from avoiding a multilevel approach Network Memory Utilization METIS-M KaFFPa PuLP-MM Graph Size Improv. LiveJournal 7.2 GB 5.0 GB 0.44 GB 0.33 GB 21 Orkut 21 GB 13 GB 0.99 GB 0.88 GB 23 R-MAT 42 GB - 1.2 GB 1.02 GB 35 DBpedia 46 GB - 2.8 GB 1.6 GB 28 WikiLinks 103 GB 42 GB 5.3 GB 4.1 GB 25 sk-2005 121 GB - 16 GB 13.7 GB 8 Twitter 487 GB - 14 GB 12.2 GB 39 36 / 37
Results Performance - Edge Cut and Edge Cut Max PuLP-M produces better edge cut than METIS-M over most graphs PuLP-MM produces better max edge cut than METIS-M over most graphs Taken together, these demonstrate the tradeoff for multi objective Partitioner PULP M PULP MM METIS M Edge Cut Ratio Max Per Part Ratio 0.4 0.3 0.2 0.1 0.08 0.06 0.04 0.02 LiveJournal Orkut R MAT DBpedia WikiLinks sk 2005 Twitter 1.0 0.4 0.5 0.6 0.8 0.4 0.06 0.8 0.3 0.4 0.6 0.3 0.04 0.6 0.2 0.2 0.2 0.4 0.02 0.4 0.0 0.1 0.1 0.2 0.00 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 Number of Partitions LiveJournal Orkut R MAT DBpedia WikiLinks sk 2005 Twitter 0.15 0.3 0.15 0.4 0.15 0.12 0.0100 0.10 0.3 0.2 0.10 0.09 0.2 0.0075 0.05 0.1 0.1 0.06 0.05 0.0050 0.00 0.0 0.03 0.00 0.0 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 2 4 8 16 32 64128 Number of Partitions 37 / 37
Results Performance - Edge Cut and Edge Cut Max PuLP-M produces better edge cut than METIS-M over most graphs PuLP-MM produces better max edge cut than METIS-M over most graphs Taken together, these demonstrate the tradeoff for multi objective Across all Lab for Web Algorithmics graphs Partitioner PULP M PULP MM METIS M Edge Cut Ratio 0.25 0.20 0.15 0.10 0.05 0.3 0.2 0.1 0.0 0.4 0.3 0.2 0.1 Amazon Arabic CNR DBLP Enron 0.04 0.20 0.6 0.4 0.03 0.3 0.15 0.4 0.02 0.2 0.10 0.2 0.01 0.1 0.05 0.0 0.00 0.0 eu Hollywood in indochina it 0.20 0.6 0.15 0.04 0.15 0.4 0.10 0.03 0.10 0.02 0.05 0.2 0.05 0.01 0.00 0.00 0.0 0.00 LJournal sk 2005 uk 2002 uk 2005 webbase 0.06 0.04 0.02 0.00 0.02 0.01 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions 0.12 0.08 0.04 0.00 0.03 0.02 0.01 37 / 37
Results Performance - Edge Cut and Edge Cut Max PuLP-M produces better edge cut than METIS-M over most graphs PuLP-MM produces better max edge cut than METIS-M over most graphs Taken together, these demonstrate the tradeoff for multi objective Across all Lab for Web Algorithmics graphs Partitioner PULP M PULP MM METIS M Max Per Part Ratio 0.100 0.075 0.050 0.025 0.000 0.03 0.02 0.01 0.08 0.06 0.04 0.02 Amazon Arabic CNR DBLP Enron 0.0100 0.0025 0.0075 0.0050 0.050 0.09 0.06 0.03 0.00 0.06 LJournal sk 2005 uk 2002 uk 2005 webbase 0.0100 0.015 0.04 0.0100 0.0075 0.03 0.010 0.0075 0.0050 0.02 0.005 0.0025 0.0050 0.01 0.00 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Partitions eu Hollywood in indochina it 0.015 0.125 0.08 0.100 0.075 0.025 0.02 0.01 0.00 0.06 0.04 0.02 0.00 0.04 0.02 0.00 0.15 0.10 0.05 0.00 0.010 0.005 37 / 37