Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

Size: px

Start display at page:

Download "Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning"

Shanon Marsh
5 years ago
Views:

1 Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng, Alexandros Labrinidis, and Panos K. Chrysanthis University of Pittsburgh 1

2 Graph Partitioning Applications of Graph Partitioning Scientific Simulations Distributed Graph Computation o Pregel, Hama, Giraph VLSI Design Task Scheduling Linear Programming 2

3 A Balanced Partitioning = Even Load Distribution N2 N3 N1 Balanced: 3

4 Minimal Edge-Cut = Minimal Data Comm N2 N3 N1 Minimizing Edge-Cut: 4

Minimal Edge-Cut = Minimal Data Comm But Minimal Data Comm Minimal Comm Cost STD DEV: 416.82Mb/s STD DEV: 358.34Mb/s STD DEV: 269.

5 Minimal Edge-Cut = Minimal Data Comm But Minimal Data Comm Minimal Comm Cost STD DEV: Mb/s STD DEV: Mb/s STD DEV: Mb/s Figure 1. Pair-Wise Network Bandwidth (J. Xue, BigData 15) Group neighboring vertices as close as possible The partitioner has to be Architecture-Aware 5

6 Overview of the State-of-the-Art Balanced Graph (Re)Partitioning Partitioners (static graphs) Repartitioners (dynamic graphs) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate Quality) (High Scalability) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate~High Quality) (High Scalability) Architecture-Aware Architecture-Aware 6

7 Roadmap Conclusions Evaluation Introduction Planar 7

8 Planar: Problem Statement Given G=(V, E) and an initial Partitioning P: Balancing Load: Minimizing Communication: Network Cost Minimizing Migration: 8

9 Planar Planar Planar Planar Planar Planar: Overview S k S k+1 S k+2 S k+4 S k+5 Phase-1: Logical Vertex Migration Phase-1a: Minimizing Comm Cost Phase-1b: Ensuring Balanced Partitions Migration Planning What vertices to move? Where to move? Phase-2: Physical Vertex Migration Perform the Migration Plan Phase-3: Convergence Check Still beneficial? 9

10 Phase-1a: Minimizing Comm Cost N3 1 N1 N2 N3 1 6 N1 6 1 N2 6 1 N1 6 N2 N

11 Phase-1a: Minimizing Comm Cost N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 11

12 Phase-1a: Minimizing Comm Cost N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 12

13 Use vertex a as an example N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 g(a, N1, N1) = 0 Max Gain: 0 Optimal Dest: N1 13

14 Move vertex a to N2? N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 old_comm(a, N1) = 2 * * 1 = 13 N3 1 new_comm(a, N2) = 1 * * 1 = 7 mig(a, N1, N2) = 1 * 6 = 6 N1 1 6 N2 g(a, N1, N2) = = 0 Max Gain: 0 Optimal Dest: N1 14

15 Move vertex a to N3? N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 old_comm(a, N1) = 2 * * 1 = 13 N new_comm(a, N3) = 1 * * 1 = 3 mig(a, N1, N3) = 1 * 1 = 1 g(a, N1, N3) = = 9 N1 N2 Max Gain: 9 Optimal Dest: N3 15

16 Phase-1a: Probabilistic Vertex Migration Migration Planning N3 1 Partition Boundary Vtx N1 a b N2 d e N3 g 1 6 Migration Dest N3 Gain 9 N3 N3 2 3 N3 N Max Gain N1 N2 Probability 9/9 2/3 3/3 0 0 Migrate with a probability proportional to the gain 16

17 Phase-1b: Balancing Partitions Quota-Based Vertex Migration Q1: How much work should each overloaded partition migrate to each underloaded partition? Potential Gain Computation Similar to Phase-1a vertex gain computation Iteratively allocate quota starting from the partition pair having the largest gain. Q2: What vertices to migrate? Phase-1a vertex migration, but limited by the quota. 17

18 Planar Planar Planar Planar Planar Planar: Physical Vertex Migration S k S k+1 S k+2 S k+4 S k+5 Phase-1: Logical Vertex Migration Phase-1a: Minimizing Comm Cost Phase-1b: Ensuring Balanced Partitions Migration Planning What vertices to move? Where to move? Phase-2: Physical Vertex Migration Perform the Migration Plan Phase-3: Convergence Check Still beneficial? 18

19 Planar Planar Planar Planar Planar Planar: Convergence Check S k S k+1 S k+2 S k+4 S k+5 Phase-1: Logical Vertex Migration Phase-1a: Minimizing Comm Cost Phase-1b: Ensuring Balanced Partitions Migration Planning What vertices to move? Where to move? Phase-2: Physical Vertex Migration Perform the Migration Plan Phase-3: Convergence Check Still beneficial? 19

20 Planar Planar Planar Planar Planar Phase-3: Convergence Repartitioning Epoch Enough changes (structure/load) Converge S k S k+1 S k+2 S k+4 S k+5 Converge improvement achieved per adaptation superstep < δ after τ consecutive adaptation supersteps δ = 1% and τ = 10 (via Sensitivity Analysis) 20

21 Evaluation Microbenchmarks Convergence Study (Param Selection) Partitioning Quality Real-World Workloads Breadth First Search (BFS) Single Source Shortest Path (SSSP) Scalability Test Scalability vs Graph Size Scalability vs # of Partitions Scalability vs Graph Size and # of Partitions 21

22 Partitioning Quality: Setup Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) Initial Partitioners HP: DG: LDG: Hashing Partitioning Deterministic Greedy Linear Deterministic Greedy 22

23 Partitioning Quality: Datasets Dataset V E Description wave ,118,662 auto 448,695 6,629,222 FEM 333SP 3,712,815 22,217,266 CA-CondMat 108, , 756 DBLP 317,080 1,049,866 Collaboration Network -Eron 36, ,831 as-skitter 1,696,415 22,190,596 Internet Topology Amazon 334, ,872 Product Network USA-roadNet 23,947,347 58,333,344 roadnet-pa 1,090,919 6,167,592 Road Network YouTube 3,223,589 24,447,548 Com-LiveJournal 4,036,537 69,362,378 Social Network Friendster 124,836,180 3,612,134,270

24 Partitioning Quality: Planar achieved up to 68% improvement Improv. Max Avg. HP 68% 53% DG 46% 24% LDG 69% 48% 24

25 Evaluation Microbenchmarks Convergence Study (Param Selection) Partitioning Quality Real-World Workloads Breadth First Search (BFS) Single Source Shortest Path (SSSP) Scalability Test Scalability vs Graph Size Scalability vs # of Partitions Scalability vs Graph Size and # of Partitions 25

26 Real-World Workload: Setup Cluster Configuration PittMPICluster (FDR Infiniband) Gordon (QDR Infiniband) # of Nodes Network Topology Single Switch (32 nodes / switch) 4*4*4 3D Torus of Switches (16 nodes / switch) Network Bandwidth 56Gbps 8Gbps Node Configuration # of Sockets PittMPICluster (Intel Haswell) 2 (10 cores / socket) Gordon (Intel Sandy Bridge) 2 (8 cores / socket) L3 Cache 25MB 20MB Memory Bandwidth 65GB/s 85GB/s 26

27 Planar: Avoiding Resource Contention on the Memory Subsystems of Multicore Machines System Bottleneck (A. Zheng EDBT 16) PittMPICluster Memory (λ=1) Gordon Network (λ=0) Degree of Contention Intra-Node Network Comm Cost Maximal Inter-Node Network Comm Cost 27

28 Real-World Workload: Baselines Balanced Graph (Re)Partitioning Partitioners (static graphs) Repartitioners (dynamic graphs) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate Quality) (High Scalability) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate~High Quality) (High Scalability) uniplanar Initial Partitioner: DG 28

29 BFS Exec. Time on PittMPICluster (λ=1): Planar achieved up to 9x speedups as-skitter: 60 Partitions: V =1.6M, E = 22M three 20-core machines 9x 7.5x 5.8x 4.1x 1.48x 1.37x 1x 29

30 BFS Comm Volume on PittMPICluster (λ=1): Planar had the lowest intra-node comm volume as-skitter: 60 Partitions: V =1.6M, E = 22M three 20-core machines Reduction Intra-Socket Inter-Socket DG 51% 38% METIS 51% 36% PARMETIS 47% 34% uniplanar 44% 28% ARAGON 4.3% 0.8% PARAGON 5.2% 2.6% 30

31 BFS Exec. Time on Gordon (λ=0): Planar achieved up to 3.2x speedups as-skitter: 48 Partitions: V =1.6M, E = 22M three 16-core machines 3.2x 1.05x 1.16x 1.21x 1x 31

32 BFS Comm. Volume on Gordon (λ=0): Planar had the lowest inter-node comm volume as-skitter: 48 Partitions: V =1.6M, E = 22M three 16-core machines 51% 11% 0.1% 25% 32

33 Conclusions PLANAR Architecture-Aware Adaptive Graph Repartitioner Communication Heterogeneity Shared Resource Contention Up to 9x speedups on real-world workloads. Scaled up to a graph with 3.6B edges. Acknowledgments: Peyman Givi Patrick Pisciuneri Mark Silvis Funding: NSF OIA NSF CBET

34 Thank You! Homepage: ADMT: 34

35 Backup Slides 35

36 Phase-3: Convergence (Param Selection) Initial Partitioner: DG (Deterministic Greedy) # of Parts: 40 (two 20-core nodes) δ = 1% and τ = 10 36

37 Scalability vs Graph Size: BFS Exec. Time friendster: V = 124M, E =3.6B 60 of Partitions: three 20-core machines Speedup (60 cores) DG 1.55x uniplanar 1.32x PARAGON 1.08x 37

38 Scalability vs Graph Size: Repart. Time friendster: V = 124M, E =3.6B 60 of Partitions: three 20-core machines 38

39 Scalability vs # of Partitions: BFS Exec. Time friendster: V = 124M, E =3.6B 60~120 of Partitions: three to six 20-core machines) Speedup (120 cores) DG 2.9x uniplanar 1.30x PARAGON 1.15x 39

40 Scalability vs # of Partitions: Repart. Time friendster: V = 124M, E =3.6B 60~120 of Partitions: three to six 20-core machines) 40

41 Intra-Node Shared Resource Contention Sending Core Receiving Core 1. Load 2b. Write 3. Load 2a. Load 4a. Load 4b. Write Send Buffer Shared Buffer Receive Buffer 41

42 Intra-Node Shared Resource Contention Cached Send/Shared/Receive Buffer Multiple copies of the same data in LLC, contending for LLC and MC 42

43 Intra-Node Shared Resource Contention Cached Send/Shared Buffer Cached Receive/Shared Buffer Multiple copies of the same data in LLC, contending for LLC, MC, and QPI 43

44 Inter-Node Comm Cost? Intra-Node Comm Cost Node#1 Node#2 RDMA-enabled Network 44

45 Inter-Node Comm Cost Intra-Node Comm Cost Infiniband: 1.7GB/s~37.5GB/s DDR3: 6.25GB/s~16.6GB/s Dual-socket Xeon E5v2 server with DDR FDR 4x NICs per socket [1]. C. Binnig et. al. The End of Slow Networks: It stime for a Redesign. CoRR, 2015 Revisit the Impact of Memory Subsystem Carefully! 45

46 Planar: Avoiding Contention Node#1 Node#2 Send Buffer Sending Core Receive Buffer Sending Core IB HCA IB HCA 46

Argo: Architecture- Aware Graph Par33oning

Argo: Architecture- Aware Graph Par33oning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of PiCsburgh hcp://db.cs.pic.edu/group/ hcp://www.prognosgclab.org/