Argo: Architecture- Aware Graph Par33oning

Size: px

Start display at page:

Download "Argo: Architecture- Aware Graph Par33oning"

Merryl Townsend
6 years ago
Views:

1 Argo: Architecture- Aware Graph Par33oning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of PiCsburgh hcp://db.cs.pic.edu/group/ hcp:// 1

2 Big Graphs Are Everywhere [SIGMOD 16 Tutorial] 2

3 A Balanced Par33oning = Even Load Distribu3on Minimal Edge- Cut = Minimal Data Comm N2 N3 N1 Assump3on: Network is the booleneck. 3

4 The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB 15] Infiniband: 1.7GB/s~37.5GB/s DDR3: 6.25GB/s~16.6GB/s Dual- socket Xeon E5v2 server with DDR FDR 4x NICs per socket 4

5 The End of Slow Networks: Does edge- cut s3ll maoer? 5

6 Roadmap ü IntroducGon ü Does edge- cut s3ll maoer? ü Why edge- cut sgll macers? ü Argo ü EvaluaGon ü Conclusions 6

7 The End of Slow Networks: Does edge- cut s3ll maoer? Graph Par33oners Graph Workloads Graph Dataset METIS and LDG BFS, SSSP, and PageRank Orkut ( V =3M, E =234M) Number of Par33ons 16 (one parggon per core) 7

8 The End of Slow Networks: Does edge- cut s3ll maoer? SSSP Execution Time (s) m:s:c METIS LDG 1:2: ,632 2:2: ,565 4:2: :2: x m: # of machines used s: # of sockets used per machine c: # of cores used per socket Denser configura3ons had longer execu3on 3me. Conten3on on the memory subsystems impacted performance. Network may not always be the booleneck. 8

9 The End of Slow Networks: Does edge- cut s3ll maoer? m:s:c SSSP Execution Time (s) METIS LDG m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2: ,632 2:2: ,565 4:2: :2: x 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2: x Denser configura3ons had longer execu3on 3me. Conten3on on the memory subsystems impacted performance. Network may not always be the booleneck. 9

10 The End of Slow Networks: Does edge- cut s3ll maoer? m:s:c SSSP Execution Time (s) METIS LDG m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2: ,632 2:2: ,565 4:2: :2: x 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2: x Denser configura3ons had longer execu3on 3me. Conten3on on the memory subsystems impacted performance. Network may not always be the booleneck. 10

11 The End of Slow Networks: Does edge- cut s3ll maoer? m:s:c SSSP Execution Time (s) METIS LDG m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2: ,632 2:2: ,565 4:2: :2: x 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2: x Denser configura3ons had longer execu3on 3me. ContenGon The distribution the memory of subsystems edge-cut matters. impacted performance. Network may not always be the bocleneck. 11

12 The End of Slow Networks: Does edge- cut s3ll maoer? m:s:c SSSP Execution Time (s) METIS LDG m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2: ,632 2:2: ,565 4:2: :2: x 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2: x METIS had lower execution time and LLC misses than LDG. Edge-cut matters. Higher edge-cut-->higher comm-->higher contention 12

13 The End of Slow Networks: Does edge- cut s3ll maoer? Yes! Both edge- cut and its distribu3on maoer! Intra- Node and Inter- Node Data Communica3on Have different performance impact on the memory subsystems of modern mulgcore machines. 13

14 Roadmap ü IntroducGon ü Does edge- cut sgll macer? ü Why edge- cut s3ll maoers? ü Argo ü EvaluaGon ü Conclusions 14

15 Intra- Node Data Comm: Shared Memory Sending Core Receiving Core 1. Load 2b. Write 3. Load 2a. Load 4a. Load 4b. Write Send Buffer Shared Buffer Receive Buffer Extra Memory Copy 15

16 Intra- Node Data Comm: Shared Memory Cached Send/Shared/Receive Buffer Cache Pollu3on LLC and Memory Bandwidth Conten3on 16

17 Intra- Node Data Comm: Shared Memory Cached Send/Shared Buffer Cached Receive/Shared Buffer Cache Pollu3on LLC and Memory Bandwidth Conten3on 17

18 Excess intra- node data communica3on may hurt performance. 18

19 Inter- Node Data Comm: RDMA Read/Write Node#1 Node#2 Send Buffer Sending Core Receive Buffer Sending Core IB HCA IB HCA No Extra Memory Copy and Cache Pollu3on 19

20 Offloading excess intra- node data comm across nodes may achieve beoer performance. 20

21 Roadmap ü IntroducGon ü Does edge- cut sgll macer? ü Why edge- cut sgll macers? ü Argo ü EvaluaGon ü Conclusions 21

22 Argo: Graph Par33oning Model Vertex Stream... Partitioner... Streaming Graph ParGGoning Model [I. Stanton, KDD 12] 22

23 Argo: Architecture- Aware Vertex Placement Place vertex, v, to a parggon, Pi, that maximize: Weighted Edge- cut Penalize the placement based on the load of Pi Weighted by the rela3ve network comm cost, Argo will avoid edge- cut across nodes (inter- node data comm). Great for cases where the network is the bottleneck. 23

24 Argo: Architecture- Aware Vertex Placement Degree of Conten3on (λ [0, 1]) Bottleneck Network λ=0 Memory λ=1 Refined Intra- Node Network Comm Cost Original Intra- Node Network Comm Cost Maximal Inter- Node Network Comm Cost Weighted by the refined rela3ve network comm cost, Argo will avoid edge- cut across cores of the same node (intra- node data comm). 24

25 Roadmap ü IntroducGon ü Does edge- cut sgll macer? ü Why edge- cut sgll macers? ü Argo ü Evalua3on ü Conclusions 25

26 Evalua3on: Workloads & Datasets ü ü Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank Three Real- World Large Graphs Dataset V E Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B 26

27 Evalua3on: Plaeorm Cluster Configura.on # of Nodes 32 Network Topology Network Bandwidth FDR Infiniband (Single Switch) 56Gbps Compute Node Configura.on # of Sockets L3 Cache 2 Intel Haswell (10 cores / socket) 25MB 27

28 Evalua3on: Par33oners ü ü ü METIS: the most well- known mul3- level parggoner. LDG: the most well- known streaming parggoner. ARGO- H: network is the bocleneck. o weight edge- cut by the original network comm costs. ü ARGO: memory is the bocleneck. o weight edge- cut by the refined network comm costs. 28

29 Evalua3on: SSSP Exec. Time on Orkut dataset Orkut: 60 Par33ons: V = 3M, E = 234M three 20- core machines 5x 4x 2x 2x 3x 1x 2x 1x 2x 1.4x 3x 1x Message Grouping Size ARGO had the lowest SSSP execu3on 3me. (Group mulgple msgs by a single SSSP process to the same desgnagon into one msg) 29

30 Evalua3on: SSSP LLC Misses on Orkut dataset Orkut: 60 Par33ons: V = 3M, E = 234M three 20- core machines 50x 38x 4x 3x 6x 12x 9x 9x 1x 1x 1.2x 1x Message Grouping Size ARGO had the lowest LLC Misses. 30

31 Evalua3on: SSSP Comm Vol. on Orkut dataset Orkut: 60 Par33ons: V = 3M, E = 234M three 20- core machines 64 Intra- Socket METIS 69% LDG 49% ARGO- H 70% ARGO had Distribu3on the lowest of intra- node the edge- cut communica3on also maoers. volume. 31

32 Evalua3on: SSSP Exec. Time vs Graph Size TwiOer: V = 52M, E = 3.9B 80 Par33ons: four 20- core machines Message Grouping Size: 512 ARGO had the lowest SSSP execu3on 3me. Up to 6x improvement against ARGO- H. Improvement became larger as the graph size increased. 32

33 Evalua3on: SSSP Exec. Time vs # of Par33ons TwiOer: V = 52M, E = 3.9B 80~200 Par33ons: four up to ten 20- core machines Message Grouping Size: 512 ARGO always outperformed LDG and ARGO- H. Up to 11x improvement against ARGO- H. 33

34 Evalua3on: SSSP Exec. Time vs # of Par33ons TwiOer: V = 52M, E = 3.9B 80~200 Par33ons: four up to ten 20- core machines Message Grouping Size: 512 * 160 = 13h * 180 = 6h zoom in Hours CPU Time Saving. 34

35 Evalua3on: Par33oning Overhead TwiOer: V = 52M, E = 3.9B 80~200 Par33ons: four up to ten 20- core machines ParGGoning Time as a Percentage of the CPU Time Saved (SSSP ExecuGon) # of ParGGons # of ParGGons ARGO is indeed slower than LDG. The overhead was negligible in comparison to the CPU 3me saved. Graph analy3cs usually have much longer execu3on 3me. 35

36 Conclusions ü Findings o Network is not always the bocleneck. o ContenGon on memory subsystems may impact the performance a lot q due to excess intra- node data comm. Thanks! o Both edge- cut and its distribugon macer. Acknowledgments: ü ARGO o voids contengon by offloading excess intra- node data comm across nodes. o Achieves up to 11x improvement on real- world workloads. o Scales well in terms of both graph size and number of parggons. ü Peyman Givi ü Patrick Pisciuneri Funding: ü NSF CBET ü NSF CBET ü BigData 16 Student Travel Award 36

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng, Alexandros Labrinidis, and Panos K. Chrysanthis University of Pittsburgh 1 Graph Partitioning Applications of