NUMA-aware Graph-structured Analytics

Size: px

Start display at page:

Download "NUMA-aware Graph-structured Analytics"

Elizabeth Holland
5 years ago
Views:

1 NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China

2 Big Data Everywhere 00 Million Tweets/day 1.11 Billion Users 6 Billion Photos 100 Hrs of Video every minute How do we understand and use Big Data?

3 Data Analytics 00 Million Tweets/day 1.11 Billion Users 6 Billion Photos 100 Hrs of Video every minute Machine Learning and Data Mining NLP

4 It s about the graphs...

5 NUMA & Graph-analytics Application Hardware Processor Memory Single Unified Multi- Core NUCA Multi- Socket NUMA Now e.g. 80 Cores with 1TB RAM NLP 8 Sockets X (10 Cores with128gb local RAM)

6 How about? NUMA systems Graph-analytics

7 Contribution Polymer: NUMA-aware Graph-structured Analytics A comprehensive analysis that uncovers issues for running graph analytics system on NUMA platform A new system that exploits both NUMA-aware data layout and memory access strategies Three optimizations for global synchronization efficiency, load balance and data structure flexibility A detailed evaluation that demonstrates the performance and scalability benefits

8 Outline Background & Issues Design of Polymer Evaluation

9 Outline Background & Issues Design of Polymer Evaluation

Example: PageRank A centrality analysis algorithm to measure the relative rank for each element of a linked set 3 5 1 2

10 Example: PageRank A centrality analysis algorithm to measure the relative rank for each element of a linked set Characteristics Linked set data dependence Rank of who links it predictable accesses Convergence iterative computation

Graph-analytics The scatter-gather model scatter : propagate the current value of a vertex to its neighbors along edges gather : accumulate values from neighbors to compute the next value of a

11 Graph-analytics The scatter-gather model scatter : propagate the current value of a vertex to its neighbors along edges gather : accumulate values from neighbors to compute the next value of a vertex vertex TOPO In-memory data structure Graph Topology Application-specific Data Runtime State edge curr D1 D2 D3 D D5 D6 next curr next DATA STAT

12 Vertex-centric (e.g. Ligra) STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

13 Vertex-centric (e.g. Ligra) STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next SEQ RND R W

14 Edge-centric (e.g. X-Stream) partition 5 3 TOPO/edge STAT/curr DATA/curr shuffle phase DATA/Uout DATA/Uin DATA/next STAT/next

15 Edge-centric (e.g. X-Stream) partition 5 3 TOPO/edge STAT/curr DATA/curr RND R shuffle phase DATA/Uout DATA/Uin SEQ SEQ W W R R DATA/next STAT/next RND W

16 NUMA Characteristics A commodity NUMA machine Multiple processor nodes (i.e., socket) Processor = multiple cores + a local DRAM A globally shared memory abstract (cache-coherence) Hallmark: Non-uniform memory access Latency (Cycle) Inst. 0-hop 1-hop 2-hop 80-core Intel Xeon machine Load Store Bandwidth (MB/s) Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine SEQ RAND

17 NUMA Characteristics A commodity NUMA machine Multiple processor nodes (i.e., socket) Sequential remote access is faster than random local access Processor = multiple cores + a local DRAM A globally shared memory abstract (cache-coherence) Hallmark: Non-uniform memory access & Random remote access is awesome! Latency (Cycle) Inst. 0-hop 1-hop 2-hop 80-core Intel Xeon machine Load Store Bandwidth (MB/s) Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine SEQ RAND

18 NUMA Characteristics The world we lived in: first-touch policy binding virtual pages to physical frames locating on a memory node where a thread first touches the pages CPU MEM Centralized Interleaved Associated The world we want to lived in

19 NUMA Characteristics The world we lived in: first-touch policy binding virtual pages to physical frames locating on a memory node where a thread first touches the pages Both centralized and interleaved data layout will hamper locality and parallelism Centralized Interleaved & Associated CPU Associated layout is the ideal one. MEM The world we want to lived in

20 NUMA Characteristics Lack of locality (access neighboring vertices) It is inevitable to access remote memory Random access is always there update How to mix?? Local Global SEQ RND

21 Access Strategy on NUMA Vertex-centric Model Completely overlooked (e.g. Ligra) L G SEQ RND 5 3 N0 N1 SEQ R L RND W G

22 Access Strategy on NUMA Edge-centric Model Inefficient way (e.g. X-Stream) N0 N1 L G SEQ RND RND R L shuffle phase SEQ W L SEQ R L SEQ W G SEQ R L RND W L

23 Runtime (sec) Normalized Speedup Normalized Speedup Normalized Speedup Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) Ligra X-Stream Galois #cores Ligra X-Stream Galois #sockets X-Stream 8C: 6.92X 8S:.58X Galois 10C: 6.19X 8S: 2.90X X-Stream 1S: 132s 8S: 29s Galois 1S: 33s 8S: 12s Performance (sec) Intel 80-cores (8Sx10C) Ligra X-Stream Galois #sockets Scalability: #sockets Ligra X-Stream Galois #sockets worse! 8 Socket LG: 2.9X XS: 1.X AMD 6-core (8Sx8C)

24 Runtime (sec) Normalized Speedup Normalized Speedup Normalized Speedup Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) X-Stream 1S: 132s 8S: 29s Galois 1S: 33s 8S: 12s Ligra X-Stream Galois #cores Performance (sec) #sockets #sockets X-Stream 8C: 6.92X 8S:.58X Galois 10C: 6.19X 8S: 2.90X Minimize remote & 1 2random accesses 8 Eliminate 160 the 8 Ligra combination Ligraof them 120 X-Stream Galois 6 X-Stream Galois 80 0 Intel 80-cores (8Sx10C) Ligra X-Stream Galois #sockets Scalability: #sockets worse! 8 Socket LG: 2.9X XS: 1.X AMD 6-core (8Sx8C)

25 Outline Background & Issues Design of Polymer Evaluation

26 Goal#1: Reduce remote accesses Co-locating data and computation within the same NUMA node

27 Graph-aware Data Layout Co-locating data and computation within the same NUMA node Graph-aware partitioning TOPO DATA STAT D1 D2 D3 D D5 D

28 Graph-aware Data Layout Co-locating data and computation within the same NUMA node 1. Graph-aware partitioning N N D1 D2 D3 Intuitive D D5 D

29 Graph-aware Data Layout Co-locating data and computation within the same NUMA node 1. Graph-aware partitioning N D1 D2 D3 N D D5 D sophisticated

30 Graph-aware Data Layout Co-locating data and computation within the same NUMA node Graph-aware partitioning D1 D2 D3 N0 agent D D5 D sophisticated N1 5 3

31 Graph-aware Data Layout Co-locating data and computation within the same NUMA node Graph-aware partitioning NUMA-aware allocation Virt-Memory Phys-Memory TOPO D1 D2 D3 N0 agent D D5 D seq 2.local 3.long 1.seq/rnd DATA 2.global 3.long STAT N seq/rnd 2.global 3.short

32 Goal#2: Eliminate random + remote Random remote access access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes

Goal#2: Eliminate random + remote Random remote access access neighboring vertices on other nodes distribute the computations on a singe vertex over

33 Goal#2: Eliminate random + remote Random remote access access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes Each node handles all edges of partial vertices RND SEQ L G RND SEQ G L Each node handles partial edges of all vertices

34 1 NUMA-aware Access Strategy distribute the computations on singe vertex over multiple NUMA-nodes N0 N1 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

35 NUMA-aware Access Strategy distribute the computations on singe vertex over multiple NUMA-nodes N0 N1 DATA/curr SEQ R L DATA/next RND W G DATA/curr DATA/next SEQ R G RND W L

36 Optimizations 1. Rolling update 2. Hierarchical and efficient barrier 3. Adaptive data structure

37 Outline Background & Issues Design of Polymer Evaluation

38 Implementation Polymer ~5,300 SLOCs of C++ code Based on scatter-gather iterative model Support both push and pull mode Use a synchronous scheduler Several typical graph applications Sparse MM: PageRank, SpMV and BP Traversal: BFS, CC and SSSP Open source:

39 Experiment Settings Baseline: Ligra v201.3, X-Stream v0.9, and Galois v2.2 Platform 80-core Intel machine (w/o Hyper-Threading) 8 sockets (E7-8850: 10 cores and 128 GB local RAM) 6-core AMD machine (also 8 sockets) Algorithms (6) Graphs () Intel/8X10 3 Spars MM algorithms 3 Traversal algorithms 2 real-world and 2 synthetic graphs AMD/8X8 Graph V E Twitter 1.7 M 1.7B rmat M 2.1B Powerlaw 10.0M 105M roadus 23.9M 58M

40 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter PR rmat Power roadus Twitter SpMV rmat Power roadus Twitter BP rmat Power roadus

41 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter X X X 11.6 PR rmat X X X 19.6 Power X X X 6.6 roadus X X X 1. Twitter X X X 11.7 SpMV rmat X X X 1.9 Power X X X 6.2 roadus X X X 3.6 Twitter X X X 57.1 BP rmat X X X 75.0 Power X X X 8.6 roadus X X X 7.1

Overall Performance Algo. V Polymer Ligra X-Stream Galois PR SpMV BP Twitter 5.3 2.85X 15.0 5.8X 28.9 2.19X 11.6 rmat27 9.6 2.91X 28.0 1.89X 18.2 2.

55X rmat27 19.2 2.83X 5.3 2.7X 52.5 2.19X 1.9 Compared with best cases Power 1.8 17.3X 31.0 3.08X 5.5 3.6X 6.2 roadus 1.3 2.18X 2.8 2.30X 3.0 2.7X 3.

42 Overall Performance Algo. V Polymer Ligra X-Stream Galois PR SpMV BP Twitter X X X 11.6 rmat X X X 19.6 Power X X X 6.6 roadus X X X 1. Twitter X ~ 3.8X 3.08X 7.89X 1.55X rmat X X X 1.9 Compared with best cases Power X X X 6.2 roadus X X X 3.6 Twitter X X X 57.1 rmat X X X 75.0 Power X X X 8.6 roadus X X X 7.1

43 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter BFS rmat Power roadus Twitter CC rmat Power roadus * Twitter SSSP rmat Power roadus *

44 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter X X X 2.5 BFS rmat X X X 2.5 Power X X X 0.36 roadus X X X 5.01 Twitter X X X 31.9 CC rmat X X X 33.9 Power X X X 3.51 roadus X X * Twitter X X X 26.3 SSSP rmat X X X 28.5 Power X X X 26.6 roadus X X *

Normalized Speedup Runtime (sec) Normalized Speedup Runtime (sec) Scalability on NUMA Performance & Scalability: #sockets 16 12 8 Ligra

7X 160 120 80 0 Performance & Scalability: #sockets 8 6 2 Ligra X-Stream Galois Polymer 0 1 2 3 5 6 7 8 #sockets 3.

45 Normalized Speedup Runtime (sec) Normalized Speedup Runtime (sec) Scalability on NUMA Performance & Scalability: #sockets Ligra X-Stream Galois Polymer #sockets 12.7X Performance & Scalability: #sockets Ligra X-Stream Galois Polymer #sockets 3.78X #sockets Ligra Galois Polymer Ligra X-Stream Galois Polymer #sockets Intel 80-cores (8Sx10C) PageRank 5.8X/X-Stream 2.85X/Ligra 2.19X/Galois Intel 80-cores (8Sx10C) BFS 0.80X/Galois 1.25X/Ligra 2.96X/X-stream

46 Conclusion Polymer: NUMA-aware Graph-structured Analytics A comprehensive analysis that uncovers several NUMA characteristics and issues with existing NUMA-oblivious graph analytics systems A new system that exploits both NUMA- and graphaware data layout and memory access strategies minimize remote access eliminate remote random access Three optimizations for global synchronization efficiency, load balance and data structure flexibility

47 Thanks Questions projects/polymer.html Institute of Parallel and Distributed Systems

NUMA-Aware Graph-Structured Analytics

NUMA-Aware Graph-Structured Analytics Kaiyuan Zhang Rong Chen Haibo Chen Shanghai Key aboratory of Scalable Computing and Systems Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University,