Communication for PIM-based Graph Processing with Efficient Data Partition. Mingxing Zhang, Youwei Zhuo (equal contribution),

Size: px

Start display at page:

Download "Communication for PIM-based Graph Processing with Efficient Data Partition. Mingxing Zhang, Youwei Zhuo (equal contribution),"

Francine Ramsey
5 years ago
Views:

1 GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian Tsinghua University University of Southern California Stanford University

2 Motivation Outline Graph applications Processing-In-Memory The drawbacks of the current solution GraphP Evaluation

3 Graph Applications Social network analytics Recommendation system Bioinformatics

4 Challenges High bandwidth requirement Small amount of computation per vertex Data movement overhead

5 Challenges High bandwidth requirement Small amount of computation per vertex Data movement overhead comp L1 L2 L3 mem

6 PIM: Processing-In-Memory

7 PIM: Processing-In-Memory Idea: Computation logic inside memory Advantage: High memory bandwidth

8 PIM: Processing-In-Memory Idea: Computation logic inside memory Advantage: High memory bandwidth Example: Hybrid Memory Cubes (HMC)

9 PIM: Processing-In-Memory Idea: Computation logic inside memory Advantage: High memory bandwidth Example: Hybrid Memory Cubes (HMC) 320GB/s intra-cube mem.. mem mem mem comp 4x120GB/s inter-cube

10 HMC: Hybrid Memory Cubes 320 Intra-cube bandwidth (GB/s)

11 HMC: Hybrid Memory Cubes Intra-cube Inter-cube bandwidth (GB/s)

12 HMC: Hybrid Memory Cubes Intra-cube Inter-cube bandwidth (GB/s)

13 HMC: Hybrid Memory Cubes Intra-cube Inter-cube bandwidth (GB/s)

14 HMC: Hybrid Memory Cubes Intra-cube Inter-cube Inter-group bandwidth (GB/s)

15 HMC: Hybrid Memory Cubes Intra-cube Bottleneck: Inter-cube communication Inter-cube Inter-group bandwidth (GB/s)

16 Motivation Outline Graph applications Processing-In-Memory The drawbacks of the current solution GraphP Evaluation

17 Current Solution: Tesseract First PIM-based graph processing architecture Ahn, J., Hong, S., Yoo, S., Mutlu, O., & Choi, K. A scalable processing-inmemory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

18 Current Solution: Tesseract First PIM-based graph processing architecture Programming model Vertex program Ahn, J., Hong, S., Yoo, S., Mutlu, O., & Choi, K. A scalable processing-inmemory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

19 Current Solution: Tesseract First PIM-based graph processing architecture Programming model Vertex program Partition Based on vertex program Ahn, J., Hong, S., Yoo, S., Mutlu, O., & Choi, K. A scalable processing-inmemory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

20 PageRank in Vertex Program for (v: vertices) { update = 0.85 * v.rank / v.out_degree; }

21 PageRank in Vertex Program for (v: vertices) { update = 0.85 * v.rank / v.out_degree; for (w: edges.destination) { put(w.id, function{ w.next_rank += update; }); } }

22 PageRank in Vertex Program for (v: vertices) { update = 0.85 * v.rank / v.out_degree; for (w: edges.destination) { put(w.id, function{ w.next_rank += update; }); } } barrier();

23 Graph Partition hmc vertex hmc1

24 Graph Partition hmc vertex inter edge intra edge hmc1

25 Graph Partition hmc vertex inter edge intra edge comm put(w.id, function{ w.next_rank += update; }); hmc1

26 Graph Partition hmc vertex inter edge intra edge comm communication = # of cross-cube edges hmc1

27 Drawback of Tesseract Excessive data communication Why? Programming Model Graph Partition Data Communication Tesseract

28 Drawback of Tesseract Excessive data communication Why? Programming Model Graph Partition Data Communication Tesseract?

29 Drawback of Tesseract Excessive data communication Why? Programming Model Graph Partition Data Communication Tesseract?

30 Drawback of Tesseract Excessive data communication Why? Programming Model Graph Partition Data Communication Tesseract?

31 Outline Motivation GraphP Evaluation

32 GraphP Consider graph partition first. Graph Partition Source-Cut Programming model Two-phase vertex program Reduces inter-cube communication

33 Source-Cut Partition hmc vertex hmc

34 Source-Cut Partition hmc vertex inter edge intra edge hmc

35 Source-Cut Partition hmc vertex inter edge intra edge 2 replica 2 hmc

36 Source-Cut Partition hmc vertex inter edge intra edge 2 replica 2 hmc

37 Source-Cut Partition hmc vertex inter edge intra edge 2 replica 2 hmc

38 Two-Phase Vertex Program for (r: replicas) { r.next_rank = 0.85 * r.next_rank / r.out_degree; } //apply updates from previous iterations

39 Two-Phase Vertex Program for (r: replicas) { r.next_rank = 0.85 * r.next_rank / r.out_degree; } //apply updates from previous iterations

40 Two-Phase Vertex Program

41 Two-Phase Vertex Program for (v: vertices) { for (u: edges.sources) { }

42 Two-Phase Vertex Program for (v: vertices) { for (u: edges.sources) { } update += u.rank;

43 Two-Phase Vertex Program for (v: vertices) { for (u: edges.sources) { } update += u.rank;

44 Two-Phase Vertex Program } for (r: replicas) { put(r.id, function { r.next_rank = update}); } 0 2 barrier();

45 Benefits Strictly less data communication Enables architecture optimizations

46 Less Communication Tesseract GraphP

47 Less Communication Tesseract GraphP

48 Broadcast Optimization for (r: replicas) { } put(r.id, function { r.next_rank = update}); } barrier(); 4 broadcast 4 4 4

49 Naïve Broadcast 15 point to point messages src dst dst dst dst

50 Hierarchical communication 3 intergroup messages src dst dst dst dst

51 Other Optimizations Computation/communication overlap Leveraging low-power state of SerDes Please see the paper for more details

52 Outline Motivation GraphP Evaluation

53 Evaluation Methodology Simulation Infrastructure zsim with HMC support ORION for NOC Energy modeling Configurations Same as Tesseract 16 HMCs Interconnection: Dragonfly and Mesh2D 512 CPUs Single-issue in-order cores Frequency: 1GHz

54 4 graph algorithms Workloads 5 real-world graphs

55 4 graph algorithms Workloads Breadth First Search Single Source Shortest Path Weakly Connected Component PageRank 5 real-world graphs Wiki-Vote (WV) ego-twitter (TT) Soc-Slashdot0902 (SD) Amazon0302 (AZ) ljournal-2008 (LJ)

56 Performance memory bandwidth Tesseract

57 Speedup Performance data partition 1.7x 5 memory bandwidth 0 DDR3 Tesseract SOTA GraphP-SC GraphP-SC -BRD

58 Speedup Performance 20 <1.1x data partition 1.7x 5 memory bandwidth 0 DDR3 Tesseract SOTA GraphP-SC GraphP-SC -BRD

59 Normalized to Tesseract Communication Amount 100% 75% 51.8% Intra-group Inter-group 50% 25% 48.2% 0% 7.1% 0.4% 7.0% 1.7% Tesseract GraphP-SC GraphP-SC -BRD

60 Normalized to Tesseract Energy consumption 100% 100.0% 75% 50% 25% 24.9% 15.9% 0% Tesseract GraphP-SC GraphP-SC -BRD

61 Other results Bandwidth utilization Scalability Replication overhead Please see the paper for more details

62 Conclusions We propose GraphP A new PIM-based graph processing framework Key contributions

63 Conclusions We propose GraphP A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration

64 Conclusions We propose GraphP A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition

65 Conclusions We propose GraphP A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition Two-phase vertex program

66 Conclusions We propose GraphP A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition Two-phase vertex program Enable additional architecture optimizations

67 Conclusions We propose GraphP A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition Two-phase vertex program Enable additional architecture optimizations GraphP drastically reduces inter-cube communication and improves energy efficiency.

68 GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian Tsinghua University University of Southern California Stanford University

69 Workload Size & Capacity 128 GB (16 * 8GB) ~16 billion edges ~400 million edges (SNAP) ~7 billion edges (WebGraph)

70 Two-phase vertex program Equivalent Expressiveness as vertex programs

GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition

GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang Youwei Zhuo Chao Wang Mingyu Gao Yongwei Wu Kang Chen Christos Kozyrakis Xuehai Qian Tsinghua