GraFBoost: Using accelerated flash storage for external graph analytics

Size: px

Start display at page:

Download "GraFBoost: Using accelerated flash storage for external graph analytics"

Isabel Powers
5 years ago
Views:

1 GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1

2 Large Graphs are Found Everywhere in Nature Human neural network Structure of the internet TB to 100s of TB in size! Social networks 1) Connectomics graph of the brain - Source: Connectomics and Graph Theory Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network Source: Griff s Graphs 2

3 Storage for Graph Analytics Extremely sparse Irregular structure TB DRAM of DRAM Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W 3

4 Random Access Challenge in Flash Storage Flash DRAM Bandwidth: 3.6 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth! 4

Sequentialize Random Access Reduce Overhead of

5 Two Pillars for Flash in GraFBoost Flash Storage Sort-Reduce Algorithm Hardware Acceleration Sequentialize Random Access Reduce Overhead of Sort-Reduce System resources Power consumption Performance 5

6 Vertex-Centric Programming Model Popular model for efficient parallism and distribution Vertex program only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more Active Vertices Active Vertices

7 Algorithmic Representation of a Vertex Program Iteration for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src,. val, e. weight) v dst. next_val = vertex_update(v dst. next_val, ev) end for end for Random read-modify update into vertex data Sort-reduce algorithm solves this issue 7

8 General Problem of Irregular Array Updates Updating an array x with a stream of update requests xs and update function f For each < idx, arg > in xs: x idx = f(x idx, arg) X xs f Random Updates 8

9 Solution Part One - Sort Sort xs according to index X xs Sort Sorted xs Sequential Updates Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead 9

10 Solution Part Two - Reduce Associative update function f can be interleaved with sort merge merge 1 7 xs e.g., (A + B) + C = A + (B + C) Reduced Overhead 10

11 Normalized size of update stream Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase Kron32 Kron32 WDC Kron30 WDC Twitter 90% Merge Iteration 11

12 Accelerated GraFBoost Architecture In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) FPGA Accelerator-Aware Flash Management Multirate Aggregator Multirate 16-to-1 Merge-Sorter Wire-Speed On-chip Sorter Flash Edge Data Vertex Data Partially Sort- Reduced Files Active Vertices 12

13 Accelerated GraFBoost Architecture In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) FPGA Edge Program Vertex Value Update Log (xs) 1GB DRAM Sort-Reduce Accelerator Edge Property Flash Edge Data Vertex Data Partially Sort- Reduced Files Active Vertices 13

14 Evaluated Graph Analytic Systems In-memory Semi-External External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) Projected system with 2x memory bandwidth GraphChi (EX) GraFSoft GraFBoost GraFBoost2 Distributed GraphLab: a framework for machine learning and data mining in the cloud, VLDB 2012 FlashGraph: Processing billion-node graphs on an array of commodity SSDs, FAST 2015 X-Stream: edge-centric graph processing using streaming partitions, SOSP 2013 GraphChi: Large-scale graph computation on just a PC, USENIX

15 Evaluation Environment + 32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash $8000 All software experiments 4-core i5 4 GB RAM $400 + Virtex 7 FPGA 1TB custom flash 1GB on-board RAM $1000 $??? 15

16 Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Medium graphs: Fail Fail Slow Fast Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirement Memory, CPU, Power 16

17 Normalized Performance Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish x 2.8x 10x GraFSoft 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 17

18 Normalized Performance Results with a Large Graph: Web Data Commons Web Crawl 4 2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish GraFSoft 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 18

19 Normalized Performance Results with a Large Graph: Web Data Commons Web Crawl 4 2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 3 Only GraFBoost succeeds in both graphs 2 1 GraFBoost can run still larger graphs! GraFSoft 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 19

20 Normalized Performance Results with Smaller Graphs: Breadth-First Search TB 0.04 Billion Fastest! Slowest 0.09 TB 0.3 Billion Slowest 0.3 TB 1 Billion Slowest twitter kron28 kron30 IN SE1 SE2 EX GraFBoost GraFBoost2 GraFSoft 20

21 Normalized Performance Results with a Medium Graph: Against an In-Memory Cluster Synthesized Kronecker Scale TB in text, 0.3 Billion vertices PR BFS 5xIN SE1 SE2 EX GraFBoost GraFSoft 21

22 GB Threads Watts GraFBoost Reduces Resource Requirements Conventional GraFBoost External analytics Conventional GraFBoost Hardware Acceleration Conventional GraFBoost External Analytics + Hardware Acceleration 22

23 Future Work Open-source GraFBoost Cleaning up code for users! Acceleration using Amazon F1 Commercial accelerated storage Collaborating with Samsung More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center 23

24 Thank You Flash Storage Sort-Reduce Algorithm Hardware Acceleration 24

25 25

26 BlueDBM Cluster Architecture Host Server (24-Core) FPGA (VC707) minflash minflash 1GB RAM Ethernet Host Server (24-Core) FPGA (VC707) minflash minflash 1 TB PCIe 4GB/s FMC 10Gbps network 8 Host Server (24-Core) FPGA (VC707) minflash minflash Uniform latency of 100 µs! 26

27 The BlueDBM Cluster BlueDBM Storage Device 27

Big Data Analytics Using Hardware-Accelerated Flash Storage

Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018 A Big Data Application: Personalized Genome