GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University

Widely-Used Graph Processing

Shared memory Single-node & in-memory Ligra, Galois, Polymer Distributed Multi-node & in-memory GraphLab, GraphX, PowerLyra Out-of-core Single-node & disk-based GraphChi, X-Stream, TurboGraph Existing Solutions 3

Existing Solutions Shared memory Large-scale Single-node & in-memory Ligra, Galois, Polymer Distributed Multi-node & in-memory GraphLab, GraphX, PowerLyra Out-of-core Single-node & disk-based GraphChi, X-Stream, TurboGraph Limited capability to big graphs Irregular structure Imbalance of computation and communication Inevitable random access Expensive disk random access 4

Existing Solutions Shared memory Large-scale Single-node & in-memory Ligra, Galois, Polymer Distributed Multi-node & in-memory GraphLab, GraphX, PowerLyra Out-of-core Single-node & disk-based GraphChi, X-Stream, TurboGraph Most cost effective! Limited capability to big graphs Irregular structure Balance of computation and communication Inevitable random access Expensive disk random access 5

Methodology How to handle graphs that is much larger than memory capacity? System Unit Partition! GraphChi TurboGraph X-Stream PathGraph FlashGraph GridGraph Shards Page Streaming Partitions Tree-Based Partitions Page Chunks & Blocks

State-of-the-Art Methodology X-Stream Access edges sequentially from disks Access vertices randomly inside memory Guarantee locality of vertex accesses by partitioning 7

State-of-the-Art Methodology X-Stream Access edges sequentially from slow memory Access vertices randomly inside fast memory Guarantee locality of vertex accesses by partitioning Out-of-core In-memory Cache Memory Disk 8

Edge-Centric Scatter-Gather source destination Scatter: for each streaming partition load source vertex chunk of edges into fast memory stream edges append to several updates Gather: for each streaming partition load destination vertex chunk of updates into fast memory stream updates apply to vertices X-Stream: Edge-centric Graph Processing using Streaming Partitions, A. Roy et al., SOSP 03 9

Motivation source Question: Is it possible to apply on-the-fly updates? (Thus bypass the writes and reads of updates.) Can be as large as O(E)! destination 0

Basic Idea source & destination Answer: Guarantee the locality of both source and destination vertices when streaming edges! Streaming-Apply: for each streaming edge block load source and destination vertex chunk of edges into memory stream edges read from source vertices write to destination vertices

Solution Grid representation Dual sliding windows Selective scheduling -level hierarchical partitioning

Grid Representation Vertices partitioned into P equalized chunks Edges partitioned into P P blocks Row source Column destination 3 4 Source Chunk Destination Chunk (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) P= Edge Block(, ) 3

Streaming-Apply Processing Model Stream edges block by block Each block corresponding to two vertex chunks Source chunk + destination chunk Fit into memory Difference with scatter-gather phases phase Updates are applied on-the-fly src. chunk 3 dest. chunk P=4 4

Dual Sliding Windows Access edge blocks in column-oriented order From left to right Destination window slides as column moves From top to bottom Source window slides as row moves Optimize write amount pass over the destination vertices P=4 5

Dual Sliding Windows,, 0,, 0 P= PR Deg NewPR 0 0 0 0 Object I/O Amt. 3,, 0 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 0 Src. vertex 0 Dest. vertex 0,, 0 Initialize 6

Dual Sliding Windows,, 0.5,, 0.5 P= PR Deg NewPR 0.5 0.5 0 0 Object I/O Amt. 3,, 0 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 0 Src. vertex 0 Dest. vertex 0,, 0 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (miss); Read edges (, ), (, ) from disk; 7

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0 0 Object I/O Amt. 3,, 0 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 4 Src. vertex 4 Dest. vertex,, 0 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (hit); Read edges (3, ), (4, ) from disk; 8

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0.5 0.5 Object I/O Amt. 3,, 0.5 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 4 6 Src. vertex 4 6 Dest. vertex 6,, 0.5 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (miss); Write back destination vertex chunk to disk; Read edges (, 3), (, 4) from disk; 9

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0.5 Object I/O Amt. 3,, 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 6 7 Src. vertex 6 8 Dest. vertex 6,, 0.5 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (hit); Read edges (4, 3) from disk; 0

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0.5 Object I/O Amt. 3,, 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 7 Src. vertex 8 Dest. vertex 6 8,, 0.5 Iteration finishes Write back destination vertex chunk to disk;

For iteration I/O Access Amount E + ( + P) V pass over the edges (read) P pass over the source vertices (read) pass over the destination vertices (read+write) Implication: P should be the minimum value that enables needed vertex data to be fit into memory.

I/O Access Amount For iteration E + ( + P) V Cache Memory Disk pass over the edges (read) P pass over the source vertices (read) pass over the destination vertices (read+write) Implication: P should be the minimum value that enables needed vertex data to be fit into memory. 3

Memory Access Amount For iteration E + ( + P) V Cache Memory Disk pass over the edges (read) P pass over the source vertices (read) pass over the destination vertices (read+write) Implication: P should be the minimum value that enables needed vertex data to be fit into memory. 4

Selective Scheduling Skip blocks with no active edges Very simple but important optimization Effective for lots of algorithms BFS, WCC, 5

Selective Scheduling 3 BFS from with P = 4 Iteration Active True False False False Parent - - - (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Before Access 4 edges Active False True True False Parent - After 6

Selective Scheduling 3 BFS from with P = 4 Iteration Active False True True False Parent - (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Before Access 7 edges Active False False False True Parent After 7

Selective Scheduling 3 BFS from with P = 4 Iteration 3 Active False False False True Parent (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Before Access 3 edges Active False False False False Parent After 8

Selective Scheduling 3 BFS from with P = 4 BFS finishes Parent (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) 4+7+3=4 Access 4 edges in all 9

Impact of P on Selective Scheduling BFS from 3 P 4 Edge accesses (=7+7+7) 4=(4+7+3) 7=(+3+) Effect becomes better with more fine-grained partitioning. 4 Implication: A larger value of P is preferred. 30

Dilemma on Selection of P Coarse-grained Fewer accesses on vertices Poorer locality Less selective scheduling Fine-grained Better locality More selective scheduling More accesses on vertices P small large 3

Dilemma from Memory Hierarchy Different selections of P Disk Memory hierarchy Fit hot vertex data into memory Memory Cache hierarchy Fit hot vertex data into cache Disk Memory Cache? Cache Memory Disk 3

-Level Hierarchical Partitioning Apply a Q Q partitioning over the P P grid Q V / M P V / C C << M P >> Q Group the small blocks into larger ones P=4 Q= P = number of partitions, M = memory capacity, C = LLC capacity, V = size of vertices 33

Programming Interface StreamVertices(F v, F) StreamEdges(F e, F) 34

Evaluation Test environment AWS EC i.xlarge 4 hyperthread cores 30.5GB memory 800GB SSD AWS EC d.xlarge 4 hyperthread cores 30.5GB memory 3 TB HDD Applications BFS, WCC, SpMV, PageRank 35

4 3.5 3.5.5 0.5 0 BFS WCC SpMV PageRank LiveJournal GraphChi X-Stream GridGraph.8.6.4. 0.8 0.6 0.4 0. 0 BFS WCC SpMV PageRank Twitter GraphChi X-Stream GridGraph. 0.8 0.6 0.4 0. UK GraphChi X-Stream GridGraph Yahoo Runtime(S) BFS WCC SpMV PageRank GraphChi - 46 676 3076 X-Stream - - 076 9957 GridGraph 685 360 63. 479 0 BFS WCC SpMV PageRank - indicates failing to finish in 48 hours i.xlarge, memory limited to 8GB 36

Disk Bandwidth Usage I/O throughput of a 0-minute interval running PageRank on Yahoo graph 37

Effect of Dual Sliding Windows 800 700 600 I/O Amount 500 400 300 00 00 GraphChi X-Stream GridGraph 0 Reads Writes PageRank on Yahoo 38

Effect of Selective Scheduling 80 70 60 I/O Amount 50 40 30 0 0 X-Stream GridGraph 0 3 4 5 6 7 8 9 0 3 4 5 Iteration WCC on Twitter 39

Impact of P on In-Memory Performance Runtime(s) 0 5 0 05 00 95 90 85 80 75 0 00 000 P PageRank on Twitter Edges cached in 30.5GB memory Optimal around P=3 where needed vertex data can be fit into L3 cache. 40

Impact of Q on Out-of-Core Performance 900 800 700 Runtime(s) 600 500 400 300 000 00 0 00 SpMV on Yahoo Memory limited to 8GB Optimal at Q= where needed vertex data can be fit into memory. 4

Comparison with Distributed Systems PowerGraph, GraphX * 6 m.4xlarge, $5.98/h GridGraph i.4xlarge, $3.4/h 4 SSDs.8GB/s disk bandwidth 900 800 700 600 500 400 300 00 00 0 PowerGraph GraphX GridGraph * GraphX: Graph Processing in a Distributed Dataflow Framework, JE Gonzalez et al., OSDI 04 4

Conclusion GridGraph Dual sliding windows Reduce I/O amount, especially writes Selective scheduling Reduce unnecessary I/O -level hierarchical grid partitioning Applicable to 3-level (cache-memory-disk) hierarchy 43

Thanks! 44