GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

Similar documents
NXgraph: An Efficient Graph Processing System on a Single Machine

NXgraph: An Efficient Graph Processing System on a Single Machine

FPGP: Graph Processing Framework on FPGA

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Report. X-Stream Edge-centric Graph processing

X-Stream: A Case Study in Building a Graph Processing System. Amitabha Roy (LABOS)

Graph-Processing Systems. (focusing on GraphChi)

A Framework for Processing Large Graphs in Shared Memory

NUMA-aware Graph-structured Analytics

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

X-Stream: Edge-centric Graph Processing using Streaming Partitions

Big Data Analytics Using Hardware-Accelerated Flash Storage

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine

AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS

GraFBoost: Using accelerated flash storage for external graph analytics

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

GraphZ: Improving the Performance of Large-Scale Graph Analytics on Small-Scale Machines

GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Chaos: Scale-out Graph Processing from Secondary Storage

Practical Near-Data Processing for In-Memory Analytics Frameworks

HUS-Graph: I/O-Efficient Out-of-Core Graph Processing with Hybrid Update Strategy

Chaos: Scale-out Graph Processing from Secondary Storage

Fast Iterative Graph Computation: A Path Centric Approach

Tutorial: Large-Scale Graph Processing in Shared Memory

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

arxiv: v1 [cs.db] 21 Oct 2017

GraphChi: Large-Scale Graph Computation on Just a PC

GraFBoost: Using accelerated flash storage for external graph analytics

Domain-specific programming on graphs

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing

FastBFS: Fast Breadth-First Graph Search on a Single Server

AS an alternative to distributed graph processing, singlemachine

Irregular Graph Algorithms on Parallel Processing Systems

PathGraph: A Path Centric Graph Processing System

A Distributed Multi-GPU System for Fast Graph Processing

Scale-up Graph Processing: A Storage-centric View

Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O

Why do we need graph processing?

Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing

Graph Data Management

LLAMA: Efficient graph analytics using Large Multiversioned Arrays

Gemini: A Computation-Centric Distributed Graph Processing System

Efficient graph computation on hybrid CPU and GPU systems

Callisto-RTS: Fine-Grain Parallel Loops

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

A Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta

Domain-specific Programming on Graphs

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems

arxiv: v4 [cs.dc] 12 Jan 2018

Extreme-scale Graph Analysis on Blue Waters

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

CS 5220: Parallel Graph Algorithms. David Bindel

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

arxiv: v1 [cs.dc] 10 Dec 2018

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Everything you always wanted to know about multicore graph processing but were afraid to ask

arxiv: v1 [cs.dc] 21 Aug 2017

VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC

CACHE-GUIDED SCHEDULING

GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Tegra Time-evolving Graph Processing on Commodity Clusters

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

GraphX: Graph Processing in a Distributed Dataflow Framework

Basics of Performance Engineering

CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics

Big data systems 12/8/17

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

A Lightweight Infrastructure for Graph Analytics

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Latency-Tolerant Software Distributed Shared Memory

Architectural Implications on the Performance and Cost of Graph Analytics Systems

Extreme-scale Graph Analysis on Blue Waters

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

AN ALTERNATIVE TO ALL- FLASH ARRAYS: PREDICTIVE STORAGE CACHING

Cluster Computing Architecture. Intel Labs

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

HP AutoRAID (Lecture 5, cs262a)

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

ECE Lab 8. Logic Design for a Direct-Mapped Cache. To understand the function and design of a direct-mapped memory cache.

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

CS 6453: Parameter Server. Soumya Basu March 7, 2017

"Systemized" Program Analyses A "Big Data" Perspective on Static Analysis Scalability. Harry Xu and Zhiqiang Zuo University of California, Irvine

COMP 530: Operating Systems File Systems: Fundamentals

CuSha: Vertex-Centric Graph Processing on GPUs

Domain-specific programming on graphs

Transcription:

GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University

Widely-Used Graph Processing

Shared memory Single-node & in-memory Ligra, Galois, Polymer Distributed Multi-node & in-memory GraphLab, GraphX, PowerLyra Out-of-core Single-node & disk-based GraphChi, X-Stream, TurboGraph Existing Solutions 3

Existing Solutions Shared memory Large-scale Single-node & in-memory Ligra, Galois, Polymer Distributed Multi-node & in-memory GraphLab, GraphX, PowerLyra Out-of-core Single-node & disk-based GraphChi, X-Stream, TurboGraph Limited capability to big graphs Irregular structure Imbalance of computation and communication Inevitable random access Expensive disk random access 4

Existing Solutions Shared memory Large-scale Single-node & in-memory Ligra, Galois, Polymer Distributed Multi-node & in-memory GraphLab, GraphX, PowerLyra Out-of-core Single-node & disk-based GraphChi, X-Stream, TurboGraph Most cost effective! Limited capability to big graphs Irregular structure Balance of computation and communication Inevitable random access Expensive disk random access 5

Methodology How to handle graphs that is much larger than memory capacity? System Unit Partition! GraphChi TurboGraph X-Stream PathGraph FlashGraph GridGraph Shards Page Streaming Partitions Tree-Based Partitions Page Chunks & Blocks

State-of-the-Art Methodology X-Stream Access edges sequentially from disks Access vertices randomly inside memory Guarantee locality of vertex accesses by partitioning 7

State-of-the-Art Methodology X-Stream Access edges sequentially from slow memory Access vertices randomly inside fast memory Guarantee locality of vertex accesses by partitioning Out-of-core In-memory Cache Memory Disk 8

Edge-Centric Scatter-Gather source destination Scatter: for each streaming partition load source vertex chunk of edges into fast memory stream edges append to several updates Gather: for each streaming partition load destination vertex chunk of updates into fast memory stream updates apply to vertices X-Stream: Edge-centric Graph Processing using Streaming Partitions, A. Roy et al., SOSP 03 9

Motivation source Question: Is it possible to apply on-the-fly updates? (Thus bypass the writes and reads of updates.) Can be as large as O(E)! destination 0

Basic Idea source & destination Answer: Guarantee the locality of both source and destination vertices when streaming edges! Streaming-Apply: for each streaming edge block load source and destination vertex chunk of edges into memory stream edges read from source vertices write to destination vertices

Solution Grid representation Dual sliding windows Selective scheduling -level hierarchical partitioning

Grid Representation Vertices partitioned into P equalized chunks Edges partitioned into P P blocks Row source Column destination 3 4 Source Chunk Destination Chunk (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) P= Edge Block(, ) 3

Streaming-Apply Processing Model Stream edges block by block Each block corresponding to two vertex chunks Source chunk + destination chunk Fit into memory Difference with scatter-gather phases phase Updates are applied on-the-fly src. chunk 3 dest. chunk P=4 4

Dual Sliding Windows Access edge blocks in column-oriented order From left to right Destination window slides as column moves From top to bottom Source window slides as row moves Optimize write amount pass over the destination vertices P=4 5

Dual Sliding Windows,, 0,, 0 P= PR Deg NewPR 0 0 0 0 Object I/O Amt. 3,, 0 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 0 Src. vertex 0 Dest. vertex 0,, 0 Initialize 6

Dual Sliding Windows,, 0.5,, 0.5 P= PR Deg NewPR 0.5 0.5 0 0 Object I/O Amt. 3,, 0 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 0 Src. vertex 0 Dest. vertex 0,, 0 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (miss); Read edges (, ), (, ) from disk; 7

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0 0 Object I/O Amt. 3,, 0 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 4 Src. vertex 4 Dest. vertex,, 0 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (hit); Read edges (3, ), (4, ) from disk; 8

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0.5 0.5 Object I/O Amt. 3,, 0.5 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 4 6 Src. vertex 4 6 Dest. vertex 6,, 0.5 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (miss); Write back destination vertex chunk to disk; Read edges (, 3), (, 4) from disk; 9

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0.5 Object I/O Amt. 3,, 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 6 7 Src. vertex 6 8 Dest. vertex 6,, 0.5 Stream Block (, ) Cache source vertex chunk in memory (miss); Cache destination vertex chunk in memory (hit); Read edges (4, 3) from disk; 0

Dual Sliding Windows,, 0.5,, P= PR Deg NewPR 0.5 0.5 Object I/O Amt. 3,, 4 (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Edges 7 Src. vertex 8 Dest. vertex 6 8,, 0.5 Iteration finishes Write back destination vertex chunk to disk;

For iteration I/O Access Amount E + ( + P) V pass over the edges (read) P pass over the source vertices (read) pass over the destination vertices (read+write) Implication: P should be the minimum value that enables needed vertex data to be fit into memory.

I/O Access Amount For iteration E + ( + P) V Cache Memory Disk pass over the edges (read) P pass over the source vertices (read) pass over the destination vertices (read+write) Implication: P should be the minimum value that enables needed vertex data to be fit into memory. 3

Memory Access Amount For iteration E + ( + P) V Cache Memory Disk pass over the edges (read) P pass over the source vertices (read) pass over the destination vertices (read+write) Implication: P should be the minimum value that enables needed vertex data to be fit into memory. 4

Selective Scheduling Skip blocks with no active edges Very simple but important optimization Effective for lots of algorithms BFS, WCC, 5

Selective Scheduling 3 BFS from with P = 4 Iteration Active True False False False Parent - - - (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Before Access 4 edges Active False True True False Parent - After 6

Selective Scheduling 3 BFS from with P = 4 Iteration Active False True True False Parent - (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Before Access 7 edges Active False False False True Parent After 7

Selective Scheduling 3 BFS from with P = 4 Iteration 3 Active False False False True Parent (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) Before Access 3 edges Active False False False False Parent After 8

Selective Scheduling 3 BFS from with P = 4 BFS finishes Parent (, ) (, ) (3, ) (4, ) (, 3) (, 4) (4, 3) 4+7+3=4 Access 4 edges in all 9

Impact of P on Selective Scheduling BFS from 3 P 4 Edge accesses (=7+7+7) 4=(4+7+3) 7=(+3+) Effect becomes better with more fine-grained partitioning. 4 Implication: A larger value of P is preferred. 30

Dilemma on Selection of P Coarse-grained Fewer accesses on vertices Poorer locality Less selective scheduling Fine-grained Better locality More selective scheduling More accesses on vertices P small large 3

Dilemma from Memory Hierarchy Different selections of P Disk Memory hierarchy Fit hot vertex data into memory Memory Cache hierarchy Fit hot vertex data into cache Disk Memory Cache? Cache Memory Disk 3

-Level Hierarchical Partitioning Apply a Q Q partitioning over the P P grid Q V / M P V / C C << M P >> Q Group the small blocks into larger ones P=4 Q= P = number of partitions, M = memory capacity, C = LLC capacity, V = size of vertices 33

Programming Interface StreamVertices(F v, F) StreamEdges(F e, F) 34

Evaluation Test environment AWS EC i.xlarge 4 hyperthread cores 30.5GB memory 800GB SSD AWS EC d.xlarge 4 hyperthread cores 30.5GB memory 3 TB HDD Applications BFS, WCC, SpMV, PageRank 35

4 3.5 3.5.5 0.5 0 BFS WCC SpMV PageRank LiveJournal GraphChi X-Stream GridGraph.8.6.4. 0.8 0.6 0.4 0. 0 BFS WCC SpMV PageRank Twitter GraphChi X-Stream GridGraph. 0.8 0.6 0.4 0. UK GraphChi X-Stream GridGraph Yahoo Runtime(S) BFS WCC SpMV PageRank GraphChi - 46 676 3076 X-Stream - - 076 9957 GridGraph 685 360 63. 479 0 BFS WCC SpMV PageRank - indicates failing to finish in 48 hours i.xlarge, memory limited to 8GB 36

Disk Bandwidth Usage I/O throughput of a 0-minute interval running PageRank on Yahoo graph 37

Effect of Dual Sliding Windows 800 700 600 I/O Amount 500 400 300 00 00 GraphChi X-Stream GridGraph 0 Reads Writes PageRank on Yahoo 38

Effect of Selective Scheduling 80 70 60 I/O Amount 50 40 30 0 0 X-Stream GridGraph 0 3 4 5 6 7 8 9 0 3 4 5 Iteration WCC on Twitter 39

Impact of P on In-Memory Performance Runtime(s) 0 5 0 05 00 95 90 85 80 75 0 00 000 P PageRank on Twitter Edges cached in 30.5GB memory Optimal around P=3 where needed vertex data can be fit into L3 cache. 40

Impact of Q on Out-of-Core Performance 900 800 700 Runtime(s) 600 500 400 300 000 00 0 00 SpMV on Yahoo Memory limited to 8GB Optimal at Q= where needed vertex data can be fit into memory. 4

Comparison with Distributed Systems PowerGraph, GraphX * 6 m.4xlarge, $5.98/h GridGraph i.4xlarge, $3.4/h 4 SSDs.8GB/s disk bandwidth 900 800 700 600 500 400 300 00 00 0 PowerGraph GraphX GridGraph * GraphX: Graph Processing in a Distributed Dataflow Framework, JE Gonzalez et al., OSDI 04 4

Conclusion GridGraph Dual sliding windows Reduce I/O amount, especially writes Selective scheduling Reduce unnecessary I/O -level hierarchical grid partitioning Applicable to 3-level (cache-memory-disk) hierarchy 43

Thanks! 44