RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

Size: px

Start display at page:

Download "RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine"

Scot Perry
5 years ago
Views:

1 RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook

2 Big Graph

3 Graph Datasets Big Graph

4 GraphChi Graph Datasets Big Graph GridGraph Graph Systems

5 Graph Analytical Problems

6 Graph Analytical Problems Graph Computation

7 Graph Analytical Problems PageRank Connected Component Graph Computation

8 Graph Analytical Problems PageRank Connected Component Graph Computation Iterative value computation

9 Graph Analytical Problems PageRank GraphChi Connected Component Graph Computation Iterative value computation Think Like a Vertex

10 Graph Analytical Problems PageRank GraphChi Connected Component Graph Computation Iterative value computation Think Like a Vertex Graph Mining

11 Graph Analytical Problems PageRank GraphChi Connected Component Graph Computation Iterative value computation Think Like a Vertex Frequent Subgraph Mining Clique Finding Graph Mining

12 Graph Analytical Problems PageRank GraphChi Connected Component Graph Computation Iterative value computation Think Like a Vertex Frequent Subgraph Mining Clique Finding Graph Mining Discover structural patterns

13 Graph Analytical Problems PageRank GraphChi Connected Component Graph Computation Iterative value computation Think Like a Vertex Frequent Subgraph Mining Clique Finding? Graph Mining Discover structural patterns

14 Existing Mining Systems Enumerate all possible subgraphs For each subgraph, check if it matches the pattern Pattern is application-specific (Clique finding, motif counting, frequent subgraph mining) 4

15 Existing Datalog Systems Relational predicates - TC(a, b, c) R(a, b), a < b, R(b, c), b < c, R(c, a) - count TC(a, b, c) Relation algebra enables composition of small structures into big structures 5

16 Challenges in Graph Mining # of subgraphs grows exponentially with the size of subgraphs Arabesque [CHC Teixeira et al., SOSP 5].7B 4k k Exponentially 5k 7.8M 7M # of subgraphs size of subgraphs 6

17 Problems with Distributed Mining Systems Suffer from large startup and communication overhead - Arabesque on 0-node cluster, 5s startup, s execution Need enterprise clusters with large amounts of memory - DistGraph on 8-node cluster,,768gb memory Poor load balancing due to dynamic working sets - some nodes out of memory, other nodes with memory usage < 0% 7

18 Problems with Datalog Systems Programming model is not expressive enough for complex graph mining algorithms 8

19 Thoughts and Insight Distributed mining systems drawbacks: large startup, underutilized cpus, poor load balancing Not all users have access to enterprise cluster Many users are domain experts with limited background in hosting a cluster 9

20 Thoughts and Insight Distributed mining systems drawbacks: large startup, underutilized cpus, poor load balancing Increasingly large SSDs Not all users have access to enterprise cluster Many users are domain experts with limited background in hosting a cluster 9

21 Our Proposal: RStream A single machine, out-of-core graph mining system A simple and expressive API Gather-Apply-Scatter + Relational Algebra => GRAS An efficient runtime engine implements relational algebra with streaming 0

22 GAS Gather information from neighbor vertices

23 GAS Apply and update the vertex property

24 GAS Scatter information to neighbor vertices

25 GRAS 4

26 GRAS GAS supports iterative graph processing 4

27 GRAS GAS supports iterative graph processing Relational Algebra enables composition of structures 4

28 GRAS GAS supports iterative graph processing GRAS Relational Algebra iteratively composition of structures enables composition of structures 4

29 GRAS GAS supports iterative graph processing GRAS Relational Algebra iteratively composition of structures enables composition of structures 4

30 Edge Streaming X-Stream [A Roy et al., SOSP ] Use streaming to reduce I/O costs Sequentially access (larger) datasets from disk, randomly access (smaller) datasets held in memory 5

31 Edge Streaming A graph is partitioned into streaming partitions. Each streaming partition contains Vertex Table Edge Table Update Table VID Value Src Dest 4 5 Value Dest 4 5 6

32 Streaming for Scatter/Gather Scatter Streaming Load Shuffle Streaming Partition a src dest Update Table 5 ID value a b value dest Update a Table b 5 Streaming Partition b 5 Edge Table Vertex Table Update Table Gather/Apply Streaming Load ID value value dest Update Table a a b value dest Update Table a+b Update Table Vertex Table Update Table 7

33 RStream API Scatter Scatter Relational... GatherApply Relational Relational GatherApply 8

34 Example:Triangle Counting Scatter R R 9

35 Example:Triangle Counting Scatter R R Scatter src dest 4 5 edge table VID value 4 5 vertex table 4 5 9

36 Example:Triangle Counting Scatter R R Scatter src dest 4 5 edge table VID value 4 5 vertex table 4 5 R (a, b) (b, c) c c 4 5 src dest update table edge table (a, b, c)

37 Example:Triangle Counting Scatter R R Scatter src dest 4 5 edge table VID value 4 5 vertex table 4 5 R (a, b) (b, c) c c 4 5 src dest update table edge table (a, b, c) R (a, b, c) (c, a) (a, b, c, a) c c c src dest update table edge table 5 8 9

38 Outline How to provide a general programming interface for graph mining algorithms? How to implement relational operators efficiently for graphs? 0

39 Streaming for Join Operator Streaming Shuffle Streaming Partition C C Load C C C 6 Src Dest Streaming Partition 6 5 Update Table Edge Table Update Table

40 Streaming for Join Operator Streaming Shuffle Streaming Partition C C Load Locality-Aware Join C C C 6 Src Dest Streaming Partition 6 5 Update Table Edge Table Update Table

41 Structural Information

42 Structural Information

43 Structural Information

44 Structural Information

45 Structural Information

46 Structural Information same update tuples 4 different subgraphs 4 4 4

47 Structural Information same update tuples 4 different subgraphs 4 4 4

48 Structural Information 4 4 Structural info is missing! 4 4 same update tuples 4 different subgraphs 4 4 4

49 Missing Structural Information Identical tuples may represent different structures Different tuples may represent identical structures

50 Adding Structural Info Encodes the history of joins in update tuples sub graph 6 8 update tuples index index () index () 5() 4

51 Is Join Enough? Join grows a subgraph from one of its vertices For Frequent Subgraph Mining, we need to explore all possibilities of existing subgraphs A different way of joining to grow a subgraph from all of its vertices 5

52 Join on All Columns Joins update table with edge table on every column 0 6

53 Join on All Columns Joins update table with edge table on every column 0 6

54 Join on All Columns Joins update table with edge table on every column 4 0 6

55 Join on All Columns Joins update table with edge table on every column

56 Join on All Columns Joins update table with edge table on every column

57 Automorphism and Isomorphism Arabesque [CHC Teixeira et al., SOSP 5] thread thread 5 ( ) Aggregation, 6 Different threads can generate identical(automorphic) update tuples Select and keep one, remove all the other duplicates 4 Aggregate to count number of each distinct shape Different tuples may belong to same isomorphism class 7

58 Evaluation Platform - 0-node cluster, 5TB SSD - Each node: Xeon(R) CPU E5-640 v processors,gb memory Application - Triangle Counting - Transitive Closure - N-Clique Finding - N-Motif Counting - Frequent Subgraph Mining Input graphs Graphs #Edges #Vertices Citeseer 4,7, Mico.M 00K Patents 4M.7M LiverJournal 69M 4.8M Orkut 7M M UK M 9.5M 8

59 Comparisons with Mining Systems Triangle Counting 5-Clique -FSM K RStream Citese 0.04 er Mico 5.8 Patent 6.7 Arabesque RStream Arabesque RStream Arabesque ScaleMine DistGraph RStream outperforms Arabesque by 60.9x ScaleMine by.x DistGraph by 7.x 9

60 Comparisons with Mining Systems FSM on patent graph running time(seconds) Rstream ScaleMine Arabesque -0K -5K -0K 4-5K 4-0K 4-5K 5-5K 5-0K 5-5K subgraph size - support 0

61 Comparisons with Datalog Systems,000 8,0 LiveJo urnal Orkut Triangle Counting RStream BigDatlog BigDatalog Time(seconds) BigDatalog SociaLite BD- BD-5 BD-0 SL RS Transitive Closure

62 Size of Intermediate Data Phase #MB Motif Counting Mico Total.49TB

63 Size of Intermediate Data Phase #MB 4-Motif Counting Mico MB initial graph 688 X Total.49TB

64 Conclusions RStream: A single machine, out-of-core graph mining system A simple and expressive API GAS + Relational Algebra => GRAS An efficient runtime engine implements relational algebra with tuple streaming

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,