Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Size: px

Start display at page:

Download "Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI"

Aubrey Dorsey
6 years ago
Views:

1 rabesque system for distributed graph mining Mohammed Zaki, RPI Carlos Teixeira, lexandre Fonseca, Marco Serafini, Georgos Siganos, shraf boulnaga, Qatar Computing Research Institute (QCRI) 1

2 Big Data Why has data analytics become so hot? Physical and digital worlds increasingly intertwined More and more digital breadcrumbs More and more applications Hadoop has made data analytics accessible Key drivers in systems research Define abstractions that ease development Systems that efficiently implement them 2 2

3 Graphs are Ubiquitous 3 3

4 Graph Mining lgorithms Finding subgraphs of interest in (labeled) input graphs Examples: Clique finding Others: frequent subgraph mining, motifs 4 4

5 pplications Web: Community detection, link spam detection Semantic data: ttributed patterns in RDF Biology: Protein-protein or gene interactions 5 5

6 Some Terminology Input graph Pattern Embeddings 6 6

7 Example: Frequent Graph Mining 7

8 Frequent Subgraph Discovery Mining frequent subgraphs from a database of many graphs C D B D C (G 1 ) (G 2 ) (G 3 ) Maximal Frequent Subgraphs with minimum support (minsup) = 2 C D D B C D D B 8

9 Frequent Subgraph Discovery Mining frequent subgraphs from a single large graph B D C D D C C D B Find subgraphs that have a minimum embedding count Total (6) Edge Disjoint (3) Vertex Disjoint (2) NP-Hard to find edge/vertex disjoint from total C D 9

10 Subgraph Mining: Complete Level-wise Search Candidate generation: add one more edge; enumerate all extensions Support counting: check which are frequent; retain for next iteration C Frequent Level 1 D B D C C Minimum Support = 2 C D D B D D B Candidate Frequent Level 2 C C B C C C D C C B C C C D 10

11 Taming of the Morphisms Challenge of isomorphisms How to detect duplicates? Graph Isomorphism How to count occurrences? Subgraph Isomorphism 11

12 Candidate Generation Can be very expensive: potentially millions of isomorphism checks C C D C D C D Graph isomorphism 12

13 Support Counting Graph Database If frequent Parent Graph Candidate Graph (G) Subgraph Isomorphism testing G1 G2 G3 Costly for large datasets, large graphs, small support: potentially millions of subgraph isomorphism checks G n 13

14 rabesque for Graph Mining 14

15 Challenge Exponential number of subgraphs/embeddings # unique subgraphs (log-scale) 1.7B 117M 7.8M 335K 4K 22K Size of subgraphs 15 15

16 State of the rt: Custom lgorithms Easy to Code High Performance Transparent Distribution Custom lgorithms 16 16

17 State of the rt: Think Like a Vertex Easy to Code Efficient Implementation Transparent Distribution Custom lgorithms Think Like a Vertex 17 17

18 rabesque New system & execution model Purpose-built for graph mining New Think Like an Embedding model Contributions: Simple & Generic PI High performance Distributed & Scalable by design 18 18

19 rabesque Easy to Code High Performance Transparent Distribution Custom lgorithms Think Like a Vertex rabesque 19 19

20 rabesque PI - Clique finding boolean filter(embedding e) { return isclique(e); } void process(embedding e) { output(e); } State of the rt (Mace, centralized) 4,621 LOC boolean isclique(embedding e) { return e.getnumedgesdded() == e.getnumberofvertices() - 1; } 20 20

21 rabesque PI - Motif Counting boolean filter(embedding e) { return e.getnumvertices() <= MX_SIZE; } void process(embedding e) { mapoutput(e.getpattern(), 1); } State of the rt (GTrieScanner, centralized) 3,145 LOC Pair<Pattern, Integer> reduceoutput(pattern p, List<Integer> counts) { return new Pair(p, sum(counts)); } 21 21

22 rabesque PI - Frequent Subgraph mining Ours was the first distributed implementation 280 lines of Java code of which 212 compute frequency metric Baseline (GRMI): 5,443 lines of Java 22 22

23 rabesque: n Efficient System COST: s efficient as centralized state of the art pplication - Graph Centralized Baseline rabesque 1 thread Motifs - MiCo (MS=3) 50s 37s Cliques - MiCo (MS=4) 281s 385s FSM - CiteSeer (S=300) 4.8s 5s 23 23

24 rabesque: Scalable System Scalable to thousands of workers Hours/days Minutes pplication - Graph Centralized Baseline rabesque 640 cores Motifs - MiCo 2 hours 24 minutes 25 seconds Cliques - MiCo 4 hours 8 minutes 1 minute 10 seconds FSM - Patents > 1 day 1 minute 28 seconds Can process graphs with almost 1 billion edges 24 24

25 lternative Paradigms? 25

26 Think Like a Vertex pplication = Stateful vertex object Vertices sends messages to their neighbors Easy to scale to large graphs: partition by vertex Bulk Synchronous Programming (BSP) 1. Receive from all neighbors 2. Compute new state 3. Send to all neighbors 26 26

27 Example: Shortest Path Input: Graph (weighted edges), source vertex Output: Min source vertex distance Superstep 0 1 message values = 2 and Superstep 1 message values = 4, 3, and Superstep 2 message values = 6 and Superstep 3 Complete, no new messages Example taken from: [McCune et al., arxiv: (2015)] 27

28 Matrix-Vector Multiplication E.g. Page-Rank style computation importance: i2 importance: i3 2 3 a12 * i2 a13 * i3 sum inputs 1 new state to neighbors superstep i superstep i+1 superstep i+2 links to v1 0 a12 a13 i1 a12 * i2 + a13 * i3 i2 * = i3 adjacency matrix (transposed) importance new importance 28 28

29 Graph Exploration with TLV 1. Receive embeddings 2. Expand by adding neighboring vertices 3. Send canonical embeddings to their constituting vertices Superstep 2 for vertex 4 Input graph Receive Expand Send

30 Think Like a Pattern Many existing algorithms keep state by pattern dvantages Rebuild embeddings from scratch No need to materialize full intermediate state Idea of TLP: ssign different patterns to different machines void storing materialized embedding 30 30

31 rabesque Details 31

32 How: rabesque Optimizations void Redundant Work Efficient canonicality checking Embedding Compression Overapproximating Directed cyclic Graphs (ODGs) Efficient ggregation 2-level pattern aggregation 32 32

33 rabesque: Fundamentals Subgraphs as 1st class citizens: Embedding == Subgraph Think Like an Embedding model rabesque responsibilities User responsibilities Graph Exploration ggregation (Isomorphism) Filter Load Balancing No redundant work (utomorphism) Process 33 33

34 Graph Exploration Iterative expansion Subgraph order n Subgraph order n + 1 Connect to neighbours, one vertex at a time Input graph Depth Depth

35 Graph Exploration Depth Input graph

36 Model - Think Like an Embedding Exploration step 1 Exploration step 2 Exploration step Input Output Input Output Input Output Filter false true Process Discard Save 1. Start from a set of initial embeddings 2. Expand: add one vertex or edge 3. Filter uninteresting candidates 4. Produce outputs User-defined functions 36 36

37 Guarantee: Completeness For each e, if Filter(e) == true then Process(e) is executed Requirement: nti-monotonicity Filter = true Filter = true Filter = false Keep expanding Filter = false We can prune and be sure that we won t ignore desired embeddings 37 37

38 ggregation Some applications must aggregate across embeddings E.g., Frequent subgraph mining: Count embeddings with same pattern ggregation in parallel with exploration step 38 38

39 ggregation ggregate across multiple embeddings readggregate(k) Process map(k, v) 1 2 gg Filter 1 3 Discard gg Process Save Filter - Process as before Exploration step i Exploration step i+1 User-defined functions 39 39

40 System rchitecture Input Embeddings size n split 1 split 4 Worker 1 Output Embeddings size n + 1 split 1 split 4 split 7 split 7 Previous step split 2 split 5 split 8 Worker 2 Worker 3 split 2 split 5 split 8 Next step split 3 split 3 split 6 split 6 split 9 split

41 rabesque PI pp-defined functions: boolean filter(embedding e) void process(embedding e) boolean aggregationfilter(embedding e) void aggregationprocess(embedding e) Pair<K,V> reduce(k key, List<V> values) Pair<K,V> reduceoutput(k key, List<V> values) Functions provided by rabesque: void map(k key, V value) V readggregate(k key) void output(object value) void mapoutput(k key, V value) 41 41

42 Technical Challenges 42

43 voiding redundant work Problem: utomorphic embeddings utomorphisms == subgraph equivalences Redundant work == Worker 1 Worker

44 voiding redundant work Solution: Decentralized Embedding Canonicality No coordination Efficient == Worker 1 Worker 2 iscanonical(e) true iscanonical(e) false 44 44

45 Embedding Canonicality iscanonical(e) iff at every step add neighbor with smallest ID 5 e 6 Initial embedding (e) Expansions: canonical canonical not canonical ( ) 45 45

46 Handling Exponential growth Goal: handle trillions+ different embeddings? Solution: Overapproximating DGs (ODGs) Compress into less restrictive superset Deal with spurious embeddings Canonical Embeddings Input Graph Embedding List ODG 46 46

47 ggregation by Pattern Label Distinguishable property of a vertex (e.g. color). Pattern - Meta sub-graph or the template. Captures subgraph structure and labelling Embedding - Instance of a pattern. ctual vertices and edges Input graph Pattern Embeddings 47 47

48 Efficient Pattern ggregation Goal: ggregate automorphic patterns to single key Find canonical pattern No known polynomial solution Canonical pattern 3x Expensive graph canonization 48 48

49 Efficient Pattern ggregation Solution: 2-level pattern aggregation 1. Embeddings quick patterns 2. Quick patterns canonical pattern ) Quick patterns 2) Canonical pattern 3x Linear matching to quick pattern 2x Expensive graph canonization 49 49

50 Evaluation 50

51 Evaluation - Setup 20 servers: GHz, 256GB RM 10 Gbps network 3 algorithms: Frequent Subgraph Mining, Counting Motifs and Clique Finding # Vertices # Edges # Labels vg. Degree CiteSeer 3,312 4, MiCO 100,000 1,080, Patents 2,745,761 13,965, Youtube 4,589,876 43,968, SN 5,022, ,613, Instagram 179,527, ,390,

52 Evaluation - TLP & TLV Use case: frequent subgraph mining No scalability. Bottlenecks: TLV: Replication of embeddings, hotspots TLP: very few patterns do all the work 10 8 Ideal TLP TLV Speedup Number of nodes (32 threads) 52 52

53 Evaluation - raquesque Scalability 53 53

54 Evaluation rabesque Scalability pplication - Graph Centralized Baseline rabesque - Num. Servers (32 threads) Motifs - MiCo 8,664s 328s 74s 41s 31s 25s FSM - Citeseer 1,813s 431s 105s 65s 52s 41s Cliques - MiCo 14,901s 1,185s 272s 140s 91s 70s Motifs - Youtube Fail 8,995s 2,218s 1,167s 900s 709s FSM - Patents >19h 548s 186s 132s 102s 88s 54 54

55 Evaluation - ODGs Compression 44 GB 60 MB 4000 vertices 1.7 billion subgraphs 55 55

56 Evaluation - Speedup w ODGs 56 56

57 Efficient Pattern ggregation Solution: 2-level pattern aggregation 1. Embeddings quick patterns 2. Quick patterns canonical pattern ) Quick patterns 2) Canonical pattern 3x Linear matching to quick pattern 2x Expensive graph canonization 57 57

58 Evaluation - Two-level aggregation Motifs MiCo (MS = 4) Motifs Youtube (MS=4) FSM CiteSeer (S=220, MS=7) FSM Patents (S=24k) Embeddings 10,957,439, ,909,854,429 1,680,983,703 1,910,611,704 Quick Patterns Canonical Patterns Reduction Factor 521,782,810x 10,424,278,782x 1,173,052x 1,061,451x 58 58

59 Evaluation - Two-level aggregation 59 59

60 CPU Utilization Breakdown dvantages of a simple PI rabesque does all the work (unlike TLV system) Great opportunities for system-level optimizations P: Pattern ggregation, C: canonicality checks, G: generate new candidates, R/W: Read/write embeddings 60

61 Large Graphs Graph # Vertices # Edges # Labels vg. Degree SN 5,022, ,613, Instagram 179,527, ,390, pplication Time Embeddings Motifs-SN (MS=4) 6h 18m 8.4 trillion Cliques-SN (MS=5) 29m 30 billion Motifs-Instagram (MS=3) 10h 45m 5 trillion 61 61

62 What s Next? 62

63 Future Work Better ways to organize intermediate state Scale to larger intermediate states Support for approximate exploration Out-of-core? Support for real-time graphs Verticals and new applications 63 63

64 Conclusions Fundamental trend: democratizing data analytics rabesque: graph mining system Straightforward to code Transparent and scalable distribution High performance Only a first step: many opportunities for improvement 69 69

65 Download It, Play with It, Hack It Open-source (pache 2.0) Pre-compiled jar User guide 70 70

66 Thank you arabesque.io 71

Arabesque: A System for Distributed Graph Mining

Arabesque: A System for Distributed Graph Mining Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, Ashraf Aboulnaga Qatar Computing Research Institute - HBKU,