Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Size: px
Start display at page:

Download "Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI"

Transcription

1 rabesque system for distributed graph mining Mohammed Zaki, RPI Carlos Teixeira, lexandre Fonseca, Marco Serafini, Georgos Siganos, shraf boulnaga, Qatar Computing Research Institute (QCRI) 1

2 Big Data Why has data analytics become so hot? Physical and digital worlds increasingly intertwined More and more digital breadcrumbs More and more applications Hadoop has made data analytics accessible Key drivers in systems research Define abstractions that ease development Systems that efficiently implement them 2 2

3 Graphs are Ubiquitous 3 3

4 Graph Mining lgorithms Finding subgraphs of interest in (labeled) input graphs Examples: Clique finding Others: frequent subgraph mining, motifs 4 4

5 pplications Web: Community detection, link spam detection Semantic data: ttributed patterns in RDF Biology: Protein-protein or gene interactions 5 5

6 Some Terminology Input graph Pattern Embeddings 6 6

7 Example: Frequent Graph Mining 7

8 Frequent Subgraph Discovery Mining frequent subgraphs from a database of many graphs C D B D C (G 1 ) (G 2 ) (G 3 ) Maximal Frequent Subgraphs with minimum support (minsup) = 2 C D D B C D D B 8

9 Frequent Subgraph Discovery Mining frequent subgraphs from a single large graph B D C D D C C D B Find subgraphs that have a minimum embedding count Total (6) Edge Disjoint (3) Vertex Disjoint (2) NP-Hard to find edge/vertex disjoint from total C D 9

10 Subgraph Mining: Complete Level-wise Search Candidate generation: add one more edge; enumerate all extensions Support counting: check which are frequent; retain for next iteration C Frequent Level 1 D B D C C Minimum Support = 2 C D D B D D B Candidate Frequent Level 2 C C B C C C D C C B C C C D 10

11 Taming of the Morphisms Challenge of isomorphisms How to detect duplicates? Graph Isomorphism How to count occurrences? Subgraph Isomorphism 11

12 Candidate Generation Can be very expensive: potentially millions of isomorphism checks C C D C D C D Graph isomorphism 12

13 Support Counting Graph Database If frequent Parent Graph Candidate Graph (G) Subgraph Isomorphism testing G1 G2 G3 Costly for large datasets, large graphs, small support: potentially millions of subgraph isomorphism checks G n 13

14 rabesque for Graph Mining 14

15 Challenge Exponential number of subgraphs/embeddings # unique subgraphs (log-scale) 1.7B 117M 7.8M 335K 4K 22K Size of subgraphs 15 15

16 State of the rt: Custom lgorithms Easy to Code High Performance Transparent Distribution Custom lgorithms 16 16

17 State of the rt: Think Like a Vertex Easy to Code Efficient Implementation Transparent Distribution Custom lgorithms Think Like a Vertex 17 17

18 rabesque New system & execution model Purpose-built for graph mining New Think Like an Embedding model Contributions: Simple & Generic PI High performance Distributed & Scalable by design 18 18

19 rabesque Easy to Code High Performance Transparent Distribution Custom lgorithms Think Like a Vertex rabesque 19 19

20 rabesque PI - Clique finding boolean filter(embedding e) { return isclique(e); } void process(embedding e) { output(e); } State of the rt (Mace, centralized) 4,621 LOC boolean isclique(embedding e) { return e.getnumedgesdded() == e.getnumberofvertices() - 1; } 20 20

21 rabesque PI - Motif Counting boolean filter(embedding e) { return e.getnumvertices() <= MX_SIZE; } void process(embedding e) { mapoutput(e.getpattern(), 1); } State of the rt (GTrieScanner, centralized) 3,145 LOC Pair<Pattern, Integer> reduceoutput(pattern p, List<Integer> counts) { return new Pair(p, sum(counts)); } 21 21

22 rabesque PI - Frequent Subgraph mining Ours was the first distributed implementation 280 lines of Java code of which 212 compute frequency metric Baseline (GRMI): 5,443 lines of Java 22 22

23 rabesque: n Efficient System COST: s efficient as centralized state of the art pplication - Graph Centralized Baseline rabesque 1 thread Motifs - MiCo (MS=3) 50s 37s Cliques - MiCo (MS=4) 281s 385s FSM - CiteSeer (S=300) 4.8s 5s 23 23

24 rabesque: Scalable System Scalable to thousands of workers Hours/days Minutes pplication - Graph Centralized Baseline rabesque 640 cores Motifs - MiCo 2 hours 24 minutes 25 seconds Cliques - MiCo 4 hours 8 minutes 1 minute 10 seconds FSM - Patents > 1 day 1 minute 28 seconds Can process graphs with almost 1 billion edges 24 24

25 lternative Paradigms? 25

26 Think Like a Vertex pplication = Stateful vertex object Vertices sends messages to their neighbors Easy to scale to large graphs: partition by vertex Bulk Synchronous Programming (BSP) 1. Receive from all neighbors 2. Compute new state 3. Send to all neighbors 26 26

27 Example: Shortest Path Input: Graph (weighted edges), source vertex Output: Min source vertex distance Superstep 0 1 message values = 2 and Superstep 1 message values = 4, 3, and Superstep 2 message values = 6 and Superstep 3 Complete, no new messages Example taken from: [McCune et al., arxiv: (2015)] 27

28 Matrix-Vector Multiplication E.g. Page-Rank style computation importance: i2 importance: i3 2 3 a12 * i2 a13 * i3 sum inputs 1 new state to neighbors superstep i superstep i+1 superstep i+2 links to v1 0 a12 a13 i1 a12 * i2 + a13 * i3 i2 * = i3 adjacency matrix (transposed) importance new importance 28 28

29 Graph Exploration with TLV 1. Receive embeddings 2. Expand by adding neighboring vertices 3. Send canonical embeddings to their constituting vertices Superstep 2 for vertex 4 Input graph Receive Expand Send

30 Think Like a Pattern Many existing algorithms keep state by pattern dvantages Rebuild embeddings from scratch No need to materialize full intermediate state Idea of TLP: ssign different patterns to different machines void storing materialized embedding 30 30

31 rabesque Details 31

32 How: rabesque Optimizations void Redundant Work Efficient canonicality checking Embedding Compression Overapproximating Directed cyclic Graphs (ODGs) Efficient ggregation 2-level pattern aggregation 32 32

33 rabesque: Fundamentals Subgraphs as 1st class citizens: Embedding == Subgraph Think Like an Embedding model rabesque responsibilities User responsibilities Graph Exploration ggregation (Isomorphism) Filter Load Balancing No redundant work (utomorphism) Process 33 33

34 Graph Exploration Iterative expansion Subgraph order n Subgraph order n + 1 Connect to neighbours, one vertex at a time Input graph Depth Depth

35 Graph Exploration Depth Input graph

36 Model - Think Like an Embedding Exploration step 1 Exploration step 2 Exploration step Input Output Input Output Input Output Filter false true Process Discard Save 1. Start from a set of initial embeddings 2. Expand: add one vertex or edge 3. Filter uninteresting candidates 4. Produce outputs User-defined functions 36 36

37 Guarantee: Completeness For each e, if Filter(e) == true then Process(e) is executed Requirement: nti-monotonicity Filter = true Filter = true Filter = false Keep expanding Filter = false We can prune and be sure that we won t ignore desired embeddings 37 37

38 ggregation Some applications must aggregate across embeddings E.g., Frequent subgraph mining: Count embeddings with same pattern ggregation in parallel with exploration step 38 38

39 ggregation ggregate across multiple embeddings readggregate(k) Process map(k, v) 1 2 gg Filter 1 3 Discard gg Process Save Filter - Process as before Exploration step i Exploration step i+1 User-defined functions 39 39

40 System rchitecture Input Embeddings size n split 1 split 4 Worker 1 Output Embeddings size n + 1 split 1 split 4 split 7 split 7 Previous step split 2 split 5 split 8 Worker 2 Worker 3 split 2 split 5 split 8 Next step split 3 split 3 split 6 split 6 split 9 split

41 rabesque PI pp-defined functions: boolean filter(embedding e) void process(embedding e) boolean aggregationfilter(embedding e) void aggregationprocess(embedding e) Pair<K,V> reduce(k key, List<V> values) Pair<K,V> reduceoutput(k key, List<V> values) Functions provided by rabesque: void map(k key, V value) V readggregate(k key) void output(object value) void mapoutput(k key, V value) 41 41

42 Technical Challenges 42

43 voiding redundant work Problem: utomorphic embeddings utomorphisms == subgraph equivalences Redundant work == Worker 1 Worker

44 voiding redundant work Solution: Decentralized Embedding Canonicality No coordination Efficient == Worker 1 Worker 2 iscanonical(e) true iscanonical(e) false 44 44

45 Embedding Canonicality iscanonical(e) iff at every step add neighbor with smallest ID 5 e 6 Initial embedding (e) Expansions: canonical canonical not canonical ( ) 45 45

46 Handling Exponential growth Goal: handle trillions+ different embeddings? Solution: Overapproximating DGs (ODGs) Compress into less restrictive superset Deal with spurious embeddings Canonical Embeddings Input Graph Embedding List ODG 46 46

47 ggregation by Pattern Label Distinguishable property of a vertex (e.g. color). Pattern - Meta sub-graph or the template. Captures subgraph structure and labelling Embedding - Instance of a pattern. ctual vertices and edges Input graph Pattern Embeddings 47 47

48 Efficient Pattern ggregation Goal: ggregate automorphic patterns to single key Find canonical pattern No known polynomial solution Canonical pattern 3x Expensive graph canonization 48 48

49 Efficient Pattern ggregation Solution: 2-level pattern aggregation 1. Embeddings quick patterns 2. Quick patterns canonical pattern ) Quick patterns 2) Canonical pattern 3x Linear matching to quick pattern 2x Expensive graph canonization 49 49

50 Evaluation 50

51 Evaluation - Setup 20 servers: GHz, 256GB RM 10 Gbps network 3 algorithms: Frequent Subgraph Mining, Counting Motifs and Clique Finding # Vertices # Edges # Labels vg. Degree CiteSeer 3,312 4, MiCO 100,000 1,080, Patents 2,745,761 13,965, Youtube 4,589,876 43,968, SN 5,022, ,613, Instagram 179,527, ,390,

52 Evaluation - TLP & TLV Use case: frequent subgraph mining No scalability. Bottlenecks: TLV: Replication of embeddings, hotspots TLP: very few patterns do all the work 10 8 Ideal TLP TLV Speedup Number of nodes (32 threads) 52 52

53 Evaluation - raquesque Scalability 53 53

54 Evaluation rabesque Scalability pplication - Graph Centralized Baseline rabesque - Num. Servers (32 threads) Motifs - MiCo 8,664s 328s 74s 41s 31s 25s FSM - Citeseer 1,813s 431s 105s 65s 52s 41s Cliques - MiCo 14,901s 1,185s 272s 140s 91s 70s Motifs - Youtube Fail 8,995s 2,218s 1,167s 900s 709s FSM - Patents >19h 548s 186s 132s 102s 88s 54 54

55 Evaluation - ODGs Compression 44 GB 60 MB 4000 vertices 1.7 billion subgraphs 55 55

56 Evaluation - Speedup w ODGs 56 56

57 Efficient Pattern ggregation Solution: 2-level pattern aggregation 1. Embeddings quick patterns 2. Quick patterns canonical pattern ) Quick patterns 2) Canonical pattern 3x Linear matching to quick pattern 2x Expensive graph canonization 57 57

58 Evaluation - Two-level aggregation Motifs MiCo (MS = 4) Motifs Youtube (MS=4) FSM CiteSeer (S=220, MS=7) FSM Patents (S=24k) Embeddings 10,957,439, ,909,854,429 1,680,983,703 1,910,611,704 Quick Patterns Canonical Patterns Reduction Factor 521,782,810x 10,424,278,782x 1,173,052x 1,061,451x 58 58

59 Evaluation - Two-level aggregation 59 59

60 CPU Utilization Breakdown dvantages of a simple PI rabesque does all the work (unlike TLV system) Great opportunities for system-level optimizations P: Pattern ggregation, C: canonicality checks, G: generate new candidates, R/W: Read/write embeddings 60

61 Large Graphs Graph # Vertices # Edges # Labels vg. Degree SN 5,022, ,613, Instagram 179,527, ,390, pplication Time Embeddings Motifs-SN (MS=4) 6h 18m 8.4 trillion Cliques-SN (MS=5) 29m 30 billion Motifs-Instagram (MS=3) 10h 45m 5 trillion 61 61

62 What s Next? 62

63 Future Work Better ways to organize intermediate state Scale to larger intermediate states Support for approximate exploration Out-of-core? Support for real-time graphs Verticals and new applications 63 63

64 Conclusions Fundamental trend: democratizing data analytics rabesque: graph mining system Straightforward to code Transparent and scalable distribution High performance Only a first step: many opportunities for improvement 69 69

65 Download It, Play with It, Hack It Open-source (pache 2.0) Pre-compiled jar User guide 70 70

66 Thank you arabesque.io 71

Arabesque: A System for Distributed Graph Mining

Arabesque: A System for Distributed Graph Mining Arabesque: A System for Distributed Graph Mining Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, Ashraf Aboulnaga Qatar Computing Research Institute - HBKU,

More information

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

gspan: Graph-Based Substructure Pattern Mining

gspan: Graph-Based Substructure Pattern Mining University of Illinois at Urbana-Champaign February 3, 2017 Agenda What motivated the development of gspan? Technical Preliminaries Exploring the gspan algorithm Experimental Performance Evaluation Introduction

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

modern database systems lecture 10 : large-scale graph processing

modern database systems lecture 10 : large-scale graph processing modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs

More information

Pregel. Ali Shah

Pregel. Ali Shah Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc

Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc Motivation Motifs Algorithms G-Tries Parallelism Complex Networks Networks are ubiquitous! Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc Images: UK Highways Agency, Uriel

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Maximum Clique Problem. Team Bushido bit.ly/parallel-computing-fall-2014

Maximum Clique Problem. Team Bushido bit.ly/parallel-computing-fall-2014 Maximum Clique Problem Team Bushido bit.ly/parallel-computing-fall-2014 Agenda Problem summary Research Paper 1 Research Paper 2 Research Paper 3 Software Design Demo of Sequential Program Summary Of the

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang CARPENTER Find Closed Patterns in Long Biological Datasets Zhiyu Wang Biological Datasets Gene expression Consists of large number of genes Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department

More information

Efficient Subgraph Matching by Postponing Cartesian Products

Efficient Subgraph Matching by Postponing Cartesian Products Efficient Subgraph Matching by Postponing Cartesian Products Computer Science and Engineering Lijun Chang Lijun.Chang@unsw.edu.au The University of New South Wales, Australia Joint work with Fei Bi, Xuemin

More information

One Trillion Edges. Graph processing at Facebook scale

One Trillion Edges. Graph processing at Facebook scale One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Lecture 3, Review of Algorithms. What is Algorithm?

Lecture 3, Review of Algorithms. What is Algorithm? BINF 336, Introduction to Computational Biology Lecture 3, Review of Algorithms Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Algorithm? Definition A process

More information

Interaction Between Input and Output-Sensitive

Interaction Between Input and Output-Sensitive Interaction Between Input and Output-Sensitive Really? Mamadou M. Kanté Université Blaise Pascal - LIMOS, CNRS Enumeration Algorithms Using Structure, Lorentz Institute, August 26 th, 2015 1 Introduction

More information

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number

More information

Mining frequent Closed Graph Pattern

Mining frequent Closed Graph Pattern Mining frequent Closed Graph Pattern Seminar aus maschninellem Lernen Referent: Yingting Fan 5.November Fachbereich 21 Institut Knowledge Engineering Prof. Fürnkranz 1 Outline Motivation and introduction

More information

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu G(B)enchmark GraphBench: Towards a Universal Graph Benchmark Khaled Ammar M. Tamer Özsu Bioinformatics Software Engineering Social Network Gene Co-expression Protein Structure Program Flow Big Graphs o

More information

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Evaluating Use of Data Flow Systems for Large Graph Analysis

Evaluating Use of Data Flow Systems for Large Graph Analysis Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under

More information

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft

More information

Large-Scale Graph Mining. Vincent Leroy

Large-Scale Graph Mining. Vincent Leroy Large-Scale Graph Mining Vincent Leroy FREQUENT SUGRPH MINING (FSM) Graphs represent complex data Chemical compounds, proteins Social networks Knowledge bases (ontologies) 0M entities, 20M facts Frequent

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

motifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs { }

motifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs { } motifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs G M { } 2 subgraph motifs 3 motifs Find interesting patterns in a network. 4 motifs Find

More information

Data mining, 4 cu Lecture 8:

Data mining, 4 cu Lecture 8: 582364 Data mining, 4 cu Lecture 8: Graph mining Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

CS6702 GRAPH THEORY AND APPLICATIONS QUESTION BANK

CS6702 GRAPH THEORY AND APPLICATIONS QUESTION BANK CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS 1 UNIT I INTRODUCTION CS6702 GRAPH THEORY AND APPLICATIONS QUESTION BANK 1. Define Graph. 2. Define Simple graph. 3. Write few problems

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing

More information

Graph-Processing Systems. (focusing on GraphChi)

Graph-Processing Systems. (focusing on GraphChi) Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

Graphs. Edges may be directed (from u to v) or undirected. Undirected edge eqvt to pair of directed edges

Graphs. Edges may be directed (from u to v) or undirected. Undirected edge eqvt to pair of directed edges (p 186) Graphs G = (V,E) Graphs set V of vertices, each with a unique name Note: book calls vertices as nodes set E of edges between vertices, each encoded as tuple of 2 vertices as in (u,v) Edges may

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

P = NP; P NP. Intuition of the reduction idea:

P = NP; P NP. Intuition of the reduction idea: 1 Polynomial Time Reducibility The question of whether P = NP is one of the greatest unsolved problems in the theoretical computer science. Two possibilities of relationship between P and N P P = NP; P

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND

More information

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao

More information

Community Detection. Community

Community Detection. Community Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

hyperx: scalable hypergraph processing

hyperx: scalable hypergraph processing hyperx: scalable hypergraph processing Jin Huang November 15, 2015 The University of Melbourne overview Research Outline Scalable Hypergraph Processing Problem and Challenge Idea Solution Implementation

More information

A single-exponential FPT algorithm for Distance-Hereditary Vertex Deletion

A single-exponential FPT algorithm for Distance-Hereditary Vertex Deletion A single-exponential FPT algorithm for Distance-Hereditary Vertex Deletion O-joung Kwon Institute for Computer Science and Control, Hungarian Academy of Sciences in Budapest, Hungary Joint work with Eduard

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Introduction to Graph Theory

Introduction to Graph Theory Introduction to Graph Theory Tandy Warnow January 20, 2017 Graphs Tandy Warnow Graphs A graph G = (V, E) is an object that contains a vertex set V and an edge set E. We also write V (G) to denote the vertex

More information

[CoolName++]: A Graph Processing Framework for Charm++

[CoolName++]: A Graph Processing Framework for Charm++ [CoolName++]: A Graph Processing Framework for Charm++ Hassan Eslami, Erin Molloy, August Shi, Prakalp Srivastava Laxmikant V. Kale Charm++ Workshop University of Illinois at Urbana-Champaign {eslami2,emolloy2,awshi2,psrivas2,kale}@illinois.edu

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Lecture 4: Graph Algorithms

Lecture 4: Graph Algorithms Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

FASCIA. Fast Approximate Subgraph Counting and Enumeration. 2 Oct Scalable Computing Laboratory The Pennsylvania State University 1 / 28

FASCIA. Fast Approximate Subgraph Counting and Enumeration. 2 Oct Scalable Computing Laboratory The Pennsylvania State University 1 / 28 FASCIA Fast Approximate Subgraph Counting and Enumeration George M. Slota Kamesh Madduri Scalable Computing Laboratory The Pennsylvania State University 2 Oct. 2013 1 / 28 Overview Background Motivation

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

1 Definition of Reduction

1 Definition of Reduction 1 Definition of Reduction Problem A is reducible, or more technically Turing reducible, to problem B, denoted A B if there a main program M to solve problem A that lacks only a procedure to solve problem

More information

ASAP: Fast, Approximate Graph Pattern Mining at Scale

ASAP: Fast, Approximate Graph Pattern Mining at Scale ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;

More information

UNIT 3. Greedy Method. Design and Analysis of Algorithms GENERAL METHOD

UNIT 3. Greedy Method. Design and Analysis of Algorithms GENERAL METHOD UNIT 3 Greedy Method GENERAL METHOD Greedy is the most straight forward design technique. Most of the problems have n inputs and require us to obtain a subset that satisfies some constraints. Any subset

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2 Graph Theory S I I S S I I S Graphs Definition A graph G is a pair consisting of a vertex set V (G), and an edge set E(G) ( ) V (G). x and y are the endpoints of edge e = {x, y}. They are called adjacent

More information

Evolving To The Big Data Warehouse

Evolving To The Big Data Warehouse Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Parallel Graph Algorithms

Parallel Graph Algorithms Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050/VT3 Part I Introduction Overview Graphs definitions & representations Minimal Spanning Tree (MST) Prim s algorithm Single Source

More information

Math 776 Graph Theory Lecture Note 1 Basic concepts

Math 776 Graph Theory Lecture Note 1 Basic concepts Math 776 Graph Theory Lecture Note 1 Basic concepts Lectured by Lincoln Lu Transcribed by Lincoln Lu Graph theory was founded by the great Swiss mathematician Leonhard Euler (1707-178) after he solved

More information

CSE 417 Branch & Bound (pt 4) Branch & Bound

CSE 417 Branch & Bound (pt 4) Branch & Bound CSE 417 Branch & Bound (pt 4) Branch & Bound Reminders > HW8 due today > HW9 will be posted tomorrow start early program will be slow, so debugging will be slow... Review of previous lectures > Complexity

More information

Graph Similarity for Data Mining Lot 6 of Data Analytics FY17/18

Graph Similarity for Data Mining Lot 6 of Data Analytics FY17/18 Graph Similarity for Data Mining Lot 6 of Data Analytics FY17/18 Meeting Presentation Peter Rodgers Algorithms for the Comparison of Graphs Graph Isomorphism Graph Edit Distance Various Others Used in

More information

Linear Regression Optimization

Linear Regression Optimization Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:

More information

Lecture 24: More Reductions (1997) Steven Skiena. skiena

Lecture 24: More Reductions (1997) Steven Skiena.   skiena Lecture 24: More Reductions (1997) Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Prove that subgraph isomorphism

More information

Practice Problems for the Final

Practice Problems for the Final ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

Exploring the Hidden Dimension in Graph Processing

Exploring the Hidden Dimension in Graph Processing Exploring the Hidden Dimension in Graph Processing Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng Tsinghua University *University of Shouthern California Graph is Ubiquitous

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

Chapter 26 Cluster Heuristic Search

Chapter 26 Cluster Heuristic Search Chapter 26 Cluster Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple

More information

CS521 \ Notes for the Final Exam

CS521 \ Notes for the Final Exam CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Project Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth

Project Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth Scaling with HiveDB Project Genesis Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth Enter Jeremy and HiveDB Our Requirements OLTP

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information