Efficient Subgraph Matching by Postponing Cartesian Products

Similar documents
arxiv: v3 [cs.db] 16 Jan 2018

On Graph Query Optimization in Large Networks

Computing A Near-Maximum Independent Set in Linear Time by Reducing-Peeling

Leveraging Set Relations in Exact Set Similarity Join

gspan: Graph-Based Substructure Pattern Mining

igraph: A Framework for Comparisons of Disk Based Graph Indexing Techniques

Survey on Graph Query Processing on Graph Database. Presented by FAN Zhe

Multi-Query Optimization for Subgraph Isomorphism Search

Capturing Topology in Graph Pattern Matching

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Binary Decision Diagrams

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

An Optimal and Progressive Approach to Online Search of Top-K Influential Communities

Subgraph Isomorphism. Artem Maksov, Yong Li, Reese Butler 03/04/2015

Graphs and Trees. An example. Graphs. Example 2

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Advanced Data Management

An Edge-Swap Heuristic for Finding Dense Spanning Trees

Answering Pattern Match Queries in Large Graph Databases Via Graph Embedding

CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching

Motivation. CS389L: Automated Logical Reasoning. Lecture 5: Binary Decision Diagrams. Historical Context. Binary Decision Trees

A6-R3: DATA STRUCTURE THROUGH C LANGUAGE

Tree-Pattern Queries on a Lightweight XML Processor

Chapters 11 and 13, Graph Data Mining

Algorithm Design (8) Graph Algorithms 1/2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on Artificial Intelligence,

Constraint Satisfaction Problems

Construction of Minimum-Weight Spanners Mikkel Sigurd Martin Zachariasen

Big Data Management and NoSQL Databases

Efficient Subgraph Search with Presorting and Indexing on Label Frequency

I/O Efficient Algorithms for Exact Distance Queries on Disk- Resident Dynamic Graphs

arxiv: v1 [cs.ds] 8 Sep 2007

Discovering Frequent Geometric Subgraphs

Path-Hop: efficiently indexing large graphs for reachability queries. Tylor Cai and C.K. Poon CityU of Hong Kong

Effective and Efficient Dynamic Graph Coloring

Graph Algorithms. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Data Mining for Knowledge Management. Association Rules

Behavior models and verification Lecture 6

Learning Path Queries on Graph Databases

Data Mining in Bioinformatics Day 3: Graph Mining

Graph Indexing: A Frequent Structure-based Approach

Leveraging Transitive Relations for Crowdsourced Joins*

On the Space-Time Trade-off in Solving Constraint Satisfaction Problems*

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Item Set Extraction of Mining Association Rule

Binary Decision Diagrams (BDD)

Computing Pure Nash Equilibria in Symmetric Action Graph Games

GADDI: Distance Index based Subgraph Matching in Biological Networks

Data mining, 4 cu Lecture 8:

Efficient Aggregation for Graph Summarization

Turbo ISO : Towards UltraFast and Robust Subgraph Isomorphism Search in Large Graph Databases

Answering Subgraph Queries over Large Graphs

Algorithms and Data Structures (INF1) Lecture 15/15 Hua Lu

Constraint Satisfaction Problems

Treewidth and graph minors

Outsourcing Privacy-Preserving Social Networks to a Cloud

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Scalable Big Graph Processing in Map Reduce

Honour Thy Neighbour Clique Maintenance in Dynamic Graphs

Let G = (V, E) be a graph. If u, v V, then u is adjacent to v if {u, v} E. We also use the notation u v to denote that u is adjacent to v.

CS570: Introduction to Data Mining

Wang, Jing (2017) Optimizing graph query performance by indexing and caching. PhD thesis.

Dense graphs and a conjecture for tree-depth

GiS: Fast Indexing and Querying of Graph Structures

A Polynomial Algorithm for Submap Isomorphism: Application to Searching Patterns in Images

Spatial Keyword Range Search on Trajectories

Department of Electronics and Technology, Shivaji University, Kolhapur, Maharashtra, India. set algorithm. Figure 1: System Architecture

Rigidity, connectivity and graph decompositions

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

A NEW ALGORITHM FOR INDUCED SUBGRAPH ISOMORPHISM

Model Checking I Binary Decision Diagrams

Dynamic Grouping Strategy in Cloud Computing

Distribution-Free Models of Social and Information Networks

Clustering in Data Mining

arxiv: v1 [cs.db] 23 Sep 2017

TwigList: Make Twig Pattern Matching Fast

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 14-1

Vertex Cover is Fixed-Parameter Tractable

4 Basics of Trees. Petr Hliněný, FI MU Brno 1 FI: MA010: Trees and Forests

Association Pattern Mining. Lijun Zhang

Reference Sheet for CO142.2 Discrete Mathematics II

Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism

Paths. Path is a sequence of edges that begins at a vertex of a graph and travels from vertex to vertex along edges of the graph.

Konigsberg Bridge Problem

A Relational View of Subgraph Isomorphism

Algorithms. Graphs. Algorithms

Propagating separable equalities in an MDD store

CS 4407 Algorithms Lecture 5: Graphs an Introduction

Top-k Keyword Search Over Graphs Based On Backward Search

Lesson 3. Prof. Enza Messina

On the Space-Time Trade-off in Solving Constraint Satisfaction Problems*

FAST-ON*: AN EXTENDED ALGORITHM FOR GRAPH ISOMORPHISM PROBLEM AND GRAPH QUERY PROCESSING

Maximum Clique Problem. Team Bushido bit.ly/parallel-computing-fall-2014

General Graph Identification By Hashing

Graphs: Introduction. Ali Shokoufandeh, Department of Computer Science, Drexel University

Interaction Between Input and Output-Sensitive

Trees. Arash Rafiey. 20 October, 2015

Lecture 3: Recap. Administrivia. Graph theory: Historical Motivation. COMP9020 Lecture 4 Session 2, 2017 Graphs and Trees

Transcription:

Efficient Subgraph Matching by Postponing Cartesian Products Computer Science and Engineering Lijun Chang Lijun.Chang@unsw.edu.au The University of New South Wales, Australia Joint work with Fei Bi, Xuemin Lin, Lu Qin, Wenjie Zhang

Outline Ø Introduction & Existing Works Ø Challenges of Subgraph Matching Ø Our Approach v Core-First Decomposition based Framework v Compact Path Index (CPI) based Matching Ø Experiments Ø Conclusion 2

Introduction Ø Subgraph Matching Given a query q and a large data graph G, the problem is to extract all subgraph isomorphic embeddings of q in G. 3

Introduction Ø Subgraph Matching Given a query q and a large data graph G, the problem is to extract all subgraph isomorphic embeddings of q in G. 4

Introduction Ø Subgraph Matching Given a query q and a large data graph G, the problem is to extract all subgraph isomorphic embeddings of q in G. 5

Introduction Ø Subgraph Matching Given a query q and a large data graph G, the problem is to extract all subgraph isomorphic embeddings of q in G. 6

Introduction Ø Applications Protein interaction network analysis Social network analysis Chemical compound search 7

Hardness Ø Subgraph Isomorphism Testing is NP-complete Ø Decide whether there is a subgraph of G that is isomophic to q Ø Enumerating all subgraph isomorphic embeddings is NP-hard Ø Many techniques have been developed for efficient enumeration in practice 8

Existing Work Ø Ullmann s algorithm [J.ACM 76] Iteratively maps query vertices one by one to data vertices, following the input order of query vertices. Cartesian Products between vertices candidates. Ø VF2 [IEEE Trans 04] and QuickSI [VLDB 08] Ø Turbo ISO [SIGMOD 13] Ø Boost ISO [VLDB 15] 9

Existing Work Ø Ullmann s algorithm [J.ACM 76] Ø VF2 [IEEE Trans 04] and QuickSI [VLDB 08] Independently propose to enforce connectivity of the matching order to reduce Cartesian products caused by disconnected query vertices. QuickSI further removes false-positive candidates by first processing infrequent query vertices and edges. Ø Turbo ISO [SIGMOD 13] Ø Boost ISO [VLDB 15] 10

Existing Work Ø Ullmann s algorithm [J.ACM 76] Ø VF2 [IEEE Trans 04] and QuickSI [VLDB 08] Ø Turbo ISO [SIGMOD 13] Compress a query graph by merging together similar vertices (i.e., with the same neighborhoods) Reduce Cartesian product caused by similar query vertices Build a data structure online to facilitate the search process. Ø Boost ISO [VLDB 15] 11

Existing Work Ø Ullmann s algorithm [J.ACM 76] Ø VF2 [IEEE Trans 04] and QuickSI [VLDB 08] Ø Turbo ISO [SIGMOD 13] Ø Boost ISO [VLDB 15, Ren and Wang] Compress a data graph G by merging together similar vertices in G. Develop query-dependent relationship between vertices in G. dynamically reduces duplicate computations. Can be applied to accelerate all previous techniques as well as ours It is still challenging for matching large query graphs. 12

Challenges of Subgraph Matching Challenge I: Redundant Cartesian Products by Dissimilar Vertices. 10 5-100 partial mappings are redundant. Cartesian products: 100 X 1000 = 10 5 No similar vertices in q or G. Matching order of QuickSI and Turbo ISO : (u 1 uu 2 u u 3 uu 4 u 5 u 6 ). Match dense subgraph first: (u 1 u 2 u 5 u 3 u 4 u 6 ) u 1 13

Challenges of Subgraph Matching Our Solution: Postpone Cartesian products. Ø Decompose q into a dense subgraph and a forest, and process the dense subgraph first. Ø The dense subgraph has more edge-connectivity information. Ø We are the first to exploit this feature. 14

Challenges of Subgraph Matching Challenge II: Exponential number of embeddings of query paths in a data graph. Ø Turbo ISO builds a data structure that materializes all embeddings of query paths in a data graph 1. for generating matching order based on estimation of #candidates. 2. for enumerating subgraph isomorphic embeddings. Ø Effective only when the number of embeddings is small Ø Worst-case space complexity: O( V(G) v(q)-1 ). 15

Challenges of Subgraph Matching Our Solution: We propose a polynomial-size data structure to avoid enumerating all embeddings of a query path in the data graph. 16

Our Approach Ø CFL-Match v A Core-First Decomposition based Framework v Compact Path-Index (CPI) based Matching 17

Core-First Decomposition Ø Core-Forest Decomposition Compute the minimal connected subgraph containing all nontree edges of q regarding any spanning tree. Ø Forest-Leaf Decomposition Compute the set of leaf vertices by rooting each tree at its connection vertex. 18

Framework Ø A Core-First Decomposition based Framework 1) Core-First (Core-Forest-Leaf) Decomposition 19

Framework Ø A Core-First Decomposition based Framework 1) Core-First (Core-Forest-Leaf) Decomposition 2) Mapping Extraction i. Core-Match ii. Forest-Match iii. Leaf-Match Categorize leaf nodes according to labels Perform combination instead of enumeration among different labels. 20

Compact Path-Index based Matching Ø Auxiliary Data Structure: Compact Path-Index (CPI) Compactly stores candidate embeddings of query spanning trees. Prunes invalid candidates Serves for computing an effective matching order. Estimate #matches for each root-to-leaf query path based on CPI Add query paths to the matching order in increasing order w.r.t. #matches Ø CPI Structure 21

Compact Path-Index based Matching Ø Auxiliary Data Structure: Compact Path-Index (CPI) Compactly stores candidate embeddings of query spanning trees. Prunes invalid candidates Serves for computing an effective matching order. Estimate #matches for each root-to-leaf query path based on CPI Add query paths to the matching order in increasing order w.r.t. #matches Ø CPI Structure Example 22

CPI-based Matching Ø CPI Structure Candidate set: each query node u has a candidate set u.c. Edge set: there is an edge between v u.c and v u.c for adjacent query nodes u and u in CPI if and only if (v, v ) exists in G. Ø Traverse CPI to find mappings for query vertices (u 0 u 1 u 4 u 3 u 2, u 5, u 6, u 7, u 8, u 9, u 10 ) G is probed only for non-tree edge validation 23

Minimizing the CPI Ø Benefits of minimizing the CPI Ø Less memory consumption Ø Fast embedding enumeration Ø Soundness of CPI For every query node u in CPI, if there is an embedding of q in G that maps u to v, then v must be in u.c. Theorem Given a sound CPI, all embeddings of q in G can be computed by traversing only the CPI while G is only probed for non-tree edge checkings. Ø It is NP-hard to build a minimum sound CPI. 24

CPI Construction Query q Data graph G v 9 is pruned from u 3.C ß edge (u 3, u 4 ); v 1 is pruned from u 1.C ß edge (u 1, u 3 ); v 8 is pruned from u 2.C ß edge (u 1, u 2 ); v 17 is pruned from u 5.C ß edge (u 2, u 5 ); v 27 is pruned from u 9.C ß edge (u 5, u 9 ). Auxiliary Compact Data Path Structure Index 25

Build a small CPI Ø General Idea A heuristic approach: 1) u.c is initialized to contain all vertices in G with the same label as u 2) A data vertex v is pruned from u.c, if u N q (u), such that v N G (v) & v u.c. Ø A two-phase CPI construction process: Top-down construction, bottom-up refinement Exploit the pruning power of both directions of every query edge. Construct CPI of O( E(G) X V(q) ) size in O( E(G) X E(q) ) time 26

Experiment Ø All algorithms are implemented in C++ and run on a machine with 3.2G CPU and 8G RAM. Ø Datasets Real Graphs Synthetic Graphs Randomly generate graphs with 100k vertices with average degree 8 and 50 distinct labels. Ø Query Graphs Randomly generate by random walk Two Categories: V E Degree HPRD 9460 37081 307 7.8 Yeast 3112 12519 71 8.1 Human 4674 86282 44 36.9 S: sparse (average degree 3). N: non-sparse (average degree > 3). 27

Comparing with Existing Techniques CFL-Match: our proposed algorithm Varying the size of query graph V(q) 28

Effectiveness of Our New Framework Ø Match: subgraph matching algorithm with CPI but no query decomposition. Ø CF-Match: only core-forest decomposition with CPI. Ø CFL-Match: our best algorithm. Evaluating our framework 29

Scalability Testing 30

Conclusion Ø A core-first framework for subgraph matching by postponing Cartesian products Ø A new polynomial-size path-based auxiliary data structure CPI, and efficient and effective technique for constructing a small CPI Ø Efficient algorithms for subgraph matching based on the core-first framework and the CPI Ø Extensive empirical studies on real and synthetic graphs 31

Thank you! Questions? Lijun.Chang@unsw.edu.au 32