Survey on Graph Query Processing on Graph Database. Presented by FAN Zhe

Size: px

Start display at page:

Download "Survey on Graph Query Processing on Graph Database. Presented by FAN Zhe"

Laurence Flynn
6 years ago
Views:

1 Survey on Graph Query Processing on Graph Database Presented by FA Zhe

2 utline Introduction of Graph and Graph Database. Background of Subgraph Isomorphism. Background of Subgraph Query Processing. Background of Similarity Graph Query Processing. Background of Supergraph Query Processing.

3 What is Graph Graph is powerful. Graph is everywhere. Graph is complex. While the size and volume of graph is increasing. Trade off. Easier for model, harder for analysis. Chemical bonds Internet DA Daily-life bjects

4 Two Scenarios of Graph Database Bio-informatics Social etwork Graph Database is a database that contains millions of graphs.

5 Definition of Graph A graph g is defined as a 4-tuple, g = (V,E,L, l), where V is the set of vertices, E is the set of edges, L is the set of labels and l is a labelling function that maps each vertex or edge to a label in L. We define the size of a graph g as size(g) = E(g). We restrict our discussion on undirected, labelled connected graphs.

6 Graph Query Processing Problem in Current Research Field 1 Subgraph Isomorphism [Ullmann, J.ACM 76], [Cordella, PAMI 04], [QuickSI, VLDB 08] 2 Subgraph Query 2.1 ne large graph [GraphGrep, ICPR 02], [TALE, ICDE 08], [GADDI, EDBT 09], [SAPPER, VLDB 10] 2.2 umbers of small graphs[graphgrep, ICPR 02], [gindex, SIGMD 04], [FG-index, SIGMD 07], [C- Tree, ICDE 06], [QuickSI, VLDB 08], [GBLEDER, SIGMD 10], [igraph, VLDB 10] 3 Similarity Graph Query (subgraph query is not always available in all cases) 3.1 ne large graph [GraphGrep, ICPR 02], [TALE, ICDE 08], [GADDI, EDBT 09], [SAPPER, VLDB 10] 3.2 umbers of small graphs [C-Tree, ICDE 06], [Grafil, SIGMD 05] 4 Supergraph Query (containment graph query) [cindex, VLDB 07], [GPTree, 09 EDBT], 5 Reachability Problem 6 Shortest path Problem 7 Spatial Data Problem

7 utline Introduction of Graph and Graph Database. Background of Subgraph Isomorphism. Background of Subgraph Query Processing. Background of Similarity Graph Query Processing. Background of Supergraph Query Processing.

8 Definition of Subgraph Isomorphism 1 A 2 3 B C 1 A 2 3 B C 4 A g g

9 Subgraph Isomorphism Algorithm Condition: 2 B 1 A 1) M [i][j] = 1 means that the i-th vertex in Q corresponds to query G. C 3 1 A 2 3 B C 4 A 2) Each row in M contains exactly one 1. 3) o column contains more than one 1. M specifies an subgraph isomorphism from Q to G. How to find such matrix M? ---Ullmann Algorithm@76

10 Subgraph Isomorphism Algorithm (Cont.) 1 A 2 B C 3 1 A 2 3 B C 4 A MC = M '( M ' MB) T i j : ( MA[ i][ j] = 1) ( MC[ i][ j] = 1)

11 Subgraph Isomorphism Algorithm (Cont.) 1 A 2 B C 3 1 A 2 3 B C 4 A MC = M '( M ' MB) T i j : ( MA[ i][ j] = 1) ( MC[ i][ j] = 1)

12 Subgraph Isomorphism Algorithm (Cont.) Given two graphs Q and G, their corresponding matrixes are MA n n =[a ij ] and MB m m = [b ij ]. Goal: 1) Find matrix M n m such that MC = M '( M ' MB) T 2) or report no such marix M. i j : ( MA[ i][ j] = 1) ( MC[ i][ j] = 1)

13 Subgraph Isomorphism Algorithm (Cont.) Step 1. Set up matrix M n m, such that M[i][j]=1, if 1) the i-th vertex in Q has the same label as the j-th vertex in G; and 2) the i-th vertex has smaller vertex degree than the j-th vertex in G. 1 A 2 3 B C 1 A 2 3 B C 4 A

14 Subgraph Isomorphism Algorithm (Cont.) Step 2. Matrixes M are generated by systematically changing to 0 all but one of the 1 s in each of the rows of M, subject to the definition condition that no column of a matrix M may contain more than one 1. (the maximal depth is MA )

15 Subgraph Isomorphism Algorithm (Cont.) Step 3. Verify matrix M by the following equation MC = M '( M ' MB) T i j : ( MA[ i][ j] = 1) ( MC[ i][ j] = 1) Iterate the above steps and enumerate all possible matrixes M. In the worst case, there are ( MB!) possible matrixes. (subgraph isomorphism is a classical P-hard problem)

16 Subgraph Isomorphism Algorithm (Cont.) Some ptimizations of Ullmann s algorithm, if interested, please check the original research paper. QuickSI: VLDB 08 A good survey about graph matching algorithms: THIRTY YEARS F GRAPH MATCHIG I PATTER C++ library For Graph Isomorphism: VFLib library

17 utline Introduction of Graph and Graph Database. Background of Subgraph Isomorphism. Background of Subgraph Query Processing. Background of Similarity Graph Query Processing. Background of Supergraph Query Processing.

18 Subgraph Query Problem definition Given a graph database D and a graph query q. Find all graphs g in D s.t. q is a subgraph of g. Sample database + H S H H H H H S (a) (b) (c) Query graph Complexity: exactly P-complete!

19 Application of Subgraph Query Protein interaction analysis Motif discovery in 3D protein structures Drug design Schema matching Graph similarity search Correlation discovery in graph databases

20 Challenges of Subgraph Query Sequential scan is not scalable Disk I/ Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha etc. PDS'02 gindex: FG-index: C-Tree: SwiftIndex: igraph:

21 Representative Works on Subgraph Query Feature-based approach gindex, SIGMD 04 Fgindex, SIGMD 07 on-feature-based approach GraphGrep, PDS 02 QuickSI, VLDB 08 C-Tree, ICDE 06 GString, ICDE 07 GCoding, EDBT 08

22 GraphGrep (shasha et 02) Fingerprinting: to filter the database A subgraph matching algorithm Basic Idea Use small components of the query graph and the database graphs to filter the database and to do the matching

23 GraphGrep (shasha et 02) (Cont.)

24 GraphGrep (shasha et 02) (Cont.)

25 gindex (Yan et 04) Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Remarks Index substructures of a query graph to prune graphs that do not contain these substructures

26 gindex (Yan et 04) (Cont.) Two steps in processing graph queries Step 1. Index Construction Framework Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test

27 gindex (Yan et 04) (Cont.) Two Approaches: Path-based indexing Subgraph-based indexing

28 gindex (Yan et 04) (Cont.) Path-Based Approach Sample database H H + S H H S (a) (b) (c) H Paths 0-length: C,,, S 1-length: C-C, C-, C-, C-S, -, S- 2-length: C-C-C, C--C, C--C,... 3-length:... Built an inverted index between paths and graphs

29 gindex (Yan et 04) (Cont.) Query graph Path-Based Approach (Cont.) 0-length: S C ={a, b, c}, S ={a, b, c} 1-length: S C-C ={a, b, c}, S C- ={a, b, c} 2-length: S C--C = {a, b}, Intersect these sets, we obtain the candidate answers - graph (a) and graph (b) - which may contain this query graph.

30 gindex (Yan et 04) (Cont.) Sample database Problem of Path-Based Approach H H + S H H S (a) (b) (c) H Query graph Graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we can not prune graph (a) and (b).

31 gindex (Yan et 04) (Cont.) Paths are simple, structural information is lost There are too many paths Problem of Path-Based Approach gindex propose Use structures instead of paths Use discriminative structures

32 gindex (Yan et 04) (Cont.) gindex: Indexing Graphs by Data Mining Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database

33 gindex (Yan et 04) (Cont.) Sample database H H Frequent Structures H + H H S S H (a) (b) (c) Frequent structures with support 2 (a) (b)

34 gindex (Yan et 04) (Cont.) Frequent Structures (cont.) Efficient frequent graph mining algorithms are available Apriori: AGM/AcGM: Inokuchi et al (PKDD 00) FSG, Kuramochi et al (ICDM 01) Vanetik et al (ICDM 02) Pattern-growth: MoFa, Borgelt et al (ICDM 02) gspan: Yan and Han (ICDM 02)

35 gindex (Yan et 04) (Cont.) Frequent Structures: Threshold Issue How to set up the minimum support threshold? If it is too low, it may generate too many frequent graphs If it is too high, it may miss important structures Should we enforce a uniform threshold for the different size of structures? Size-increasing support threshold

36 gindex (Yan et 04) (Cont.) Frequent Structures: Threshold Issue 20 support(%) Θ θ fragment size (edges) Intuition: large structures with low support will likely be indexed well by their substructures that have the similar support Size-increasing support threshold The support threshold increases when the indexed structures become larger

37 gindex (Yan et 04) (Cont.) Frequent Structures: Volume Issue The number of frequent structures may exceed the number of graphs in the database when the support is low 1,000 graphs may generate 1,000,000 frequent structures It is time and memory expensive to compute and index all frequent structures discriminative structures

38 gindex (Yan et 04) (Cont.) Redundant Structures Sample database H H + S H H S (a) (b) (c) H All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Remove these redundant structures nly index structures that provide more information than existing structures

39 gindex (Yan et 04) (Cont.) Discriminative Structures Pinpoint the most useful frequent structures Given a set of sturctures f1, f2, K f n and a new structure x, we measure the extra indexing power provided by x, P ( x f, f2, K f ), f. 1 n i x When P is small enough, is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude Achieve good performance x

40 gindex (Yan et 04) (Cont.) gindex - Construction First generates all frequent fragments while taking out redundant ones Translates fragments into sequences and holds them in a prefix tree Each fragment has an id list: the ids of the graphs containing the fragment Graph Sequentialization (DFS Code) Labeled edge is a 5-tuple (I,j,l i, l (I,j),l j ) Described in another paper

41 gindex (Yan et 04) (Cont.) gindex - Construction gindex Tree each fragment can be mapped to an edge sequence (DFS code), insert the edge sequences of discriminative fragments in a prefix tree called the gindex Tree

42 gindex (Yan et 04) (Cont.) gindex - Search Query Filtering Verification Answers

43 gindex (Yan et 04) (Cont.) Query Response Time Cost Analysis T index + C q ( ) T + T io isomorphism _ testing Disk I/ time Query indexing time Isomorphism testing time Size of candidate answer set Remark: make C q as small as possible

44 gindex (Yan et 04) (Cont.) gindex - Search ptimization AprioriPruning If a fragment is not in the gindex tree, we need not check its super-graphs

45 gindex (Yan et 04) (Cont.)

46 gindex (Yan et 04) (Cont.)

47 FGindex (Cheng et 07) First work propose the concept of verification-free Basic idea: If the query is frequent feature, then no need to verify the candidate If the query is not frequent feature, the cost is the same to gindex. Problem is if the query graph is large, the probability of being frequent feature would be low.

48 Closure-Tree (He and 04)

49 Closure-Tree (He and 04) (Cont.)

igraph (@VLDB 10) [igraph, VLDB 10] is a common framework that implement most of the above representative index, it uses the same

In terms of the experiments and conclusions. We have known that: 1.

50 igraph 10) [igraph, VLDB 10] is a common framework that implement most of the above representative index, it uses the same subgraph isomorphism algorithm and a common storage engine that guarantees real disk I/s by bypassing the S file system cache. In terms of the experiments and conclusions. We have known that: 1. There is no single winner for all the above techniques on subgraph query processing. 2. Feature-based index, like gindex and FGindex, have the best pruning power, which leads to lowest I/ cost and small candidate set.

51 utline Introduction of Graph and Graph Database. Background of Subgraph Isomorphism. Background of Subgraph Query Processing. Background of Similarity Graph Query Processing. Background of Supergraph Query Processing.

52 Precise vs. Approximate Search in Graphs Given a graph database and a query graph Q, Find graphs containing Q exactly (Precise Matching, gindex, SIGMD 04) Find graphs containing Q approximately (Approximate Matching, Grafil)

53 Evaluating Graph Similarity 1. Maximal Common Subgraph (MCS): Given two graphs Q and G, assume that S is subgraph isomorphism to both Q and G. S is called a common subgraphof Q and G. A MCS E The MCS between Q and G is the common subgraph with the largest number of edges ( E(S) ). B C A B F Q C G

54 Evaluating Graph Similarity (Cont.) 2. Minimal Graph Edit Distance The minimal edit distance between Q and G is the minimal number of edit operations (insertion, deletion, or relabeling ) in the optimal alignments that make Q reach G. A E B C B C F Q A G

55 Solution 1 Compute the similarity between the graphs in the database and the query graph directly (costly) sequential scan subgraph similarity computation

56 Solution 2 Form a set of subgraph queries from the original query graph and use the exact subgraph search (costly) If we allow 3 edges to be missed in a 20-edge query graph, it may generate 1,140 subgraphs.

57 Solution 3 Precise Search Use frequent patterns as indexing features Select features in the database space based on their selectivity Build the index Approximate Search Hard to build indices covering similar subgraphs explosive number of subgraphs in databases Idea: (1) keep the index structure (2) select features in the query space

58 Substructure Similarity Measure Structure-based similarity measure The largest overlapping part of two graphs G Relaxation: the number of edges that can be relabeled or deleted (relaxation of the query graph) Q

59 Structural Features Graph Database H H + S H H S (a) (b) (c) H Structural Features (small fragments) atom path bond subgraph

60 Substructure Similarity Measure Feature-based similarity measure Each graph is represented as a feature vector X = {x 1, x 2,, x n } The similarity is defined by the distance of their corresponding vectors Easy to index Very fast Rough measure

61 Substructure Similarity Measure Structure-based similarity Accurate measure Slow Can we transform structure-based to feature-based? Feature-based similarity Rough measure Fast

62 Grafil (Yan et 05) Query (Q) Graph (G 1 ) If graph G contains the major part of a query graph Q, G should share a number of common features with Q Substructure Graph (G 2 ) Given a relaxation ratio, calculate the maximal number of features that can be missed! At least one of them should be contained

63 Grafil (Yan et 05) (Cont. ) Feature-Graph Matrix An occurrence table between feature and graph G 1 G 2 G 3 G 4 G 5 f f f f Assume a query graph has 4 features and only 1 feature to miss due to the relaxation threshold

64 Grafil (Yan et 05) (Cont. ) Query Processing Framework Three steps in processing approximate graph queries Step 1. Index Construction Select small structures as features in a graph database, and build the feature-graph matrix between the features and the graphs in the database.

65 Grafil (Yan et 05) (Cont. ) Query Processing Framework Step 2. Feature Miss Estimation Determine the indexed features belonging to the query graph Calculate the upper bound of the number of features that can be missed for an approximate matching, denoted by J n the query graph, not the graph database

66 Grafil (Yan et 05) (Cont. ) Query Processing Framework Step 3. Query Processing Use the feature-graph matrix to calculate the difference in the number of features between graph G and query Q, F G F Q If F G F Q > J, discard G. The remaining graphs constitute a candidate answer set

67 Grafil (Yan et 05) (Cont. ) Selection of Upper Bound If we allow k edges to be relaxed, the main idea is to transform edge misses k to feature misses m. Classic set k-cover problem, P-complete k: the number of missing edges in q. m: max number of features covered by k edges.

68 Grafil (Yan et 05) (Cont. ) Usage of the feature misses m m = 4

69 utline Introduction of Graph and Graph Database. Background of Subgraph Isomorphism. Background of Subgraph Query Processing. Background of Similarity Graph Query Processing. Background of Supergraph Query Processing.

70 Supergraph Query Processing Counterpart of subgraph query processing. Problem statement: Given a graph database D and a graph query q. Find all graphs g in D s.t. q is a supergraph of g.

71 Challenges Problem complexity: P-Complete. Same as subgraph query. Existing feature-based indexes for subgraph queries are not applicable: Inclusion logic for subgraph query If f q and f g, then q g Exclusion logic for supergraph query If f q and f g, then q g Representative work cindex (Chen et 07). Feature-based approach. GPTree (Zhang et 07) Feature-based approach. Fast sub-iso approach.

72 References [Shasha et al., PDS 02] Shasha, D., Wang, J.T.L., Giugno, R.: Algorithmics and applications of tree and graph searching. In: PDS. (2002) [Yan et al., SIGMD 04] Yan, X., Yu, P.S., Han, J.: Graph indexing based on discriminative frequent structure analysis. In: SIGMD. (2004) [He and Singh, ICDE 06] He, H., Singh, A.K.: Closure-tree: An index structure for graph queries. In: ICDE. (2006) 38 [Cheng et al., SIGMD 07] Cheng, J., Ke, Y., g, W., Lu, A.: Fg-index: towards verification- free query processing on graph databases. In: SIGMD. (2007) [Cheng et al., TDS 09] Cheng, J., Ke, Y., g, W.: Effective query processing on graph databases. ACM Trans. Database Syst. 34(1) (2009) [Jiang et al., ICDE 07] Jiang, H., Wang, H., Yu, P.S., Zhou, S.: Gstring: A novel approach for efficient search in graph databases. In: ICDE. (2007) [Zhang et al., ICDE 07] Zhang, S., Hu, M., Yang, J.: Treepi: A novel graph indexing method. In: ICDE. (2007) [Williams et al., ICDE 07] Williams, D.W., Huan, J., Wang, W.: Graph database indexing using structured graph decomposition. In: ICDE. (2007)

73 References [Zhao et al., VLDB 07] Zhao, P., Yu, J.X., Yu, P.S.: Graph indexing: Tree + delta >= graph. In: VLDB. (2007) [Zou et al., EDBT 08] Zou, L., Chen, L., Yu, J.X., Lu, Y.: A novel spectral coding in a large graph database. In: EDBT. (2008) [Shang et al., VLDB 08] Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. In: VLDB. (2008) [Chen et al., VLDB 07] Chen, C., Yan, X., Yu, P.S., Han, J., Zhang, D.Q., Gu, X.: Towards graph containment search and indexing. In: VLDB. (2007) [Zhang et al., EDBT 09] Zhang, S., Li, J., Gao, H., Zou, Z.: A novel approach for efficient supergraph query processing on graph databases. In: EDBT. (2009) [Raymond et al., CJ 02] Raymond, J.W., Gardiner, E.J., Willett, P.: RASCAL: calculation of graph similarity using maximum common edge subgraphs. Comput. J. 45(6) (2002) [Yan et al., SIGMD 05] Yan, X., Yu, P.S., Han, J.: Substructure similarity search in graph databases. In: SIGMD Conference. (2005) [Faloutsos and Tong, ICDE 09] Faloutsos, C., Tong, H.: Large graph mining: patterns, tools and case studies. In: ICDE (2009) tutorial

74 References [Shang et al., ICDE 10] Shang, H., Zhu, K., Lin, X., Zhang, Y., Ichise, R.: Similarity Search on Supergraph Containment. In: ICDE. (2010) [Ke et al., KDD 07] Ke, Y., Cheng, J., g, W.: Correlation search in graph databases. In: KDD. (2007) [Ke et al., SDM 09] Ke, Y., Cheng, J., Yu, J.X.: Top-k correlative graph mining. In: SDM. (2009) [Ke et al. ICDM 09] Ke, Y., Cheng, J., Yu, J.X.: Efficient discovery of frequent correlated subgraph pairs. In: ICDM. (2009)

75 Thank you!

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun yzsun@ccs.neu.edu November 12, 2013 Announcement Homework 4 will be out tonight Due on 12/2 Next class will be canceled